Re: [PATCH net-next v2 1/2] fs/crashdd: add API to collect hardware dump in second kernel
On 4/2/2018 12:12 PM, Jiri Pirko wrote: Fri, Mar 30, 2018 at 05:11:29PM CEST, and...@lunn.ch wrote: Please see: http://patchwork.ozlabs.org/project/netdev/list/?series=36524 I bevieve that the solution in the patchset could be used for your usecase too. Hi Jiri https://lkml.org/lkml/2018/3/20/436 How well does this API work for a 2Gbyte snapshot? Ccing Alex who did the tests. I didn't check the performance for such a large snapshot. From my measurement it takes 0.09s for 1 MB of data this means about ~3m. This can be tuned and improved since this is a socket application. Andrew
linux-next: build failure after merge of the tip tree
Hi all, After merging the tip tree, today's linux-next build (x86_64 allmodconfig) failed like this: net/rxrpc/call_object.c: In function 'rxrpc_rcu_destroy_call': net/rxrpc/call_object.c:661:3: error: implicit declaration of function 'wake_up_atomic_t'; did you mean 'wake_up_bit'? [-Werror=implicit-function-declaration] wake_up_atomic_t(>nr_calls); ^~~~ wake_up_bit net/rxrpc/call_object.c: In function 'rxrpc_destroy_all_calls': net/rxrpc/call_object.c:728:2: error: implicit declaration of function 'wait_on_atomic_t'; did you mean 'wait_on_bit'? [-Werror=implicit-function-declaration] wait_on_atomic_t(>nr_calls, atomic_t_wait, TASK_UNINTERRUPTIBLE); ^~~~ wait_on_bit net/rxrpc/call_object.c:728:37: error: 'atomic_t_wait' undeclared (first use in this function); did you mean 'atomic_long_t'? wait_on_atomic_t(>nr_calls, atomic_t_wait, TASK_UNINTERRUPTIBLE); ^ atomic_long_t net/rxrpc/call_object.c:728:37: note: each undeclared identifier is reported only once for each function it appears in net/rxrpc/call_accept.c: In function 'rxrpc_discard_prealloc': net/rxrpc/call_accept.c:223:4: error: implicit declaration of function 'wake_up_atomic_t'; did you mean 'wake_up_bit'? [-Werror=implicit-function-declaration] wake_up_atomic_t(>nr_conns); ^~~~ wake_up_bit Caused by commit 9b8cce52c4b5 ("sched/wait: Remove the wait_on_atomic_t() API") interacting with commits d3be4d244330 ("xrpc: Fix potential call vs socket/net destruction race") 31f5f9a1691e ("rxrpc: Fix apparent leak of rxrpc_local objects") from the net-next tree. Haven't we figured out how to remove/change APIs yet? :-( That tip tree commit is now in Linus' tree (merged since I started this morning) so the net-next tree will need the below patch (or something similar when it is merged. I have applied the following merge fix patch (this may need more work): From: Stephen RothwellDate: Tue, 3 Apr 2018 15:34:48 +1000 Subject: [PATCH] sched/wait: merge fix up for wait_on_atomic() API removal Signed-off-by: Stephen Rothwell --- net/rxrpc/call_accept.c | 2 +- net/rxrpc/call_object.c | 4 ++-- net/rxrpc/conn_object.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c index f67017dcb25e..a9a9be5519b9 100644 --- a/net/rxrpc/call_accept.c +++ b/net/rxrpc/call_accept.c @@ -220,7 +220,7 @@ void rxrpc_discard_prealloc(struct rxrpc_sock *rx) write_unlock(>conn_lock); kfree(conn); if (atomic_dec_and_test(>nr_conns)) - wake_up_atomic_t(>nr_conns); + wake_up_var(>nr_conns); tail = (tail + 1) & (size - 1); } diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index f721c2b7e234..f6734d8cb01a 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -658,7 +658,7 @@ static void rxrpc_rcu_destroy_call(struct rcu_head *rcu) kfree(call->rxtx_annotations); kmem_cache_free(rxrpc_call_jar, call); if (atomic_dec_and_test(>nr_calls)) - wake_up_atomic_t(>nr_calls); + wake_up_var(>nr_calls); } /* @@ -725,5 +725,5 @@ void rxrpc_destroy_all_calls(struct rxrpc_net *rxnet) write_unlock(>call_lock); atomic_dec(>nr_calls); - wait_on_atomic_t(>nr_calls, atomic_t_wait, TASK_UNINTERRUPTIBLE); + wait_var_event(>nr_calls, !atomic_read(>nr_calls)); } diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c index 0950ee3d26f5..4c77a78a252a 100644 --- a/net/rxrpc/conn_object.c +++ b/net/rxrpc/conn_object.c @@ -367,7 +367,7 @@ static void rxrpc_destroy_connection(struct rcu_head *rcu) rxrpc_put_peer(conn->params.peer); if (atomic_dec_and_test(>params.local->rxnet->nr_conns)) - wake_up_atomic_t(>params.local->rxnet->nr_conns); + wake_up_var(>params.local->rxnet->nr_conns); rxrpc_put_local(conn->params.local); kfree(conn); @@ -482,6 +482,6 @@ void rxrpc_destroy_all_connections(struct rxrpc_net *rxnet) /* We need to wait for the connections to be destroyed by RCU as they * pin things that we still need to get rid of. */ - wait_on_atomic_t(>nr_conns, atomic_t_wait, TASK_UNINTERRUPTIBLE); + wait_var_event(>nr_conns, !atomic_read(>nr_conns)); _leave(""); } -- 2.16.1 -- Cheers, Stephen Rothwell pgpV_S8PAHTPQ.pgp Description: OpenPGP digital signature
Re: WARNING: refcount bug in should_fail
On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote: > FWIW, I'm going through the ->kill_sb() instances, fixing that sort > of bugs (most of them preexisting, but I should've checked instead > of assuming that everything's fine). Will push out later tonight. OK, see vfs.git#for-linus. Caught: 4 old bugs (allocation failure in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp. and double-dput in late failure exit in rpc_fill_super()) and 5 regressions from register_shrinker() failure recovery.
Re: [GIT PULL] remove in-kernel calls to syscalls
On Mon, Apr 2, 2018 at 12:04 PM, Dominik Brodowskiwrote: > > This patchset removes all in-kernel calls to syscall functions in the > kernel with the exception of arch/. Ok, this finished off my arch updates for today, I'll probably move on to driver pulls tomorrow. Anyway, it's in my tree, will push out once my test build finishes. Linus
[PATCH iproute2-next v1] rdma: Print net device name and index for RDMA device
From: Leon RomanovskyThe RDMA devices are operated in RoCE and iWARP modes have net device underneath. Present their names in regular output and their net index in detailed mode. [root@nps ~]# rdma link show mlx5_3/1 4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7 [root@nps ~]# rdma link show mlx5_3/1 -d 4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7 netdev_index 7 caps: Signed-off-by: Leon Romanovsky Reviewed-by: Steve Wise --- Changes v0->v1: * Resend after commit 29122c1aae35 ("rdma: update rdma_netlink.h") which updated relevant netlink attributes. * Added Steve's ROB --- rdma/include/uapi/rdma/rdma_netlink.h | 4 rdma/link.c | 21 + rdma/utils.c | 2 ++ 3 files changed, 27 insertions(+) diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h index 9446a721..45474f13 100644 --- a/rdma/include/uapi/rdma/rdma_netlink.h +++ b/rdma/include/uapi/rdma/rdma_netlink.h @@ -388,6 +388,10 @@ enum rdma_nldev_attr { RDMA_NLDEV_ATTR_RES_LOCAL_DMA_LKEY, /* u32 */ RDMA_NLDEV_ATTR_RES_UNSAFE_GLOBAL_RKEY, /* u32 */ + /* Netdev information for relevant protocols, like RoCE and iWARP */ + RDMA_NLDEV_ATTR_NDEV_INDEX, /* u32 */ + RDMA_NLDEV_ATTR_NDEV_NAME, /* string */ + RDMA_NLDEV_ATTR_MAX }; #endif /* _RDMA_NETLINK_H */ diff --git a/rdma/link.c b/rdma/link.c index 66bcd50e..7e914c87 100644 --- a/rdma/link.c +++ b/rdma/link.c @@ -205,6 +205,26 @@ static void link_print_phys_state(struct rd *rd, struct nlattr **tb) pr_out("physical_state %s ", phys_state_to_str(phys_state)); } +static void link_print_netdev(struct rd *rd, struct nlattr **tb) +{ + const char *netdev_name; + uint32_t idx; + + if (!tb[RDMA_NLDEV_ATTR_NDEV_NAME] || !tb[RDMA_NLDEV_ATTR_NDEV_INDEX]) + return; + + netdev_name = mnl_attr_get_str(tb[RDMA_NLDEV_ATTR_NDEV_NAME]); + idx = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_NDEV_INDEX]); + if (rd->json_output) { + jsonw_string_field(rd->jw, "netdev", netdev_name); + jsonw_uint_field(rd->jw, "netdev_index", idx); + } else { + pr_out("netdev %s ", netdev_name); + if (rd->show_details) + pr_out("netdev_index %u ", idx); + } +} + static int link_parse_cb(const struct nlmsghdr *nlh, void *data) { struct nlattr *tb[RDMA_NLDEV_ATTR_MAX] = {}; @@ -241,6 +261,7 @@ static int link_parse_cb(const struct nlmsghdr *nlh, void *data) link_print_lmc(rd, tb); link_print_state(rd, tb); link_print_phys_state(rd, tb); + link_print_netdev(rd, tb); if (rd->show_details) link_print_caps(rd, tb); diff --git a/rdma/utils.c b/rdma/utils.c index a2e08e91..f38f0459 100644 --- a/rdma/utils.c +++ b/rdma/utils.c @@ -391,6 +391,8 @@ static const enum mnl_attr_data_type nldev_policy[RDMA_NLDEV_ATTR_MAX] = { [RDMA_NLDEV_ATTR_RES_LKEY] = MNL_TYPE_U32, [RDMA_NLDEV_ATTR_RES_IOVA] = MNL_TYPE_U64, [RDMA_NLDEV_ATTR_RES_MRLEN] = MNL_TYPE_U64, + [RDMA_NLDEV_ATTR_NDEV_INDEX]= MNL_TYPE_U32, + [RDMA_NLDEV_ATTR_NDEV_NAME] = MNL_TYPE_NUL_STRING, }; int rd_attr_cb(const struct nlattr *attr, void *data) -- 2.14.3
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
On Mon, Mar 26, 2018 at 01:01:58AM +0200, Andrew Lunn wrote: > The phylib core code will take the phydev lock before calling into the > driver. By violating the layering, we are missing on this lock. That lock protects the fields within the struct phy_device, like the state field. None of the time stamping methods need to read or write any part of that data structure. Actually it is not true that the core always takes the lock before calling the driver methods. See .ack_interrupt for example. > Maybe the one driver which currently implements these calls does not > need locking. It has locking. If a specific device needs locking to protect the integrity of its registers or other internal data structures, then it can and should implement those locks. Thanks, Richard
Re: WARNING in refcount_dec
No. Only the first crash (WARNING in refcount_dec) is reproduced by the attached reproducer. The second crash (kernel bug at af_packet.c:3107) is reproduced by another reproducer. We reported it here. http://lkml.iu.edu/hypermail/linux/kernel/1803.3/05324.html On Sun, Apr 1, 2018 at 4:38 PM, Willem de Bruijnwrote: > On Thu, Mar 29, 2018 at 1:16 AM, Cong Wang wrote: >> (Cc'ing netdev and Willem) >> >> On Wed, Mar 28, 2018 at 12:03 PM, Byoungyoung Lee >> wrote: >>> Another crash patterns observed: race between (setsockopt$packet_int) >>> and (bind$packet). >>> >>> -- >>> [ 357.731597] kernel BUG at >>> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/net/packet/af_packet.c:3107! >>> [ 357.733382] invalid opcode: [#1] SMP KASAN >>> [ 357.734017] Modules linked in: >>> [ 357.734662] CPU: 1 PID: 3871 Comm: repro.exe Not tainted 4.16.0-rc3 #1 >>> [ 357.735791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >>> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 >>> [ 357.737434] RIP: 0010:packet_do_bind+0x88d/0x950 >>> [ 357.738121] RSP: 0018:8800b2787b08 EFLAGS: 00010293 >>> [ 357.738906] RAX: 8800b2fdc780 RBX: 880234358cc0 RCX: >>> 838b244c >>> [ 357.739905] RDX: RSI: 838b257d RDI: >>> 0001 >>> [ 357.741315] RBP: 8800b2787c10 R08: 8800b2fdc780 R09: >>> >>> [ 357.743055] R10: 0001 R11: R12: >>> 88023352ecc0 >>> [ 357.744744] R13: R14: 0001 R15: >>> 1d00 >>> [ 357.746377] FS: 7f4b43733700() GS:8800b8b0() >>> knlGS: >>> [ 357.749599] CS: 0010 DS: ES: CR0: 80050033 >>> [ 357.752096] CR2: 20058000 CR3: 0002334b8000 CR4: >>> 06e0 >>> [ 357.755045] Call Trace: >>> [ 357.755822] ? compat_packet_setsockopt+0x100/0x100 >>> [ 357.757324] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20 >>> [ 357.758810] packet_bind+0xa2/0xe0 >>> [ 357.759640] SYSC_bind+0x279/0x2f0 >>> [ 357.760364] ? move_addr_to_kernel.part.19+0xc0/0xc0 >>> [ 357.761491] ? __handle_mm_fault+0x25d0/0x25d0 >>> [ 357.762449] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20 >>> [ 357.763663] ? __do_page_fault+0x417/0xba0 >>> [ 357.764569] ? vmalloc_fault+0x910/0x910 >>> [ 357.765405] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20 >>> [ 357.766525] ? mark_held_locks+0x25/0xb0 >>> [ 357.767336] ? SyS_socketpair+0x4a0/0x4a0 >>> [ 357.768182] SyS_bind+0x24/0x30 >>> [ 357.768851] do_syscall_64+0x209/0x5d0 >>> [ 357.769650] ? syscall_return_slowpath+0x3e0/0x3e0 >>> [ 357.770665] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20 >>> [ 357.771779] ? syscall_return_slowpath+0x260/0x3e0 >>> [ 357.772748] ? mark_held_locks+0x25/0xb0 >>> [ 357.773581] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20 >>> [ 357.774720] ? retint_user+0x18/0x18 >>> [ 357.775493] ? trace_hardirqs_off_caller+0xb5/0x120 >>> [ 357.776567] ? trace_hardirqs_off_thunk+0x1a/0x1c >>> [ 357.777512] entry_SYSCALL_64_after_hwframe+0x42/0xb7 >>> [ 357.778508] RIP: 0033:0x4503a9 >>> [ 357.779156] RSP: 002b:7f4b43732ce8 EFLAGS: 0246 ORIG_RAX: >>> 0031 >>> [ 357.780737] RAX: ffda RBX: RCX: >>> 004503a9 >>> [ 357.782169] RDX: 0014 RSI: 20058000 RDI: >>> 0003 >>> [ 357.783710] RBP: 7f4b43732d10 R08: R09: >>> >>> [ 357.785202] R10: R11: 0246 R12: >>> >>> [ 357.786664] R13: R14: 7f4b437339c0 R15: >>> 7f4b43733700 >>> [ 357.788210] Code: c0 fd 48 c7 c2 00 c8 d9 84 be ab 02 00 00 48 c7 >>> c7 60 c8 d9 84 c6 05 e7 a2 48 02 01 e8 3f 17 af fd e9 60 fb ff ff e8 >>> 43 b3 c0 fd <0f> 0b e8 3c b3 c0 fd 48 8b bd 20 ff ff ff e8 60 1e e7 fd >>> 4c 89 >>> [ 357.792260] RIP: packet_do_bind+0x88d/0x950 RSP: 8800b2787b08 >>> [ 357.793698] ---[ end trace 0c5a2539f0247369 ]--- >>> [ 357.794696] Kernel panic - not syncing: Fatal exception >>> [ 357.795918] Kernel Offset: disabled >>> [ 357.796614] Rebooting in 86400 seconds.. >>> >>> On Wed, Mar 28, 2018 at 1:19 AM, Byoungyoung Lee >>> wrote: We report the crash: WARNING in refcount_dec This crash has been found in v4.16-rc3 using RaceFuzzer (a modified version of Syzkaller), which we describe more at the end of this report. Our analysis shows that the race occurs when invoking two syscalls concurrently, (setsockopt$packet_int) and (setsockopt$packet_rx_ring). C repro code : https://kiwi.cs.purdue.edu/static/race-fuzzer/repro-refcount_dec.c kernel config: https://kiwi.cs.purdue.edu/static/race-fuzzer/kernel-config-v4.16-rc3 >> >> >> I tried your
Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek
On Tue, Apr 03, 2018 at 01:41:26PM +1000, NeilBrown wrote: > > Do we really need a rhashtable_walk_peek() interface? > I imagine that a seqfile ->start function can do: > > if (*ppos == 0 && last_pos != 0) { > rhashtable_walk_exit(); > rhashtable_walk_enter(, ); > last_pos = 0; > } > rhashtable_walk_start(); > if (*ppos == last_pos && iter.p) > return iter.p; > last_pos = *ppos; We don't want users poking into the internals of iter. If you're suggesting we could simplify rhashtable_walk_peek to just this after your patch then yes possibly. You also need to ensure that not only seqfs users continue to work but also netlink dump users which are slightly different. > It might be OK to have a function call instead of expecting people to > use iter.p directly. Yes that would definitely be the preferred option. Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
On Sun, Mar 25, 2018 at 04:01:49PM -0700, Florian Fainelli wrote: > The best that I can think about and it still is a hack in some way, is > to you have your time stamping driver create a proxy mii_bus whose > purpose is just to hook to mdio/phy_device events (such as link changes) > in order to do what is necessary, or at least, this would indicate its > transparent nature towards the MDIO/MDC lines... That won't work at all, AFAICT. There is only one mii_bus per netdev, that is one that is attached to the phydev. > Tangential: the existing PHY time stamping logic should probably be > generalized to a mdio_device (which the phy_device is a specialized > superset of) instead of to the phy_device. This would still allow > existing use cases but it would also allow us to support possible "pure > MDIO" devices would that become some thing in the future. So this is exactly what I did. The time stamping methods were pushed down into the mdio_device. The active device (mdiots pointer) points either to a non-PHY mdio_device or to the mdio_device embedded in the phydev. Thanks, Richard
Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek
On Fri, Mar 30 2018, Herbert Xu wrote: > On Thu, Mar 29, 2018 at 06:52:34PM +0200, Andreas Gruenbacher wrote: >> >> Should rhashtable_walk_peek be kept around even if there are no more >> users? I have my doubts. > > Absolutely. All netlink dumps using rhashtable_walk_next are buggy > and need to switch over to rhashtable_walk_peek. As otherwise > the object that triggers the out-of-space condition will be skipped > upon resumption. Do we really need a rhashtable_walk_peek() interface? I imagine that a seqfile ->start function can do: if (*ppos == 0 && last_pos != 0) { rhashtable_walk_exit(); rhashtable_walk_enter(, ); last_pos = 0; } rhashtable_walk_start(); if (*ppos == last_pos && iter.p) return iter.p; last_pos = *ppos; return rhashtable_walk_next() and the ->next function just does last_pos = *ppos; *ppos += 1; do p = rhashtable_walk_next(); while (IS_ERR(p)); return p; It might be OK to have a function call instead of expecting people to use iter.p directly. static inline void *rhashtable_walk_prev(struct rhashtable_iter *iter) { return iter->p; } Thoughts? Thanks, NeilBrown signature.asc Description: PGP signature
linux-next: manual merge of the net-next tree with the pci tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: drivers/net/ethernet/mellanox/mlx5/core/en_main.c between commit: 2907938d2375 ("net/mlx5e: Use pcie_bandwidth_available() to compute bandwidth") from the pci tree and commit: 0608d4dbaf4e ("net/mlx5e: Unify slow PCI heuristic") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 884337f88589,0aab3afc6885.. --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@@ -3880,16 -4026,50 +4033,20 @@@ void mlx5e_build_default_indir_rqt(u32 indirection_rqt[i] = i % num_channels; } - static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw) -static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw) -{ - enum pcie_link_width width; - enum pci_bus_speed speed; - int err = 0; - - err = pcie_get_minimum_link(mdev->pdev, , ); - if (err) - return err; - - if (speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) - return -EINVAL; - - switch (speed) { - case PCIE_SPEED_2_5GT: - *pci_bw = 2500 * width; - break; - case PCIE_SPEED_5_0GT: - *pci_bw = 5000 * width; - break; - case PCIE_SPEED_8_0GT: - *pci_bw = 8000 * width; - break; - default: - return -EINVAL; - } - - return 0; -} - + static bool slow_pci_heuristic(struct mlx5_core_dev *mdev) { - return (link_speed && pci_bw && - (pci_bw < 4) && (pci_bw < link_speed)); - } + u32 link_speed = 0; + u32 pci_bw = 0; - static bool hw_lro_heuristic(u32 link_speed, u32 pci_bw) - { - return !(link_speed && pci_bw && -(pci_bw <= 16000) && (pci_bw < link_speed)); + mlx5e_get_max_linkspeed(mdev, _speed); - mlx5e_get_pci_bw(mdev, _bw); ++ pci_bw = pcie_bandwidth_available(mdev->pdev, NULL, NULL, NULL); + mlx5_core_dbg_once(mdev, "Max link speed = %d, PCI BW = %d\n", + link_speed, pci_bw); + + #define MLX5E_SLOW_PCI_RATIO (2) + + return link_speed && pci_bw && + link_speed > MLX5E_SLOW_PCI_RATIO * pci_bw; } void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode) pgpHbGrm02dE5.pgp Description: OpenPGP digital signature
Re: [PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs
Alex, On 4/2/2018 3:06 PM, Sinan Kaya wrote: > Code includes wmb() followed by writel() in multiple places. writel() > already has a barrier on some architectures like arm64. > > This ends up CPU observing two barriers back to back before executing the > register write. > > Since code already has an explicit barrier call, changing writel() to > writel_relaxed(). > > I did a regex search for wmb() followed by writel() in each drivers > directory. I scrubbed the ones I care about in this series. > > I considered "ease of change", "popular usage" and "performance critical > path" as the determining criteria for my filtering. > > We used relaxed API heavily on ARM for a long time but > it did not exist on other architectures. For this reason, relaxed > architectures have been paying double penalty in order to use the common > drivers. > > Now that relaxed API is present on all architectures, we can go and scrub > all drivers to see what needs to change and what can remain. > > We start with mostly used ones and hope to increase the coverage over time. > It will take a while to cover all drivers. > > Feel free to apply patches individually. > > Change since 7: > > * API clarification with Linus and several architecture folks regarding > writel() > > https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html > > * removed wmb() in front of writel() at several places. > * remove old IA64 comments regarding ordering. > What do you think about this version? Did I miss any SMP barriers? > Sinan Kaya (7): > i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs > ixgbe: eliminate duplicate barriers on weakly-ordered archs > igbvf: eliminate duplicate barriers on weakly-ordered archs > igb: eliminate duplicate barriers on weakly-ordered archs > fm10k: Eliminate duplicate barriers on weakly-ordered archs > ixgbevf: keep writel() closer to wmb() > ixgbevf: eliminate duplicate barriers on weakly-ordered archs > > drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++--- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 -- > drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 8 +-- > drivers/net/ethernet/intel/igb/igb_main.c | 14 ++-- > drivers/net/ethernet/intel/igbvf/netdev.c | 16 +- > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++- > drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 - > drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 > --- > 8 files changed, 30 insertions(+), 100 deletions(-) > Sinan -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH 0/2] rhashtable_walk fixes
From: NeilBrownDate: Tue, 03 Apr 2018 12:23:40 +1000 > I'm sorry if I've caused some confusion, but I didn't think that I was > submitting patches to you and know nothing about your two trees. > I was submitting patches to Thomas and Herbert, the registered > maintainers of rhashtable. I assumed they would review, respond, and > take responsibility for getting them upstream, if that's what they > decided, based on whatever arrangements they have in place. > > If it is appropriate I can resend all of my patches that receive an > Ack as a coherent series, and send this to you nominating a particular > tree, but I'm unlikely to do that unless asked and told which tree to > nominate. Herbert and Thomas generally review rhashtable patches, but rhashtable itself is generally maintained in the networking tree(s). So once they review and ACK it, I would apply it.
Re: [PATCH 0/2] rhashtable_walk fixes
On Fri, Mar 30 2018, David Miller wrote: > From: NeilBrown> Date: Thu, 29 Mar 2018 12:19:09 +1100 > >> These two patches apply on top of my previous "rhashtable: reset iter >> when rhashtable_walk_start sees new table" patch. >> >> The first fixes a bug that I found in rhltable_insert(). >> >> The second is an alternate to my "rhashtable: allow a walk of the hash >> table without missing object." >> This version doesn't require an API change and should be reliable for >> rhltables too (my first version didn't handle these correctly). > > Neil, please don't mix and match patches. > > Also when you need to change a patch in a series, please post the entire > new series not just the patch that changes. > > Patch #1 in this series is unnecessary. As Herbert explained this has > been fixed already. > > So please repost freshly the patches that are relevant and you want me > to consider for inclusion. Also be explicit and clear about which of > my two networking trees you are targetting these changes. Hi Dave, I'm sorry if I've caused some confusion, but I didn't think that I was submitting patches to you and know nothing about your two trees. I was submitting patches to Thomas and Herbert, the registered maintainers of rhashtable. I assumed they would review, respond, and take responsibility for getting them upstream, if that's what they decided, based on whatever arrangements they have in place. If it is appropriate I can resend all of my patches that receive an Ack as a coherent series, and send this to you nominating a particular tree, but I'm unlikely to do that unless asked and told which tree to nominate. Thanks, NeilBrown signature.asc Description: PGP signature
Re: [PATCH net] net: dsa: mt7530: Use NULL instead of plain integer
On Mon, 2018-04-02 at 16:24 -0700, Florian Fainelli wrote: > We would be passing 0 instead of NULL as the rsp argument to > mt7530_fdb_cmd(), fix that. > Acked-by: Sean WangBTW, does the part of the commit message should be updated with "passing NULL instead of 0"? > Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 > switch") > Signed-off-by: Florian Fainelli > --- > drivers/net/dsa/mt7530.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c > index 4e53c5ce23ff..a7f6c0a62fc7 100644 > --- a/drivers/net/dsa/mt7530.c > +++ b/drivers/net/dsa/mt7530.c > @@ -917,7 +917,7 @@ mt7530_port_fdb_add(struct dsa_switch *ds, int port, > > mutex_lock(>reg_mutex); > mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_ENT); > - ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0); > + ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL); > mutex_unlock(>reg_mutex); > > return ret; > @@ -933,7 +933,7 @@ mt7530_port_fdb_del(struct dsa_switch *ds, int port, > > mutex_lock(>reg_mutex); > mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_EMP); > - ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0); > + ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL); > mutex_unlock(>reg_mutex); > > return ret; > @@ -1293,7 +1293,7 @@ mt7530_setup(struct dsa_switch *ds) > } > > /* Flush the FDB table */ > - ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, 0); > + ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, NULL); > if (ret < 0) > return ret; >
[PATCH net-next] pptp: remove a buggy dst release in pptp_connect()
Once dst has been cached in socket via sk_setup_caps(), it is illegal to call ip_rt_put() (or dst_release()), since sk_setup_caps() did not change dst refcount. We can still dereference it since we hold socket lock. Caugth by syzbot : BUG: KASAN: use-after-free in atomic_dec_return include/asm-generic/atomic-instrumented.h:198 [inline] BUG: KASAN: use-after-free in dst_release+0x27/0xa0 net/core/dst.c:185 Write of size 4 at addr 8801c54dc040 by task syz-executor4/20088 CPU: 1 PID: 20088 Comm: syz-executor4 Not tainted 4.16.0+ #376 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x1a7/0x27d lib/dump_stack.c:53 print_address_description+0x73/0x250 mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report+0x23c/0x360 mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x137/0x190 mm/kasan/kasan.c:267 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278 atomic_dec_return include/asm-generic/atomic-instrumented.h:198 [inline] dst_release+0x27/0xa0 net/core/dst.c:185 sk_dst_set include/net/sock.h:1812 [inline] sk_dst_reset include/net/sock.h:1824 [inline] sock_setbindtodevice net/core/sock.c:610 [inline] sock_setsockopt+0x431/0x1b20 net/core/sock.c:707 SYSC_setsockopt net/socket.c:1845 [inline] SyS_setsockopt+0x2ff/0x360 net/socket.c:1828 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x4552d9 RSP: 002b:7f4878126c68 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: 7f48781276d4 RCX: 004552d9 RDX: 0019 RSI: 0001 RDI: 0013 RBP: 0072bea0 R08: 0010 R09: R10: 200010c0 R11: 0246 R12: R13: 0526 R14: 006fac30 R15: Allocated by task 20088: save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3542 dst_alloc+0x11f/0x1a0 net/core/dst.c:104 rt_dst_alloc+0xe9/0x540 net/ipv4/route.c:1520 __mkroute_output net/ipv4/route.c:2265 [inline] ip_route_output_key_hash_rcu+0xa49/0x2c60 net/ipv4/route.c:2493 ip_route_output_key_hash+0x20b/0x370 net/ipv4/route.c:2322 __ip_route_output_key include/net/route.h:126 [inline] ip_route_output_flow+0x26/0xa0 net/ipv4/route.c:2577 ip_route_output_ports include/net/route.h:163 [inline] pptp_connect+0xa84/0x1170 drivers/net/ppp/pptp.c:453 SYSC_connect+0x213/0x4a0 net/socket.c:1639 SyS_connect+0x24/0x30 net/socket.c:1620 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Freed by task 20082: save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527 __cache_free mm/slab.c:3486 [inline] kmem_cache_free+0x83/0x2a0 mm/slab.c:3744 dst_destroy+0x266/0x380 net/core/dst.c:140 dst_destroy_rcu+0x16/0x20 net/core/dst.c:153 __rcu_reclaim kernel/rcu/rcu.h:178 [inline] rcu_do_batch kernel/rcu/tree.c:2675 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2930 [inline] __rcu_process_callbacks kernel/rcu/tree.c:2897 [inline] rcu_process_callbacks+0xd6c/0x17b0 kernel/rcu/tree.c:2914 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 The buggy address belongs to the object at 8801c54dc000 which belongs to the cache ip_dst_cache of size 168 The buggy address is located 64 bytes inside of 168-byte region [8801c54dc000, 8801c54dc0a8) The buggy address belongs to the page: page:ea0007153700 count:1 mapcount:0 mapping:8801c54dc000 index:0x0 flags: 0x2fffc000100(slab) raw: 02fffc000100 8801c54dc000 00010010 raw: ea0006b34b20 ea0006b6c1e0 8801d674a1c0 page dumped because: kasan: bad access detected Signed-off-by: Eric Dumazet--- drivers/net/ppp/pptp.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c index 8249d46a784493faedbfad5ec78212afab8c872e..c4267ecefd85271ac561e94d57c586179e66777c 100644 --- a/drivers/net/ppp/pptp.c +++ b/drivers/net/ppp/pptp.c @@ -464,7 +464,6 @@ static int pptp_connect(struct socket *sock, struct sockaddr *uservaddr, po->chan.mtu = dst_mtu(>dst); if (!po->chan.mtu) po->chan.mtu = PPP_MRU; - ip_rt_put(rt); po->chan.mtu -= PPTP_HEADER_OVERHEAD; po->chan.hdrlen = 2 + sizeof(struct pptp_gre_header); -- 2.17.0.rc1.321.gba9d0f2565-goog
Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications
On Thu, Mar 29, 2018 at 11:44:17PM +0100, Edward Cree wrote: > By storing subprog boundaries as a subprogno mark on each insn, rather than > a start (and implicit end) for each subprog, we collect a number of gains: > * More efficient determination of which subprog contains a given insn, and > thus of find_subprog (which subprog begins at a given insn). > * Number of verifier "full recursive walk" passes is reduced, since most of > the work is done in the main insn walk (do_check()). Leftover work in > other passes is mostly linear scans (O(insn_cnt)) or, in the case of > check_max_stack_depth(), a topological sort (O(subprog_cnt)). > > Some other changes were also included to support this: > * Per-subprog info is stored in env->subprog_info, an array of structs, > rather than several arrays with a common index. > * Call graph is now stored in the new bpf_subprog_info struct; used here > for check_max_stack_depth() but may have other uses too. > > Along with this, patch #3 puts parent pointers (used by liveness analysis) > in the registers instead of the func_state or verifier_state, so that we > don't need skip_callee() machinery. This also does the right thing for > stack slots, so they don't need their own special handling for liveness > marking either. I like patch 3 and going to play with it. How did you test it ? Do you have processed_insn numbers for cilium or selftests programs before and after? There should be no difference, right? Essentially it's trading extra run-time cost of skip_callee logic for higher memory overhead due to parent pointers in every register state. I guess that's ok, since it does cleanup things nicely. As far as patch 1 it was very difficult to review, since several logically different things clamped together. So breaking it apart: - converting two arrays of subprog_starts and subprog_stack_depth into single array of struct bpf_subprog_info is a good thing - tsort is interesting, but not sure it's correct. more below - but main change of combining subprog logic with do_check is no good. The verifier have to move towards compiler style code analysis instead of the other way around. It has to analyze the program in simple and hopefully easy to understand passes instead of combinging everything into one loop. We _have_ to get rid of do_check() approach and corresponding insn_processed counter. That was and still is the biggest and most pressing issue we need to solve in bpf verification. The verifier has to switch to compiler style code analysis to scale. The algorithms should be such that scale with thousands of instructions and thousands of functions. The only way I see reaching that goal is to do hierarchical program analysis in passes: - identify subrpograms - identify basic blocks, build control flow graph - build dom graph, ssa and so on - and do per-basic block liveness and data flow analysis that propagates the combined result further into other BBs along cfg edges. There will be no 'do_check() across whole program' walk. Combining subprog pass with do_check is going into opposite direction of this long term work. Divide and conquer. Combining more things into do_check is the opposite of this programming principle. My short term plan is to split basic instruction correctness checks out of do_check() loop into separate pass and run it early on. It's necessary for bpf libraries to work, since the verifier cannot do full data flow analysis at this point, but should build cfg and move as much as possible out of instruction by instruction walk to scale with number of bpf programs/libraries and sizes of them. As far as tsort approach for determining max stack depth... It's an interesting idea, but implementation is suffering from the same 'combine everything into one loop' coding issue. I think updating total_stack_depth math should be separate from sorting, since the same function can be part of different stacks with different max. I don't see how updating global subprog_info[i].total_stack_depth can be correct. It has to be different for every stack and the longest stack is not necessarily the deepest. May be I'm missing something, but combined sort and stack_depth math didn't make it easy to review. I find existing check_max_stack_depth() algo much easier to understand.
Re: [PATCH net-next 4/5] tun: Add support for SCTP checksum offload
Hi Vladislav, Thank you for the patch! Yet something to improve: [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Vladislav-Yasevich/virtio-net-Add-SCTP-checksum-offload-support/20180402-221407 config: m68k-hp300_defconfig (attached as .config) compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=m68k Note: the linux-review/Vladislav-Yasevich/virtio-net-Add-SCTP-checksum-offload-support/20180402-221407 HEAD 5e0497a085e70055a1981959802173f4ff05c86b builds fine. It only hurts bisectibility. All errors (new ones prefixed by >>): drivers/net/tun.c: In function 'set_offload': >> drivers/net/tun.c:2722:12: error: 'TUN_F_SCTP_CSUM' undeclared (first use in >> this function); did you mean 'TUN_F_CSUM'? if (arg & TUN_F_SCTP_CSUM) { ^~~ TUN_F_CSUM drivers/net/tun.c:2722:12: note: each undeclared identifier is reported only once for each function it appears in vim +2722 drivers/net/tun.c 2696 2697 /* This is like a cut-down ethtool ops, except done via tun fd so no 2698 * privs required. */ 2699 static int set_offload(struct tun_struct *tun, unsigned long arg) 2700 { 2701 netdev_features_t features = 0; 2702 2703 if (arg & TUN_F_CSUM) { 2704 features |= NETIF_F_HW_CSUM; 2705 arg &= ~TUN_F_CSUM; 2706 2707 if (arg & (TUN_F_TSO4|TUN_F_TSO6)) { 2708 if (arg & TUN_F_TSO_ECN) { 2709 features |= NETIF_F_TSO_ECN; 2710 arg &= ~TUN_F_TSO_ECN; 2711 } 2712 if (arg & TUN_F_TSO4) 2713 features |= NETIF_F_TSO; 2714 if (arg & TUN_F_TSO6) 2715 features |= NETIF_F_TSO6; 2716 arg &= ~(TUN_F_TSO4|TUN_F_TSO6); 2717 } 2718 2719 arg &= ~TUN_F_UFO; 2720 } 2721 > 2722 if (arg & TUN_F_SCTP_CSUM) { 2723 features |= NETIF_F_SCTP_CRC; 2724 arg &= ~TUN_F_SCTP_CSUM; 2725 } 2726 2727 /* This gives the user a way to test for new features in future by 2728 * trying to set them. */ 2729 if (arg) 2730 return -EINVAL; 2731 2732 tun->set_features = features; 2733 tun->dev->wanted_features &= ~TUN_USER_FEATURES; 2734 tun->dev->wanted_features |= features; 2735 netdev_update_features(tun->dev); 2736 2737 return 0; 2738 } 2739 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth
On Mon, Apr 2, 2018 at 7:05 AM, Bjorn Helgaaswrote: > +/* PCIe speed to Mb/s reduced by encoding overhead */ > +#define PCIE_SPEED2MBS_ENC(speed) \ > + ((speed) == PCIE_SPEED_16_0GT ? (16000*(128/130)) : \ > +(speed) == PCIE_SPEED_8_0GT ? (8000*(128/130)) : \ > +(speed) == PCIE_SPEED_5_0GT ? (5000*(8/10)) : \ > +(speed) == PCIE_SPEED_2_5GT ? (2500*(8/10)) : \ > +0) > + Should this be "(speed * x ) / y" instead? wouldn't they calculate 128/130 and truncate that to zero before multiplying by the speed? Or are compilers smart enough to do this the other way to avoid the losses? Thanks, Jake
[PATCH net] net: dsa: mt7530: Use NULL instead of plain integer
We would be passing 0 instead of NULL as the rsp argument to mt7530_fdb_cmd(), fix that. Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Signed-off-by: Florian Fainelli--- drivers/net/dsa/mt7530.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c index 4e53c5ce23ff..a7f6c0a62fc7 100644 --- a/drivers/net/dsa/mt7530.c +++ b/drivers/net/dsa/mt7530.c @@ -917,7 +917,7 @@ mt7530_port_fdb_add(struct dsa_switch *ds, int port, mutex_lock(>reg_mutex); mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_ENT); - ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0); + ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL); mutex_unlock(>reg_mutex); return ret; @@ -933,7 +933,7 @@ mt7530_port_fdb_del(struct dsa_switch *ds, int port, mutex_lock(>reg_mutex); mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_EMP); - ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, 0); + ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL); mutex_unlock(>reg_mutex); return ret; @@ -1293,7 +1293,7 @@ mt7530_setup(struct dsa_switch *ds) } /* Flush the FDB table */ - ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, 0); + ret = mt7530_fdb_cmd(priv, MT7530_FDB_FLUSH, NULL); if (ret < 0) return ret; -- 2.14.1
[PATCH net] net: dsa: b53: Fix sparse warnings in b53_mmap.c
sparse complains about the following warnings: drivers/net/dsa/b53/b53_mmap.c:33:31: warning: incorrect type in initializer (different address spaces) drivers/net/dsa/b53/b53_mmap.c:33:31:expected unsigned char [noderef] [usertype] *regs drivers/net/dsa/b53/b53_mmap.c:33:31:got void *priv and indeed, while what we are doing is functional, we are dereferencing a void * pointer into a void __iomem * which is not great. Just use the defined b53_mmap_priv structure which holds our register base and use that. Fixes: 967dd82ffc52 ("net: dsa: b53: Add support for Broadcom RoboSwitch") Signed-off-by: Florian Fainelli--- drivers/net/dsa/b53/b53_mmap.c | 33 - 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/drivers/net/dsa/b53/b53_mmap.c b/drivers/net/dsa/b53/b53_mmap.c index ef63d24fef81..c628d0980c0b 100644 --- a/drivers/net/dsa/b53/b53_mmap.c +++ b/drivers/net/dsa/b53/b53_mmap.c @@ -30,7 +30,8 @@ struct b53_mmap_priv { static int b53_mmap_read8(struct b53_device *dev, u8 page, u8 reg, u8 *val) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; *val = readb(regs + (page << 8) + reg); @@ -39,7 +40,8 @@ static int b53_mmap_read8(struct b53_device *dev, u8 page, u8 reg, u8 *val) static int b53_mmap_read16(struct b53_device *dev, u8 page, u8 reg, u16 *val) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; if (WARN_ON(reg % 2)) return -EINVAL; @@ -54,7 +56,8 @@ static int b53_mmap_read16(struct b53_device *dev, u8 page, u8 reg, u16 *val) static int b53_mmap_read32(struct b53_device *dev, u8 page, u8 reg, u32 *val) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; if (WARN_ON(reg % 4)) return -EINVAL; @@ -69,7 +72,8 @@ static int b53_mmap_read32(struct b53_device *dev, u8 page, u8 reg, u32 *val) static int b53_mmap_read48(struct b53_device *dev, u8 page, u8 reg, u64 *val) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; if (WARN_ON(reg % 2)) return -EINVAL; @@ -107,7 +111,8 @@ static int b53_mmap_read48(struct b53_device *dev, u8 page, u8 reg, u64 *val) static int b53_mmap_read64(struct b53_device *dev, u8 page, u8 reg, u64 *val) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; u32 hi, lo; if (WARN_ON(reg % 4)) @@ -128,7 +133,8 @@ static int b53_mmap_read64(struct b53_device *dev, u8 page, u8 reg, u64 *val) static int b53_mmap_write8(struct b53_device *dev, u8 page, u8 reg, u8 value) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; writeb(value, regs + (page << 8) + reg); @@ -138,7 +144,8 @@ static int b53_mmap_write8(struct b53_device *dev, u8 page, u8 reg, u8 value) static int b53_mmap_write16(struct b53_device *dev, u8 page, u8 reg, u16 value) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; if (WARN_ON(reg % 2)) return -EINVAL; @@ -154,7 +161,8 @@ static int b53_mmap_write16(struct b53_device *dev, u8 page, u8 reg, static int b53_mmap_write32(struct b53_device *dev, u8 page, u8 reg, u32 value) { - u8 __iomem *regs = dev->priv; + struct b53_mmap_priv *priv = dev->priv; + void __iomem *regs = priv->regs; if (WARN_ON(reg % 4)) return -EINVAL; @@ -223,12 +231,19 @@ static const struct b53_io_ops b53_mmap_ops = { static int b53_mmap_probe(struct platform_device *pdev) { struct b53_platform_data *pdata = pdev->dev.platform_data; + struct b53_mmap_priv *priv; struct b53_device *dev; if (!pdata) return -EINVAL; - dev = b53_switch_alloc(>dev, _mmap_ops, pdata->regs); + priv = devm_kzalloc(>dev, sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->regs = pdata->regs; + + dev = b53_switch_alloc(>dev, _mmap_ops, priv); if (!dev) return -ENOMEM; -- 2.14.1
[PATCH net 0/2] net: Broadcom drivers sparse fixes
Hi all, This patch series fixes the same warning reported by sparse in bcmsysport and bcmgenet in the code that deals with inserting the TX checksum pointers: drivers/net/ethernet/broadcom/bcmsysport.c:1155:26: warning: cast from restricted __be16 drivers/net/ethernet/broadcom/bcmsysport.c:1155:26: warning: incorrect type in argument 1 (different base types) drivers/net/ethernet/broadcom/bcmsysport.c:1155:26:expected unsigned short [unsigned] [usertype] val drivers/net/ethernet/broadcom/bcmsysport.c:1155:26:got restricted __be16 [usertype] protocol This patch fixes both issues by using the same construct and not swapping skb->protocol but instead the values we are checking against. Florian Fainelli (2): net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum() net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb() drivers/net/ethernet/broadcom/bcmsysport.c | 11 ++- drivers/net/ethernet/broadcom/genet/bcmgenet.c | 11 ++- 2 files changed, 12 insertions(+), 10 deletions(-) -- 2.14.1
[PATCH net 2/2] net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()
skb->protocol is a __be16 which we would be calling htons() against, while this is not wrong per-se as it correctly results in swapping the value on LE hosts, this still upsets sparse. Adopt a similar pattern to what other drivers do and just assign ip_ver to skb->protocol, and then use htons() against the different constants such that the compiler can resolve the values at build time. Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver") Signed-off-by: Florian Fainelli--- drivers/net/ethernet/broadcom/bcmsysport.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index 3fc549b88c43..879db1c700bd 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -1133,7 +1133,7 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct sk_buff *skb, u32 csum_info; u8 ip_proto; u16 csum_start; - u16 ip_ver; + __be16 ip_ver; /* Re-allocate SKB if needed */ if (unlikely(skb_headroom(skb) < sizeof(*tsb))) { @@ -1152,12 +1152,12 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct sk_buff *skb, memset(tsb, 0, sizeof(*tsb)); if (skb->ip_summed == CHECKSUM_PARTIAL) { - ip_ver = htons(skb->protocol); + ip_ver = skb->protocol; switch (ip_ver) { - case ETH_P_IP: + case htons(ETH_P_IP): ip_proto = ip_hdr(skb)->protocol; break; - case ETH_P_IPV6: + case htons(ETH_P_IPV6): ip_proto = ipv6_hdr(skb)->nexthdr; break; default: @@ -1171,7 +1171,8 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct sk_buff *skb, if (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP) { csum_info |= L4_LENGTH_VALID; - if (ip_proto == IPPROTO_UDP && ip_ver == ETH_P_IP) + if (ip_proto == IPPROTO_UDP && + ip_ver == htons(ETH_P_IP)) csum_info |= L4_UDP; } else { csum_info = 0; -- 2.14.1
[PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()
skb->protocol is a __be16 which we would be calling htons() against, while this is not wrong per-se as it correctly results in swapping the value on LE hosts, this still upsets sparse. Adopt a similar pattern to what other drivers do and just assign ip_ver to skb->protocol, and then use htons() against the different constants such that the compiler can resolve the values at build time. Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file") Signed-off-by: Florian Fainelli--- drivers/net/ethernet/broadcom/genet/bcmgenet.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c index b1e35a9accf1..aeb329690f78 100644 --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c @@ -1460,7 +1460,7 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct net_device *dev, struct sk_buff *new_skb; u16 offset; u8 ip_proto; - u16 ip_ver; + __be16 ip_ver; u32 tx_csum_info; if (unlikely(skb_headroom(skb) < sizeof(*status))) { @@ -1480,12 +1480,12 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct net_device *dev, status = (struct status_64 *)skb->data; if (skb->ip_summed == CHECKSUM_PARTIAL) { - ip_ver = htons(skb->protocol); + ip_ver = skb->protocol; switch (ip_ver) { - case ETH_P_IP: + case htons(ETH_P_IP): ip_proto = ip_hdr(skb)->protocol; break; - case ETH_P_IPV6: + case htons(ETH_P_IPV6): ip_proto = ipv6_hdr(skb)->nexthdr; break; default: @@ -1501,7 +1501,8 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct net_device *dev, */ if (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP) { tx_csum_info |= STATUS_TX_CSUM_LV; - if (ip_proto == IPPROTO_UDP && ip_ver == ETH_P_IP) + if (ip_proto == IPPROTO_UDP && + ip_ver == htons(ETH_P_IP)) tx_csum_info |= STATUS_TX_CSUM_PROTO_UDP; } else { tx_csum_info = 0; -- 2.14.1
[PATCH net-next] rxrpc: Fix undefined packet handling
By analogy with other Rx implementations, RxRPC packet types 9, 10 and 11 should just be discarded rather than being aborted like other undefined packet types. Reported-by: Jeffrey AltmanSigned-off-by: David Howells --- net/rxrpc/input.c|6 ++ net/rxrpc/protocol.h |6 ++ 2 files changed, 12 insertions(+) diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c index 21800e6f5019..0410d2277ca2 100644 --- a/net/rxrpc/input.c +++ b/net/rxrpc/input.c @@ -1200,6 +1200,12 @@ void rxrpc_data_ready(struct sock *udp_sk) !rxrpc_validate_jumbo(skb)) goto bad_message; break; + + /* Packet types 9-11 should just be ignored. */ + case RXRPC_PACKET_TYPE_PARAMS: + case RXRPC_PACKET_TYPE_10: + case RXRPC_PACKET_TYPE_11: + goto discard; } rcu_read_lock(); diff --git a/net/rxrpc/protocol.h b/net/rxrpc/protocol.h index 4bddcf3face3..93da73bf7098 100644 --- a/net/rxrpc/protocol.h +++ b/net/rxrpc/protocol.h @@ -46,6 +46,9 @@ struct rxrpc_wire_header { #define RXRPC_PACKET_TYPE_CHALLENGE6 /* connection security challenge (SRVR->CLNT) */ #define RXRPC_PACKET_TYPE_RESPONSE 7 /* connection secutity response (CLNT->SRVR) */ #define RXRPC_PACKET_TYPE_DEBUG8 /* debug info request */ +#define RXRPC_PACKET_TYPE_PARAMS 9 /* Parameter negotiation (unspec'd, ignore) */ +#define RXRPC_PACKET_TYPE_10 10 /* Ignored */ +#define RXRPC_PACKET_TYPE_11 11 /* Ignored */ #define RXRPC_PACKET_TYPE_VERSION 13 /* version string request */ #define RXRPC_N_PACKET_TYPES 14 /* number of packet types (incl type 0) */ @@ -78,6 +81,9 @@ struct rxrpc_wire_header { (1 << RXRPC_PACKET_TYPE_CHALLENGE) |\ (1 << RXRPC_PACKET_TYPE_RESPONSE) | \ /*(1 << RXRPC_PACKET_TYPE_DEBUG) | */ \ + (1 << RXRPC_PACKET_TYPE_PARAMS) | \ + (1 << RXRPC_PACKET_TYPE_10) | \ + (1 << RXRPC_PACKET_TYPE_11) | \ (1 << RXRPC_PACKET_TYPE_VERSION)) /*/
[PATCH] PCI: allow drivers to limit the number of VFs to 0
Some user space depends on enabling sriov_totalvfs number of VFs to not fail, e.g.: $ cat .../sriov_totalvfs > .../sriov_numvfs For devices which VF support depends on loaded FW we have the pci_sriov_{g,s}et_totalvfs() API. However, this API uses 0 as a special "unset" value, meaning drivers can't limit sriov_totalvfs to 0. Remove the special values completely and simply initialize driver_max_VFs to total_VFs. Then always use driver_max_VFs. Add a helper for drivers to reset the VF limit back to total. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_main.c | 6 +++--- drivers/pci/iov.c | 27 +-- include/linux/pci.h | 2 ++ 3 files changed, 26 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index c4b1f344b4da..a76d177e40dd 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -123,7 +123,7 @@ static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf *pf) return pci_sriov_set_totalvfs(pf->pdev, pf->limit_vfs); pf->limit_vfs = ~0; - pci_sriov_set_totalvfs(pf->pdev, 0); /* 0 is unset */ + pci_sriov_reset_totalvfs(pf->pdev); /* Allow any setting for backwards compatibility if symbol not found */ if (err == -ENOENT) return 0; @@ -537,7 +537,7 @@ static int nfp_pci_probe(struct pci_dev *pdev, err_net_remove: nfp_net_pci_remove(pf); err_sriov_unlimit: - pci_sriov_set_totalvfs(pf->pdev, 0); + pci_sriov_reset_totalvfs(pf->pdev); err_fw_unload: kfree(pf->rtbl); nfp_mip_close(pf->mip); @@ -570,7 +570,7 @@ static void nfp_pci_remove(struct pci_dev *pdev) nfp_hwmon_unregister(pf); nfp_pcie_sriov_disable(pdev); - pci_sriov_set_totalvfs(pf->pdev, 0); + pci_sriov_reset_totalvfs(pf->pdev); nfp_net_pci_remove(pf); diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 677924ae0350..c63ea870d8be 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -443,6 +443,7 @@ static int sriov_init(struct pci_dev *dev, int pos) iov->nres = nres; iov->ctrl = ctrl; iov->total_VFs = total; + iov->driver_max_VFs = total; pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, >vf_device); iov->pgsz = pgsz; iov->self = dev; @@ -788,12 +789,29 @@ int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs) } EXPORT_SYMBOL_GPL(pci_sriov_set_totalvfs); +/** + * pci_sriov_reset_totalvfs -- return the TotalVFs value to the default + * @dev: the PCI PF device + * + * Should be called from PF driver's remove routine with + * device's mutex held. + */ +void pci_sriov_reset_totalvfs(struct pci_dev *dev) +{ + /* Shouldn't change if VFs already enabled */ + if (!dev->is_physfn || dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE) + return; + + dev->sriov->driver_max_VFs = dev->sriov->total_VFs; +} +EXPORT_SYMBOL_GPL(pci_sriov_reset_totalvfs); + /** * pci_sriov_get_totalvfs -- get total VFs supported on this device * @dev: the PCI PF device * - * For a PCIe device with SRIOV support, return the PCIe - * SRIOV capability value of TotalVFs or the value of driver_max_VFs + * For a PCIe device with SRIOV support, return the value of driver_max_VFs + * which can be equal to the PCIe SRIOV capability value of TotalVFs or lower * if the driver reduced it. Otherwise 0. */ int pci_sriov_get_totalvfs(struct pci_dev *dev) @@ -801,9 +819,6 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev) if (!dev->is_physfn) return 0; - if (dev->sriov->driver_max_VFs) - return dev->sriov->driver_max_VFs; - - return dev->sriov->total_VFs; + return dev->sriov->driver_max_VFs; } EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs); diff --git a/include/linux/pci.h b/include/linux/pci.h index 024a1beda008..95fde8850393 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1952,6 +1952,7 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id); int pci_num_vf(struct pci_dev *dev); int pci_vfs_assigned(struct pci_dev *dev); int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs); +void pci_sriov_reset_totalvfs(struct pci_dev *dev); int pci_sriov_get_totalvfs(struct pci_dev *dev); resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe); @@ -1978,6 +1979,7 @@ static inline int pci_vfs_assigned(struct pci_dev *dev) { return 0; } static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs) { return 0; } +static inline void pci_sriov_reset_totalvfs(struct pci_dev *dev) { } static inline int pci_sriov_get_totalvfs(struct pci_dev *dev) { return 0; } static inline resource_size_t
Re: WARNING: refcount bug in should_fail
On Mon, Apr 02, 2018 at 10:52:12PM +0100, Al Viro wrote: > On Mon, Apr 02, 2018 at 03:30:56PM -0500, Eric W. Biederman wrote: > > Tetsuo Handawrites: > > > > I don't think this is a dup of existing bug. > > > We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080. > > > > Even if expanding mount_ns to more filesystems was magically fixed, > > proc would still have this issue with the pid namespace rather than > > the net namespace. > > > > This is a mess. I will take a look and see if I can see a a fix. > > It's trivially fixable, and there's no need to modify mount_ns() at > all. > > All we need is for rpc_kill_sb() to recognize whether we are already > through the point in rpc_fill_super() where the refcount is bumped. > That's it. > > The most trivial way to do that is to move > net = get_net(sb->s_fs_info); > past > if (!root) > return -ENOMEM; > in the latter and have > out: > if (!sb->s_root) > net = NULL; > kill_litter_super(sb); > if (net) > put_net(net); > in the end of the former. And similar changes in other affected > instances. FWIW, I'm going through the ->kill_sb() instances, fixing that sort of bugs (most of them preexisting, but I should've checked instead of assuming that everything's fine). Will push out later tonight.
Re: WARNING: refcount bug in should_fail
On Mon, Apr 02, 2018 at 03:30:56PM -0500, Eric W. Biederman wrote: > Tetsuo Handawrites: > > I don't think this is a dup of existing bug. > > We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080. > > Even if expanding mount_ns to more filesystems was magically fixed, > proc would still have this issue with the pid namespace rather than > the net namespace. > > This is a mess. I will take a look and see if I can see a a fix. It's trivially fixable, and there's no need to modify mount_ns() at all. All we need is for rpc_kill_sb() to recognize whether we are already through the point in rpc_fill_super() where the refcount is bumped. That's it. The most trivial way to do that is to move net = get_net(sb->s_fs_info); past if (!root) return -ENOMEM; in the latter and have out: if (!sb->s_root) net = NULL; kill_litter_super(sb); if (net) put_net(net); in the end of the former. And similar changes in other affected instances.
Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
On Fri, 30 Mar 2018 09:54:37 -0700, Alexander Duyck wrote: > On Fri, Mar 30, 2018 at 4:49 AM, Christoph Hellwigwrote: > > On Thu, Mar 29, 2018 at 11:22:31AM -0700, Jakub Kicinski wrote: > >> Some user space depends on driver allowing sriov_totalvfs to be > >> enabled. > > > > I can't make sene of this sentence. Can you explain what user space > > code depends on what semantics? The sriov_totalvfs file should show > > up for any device supporting SR-IOV as far as I can tell. > > > >> > >> For devices which VF support depends on loaded FW we > >> have the pci_sriov_{g,s}et_totalvfs() API. However, this API > >> uses 0 as a special "unset" value, meaning drivers can't limit > >> sriov_totalvfs to 0. Change the special value to be U16_MAX. > >> Use simple min() to determine actual totalvfs. > > > > Please use a PCI_MAX_VFS or similar define instead of plain U16_MAX or ~0. > > Actually is there any reason why driver_max_VFs couldn't just be > initialized to the same value as total_VFs? > > Also looking over the patch I don't see how writing ~0 would be > accepted unless you also make changes to pci_sriov_set_totalvfs since > it should fail the "numvfs > dev->sriov->total_VFs" check. You might > just want to look at adding a new function that would reset the > driver_max_VFs value instead of trying to write it to an arbitrary > value from the driver. Ack, the reset function plus using total_VFs as unset seems a lot cleaner, thanks!
Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
On Fri, 30 Mar 2018 04:49:05 -0700, Christoph Hellwig wrote: > On Thu, Mar 29, 2018 at 11:22:31AM -0700, Jakub Kicinski wrote: > > Some user space depends on driver allowing sriov_totalvfs to be > > enabled. > > I can't make sene of this sentence. Can you explain what user space > code depends on what semantics? The sriov_totalvfs file should show > up for any device supporting SR-IOV as far as I can tell. Ugh, I took me a while to understand what I meant myself... So what I meant is that this should generally work: # cat .../sriov_totalvfs > .../sriov_numvfs I.e. one should be able to "enable" the number of VFs advertised in sriov_totalvfs. I will rephrase! > > For devices which VF support depends on loaded FW we > > have the pci_sriov_{g,s}et_totalvfs() API. However, this API > > uses 0 as a special "unset" value, meaning drivers can't limit > > sriov_totalvfs to 0. Change the special value to be U16_MAX. > > Use simple min() to determine actual totalvfs. > > Please use a PCI_MAX_VFS or similar define instead of plain U16_MAX or ~0. I think I may just go with what I think Alex is suggesting and "unset" the value to total_VFs.
Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
On 4/2/2018 11:25 PM, Keller, Jacob E wrote: -Original Message- From: Bjorn Helgaas [mailto:helg...@kernel.org] Sent: Monday, April 02, 2018 12:58 PM To: Keller, Jacob ECc: Tal Gilboa ; Tariq Toukan ; Ariel Elior ; Ganesh Goudar ; Kirsher, Jeffrey T ; everest-linux...@cavium.com; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux- ker...@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote: -Original Message- From: Bjorn Helgaas [mailto:helg...@kernel.org] Sent: Friday, March 30, 2018 2:05 PM To: Tal Gilboa Cc: Tariq Toukan ; Keller, Jacob E ; Ariel Elior ; Ganesh Goudar ; Kirsher, Jeffrey T ; everest-linux...@cavium.com; intel-wired- l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; linux-...@vger.kernel.org Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited From: Tal Gilboa Add pcie_print_link_status(). This logs the current settings of the link (speed, width, and total available bandwidth). If the device is capable of more bandwidth but is limited by a slower upstream link, we include information about the link that limits the device's performance. The user may be able to move the device to a different slot for better performance. This provides a unified method for all PCI devices to report status and issues, instead of each device reporting in a different way, using different code. Signed-off-by: Tal Gilboa [bhelgaas: changelog, reword log messages, print device capabilities when not limited] Signed-off-by: Bjorn Helgaas --- drivers/pci/pci.c | 29 + include/linux/pci.h |1 + 2 files changed, 30 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e00d56b12747..cec7aed09f6b 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed, return *width * PCIE_SPEED2MBS_ENC(*speed); } +/** + * pcie_print_link_status - Report the PCI device's link speed and width + * @dev: PCI device to query + * + * Report the available bandwidth at the device. If this is less than the + * device is capable of, report the device's maximum possible bandwidth and + * the upstream link that limits its performance to less than that. + */ +void pcie_print_link_status(struct pci_dev *dev) +{ + enum pcie_link_width width, width_cap; + enum pci_bus_speed speed, speed_cap; + struct pci_dev *limiting_dev = NULL; + u32 bw_avail, bw_cap; + + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); + bw_avail = pcie_bandwidth_available(dev, _dev, , ); + + if (bw_avail >= bw_cap) + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n", +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); + else + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d link at %s (capable of %d Mb/s with %s x%d link)\n", +bw_avail, PCIE_SPEED2STR(speed), width, +limiting_dev ? pci_name(limiting_dev) : "", +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); +} Personally, I would make thic last one a pci_warn() to indicate it at a higher log level, but I'm ok with the wording, and if consensus is that this should be at info, I'm ok with that. Tal's original patch did have a pci_warn() here, and we went back and forth a bit. They get bug reports when a device doesn't perform as expected, which argues for pci_warn(). But they also got feedback saying warnings are a bit too much, which argues for pci_info() [1] I don't have a really strong opinion either way. I have a slight preference for info because the user may not be able to do anything about it (there may not be a faster slot available), and I think distros are usually configured so a warning interrupts the smooth graphical boot. It looks like mlx4, fm10k, and ixgbe currently use warnings, while bnx2x, bnxt_en, and cxgb4 use info. It's a tie so far :) [1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b- 5ffaf261c...@mellanox.com With that information, I'm fine with the proposal to display this as only an info. The message is still printed and can be used for debugging purposes, and I think that's really enough. Here's a proposal for printing the bandwidth as "x.xxx Gb/s": Nice, I like that also. Regards, Jake Same here for
RE: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()
> -Original Message- > From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] > On Behalf Of Bjorn Helgaas > Sent: Monday, April 02, 2018 1:32 PM > To: Keller, Jacob E> Cc: Tal Gilboa ; Tariq Toukan ; Ariel > Elior ; Ganesh Goudar ; > Kirsher, Jeffrey T ; everest-linux...@cavium.com; > intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux- > ker...@vger.kernel.org; linux-...@vger.kernel.org > Subject: Re: [PATCH v5 12/14] fm10k: Report PCIe link properties with > pcie_print_link_status() > > On Mon, Apr 02, 2018 at 03:56:06PM +, Keller, Jacob E wrote: > > > -Original Message- > > > From: Bjorn Helgaas [mailto:helg...@kernel.org] > > > Sent: Friday, March 30, 2018 2:06 PM > > > To: Tal Gilboa > > > Cc: Tariq Toukan ; Keller, Jacob E > > > ; Ariel Elior ; Ganesh > > > Goudar ; Kirsher, Jeffrey T > > > ; everest-linux...@cavium.com; intel-wired- > > > l...@lists.osuosl.org; netdev@vger.kernel.org; > > > linux-ker...@vger.kernel.org; > > > linux-...@vger.kernel.org > > > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with > > > pcie_print_link_status() > > > > > > From: Bjorn Helgaas > > > > > > Use pcie_print_link_status() to report PCIe link speed and possible > > > limitations instead of implementing this in the driver itself. > > > > > > Note that pcie_get_minimum_link() can return misleading information > because > > > it finds the slowest link and the narrowest link without considering the > > > total bandwidth of the link. If the path contains a 16 GT/s x1 link and a > > > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which > > > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth > > > of > > > about 2000 MB/s for a 16 GT/s x1 link. > > > > This comment is about what's being fixed, so it would have been easier to > > parse if it were written to more clearly indicate that we're removing > > (and not adding) this behavior. > > Good point. Is this any better? > > fm10k: Report PCIe link properties with pcie_print_link_status() > > Previously the driver used pcie_get_minimum_link() to warn when the NIC > is in a slot that can't supply as much bandwidth as the NIC could use. > > pcie_get_minimum_link() can be misleading because it finds the slowest link > and the narrowest link (which may be different links) without considering > the total bandwidth of each link. For a path with a 16 GT/s x1 link and a > 2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of > bandwidth, not the true available bandwidth of about 1969 MB/s for a > 16 GT/s x1 link. > > Use pcie_print_link_status() to report PCIe link speed and possible > limitations instead of implementing this in the driver itself. This finds > the slowest link in the path to the device by computing the total bandwidth > of each link and compares that with the capabilities of the device. > > Note that the driver previously used dev_warn() to suggest using a > different slot, but pcie_print_link_status() uses dev_info() because if the > platform has no faster slot available, the user can't do anything about the > warning and may not want to be bothered with it. Perfect! Thanks! -Jake
Re: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()
On Mon, Apr 02, 2018 at 03:56:06PM +, Keller, Jacob E wrote: > > -Original Message- > > From: Bjorn Helgaas [mailto:helg...@kernel.org] > > Sent: Friday, March 30, 2018 2:06 PM > > To: Tal Gilboa> > Cc: Tariq Toukan ; Keller, Jacob E > > ; Ariel Elior ; Ganesh > > Goudar ; Kirsher, Jeffrey T > > ; everest-linux...@cavium.com; intel-wired- > > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > > linux-...@vger.kernel.org > > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with > > pcie_print_link_status() > > > > From: Bjorn Helgaas > > > > Use pcie_print_link_status() to report PCIe link speed and possible > > limitations instead of implementing this in the driver itself. > > > > Note that pcie_get_minimum_link() can return misleading information because > > it finds the slowest link and the narrowest link without considering the > > total bandwidth of the link. If the path contains a 16 GT/s x1 link and a > > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which > > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of > > about 2000 MB/s for a 16 GT/s x1 link. > > This comment is about what's being fixed, so it would have been easier to > parse if it were written to more clearly indicate that we're removing > (and not adding) this behavior. Good point. Is this any better? fm10k: Report PCIe link properties with pcie_print_link_status() Previously the driver used pcie_get_minimum_link() to warn when the NIC is in a slot that can't supply as much bandwidth as the NIC could use. pcie_get_minimum_link() can be misleading because it finds the slowest link and the narrowest link (which may be different links) without considering the total bandwidth of each link. For a path with a 16 GT/s x1 link and a 2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of bandwidth, not the true available bandwidth of about 1969 MB/s for a 16 GT/s x1 link. Use pcie_print_link_status() to report PCIe link speed and possible limitations instead of implementing this in the driver itself. This finds the slowest link in the path to the device by computing the total bandwidth of each link and compares that with the capabilities of the device. Note that the driver previously used dev_warn() to suggest using a different slot, but pcie_print_link_status() uses dev_info() because if the platform has no faster slot available, the user can't do anything about the warning and may not want to be bothered with it.
Re: WARNING: refcount bug in should_fail
Tetsuo Handawrites: > syzbot wrote: >> > On Sun, Mar 4, 2018 at 6:57 AM, Tetsuo Handa >> > wrote: >> >> Switching from mm to fsdevel, for this report says that put_net(net) in >> >> rpc_kill_sb() made net->count < 0 when mount_ns() failed due to >> >> register_shrinker() failure. >> >> >> Relevant commits will be >> >> commit 9ee332d99e4d5a97 ("sget(): handle failures of >> >> register_shrinker()") and >> >> commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to >> >> mount_ns."). >> >> >> When sget_userns() in mount_ns() failed, mount_ns() returns an error >> >> code to >> >> the caller without calling fill_super(). That is, get_net(sb->s_fs_info) >> >> was >> >> not called by rpc_fill_super() (via fill_super callback passed to >> >> mount_ns()) >> >> but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb() >> >> from >> >> deactivate_locked_super()). >> >> >> -- >> >> static struct dentry * >> >> rpc_mount(struct file_system_type *fs_type, >> >> int flags, const char *dev_name, void *data) >> >> { >> >> struct net *net = current->nsproxy->net_ns; >> >> return mount_ns(fs_type, flags, data, net, net->user_ns, >> >> rpc_fill_super); >> >> } >> >> -- >> >> > Messed kernel output, this is definitely not in should_fail. >> >> > #syz dup: WARNING: refcount bug in sk_alloc >> >> Can't find the corresponding bug. >> > I don't think this is a dup of existing bug. > We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080. Even if expanding mount_ns to more filesystems was magically fixed, proc would still have this issue with the pid namespace rather than the net namespace. This is a mess. I will take a look and see if I can see a a fix. Eric
[PATCH net-next] nfp: add a separate counter for packets with CHECKSUM_COMPLETE
We are currently counting packets with CHECKSUM_COMPLETE as "hw_rx_csum_ok". This is confusing. Add a new counter. To make sure it fits in the same cacheline move the less used error counter to a different location. Signed-off-by: Jakub KicinskiReviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/nfp_net.h | 4 +++- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +- drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 16 +--- 3 files changed, 13 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 787df47ec430..bd7d8ae31e17 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -391,6 +391,7 @@ struct nfp_net_rx_ring { * @rx_drops: Number of packets dropped on RX due to lack of resources * @hw_csum_rx_ok: Counter of packets where the HW checksum was OK * @hw_csum_rx_inner_ok: Counter of packets where the inner HW checksum was OK + * @hw_csum_rx_complete: Counter of packets with CHECKSUM_COMPLETE reported * @hw_csum_rx_error: Counter of packets with bad checksums * @tx_sync: Seqlock for atomic updates of TX stats * @tx_pkts: Number of Transmitted packets @@ -434,7 +435,7 @@ struct nfp_net_r_vector { u64 rx_drops; u64 hw_csum_rx_ok; u64 hw_csum_rx_inner_ok; - u64 hw_csum_rx_error; + u64 hw_csum_rx_complete; struct nfp_net_tx_ring *xdp_ring; @@ -446,6 +447,7 @@ struct nfp_net_r_vector { u64 tx_gather; u64 tx_lso; + u64 hw_csum_rx_error; u64 rx_replace_buf_alloc_fail; u64 tx_errors; u64 tx_busy; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 43a9c207a049..1eb6549f2a54 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1406,7 +1406,7 @@ static void nfp_net_rx_csum(struct nfp_net_dp *dp, skb->ip_summed = meta->csum_type; skb->csum = meta->csum; u64_stats_update_begin(_vec->rx_sync); - r_vec->hw_csum_rx_ok++; + r_vec->hw_csum_rx_complete++; u64_stats_update_end(_vec->rx_sync); return; } diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c index e1dae0616f52..c9016419bfa0 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c @@ -179,7 +179,7 @@ static const struct nfp_et_stat nfp_mac_et_stats[] = { #define NN_ET_GLOBAL_STATS_LEN ARRAY_SIZE(nfp_net_et_stats) #define NN_ET_SWITCH_STATS_LEN 9 -#define NN_RVEC_GATHER_STATS 8 +#define NN_RVEC_GATHER_STATS 9 #define NN_RVEC_PER_Q_STATS3 static void nfp_net_get_nspinfo(struct nfp_app *app, char *version) @@ -468,6 +468,7 @@ static u8 *nfp_vnic_get_sw_stats_strings(struct net_device *netdev, u8 *data) data = nfp_pr_et(data, "hw_rx_csum_ok"); data = nfp_pr_et(data, "hw_rx_csum_inner_ok"); + data = nfp_pr_et(data, "hw_rx_csum_complete"); data = nfp_pr_et(data, "hw_rx_csum_err"); data = nfp_pr_et(data, "rx_replace_buf_alloc_fail"); data = nfp_pr_et(data, "hw_tx_csum"); @@ -493,18 +494,19 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data) data[0] = nn->r_vecs[i].rx_pkts; tmp[0] = nn->r_vecs[i].hw_csum_rx_ok; tmp[1] = nn->r_vecs[i].hw_csum_rx_inner_ok; - tmp[2] = nn->r_vecs[i].hw_csum_rx_error; - tmp[3] = nn->r_vecs[i].rx_replace_buf_alloc_fail; + tmp[2] = nn->r_vecs[i].hw_csum_rx_complete; + tmp[3] = nn->r_vecs[i].hw_csum_rx_error; + tmp[4] = nn->r_vecs[i].rx_replace_buf_alloc_fail; } while (u64_stats_fetch_retry(>r_vecs[i].rx_sync, start)); do { start = u64_stats_fetch_begin(>r_vecs[i].tx_sync); data[1] = nn->r_vecs[i].tx_pkts; data[2] = nn->r_vecs[i].tx_busy; - tmp[4] = nn->r_vecs[i].hw_csum_tx; - tmp[5] = nn->r_vecs[i].hw_csum_tx_inner; - tmp[6] = nn->r_vecs[i].tx_gather; - tmp[7] = nn->r_vecs[i].tx_lso; + tmp[5] = nn->r_vecs[i].hw_csum_tx; + tmp[6] = nn->r_vecs[i].hw_csum_tx_inner; + tmp[7] = nn->r_vecs[i].tx_gather; + tmp[8] = nn->r_vecs[i].tx_lso; } while (u64_stats_fetch_retry(>r_vecs[i].tx_sync, start));
Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code
On Mon, 2 Apr 2018 07:43:49 + Alexander Kurzwrote: > Remove the duplicated code for asix88179_178a bind and reset methods. > > Signed-off-by: Alexander Kurz > --- > drivers/net/usb/ax88179_178a.c | 137 > ++--- > 1 file changed, 31 insertions(+), 106 deletions(-) The good news is that this patch doesn't seem to break anything this time. A few remarks below: > > diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c > index a6ef75907ae9..fea4c7b877cc 100644 > --- a/drivers/net/usb/ax88179_178a.c > +++ b/drivers/net/usb/ax88179_178a.c > @@ -1223,7 +1223,7 @@ static int ax88179_led_setting(struct usbnet *dev) > return 0; > } > > -static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf) > +static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset) > { > u8 buf[5]; > u16 *tmp16; > @@ -1231,12 +1231,11 @@ static int ax88179_bind(struct usbnet *dev, struct > usb_interface *intf) > struct ax88179_data *ax179_data = (struct ax88179_data *)dev->data; > struct ethtool_eee eee_data; > > - usbnet_get_endpoints(dev, intf); > - So you move this to "bind"... > tmp16 = (u16 *)buf; > tmp = (u8 *)buf; > > - memset(ax179_data, 0, sizeof(*ax179_data)); > + if (!do_reset) > + memset(ax179_data, 0, sizeof(*ax179_data)); ... but not that. Why? They both are exclusive to the bind operation. > > /* Power up ethernet PHY */ > *tmp16 = 0; > @@ -1249,9 +1248,13 @@ static int ax88179_bind(struct usbnet *dev, struct > usb_interface *intf) > ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_CLK_SELECT, 1, 1, tmp); > msleep(100); > > + if (do_reset) > + ax88179_auto_detach(dev, 0); > + > ax88179_read_cmd(dev, AX_ACCESS_MAC, AX_NODE_ID, ETH_ALEN, >ETH_ALEN, dev->net->dev_addr); > - memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN); > + if (!do_reset) > + memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN); > > /* RX bulk configuration */ > memcpy(tmp, _BULKIN_SIZE[0], 5); > @@ -1266,19 +1269,21 @@ static int ax88179_bind(struct usbnet *dev, struct > usb_interface *intf) > ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_PAUSE_WATERLVL_HIGH, > 1, 1, tmp); > > - dev->net->netdev_ops = _netdev_ops; > - dev->net->ethtool_ops = _ethtool_ops; > - dev->net->needed_headroom = 8; > - dev->net->max_mtu = 4088; > - > - /* Initialize MII structure */ > - dev->mii.dev = dev->net; > - dev->mii.mdio_read = ax88179_mdio_read; > - dev->mii.mdio_write = ax88179_mdio_write; > - dev->mii.phy_id_mask = 0xff; > - dev->mii.reg_num_mask = 0xff; > - dev->mii.phy_id = 0x03; > - dev->mii.supports_gmii = 1; > + if (!do_reset) { > + dev->net->netdev_ops = _netdev_ops; > + dev->net->ethtool_ops = _ethtool_ops; > + dev->net->needed_headroom = 8; > + dev->net->max_mtu = 4088; > + > + /* Initialize MII structure */ > + dev->mii.dev = dev->net; > + dev->mii.mdio_read = ax88179_mdio_read; > + dev->mii.mdio_write = ax88179_mdio_write; > + dev->mii.phy_id_mask = 0xff; > + dev->mii.reg_num_mask = 0xff; > + dev->mii.phy_id = 0x03; > + dev->mii.supports_gmii = 1; > + } > > dev->net->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | > NETIF_F_RXCSUM; > @@ -1330,6 +1335,13 @@ static int ax88179_bind(struct usbnet *dev, struct > usb_interface *intf) > return 0; > } > > +static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf) > +{ > + usbnet_get_endpoints(dev, intf); > + > + return ax88179_bind_or_reset(dev, false); > +} I find the whole construct extremely messy. You shouldn't need to "bind or reset" the adapter. You first reset it (that's the HW part), and you then bind it (that's the SW part). I understand that the code is quite messy already, and that you're trying to improve it. I'm just not convinced that having a single function containing everything and the kitchen sink is the most sensible approach. Instead, you probably want to extract stuff that needs to be done at reset time from what can be done at bind time, and keep the two quite separate. The fact that bind is mostly a superset of reset is a bit of an odd thing, and it'd be good to get to the bottom of that (I fully admit that my understanding of USB networking is close to zero). I came up with the below hack on top of your patch, and things seem to still behave. Thanks, M. diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c index fea4c7b877cc..aed98d400d7a 100644 --- a/drivers/net/usb/ax88179_178a.c +++ b/drivers/net/usb/ax88179_178a.c @@ -1223,7 +1223,7 @@ static int
RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
> -Original Message- > From: Bjorn Helgaas [mailto:helg...@kernel.org] > Sent: Monday, April 02, 2018 12:58 PM > To: Keller, Jacob E> Cc: Tal Gilboa ; Tariq Toukan ; Ariel > Elior ; Ganesh Goudar ; > Kirsher, Jeffrey T ; everest-linux...@cavium.com; > intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux- > ker...@vger.kernel.org; linux-...@vger.kernel.org > Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link > speed > and whether it's limited > > On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote: > > > -Original Message- > > > From: Bjorn Helgaas [mailto:helg...@kernel.org] > > > Sent: Friday, March 30, 2018 2:05 PM > > > To: Tal Gilboa > > > Cc: Tariq Toukan ; Keller, Jacob E > > > ; Ariel Elior ; Ganesh > > > Goudar ; Kirsher, Jeffrey T > > > ; everest-linux...@cavium.com; intel-wired- > > > l...@lists.osuosl.org; netdev@vger.kernel.org; > > > linux-ker...@vger.kernel.org; > > > linux-...@vger.kernel.org > > > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link > > > speed > and > > > whether it's limited > > > > > > From: Tal Gilboa > > > > > > Add pcie_print_link_status(). This logs the current settings of the link > > > (speed, width, and total available bandwidth). > > > > > > If the device is capable of more bandwidth but is limited by a slower > > > upstream link, we include information about the link that limits the > > > device's performance. > > > > > > The user may be able to move the device to a different slot for better > > > performance. > > > > > > This provides a unified method for all PCI devices to report status and > > > issues, instead of each device reporting in a different way, using > > > different code. > > > > > > Signed-off-by: Tal Gilboa > > > [bhelgaas: changelog, reword log messages, print device capabilities when > > > not limited] > > > Signed-off-by: Bjorn Helgaas > > > --- > > > drivers/pci/pci.c | 29 + > > > include/linux/pci.h |1 + > > > 2 files changed, 30 insertions(+) > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > > index e00d56b12747..cec7aed09f6b 100644 > > > --- a/drivers/pci/pci.c > > > +++ b/drivers/pci/pci.c > > > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, > > > enum pci_bus_speed *speed, > > > return *width * PCIE_SPEED2MBS_ENC(*speed); > > > } > > > > > > +/** > > > + * pcie_print_link_status - Report the PCI device's link speed and width > > > + * @dev: PCI device to query > > > + * > > > + * Report the available bandwidth at the device. If this is less than > > > the > > > + * device is capable of, report the device's maximum possible bandwidth > > > and > > > + * the upstream link that limits its performance to less than that. > > > + */ > > > +void pcie_print_link_status(struct pci_dev *dev) > > > +{ > > > + enum pcie_link_width width, width_cap; > > > + enum pci_bus_speed speed, speed_cap; > > > + struct pci_dev *limiting_dev = NULL; > > > + u32 bw_avail, bw_cap; > > > + > > > + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); > > > + bw_avail = pcie_bandwidth_available(dev, _dev, , > > > ); > > > + > > > + if (bw_avail >= bw_cap) > > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n", > > > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > > > + else > > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d > > > link at %s (capable of %d Mb/s with %s x%d link)\n", > > > + bw_avail, PCIE_SPEED2STR(speed), width, > > > + limiting_dev ? pci_name(limiting_dev) : "", > > > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > > > +} > > > > Personally, I would make thic last one a pci_warn() to indicate it at a > > higher log level, but I'm ok with the wording, and if consensus is that > > this should be at info, I'm ok with that. > > Tal's original patch did have a pci_warn() here, and we went back and > forth a bit. They get bug reports when a device doesn't perform as > expected, which argues for pci_warn(). But they also got feedback > saying warnings are a bit too much, which argues for pci_info() [1] > > I don't have a really strong opinion either way. I have a slight > preference for info because the user may not be able to do anything > about it (there may not be a faster slot available), and I think > distros are usually configured so a warning interrupts the smooth > graphical boot. > > It looks like mlx4, fm10k, and ixgbe currently use warnings, while > bnx2x, bnxt_en, and cxgb4 use info. It's a tie so far
Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote: > > -Original Message- > > From: Bjorn Helgaas [mailto:helg...@kernel.org] > > Sent: Friday, March 30, 2018 2:05 PM > > To: Tal Gilboa> > Cc: Tariq Toukan ; Keller, Jacob E > > ; Ariel Elior ; Ganesh > > Goudar ; Kirsher, Jeffrey T > > ; everest-linux...@cavium.com; intel-wired- > > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > > linux-...@vger.kernel.org > > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link > > speed and > > whether it's limited > > > > From: Tal Gilboa > > > > Add pcie_print_link_status(). This logs the current settings of the link > > (speed, width, and total available bandwidth). > > > > If the device is capable of more bandwidth but is limited by a slower > > upstream link, we include information about the link that limits the > > device's performance. > > > > The user may be able to move the device to a different slot for better > > performance. > > > > This provides a unified method for all PCI devices to report status and > > issues, instead of each device reporting in a different way, using > > different code. > > > > Signed-off-by: Tal Gilboa > > [bhelgaas: changelog, reword log messages, print device capabilities when > > not limited] > > Signed-off-by: Bjorn Helgaas > > --- > > drivers/pci/pci.c | 29 + > > include/linux/pci.h |1 + > > 2 files changed, 30 insertions(+) > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index e00d56b12747..cec7aed09f6b 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, > > enum pci_bus_speed *speed, > > return *width * PCIE_SPEED2MBS_ENC(*speed); > > } > > > > +/** > > + * pcie_print_link_status - Report the PCI device's link speed and width > > + * @dev: PCI device to query > > + * > > + * Report the available bandwidth at the device. If this is less than the > > + * device is capable of, report the device's maximum possible bandwidth and > > + * the upstream link that limits its performance to less than that. > > + */ > > +void pcie_print_link_status(struct pci_dev *dev) > > +{ > > + enum pcie_link_width width, width_cap; > > + enum pci_bus_speed speed, speed_cap; > > + struct pci_dev *limiting_dev = NULL; > > + u32 bw_avail, bw_cap; > > + > > + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); > > + bw_avail = pcie_bandwidth_available(dev, _dev, , > > ); > > + > > + if (bw_avail >= bw_cap) > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n", > > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > > + else > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d > > link at %s (capable of %d Mb/s with %s x%d link)\n", > > +bw_avail, PCIE_SPEED2STR(speed), width, > > +limiting_dev ? pci_name(limiting_dev) : "", > > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > > +} > > Personally, I would make thic last one a pci_warn() to indicate it at a > higher log level, but I'm ok with the wording, and if consensus is that > this should be at info, I'm ok with that. Tal's original patch did have a pci_warn() here, and we went back and forth a bit. They get bug reports when a device doesn't perform as expected, which argues for pci_warn(). But they also got feedback saying warnings are a bit too much, which argues for pci_info() [1] I don't have a really strong opinion either way. I have a slight preference for info because the user may not be able to do anything about it (there may not be a faster slot available), and I think distros are usually configured so a warning interrupts the smooth graphical boot. It looks like mlx4, fm10k, and ixgbe currently use warnings, while bnx2x, bnxt_en, and cxgb4 use info. It's a tie so far :) [1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b-5ffaf261c...@mellanox.com Here's a proposal for printing the bandwidth as "x.xxx Gb/s": commit ad370f38c1b5e9b8bb941eaed84ebb676c4bdaa4 Author: Tal Gilboa Date: Fri Mar 30 08:56:47 2018 -0500 PCI: Add pcie_print_link_status() to log link speed and whether it's limited Add pcie_print_link_status(). This logs the current settings of the link (speed, width, and total available bandwidth). If the device is capable of more bandwidth but is limited by a slower upstream link, we include information about the link that limits the device's performance. The user may be able to move the device to a different slot for better performance. This provides a unified method for all
Re: [bpf-next PATCH 1/2] bpf: sockmap, free memory on sock close with cork data
On 04/02/2018 12:50 PM, John Fastabend wrote: > If a socket with pending cork data is closed we do not return the > memory to the socket until the garbage collector free's the psock > structure. The garbage collector though can run after the sock has > completed its close operation. If this ordering happens the sock code > will through a WARN_ON because there is still outstanding memory > accounted to the sock. > > To resolve this ensure we return memory to the sock when a socket > is closed. > > Signed-off-by: John Fastabend> Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper") > --- Hi Alexei, Daniel, These two fixes apply against current bpf-next or bpf after bpf-next is merged. I could resend later I suppose but I think it makes sense to get these in sooner rather than later. Thanks, John
[bpf-next PATCH 2/2] bpf: sockmap, duplicates release calls may NULL sk_prot
It is possible to have multiple ULP tcp_release call paths in flight if a sock is closed and simultaneously being removed from the sockmap control path. The result would be setting the sk_prot to the saved values on the first iteration and then on the second iteration setting the value to NULL. This patch resolves this by ensuring we only reset the sk_prot pointer if we have a valid saved state to set. Fixes: 4f738adba30a7 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: John Fastabend--- kernel/bpf/sockmap.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 8ddf326..8dd9210 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -182,8 +182,10 @@ static void bpf_tcp_release(struct sock *sk) psock->cork = NULL; } - sk->sk_prot = psock->sk_proto; - psock->sk_proto = NULL; + if (psock->sk_proto) { + sk->sk_prot = psock->sk_proto; + psock->sk_proto = NULL; + } out: rcu_read_unlock(); }
[bpf-next PATCH 1/2] bpf: sockmap, free memory on sock close with cork data
If a socket with pending cork data is closed we do not return the memory to the socket until the garbage collector free's the psock structure. The garbage collector though can run after the sock has completed its close operation. If this ordering happens the sock code will through a WARN_ON because there is still outstanding memory accounted to the sock. To resolve this ensure we return memory to the sock when a socket is closed. Signed-off-by: John FastabendFixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper") --- kernel/bpf/sockmap.c |6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index d2bda5a..8ddf326 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -211,6 +211,12 @@ static void bpf_tcp_close(struct sock *sk, long timeout) close_fun = psock->save_close; write_lock_bh(>sk_callback_lock); + if (psock->cork) { + free_start_sg(psock->sock, psock->cork); + kfree(psock->cork); + psock->cork = NULL; + } + list_for_each_entry_safe(md, mtmp, >ingress, list) { list_del(>list); free_start_sg(psock->sock, md);
Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth
On Mon, Apr 02, 2018 at 04:00:16PM +, Keller, Jacob E wrote: > > -Original Message- > > From: Tal Gilboa [mailto:ta...@mellanox.com] > > Sent: Monday, April 02, 2018 7:34 AM > > To: Bjorn Helgaas> > Cc: Tariq Toukan ; Keller, Jacob E > > ; Ariel Elior ; Ganesh > > Goudar ; Kirsher, Jeffrey T > > ; everest-linux...@cavium.com; intel-wired- > > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > > linux-...@vger.kernel.org > > Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute > > max supported link bandwidth > > > > On 4/2/2018 5:05 PM, Bjorn Helgaas wrote: > > > On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote: > > >> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote: > > >>> On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote: > > On 3/31/2018 12:05 AM, Bjorn Helgaas wrote: > > > From: Tal Gilboa > > > > > > Add pcie_bandwidth_capable() to compute the max link bandwidth > > supported by > > > a device, based on the max link speed and width, adjusted by the > > encoding > > > overhead. > > > > > > The maximum bandwidth of the link is computed as: > > > > > > max_link_speed * max_link_width * (1 - encoding_overhead) > > > > > > The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using > > 8b/10b > > > encoding, and about 1.5% for 8 GT/s or higher speed links using > > > 128b/130b > > > encoding. > > > > > > Signed-off-by: Tal Gilboa > > > [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap() > > > signatures, don't export outside drivers/pci] > > > Signed-off-by: Bjorn Helgaas > > > Reviewed-by: Tariq Toukan > > > --- > > > drivers/pci/pci.c | 21 + > > > drivers/pci/pci.h |9 + > > > 2 files changed, 30 insertions(+) > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > > index 43075be79388..9ce89e254197 100644 > > > --- a/drivers/pci/pci.c > > > +++ b/drivers/pci/pci.c > > > @@ -5208,6 +5208,27 @@ enum pcie_link_width > > pcie_get_width_cap(struct pci_dev *dev) > > > return PCIE_LNK_WIDTH_UNKNOWN; > > > } > > > +/** > > > + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth > > capability > > > + * @dev: PCI device > > > + * @speed: storage for link speed > > > + * @width: storage for link width > > > + * > > > + * Calculate a PCI device's link bandwidth by querying for its link > > > speed > > > + * and width, multiplying them, and applying encoding overhead. > > > + */ > > > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed > > *speed, > > > +enum pcie_link_width *width) > > > +{ > > > + *speed = pcie_get_speed_cap(dev); > > > + *width = pcie_get_width_cap(dev); > > > + > > > + if (*speed == PCI_SPEED_UNKNOWN || *width == > > PCIE_LNK_WIDTH_UNKNOWN) > > > + return 0; > > > + > > > + return *width * PCIE_SPEED2MBS_ENC(*speed); > > > +} > > > + > > > /** > > > * pci_select_bars - Make BAR mask from the type of resource > > > * @dev: the PCI device for which BAR mask is made > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > > index 66738f1050c0..2a50172b9803 100644 > > > --- a/drivers/pci/pci.h > > > +++ b/drivers/pci/pci.h > > > @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev > > *dev); > > >(speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \ > > >"Unknown speed") > > > +/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for > > gen3 */ > > > +#define PCIE_SPEED2MBS_ENC(speed) \ > > > > Missing gen4. > > >>> > > >>> I made it "gen3+". I think that's accurate, isn't it? The spec > > >>> doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2 > > >>> says rates of 8 GT/s or higher (which I think includes gen3 and gen4) > > >>> use 128b/130b encoding. > > >>> > > >> > > >> I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it > > >> wasn't > > >> added. Need to return 15754. > > > > > > Oh, duh, of course! Sorry for being dense. What about the following? > > > I included the calculation as opposed to just the magic numbers to try > > > to make it clear how they're derived. This has the disadvantage of > > > truncating the result instead of rounding, but I doubt that's > > > significant in this context. If it is, we could use the magic numbers > > > and put the computation in a comment. > > > > We can always use DIV_ROUND_UP((speed * enc_nominator), > >
Re: [PATCH v2] KEYS: DNS: limit the length of option strings
On Fri, Mar 23, 2018 at 01:21:22PM -0700, Eric Biggers wrote: > On Mon, Mar 12, 2018 at 10:57:07AM -0700, Eric Biggers wrote: > > On Wed, Mar 07, 2018 at 03:54:37PM +, David Howells wrote: > > > Eric Biggerswrote: > > > > > > > Fix it by limiting option strings (combined name + value) to a much more > > > > reasonable 128 bytes. The exact limit is arbitrary, but currently the > > > > only recognized option is formatted as "dnserror=%lu" which fits well > > > > within this limit. > > > > > > There will be more options coming ("ipv4", "ipv6") but they shouldn't > > > overrun > > > this limit and we can always extend the limit if need be. > > > > > > David > > > > David (Howells) do you want to take this patch through the keyrings tree or > > should I ask David Miller to take it through net-next? > > > > Eric > > Ping. Ping again.
[PATCH v8 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs
memory-barriers.txt has been updated as follows: "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Remove old IA-64 comments in the code along with unneeded wmb() in front of writel(). There are places in the code where wmb() has been used as a double barrier for CPU and IO in place of smp_wmb() and wmb() as an optimization. For such places, keep the wmb() but replace the following writel() with writel_relaxed() to have a sequence as wmb() writel_relaxed() mmio_wb() Signed-off-by: Sinan Kaya--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index afadba9..c17924b 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1696,12 +1696,6 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count) /* update next to alloc since we have filled the ring */ rx_ring->next_to_alloc = i; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). -*/ - wmb(); writel(i, rx_ring->tail); } } @@ -2467,10 +2461,6 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, if (xdp_xmit) { struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()]; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. -*/ - wmb(); writel(ring->next_to_use, ring->tail); xdp_do_flush_map(); @@ -8080,12 +8070,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring, /* set the timestamp */ first->time_stamp = jiffies; - /* -* Force memory writes to complete before letting h/w know there -* are new descriptors to fetch. (Only applicable for weak-ordered -* memory model archs, such as IA-64). -* -* We also need this memory barrier to make certain all of the + /* We need this memory barrier to make certain all of the * status bits have been updated before next_to_watch is written. */ wmb(); @@ -8102,7 +8087,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring, ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED); if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) { - writel(i, tx_ring->tail); + writel_relaxed(i, tx_ring->tail); /* we need this if more than one processor can write to our tail * at a time, it synchronizes IO on IA64/Altix systems @@ -10034,10 +10019,6 @@ static void ixgbe_xdp_flush(struct net_device *dev) if (unlikely(!ring)) return; - /* Force memory writes to complete before letting h/w know there -* are new descriptors to fetch. -*/ - wmb(); writel(ring->next_to_use, ring->tail); return; -- 2.7.4
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
❦ 2 avril 2018 22:05 +0300, Julian Anastasov: > Sorry to say it but may be you missed the discussion > on lvs-devel about the new MH scheduler implemented by Inju Song: > > https://www.spinics.net/lists/lvs-devel/msg04928.html > http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html > > In the last 6 months we fixed all issues and I acked > v4 just yesterday: > > http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html For some reason, "ipvs maglev" in Google doesn't yield those results. But having code already reviewed is great news for me: I can use it with confidence. :) -- Make sure your code "does nothing" gracefully. - The Elements of Programming Style (Kernighan & Plauger)
[PATCH v8 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
memory-barriers.txt has been updated as follows: "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Remove old IA-64 comments in the code along with unneeded wmb() in front of writel(). There are places in the code where wmb() has been used as a double barrier for CPU and IO in place of smp_wmb() and wmb() as an optimization. For such places, keep the wmb() but replace the following writel() with writel_relaxed() to have a sequence as wmb() writel_relaxed() mmio_wb() Signed-off-by: Sinan Kaya--- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 +- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 8 +--- 2 files changed, 10 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index f174c72..1b9fa7a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -186,7 +186,13 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter *fdir_data, /* Mark the data descriptor to be watched */ first->next_to_watch = tx_desc; - writel(tx_ring->next_to_use, tx_ring->tail); + writel_relaxed(tx_ring->next_to_use, tx_ring->tail); + + /* We need this if more than one processor can write to our tail +* at a time, it synchronizes IO on IA64/Altix systems +*/ + mmiowb(); + return 0; dma_fail: @@ -1523,12 +1529,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val) /* update next to alloc since we have filled the ring */ rx_ring->next_to_alloc = val; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). -*/ - wmb(); writel(val, rx_ring->tail); } @@ -2274,11 +2274,7 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring, static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring) { - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. -*/ - wmb(); - writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail); + writel(xdp_ring->next_to_use, xdp_ring->tail); } /** @@ -3444,7 +3440,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb, /* notify HW of packet */ if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) { - writel(i, tx_ring->tail); + writel_relaxed(i, tx_ring->tail); /* we need this if more than one processor can write to our tail * at a time, it synchronizes IO on IA64/Altix systems diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c index 12bd937..eb5556e 100644 --- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c @@ -804,12 +804,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val) /* update next to alloc since we have filled the ring */ rx_ring->next_to_alloc = val; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). -*/ - wmb(); writel(val, rx_ring->tail); } @@ -2379,7 +2373,7 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb, /* notify HW of packet */ if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) { - writel(i, tx_ring->tail); + writel_relaxed(i, tx_ring->tail); /* we need this if more than one processor can write to our tail * at a time, it synchronizes IO on IA64/Altix systems -- 2.7.4
[PATCH v8 5/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs
memory-barriers.txt has been updated as follows: "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Remove old IA-64 comments in the code along with unneeded wmb() in front of writel(). There are places in the code where wmb() has been used as a double barrier for CPU and IO in place of smp_wmb() and wmb() as an optimization. For such places, keep the wmb() but replace the following writel() with writel_relaxed() to have a sequence as wmb() writel_relaxed() mmio_wb() Signed-off-by: Sinan Kaya--- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index df86070..41e3aaa 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -172,13 +172,6 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, u16 cleaned_count) /* update next to alloc since we have filled the ring */ rx_ring->next_to_alloc = i; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). -*/ - wmb(); - /* notify hardware of new descriptors */ writel(i, rx_ring->tail); } @@ -1036,11 +1029,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring, /* record SW timestamp if HW timestamp is not available */ skb_tx_timestamp(first->skb); - /* Force memory writes to complete before letting h/w know there -* are new descriptors to fetch. (Only applicable for weak-ordered -* memory model archs, such as IA-64). -* -* We also need this memory barrier to make certain all of the + /* We need this memory barrier to make certain all of the * status bits have been updated before next_to_watch is written. */ wmb(); @@ -1055,7 +1044,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring, /* notify HW of packet */ if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) { - writel(i, tx_ring->tail); + writel_relaxed(i, tx_ring->tail); /* we need this if more than one processor can write to our tail * at a time, it synchronizes IO on IA64/Altix systems -- 2.7.4
[PATCH v8 4/7] igb: eliminate duplicate barriers on weakly-ordered archs
memory-barriers.txt has been updated as follows: "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Remove old IA-64 comments in the code along with unneeded wmb() in front of writel(). There are places in the code where wmb() has been used as a double barrier for CPU and IO in place of smp_wmb() and wmb() as an optimization. For such places, keep the wmb() but replace the following writel() with writel_relaxed() to have a sequence as wmb() writel_relaxed() mmio_wb() Signed-off-by: Sinan Kaya--- drivers/net/ethernet/intel/igb/igb_main.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index c1c0bc3..c3f7130 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -5652,11 +5652,7 @@ static int igb_tx_map(struct igb_ring *tx_ring, /* set the timestamp */ first->time_stamp = jiffies; - /* Force memory writes to complete before letting h/w know there -* are new descriptors to fetch. (Only applicable for weak-ordered -* memory model archs, such as IA-64). -* -* We also need this memory barrier to make certain all of the + /* We need this memory barrier to make certain all of the * status bits have been updated before next_to_watch is written. */ wmb(); @@ -5674,7 +5670,7 @@ static int igb_tx_map(struct igb_ring *tx_ring, igb_maybe_stop_tx(tx_ring, DESC_NEEDED); if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) { - writel(i, tx_ring->tail); + writel_relaxed(i, tx_ring->tail); /* we need this if more than one processor can write to our tail * at a time, it synchronizes IO on IA64/Altix systems @@ -8073,12 +8069,6 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 cleaned_count) /* update next to alloc since we have filled the ring */ rx_ring->next_to_alloc = i; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). -*/ - wmb(); writel(i, rx_ring->tail); } } -- 2.7.4
[PATCH v8 6/7] ixgbevf: keep writel() closer to wmb()
Remove ixgbevf_write_tail() in favor of moving writel() close to wmb(). Signed-off-by: Sinan KayaReviewed-by: Alexander Duyck --- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 - drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++--- 2 files changed, 3 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 447ce1d..c75ea1f 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -312,11 +312,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring *ring) return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1; } -static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value) -{ - writel(value, ring->tail); -} - #define IXGBEVF_RX_DESC(R, i) \ (&(((union ixgbe_adv_rx_desc *)((R)->desc))[i])) #define IXGBEVF_TX_DESC(R, i) \ diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index e3d04f2..757dac6 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring, * such as IA-64). */ wmb(); - ixgbevf_write_tail(rx_ring, i); + writel(i, rx_ring->tail); } } @@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, * know there are new descriptors to fetch. */ wmb(); - ixgbevf_write_tail(xdp_ring, xdp_ring->next_to_use); + writel(xdp_ring->next_to_use, xdp_ring->tail); } u64_stats_update_begin(_ring->syncp); @@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, tx_ring->next_to_use = i; /* notify HW of packet */ - ixgbevf_write_tail(tx_ring, i); + writel(i, tx_ring->tail); return; dma_error: -- 2.7.4
[PATCH v8 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
memory-barriers.txt has been updated as follows: "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Remove old IA-64 comments in the code along with unneeded wmb() in front of writel(). There are places in the code where wmb() has been used as a double barrier for CPU and IO in place of smp_wmb() and wmb() as an optimization. For such places, keep the wmb() but replace the following writel() with writel_relaxed() to have a sequence as wmb() writel_relaxed() mmio_wb() Signed-off-by: Sinan Kaya--- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 23 +++ 1 file changed, 7 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 757dac6..29b71a7 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -719,12 +719,6 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring, /* update next to alloc since we have filled the ring */ rx_ring->next_to_alloc = i; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). -*/ - wmb(); writel(i, rx_ring->tail); } } @@ -1228,10 +1222,6 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, struct ixgbevf_ring *xdp_ring = adapter->xdp_ring[rx_ring->queue_index]; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. -*/ - wmb(); writel(xdp_ring->next_to_use, xdp_ring->tail); } @@ -3985,11 +3975,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, /* set the timestamp */ first->time_stamp = jiffies; - /* Force memory writes to complete before letting h/w know there -* are new descriptors to fetch. (Only applicable for weak-ordered -* memory model archs, such as IA-64). -* -* We also need this memory barrier (wmb) to make certain all of the + /* We also need this memory barrier (wmb) to make certain all of the * status bits have been updated before next_to_watch is written. */ wmb(); @@ -4004,7 +3990,12 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, tx_ring->next_to_use = i; /* notify HW of packet */ - writel(i, tx_ring->tail); + writel_relaxed(i, tx_ring->tail); + + /* We need this if more than one processor can write to our tail +* at a time, it synchronizes IO on IA64/Altix systems +*/ + mmiowb(); return; dma_error: -- 2.7.4
Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code
[dropping Freddy as I'm getting bounces from asix.com.tw] On Mon, 2 Apr 2018 15:21:08 + Alexander Kurzwrote: Alexander, > Hi Marc, David, > with the v2 patch ("net: usb: asix88179_178a: de-duplicate code") > I made an embarrasly stupid mistake of removing the wrong function. > The v2 patch accidentially changed ax88179_link_reset() instead of > ax88179_reset(). Hunk 6 of v2 ("net: usb: asix88179_178a: de-duplicate > code") is just utterly wrong. OK, that explains what I saw here. > ax88179_bind() and ax88179_reset() were the correct targets to be > de-duplicated, as done in the v3 patch. > > Sorry for this, Alexander We all make mistakes, so let's try to learn from them. You can improve your process by testing what you're about to send (a very basic requirement), and documenting the changes you make on each version you send (a cover letter is a good place to put it). Answering reviewer questions helps building trust between the contributor and the maintainer on the receiving end of the patch, and you probably want to answer them before sending a new version so that a proper discussion can take place, specially if the reviewer doesn't quite see what you're aiming for. I'll comment on the patch separately. Hope this helps, M. -- Without deviation from the norm, progress is not possible.
[PATCH v8 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs
memory-barriers.txt has been updated as follows: "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Remove old IA-64 comments in the code along with unneeded wmb() in front of writel(). There are places in the code where wmb() has been used as a double barrier for CPU and IO in place of smp_wmb() and wmb() as an optimization. For such places, keep the wmb() but replace the following writel() with writel_relaxed() to have a sequence as wmb() writel_relaxed() mmio_wb() Signed-off-by: Sinan Kaya--- drivers/net/ethernet/intel/igbvf/netdev.c | 16 +--- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c index e2b7502..d9f186a 100644 --- a/drivers/net/ethernet/intel/igbvf/netdev.c +++ b/drivers/net/ethernet/intel/igbvf/netdev.c @@ -246,12 +246,6 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring *rx_ring, else i--; - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). - */ - wmb(); writel(i, adapter->hw.hw_addr + rx_ring->tail); } } @@ -2289,16 +2283,16 @@ static inline void igbvf_tx_queue_adv(struct igbvf_adapter *adapter, } tx_desc->read.cmd_type_len |= cpu_to_le32(adapter->txd_cmd); - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). + + /* We use this memory barrier to make certain all of the +* status bits have been updated before next_to_watch is +* written. */ wmb(); tx_ring->buffer_info[first].next_to_watch = tx_desc; tx_ring->next_to_use = i; - writel(i, adapter->hw.hw_addr + tx_ring->tail); + writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail); /* we need this if more than one processor can write to our tail * at a time, it synchronizes IO on IA64/Altix systems */ -- 2.7.4
[PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs
Code includes wmb() followed by writel() in multiple places. writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Since code already has an explicit barrier call, changing writel() to writel_relaxed(). I did a regex search for wmb() followed by writel() in each drivers directory. I scrubbed the ones I care about in this series. I considered "ease of change", "popular usage" and "performance critical path" as the determining criteria for my filtering. We used relaxed API heavily on ARM for a long time but it did not exist on other architectures. For this reason, relaxed architectures have been paying double penalty in order to use the common drivers. Now that relaxed API is present on all architectures, we can go and scrub all drivers to see what needs to change and what can remain. We start with mostly used ones and hope to increase the coverage over time. It will take a while to cover all drivers. Feel free to apply patches individually. Change since 7: * API clarification with Linus and several architecture folks regarding writel() https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html * removed wmb() in front of writel() at several places. * remove old IA64 comments regarding ordering. Sinan Kaya (7): i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs ixgbe: eliminate duplicate barriers on weakly-ordered archs igbvf: eliminate duplicate barriers on weakly-ordered archs igb: eliminate duplicate barriers on weakly-ordered archs fm10k: Eliminate duplicate barriers on weakly-ordered archs ixgbevf: keep writel() closer to wmb() ixgbevf: eliminate duplicate barriers on weakly-ordered archs drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++--- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 -- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 8 +-- drivers/net/ethernet/intel/igb/igb_main.c | 14 ++-- drivers/net/ethernet/intel/igbvf/netdev.c | 16 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 - drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 --- 8 files changed, 30 insertions(+), 100 deletions(-) -- 2.7.4
[GIT PULL] remove in-kernel calls to syscalls
Linus, please pull the following changes since commit 0c8efd610b58cb23cefdfa12015799079aef94ae: Linux 4.16-rc5 (2018-03-11 17:25:09 -0700) which are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux.git syscalls-next up to commit c9a211951c7c79cfb5de888d7d9550872868b086: bpf: whitelist all syscalls for error injection (2018-04-02 20:16:21 +0200) to remove all in-kernel calls to syscalls except from arch/ . Since the last time I sent the patches out for review,[*] I have solely added a few more ACKs. Jon Corbet raised the question whether the documentation really should go to Documentation/process/adding-syscalls.rst and not to Documentation/process/coding-style.rst (even though, as he said, that isn't quite right either). As most of the existing instances where syscalls were called in the kernel were (1) common codepaths for old and new syscalls, (2) common codepaths for native and compat syscalls, and (3) syscall multiplexers like sys_ipc(), I have kept it at the former location for the time being, but will be happy to submit a follow-up patch to move the documentation bits to a different file. [*] lkml.kernel.org/r/20180329112426.23043-1-li...@dominikbrodowski.net All these patches have been in -next, but got rebased a few minutes ago to include another ACK in patch 2/109 (no code changes). There were/are a few trivial conflicts against the net, sparc and vfs trees, but not (yet) against what is in your tree up to commit 86bbbebac1933e6e95e8234c4f7d220c5ddd38bc. Thanks, Dominik System calls are interaction points between userspace and the kernel. Therefore, system call functions such as sys_xyzzy() or compat_sys_xyzzy() should only be called from userspace via the syscall table, but not from elsewhere in the kernel. At least on 64-bit x86, it will likely be a hard requirement from v4.17 onwards to not call system call functions in the kernel: It is better to use use a different calling convention for system calls there, where struct pt_regs is decoded on-the-fly in a syscall wrapper which then hands processing over to the actual syscall function. This means that only those parameters which are actually needed for a specific syscall are passed on during syscall entry, instead of filling in six CPU registers with random user space content all the time (which may cause serious trouble down the call chain). Those x86-specific patches will be pushed through the x86 tree in the near future. Moreover, rules on how data may be accessed may differ between kernel data and user data. This is another reason why calling sys_xyzzy() is generally a bad idea, and -- at most -- acceptable in arch-specific code. This patchset removes all in-kernel calls to syscall functions in the kernel with the exception of arch/. On top of this, it cleans up the three places where many syscalls are referenced or prototyped, namely kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h. First goes a patch which defines the goal and explains the rationale: syscalls: define and explain goal to not call syscalls in the kernel A few codepaths can trivially be converted to existing in-kernel interfaces: kernel: use kernel_wait4() instead of sys_wait4() kernel: open-code sys_rt_sigpending() in sys_sigpending() kexec: call do_kexec_load() in compat syscall directly mm: use do_futex() instead of sys_futex() in mm_release() x86: use _do_fork() in compat_sys_x86_clone() x86: remove compat_sys_x86_waitpid() Then follow many patches which only affect specfic subsystems each, and replace sys_*() with internal helpers named __sys_*() or do_sys_*(). Let's start with net/: net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall net: socket: add __sys_sendto() helper; remove in-kernel call to syscall net: socket: add __sys_accept4() helper; remove in-kernel call to syscall net: socket: add __sys_socket() helper; remove in-kernel call to syscall net: socket: add __sys_bind() helper; remove in-kernel call to syscall net: socket: add __sys_connect() helper; remove in-kernel call to syscall net: socket: add __sys_listen() helper; remove in-kernel call to syscall net: socket: add __sys_getsockname() helper; remove in-kernel call to syscall net: socket: add __sys_getpeername() helper; remove in-kernel call to syscall net: socket: add __sys_socketpair() helper; remove in-kernel call to syscall net: socket: add __sys_shutdown() helper; remove in-kernel call to syscall net: socket: add __sys_setsockopt() helper; remove in-kernel call to syscall net: socket: add __sys_getsockopt() helper; remove in-kernel call to syscall net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to syscall net: socket: move check for forbid_cmsg_compat to __sys_...msg() net: socket: replace calls to sys_send() with __sys_sendto() net:
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
Hello, On Mon, 2 Apr 2018, Vincent Bernat wrote: > Based on Google's Maglev algorithm [1][2], this scheduler builds a > lookup table in a way disruption is minimized when a change > occurs. This helps in case of active/active setup without > synchronization. Like for classic source hashing, this lookup table is > used to assign connections to a real server. > > Both source address and port are used to compute the hash (unlike sh > where this is optional). > > Weights are correctly handled. Unlike sh, servers with a weight of 0 > are considered as absent. Also, unlike sh, when a server becomes > unavailable due to a threshold, no fallback is possible: doing so > would seriously impair the the usefulness of using a consistent hash. > > There is a small hack to detect when all real servers have a weight of > 0. It relies on the fact it is not possible for the weight of a real > server to change during the execution of the assignment. I believe > this is the case as modifications through netlink are subject to a > mutex, but the use of atomic_read() is unsettling. > > The value of 65537 for the hash table size is currently not modifiable > at compile-time. This is the value suggested in the Maglev > paper. Another possible value is 257 (for small tests) and 655373 (for > very large setups). > > [1]: https://research.google.com/pubs/pub44824.html > [2]: > https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/ Sorry to say it but may be you missed the discussion on lvs-devel about the new MH scheduler implemented by Inju Song: https://www.spinics.net/lists/lvs-devel/msg04928.html http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html In the last 6 months we fixed all issues and I acked v4 just yesterday: http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html This scheduler supports: - tables with different size (prime): IP_VS_MH_TAB_INDEX - gcd of weights: ip_vs_mh_gcd_weight - shifted weights: ip_vs_mh_shift_weight - weight can be changed any time > Signed-off-by: Vincent Bernat> --- > include/net/ip_vs.h| 27 > net/netfilter/ipvs/Kconfig | 13 ++ > net/netfilter/ipvs/Makefile| 1 + > net/netfilter/ipvs/ip_vs_csh.c | 339 > + > net/netfilter/ipvs/ip_vs_sh.c | 32 +--- > 5 files changed, 381 insertions(+), 31 deletions(-) > create mode 100644 net/netfilter/ipvs/ip_vs_csh.c Regards -- Julian Anastasov
Re: [PATCH 15/15] ARM: pxa: change SSP DMA channels allocation
Hi Robert, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.16] [cannot apply to arm-soc/for-next next-20180329] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Robert-Jarzmik/ARM-pxa-switch-to-DMA-slave-maps/20180402-233029 config: arm-allmodconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All errors (new ones prefixed by >>): >> arch/arm/plat-pxa/ssp.c:19:10: fatal error: mach/audio.h: No such file or >> directory #include ^~ compilation terminated. vim +19 arch/arm/plat-pxa/ssp.c > 19 #include 20 #include 21 #include 22 #include 23 #include 24 #include 25 #include 26 #include 27 #include 28 #include 29 #include 30 #include 31 #include 32 #include 33 #include 34 #include 35 #include 36 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [net-next 0/2] Add promiscous mode support in k2g network driver
On 04/02/2018 12:28 PM, David Miller wrote: > From: Murali Karicheri> Date: Mon, 2 Apr 2018 12:17:17 -0400 > >> This patch adds support for promiscuous mode in network driver for K2G >> SoC. This depends on v3 of my series at >> https://www.spinics.net/lists/kernel/msg2765942.html > > The net-next tree is closed, please resubmit this series after the merge > window when the net-next tree is openned back up. > > Thank you. > Sure! As I have indicated in my cover letter, I will fold this to the other series and re-submit it when the merge window opens. Thanks -- Murali Karicheri Linux Kernel, Keystone
Re: [net-next 2/2] net: netcp: ethss: k2g: add promiscuous mode support
Andrew, Thanks for reviewing this! On 04/02/2018 12:47 PM, Andrew Lunn wrote: > On Mon, Apr 02, 2018 at 12:17:19PM -0400, Murali Karicheri wrote: >> +static int gbe_set_rx_mode(void *intf_priv, bool promisc) >> +{ >> +struct gbe_intf *gbe_intf = intf_priv; >> +struct gbe_priv *gbe_dev = gbe_intf->gbe_dev; >> +struct cpsw_ale *ale = gbe_dev->ale; >> +unsigned long timeout; >> +int i, ret = -ETIMEDOUT; >> + >> +/* Disable(1)/Enable(0) Learn for all ports (host is port 0 and >> + * slaves are port 1 and up >> + */ >> +for (i = 0; i <= gbe_dev->num_slaves; i++) { >> +cpsw_ale_control_set(ale, i, >> + ALE_PORT_NOLEARN, !!promisc); >> +cpsw_ale_control_set(ale, i, >> + ALE_PORT_NO_SA_UPDATE, !!promisc); >> +} > > Hi Murali > > Does this mean that in promisc mode, switching of frames between ports > in hardware is disabled? You are relying on the software bridge to > perform such bridging between ports? The K2G switch has only one slave port. The other port is the host port. So there is no switching applicable here. At the egress drivers provide frame with PS_FLAG to indicate which port the frame is forwarded to and at the Ingress direction, it forward the received frame to the Host port which is the only other port in a K2G switch (2u). To Implement promiscuous mode, this requires ALE to be enabled and take advantage of ALE feature to flood all received unicast frames to host port. In the non-promiscuous mode, it disables that feature. So only frames with destination MAC address match is forwarded. For other K2 devices that has more than one port available, what you say is applicable. However we have not implemented the switch mode of these devices with multiple ports and don't have plan to do the same anytime in the future as this device is already matured and adding this feature at this point doesn't make much sense now. The driver on these devices currently bypass ALE and implement plain Ethernet interfaces (n port) for Ethernet connectivity. > > You might want to look at skb->offload_fwd_mark. By setting this, you > can tell the software bridge the hardware has already bridged the > frame. You might then be able to have promisc enabled, and the > hardware still doing the forwarding. Yes, if we decide to support switch mode for K2 devices, I will certainly look at this and add support as you have suggested. > >Andrew > -- Murali Karicheri Linux Kernel, Keystone
Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
On Mon, Apr 02, 2018 at 12:09:44PM -0600, David Ahern wrote: > On 4/2/18 12:03 PM, John Fastabend wrote: > > > > Can the above be a normal BPF helper that returns an > > ifindex? Then something roughly like this patter would > > work for all drivers with redirect support, > > > > > > route_ifindex = ip_route_lookup(__daddr, ) > > if (!route_ifindex) > >return do_foo() > > return xdp_redirect(route_ifindex); > > > > So my suggestion is, > > > > 1. enable veth xdp (including redirect support) > > 2. add a helper to lookup route from routing table > > > > Alternatively you can skip step (2) and encode the routing > > table in BPF directly. Maybe we need a more efficient data > > structure but that should also work. > > > > That's what I have here: > > https://github.com/dsahern/linux/commit/bab42f158c0925339f7519df7fb2cde8eac33aa8 was wondering what's up with the delay and when are you going to submit them officially... The use case came up several times.
Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
On 4/2/18 12:03 PM, John Fastabend wrote: > > Can the above be a normal BPF helper that returns an > ifindex? Then something roughly like this patter would > work for all drivers with redirect support, > > > route_ifindex = ip_route_lookup(__daddr, ) > if (!route_ifindex) >return do_foo() > return xdp_redirect(route_ifindex); > > So my suggestion is, > > 1. enable veth xdp (including redirect support) > 2. add a helper to lookup route from routing table > > Alternatively you can skip step (2) and encode the routing > table in BPF directly. Maybe we need a more efficient data > structure but that should also work. > That's what I have here: https://github.com/dsahern/linux/commit/bab42f158c0925339f7519df7fb2cde8eac33aa8 And Jesper has done some measurements showing a 400% improvement in throughput. I have not had time to come back to the xdp forwarding set. It needs to handle vlan devices at a minimum before I send an RFC.
Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
On 04/01/2018 05:47 PM, Md. Islam wrote: > This patch implements IPv4 forwarding on xdp_buff. I added a new > config option XDP_ROUTER. Kernel would forward packets through fast > path when this option is enabled. But it would require driver support. > Currently it only works with veth. Here I have modified veth such that > it outputs xdp_buff. I created a testbed in Mininet. The Mininet > script (topology.py) is attached. Here the topology is: > > h1 -r1-h2 (r1 acts as a router) > > This patch improves the throughput from 53.8Gb/s to 60Gb/s on my > machine. Median RTT also improved from around .055 ms to around .035 > ms. > > Then I disabled hyperthreading and cpu frequency scaling in order to > utilize CPU cache (DPDK also utilizes CPU cache to improve > forwarding). This further improves per-packet forwarding latency from > around 400ns to 200 ns. More specifically, header parsing and fib > lookup only takes around 82 ns. This shows that this could be used to > implement linerate packet forwarding in kernel. > > The patch has been generated on 4.15.0+. Please let me know your > feedback and suggestions. Please feel free to let me know if this > approach make sense. Make sense although lets try to avoid hard coded routing into XDP xmit routines. See details below. > +#ifdef CONFIG_XDP_ROUTER > +int veth_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) > +{ This is nice but instead of building a new config_xdp_router just enable standard XDP for veth + a new helper call to do routing. Then it will be immediately usable from any XDP enabled device. > +struct veth_priv *priv = netdev_priv(dev); > +struct net_device *rcv; > +struct ethhdr *ethh; > +struct sk_buff *skb; > +int length = xdp->data_end - xdp->data; > + > +rcu_read_lock(); > +rcv = rcu_dereference(priv->peer); > +if (unlikely(!rcv)) { > +kfree(xdp); > +goto drop; > +} > + > +/* Update MAC address and checksum */ > +ethh = eth_hdr_xdp(xdp); > +ether_addr_copy(ethh->h_source, dev->dev_addr); > +ether_addr_copy(ethh->h_dest, rcv->dev_addr); > + > +/* if IP forwarding is enabled on the receiver, > + * call xdp_router_forward() > + */ > +if (is_forwarding_enabled(rcv)) { > +prefetch_xdp(xdp); > +if (likely(xdp_router_forward(rcv, xdp) == NET_RX_SUCCESS)) { > +struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats); > + > +u64_stats_update_begin(>syncp); > +stats->bytes += length; > +stats->packets++; > +u64_stats_update_end(>syncp); > +goto success; > +} > +} > + > +/* Local deliver */ > +skb = (struct sk_buff *)xdp->data_meta; > +if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) { > +struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats); > + > +u64_stats_update_begin(>syncp); > +stats->bytes += length; > +stats->packets++; > +u64_stats_update_end(>syncp); > +} else { > +drop: > +atomic64_inc(>dropped); > +} > +success: > +rcu_read_unlock(); > +return NETDEV_TX_OK; > +} > +#endif > + > static const struct net_device_ops veth_netdev_ops = { > .ndo_init= veth_dev_init, > .ndo_open= veth_open, > @@ -290,6 +370,9 @@ static const struct net_device_ops veth_netdev_ops = { > .ndo_get_iflink= veth_get_iflink, > .ndo_features_check= passthru_features_check, > .ndo_set_rx_headroom= veth_set_rx_headroom, > +#ifdef CONFIG_XDP_ROUTER > +.ndo_xdp_xmit= veth_xdp_xmit, > +#endif > }; > [...] > +#ifdef CONFIG_XDP_ROUTER > +int ip_route_lookup(__be32 daddr, __be32 saddr, > + u8 tos, struct net_device *dev, > + struct fib_result *res); > +#endif > + Can the above be a normal BPF helper that returns an ifindex? Then something roughly like this patter would work for all drivers with redirect support, route_ifindex = ip_route_lookup(__daddr, ) if (!route_ifindex) return do_foo() return xdp_redirect(route_ifindex); So my suggestion is, 1. enable veth xdp (including redirect support) 2. add a helper to lookup route from routing table Alternatively you can skip step (2) and encode the routing table in BPF directly. Maybe we need a more efficient data structure but that should also work. Thanks, John
[Patch net] af_unix: remove redundant lockdep class
After commit 581319c58600 ("net/socket: use per af lockdep classes for sk queues") sock queue locks now have per-af lockdep classes, including unix socket. It is no longer necessary to workaround it. I noticed this while looking at a syzbot deadlock report, this patch itself doesn't fix it (this is why I don't add Reported-by). Fixes: 581319c58600 ("net/socket: use per af lockdep classes for sk queues") Cc: Paolo AbeniSigned-off-by: Cong Wang --- net/unix/af_unix.c | 10 -- 1 file changed, 10 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 2d465bdeccbc..45971e173924 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -745,14 +745,6 @@ static struct proto unix_proto = { .obj_size = sizeof(struct unix_sock), }; -/* - * AF_UNIX sockets do not interact with hardware, hence they - * dont trigger interrupts - so it's safe for them to have - * bh-unsafe locking for their sk_receive_queue.lock. Split off - * this special lock-class by reinitializing the spinlock key: - */ -static struct lock_class_key af_unix_sk_receive_queue_lock_key; - static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) { struct sock *sk = NULL; @@ -767,8 +759,6 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) goto out; sock_init_data(sock, sk); - lockdep_set_class(>sk_receive_queue.lock, - _unix_sk_receive_queue_lock_key); sk->sk_allocation = GFP_KERNEL_ACCOUNT; sk->sk_write_space = unix_write_space; -- 2.13.0
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
❦ 2 avril 2018 10:33 -0700, Eric Dumazet: >> +static inline u32 >> +ip_vs_csh_permutation(struct ip_vs_dest *d, int j) >> +{ >> +u32 offset, skip; >> +__be32 addr_fold = d->addr.ip; >> + >> +#ifdef CONFIG_IP_VS_IPV6 >> +if (d->af == AF_INET6) >> +addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^ >> +d->addr.ip6[2]^d->addr.ip6[3]; >> +#endif >> +addr_fold = ntohl(addr_fold) + ntohs(d->port); >> +offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE; >> +skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1; >> +return (offset + j * skip) % IP_VS_CSH_TAB_SIZE; >> +} >> + > > This does not look very strong to me, particularly the IPv6 folding > > I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(), > even if it is hard coded ;) I can switch to ipv6_addr_hash(). However, switching to __ipv6_addr_jhash seems useless as I would need to hardcode the initial value: people use source hashing to get the same result from one host to another. Am I missing something? -- Each module should do one thing well. - The Elements of Programming Style (Kernighan & Plauger)
Re: [GIT] Networking
Sorry, this is a dup of the bug fix pull request from last week. I'll send you the right one.
Re: [bpf-next PATCH 4/4] bpf: sockmap, add hash map support
On 04/02/2018 08:54 AM, Alexei Starovoitov wrote: > On Sun, Apr 01, 2018 at 08:01:10AM -0700, John Fastabend wrote: >> Sockmap is currently backed by an array and enforces keys to be >> four bytes. This works well for many use cases and was originally >> modeled after devmap which also uses four bytes keys. However, >> this has become limiting in larger use cases where a hash would >> be more appropriate. For example users may want to use the 5-tuple >> of the socket as the lookup key. >> >> To support this add hash support. >> >> Signed-off-by: John Fastabend> > api looks good, but I think it came a bit too late for this release. > _nulls part you don't need for this hash. Few other nits: > Yeah no problem will push when {bpf|net}-next opens again. We also have some fields we need to access from sockmap programs as well such as ip addrs and ports. I'll respin the fixes (first two patches) for bpf-net. >> +static void htab_elem_free_rcu(struct rcu_head *head) >> +{ >> +struct htab_elem *l = container_of(head, struct htab_elem, rcu); >> + >> +/* must increment bpf_prog_active to avoid kprobe+bpf triggering while >> + * we're calling kfree, otherwise deadlock is possible if kprobes >> + * are placed somewhere inside of slub >> + */ >> +preempt_disable(); >> +__this_cpu_inc(bpf_prog_active); >> +kfree(l); >> +__this_cpu_dec(bpf_prog_active); >> +preempt_enable(); > > I don't think it's necessary. > Yep agreed. >> +static struct bpf_map *sock_hash_alloc(union bpf_attr *attr) >> +{ >> +struct bpf_htab *htab; >> +int i, err; >> +u64 cost; >> + >> +if (!capable(CAP_NET_ADMIN)) >> +return ERR_PTR(-EPERM); >> + >> +/* check sanity of attributes */ >> +if (attr->max_entries == 0 || >> +attr->map_flags & ~SOCK_CREATE_FLAG_MASK) >> +return ERR_PTR(-EINVAL); >> + >> +if (attr->value_size > KMALLOC_MAX_SIZE) >> +return ERR_PTR(-E2BIG); > > doesn't seem to match > + u32 fd = *(u32 *)value; > that is done later. > Yep. >> +static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, >> + u32 hash, void *key, u32 key_size) >> +{ >> +struct hlist_nulls_node *n; >> +struct htab_elem *l; >> + >> +hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) >> +if (l->hash == hash && !memcmp(>key, key, key_size)) >> +return l; > > if nulls is needed, there gotta be a comment explaining it. > Sure its not needed lets drop it. > please add tests for all methods. > I'll add these tests to test_maps. >> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c >> index f95fa67..2fa4cbb 100644 >> --- a/tools/bpf/bpftool/map.c >> +++ b/tools/bpf/bpftool/map.c >> @@ -67,6 +67,7 @@ >> [BPF_MAP_TYPE_DEVMAP] = "devmap", >> [BPF_MAP_TYPE_SOCKMAP] = "sockmap", >> [BPF_MAP_TYPE_CPUMAP] = "cpumap", >> +[BPF_MAP_TYPE_SOCKHASH] = "sockhash", >> }; >> >> static unsigned int get_possible_cpus(void) >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h >> index 9d07465..1a19450 100644 >> --- a/tools/include/uapi/linux/bpf.h >> +++ b/tools/include/uapi/linux/bpf.h >> @@ -115,6 +115,7 @@ enum bpf_map_type { >> BPF_MAP_TYPE_DEVMAP, >> BPF_MAP_TYPE_SOCKMAP, >> BPF_MAP_TYPE_CPUMAP, >> +BPF_MAP_TYPE_SOCKHASH, > > tools/* updates should be in separate commit. > Great, thanks for the review.
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
On 04/02/2018 10:20 AM, Vincent Bernat wrote: > +static inline u32 > +ip_vs_csh_permutation(struct ip_vs_dest *d, int j) > +{ > + u32 offset, skip; > + __be32 addr_fold = d->addr.ip; > + > +#ifdef CONFIG_IP_VS_IPV6 > + if (d->af == AF_INET6) > + addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^ > + d->addr.ip6[2]^d->addr.ip6[3]; > +#endif > + addr_fold = ntohl(addr_fold) + ntohs(d->port); > + offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE; > + skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1; > + return (offset + j * skip) % IP_VS_CSH_TAB_SIZE; > +} > + This does not look very strong to me, particularly the IPv6 folding I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(), even if it is hard coded ;)
[PATCH net-next v1] ipvs: add consistent source hashing scheduling
Based on Google's Maglev algorithm [1][2], this scheduler builds a lookup table in a way disruption is minimized when a change occurs. This helps in case of active/active setup without synchronization. Like for classic source hashing, this lookup table is used to assign connections to a real server. Both source address and port are used to compute the hash (unlike sh where this is optional). Weights are correctly handled. Unlike sh, servers with a weight of 0 are considered as absent. Also, unlike sh, when a server becomes unavailable due to a threshold, no fallback is possible: doing so would seriously impair the the usefulness of using a consistent hash. There is a small hack to detect when all real servers have a weight of 0. It relies on the fact it is not possible for the weight of a real server to change during the execution of the assignment. I believe this is the case as modifications through netlink are subject to a mutex, but the use of atomic_read() is unsettling. The value of 65537 for the hash table size is currently not modifiable at compile-time. This is the value suggested in the Maglev paper. Another possible value is 257 (for small tests) and 655373 (for very large setups). [1]: https://research.google.com/pubs/pub44824.html [2]: https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/ Signed-off-by: Vincent Bernat--- include/net/ip_vs.h| 27 net/netfilter/ipvs/Kconfig | 13 ++ net/netfilter/ipvs/Makefile| 1 + net/netfilter/ipvs/ip_vs_csh.c | 339 + net/netfilter/ipvs/ip_vs_sh.c | 32 +--- 5 files changed, 381 insertions(+), 31 deletions(-) create mode 100644 net/netfilter/ipvs/ip_vs_csh.c diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h index eb0bec043c96..2184b43b7320 100644 --- a/include/net/ip_vs.h +++ b/include/net/ip_vs.h @@ -195,6 +195,33 @@ static inline int ip_vs_addr_equal(int af, const union nf_inet_addr *a, return a->ip == b->ip; } +static inline __be16 ip_vs_get_port(const struct sk_buff *skb, struct ip_vs_iphdr *iph) +{ + __be16 _ports[2], *ports; + + /* At this point we know that we have a valid packet of some kind. +* Because ICMP packets are only guaranteed to have the first 8 +* bytes, let's just grab the ports. Fortunately they're in the +* same position for all three of the protocols we care about. +*/ + switch (iph->protocol) { + case IPPROTO_TCP: + case IPPROTO_UDP: + case IPPROTO_SCTP: + ports = skb_header_pointer(skb, iph->len, sizeof(_ports), + &_ports); + if (unlikely(!ports)) + return 0; + + if (likely(!ip_vs_iph_inverse(iph))) + return ports[0]; + else + return ports[1]; + default: + return 0; + } +} + #ifdef CONFIG_IP_VS_DEBUG #include diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig index b32fb0dbe237..bfd091e020af 100644 --- a/net/netfilter/ipvs/Kconfig +++ b/net/netfilter/ipvs/Kconfig @@ -225,6 +225,19 @@ config IP_VS_SH If you want to compile it in kernel, say Y. To compile it as a module, choose M here. If unsure, say N. +config IP_VS_CSH + tristate "consistent source hashing scheduling" + ---help--- + The consistent source hashing scheduling algorithm assigns + network connections to the servers through looking up a + statically assigned hash table by their source IP addresses + and ports. The hash table is built to minimize disruption + when a server becomes unavailable. It relies on the Maglev + algorithm. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + config IP_VS_SED tristate "shortest expected delay scheduling" ---help--- diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile index c552993fa4b9..7d0badf86dfe 100644 --- a/net/netfilter/ipvs/Makefile +++ b/net/netfilter/ipvs/Makefile @@ -33,6 +33,7 @@ obj-$(CONFIG_IP_VS_LBLC) += ip_vs_lblc.o obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o +obj-$(CONFIG_IP_VS_CSH) += ip_vs_csh.o obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o diff --git a/net/netfilter/ipvs/ip_vs_csh.c b/net/netfilter/ipvs/ip_vs_csh.c new file mode 100644 index ..c70db61ba934 --- /dev/null +++ b/net/netfilter/ipvs/ip_vs_csh.c @@ -0,0 +1,339 @@ +/* + * IPVS:Consistent Hashing scheduling module using Google's Maglev + * + * Authors: Vincent Bernat + * + * This program is free software; you can redistribute it and/or + * modify it under
Re: [PATCH net 0/2] net: bgmac: Couple of sparse warnings
From: Florian FainelliDate: Mon, 2 Apr 2018 10:17:36 -0700 > On 04/01/2018 07:21 PM, David Miller wrote: >> From: Florian Fainelli >> Date: Sun, 1 Apr 2018 10:26:28 -0700 >> >>> This patch series fixes a couple of warnings reported by sparse, should not >>> cause any functional problems since bgmac is typically used on LE platforms >>> anyway. >> >> Series applied, thanks Florian. > > I did not see this in net/master or included in your recent pull request > to Linus, not a big deal, but I just want to make sure this is not lost > somewhere ;) All changes go into net-next as I prepare for a merge window pull request.
Re: [PATCH net 0/2] net: bgmac: Couple of sparse warnings
On 04/01/2018 07:21 PM, David Miller wrote: > From: Florian Fainelli> Date: Sun, 1 Apr 2018 10:26:28 -0700 > >> This patch series fixes a couple of warnings reported by sparse, should not >> cause any functional problems since bgmac is typically used on LE platforms >> anyway. > > Series applied, thanks Florian. I did not see this in net/master or included in your recent pull request to Linus, not a big deal, but I just want to make sure this is not lost somewhere ;) -- Florian
[ANNOUNCE] iproute 4.16
Release of iproute2 for Linux 4.16 Lastest version iproute2 utility to support new features in Linux 4.16. This release covers a wide range of small changes. Lots of changes to: bpf, vrf, devlink, flower, and rdma support. Also more changes to ss and JSON support enhancements. The tarball can be dowloaded from: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.16.0.tar.gz The upstream repositories for master and net-next branch are now split. Master branch is at: git://git.kernel.org/pub/scm/network/iproute2/iproute2.gti and patches for next release are in (master branch): git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git There are symlinks so that old paths still work. Report problems (or enhancements) to the netdev@vger.kernel.org mailing list. --- Adam Vyskovsky (1): tc: fix an off-by-one error while printing tc actions Alexander Alemayhu (4): man: add examples to ip.8 man: fix man page warnings tc: bpf: add ppc64 and sparc64 to list of archs with eBPF support examples/bpf: update list of examples Alexander Aring (5): tc: m_ife: allow ife type to zero tc: m_ife: print IEEE ethertype format tc: m_ife: report about kernels default type man: tc-ife: add default type note tc: m_ife: fix match tcindex parsing Alexander Heinlein (1): ip/xfrm: Fix deleteall when having many policies installed Alexander Zubkov (5): iproute: list/flush/save filter also by metric iproute: "list/flush/save default" selected all of the routes treat "default" and "all"/"any" addresses differenty treat "default" and "all"/"any" addresses differenty arrange prefix parsing code after redundant patches Alexey Kodanev (1): fix typo in ip-xfrm man page, rmd610 -> rmd160 Amir Vadai (14): libnetlink: Introduce rta_getattr_be*() tc/cls_flower: Classify packet in ip tunnels tc/act_tunnel: Introduce ip tunnel action tc/pedit: Fix a typo in pedit usage message tc/pedit: Extend pedit to specify offset relative to mac/transport headers tc/pedit: Introduce 'add' operation tc/pedit: p_ip: introduce editing ttl header tc/pedit: Support fields bigger than 32 bits tc/pedit: p_eth: ETH header editor tc/pedit: p_tcp: introduce pedit tcp support pedit: Fix a typo in warning pedit: Do not allow using retain for too big fields pedit: Check for extended capability in protocol parser pedit: Introduce ipv6 support Amritha Nambiar (4): tc/mqprio: Offload mode and shaper options in mqprio flower: Represent HW traffic classes as classid values man: tc-mqprio: add documentation for new offload options man: tc-flower: add explanation for hw_tc option Andreas Henriksson (1): ss: fix help/man TCP-STATE description for listening Antonio Quartulli (2): ss: fix crash when skipping disabled header field ss: fix NULL pointer access when parsing unix sockets with oldformat Arkadi Sharshevsky (15): devlink: Change netlink attribute validation devlink: Add support for pipeline debug (dpipe) bridge: Distinguish between externally learned vs offloaded FDBs devlink: Make match/action parsing more flexible devlink: Add support for special format protocol headers devlink: Add support for protocol IPv4/IPv6/Ethernet special formats devlink: Ignore unknown attributes devlink: Change empty line indication with indentations devlink: mnlg: Add support for extended ack devlink: Add support for devlink resource abstraction devlink: Add support for hot reload devlink: Move dpipe context from heap to stack devlink: Add support for resource/dpipe relation devlink: Update man pages and add resource man devlink: Fix error reporting Asbjørn Sloth Tønnesen (2): testsuite: refactor kernel config search testsuite: search for kernel config in /boot Baruch Siach (3): tc: add missing limits.h header ip: include libc headers first lib: fix multiple strlcpy definition Benjamin LaHaise (2): f_flower: don't set TCA_FLOWER_KEY_ETH_TYPE for "protocol all" tc: flower: support for matching MPLS labels Boris Pismenny (1): ip xfrm: Add xfrm state crypto offload Casey Callendrello (1): netns: make /var/run/netns bind-mount recursive Chris Mi (3): tc: fix command "tc actions del" hang issue lib/libnetlink: Add a new function rtnl_talk_iov tc: Add batchsize feature for filter and actions Christian Brauner (1): netns: allow negative nsid Christian Ehrhardt (2): tests: read limited amount from /dev/urandom tests: make sure rand_dev suffix has 6 chars Christoph Paasch (1): ip: add fastopen_no_cookie option to ip route Craig Gallek (2): gre6: fix copy/paste bugs in GREv6 attribute manipulation iplink: Expose IFLA_*_FWMARK attributes
Re: SO_TCP_NODELAY implementation in TCP stack
On Sun, Apr 1, 2018 at 4:06 AM Naruto Nguyenwrote: > Hello everyone, > As I know we have a socket option SO_TCP_NODELAY to disable Nagle > Algorithm, and I found it is implemented in TCP/IP stack at > https://elixir.bootlin.com/linux/v4.4.90/source/net/ipv4/tcp.c#L2401 . > However, I do not know where the source code the Nagle Algorithm is > implemented in kernel. If you know, could you please help me? The Linux TCP code for Nagle's algorithm is largely in tcp_output.c in the following functions: tcp_nagle_check(), tcp_minshall_check(), tcp_minshall_update() The variant of the Nagle algorithm that Linux implements is described here: https://tools.ietf.org/html/draft-minshall-nagle-00 Basically, the algorithm avoids having more than one outstanding unacknowledged packet that is less than one MSS in size. The Nagle algorithm is disabled using the TCP_NODELAY socket option. neal
Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
On Sun, 1 Apr 2018 20:47:28 -0400 "Md. Islam"wrote: > This patch implements IPv4 forwarding on xdp_buff. I added a new > config option XDP_ROUTER. Kernel would forward packets through fast > path when this option is enabled. But it would require driver support. > Currently it only works with veth. Here I have modified veth such that > it outputs xdp_buff. I created a testbed in Mininet. The Mininet > script (topology.py) is attached. Here the topology is: Having XDP routing would be great. The solution you have chosen that by changing each driver does not scale well since it requires lots of changes and is not ready for upstream.
Re: [net-next 2/2] net: netcp: ethss: k2g: add promiscuous mode support
On Mon, Apr 02, 2018 at 12:17:19PM -0400, Murali Karicheri wrote: > +static int gbe_set_rx_mode(void *intf_priv, bool promisc) > +{ > + struct gbe_intf *gbe_intf = intf_priv; > + struct gbe_priv *gbe_dev = gbe_intf->gbe_dev; > + struct cpsw_ale *ale = gbe_dev->ale; > + unsigned long timeout; > + int i, ret = -ETIMEDOUT; > + > + /* Disable(1)/Enable(0) Learn for all ports (host is port 0 and > + * slaves are port 1 and up > + */ > + for (i = 0; i <= gbe_dev->num_slaves; i++) { > + cpsw_ale_control_set(ale, i, > + ALE_PORT_NOLEARN, !!promisc); > + cpsw_ale_control_set(ale, i, > + ALE_PORT_NO_SA_UPDATE, !!promisc); > + } Hi Murali Does this mean that in promisc mode, switching of frames between ports in hardware is disabled? You are relying on the software bridge to perform such bridging between ports? You might want to look at skb->offload_fwd_mark. By setting this, you can tell the software bridge the hardware has already bridged the frame. You might then be able to have promisc enabled, and the hardware still doing the forwarding. Andrew
Re: [PATCH 12/15] dmaengine: pxa: make the filter function internal
Hi Robert, I love your patch! Perhaps something to improve: [auto build test WARNING on linus/master] [also build test WARNING on v4.16] [cannot apply to arm-soc/for-next next-20180329] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Robert-Jarzmik/ARM-pxa-switch-to-DMA-slave-maps/20180402-233029 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) drivers/mtd/nand/marvell_nand.c:2621:17: sparse: undefined identifier 'pxad_filter_fn' >> drivers/mtd/nand/marvell_nand.c:2621:17: sparse: call with no type! In file included from drivers/mtd/nand/marvell_nand.c:21:0: drivers/mtd/nand/marvell_nand.c: In function 'marvell_nfc_init_dma': drivers/mtd/nand/marvell_nand.c:2621:42: error: 'pxad_filter_fn' undeclared (first use in this function); did you mean 'dma_filter_fn'? dma_request_slave_channel_compat(mask, pxad_filter_fn, ^ include/linux/dmaengine.h:1408:46: note: in definition of macro 'dma_request_slave_channel_compat' __dma_request_slave_channel_compat(&(mask), x, y, dev, name) ^ drivers/mtd/nand/marvell_nand.c:2621:42: note: each undeclared identifier is reported only once for each function it appears in dma_request_slave_channel_compat(mask, pxad_filter_fn, ^ include/linux/dmaengine.h:1408:46: note: in definition of macro 'dma_request_slave_channel_compat' __dma_request_slave_channel_compat(&(mask), x, y, dev, name) ^ vim +2621 drivers/mtd/nand/marvell_nand.c 02f26ecf Miquel Raynal 2018-01-09 2588 02f26ecf Miquel Raynal 2018-01-09 2589 static int marvell_nfc_init_dma(struct marvell_nfc *nfc) 02f26ecf Miquel Raynal 2018-01-09 2590 { 02f26ecf Miquel Raynal 2018-01-09 2591 struct platform_device *pdev = container_of(nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2592 struct platform_device, 02f26ecf Miquel Raynal 2018-01-09 2593 dev); 02f26ecf Miquel Raynal 2018-01-09 2594 struct dma_slave_config config = {}; 02f26ecf Miquel Raynal 2018-01-09 2595 struct resource *r; 02f26ecf Miquel Raynal 2018-01-09 2596 dma_cap_mask_t mask; 02f26ecf Miquel Raynal 2018-01-09 2597 struct pxad_param param; 02f26ecf Miquel Raynal 2018-01-09 2598 int ret; 02f26ecf Miquel Raynal 2018-01-09 2599 02f26ecf Miquel Raynal 2018-01-09 2600 if (!IS_ENABLED(CONFIG_PXA_DMA)) { 02f26ecf Miquel Raynal 2018-01-09 2601 dev_warn(nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2602 "DMA not enabled in configuration\n"); 02f26ecf Miquel Raynal 2018-01-09 2603 return -ENOTSUPP; 02f26ecf Miquel Raynal 2018-01-09 2604 } 02f26ecf Miquel Raynal 2018-01-09 2605 02f26ecf Miquel Raynal 2018-01-09 2606 ret = dma_set_mask_and_coherent(nfc->dev, DMA_BIT_MASK(32)); 02f26ecf Miquel Raynal 2018-01-09 2607 if (ret) 02f26ecf Miquel Raynal 2018-01-09 2608 return ret; 02f26ecf Miquel Raynal 2018-01-09 2609 02f26ecf Miquel Raynal 2018-01-09 2610 r = platform_get_resource(pdev, IORESOURCE_DMA, 0); 02f26ecf Miquel Raynal 2018-01-09 2611 if (!r) { 02f26ecf Miquel Raynal 2018-01-09 2612 dev_err(nfc->dev, "No resource defined for data DMA\n"); 02f26ecf Miquel Raynal 2018-01-09 2613 return -ENXIO; 02f26ecf Miquel Raynal 2018-01-09 2614 } 02f26ecf Miquel Raynal 2018-01-09 2615 02f26ecf Miquel Raynal 2018-01-09 2616 param.drcmr = r->start; 02f26ecf Miquel Raynal 2018-01-09 2617 param.prio = PXAD_PRIO_LOWEST; 02f26ecf Miquel Raynal 2018-01-09 2618 dma_cap_zero(mask); 02f26ecf Miquel Raynal 2018-01-09 2619 dma_cap_set(DMA_SLAVE, mask); 02f26ecf Miquel Raynal 2018-01-09 2620 nfc->dma_chan = 02f26ecf Miquel Raynal 2018-01-09 @2621 dma_request_slave_channel_compat(mask, pxad_filter_fn, 02f26ecf Miquel Raynal 2018-01-09 2622 , nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2623 "data"); 02f26ecf Miquel Raynal 2018-01-09 2624 if (!nfc->dma_chan) { 02f26ecf Miquel Raynal 2018-01-09 2625 dev_err(nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2626 "Unable to request data DMA channel\n"); 02f26ecf Miquel Raynal 2018-01-09 2627 return -ENODEV; 02f26ecf Miquel Raynal 2018-01-09
[GIT] Networking
1) Fix RCU locking in xfrm_local_error(), from Taehee Yoo. 2) Fix return value assignments and thus error checking in iwl_mvm_start_ap_ibss(), from Johannes Berg. 3) Don't count header length twice in vti4, from Stefano Brivio. 4) Fix deadlock in rt6_age_examine_exception, from Eric Dumazet. 5) Fix out-of-bounds access in nf_sk_lookup_slow{v4,v6}() from Subash Abhinov. 6) Check nladdr size in netlink_connect(), from Alexander Potapenko. 7) VF representor SQ numbers are 32 not 16 bits, in mlx5 driver, from Or Gerlitz. 8) Out of bounds read in skb_network_protocol(), from Eric Dumazet. 9) r8169 driver sets driver data pointer after register_netdev() which is too late. Fix from Heiner Kallweit. 10) Fix memory leak in mlx4 driver, from Moshe Shemesh. 11) The multi-VLAN decap fix added a regression when dealing with device that lack a MAC header, such as tun. Fix from Toshiaki Makita. 12) Fix integer overflow in dynamic interrupt coalescing code. From Tal Gilboa. 13) Use after free in vrf code, from David Ahern. 14) IPV6 route leak between VRFs fix, also from David Ahern. Please pull, thanks a lot! The following changes since commit f36b7534b83357cf52e747905de6d65b4f7c2512: Merge branch 'akpm' (patches from Andrew) (2018-03-22 18:48:43 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to e81b5e01c14add8395dfba7130f8829206bb507d: net: mvneta: fix enable of all initialized RXQs (2018-03-30 14:27:47 -0400) Alexander Potapenko (1): netlink: make sure nladdr has correct size in netlink_connect() Andrei Otcheretianski (2): iwlwifi: mvm: Increase session protection time after CS iwlwifi: mvm: Move unused phy's to a default channel Avraham Stern (3): iwlwifi: mvm: clear tx queue id when unreserving aggregation queue iwlwifi: mvm: make sure internal station has a valid id iwlwifi: mvm: fix array out of bounds reference Beni Lev (1): iwlwifi: mvm: Correctly set IGTK for AP Colin Ian King (1): atm: iphase: fix spelling mistake: "Receiverd" -> "Received" Cong Wang (1): llc: properly handle dev_queue_xmit() return value Craig Dillabaugh (1): net sched actions: fix dumping which requires several messages to user space Dave Watson (1): strparser: Fix sign of err codes David Ahern (2): vrf: Fix use after free and double free in vrf_finish_output net/ipv6: Fix route leaking between VRFs David Lebrun (1): ipv6: sr: fix seg6 encap performances with TSO enabled David S. Miller (11): Merge branch 'mlxsw-GRE-mtu-changes' Merge git://git.kernel.org/.../pablo/nf Merge tag 'wireless-drivers-for-davem-2018-03-24' of git://git.kernel.org/.../kvalo/wireless-drivers Merge branch 'bond-hwaddr-sync-fixes' Merge tag 'batadv-net-for-davem-20180326' of git://git.open-mesh.org/linux-merge Merge tag 'mlx5-fixes-2018-03-23' of git://git.kernel.org/.../saeed/linux Merge branch 'mlx4-misc-fixes-for-4.16' Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec ip_tunnel: Resolve ipsec merge conflict properly. Merge git://git.kernel.org/.../bpf/bpf Merge branch 'vlan-fix' Emmanuel Grumbach (1): iwlwifi: mvm: set the correct tid when we flush the MCAST sta Eran Ben Elisha (1): net/mlx4_en: Fix mixed PFC and Global pause user control requests Eric Dumazet (2): ipv6: fix possible deadlock in rt6_age_examine_exception() net: fix possible out-of-bound read in skb_network_protocol() Florian Westphal (3): netfilter: nf_tables: meter: pick a set backend that supports updates netfilter: nf_tables: permit second nat hook if colliding hook is going away netfilter: nf_tables: add missing netlink attrs to policies Giuseppe Lippolis (1): net-usb: add qmi_wwan if on lte modem wistron neweb d18q1 Hans Wippel (1): net/ipv4: disable SMC TCP option with SYN Cookies Heiner Kallweit (1): r8169: fix setting driver_data after register_netdev Jakub Kicinski (2): tools: bpftool: don't use hex numbers in JSON output nfp: bpf: fix check of program max insn count Jason Wang (3): vhost_net: add missing lock nesting notation vhost: correctly remove wait queue during poll failure vhost: validate log when IOTLB is enabled Jianbo Liu (2): net/mlx5e: Don't override vport admin link state in switchdev mode net/mlx5e: Fix memory usage issues in offloading TC flows Johannes Berg (1): iwlwifi: mvm: fix error checking for multi/broadcast sta John Fastabend (1): net: sched, fix OOO packets with pfifo_fast Kalle Valo (2): Merge tag 'iwlwifi-for-kalle-2018-03-16' of git://git.kernel.org/.../iwlwifi/iwlwifi-fixes Merge tag 'iwlwifi-for-kalle-2018-03-19' of
Re: [net-next 0/2] Add promiscous mode support in k2g network driver
From: Murali KaricheriDate: Mon, 2 Apr 2018 12:17:17 -0400 > This patch adds support for promiscuous mode in network driver for K2G > SoC. This depends on v3 of my series at > https://www.spinics.net/lists/kernel/msg2765942.html The net-next tree is closed, please resubmit this series after the merge window when the net-next tree is openned back up. Thank you.
Re: [PATCH 12/15] dmaengine: pxa: make the filter function internal
Hi Robert, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.16] [cannot apply to arm-soc/for-next next-20180329] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Robert-Jarzmik/ARM-pxa-switch-to-DMA-slave-maps/20180402-233029 config: i386-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): In file included from drivers/mtd/nand/marvell_nand.c:21:0: drivers/mtd/nand/marvell_nand.c: In function 'marvell_nfc_init_dma': >> drivers/mtd/nand/marvell_nand.c:2621:42: error: 'pxad_filter_fn' undeclared >> (first use in this function); did you mean 'dma_filter_fn'? dma_request_slave_channel_compat(mask, pxad_filter_fn, ^ include/linux/dmaengine.h:1408:46: note: in definition of macro 'dma_request_slave_channel_compat' __dma_request_slave_channel_compat(&(mask), x, y, dev, name) ^ drivers/mtd/nand/marvell_nand.c:2621:42: note: each undeclared identifier is reported only once for each function it appears in dma_request_slave_channel_compat(mask, pxad_filter_fn, ^ include/linux/dmaengine.h:1408:46: note: in definition of macro 'dma_request_slave_channel_compat' __dma_request_slave_channel_compat(&(mask), x, y, dev, name) ^ vim +2621 drivers/mtd/nand/marvell_nand.c 02f26ecf Miquel Raynal 2018-01-09 2588 02f26ecf Miquel Raynal 2018-01-09 2589 static int marvell_nfc_init_dma(struct marvell_nfc *nfc) 02f26ecf Miquel Raynal 2018-01-09 2590 { 02f26ecf Miquel Raynal 2018-01-09 2591 struct platform_device *pdev = container_of(nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2592 struct platform_device, 02f26ecf Miquel Raynal 2018-01-09 2593 dev); 02f26ecf Miquel Raynal 2018-01-09 2594 struct dma_slave_config config = {}; 02f26ecf Miquel Raynal 2018-01-09 2595 struct resource *r; 02f26ecf Miquel Raynal 2018-01-09 2596 dma_cap_mask_t mask; 02f26ecf Miquel Raynal 2018-01-09 2597 struct pxad_param param; 02f26ecf Miquel Raynal 2018-01-09 2598 int ret; 02f26ecf Miquel Raynal 2018-01-09 2599 02f26ecf Miquel Raynal 2018-01-09 2600 if (!IS_ENABLED(CONFIG_PXA_DMA)) { 02f26ecf Miquel Raynal 2018-01-09 2601 dev_warn(nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2602 "DMA not enabled in configuration\n"); 02f26ecf Miquel Raynal 2018-01-09 2603 return -ENOTSUPP; 02f26ecf Miquel Raynal 2018-01-09 2604 } 02f26ecf Miquel Raynal 2018-01-09 2605 02f26ecf Miquel Raynal 2018-01-09 2606 ret = dma_set_mask_and_coherent(nfc->dev, DMA_BIT_MASK(32)); 02f26ecf Miquel Raynal 2018-01-09 2607 if (ret) 02f26ecf Miquel Raynal 2018-01-09 2608 return ret; 02f26ecf Miquel Raynal 2018-01-09 2609 02f26ecf Miquel Raynal 2018-01-09 2610 r = platform_get_resource(pdev, IORESOURCE_DMA, 0); 02f26ecf Miquel Raynal 2018-01-09 2611 if (!r) { 02f26ecf Miquel Raynal 2018-01-09 2612 dev_err(nfc->dev, "No resource defined for data DMA\n"); 02f26ecf Miquel Raynal 2018-01-09 2613 return -ENXIO; 02f26ecf Miquel Raynal 2018-01-09 2614 } 02f26ecf Miquel Raynal 2018-01-09 2615 02f26ecf Miquel Raynal 2018-01-09 2616 param.drcmr = r->start; 02f26ecf Miquel Raynal 2018-01-09 2617 param.prio = PXAD_PRIO_LOWEST; 02f26ecf Miquel Raynal 2018-01-09 2618 dma_cap_zero(mask); 02f26ecf Miquel Raynal 2018-01-09 2619 dma_cap_set(DMA_SLAVE, mask); 02f26ecf Miquel Raynal 2018-01-09 2620 nfc->dma_chan = 02f26ecf Miquel Raynal 2018-01-09 @2621 dma_request_slave_channel_compat(mask, pxad_filter_fn, 02f26ecf Miquel Raynal 2018-01-09 2622 , nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2623 "data"); 02f26ecf Miquel Raynal 2018-01-09 2624 if (!nfc->dma_chan) { 02f26ecf Miquel Raynal 2018-01-09 2625 dev_err(nfc->dev, 02f26ecf Miquel Raynal 2018-01-09 2626 "Unable to request data DMA channel\n"); 02f26ecf Miquel Raynal 2018-01-09 2627 return -ENODEV; 02f26ecf Miquel Raynal 2018-01-09 2628 } 02f26ecf Miquel Raynal 2018-01-09 2629 02f26ecf Miquel Raynal 2018-01-09 2630
RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
> -Original Message- > From: Bjorn Helgaas [mailto:helg...@kernel.org] > Sent: Friday, March 30, 2018 2:05 PM > To: Tal Gilboa> Cc: Tariq Toukan ; Keller, Jacob E > ; Ariel Elior ; Ganesh > Goudar ; Kirsher, Jeffrey T > ; everest-linux...@cavium.com; intel-wired- > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > linux-...@vger.kernel.org > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed > and > whether it's limited > > From: Tal Gilboa > > Add pcie_print_link_status(). This logs the current settings of the link > (speed, width, and total available bandwidth). > > If the device is capable of more bandwidth but is limited by a slower > upstream link, we include information about the link that limits the > device's performance. > > The user may be able to move the device to a different slot for better > performance. > > This provides a unified method for all PCI devices to report status and > issues, instead of each device reporting in a different way, using > different code. > > Signed-off-by: Tal Gilboa > [bhelgaas: changelog, reword log messages, print device capabilities when > not limited] > Signed-off-by: Bjorn Helgaas > --- > drivers/pci/pci.c | 29 + > include/linux/pci.h |1 + > 2 files changed, 30 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index e00d56b12747..cec7aed09f6b 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, > enum pci_bus_speed *speed, > return *width * PCIE_SPEED2MBS_ENC(*speed); > } > > +/** > + * pcie_print_link_status - Report the PCI device's link speed and width > + * @dev: PCI device to query > + * > + * Report the available bandwidth at the device. If this is less than the > + * device is capable of, report the device's maximum possible bandwidth and > + * the upstream link that limits its performance to less than that. > + */ > +void pcie_print_link_status(struct pci_dev *dev) > +{ > + enum pcie_link_width width, width_cap; > + enum pci_bus_speed speed, speed_cap; > + struct pci_dev *limiting_dev = NULL; > + u32 bw_avail, bw_cap; > + > + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); > + bw_avail = pcie_bandwidth_available(dev, _dev, , > ); > + > + if (bw_avail >= bw_cap) > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n", > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > + else > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d > link at %s (capable of %d Mb/s with %s x%d link)\n", > + bw_avail, PCIE_SPEED2STR(speed), width, > + limiting_dev ? pci_name(limiting_dev) : "", > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > +} Personally, I would make thic last one a pci_warn() to indicate it at a higher log level, but I'm ok with the wording, and if consensus is that this should be at info, I'm ok with that. Thanks, Jake > +EXPORT_SYMBOL(pcie_print_link_status); > + > /** > * pci_select_bars - Make BAR mask from the type of resource > * @dev: the PCI device for which BAR mask is made > diff --git a/include/linux/pci.h b/include/linux/pci.h > index f2bf2b7a66c7..38f7957121ef 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1086,6 +1086,7 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum > pci_bus_speed *speed, > u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev > **limiting_dev, >enum pci_bus_speed *speed, >enum pcie_link_width *width); > +void pcie_print_link_status(struct pci_dev *dev); > void pcie_flr(struct pci_dev *dev); > int __pci_reset_function_locked(struct pci_dev *dev); > int pci_reset_function(struct pci_dev *dev);
[net-next 2/2] net: netcp: ethss: k2g: add promiscuous mode support
From: WingMan KwokThis patch adds support for promiscuous mode in k2g's network driver. When upper layer instructs to transition from non-promiscuous mode to promiscuous mode or vice versa K2G network driver needs to configure ALE accordingly so that in case of non-promiscuous mode, ALE will not flood all unicast packets to host port, while in promiscuous mode, it will pass all received unicast packets to host port. Signed-off-by: WingMan Kwok Signed-off-by: Murali Karicheri --- drivers/net/ethernet/ti/netcp_ethss.c | 56 +++ 1 file changed, 56 insertions(+) diff --git a/drivers/net/ethernet/ti/netcp_ethss.c b/drivers/net/ethernet/ti/netcp_ethss.c index f7af999..1ac2cd6 100644 --- a/drivers/net/ethernet/ti/netcp_ethss.c +++ b/drivers/net/ethernet/ti/netcp_ethss.c @@ -2771,6 +2771,61 @@ static inline int gbe_hwtstamp_set(struct gbe_intf *gbe_intf, struct ifreq *req) } #endif /* CONFIG_TI_CPTS */ +static int gbe_set_rx_mode(void *intf_priv, bool promisc) +{ + struct gbe_intf *gbe_intf = intf_priv; + struct gbe_priv *gbe_dev = gbe_intf->gbe_dev; + struct cpsw_ale *ale = gbe_dev->ale; + unsigned long timeout; + int i, ret = -ETIMEDOUT; + + /* Disable(1)/Enable(0) Learn for all ports (host is port 0 and +* slaves are port 1 and up +*/ + for (i = 0; i <= gbe_dev->num_slaves; i++) { + cpsw_ale_control_set(ale, i, +ALE_PORT_NOLEARN, !!promisc); + cpsw_ale_control_set(ale, i, +ALE_PORT_NO_SA_UPDATE, !!promisc); + } + + if (!promisc) { + /* Don't Flood All Unicast Packets to Host port */ + cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 0); + dev_vdbg(gbe_dev->dev, "promiscuous mode disabled\n"); + return 0; + } + + timeout = jiffies + HZ; + + /* Clear All Untouched entries */ + cpsw_ale_control_set(ale, 0, ALE_AGEOUT, 1); + do { + cpu_relax(); + if (cpsw_ale_control_get(ale, 0, ALE_AGEOUT)) { + ret = 0; + break; + } + + } while (time_after(timeout, jiffies)); + + /* Make sure it is not a false timeout */ + if (ret && !cpsw_ale_control_get(ale, 0, ALE_AGEOUT)) + return ret; + + cpsw_ale_control_set(ale, 0, ALE_AGEOUT, 1); + + /* Clear all mcast from ALE */ + cpsw_ale_flush_multicast(ale, +GBE_PORT_MASK(gbe_dev->ale_ports), +-1); + + /* Flood All Unicast Packets to Host port */ + cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1); + dev_vdbg(gbe_dev->dev, "promiscuous mode enabled\n"); + return ret; +} + static int gbe_ioctl(void *intf_priv, struct ifreq *req, int cmd) { struct gbe_intf *gbe_intf = intf_priv; @@ -3523,6 +3578,7 @@ static int gbe_probe(struct netcp_device *netcp_device, struct device *dev, gbe_dev->max_num_slaves = 8; } else if (of_device_is_compatible(node, "ti,netcp-gbe-2")) { gbe_dev->max_num_slaves = 1; + gbe_module.set_rx_mode = gbe_set_rx_mode; } else if (of_device_is_compatible(node, "ti,netcp-xgbe")) { gbe_dev->max_num_slaves = 2; } else { -- 1.9.1
[net-next 0/2] Add promiscous mode support in k2g network driver
This patch adds support for promiscuous mode in network driver for K2G SoC. This depends on v3 of my series at https://www.spinics.net/lists/kernel/msg2765942.html I plan to fold this to the above series and submit again when the net-next merge windows opens. At this time, please review and let me know if it looks good or need any re-work. I would like to get this ready so that it can be merged along with the above series. The boot and promiscuous mode test logs are at https://pastebin.ubuntu.com/p/XQCvFS3QZb/ WingMan Kwok (2): net: netcp: add api to support set rx mode in netcp modules net: netcp: ethss: k2g: add promiscuous mode support drivers/net/ethernet/ti/netcp.h | 1 + drivers/net/ethernet/ti/netcp_core.c | 19 drivers/net/ethernet/ti/netcp_ethss.c | 56 +++ 3 files changed, 76 insertions(+) -- 1.9.1
[net-next 1/2] net: netcp: add api to support set rx mode in netcp modules
From: WingMan KwokThis patch adds an API to support setting rx mode in netcp modules. If a netcp module needs to be notified when upper layer transitions from one rx mode to another and react accordingly, such a module will implement the new API set_rx_mode added in this patch. Currently rx modes supported are PROMISCUOUS and NON_PROMISCUOUS modes. Signed-off-by: WingMan Kwok Signed-off-by: Murali Karicheri --- drivers/net/ethernet/ti/netcp.h | 1 + drivers/net/ethernet/ti/netcp_core.c | 19 +++ 2 files changed, 20 insertions(+) diff --git a/drivers/net/ethernet/ti/netcp.h b/drivers/net/ethernet/ti/netcp.h index 416f732..c4ffdf4 100644 --- a/drivers/net/ethernet/ti/netcp.h +++ b/drivers/net/ethernet/ti/netcp.h @@ -214,6 +214,7 @@ struct netcp_module { int (*add_vid)(void *intf_priv, int vid); int (*del_vid)(void *intf_priv, int vid); int (*ioctl)(void *intf_priv, struct ifreq *req, int cmd); + int (*set_rx_mode)(void *intf_priv, bool promisc); /* used internally */ struct list_headmodule_list; diff --git a/drivers/net/ethernet/ti/netcp_core.c b/drivers/net/ethernet/ti/netcp_core.c index 736f6f7..e40aa3e 100644 --- a/drivers/net/ethernet/ti/netcp_core.c +++ b/drivers/net/ethernet/ti/netcp_core.c @@ -1509,6 +1509,24 @@ static void netcp_addr_sweep_add(struct netcp_intf *netcp) } } +static int netcp_set_promiscuous(struct netcp_intf *netcp, bool promisc) +{ + struct netcp_intf_modpriv *priv; + struct netcp_module *module; + int error; + + for_each_module(netcp, priv) { + module = priv->netcp_module; + if (!module->set_rx_mode) + continue; + + error = module->set_rx_mode(priv->module_priv, promisc); + if (error) + return error; + } + return 0; +} + static void netcp_set_rx_mode(struct net_device *ndev) { struct netcp_intf *netcp = netdev_priv(ndev); @@ -1538,6 +1556,7 @@ static void netcp_set_rx_mode(struct net_device *ndev) /* finally sweep and callout into modules */ netcp_addr_sweep_del(netcp); netcp_addr_sweep_add(netcp); + netcp_set_promiscuous(netcp, promisc); spin_unlock(>lock); } -- 1.9.1
Re: [PATCH 0/4] RFC: Realtek 83xx SMI driver core
Hi Linus, did you make any progress with this? I noticed that the Vodafone Easybox 904xdsl/904lte models both make use of the RTL8367 switch. About one million of these routers have been deployed in Germany. There is an OpenWrt fork at https://github.com/Quallenauge/Easybox-904-XDSL/commits/master-lede which depends on the out-of-tree patches which seem to be the basis for your Realtek 83xx driver patches. Having your Realtek 83xx patches in the upstream Linux kernel would help tremendously in getting support for those router models merged in OpenWrt. Regards, Carl-Daniel
RE: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth
> -Original Message- > From: Tal Gilboa [mailto:ta...@mellanox.com] > Sent: Monday, April 02, 2018 7:34 AM > To: Bjorn Helgaas> Cc: Tariq Toukan ; Keller, Jacob E > ; Ariel Elior ; Ganesh > Goudar ; Kirsher, Jeffrey T > ; everest-linux...@cavium.com; intel-wired- > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > linux-...@vger.kernel.org > Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute > max supported link bandwidth > > On 4/2/2018 5:05 PM, Bjorn Helgaas wrote: > > On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote: > >> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote: > >>> On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote: > On 3/31/2018 12:05 AM, Bjorn Helgaas wrote: > > From: Tal Gilboa > > > > Add pcie_bandwidth_capable() to compute the max link bandwidth > supported by > > a device, based on the max link speed and width, adjusted by the > encoding > > overhead. > > > > The maximum bandwidth of the link is computed as: > > > > max_link_speed * max_link_width * (1 - encoding_overhead) > > > > The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using > 8b/10b > > encoding, and about 1.5% for 8 GT/s or higher speed links using > > 128b/130b > > encoding. > > > > Signed-off-by: Tal Gilboa > > [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap() > > signatures, don't export outside drivers/pci] > > Signed-off-by: Bjorn Helgaas > > Reviewed-by: Tariq Toukan > > --- > > drivers/pci/pci.c | 21 + > > drivers/pci/pci.h |9 + > > 2 files changed, 30 insertions(+) > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index 43075be79388..9ce89e254197 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -5208,6 +5208,27 @@ enum pcie_link_width > pcie_get_width_cap(struct pci_dev *dev) > > return PCIE_LNK_WIDTH_UNKNOWN; > > } > > +/** > > + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth > capability > > + * @dev: PCI device > > + * @speed: storage for link speed > > + * @width: storage for link width > > + * > > + * Calculate a PCI device's link bandwidth by querying for its link > > speed > > + * and width, multiplying them, and applying encoding overhead. > > + */ > > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed > *speed, > > + enum pcie_link_width *width) > > +{ > > + *speed = pcie_get_speed_cap(dev); > > + *width = pcie_get_width_cap(dev); > > + > > + if (*speed == PCI_SPEED_UNKNOWN || *width == > PCIE_LNK_WIDTH_UNKNOWN) > > + return 0; > > + > > + return *width * PCIE_SPEED2MBS_ENC(*speed); > > +} > > + > > /** > > * pci_select_bars - Make BAR mask from the type of resource > > * @dev: the PCI device for which BAR mask is made > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index 66738f1050c0..2a50172b9803 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev > *dev); > > (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \ > > "Unknown speed") > > +/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for > gen3 */ > > +#define PCIE_SPEED2MBS_ENC(speed) \ > > Missing gen4. > >>> > >>> I made it "gen3+". I think that's accurate, isn't it? The spec > >>> doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2 > >>> says rates of 8 GT/s or higher (which I think includes gen3 and gen4) > >>> use 128b/130b encoding. > >>> > >> > >> I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it > >> wasn't > >> added. Need to return 15754. > > > > Oh, duh, of course! Sorry for being dense. What about the following? > > I included the calculation as opposed to just the magic numbers to try > > to make it clear how they're derived. This has the disadvantage of > > truncating the result instead of rounding, but I doubt that's > > significant in this context. If it is, we could use the magic numbers > > and put the computation in a comment. > > We can always use DIV_ROUND_UP((speed * enc_nominator), > enc_denominator). I think this is confusing and since this introduces a > bandwidth limit I would prefer to give a wider limit than a wrong one, > even it is by less than 1Mb/s. My vote is for leaving it as you wrote below. > > > > > Another question: we currently deal in
RE: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()
> -Original Message- > From: Bjorn Helgaas [mailto:helg...@kernel.org] > Sent: Friday, March 30, 2018 2:06 PM > To: Tal Gilboa> Cc: Tariq Toukan ; Keller, Jacob E > ; Ariel Elior ; Ganesh > Goudar ; Kirsher, Jeffrey T > ; everest-linux...@cavium.com; intel-wired- > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > linux-...@vger.kernel.org > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with > pcie_print_link_status() > > From: Bjorn Helgaas > > Use pcie_print_link_status() to report PCIe link speed and possible > limitations instead of implementing this in the driver itself. > > Note that pcie_get_minimum_link() can return misleading information because > it finds the slowest link and the narrowest link without considering the > total bandwidth of the link. If the path contains a 16 GT/s x1 link and a > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of > about 2000 MB/s for a 16 GT/s x1 link. This comment is about what's being fixed, so it would have been easier to parse if it were written to more clearly indicate that we're removing (and not adding) this behavior. Aside from the commit message (which I don't feel strongly enough needs a re-send of the patch) this looks good to me. Acked-by: Jacob Keller Thanks Bjorn and Tal for fixing this! > > Signed-off-by: Bjorn Helgaas > --- > drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 87 > -- > 1 file changed, 1 insertion(+), 86 deletions(-) > > diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c > b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c > index a434fecfdfeb..aa05fb534942 100644 > --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c > +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c > @@ -2120,91 +2120,6 @@ static int fm10k_sw_init(struct fm10k_intfc *interface, > return 0; > } > > -static void fm10k_slot_warn(struct fm10k_intfc *interface) > -{ > - enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN; > - enum pci_bus_speed speed = PCI_SPEED_UNKNOWN; > - struct fm10k_hw *hw = >hw; > - int max_gts = 0, expected_gts = 0; > - > - if (pcie_get_minimum_link(interface->pdev, , ) || > - speed == PCI_SPEED_UNKNOWN || width == > PCIE_LNK_WIDTH_UNKNOWN) { > - dev_warn(>pdev->dev, > - "Unable to determine PCI Express bandwidth.\n"); > - return; > - } > - > - switch (speed) { > - case PCIE_SPEED_2_5GT: > - /* 8b/10b encoding reduces max throughput by 20% */ > - max_gts = 2 * width; > - break; > - case PCIE_SPEED_5_0GT: > - /* 8b/10b encoding reduces max throughput by 20% */ > - max_gts = 4 * width; > - break; > - case PCIE_SPEED_8_0GT: > - /* 128b/130b encoding has less than 2% impact on throughput */ > - max_gts = 8 * width; > - break; > - default: > - dev_warn(>pdev->dev, > - "Unable to determine PCI Express bandwidth.\n"); > - return; > - } > - > - dev_info(>pdev->dev, > - "PCI Express bandwidth of %dGT/s available\n", > - max_gts); > - dev_info(>pdev->dev, > - "(Speed:%s, Width: x%d, Encoding Loss:%s, Payload:%s)\n", > - (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" : > - speed == PCIE_SPEED_5_0GT ? "5.0GT/s" : > - speed == PCIE_SPEED_2_5GT ? "2.5GT/s" : > - "Unknown"), > - hw->bus.width, > - (speed == PCIE_SPEED_2_5GT ? "20%" : > - speed == PCIE_SPEED_5_0GT ? "20%" : > - speed == PCIE_SPEED_8_0GT ? "<2%" : > - "Unknown"), > - (hw->bus.payload == fm10k_bus_payload_128 ? "128B" : > - hw->bus.payload == fm10k_bus_payload_256 ? "256B" : > - hw->bus.payload == fm10k_bus_payload_512 ? "512B" : > - "Unknown")); > - > - switch (hw->bus_caps.speed) { > - case fm10k_bus_speed_2500: > - /* 8b/10b encoding reduces max throughput by 20% */ > - expected_gts = 2 * hw->bus_caps.width; > - break; > - case fm10k_bus_speed_5000: > - /* 8b/10b encoding reduces max throughput by 20% */ > - expected_gts = 4 * hw->bus_caps.width; > - break; > - case fm10k_bus_speed_8000: > - /* 128b/130b encoding has less than 2% impact on throughput */ > - expected_gts = 8 * hw->bus_caps.width; > - break; > - default: > - dev_warn(>pdev->dev, > - "Unable to
Re: [bpf-next PATCH 4/4] bpf: sockmap, add hash map support
On Sun, Apr 01, 2018 at 08:01:10AM -0700, John Fastabend wrote: > Sockmap is currently backed by an array and enforces keys to be > four bytes. This works well for many use cases and was originally > modeled after devmap which also uses four bytes keys. However, > this has become limiting in larger use cases where a hash would > be more appropriate. For example users may want to use the 5-tuple > of the socket as the lookup key. > > To support this add hash support. > > Signed-off-by: John Fastabendapi looks good, but I think it came a bit too late for this release. _nulls part you don't need for this hash. Few other nits: > +static void htab_elem_free_rcu(struct rcu_head *head) > +{ > + struct htab_elem *l = container_of(head, struct htab_elem, rcu); > + > + /* must increment bpf_prog_active to avoid kprobe+bpf triggering while > + * we're calling kfree, otherwise deadlock is possible if kprobes > + * are placed somewhere inside of slub > + */ > + preempt_disable(); > + __this_cpu_inc(bpf_prog_active); > + kfree(l); > + __this_cpu_dec(bpf_prog_active); > + preempt_enable(); I don't think it's necessary. > +static struct bpf_map *sock_hash_alloc(union bpf_attr *attr) > +{ > + struct bpf_htab *htab; > + int i, err; > + u64 cost; > + > + if (!capable(CAP_NET_ADMIN)) > + return ERR_PTR(-EPERM); > + > + /* check sanity of attributes */ > + if (attr->max_entries == 0 || > + attr->map_flags & ~SOCK_CREATE_FLAG_MASK) > + return ERR_PTR(-EINVAL); > + > + if (attr->value_size > KMALLOC_MAX_SIZE) > + return ERR_PTR(-E2BIG); doesn't seem to match + u32 fd = *(u32 *)value; that is done later. > +static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, > + u32 hash, void *key, u32 key_size) > +{ > + struct hlist_nulls_node *n; > + struct htab_elem *l; > + > + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) > + if (l->hash == hash && !memcmp(>key, key, key_size)) > + return l; if nulls is needed, there gotta be a comment explaining it. please add tests for all methods. > diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c > index f95fa67..2fa4cbb 100644 > --- a/tools/bpf/bpftool/map.c > +++ b/tools/bpf/bpftool/map.c > @@ -67,6 +67,7 @@ > [BPF_MAP_TYPE_DEVMAP] = "devmap", > [BPF_MAP_TYPE_SOCKMAP] = "sockmap", > [BPF_MAP_TYPE_CPUMAP] = "cpumap", > + [BPF_MAP_TYPE_SOCKHASH] = "sockhash", > }; > > static unsigned int get_possible_cpus(void) > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 9d07465..1a19450 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -115,6 +115,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_DEVMAP, > BPF_MAP_TYPE_SOCKMAP, > BPF_MAP_TYPE_CPUMAP, > + BPF_MAP_TYPE_SOCKHASH, tools/* updates should be in separate commit.
Re: [PATCH net-next] net: ipv6/gre: Add GRO support
On 02/04/2018 6:19 PM, Eric Dumazet wrote: On 04/02/2018 08:00 AM, Eran Ben Elisha wrote: Seems good, but why isn't this handled directly in GRO native layer ? ip6_tunnel and ip6_gre do not share initialization flow functions (unlike ipv4). Changing the ipv6 init infrastructure should not be part of this patch. we prefer to keep this one minimal, simple and safe. Looking at gre_gro_receive() and gre_gro_complete() I could not see why they could not be copied/pasted to IPv6. These functions to handle GRO over GRE are already assigned in gre_offload_init() (in net/ipv4/gre_offload.c under CONFIG_IPV6). However without initializing the gro_cells, the receive path will not go via napi_gro_receive path, but directly to netif_rx. So AFAIU, only gcells->cells was missing for gro_cells_receive to really go via GRO flow. Maybe give more details on the changelog, it is really not obvious. Hopefully the above filled this request. Not really :/ So you're referring to native interface. We thought you meant native IP module. gro_cells_receive() is not really useful with native GRO, since packet is already a GRO packet by the time it reaches ip_tunnel_rcv() or __ip6_tnl_rcv() Right. If GRO on native interface is ON, our patch doesn't help much. The case we improve here is: Native has GRO OFF, GRE has GRO ON. Before this patch there were no GRO packets at all in this case, only MTU packets went up the stack. Sure, it might be usefull if native GRO (happening on eth0 if you prefer) did not handle a particular encapsulation. Or it is turned OFF. gro_cell was a work around before we extended GRO to be able to decap some tunnel headers. It seems we have to extend this to also support GRE6.
Re: [BUG/Q] can_pernet_exit() leaves devices on dead net
Hi, Oliver, On 02.04.2018 18:28, Oliver Hartkopp wrote: > Hi Kirill, Marc, > > I checked the code once more and added some debug output to the other parts > in CAN notifier code. > > In fact the code pointed to by both of you seems to be obsolete as I only > wanted to be 'really sure' that no leftovers of the CAN filters at module > unloading. > > >> Yes, this one looks good: >> https://marc.info/?l=linux-can=150169589119335=2 >> >> Regards, >> Kirill >> > > I was obviously too cautious ;-) > > All tests I made resulted in the function iterating through all the CAN > netdevices doing exactly nothing. > > I'm fine with removing that stuff - but I'm not sure whether it's worth to > push that patch to stable 4.12+ or even before 4.12 (without namespace > support - and removing rcu_barrier() too). > > Any opinions? I think the same -- it's not need for stable as there is just iteration over empty list, i.e., noop. Kirill
Re: [BUG/Q] can_pernet_exit() leaves devices on dead net
Hi Kirill, Marc, I checked the code once more and added some debug output to the other parts in CAN notifier code. In fact the code pointed to by both of you seems to be obsolete as I only wanted to be 'really sure' that no leftovers of the CAN filters at module unloading. Yes, this one looks good: https://marc.info/?l=linux-can=150169589119335=2 Regards, Kirill I was obviously too cautious ;-) All tests I made resulted in the function iterating through all the CAN netdevices doing exactly nothing. I'm fine with removing that stuff - but I'm not sure whether it's worth to push that patch to stable 4.12+ or even before 4.12 (without namespace support - and removing rcu_barrier() too). Any opinions? Best regards, Oliver
Re: [PATCH net-next] bridge: Allow max MTU when multiple VLANs present
On Mon, Apr 2, 2018 at 11:08 AM, Roopa Prabhuwrote: > On Fri, Mar 30, 2018 at 12:54 PM, Chas Williams <3ch...@gmail.com> wrote: >> On Thu, Mar 29, 2018 at 9:02 PM, Toshiaki Makita >> wrote: >>> On 2018/03/30 1:49, Roopa Prabhu wrote: On Thu, Mar 22, 2018 at 9:53 PM, Roopa Prabhu wrote: > On Thu, Mar 22, 2018 at 8:34 AM, Chas Williams <3ch...@gmail.com> wrote: >> If the bridge is allowing multiple VLANs, some VLANs may have >> different MTUs. Instead of choosing the minimum MTU for the >> bridge interface, choose the maximum MTU of the bridge members. >> With this the user only needs to set a larger MTU on the member >> ports that are participating in the large MTU VLANS. >> >> Signed-off-by: Chas Williams <3ch...@gmail.com> >> --- > > Acked-by: Roopa Prabhu > > This or an equivalent fix is necessary: as stated above, today the > bridge mtu capped at min port mtu limits all > vlan devices on top of the vlan filtering bridge to min port mtu. On further thought, since this patch changes default behavior, it may upset people. ie with this patch, a vlan device on the bridge by default will now use the bridge max mtu and that could cause unexpected drops in the bridge driver if the xmit port had a lower mtu. This may surprise users. >> >> It only changes the default behavior when you are using VLAN aware bridges. >> The behavior remains the same otherwise. I don't know if VLAN aware bridges >> are that popular yet so there probably isn't any particular >> expectation from those >> bridges. > > they are popular...in-fact they are the default bridge mode on our > network switches. > And they have been around for some time now to ignore its users. > Plus it is not right to change default mtu behavior for one mode of the bridge > and not the others (bridge mtu handling from user-space is complex enough > today > due to dynamic mtu changes on port enslave/deslave). I don't see the issue with one mode of bridge behaving differently from another mode. The VLAN behavior between the two bridge modes is completely different so having a different MTU behavior doesn't seem that surprising. You are potentially mixing different sized VLAN on a same bridge. The only sane choice is to pick the largest MTU for the bridge. This lets you have whatever MTU is appropriate on the child VLAN interfaces of the bridge. If you attempt to forward from a port with a larger MTU to a smaller MTU, you get the expected behavior. Forcing the end user to configure all the ports to the maximum MTU of all the VLANs on the bridge is wrong IMHO. You then risk attempting to forward oversize packets on a network that can't support that. > >> >> I don't think those drops are unexpected. If a user has misconfigured >> the bridge >> we can't be expected to fix that for them. It is the user's >> responsbility to ensure >> that the ports on the VLAN have a size consistent with the traffic >> they expect to >> pass. >> > > By default they are not expected today. The problem is changing the bridge > to max mtu changes 'all' the vlan devices on top of the vlan aware bridge to > max mtu by default which makes drops at the bridge driver more common if the > user had mixed mtu on its ports. That's not been my experience. The MTU on the vlan devices is only limited by the bridges's MTU. Setting the bridge MTU doesn't change the children VLAN devices MTUs.
[RFC] vhost: introduce mdev based hardware vhost backend
This patch introduces a mdev (mediated device) based hardware vhost backend. This backend is an abstraction of the various hardware vhost accelerators (potentially any device that uses virtio ring can be used as a vhost accelerator). Some generic mdev parent ops are provided for accelerator drivers to support generating mdev instances. What's this === The idea is that we can setup a virtio ring compatible device with the messages available at the vhost-backend. Originally, these messages are used to implement a software vhost backend, but now we will use these messages to setup a virtio ring compatible hardware device. Then the hardware device will be able to work with the guest virtio driver in the VM just like what the software backend does. That is to say, we can implement a hardware based vhost backend in QEMU, and any virtio ring compatible devices potentially can be used with this backend. (We also call it vDPA -- vhost Data Path Acceleration). One problem is that, different virtio ring compatible devices may have different device interfaces. That is to say, we will need different drivers in QEMU. It could be troublesome. And that's what this patch trying to fix. The idea behind this patch is very simple: mdev is a standard way to emulate device in kernel. So we defined a standard device based on mdev, which is able to accept vhost messages. When the mdev emulation code (i.e. the generic mdev parent ops provided by this patch) gets vhost messages, it will parse and deliver them to accelerator drivers. Drivers can use these messages to setup accelerators. That is to say, the generic mdev parent ops (e.g. read()/write()/ ioctl()/...) will be provided for accelerator drivers to register accelerators as mdev parent devices. And each accelerator device will support generating standard mdev instance(s). With this standard device interface, we will be able to just develop one userspace driver to implement the hardware based vhost backend in QEMU. Difference between vDPA and PCI passthru The key difference between vDPA and PCI passthru is that, in vDPA only the data path of the device (e.g. DMA ring, notify region and queue interrupt) is pass-throughed to the VM, the device control path (e.g. PCI configuration space and MMIO regions) is still defined and emulated by QEMU. The benefits of keeping virtio device emulation in QEMU compared with virtio device PCI passthru include (but not limit to): - consistent device interface for guest OS in the VM; - max flexibility on the hardware design, especially the accelerator for each vhost backend doesn't have to be a full PCI device; - leveraging the existing virtio live-migration framework; The interface of this mdev based device === 1. BAR0 The MMIO region described by BAR0 is the main control interface. Messages will be written to or read from this region. The message type is determined by the `request` field in message header. The message size is encoded in the message header too. The message format looks like this: struct vhost_vfio_op { __u64 request; __u32 flags; /* Flag values: */ #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */ __u32 size; union { __u64 u64; struct vhost_vring_state state; struct vhost_vring_addr addr; struct vhost_memory memory; } payload; }; The existing vhost-kernel ioctl cmds are reused as the message requests in above structure. Each message will be written to or read from this region at offset 0: int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op) { int count = VHOST_VFIO_OP_HDR_SIZE + op->size; struct vhost_vfio *vfio = dev->opaque; int ret; ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset); if (ret != count) return -1; return 0; } int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op) { int count = VHOST_VFIO_OP_HDR_SIZE + op->size; struct vhost_vfio *vfio = dev->opaque; uint64_t request = op->request; int ret; ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset); if (ret != count || request != op->request) return -1; return 0; } It's quite straightforward to set things to the device. Just need to write the message to device directly: int vhost_vfio_set_features(struct vhost_dev *dev, uint64_t features) { struct vhost_vfio_op op; op.request = VHOST_SET_FEATURES; op.flags = 0; op.size = sizeof(features); op.payload.u64 = features; return vhost_vfio_write(dev, ); } To get things from the device, two steps are needed. Take VHOST_GET_FEATURE as an example: int vhost_vfio_get_features(struct vhost_dev *dev, uint64_t *features) { struct vhost_vfio_op op;
Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code
Hi Marc, David, with the v2 patch ("net: usb: asix88179_178a: de-duplicate code") I made an embarrasly stupid mistake of removing the wrong function. The v2 patch accidentially changed ax88179_link_reset() instead of ax88179_reset(). Hunk 6 of v2 ("net: usb: asix88179_178a: de-duplicate code") is just utterly wrong. ax88179_bind() and ax88179_reset() were the correct targets to be de-duplicated, as done in the v3 patch. Sorry for this, Alexander On Mon, 2 Apr 2018, David Miller wrote: > From: Marc Zyngier> Date: Mon, 02 Apr 2018 10:45:40 +0100 > > > What has changed between this patch and the previous one? Having a bit > > of a change-log would certainly help. Also, I would have appreciated a > > reply to the questions I had on v2 before you posted a third version. > > Agreed, and I'm not applying these patches until this is sorted out > and explained properly. >
Re: [PATCH net-next] net: ipv6/gre: Add GRO support
On 04/02/2018 08:00 AM, Eran Ben Elisha wrote: Seems good, but why isn't this handled directly in GRO native layer ? >>> ip6_tunnel and ip6_gre do not share initialization flow functions (unlike >>> ipv4). >>> Changing the ipv6 init infrastructure should not be part of this >>> patch. we prefer to keep this one minimal, simple and safe. >> >> >> >> Looking at gre_gro_receive() and gre_gro_complete() I could not see why they >> could not be copied/pasted to IPv6. > > These functions to handle GRO over GRE are already assigned in > gre_offload_init() (in net/ipv4/gre_offload.c under CONFIG_IPV6). > However without initializing the gro_cells, the receive path will not > go via napi_gro_receive path, but directly to netif_rx. > So AFAIU, only gcells->cells was missing for gro_cells_receive to > really go via GRO flow. > >> >> Maybe give more details on the changelog, it is really not obvious. > Hopefully the above filled this request. >> Not really :/ gro_cells_receive() is not really useful with native GRO, since packet is already a GRO packet by the time it reaches ip_tunnel_rcv() or __ip6_tnl_rcv() Sure, it might be usefull if native GRO (happening on eth0 if you prefer) did not handle a particular encapsulation. gro_cell was a work around before we extended GRO to be able to decap some tunnel headers. It seems we have to extend this to also support GRE6.
Re: [PATCH] connector: add parent pid and tgid to coredump and exit events
Hi David, I don't see how it breaks UAPI. The point is that structures coredump_proc_event and exit_proc_event are members of *union* event_data, thus position of the existing data in the structure is unchanged. Furthermore, this change won't increase size of struct proc_event, because comm_proc_event (also a member of event_data) is of bigger size than the changed structures. If I'm wrong, could you please explain what exactly will the change break in UAPI? On 30/03/18 19:59, David Miller wrote: > From: Stefan Strogin> Date: Thu, 29 Mar 2018 17:12:47 +0300 > >> diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h >> index 68ff25414700..db210625cee8 100644 >> --- a/include/uapi/linux/cn_proc.h >> +++ b/include/uapi/linux/cn_proc.h >> @@ -116,12 +116,16 @@ struct proc_event { >> struct coredump_proc_event { >> __kernel_pid_t process_pid; >> __kernel_pid_t process_tgid; >> +__kernel_pid_t parent_pid; >> +__kernel_pid_t parent_tgid; >> } coredump; >> >> struct exit_proc_event { >> __kernel_pid_t process_pid; >> __kernel_pid_t process_tgid; >> __u32 exit_code, exit_signal; >> +__kernel_pid_t parent_pid; >> +__kernel_pid_t parent_tgid; >> } exit; >> >> } event_data; > > I don't think you can add these members without breaking UAPI. >
Re: [PATCH net v5 2/3] ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
From: Alexey KodanevDate: Mon, 2 Apr 2018 17:00:34 +0300 > +++ b/net/ipv6/ip6_output.c > @@ -1105,23 +1105,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct > sock *sk, struct flowi6 *fl6, > * @sk: socket which provides the dst cache and route info > * @fl6: flow to lookup > * @final_dst: final destination address for ipsec lookup > + * @connected: whether @sk is connected or not ... > struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6, > - const struct in6_addr *final_dst) > + const struct in6_addr *final_dst, > + int connected) Please use type 'bool' and true/false for this new parameter. Thank you.
Re: [PATCH v3 0/2] net: mvneta: improve suspend/resume
From: Jisheng ZhangDate: Mon, 2 Apr 2018 11:22:29 +0800 > This series tries to optimize the mvneta's suspend/resume > implementation by only taking necessary actions. > > Since v2: > - keep rtnl lock when calling mvneta_start_dev() and mvneta_stop_dev() >Thank Russell for pointing this out > > Since v1: > - unify ret check > - try best to keep the suspend/resume behavior > - split txq deinit into sw/hw parts as well > - adjust mvneta_stop_dev() location Series applied, thank you.
Re: [PATCH net-next] bridge: Allow max MTU when multiple VLANs present
On Fri, Mar 30, 2018 at 12:54 PM, Chas Williams <3ch...@gmail.com> wrote: > On Thu, Mar 29, 2018 at 9:02 PM, Toshiaki Makita >wrote: >> On 2018/03/30 1:49, Roopa Prabhu wrote: >>> On Thu, Mar 22, 2018 at 9:53 PM, Roopa Prabhu >>> wrote: On Thu, Mar 22, 2018 at 8:34 AM, Chas Williams <3ch...@gmail.com> wrote: > If the bridge is allowing multiple VLANs, some VLANs may have > different MTUs. Instead of choosing the minimum MTU for the > bridge interface, choose the maximum MTU of the bridge members. > With this the user only needs to set a larger MTU on the member > ports that are participating in the large MTU VLANS. > > Signed-off-by: Chas Williams <3ch...@gmail.com> > --- Acked-by: Roopa Prabhu This or an equivalent fix is necessary: as stated above, today the bridge mtu capped at min port mtu limits all vlan devices on top of the vlan filtering bridge to min port mtu. >>> >>> >>> On further thought, since this patch changes default behavior, it may >>> upset people. ie with this patch, a vlan device >>> on the bridge by default will now use the bridge max mtu and that >>> could cause unexpected drops in the bridge driver >>> if the xmit port had a lower mtu. This may surprise users. > > It only changes the default behavior when you are using VLAN aware bridges. > The behavior remains the same otherwise. I don't know if VLAN aware bridges > are that popular yet so there probably isn't any particular > expectation from those > bridges. they are popular...in-fact they are the default bridge mode on our network switches. And they have been around for some time now to ignore its users. Plus it is not right to change default mtu behavior for one mode of the bridge and not the others (bridge mtu handling from user-space is complex enough today due to dynamic mtu changes on port enslave/deslave). > > I don't think those drops are unexpected. If a user has misconfigured > the bridge > we can't be expected to fix that for them. It is the user's > responsbility to ensure > that the ports on the VLAN have a size consistent with the traffic > they expect to > pass. > By default they are not expected today. The problem is changing the bridge to max mtu changes 'all' the vlan devices on top of the vlan aware bridge to max mtu by default which makes drops at the bridge driver more common if the user had mixed mtu on its ports.
Re: [net-next PATCH v3 00/11] Add support for netcp driver on K2G SoC
On 04/02/2018 10:40 AM, David Miller wrote: > > The net-next tree is closed, please resubmit this after the merge window and > the net-next tree is open back up again. > Ok. Will do. Thanks -- Murali Karicheri Linux Kernel, Keystone
Re: [PATCH net-next] net: ipv6/gre: Add GRO support
>>> Seems good, but why isn't this handled directly in GRO native layer ? >> ip6_tunnel and ip6_gre do not share initialization flow functions (unlike >> ipv4). >> Changing the ipv6 init infrastructure should not be part of this >> patch. we prefer to keep this one minimal, simple and safe. > > > > Looking at gre_gro_receive() and gre_gro_complete() I could not see why they > could not be copied/pasted to IPv6. These functions to handle GRO over GRE are already assigned in gre_offload_init() (in net/ipv4/gre_offload.c under CONFIG_IPV6). However without initializing the gro_cells, the receive path will not go via napi_gro_receive path, but directly to netif_rx. So AFAIU, only gcells->cells was missing for gro_cells_receive to really go via GRO flow. > > Maybe give more details on the changelog, it is really not obvious. Hopefully the above filled this request. >
Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support
On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote: > Now that we have SCTP offload capabilities in the kernel, we can add > them to virtio as well. First step is SCTP checksum. Thanks. > As for GSO, the way sctp GSO is currently implemented buys us nothing > in added support to virtio. To add true GSO, would require a lot of > re-work inside of SCTP and would require extensions to the virtio > net header to carry extra sctp data. Can you please elaborate more on this? Is this because SCTP GSO relies on the gso skb format for knowing how to segment it instead of having a list of sizes? Marcelo
Re: [PATCH] net: implement IP_RECVHDRS option to get full headers through recvmsg cmsg.
From: Maciej ŻenczykowskiDate: Sat, 31 Mar 2018 22:43:14 -0700 > From: Luigi Rizzo > > We have all sorts of different ways to fetch pre-UDP payload metadata: > IP_RECVTOS > IP_RECVTTL > IP_RECVOPTS > IP_RETOPTS > > But nothing generic which simply allows you to receive the entire packet > header. > > This is in similar vein to TCP_SAVE_SYN but for UDP and other datagram > sockets. > > This is envisioned as a way to get GUE extension metadata for encapsulated > packets, but implemented in a way to be much more future proof. > > (Implemented by Luigi, who asked me to send it upstream) > > Cc: Eric Dumazet > Signed-off-by: Luigi Rizzo > Signed-off-by: Maciej Żenczykowski This is an ipv4 level socket option, so why are you copying in the MAC header(s)? That part I don't like at all. First of all, you have no idea what the link level protocol is for that MAC header, therefore how could you even begin to interpret it's contents correctly? Second of all, MAC level details belong not in AF_INET socket interfaces. Thank you.