Re: [PATCH] net/mlx5e: allocate 'indirection_rqt' buffer dynamically
On 3/8/2021 5:32 PM, Arnd Bergmann wrote: From: Arnd Bergmann Increasing the size of the indirection_rqt array from 128 to 256 bytes pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function over the warning limit when building with clang and CONFIG_KASAN: drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=] Using dynamic allocation here is safe because the caller does the same, and it reduces the stack usage of the function to just a few bytes. Fixes: 1dd55ba2fb70 ("net/mlx5e: Increase indirection RQ table size to 256") Signed-off-by: Arnd Bergmann --- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 0da69b98f38f..66f98618dc13 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -445,12 +445,16 @@ static void mlx5e_hairpin_destroy_transport(struct mlx5e_hairpin *hp) mlx5_core_dealloc_transport_domain(hp->func_mdev, hp->tdn); } -static void mlx5e_hairpin_fill_rqt_rqns(struct mlx5e_hairpin *hp, void *rqtc) +static int mlx5e_hairpin_fill_rqt_rqns(struct mlx5e_hairpin *hp, void *rqtc) { - u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE], rqn; + u32 *indirection_rqt, rqn; struct mlx5e_priv *priv = hp->func_priv; int i, ix, sz = MLX5E_INDIR_RQT_SIZE; + indirection_rqt = kzalloc(sz, GFP_KERNEL); + if (!indirection_rqt) + return -ENOMEM; + mlx5e_build_default_indir_rqt(indirection_rqt, sz, hp->num_channels); @@ -462,6 +466,9 @@ static void mlx5e_hairpin_fill_rqt_rqns(struct mlx5e_hairpin *hp, void *rqtc) rqn = hp->pair->rqn[ix]; MLX5_SET(rqtc, rqtc, rq_num[i], rqn); } + + kfree(indirection_rqt); + return 0; } static int mlx5e_hairpin_create_indirect_rqt(struct mlx5e_hairpin *hp) @@ -482,12 +489,15 @@ static int mlx5e_hairpin_create_indirect_rqt(struct mlx5e_hairpin *hp) MLX5_SET(rqtc, rqtc, rqt_actual_size, sz); MLX5_SET(rqtc, rqtc, rqt_max_size, sz); - mlx5e_hairpin_fill_rqt_rqns(hp, rqtc); + err = mlx5e_hairpin_fill_rqt_rqns(hp, rqtc); + if (err) + goto out; err = mlx5_core_create_rqt(mdev, in, inlen, >indir_rqt.rqtn); if (!err) hp->indir_rqt.enabled = true; +out: kvfree(in); return err; } Reviewed-by: Tariq Toukan Thanks for your patch. Tariq
Re: [PATCH] net/mlx4_core: Add missed mlx4_free_cmd_mailbox()
On 2/21/2021 4:35 PM, Chuhong Yuan wrote: mlx4_do_mirror_rule() forgets to call mlx4_free_cmd_mailbox() to free the memory region allocated by mlx4_alloc_cmd_mailbox() before an exit. Add the missed call to fix it. Fixes: 78efed275117 ("net/mlx4_core: Support mirroring VF DMFS rules on both ports") Signed-off-by: Chuhong Yuan --- drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index 394f43add85c..a99e71bc7b3c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -4986,6 +4986,7 @@ static int mlx4_do_mirror_rule(struct mlx4_dev *dev, struct res_fs_rule *fs_rule if (!fs_rule->mirr_mbox) { mlx4_err(dev, "rule mirroring mailbox is null\n"); + mlx4_free_cmd_mailbox(dev, mailbox); return -EINVAL; } memcpy(mailbox->buf, fs_rule->mirr_mbox, fs_rule->mirr_mbox_size); Reviewed-by: Tariq Toukan Thanks for your patch. Tariq
Re: [PATCH v4 net-next] net: Remove redundant calls of sk_tx_queue_clear().
On 1/28/2021 2:42 PM, Kuniyuki Iwashima wrote: The commit 41b14fb8724d ("net: Do not clear the sock TX queue in sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the other hand, the original commit had already put sk_tx_queue_clear() in sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus sk_tx_queue_clear() is called twice in each path. If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it currently works well because (i) sk_tx_queue_mapping is defined between sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping. However, if we move sk_tx_queue_mapping out of the no copy area, it introduces a bug unintentionally. Therefore, this patch adds a compile-time check to take care of the order of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from sk_prot_alloc() so that it does the only allocation and its callers initialize fields. v4: * Fix typo in the changelog (runtime -> compile-time) v3: https://lore.kernel.org/netdev/20210128021905.57471-1-kun...@amazon.co.jp/ * Remove Fixes: tag * Add BUILD_BUG_ON * Remove sk_tx_queue_clear() from sk_prot_alloc() instead of sk_alloc() and sk_clone_lock() v2: https://lore.kernel.org/netdev/20210127132215.10842-1-kun...@amazon.co.jp/ * Remove Reviewed-by: tag v1: https://lore.kernel.org/netdev/20210127125018.7059-1-kun...@amazon.co.jp/ Sorry for not pointing this out earlier, but shouldn't the changelog come after the --- separator? Unless you want it to appear as part of the commit message. Other than that, I think now I'm fine with the patch. Acked-by: Tariq Toukan Thanks, Tariq CC: Tariq Toukan CC: Boris Pismenny Signed-off-by: Kuniyuki Iwashima --- net/core/sock.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index bbcd4b97eddd..cfbd62a5e079 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1657,6 +1657,16 @@ static void sock_copy(struct sock *nsk, const struct sock *osk) #ifdef CONFIG_SECURITY_NETWORK void *sptr = nsk->sk_security; #endif + + /* If we move sk_tx_queue_mapping out of the private section, +* we must check if sk_tx_queue_clear() is called after +* sock_copy() in sk_clone_lock(). +*/ + BUILD_BUG_ON(offsetof(struct sock, sk_tx_queue_mapping) < +offsetof(struct sock, sk_dontcopy_begin) || +offsetof(struct sock, sk_tx_queue_mapping) >= +offsetof(struct sock, sk_dontcopy_end)); + memcpy(nsk, osk, offsetof(struct sock, sk_dontcopy_begin)); memcpy(>sk_dontcopy_end, >sk_dontcopy_end, @@ -1690,7 +1700,6 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, if (!try_module_get(prot->owner)) goto out_free_sec; - sk_tx_queue_clear(sk); } return sk;
Re: [PATCH v3 net-next] net: Remove redundant calls of sk_tx_queue_clear().
On 1/28/2021 4:19 AM, Kuniyuki Iwashima wrote: The commit 41b14fb8724d ("net: Do not clear the sock TX queue in sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the other hand, the original commit had already put sk_tx_queue_clear() in sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus sk_tx_queue_clear() is called twice in each path. If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it currently works well because (i) sk_tx_queue_mapping is defined between sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping. However, if we move sk_tx_queue_mapping out of the no copy area, it introduces a bug unintentionally. Therefore, this patch adds a runtime compile-time check to take care of the order of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from sk_prot_alloc() so that it does the only allocation and its callers initialize fields. v3: * Remove Fixes: tag * Add BUILD_BUG_ON * Remove sk_tx_queue_clear() from sk_prot_alloc() instead of sk_alloc() and sk_clone_lock() v2: https://lore.kernel.org/netdev/20210127132215.10842-1-kun...@amazon.co.jp/ * Remove Reviewed-by: tag v1: https://lore.kernel.org/netdev/20210127125018.7059-1-kun...@amazon.co.jp/ CC: Tariq Toukan CC: Boris Pismenny Signed-off-by: Kuniyuki Iwashima --- net/core/sock.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index bbcd4b97eddd..cfbd62a5e079 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1657,6 +1657,16 @@ static void sock_copy(struct sock *nsk, const struct sock *osk) #ifdef CONFIG_SECURITY_NETWORK void *sptr = nsk->sk_security; #endif + + /* If we move sk_tx_queue_mapping out of the private section, +* we must check if sk_tx_queue_clear() is called after +* sock_copy() in sk_clone_lock(). +*/ + BUILD_BUG_ON(offsetof(struct sock, sk_tx_queue_mapping) < +offsetof(struct sock, sk_dontcopy_begin) || +offsetof(struct sock, sk_tx_queue_mapping) >= +offsetof(struct sock, sk_dontcopy_end)); + memcpy(nsk, osk, offsetof(struct sock, sk_dontcopy_begin)); memcpy(>sk_dontcopy_end, >sk_dontcopy_end, @@ -1690,7 +1700,6 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, if (!try_module_get(prot->owner)) goto out_free_sec; - sk_tx_queue_clear(sk); } return sk;
Re: [PATCH] mlx4: style: replace zero-length array with flexible-array member.
On 12/30/2020 8:28 AM, YANG LI wrote: There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use "flexible array members"[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9/process/ deprecated.html#zero-length-and-one-element-arrays Signed-off-by: YANG LI Reported-by: Abaci --- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h index e8ed2319..4029a8b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h @@ -314,7 +314,7 @@ struct mlx4_en_tx_ring { struct mlx4_en_rx_desc { /* actual number of entries depends on rx ring stride */ - struct mlx4_wqe_data_seg data[0]; + struct mlx4_wqe_data_seg data[]; }; struct mlx4_en_rx_ring { Reviewed-by: Tariq Toukan Thanks.
Re: [patch 23/30] net/mlx5: Use effective interrupt affinity
On 12/10/2020 9:25 PM, Thomas Gleixner wrote: Using the interrupt affinity mask for checking locality is not really working well on architectures which support effective affinity masks. The affinity mask is either the system wide default or set by user space, but the architecture can or even must reduce the mask to the effective set, which means that checking the affinity mask itself does not really tell about the actual target CPUs. Signed-off-by: Thomas Gleixner Cc: Saeed Mahameed Cc: Leon Romanovsky Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org Cc: linux-r...@vger.kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -1998,7 +1998,7 @@ static int mlx5e_open_channel(struct mlx c->num_tc = params->num_tc; c->xdp = !!params->xdp_prog; c->stats= >channel_stats[ix].ch; - c->aff_mask = irq_get_affinity_mask(irq); + c->aff_mask = irq_get_effective_affinity_mask(irq); c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix); netif_napi_add(netdev, >napi, mlx5e_napi_poll, 64); Reviewed-by: Tariq Toukan Thanks.
Re: [patch 22/30] net/mlx5: Replace irq_to_desc() abuse
On 12/10/2020 9:25 PM, Thomas Gleixner wrote: No driver has any business with the internals of an interrupt descriptor. Storing a pointer to it just to use yet another helper at the actual usage site to retrieve the affinity mask is creative at best. Just because C does not allow encapsulation does not mean that the kernel has no limits. Retrieve a pointer to the affinity mask itself and use that. It's still using an interface which is usually not for random drivers, but definitely less hideous than the previous hack. Signed-off-by: Thomas Gleixner --- drivers/net/ethernet/mellanox/mlx5/core/en.h |2 +- drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +- drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |6 +- 3 files changed, 3 insertions(+), 7 deletions(-) Reviewed-by: Tariq Toukan Thanks.
Re: [patch 21/30] net/mlx4: Use effective interrupt affinity
On 12/10/2020 9:25 PM, Thomas Gleixner wrote: Using the interrupt affinity mask for checking locality is not really working well on architectures which support effective affinity masks. The affinity mask is either the system wide default or set by user space, but the architecture can or even must reduce the mask to the effective set, which means that checking the affinity mask itself does not really tell about the actual target CPUs. Signed-off-by: Thomas Gleixner Cc: Tariq Toukan Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org Cc: linux-r...@vger.kernel.org --- drivers/net/ethernet/mellanox/mlx4/en_cq.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c @@ -117,7 +117,7 @@ int mlx4_en_activate_cq(struct mlx4_en_p assigned_eq = true; } irq = mlx4_eq_get_irq(mdev->dev, cq->vector); - cq->aff_mask = irq_get_affinity_mask(irq); + cq->aff_mask = irq_get_effective_affinity_mask(irq); } else { /* For TX we use the same irq per ring we assigned for the RX */ Reviewed-by: Tariq Toukan Thanks.
Re: [patch 20/30] net/mlx4: Replace irq_to_desc() abuse
On 12/10/2020 9:25 PM, Thomas Gleixner wrote: No driver has any business with the internals of an interrupt descriptor. Storing a pointer to it just to use yet another helper at the actual usage site to retrieve the affinity mask is creative at best. Just because C does not allow encapsulation does not mean that the kernel has no limits. Retrieve a pointer to the affinity mask itself and use that. It's still using an interface which is usually not for random drivers, but definitely less hideous than the previous hack. Signed-off-by: Thomas Gleixner Cc: Tariq Toukan Cc: "David S. Miller" Cc: Jakub Kicinski Cc: net...@vger.kernel.org Cc: linux-r...@vger.kernel.org --- drivers/net/ethernet/mellanox/mlx4/en_cq.c |8 +++- drivers/net/ethernet/mellanox/mlx4/en_rx.c |6 +- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |3 ++- 3 files changed, 6 insertions(+), 11 deletions(-) Reviewed-by: Tariq Toukan Thanks for your patch.
Re: [PATCH 044/141] net/mlx4: Fix fall-through warnings for Clang
On 11/20/2020 8:31 PM, Gustavo A. R. Silva wrote: In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva --- drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index 1187ef1375e2..e6b8b8dc7894 100644 --- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -2660,6 +2660,7 @@ int mlx4_FREE_RES_wrapper(struct mlx4_dev *dev, int slave, case RES_XRCD: err = xrcdn_free_res(dev, slave, vhcr->op_modifier, alop, vhcr->in_param, >out_param); + break; default: break; Reviewed-by: Tariq Toukan Thanks for your patch.
Re: [PATCH] net/mlx4: Assign boolean values to a bool variable
On 11/9/2020 9:07 AM, kaixuxia wrote: On 2020/11/8 16:20, Tariq Toukan wrote: On 11/7/2020 8:53 AM, xiakaixu1...@gmail.com wrote: From: Kaixu Xia Fix the following coccinelle warnings: Hi Kaixu, Which coccinelle version gave this warning? Hi Tariq, The version is coccinelle-1.0.7. Thanks, Kaixu ./drivers/net/ethernet/mellanox/mlx4/en_rx.c:687:1-17: WARNING: Assignment of 0/1 to bool variable Reported-by: Tosk Robot Signed-off-by: Kaixu Xia Reviewed-by: Tariq Toukan Thanks, Tariq
Re: [PATCH] net/mlx4: Assign boolean values to a bool variable
On 11/7/2020 8:53 AM, xiakaixu1...@gmail.com wrote: From: Kaixu Xia Fix the following coccinelle warnings: Hi Kaixu, Which coccinelle version gave this warning? ./drivers/net/ethernet/mellanox/mlx4/en_rx.c:687:1-17: WARNING: Assignment of 0/1 to bool variable Reported-by: Tosk Robot Signed-off-by: Kaixu Xia --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 502d1b97855c..b0f79a5151cf 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -684,7 +684,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud xdp_prog = rcu_dereference(ring->xdp_prog); xdp.rxq = >xdp_rxq; xdp.frame_sz = priv->frag_info[0].frag_stride; - doorbell_pending = 0; + doorbell_pending = false; /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx * descriptor offset can be deduced from the CQE index instead of
Re: [PATCH] net/mlx4_core : remove unneeded semicolon
On 11/1/2020 4:05 PM, t...@redhat.com wrote: From: Tom Rix A semicolon is not needed after a switch statement. Signed-off-by: Tom Rix --- drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index 1187ef1375e2..394f43add85c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -300,7 +300,7 @@ static const char *resource_str(enum mlx4_resource rt) case RES_FS_RULE: return "RES_FS_RULE"; case RES_XRCD: return "RES_XRCD"; default: return "Unknown resource type !!!"; - }; + } } static void rem_slave_vlans(struct mlx4_dev *dev, int slave); Reviewed-by: Tariq Toukan Thanks, Tariq
[RFC PATCH] workqueue: Add support for exposing singlethread workqueues in sysfs
Extend the workqueue API so that singlethread workqueues can be exposed in sysfs. Guarantee max_active is 1 by turning it read-only. This allows admins to control the cpumask of a workqueue, and apply the desired system cpu separation/isolation policy. Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 2 +- include/linux/workqueue.h | 4 kernel/workqueue.c| 21 +-- 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c index 1d91a0d0ab1d..5106820a5cd5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -2033,7 +2033,7 @@ int mlx5_cmd_init(struct mlx5_core_dev *dev) create_msg_cache(dev); set_wqname(dev); - cmd->wq = create_singlethread_workqueue(cmd->wq_name); + cmd->wq = create_singlethread_sysfs_workqueue(cmd->wq_name); if (!cmd->wq) { mlx5_core_err(dev, "failed to create command workqueue\n"); err = -ENOMEM; diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 26de0cae2a0a..d4d4ca2b041a 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -344,6 +344,7 @@ enum { __WQ_ORDERED= 1 << 17, /* internal: workqueue is ordered */ __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ __WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */ + __WQ_MAX_ACTIVE_RO = 1 << 20, /* internal: make max_active read-only */ WQ_MAX_ACTIVE = 512,/* I like 512, better ideas? */ WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ @@ -432,6 +433,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt, WQ_MEM_RECLAIM, 1, (name)) #define create_singlethread_workqueue(name)\ alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name) +#define create_singlethread_sysfs_workqueue(name) \ + alloc_ordered_workqueue("%s", __WQ_MAX_ACTIVE_RO | \ + __WQ_LEGACY | WQ_MEM_RECLAIM, name) extern void destroy_workqueue(struct workqueue_struct *wq); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c41c3c17b86a..a80d34726e68 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4258,6 +4258,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt, if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient) flags |= WQ_UNBOUND; + if (flags & __WQ_MAX_ACTIVE_RO) + flags |= WQ_SYSFS; + /* allocate wq and format name */ if (flags & WQ_UNBOUND) tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]); @@ -5401,9 +5404,11 @@ static ssize_t max_active_store(struct device *dev, } static DEVICE_ATTR_RW(max_active); +static struct device_attribute dev_attr_max_active_ro = + __ATTR(max_active, 0444, max_active_show, NULL); + static struct attribute *wq_sysfs_attrs[] = { _attr_per_cpu.attr, - _attr_max_active.attr, NULL, }; ATTRIBUTE_GROUPS(wq_sysfs); @@ -5642,6 +5647,7 @@ static void wq_device_release(struct device *dev) */ int workqueue_sysfs_register(struct workqueue_struct *wq) { + const struct device_attribute *max_active_entry; struct wq_device *wq_dev; int ret; @@ -5650,7 +5656,8 @@ int workqueue_sysfs_register(struct workqueue_struct *wq) * attributes breaks ordering guarantee. Disallow exposing ordered * workqueues. */ - if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT)) + if (WARN_ON(!(wq->flags & __WQ_MAX_ACTIVE_RO) && + wq->flags & __WQ_ORDERED_EXPLICIT)) return -EINVAL; wq->wq_dev = wq_dev = kzalloc(sizeof(*wq_dev), GFP_KERNEL); @@ -5675,6 +5682,16 @@ int workqueue_sysfs_register(struct workqueue_struct *wq) return ret; } + max_active_entry = wq->flags & __WQ_MAX_ACTIVE_RO ? + _attr_max_active_ro : _attr_max_active; + + ret = device_create_file(_dev->dev, max_active_entry); + if (ret) { + device_unregister(_dev->dev); + wq->wq_dev = NULL; + return ret; + } + if (wq->flags & WQ_UNBOUND) { struct device_attribute *attr; -- 2.21.0
Re: [PATCH net-next] net: ethernet: mlx4: Avoid assigning a value to ring_cons but not used it anymore in mlx4_en_xmit()
On 9/14/2020 11:02 PM, Jakub Kicinski wrote: On Sun, 13 Sep 2020 13:12:05 +0300 Tariq Toukan wrote: 2. When MLX4_EN_PERF_STAT is not defined, we should totally remove the local variable declaration, not only its usage. I was actually wondering about this when working on the pause stat patch. Where is MLX4_EN_PERF_STAT ever defined? $ git grep MLX4_EN_PERF_STAT drivers/net/ethernet/mellanox/mlx4/mlx4_en.h:#ifdef MLX4_EN_PERF_STAT drivers/net/ethernet/mellanox/mlx4/mlx4_en.h:#endif /* MLX4_EN_PERF_STAT */ drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h:#ifdef MLX4_EN_PERF_STAT Good point. This was introduced long ago, since day 1 of mlx4 driver. I believe it had off-tree usage back then, not sure though... Anyway, I don't find it useful anymore. Should be removed. I'll prepare a cleanup patch for net-next. Thanks, Tariq
Re: [PATCH net-next] net: ethernet: mlx4: Avoid assigning a value to ring_cons but not used it anymore in mlx4_en_xmit()
On 9/13/2020 4:22 AM, David Miller wrote: From: Luo Jiaxing Date: Sat, 12 Sep 2020 16:08:15 +0800 We found a set but not used variable 'ring_cons' in mlx4_en_xmit(), it will cause a warning when build the kernel. And after checking the commit record of this function, we found that it was introduced by a previous patch. So, We delete this redundant assignment code. Fixes: 488a9b48e398 ("net/mlx4_en: Wake TX queues only when there's enough room") Signed-off-by: Luo Jiaxing Looks good, applied, thanks. Hi Luo, I didn't get a chance to review it during the weekend. The ring_cons local variable is used in line 903: https://elixir.bootlin.com/linux/v5.9-rc4/source/drivers/net/ethernet/mellanox/mlx4/en_tx.c#L903 AVG_PERF_COUNTER depends on the compile-time definition of MLX4_EN_PERF_STAT. Otherwise it is a nop. 1. Your patch causes a degradation to the case when MLX4_EN_PERF_STAT is defined. 2. When MLX4_EN_PERF_STAT is not defined, we should totally remove the local variable declaration, not only its usage. Please let me know if you're planning to fix this. Otherwise I'll do. Regards, Tariq
Re: [PATCH][next] net/mlx5e: fix memory leak of tls
On 6/30/2020 6:16 PM, Colin King wrote: From: Colin Ian King The error return path when create_singlethread_workqueue fails currently does not kfree tls and leads to a memory leak. Fix this by kfree'ing tls before returning -ENOMEM. Addresses-Coverity: ("Resource leak") Fixes: 1182f3659357 ("net/mlx5e: kTLS, Add kTLS RX HW offload support") Signed-off-by: Colin Ian King --- drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c index 99beb928feff..fee991f5ee7c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c @@ -232,8 +232,10 @@ int mlx5e_tls_init(struct mlx5e_priv *priv) return -ENOMEM; tls->rx_wq = create_singlethread_workqueue("mlx5e_tls_rx"); - if (!tls->rx_wq) + if (!tls->rx_wq) { + kfree(tls); return -ENOMEM; + } priv->tls = tls; return 0; Reviewed-by: Tariq Toukan Thanks.
Re: [PATCH][next] net/tls: fix sign extension issue when left shifting u16 value
On 6/30/2020 5:27 PM, Colin King wrote: From: Colin Ian King Left shifting the u16 value promotes it to a int and then it gets sign extended to a u64. If len << 16 is greater than 0x7fff then the upper bits get set to 1 because of the implicit sign extension. Fix this by casting len to u64 before shifting it. Addresses-Coverity: ("integer handling issues") Fixes: ed9b7646b06a ("net/tls: Add asynchronous resync") Signed-off-by: Colin Ian King --- include/net/tls.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/tls.h b/include/net/tls.h index c875c0a445a6..e5dac7e74e79 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -637,7 +637,7 @@ tls_offload_rx_resync_async_request_start(struct sock *sk, __be32 seq, u16 len) struct tls_offload_context_rx *rx_ctx = tls_offload_ctx_rx(tls_ctx); atomic64_set(_ctx->resync_async->req, ((u64)ntohl(seq) << 32) | -(len << 16) | RESYNC_REQ | RESYNC_REQ_ASYNC); +((u64)len << 16) | RESYNC_REQ | RESYNC_REQ_ASYNC); rx_ctx->resync_async->loglen = 0; } Reviewed-by: Tariq Toukan Thanks!
Re: [PATCH v2] net/mlx4_en: fix a memory leak bug
On 8/12/2019 10:11 PM, Wenwen Wang wrote: > In mlx4_en_config_rss_steer(), 'rss_map->indir_qp' is allocated through > kzalloc(). After that, mlx4_qp_alloc() is invoked to configure RSS > indirection. However, if mlx4_qp_alloc() fails, the allocated > 'rss_map->indir_qp' is not deallocated, leading to a memory leak bug. > > To fix the above issue, add the 'qp_alloc_err' label to free > 'rss_map->indir_qp'. > > Fixes: 4931c6ef04b4 ("net/mlx4_en: Optimized single ring steering") > > Signed-off-by: Wenwen Wang > --- > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 6c01314..db3552f 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -1187,7 +1187,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv) > err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, rss_map->indir_qp); > if (err) { > en_err(priv, "Failed to allocate RSS indirection QP\n"); > - goto rss_err; > + goto qp_alloc_err; > } > > rss_map->indir_qp->event = mlx4_en_sqp_event; > @@ -1241,6 +1241,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv) > MLX4_QP_STATE_RST, NULL, 0, 0, rss_map->indir_qp); > mlx4_qp_remove(mdev->dev, rss_map->indir_qp); > mlx4_qp_free(mdev->dev, rss_map->indir_qp); > +qp_alloc_err: > kfree(rss_map->indir_qp); > rss_map->indir_qp = NULL; > rss_err: > Thanks for your patch. Reviewed-by: Tariq Toukan
Re: [PATCH] net/mlx4_en: fix a memory leak bug
Hi Wenwen, Thanks for your patch. On 8/12/2019 9:36 AM, Wenwen Wang wrote: > In mlx4_en_config_rss_steer(), 'rss_map->indir_qp' is allocated through > kzalloc(). After that, mlx4_qp_alloc() is invoked to configure RSS > indirection. However, if mlx4_qp_alloc() fails, the allocated > 'rss_map->indir_qp' is not deallocated, leading to a memory leak bug. > > To fix the above issue, add the 'mlx4_err' label to free > 'rss_map->indir_qp'. > Add a Fixes line. > Signed-off-by: Wenwen Wang > --- > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 6c01314..9476dbd 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -1187,7 +1187,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv) > err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, rss_map->indir_qp); > if (err) { > en_err(priv, "Failed to allocate RSS indirection QP\n"); > - goto rss_err; > + goto mlx4_err; > } > > rss_map->indir_qp->event = mlx4_en_sqp_event; > @@ -1241,6 +1241,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv) > MLX4_QP_STATE_RST, NULL, 0, 0, rss_map->indir_qp); > mlx4_qp_remove(mdev->dev, rss_map->indir_qp); > mlx4_qp_free(mdev->dev, rss_map->indir_qp); > +mlx4_err: I don't like the label name. It's too general and not informative. Maybe qp_alloc_err? > kfree(rss_map->indir_qp); > rss_map->indir_qp = NULL; > rss_err: >
Re: [PATCH v2] net/mlx5e: always initialize frag->last_in_page
On 8/1/2019 4:52 PM, Qian Cai wrote: > The commit 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue > memory scheme") introduced an undefined behaviour below due to > "frag->last_in_page" is only initialized in mlx5e_init_frags_partition() > when, > > if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE) > > or after bailed out the loop, > > for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) > > As the result, there could be some "frag" have uninitialized > value of "last_in_page". > > Later, get_frag() obtains those "frag" and check "frag->last_in_page" in > mlx5e_put_rx_frag() and triggers the error during boot. Fix it by always > initializing "frag->last_in_page" to "false" in > mlx5e_init_frags_partition(). > > UBSAN: Undefined behaviour in > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:325:12 > load of value 170 is not a valid value for type 'bool' (aka '_Bool') > Call trace: > dump_backtrace+0x0/0x264 > show_stack+0x20/0x2c > dump_stack+0xb0/0x104 > __ubsan_handle_load_invalid_value+0x104/0x128 > mlx5e_handle_rx_cqe+0x8e8/0x12cc [mlx5_core] > mlx5e_poll_rx_cq+0xca8/0x1a94 [mlx5_core] > mlx5e_napi_poll+0x17c/0xa30 [mlx5_core] > net_rx_action+0x248/0x940 > __do_softirq+0x350/0x7b8 > irq_exit+0x200/0x26c > __handle_domain_irq+0xc8/0x128 > gic_handle_irq+0x138/0x228 > el1_irq+0xb8/0x140 > arch_cpu_idle+0x1a4/0x348 > do_idle+0x114/0x1b0 > cpu_startup_entry+0x24/0x28 > rest_init+0x1ac/0x1dc > arch_call_rest_init+0x10/0x18 > start_kernel+0x4d4/0x57c > > Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory > scheme") > Signed-off-by: Qian Cai > --- > > v2: zero-init the whole struct instead per Tariq. > > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > index 47eea6b3a1c3..e1810c03a510 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > @@ -331,12 +331,11 @@ static inline u64 mlx5e_get_mpwqe_offset(struct > mlx5e_rq *rq, u16 wqe_ix) > > static void mlx5e_init_frags_partition(struct mlx5e_rq *rq) > { > - struct mlx5e_wqe_frag_info next_frag, *prev; > + struct mlx5e_wqe_frag_info next_frag = {}; > + struct mlx5e_wqe_frag_info *prev = NULL; > int i; > > next_frag.di = >wqe.di[0]; > - next_frag.offset = 0; > - prev = NULL; > > for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) { > struct mlx5e_rq_frag_info *frag_info = >wqe.info.arr[0]; > Reviewed-by: Tariq Toukan Thanks.
Re: [PATCH] net/mlx5e: always initialize frag->last_in_page
On 7/31/2019 10:02 PM, Qian Cai wrote: > The commit 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue > memory scheme") introduced an undefined behaviour below due to > "frag->last_in_page" is only initialized in > mlx5e_init_frags_partition() when, > > if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE) > > or after bailed out the loop, > > for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) > > As the result, there could be some "frag" have uninitialized > value of "last_in_page". > > Later, get_frag() obtains those "frag" and check "rag->last_in_page" in > mlx5e_put_rx_frag() and triggers the error during boot. Fix it by always > initializing "frag->last_in_page" to "false" in > mlx5e_init_frags_partition(). > > UBSAN: Undefined behaviour in > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:325:12 > load of value 170 is not a valid value for type 'bool' (aka '_Bool') > Call trace: > dump_backtrace+0x0/0x264 > show_stack+0x20/0x2c > dump_stack+0xb0/0x104 > __ubsan_handle_load_invalid_value+0x104/0x128 > mlx5e_handle_rx_cqe+0x8e8/0x12cc [mlx5_core] > mlx5e_poll_rx_cq+0xca8/0x1a94 [mlx5_core] > mlx5e_napi_poll+0x17c/0xa30 [mlx5_core] > net_rx_action+0x248/0x940 > __do_softirq+0x350/0x7b8 > irq_exit+0x200/0x26c > __handle_domain_irq+0xc8/0x128 > gic_handle_irq+0x138/0x228 > el1_irq+0xb8/0x140 > arch_cpu_idle+0x1a4/0x348 > do_idle+0x114/0x1b0 > cpu_startup_entry+0x24/0x28 > rest_init+0x1ac/0x1dc > arch_call_rest_init+0x10/0x18 > start_kernel+0x4d4/0x57c > > Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory > scheme") > Signed-off-by: Qian Cai > --- > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > index 47eea6b3a1c3..96f5110a9b43 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > @@ -336,6 +336,7 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq > *rq) > > next_frag.di = >wqe.di[0]; > next_frag.offset = 0; > + next_frag.last_in_page = false; > prev = NULL; > > for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) { > Thanks Qian. Please zero-init the whole struct instead: diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 1f433a06e637..55f4f5cc1d8f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -312,11 +312,10 @@ static inline u64 mlx5e_get_mpwqe_offset(struct mlx5e_rq *rq, u16 wqe_ix) static void mlx5e_init_frags_partition(struct mlx5e_rq *rq) { - struct mlx5e_wqe_frag_info next_frag, *prev; + struct mlx5e_wqe_frag_info next_frag = {}, *prev; int i; next_frag.di = >wqe.di[0]; - next_frag.offset = 0; prev = NULL; for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) { Thanks, Tariq
Re: linux-next: Fixes tag needs some work in the net tree
On 7/14/2019 3:15 PM, Stephen Rothwell wrote: > Hi Tariq, > > On Sun, 14 Jul 2019 07:55:48 +0000 Tariq Toukan wrote: >> >> How do you think we should handle this? > > Dave doesn't rebase his trees, so all you can really do is learn from > it and not do it again :-) > Sure. My bad, used the SHA1 I had on my internal branch. Thanks, Tariq
Re: linux-next: Fixes tag needs some work in the net tree
On 7/12/2019 9:50 AM, Stephen Rothwell wrote: > Hi all, > > In commit > >c93dfec10f1d ("net/mlx5e: Fix compilation error in TLS code") > > Fixes tag > >Fixes: 90687e1a9a50 ("net/mlx5: Kconfig, Better organize compilation > flags") > > has these problem(s): > >- Target SHA1 does not exist > > Did you mean > > Fixes: e2869fb2068b ("net/mlx5: Kconfig, Better organize compilation flags") > Right. Thank you Stephen! How do you think we should handle this? Tariq
Re: [PATCH] [net-next] net/mlx5e: avoid uninitialized variable use
On 7/10/2019 4:06 PM, Arnd Bergmann wrote: > clang points to a variable being used in an unexpected > code path: > > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:251:2: warning: > variable 'rec_seq_sz' is used uninitialized whenever switch default is taken > [-Wsometimes-uninitialized] > default: > ^~~ > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:255:46: note: > uninitialized use occurs here > skip_static_post = !memcmp(rec_seq, _be, rec_seq_sz); > ^~ > > From looking at the function logic, it seems that there is no > sensible way to continue here, so just return early and hope > for the best. > > Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support") > Signed-off-by: Arnd Bergmann > --- > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c > index 3f5f4317a22b..5c08891806f0 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c > @@ -250,6 +250,7 @@ tx_post_resync_params(struct mlx5e_txqsq *sq, > } > default: > WARN_ON(1); > + return; > } > > skip_static_post = !memcmp(rec_seq, _be, rec_seq_sz); > Reviewed-by: Tariq Toukan Thanks!
Re: [PATCH v3 15/27] net: use zeroing allocator rather than allocator followed by memset zero
On 7/2/2019 10:58 AM, Fuqian Huang wrote: > Replace allocator followed by memset with 0 with zeroing allocator. > > Signed-off-by: Fuqian Huang > --- > Changes in v3: >- Resend > > drivers/net/eql.c | 3 +-- > drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c | 4 +--- > drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c | 4 +--- > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 +-- > 4 files changed, 4 insertions(+), 10 deletions(-) > > diff --git a/drivers/net/eql.c b/drivers/net/eql.c > index 74263f8efe1a..2f101a6036e6 100644 > --- a/drivers/net/eql.c > +++ b/drivers/net/eql.c > @@ -419,14 +419,13 @@ static int eql_enslave(struct net_device *master_dev, > slaving_request_t __user * > if ((master_dev->flags & IFF_UP) == IFF_UP) { > /* slave is not a master & not already a slave: */ > if (!eql_is_master(slave_dev) && !eql_is_slave(slave_dev)) { > - slave_t *s = kmalloc(sizeof(*s), GFP_KERNEL); > + slave_t *s = kzalloc(sizeof(*s), GFP_KERNEL); > equalizer_t *eql = netdev_priv(master_dev); > int ret; > > if (!s) > return -ENOMEM; > > - memset(s, 0, sizeof(*s)); > s->dev = slave_dev; > s->priority = srq.priority; > s->priority_bps = srq.priority; > diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c > b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c > index 43d11c38b38a..cf3835da32c8 100644 > --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c > +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c > @@ -719,12 +719,10 @@ static int cn23xx_setup_pf_mbox(struct octeon_device > *oct) > for (i = 0; i < oct->sriov_info.max_vfs; i++) { > q_no = i * oct->sriov_info.rings_per_vf; > > - mbox = vmalloc(sizeof(*mbox)); > + mbox = vzalloc(sizeof(*mbox)); > if (!mbox) > goto free_mbox; > > - memset(mbox, 0, sizeof(struct octeon_mbox)); > - > spin_lock_init(>lock); > > mbox->oct_dev = oct; > diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c > b/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c > index fda49404968c..b3bd2767d3dd 100644 > --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c > +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c > @@ -279,12 +279,10 @@ static int cn23xx_setup_vf_mbox(struct octeon_device > *oct) > { > struct octeon_mbox *mbox = NULL; > > - mbox = vmalloc(sizeof(*mbox)); > + mbox = vzalloc(sizeof(*mbox)); > if (!mbox) > return 1; > > - memset(mbox, 0, sizeof(struct octeon_mbox)); > - > spin_lock_init(>lock); > > mbox->oct_dev = oct; > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 6c01314e87b0..f1dff5c47676 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -1062,7 +1062,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv > *priv, int qpn, > struct mlx4_qp_context *context; > int err = 0; > > - context = kmalloc(sizeof(*context), GFP_KERNEL); > + context = kzalloc(sizeof(*context), GFP_KERNEL); > if (!context) > return -ENOMEM; > > @@ -1073,7 +1073,6 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv > *priv, int qpn, > } > qp->event = mlx4_en_sqp_event; > > - memset(context, 0, sizeof(*context)); > mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0, > qpn, ring->cqn, -1, context); > context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma); > Reviewed-by: Tariq Toukan
Re: [PATCH v2 15/27] net: use zeroing allocator rather than allocator followed by memset zero
On 6/28/2019 5:48 AM, Fuqian Huang wrote: > Replace allocator followed by memset with 0 with zeroing allocator. > > Signed-off-by: Fuqian Huang > --- .. > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -1062,7 +1062,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv > *priv, int qpn, > struct mlx4_qp_context *context; > int err = 0; > > - context = kmalloc(sizeof(*context), GFP_KERNEL); > + context = kzalloc(sizeof(*context), GFP_KERNEL); > if (!context) > return -ENOMEM; > > @@ -1073,7 +1073,6 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv > *priv, int qpn, > } > qp->event = mlx4_en_sqp_event; > > - memset(context, 0, sizeof(*context)); > mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0, > qpn, ring->cqn, -1, context); > context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma); > For the mlx4 part: Reviewed-by: Tariq Toukan Tariq
Re: [PATCH 87/87] ethernet: mlx4: remove memset after dma_alloc_coherent
On 6/27/2019 8:42 PM, Fuqian Huang wrote: > In commit af7ddd8a627c > ("Merge tag 'dma-mapping-4.21' of > git://git.infradead.org/users/hch/dma-mapping"), > dma_alloc_coherent has already zeroed the memory. > So memset is not needed. > > Signed-off-by: Fuqian Huang > --- > drivers/net/ethernet/mellanox/mlx4/eq.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c > b/drivers/net/ethernet/mellanox/mlx4/eq.c > index a5be27772b8e..c790a5fcea73 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/eq.c > +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c > @@ -1013,8 +1013,6 @@ static int mlx4_create_eq(struct mlx4_dev *dev, int > nent, > > dma_list[i] = t; > eq->page_list[i].map = t; > - > - memset(eq->page_list[i].buf, 0, PAGE_SIZE); > } > > eq->eqn = mlx4_bitmap_alloc(>eq_table.bitmap); > Reviewed-by: Tariq Toukan Thanks.
Re: [PATCH 1/8] net/mlx4: use kzalloc instead of kmalloc
On 2/27/2019 8:09 AM, Robert Eshleman wrote: > This patch replaces a kmalloc/memset(,0) call with a call to kzalloc. > It also removes a memset(,0) call that always follows a *zalloc call. > > Signed-off-by: Robert Eshleman > --- > drivers/net/ethernet/mellanox/mlx4/cmd.c | 1 - > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 +-- > 2 files changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c > b/drivers/net/ethernet/mellanox/mlx4/cmd.c > index e65bc3c95630..7bfa6e850e5f 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c > +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c > @@ -3307,7 +3307,6 @@ int mlx4_get_counter_stats(struct mlx4_dev *dev, int > counter_index, > if (IS_ERR(mailbox)) > return PTR_ERR(mailbox); > > - memset(mailbox->buf, 0, sizeof(struct mlx4_counter)); > if_stat_in_mod = counter_index; > if (reset) > if_stat_in_mod |= MLX4_QUERY_IF_STAT_RESET; > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 9a0881cb7f51..f55805d206ef 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -1044,7 +1044,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv > *priv, int qpn, > struct mlx4_qp_context *context; > int err = 0; > > - context = kmalloc(sizeof(*context), GFP_KERNEL); > + context = kzalloc(sizeof(*context), GFP_KERNEL); > if (!context) > return -ENOMEM; > > @@ -1055,7 +1055,6 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv > *priv, int qpn, > } > qp->event = mlx4_en_sqp_event; > > - memset(context, 0, sizeof(*context)); > mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0, > qpn, ring->cqn, -1, context); > context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma); > Hi, Thanks for your patch. It looks good, but misses one similar point you might want to cover: in drivers/net/ethernet/mellanox/mlx4/port.c:1780 : memset(context, 0, sizeof(*context)); Tariq
Re: [PATCH] net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
On 2/18/2019 10:08 PM, David Miller wrote: > From: Colin King > Date: Sun, 17 Feb 2019 23:03:31 + > >> From: Colin Ian King >> >> There is a spelling mistake in a en_err error message. Fix it. >> >> Signed-off-by: Colin Ian King > > Applied, thanks Colin. > > And I agree that this doesn't really deserve a Fixes: tag. > > Fixes: tags should really be for changes that introduce truly > functional bugs. > > And that could even be applied in this case _iff_ the string > was essential in some way for userland tools which parse the > output or similar. But that is not the case here. > Thanks for the clarification. > Anyways, thanks. >
Re: [PATCH] net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
On 2/18/2019 12:25 PM, Dan Carpenter wrote: > On Mon, Feb 18, 2019 at 09:37:22AM +0000, Tariq Toukan wrote: >> >> >> On 2/18/2019 1:03 AM, Colin King wrote: >>> From: Colin Ian King >>> >>> There is a spelling mistake in a en_err error message. Fix it. >>> >>> Signed-off-by: Colin Ian King >>> --- >>>drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +- >>>1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c >>> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c >>> index 6b1b8e35..c1438ae52a11 100644 >>> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c >>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c >>> @@ -3360,7 +3360,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int >>> port, >>> dev->addr_len = ETH_ALEN; >>> mlx4_en_u64_to_mac(dev->dev_addr, mdev->dev->caps.def_mac[priv->port]); >>> if (!is_valid_ether_addr(dev->dev_addr)) { >>> - en_err(priv, "Port: %d, invalid mac burned: %pM, quiting\n", >>> + en_err(priv, "Port: %d, invalid mac burned: %pM, quitting\n", >>>priv->port, dev->dev_addr); >>> err = -EINVAL; >>> goto out; >>> >> >> Hi Colin, thanks for your patch. >> >> Reviewed-by: Tariq Toukan >> >> I would suggest adding a Fixes line, but looking into the history of the >> typo, it went through many patches that modified this line but preserved >> the typo. >> Actually, it dates back to the very first commit that introduces mlx4 >> driver: >> >> Patches history: >> 2b3ddf27f48c net/mlx4_core: Replace VF zero mac with random mac in mlx4_core >> ef96f7d46ad8 net/mlx4_en: Handle unassigned VF MAC address correctly >> 6bbb6d99f3d2 net/mlx4_en: Optimize Rx fast path filter checks >> 453a60827735 mlx4_en: Giving interface name in debug messages >> c27a02cd94d6 mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC >> >> I'm not sure what the "Fixes:" policy is in these cases. > > I wouldn't necessarily put a Fixes tag on this, because does fixing the > spelling really count as a bugfix? It's borderline whether it's a fix > or a cleanup. > > regards, > daan carpenter > Thanks Dan, I'm fine with that.
Re: [PATCH] net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
On 2/18/2019 1:03 AM, Colin King wrote: > From: Colin Ian King > > There is a spelling mistake in a en_err error message. Fix it. > > Signed-off-by: Colin Ian King > --- > drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > index 6b1b8e35..c1438ae52a11 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > @@ -3360,7 +3360,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int > port, > dev->addr_len = ETH_ALEN; > mlx4_en_u64_to_mac(dev->dev_addr, mdev->dev->caps.def_mac[priv->port]); > if (!is_valid_ether_addr(dev->dev_addr)) { > - en_err(priv, "Port: %d, invalid mac burned: %pM, quiting\n", > + en_err(priv, "Port: %d, invalid mac burned: %pM, quitting\n", > priv->port, dev->dev_addr); > err = -EINVAL; > goto out; > Hi Colin, thanks for your patch. Reviewed-by: Tariq Toukan I would suggest adding a Fixes line, but looking into the history of the typo, it went through many patches that modified this line but preserved the typo. Actually, it dates back to the very first commit that introduces mlx4 driver: Patches history: 2b3ddf27f48c net/mlx4_core: Replace VF zero mac with random mac in mlx4_core ef96f7d46ad8 net/mlx4_en: Handle unassigned VF MAC address correctly 6bbb6d99f3d2 net/mlx4_en: Optimize Rx fast path filter checks 453a60827735 mlx4_en: Giving interface name in debug messages c27a02cd94d6 mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC I'm not sure what the "Fixes:" policy is in these cases. Thanks, Tariq
Re: [PATCH net-next] net/mlx4: Mark expected switch fall-through
On 1/23/2019 10:05 AM, Gustavo A. R. Silva wrote: > In preparation to enabling -Wimplicit-fallthrough, mark switch cases > where we are expecting to fall through. > > This patch fixes the following warning: > > drivers/net/ethernet/mellanox/mlx4/eq.c: In function ‘mlx4_eq_int’: > drivers/net/ethernet/mellanox/mlx4/mlx4.h:219:5: warning: this statement may > fall through [-Wimplicit-fallthrough=] >if (mlx4_debug_level) \ > ^ > drivers/net/ethernet/mellanox/mlx4/eq.c:558:4: note: in expansion of macro > ‘mlx4_dbg’ > mlx4_dbg(dev, "%s: MLX4_EVENT_TYPE_SRQ_LIMIT. srq_no=0x%x, eq 0x%x\n", > ^~~~ > drivers/net/ethernet/mellanox/mlx4/eq.c:561:3: note: here > case MLX4_EVENT_TYPE_SRQ_CATAS_ERROR: > ^~~~ > > Warning level 3 was used: -Wimplicit-fallthrough=3 > > This patch is part of the ongoing efforts to enabling > -Wimplicit-fallthrough. > > Signed-off-by: Gustavo A. R. Silva > --- > drivers/net/ethernet/mellanox/mlx4/eq.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c > b/drivers/net/ethernet/mellanox/mlx4/eq.c > index 4953c852c247..2f4201023836 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/eq.c > +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c > @@ -558,6 +558,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct > mlx4_eq *eq) > mlx4_dbg(dev, "%s: MLX4_EVENT_TYPE_SRQ_LIMIT. > srq_no=0x%x, eq 0x%x\n", >__func__, be32_to_cpu(eqe->event.srq.srqn), >eq->eqn); > + /* fall through */ > case MLX4_EVENT_TYPE_SRQ_CATAS_ERROR: > if (mlx4_is_master(dev)) { > /* forward only to slave owning the SRQ */ > Reviewed-by: Tariq Toukan Thanks for your patch.
Re: [PATCH v2 RESEND 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
On 19/11/2018 3:48 PM, Aaron Lu wrote: > page_frag_free() calls __free_pages_ok() to free the page back to > Buddy. This is OK for high order page, but for order-0 pages, it > misses the optimization opportunity of using Per-Cpu-Pages and can > cause zone lock contention when called frequently. > > Paweł Staszewski recently shared his result of 'how Linux kernel > handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer > found the lock contention comes from page allocator: > >mlx5e_poll_tx_cq >| > --16.34%--napi_consume_skb > | > |--12.65%--__free_pages_ok > | | > | --11.86%--free_one_page > | | > | |--10.10%--queued_spin_lock_slowpath > | | > | --0.65%--_raw_spin_lock > | > |--1.55%--page_frag_free > | >--1.44%--skb_release_data > > Jesper explained how it happened: mlx5 driver RX-page recycle > mechanism is not effective in this workload and pages have to go > through the page allocator. The lock contention happens during > mlx5 DMA TX completion cycle. And the page allocator cannot keep > up at these speeds.[2] > > I thought that __free_pages_ok() are mostly freeing high order > pages and thought this is an lock contention for high order pages > but Jesper explained in detail that __free_pages_ok() here are > actually freeing order-0 pages because mlx5 is using order-0 pages > to satisfy its page pool allocation request.[3] > > The free path as pointed out by Jesper is: > skb_free_head() >-> skb_free_frag() > -> page_frag_free() > And the pages being freed on this path are order-0 pages. > > Fix this by doing similar things as in __page_frag_cache_drain() - > send the being freed page to PCP if it's an order-0 page, or > directly to Buddy if it is a high order page. > > With this change, Paweł hasn't noticed lock contention yet in > his workload and Jesper has noticed a 7% performance improvement > using a micro benchmark and lock contention is gone. Ilias' test > on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11% > performance boost testing with 64byte packets and __free_pages_ok() > disappeared from perf top. > > [1]: https://www.spinics.net/lists/netdev/msg531362.html > [2]: https://www.spinics.net/lists/netdev/msg531421.html > [3]: https://www.spinics.net/lists/netdev/msg531556.html > > Reported-by: Paweł Staszewski > Analysed-by: Jesper Dangaard Brouer > Acked-by: Vlastimil Babka > Acked-by: Mel Gorman > Acked-by: Jesper Dangaard Brouer > Acked-by: Ilias Apalodimas > Tested-by: Ilias Apalodimas > Acked-by: Alexander Duyck > Acked-by: Tariq Toukan ' sign in my email tag. > Signed-off-by: Aaron Lu > ---
Re: [PATCH v2 RESEND 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
On 19/11/2018 3:48 PM, Aaron Lu wrote: > page_frag_free() calls __free_pages_ok() to free the page back to > Buddy. This is OK for high order page, but for order-0 pages, it > misses the optimization opportunity of using Per-Cpu-Pages and can > cause zone lock contention when called frequently. > > Paweł Staszewski recently shared his result of 'how Linux kernel > handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer > found the lock contention comes from page allocator: > >mlx5e_poll_tx_cq >| > --16.34%--napi_consume_skb > | > |--12.65%--__free_pages_ok > | | > | --11.86%--free_one_page > | | > | |--10.10%--queued_spin_lock_slowpath > | | > | --0.65%--_raw_spin_lock > | > |--1.55%--page_frag_free > | >--1.44%--skb_release_data > > Jesper explained how it happened: mlx5 driver RX-page recycle > mechanism is not effective in this workload and pages have to go > through the page allocator. The lock contention happens during > mlx5 DMA TX completion cycle. And the page allocator cannot keep > up at these speeds.[2] > > I thought that __free_pages_ok() are mostly freeing high order > pages and thought this is an lock contention for high order pages > but Jesper explained in detail that __free_pages_ok() here are > actually freeing order-0 pages because mlx5 is using order-0 pages > to satisfy its page pool allocation request.[3] > > The free path as pointed out by Jesper is: > skb_free_head() >-> skb_free_frag() > -> page_frag_free() > And the pages being freed on this path are order-0 pages. > > Fix this by doing similar things as in __page_frag_cache_drain() - > send the being freed page to PCP if it's an order-0 page, or > directly to Buddy if it is a high order page. > > With this change, Paweł hasn't noticed lock contention yet in > his workload and Jesper has noticed a 7% performance improvement > using a micro benchmark and lock contention is gone. Ilias' test > on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11% > performance boost testing with 64byte packets and __free_pages_ok() > disappeared from perf top. > > [1]: https://www.spinics.net/lists/netdev/msg531362.html > [2]: https://www.spinics.net/lists/netdev/msg531421.html > [3]: https://www.spinics.net/lists/netdev/msg531556.html > > Reported-by: Paweł Staszewski > Analysed-by: Jesper Dangaard Brouer > Acked-by: Vlastimil Babka > Acked-by: Mel Gorman > Acked-by: Jesper Dangaard Brouer > Acked-by: Ilias Apalodimas > Tested-by: Ilias Apalodimas > Acked-by: Alexander Duyck > Acked-by: Tariq Toukan ' sign in my email tag. > Signed-off-by: Aaron Lu > ---
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote: On Thu, 12 Jul 2018 14:54:08 +0200 Michal Hocko wrote: [CC Jesper - I remember he was really concerned about the worst case latencies for highspeed network workloads.] Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s), where we are contenting on the page allocator lock, in a CPU scaling netperf test AFAIK. I also have some special-case micro-benchmarks where I can hit it, but it a micro-bench... Thanks! Looks good. Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a major PCP/buddy bottleneck, where spinning the zonelock took up to 80% CPU, with dramatic BW degradation. Test ran relatively small number of TCP streams (4-16) with unpinned application (iperf). Larger batching reduces the contention on the zone lock and improves the CPU util. I also considered increasing the percpu_pagelist_fraction to a larger value (thought of 512, see patch below), which also affects the batch size (in pageset_set_high_and_batch). As far as I see it, to totally solve the page allocation bottleneck for the increasing networking speeds, the following is still required: 1) optimize order-0 allocations (even on the cost of higher-order allocations). 2) bulking API for page allocations. 3) do SKB remote-release (on the originating core). Regards, Tariq diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 697ef8c225df..88763bd716a5 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -741,9 +741,9 @@ of hot per cpu pagelists. User can specify a number like 100 to allocate The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) -The initial value is zero. Kernel does not use this value at boot time to set +The initial value is 512. Kernel uses this value at boot time to set the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. +sysctl, it will revert to a behavior based on batchsize calculation. == diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100f1e63..c88e8eb50bcb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -129,7 +129,7 @@ unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; -int percpu_pagelist_fraction; +int percpu_pagelist_fraction = 512; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; /*
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote: On Thu, 12 Jul 2018 14:54:08 +0200 Michal Hocko wrote: [CC Jesper - I remember he was really concerned about the worst case latencies for highspeed network workloads.] Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s), where we are contenting on the page allocator lock, in a CPU scaling netperf test AFAIK. I also have some special-case micro-benchmarks where I can hit it, but it a micro-bench... Thanks! Looks good. Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a major PCP/buddy bottleneck, where spinning the zonelock took up to 80% CPU, with dramatic BW degradation. Test ran relatively small number of TCP streams (4-16) with unpinned application (iperf). Larger batching reduces the contention on the zone lock and improves the CPU util. I also considered increasing the percpu_pagelist_fraction to a larger value (thought of 512, see patch below), which also affects the batch size (in pageset_set_high_and_batch). As far as I see it, to totally solve the page allocation bottleneck for the increasing networking speeds, the following is still required: 1) optimize order-0 allocations (even on the cost of higher-order allocations). 2) bulking API for page allocations. 3) do SKB remote-release (on the originating core). Regards, Tariq diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 697ef8c225df..88763bd716a5 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -741,9 +741,9 @@ of hot per cpu pagelists. User can specify a number like 100 to allocate The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) -The initial value is zero. Kernel does not use this value at boot time to set +The initial value is 512. Kernel uses this value at boot time to set the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. +sysctl, it will revert to a behavior based on batchsize calculation. == diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100f1e63..c88e8eb50bcb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -129,7 +129,7 @@ unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; -int percpu_pagelist_fraction; +int percpu_pagelist_fraction = 512; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; /*
Re: WARNING and PANIC in irq_matrix_free
On 28/05/2018 1:53 PM, Thomas Gleixner wrote: On Fri, 25 May 2018, Song Liu wrote: On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <t...@linutronix.de> wrote: On Wed, 23 May 2018, Tariq Toukan wrote: I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. Could you please enable the vector and irq matrix trace points and capture the trace when this happens? Does the patch below fix it? Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..54af3d4884b1 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec, * prev_vector for this and the offlined target case. */ apicd->prev_vector = 0; + apicd->move_in_progress = false; if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) goto setnew; /* I took it into my internal branch. Will let you know. Tariq
Re: WARNING and PANIC in irq_matrix_free
On 28/05/2018 1:53 PM, Thomas Gleixner wrote: On Fri, 25 May 2018, Song Liu wrote: On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: On Wed, 23 May 2018, Tariq Toukan wrote: I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. Could you please enable the vector and irq matrix trace points and capture the trace when this happens? Does the patch below fix it? Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..54af3d4884b1 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec, * prev_vector for this and the offlined target case. */ apicd->prev_vector = 0; + apicd->move_in_progress = false; if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) goto setnew; /* I took it into my internal branch. Will let you know. Tariq
Re: [PATCH 4.16 008/161] packet: in packet_snd start writing at link layer allocation
On 24/05/2018 12:37 PM, Greg Kroah-Hartman wrote: 4.16-stable review patch. If anyone has any objections, please let me know. Please hold on with this. I think it causes some degradation. I'll report the issue on the original patch. Thanks, Tariq -- From: Willem de Bruijn[ Upstream commit b84bbaf7a6c8cca24f8acf25a2c8e46913a947ba ] Packet sockets allow construction of packets shorter than dev->hard_header_len to accommodate protocols with variable length link layer headers. These packets are padded to dev->hard_header_len, because some device drivers interpret that as a minimum packet size. packet_snd reserves dev->hard_header_len bytes on allocation. SOCK_DGRAM sockets call skb_push in dev_hard_header() to ensure that link layer headers are stored in the reserved range. SOCK_RAW sockets do the same in tpacket_snd, but not in packet_snd. Syzbot was able to send a zero byte packet to a device with massive 116B link layer header, causing padding to cross over into skb_shinfo. Fix this by writing from the start of the llheader reserved range also in the case of packet_snd/SOCK_RAW. Update skb_set_network_header to the new offset. This also corrects it for SOCK_DGRAM, where it incorrectly double counted reserve due to the skb_push in dev_hard_header. Fixes: 9ed988cd5915 ("packet: validate variable length ll headers") Reported-by: syzbot+71d74a5406d02057d...@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/packet/af_packet.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2903,13 +2903,15 @@ static int packet_snd(struct socket *soc if (skb == NULL) goto out_unlock; - skb_set_network_header(skb, reserve); + skb_reset_network_header(skb); err = -EINVAL; if (sock->type == SOCK_DGRAM) { offset = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, len); if (unlikely(offset < 0)) goto out_free; + } else if (reserve) { + skb_push(skb, reserve); } /* Returns -EFAULT on error */
Re: [PATCH 4.16 008/161] packet: in packet_snd start writing at link layer allocation
On 24/05/2018 12:37 PM, Greg Kroah-Hartman wrote: 4.16-stable review patch. If anyone has any objections, please let me know. Please hold on with this. I think it causes some degradation. I'll report the issue on the original patch. Thanks, Tariq -- From: Willem de Bruijn [ Upstream commit b84bbaf7a6c8cca24f8acf25a2c8e46913a947ba ] Packet sockets allow construction of packets shorter than dev->hard_header_len to accommodate protocols with variable length link layer headers. These packets are padded to dev->hard_header_len, because some device drivers interpret that as a minimum packet size. packet_snd reserves dev->hard_header_len bytes on allocation. SOCK_DGRAM sockets call skb_push in dev_hard_header() to ensure that link layer headers are stored in the reserved range. SOCK_RAW sockets do the same in tpacket_snd, but not in packet_snd. Syzbot was able to send a zero byte packet to a device with massive 116B link layer header, causing padding to cross over into skb_shinfo. Fix this by writing from the start of the llheader reserved range also in the case of packet_snd/SOCK_RAW. Update skb_set_network_header to the new offset. This also corrects it for SOCK_DGRAM, where it incorrectly double counted reserve due to the skb_push in dev_hard_header. Fixes: 9ed988cd5915 ("packet: validate variable length ll headers") Reported-by: syzbot+71d74a5406d02057d...@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/packet/af_packet.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2903,13 +2903,15 @@ static int packet_snd(struct socket *soc if (skb == NULL) goto out_unlock; - skb_set_network_header(skb, reserve); + skb_reset_network_header(skb); err = -EINVAL; if (sock->type == SOCK_DGRAM) { offset = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, len); if (unlikely(offset < 0)) goto out_free; + } else if (reserve) { + skb_push(skb, reserve); } /* Returns -EFAULT on error */
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 24/05/2018 2:22 AM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } Thanks Qing. Reviewed-by: Tariq Toukan <tar...@mellanox.com>
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 24/05/2018 2:22 AM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } Thanks Qing. Reviewed-by: Tariq Toukan
Re: [PATCH][V2] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
On 22/05/2018 6:42 PM, Colin King wrote: From: Colin Ian King <colin.k...@canonical.com> Trivial fix to spelling mistake in mlx4_dbg debug message and also change the phrasing of the message so that is is more readable Signed-off-by: Colin Ian King <colin.k...@canonical.com> --- V2: rephrase message, as helpfully suggested by Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c index 2edcce98ab2d..65482f004e50 100644 --- a/drivers/net/ethernet/mellanox/mlx4/intf.c +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable) list_add_tail(_ctx->list, >ctx_list); spin_unlock_irqrestore(>ctx_lock, flags); - mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n", + mlx4_dbg(dev, "Interface for protocol %d restarted with bonded mode %s\n", dev_ctx->intf->protocol, enable ? "enabled" : "disabled"); } Thanks Colin. Reviewed-by: Tariq Toukan <tar...@mellanox.com>
Re: [PATCH][V2] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
On 22/05/2018 6:42 PM, Colin King wrote: From: Colin Ian King Trivial fix to spelling mistake in mlx4_dbg debug message and also change the phrasing of the message so that is is more readable Signed-off-by: Colin Ian King --- V2: rephrase message, as helpfully suggested by Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c index 2edcce98ab2d..65482f004e50 100644 --- a/drivers/net/ethernet/mellanox/mlx4/intf.c +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable) list_add_tail(_ctx->list, >ctx_list); spin_unlock_irqrestore(>ctx_lock, flags); - mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n", + mlx4_dbg(dev, "Interface for protocol %d restarted with bonded mode %s\n", dev_ctx->intf->protocol, enable ? "enabled" : "disabled"); } Thanks Colin. Reviewed-by: Tariq Toukan
Re: WARNING and PANIC in irq_matrix_free
On 19/05/2018 2:20 PM, Thomas Gleixner wrote: On Fri, 18 May 2018, Dmitry Safonov wrote: I'm not entirely sure that it's the same fault, but at least backtrace looks resembling. Yes, it's similar, but not the same issue. I'll stare are the code ... Thanks, tglx We still see the issue in our daily regression runs. I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
Re: WARNING and PANIC in irq_matrix_free
On 19/05/2018 2:20 PM, Thomas Gleixner wrote: On Fri, 18 May 2018, Dmitry Safonov wrote: I'm not entirely sure that it's the same fault, but at least backtrace looks resembling. Yes, it's similar, but not the same issue. I'll stare are the code ... Thanks, tglx We still see the issue in our daily regression runs. I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 18/05/2018 12:45 AM, Qing Huang wrote: On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. NACK. This would cause a degradation when iterating the entries of table->icm. For example, in mlx4_table_get_range. Thanks, Tariq And you'll understand some people care about this. Strongly. Thanks.
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 18/05/2018 12:45 AM, Qing Huang wrote: On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. NACK. This would cause a degradation when iterating the entries of table->icm. For example, in mlx4_table_get_range. Thanks, Tariq And you'll understand some people care about this. Strongly. Thanks.
Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"
On 22/05/2018 6:23 PM, Colin Ian King wrote: On 22/05/18 16:21, Tariq Toukan wrote: On 22/05/2018 11:37 AM, Colin King wrote: From: Colin Ian King <colin.k...@canonical.com> Trivial fix to spelling mistake in mlx4_dbg debug message Signed-off-by: Colin Ian King <colin.k...@canonical.com> --- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c index 2edcce98ab2d..6bd4103265d2 100644 --- a/drivers/net/ethernet/mellanox/mlx4/intf.c +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable) list_add_tail(_ctx->list, >ctx_list); spin_unlock_irqrestore(>ctx_lock, flags); - mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n", + mlx4_dbg(dev, "Interface for protocol %d restarted with when bonded mode is %s\n", Thanks Colin. I think there's one more thing to fix here. It is redundant to say "with when", it was probably done by mistake. Let's rephrase, maybe this way? restarted with bonded mode %s Sounds like a good idea, do you want me to send V2 of the patch with this fix? Yes please. dev_ctx->intf->protocol, enable ? "enabled" : "disabled"); } -- To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"
On 22/05/2018 6:23 PM, Colin Ian King wrote: On 22/05/18 16:21, Tariq Toukan wrote: On 22/05/2018 11:37 AM, Colin King wrote: From: Colin Ian King Trivial fix to spelling mistake in mlx4_dbg debug message Signed-off-by: Colin Ian King --- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c index 2edcce98ab2d..6bd4103265d2 100644 --- a/drivers/net/ethernet/mellanox/mlx4/intf.c +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable) list_add_tail(_ctx->list, >ctx_list); spin_unlock_irqrestore(>ctx_lock, flags); - mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n", + mlx4_dbg(dev, "Interface for protocol %d restarted with when bonded mode is %s\n", Thanks Colin. I think there's one more thing to fix here. It is redundant to say "with when", it was probably done by mistake. Let's rephrase, maybe this way? restarted with bonded mode %s Sounds like a good idea, do you want me to send V2 of the patch with this fix? Yes please. dev_ctx->intf->protocol, enable ? "enabled" : "disabled"); } -- To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"
On 22/05/2018 11:37 AM, Colin King wrote: From: Colin Ian KingTrivial fix to spelling mistake in mlx4_dbg debug message Signed-off-by: Colin Ian King --- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c index 2edcce98ab2d..6bd4103265d2 100644 --- a/drivers/net/ethernet/mellanox/mlx4/intf.c +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable) list_add_tail(_ctx->list, >ctx_list); spin_unlock_irqrestore(>ctx_lock, flags); - mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n", + mlx4_dbg(dev, "Interface for protocol %d restarted with when bonded mode is %s\n", Thanks Colin. I think there's one more thing to fix here. It is redundant to say "with when", it was probably done by mistake. Let's rephrase, maybe this way? restarted with bonded mode %s dev_ctx->intf->protocol, enable ? "enabled" : "disabled"); }
Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"
On 22/05/2018 11:37 AM, Colin King wrote: From: Colin Ian King Trivial fix to spelling mistake in mlx4_dbg debug message Signed-off-by: Colin Ian King --- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c index 2edcce98ab2d..6bd4103265d2 100644 --- a/drivers/net/ethernet/mellanox/mlx4/intf.c +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable) list_add_tail(_ctx->list, >ctx_list); spin_unlock_irqrestore(>ctx_lock, flags); - mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n", + mlx4_dbg(dev, "Interface for protocol %d restarted with when bonded mode is %s\n", Thanks Colin. I think there's one more thing to fix here. It is redundant to say "with when", it was probably done by mistake. Let's rephrase, maybe this way? restarted with bonded mode %s dev_ctx->intf->protocol, enable ? "enabled" : "disabled"); }
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 15/05/2018 9:53 PM, Qing Huang wrote: On 5/15/2018 2:19 AM, Tariq Toukan wrote: On 14/05/2018 7:41 PM, Qing Huang wrote: On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Then please use PAGE_SIZE instead. PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is actually more appropriate here. Definition of PAGE_SIZE varies among different archs. It is not always as simple as 1 << PAGE_SHIFT. It might be: PAGE_SIZE (1UL << PAGE_SHIFT) PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT) etc... Please replace 1 << PAGE_SHIFT with PAGE_SIZE. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame. I know it didn't exist before your patch, but once we're here, let's do it. I'm okay either way. If adding an extra comma is preferred by many people, someone should update checkpatch.pl to enforce it. :) I agree. Until then, please use an extra comma in this patch. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Degradation is expected when you replace a contig memory with non-contig memory, without any perf test. We agree that when contig memory is not available, we should use non-contig instead of simply failing, and for this you can
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 15/05/2018 9:53 PM, Qing Huang wrote: On 5/15/2018 2:19 AM, Tariq Toukan wrote: On 14/05/2018 7:41 PM, Qing Huang wrote: On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Then please use PAGE_SIZE instead. PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is actually more appropriate here. Definition of PAGE_SIZE varies among different archs. It is not always as simple as 1 << PAGE_SHIFT. It might be: PAGE_SIZE (1UL << PAGE_SHIFT) PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT) etc... Please replace 1 << PAGE_SHIFT with PAGE_SIZE. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame. I know it didn't exist before your patch, but once we're here, let's do it. I'm okay either way. If adding an extra comma is preferred by many people, someone should update checkpatch.pl to enforce it. :) I agree. Until then, please use an extra comma in this patch. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Degradation is expected when you replace a contig memory with non-contig memory, without any perf test. We agree that when contig memory is not available, we should use non-contig instead of simply failing, and for this you can call kvzalloc. Thanks, Qing if (!table->icm) return -ENOMEM;
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 14/05/2018 7:41 PM, Qing Huang wrote: On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Then please use PAGE_SIZE instead. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame. I know it didn't exist before your patch, but once we're here, let's do it. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic performance degradation here. You can prepare and send a v3 in the meanwhile. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 14/05/2018 7:41 PM, Qing Huang wrote: On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Then please use PAGE_SIZE instead. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame. I know it didn't exist before your patch, but once we're here, let's do it. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic performance degradation here. You can prepare and send a v3 in the meanwhile. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net/mlx4_core: Fix error handling in mlx4_init_port_info.
On 14/05/2018 2:38 AM, Tarick Bedeir wrote: Avoid exiting the function with a lingering sysfs file (if the first call to device_create_file() fails while the second succeeds), and avoid calling devlink_port_unregister() twice. In other words, either mlx4_init_port_info() succeeds and returns zero, or it fails, returns non-zero, and requires no cleanup. Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB ports") Signed-off-by: Tarick Bedeir <tar...@google.com> --- v1 -> v2: Added "Fixes" tag. --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 4d84cab77105..e8a3a45d0b53 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) mlx4_err(dev, "Failed to create file for port %d\n", port); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port); @@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) >port_attr); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } - return err; + return 0; } static void mlx4_cleanup_port_info(struct mlx4_port_info *info) Thanks for you patch. Reviewed-by: Tariq Toukan <tar...@mellanox.com> Dave, please queue for -stable. Best, Tariq
Re: [PATCH v2] net/mlx4_core: Fix error handling in mlx4_init_port_info.
On 14/05/2018 2:38 AM, Tarick Bedeir wrote: Avoid exiting the function with a lingering sysfs file (if the first call to device_create_file() fails while the second succeeds), and avoid calling devlink_port_unregister() twice. In other words, either mlx4_init_port_info() succeeds and returns zero, or it fails, returns non-zero, and requires no cleanup. Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB ports") Signed-off-by: Tarick Bedeir --- v1 -> v2: Added "Fixes" tag. --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 4d84cab77105..e8a3a45d0b53 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) mlx4_err(dev, "Failed to create file for port %d\n", port); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port); @@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) >port_attr); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } - return err; + return 0; } static void mlx4_cleanup_port_info(struct mlx4_port_info *info) Thanks for you patch. Reviewed-by: Tariq Toukan Dave, please queue for -stable. Best, Tariq
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing HuangAcked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Also, please add a comma at the end of the last entry. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic performance degradation here. You can prepare and send a v3 in the meanwhile. Thanks, Tariq
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Also, please add a comma at the end of the last entry. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic performance degradation here. You can prepare and send a v3 in the meanwhile. Thanks, Tariq
Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.
On 02/05/2018 4:31 PM, Tariq Toukan wrote: On 27/04/2018 6:20 PM, Tarick Bedeir wrote: Avoid exiting the function with a lingering sysfs file (if the first call to device_create_file() fails while the second succeeds), and avoid calling devlink_port_unregister() twice. In other words, either mlx4_init_port_info() succeeds and returns zero, or it fails, returns non-zero, and requires no cleanup. Signed-off-by: Tarick Bedeir <tar...@google.com> --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 4d84cab77105..e8a3a45d0b53 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) mlx4_err(dev, "Failed to create file for port %d\n", port); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port); @@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) >port_attr); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } - return err; + return 0; } static void mlx4_cleanup_port_info(struct mlx4_port_info *info) Acked-by: Tariq Toukan <tar...@mellanox.com> Thanks Tarick. Actually, you need to add a Fixes line: Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB ports")
Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.
On 02/05/2018 4:31 PM, Tariq Toukan wrote: On 27/04/2018 6:20 PM, Tarick Bedeir wrote: Avoid exiting the function with a lingering sysfs file (if the first call to device_create_file() fails while the second succeeds), and avoid calling devlink_port_unregister() twice. In other words, either mlx4_init_port_info() succeeds and returns zero, or it fails, returns non-zero, and requires no cleanup. Signed-off-by: Tarick Bedeir --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 4d84cab77105..e8a3a45d0b53 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) mlx4_err(dev, "Failed to create file for port %d\n", port); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port); @@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) >port_attr); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } - return err; + return 0; } static void mlx4_cleanup_port_info(struct mlx4_port_info *info) Acked-by: Tariq Toukan Thanks Tarick. Actually, you need to add a Fixes line: Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB ports")
Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 10/05/2018 5:36 PM, Tariq Toukan wrote: On 10/05/2018 5:18 PM, Dan Carpenter wrote: On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote: On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote: If an error occurs, 'mlx4_en_destroy_netdev()' is called. It then calls 'mlx4_en_free_resources()' which does the needed resources cleanup. So, doing some explicit kfree in the error handling path would lead to some double kfree. Patch make sense but what's bothering me is that mlx4_en_free_resources loops on the entire array, assuming !priv->tx_ring[t] means entry is allocated but the existing code does not assume that, see [1]. So i looked to see where tx_ring array is zeroed and didn't find it. Am i missing something here. It's zeroed twice. alloc_etherdev_mqs() allocates zeroed memory and then we do a memset(priv, 0, sizeof(struct mlx4_en_priv)); regards, dan carpenter We do zero (twice) on init, that's right. But I think Yuval's comment is valid in case of the driver went into configuration change, or down/up, that reallocates the rings. I'm double checking this. Well, the flows in which we need to nullify the tx_rings pointer (if any, I still need to investigate this) is not related to this init function. Here we're safe. Anyway, a V2 is already submitted, please use it for your next comments. I think patch is OK.
Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 10/05/2018 5:36 PM, Tariq Toukan wrote: On 10/05/2018 5:18 PM, Dan Carpenter wrote: On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote: On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote: If an error occurs, 'mlx4_en_destroy_netdev()' is called. It then calls 'mlx4_en_free_resources()' which does the needed resources cleanup. So, doing some explicit kfree in the error handling path would lead to some double kfree. Patch make sense but what's bothering me is that mlx4_en_free_resources loops on the entire array, assuming !priv->tx_ring[t] means entry is allocated but the existing code does not assume that, see [1]. So i looked to see where tx_ring array is zeroed and didn't find it. Am i missing something here. It's zeroed twice. alloc_etherdev_mqs() allocates zeroed memory and then we do a memset(priv, 0, sizeof(struct mlx4_en_priv)); regards, dan carpenter We do zero (twice) on init, that's right. But I think Yuval's comment is valid in case of the driver went into configuration change, or down/up, that reallocates the rings. I'm double checking this. Well, the flows in which we need to nullify the tx_rings pointer (if any, I still need to investigate this) is not related to this init function. Here we're safe. Anyway, a V2 is already submitted, please use it for your next comments. I think patch is OK.
Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 10/05/2018 5:18 PM, Dan Carpenter wrote: On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote: On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote: If an error occurs, 'mlx4_en_destroy_netdev()' is called. It then calls 'mlx4_en_free_resources()' which does the needed resources cleanup. So, doing some explicit kfree in the error handling path would lead to some double kfree. Patch make sense but what's bothering me is that mlx4_en_free_resources loops on the entire array, assuming !priv->tx_ring[t] means entry is allocated but the existing code does not assume that, see [1]. So i looked to see where tx_ring array is zeroed and didn't find it. Am i missing something here. It's zeroed twice. alloc_etherdev_mqs() allocates zeroed memory and then we do a memset(priv, 0, sizeof(struct mlx4_en_priv)); regards, dan carpenter We do zero (twice) on init, that's right. But I think Yuval's comment is valid in case of the driver went into configuration change, or down/up, that reallocates the rings. I'm double checking this.
Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 10/05/2018 5:18 PM, Dan Carpenter wrote: On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote: On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote: If an error occurs, 'mlx4_en_destroy_netdev()' is called. It then calls 'mlx4_en_free_resources()' which does the needed resources cleanup. So, doing some explicit kfree in the error handling path would lead to some double kfree. Patch make sense but what's bothering me is that mlx4_en_free_resources loops on the entire array, assuming !priv->tx_ring[t] means entry is allocated but the existing code does not assume that, see [1]. So i looked to see where tx_ring array is zeroed and didn't find it. Am i missing something here. It's zeroed twice. alloc_etherdev_mqs() allocates zeroed memory and then we do a memset(priv, 0, sizeof(struct mlx4_en_priv)); regards, dan carpenter We do zero (twice) on init, that's right. But I think Yuval's comment is valid in case of the driver went into configuration change, or down/up, that reallocates the rings. I'm double checking this.
Re: [PATCH v2] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 10/05/2018 10:06 AM, Christophe JAILLET wrote: If an error occurs, 'mlx4_en_destroy_netdev()' is called. It then calls 'mlx4_en_free_resources()' which does the needed resources cleanup. So, doing some explicit kfree in the error handling path would lead to some double kfree. Simplify code to avoid such a case. Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Christophe JAILLET <christophe.jail...@wanadoo.fr> --- v1 -> v2 : rewrite the fix as explained by Tariq Toukan (this 2nd version may have been posted twice, once without the v2 tag. PLease ignore the first one) --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index e0adac4a9a19..9670b33fc9b1 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, MAX_TX_RINGS, GFP_KERNEL); if (!priv->tx_ring[t]) { err = -ENOMEM; - goto err_free_tx; + goto out; } priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) * MAX_TX_RINGS, GFP_KERNEL); if (!priv->tx_cq[t]) { - kfree(priv->tx_ring[t]); err = -ENOMEM; goto out; } @@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, return 0; -err_free_tx: - while (t--) { - kfree(priv->tx_ring[t]); - kfree(priv->tx_cq[t]); - } out: mlx4_en_destroy_netdev(dev); return err; Reviewed-by: Tariq Toukan <tar...@mellanox.com> Thanks Christophe.
Re: [PATCH v2] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 10/05/2018 10:06 AM, Christophe JAILLET wrote: If an error occurs, 'mlx4_en_destroy_netdev()' is called. It then calls 'mlx4_en_free_resources()' which does the needed resources cleanup. So, doing some explicit kfree in the error handling path would lead to some double kfree. Simplify code to avoid such a case. Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Christophe JAILLET --- v1 -> v2 : rewrite the fix as explained by Tariq Toukan (this 2nd version may have been posted twice, once without the v2 tag. PLease ignore the first one) --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index e0adac4a9a19..9670b33fc9b1 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, MAX_TX_RINGS, GFP_KERNEL); if (!priv->tx_ring[t]) { err = -ENOMEM; - goto err_free_tx; + goto out; } priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) * MAX_TX_RINGS, GFP_KERNEL); if (!priv->tx_cq[t]) { - kfree(priv->tx_ring[t]); err = -ENOMEM; goto out; } @@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, return 0; -err_free_tx: - while (t--) { - kfree(priv->tx_ring[t]); - kfree(priv->tx_cq[t]); - } out: mlx4_en_destroy_netdev(dev); return err; Reviewed-by: Tariq Toukan Thanks Christophe.
Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 08/05/2018 12:34 PM, Christophe JAILLET wrote: If the 2nd memory allocation of the loop fails, we must undo the memory allocation done so far. Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Christophe JAILLET--- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index e0adac4a9a19..bf078244e467 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -3331,7 +3331,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, if (!priv->tx_cq[t]) { kfree(priv->tx_ring[t]); err = -ENOMEM; - goto out; + goto err_free_tx; } } priv->rx_ring_num = prof->rx_ring_num; Hi Christophe, Thanks for re-sending this. In your previous mail you referred to the call mlx4_en_destroy_netdev here: https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_main.c#L232 While I was referring to the one below, which is always called on failures: https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_netdev.c#L3587 I still believe that the err_free_tx label and its while loop is redundant. Regards, Tariq
Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
On 08/05/2018 12:34 PM, Christophe JAILLET wrote: If the 2nd memory allocation of the loop fails, we must undo the memory allocation done so far. Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Christophe JAILLET --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index e0adac4a9a19..bf078244e467 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -3331,7 +3331,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, if (!priv->tx_cq[t]) { kfree(priv->tx_ring[t]); err = -ENOMEM; - goto out; + goto err_free_tx; } } priv->rx_ring_num = prof->rx_ring_num; Hi Christophe, Thanks for re-sending this. In your previous mail you referred to the call mlx4_en_destroy_netdev here: https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_main.c#L232 While I was referring to the one below, which is always called on failures: https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_netdev.c#L3587 I still believe that the err_free_tx label and its while loop is redundant. Regards, Tariq
Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.
On 27/04/2018 6:20 PM, Tarick Bedeir wrote: Avoid exiting the function with a lingering sysfs file (if the first call to device_create_file() fails while the second succeeds), and avoid calling devlink_port_unregister() twice. In other words, either mlx4_init_port_info() succeeds and returns zero, or it fails, returns non-zero, and requires no cleanup. Signed-off-by: Tarick Bedeir <tar...@google.com> --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 4d84cab77105..e8a3a45d0b53 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) mlx4_err(dev, "Failed to create file for port %d\n", port); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port); @@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) >port_attr); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } - return err; + return 0; } static void mlx4_cleanup_port_info(struct mlx4_port_info *info) Acked-by: Tariq Toukan <tar...@mellanox.com> Thanks Tarick.
Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.
On 27/04/2018 6:20 PM, Tarick Bedeir wrote: Avoid exiting the function with a lingering sysfs file (if the first call to device_create_file() fails while the second succeeds), and avoid calling devlink_port_unregister() twice. In other words, either mlx4_init_port_info() succeeds and returns zero, or it fails, returns non-zero, and requires no cleanup. Signed-off-by: Tarick Bedeir --- drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 4d84cab77105..e8a3a45d0b53 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) mlx4_err(dev, "Failed to create file for port %d\n", port); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port); @@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int port) >port_attr); devlink_port_unregister(>devlink_port); info->port = -1; + return err; } - return err; + return 0; } static void mlx4_cleanup_port_info(struct mlx4_port_info *info) Acked-by: Tariq Toukan Thanks Tarick.
Re: [PATCH] net/mlx4_en: Fix a memory leak in case of error in 'mlx4_en_init_netdev()'
On 12/03/2018 12:45 AM, Christophe JAILLET wrote: If 'kzalloc' fails, we must free some memory before returning. Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Christophe JAILLET--- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 8fc51bc29003..f9db018e858f 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -3327,7 +3327,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, if (!priv->tx_cq[t]) { kfree(priv->tx_ring[t]); err = -ENOMEM; - goto out; + goto err_free_tx; } } priv->rx_ring_num = prof->rx_ring_num; Hi Christophe, thanks for spotting this. However, I think these err_free_tx label and loop are redundant. Both tx_ring/tx_cq flows should just goto out, as resources are freed later in mlx4_en_destroy_netdev() -> mlx4_en_free_resources().
Re: [PATCH] net/mlx4_en: Fix a memory leak in case of error in 'mlx4_en_init_netdev()'
On 12/03/2018 12:45 AM, Christophe JAILLET wrote: If 'kzalloc' fails, we must free some memory before returning. Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Christophe JAILLET --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 8fc51bc29003..f9db018e858f 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -3327,7 +3327,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, if (!priv->tx_cq[t]) { kfree(priv->tx_ring[t]); err = -ENOMEM; - goto out; + goto err_free_tx; } } priv->rx_ring_num = prof->rx_ring_num; Hi Christophe, thanks for spotting this. However, I think these err_free_tx label and loop are redundant. Both tx_ring/tx_cq flows should just goto out, as resources are freed later in mlx4_en_destroy_netdev() -> mlx4_en_free_resources().
Re: WARNING and PANIC in irq_matrix_free
On 22/02/2018 11:38 PM, Thomas Gleixner wrote: On Wed, 21 Feb 2018, Tariq Toukan wrote: On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. Ok. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Hmm. Looks like memory corruption, but I can't pin point it. Find below a debug patch which should prevent the crash and might give us some insight into the type of corruption. Please enable the irq_matrix and vector allocation trace points. echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable When the problem triggers the bogus vector is printed and the trace is frozen. Please provide dmesg and the tracebuffer output. OK, I'm temporarily adding this to the regression internal branch. I'll let you know once we have a repro. Thanks, Tariq Thanks, tglx 8<-- --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -822,6 +822,12 @@ static void free_moved_vector(struct api unsigned int cpu = apicd->prev_cpu; bool managed = apicd->is_managed; + if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) { + tracing_off(); + pr_err("Trying to clear prev_vector: %u\n", vector); + goto out; + } + /* * This should never happen. Managed interrupts are not * migrated except on CPU down, which does not involve the @@ -833,6 +839,7 @@ static void free_moved_vector(struct api trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; +out: hlist_del_init(>clist); apicd->prev_vector = 0; apicd->move_in_progress = 0;
Re: WARNING and PANIC in irq_matrix_free
On 22/02/2018 11:38 PM, Thomas Gleixner wrote: On Wed, 21 Feb 2018, Tariq Toukan wrote: On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. Ok. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Hmm. Looks like memory corruption, but I can't pin point it. Find below a debug patch which should prevent the crash and might give us some insight into the type of corruption. Please enable the irq_matrix and vector allocation trace points. echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable When the problem triggers the bogus vector is printed and the trace is frozen. Please provide dmesg and the tracebuffer output. OK, I'm temporarily adding this to the regression internal branch. I'll let you know once we have a repro. Thanks, Tariq Thanks, tglx 8<-- --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -822,6 +822,12 @@ static void free_moved_vector(struct api unsigned int cpu = apicd->prev_cpu; bool managed = apicd->is_managed; + if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) { + tracing_off(); + pr_err("Trying to clear prev_vector: %u\n", vector); + goto out; + } + /* * This should never happen. Managed interrupts are not * migrated except on CPU down, which does not involve the @@ -833,6 +839,7 @@ static void free_moved_vector(struct api trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; +out: hlist_del_init(>clist); apicd->prev_vector = 0; apicd->move_in_progress = 0;
Re: WARNING and PANIC in irq_matrix_free
On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Thanks, Tariq
Re: WARNING and PANIC in irq_matrix_free
On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Thanks, Tariq
WARNING and PANIC in irq_matrix_free
Hi Thomas, We started seeing new issues in our net-device daily regression tests. They are related to patch [1] introduced in kernel 4.15-rc1. We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to narrow it down to a smaller run but couldn't. In addition, sometimes (less frequent) the warning is followed by a panic [3]. I can share all needed details to help analyze this bug. If you suspect specific flows, we can do an educated narrow down. Regards, Tariq [1] 2f75d9e1c905 genirq: Implement bitmap matrix allocator [2] [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 irq_matrix_free+0x30/0xd0 [ 8664.891905] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core mlx4_en mlx4_core devlink macvlan vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc mst_pciconf(OE) nfsv3 nfs fscache netconsole dm_mirror dm_region_hash dm_log dm_mod dax kvm_intel kvm irqbypass pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ata_generic cirrus drm_kms_helper syscopyarea sysfillrect pata_acpi sysimgblt fb_sys_fops ttm drm e1000 serio_raw virtio_console i2c_core floppy ata_piix [last unloaded: mst_pci] [ 8664.905117] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G OE 4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1 [ 8664.907613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0 [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002 [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: [ 8664.917679] RDX: 0001 RSI: 0001 RDI: 880237038400 [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: [ 8664.922813] R10: 03ff R11: 0ad9 R12: 88023fc4 [ 8664.925345] R13: R14: 0001 R15: 002b [ 8664.927872] FS: () GS:88023fd4() knlGS: [ 8664.930455] CS: 0010 DS: ES: CR0: 80050033 [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 06e0 [ 8664.935557] DR0: DR1: DR2: [ 8664.938051] DR3: DR6: fffe0ff0 DR7: 0400 [ 8664.940541] Call Trace: [ 8664.942980] [ 8664.945399] free_moved_vector+0x4e/0x100 [ 8664.947787] smp_irq_move_cleanup_interrupt+0x89/0x9e [ 8664.950134] irq_move_cleanup_interrupt+0x95/0xa0 [ 8664.952480] [ 8664.954800] RIP: 0010:native_safe_halt+0x2/0x10 [ 8664.957052] RSP: 0018:c9ccfee0 EFLAGS: 0246 ORIG_RAX: ffdf [ 8664.959186] RAX: 818ab6e0 RBX: 880236233f00 RCX: [ 8664.960499] RDX: RSI: RDI: [ 8664.961774] RBP: 0005 R08: R09: [ 8664.963048] R10: 03ff R11: 0ad9 R12: 880236233f00 [ 8664.964345] R13: 880236233f00 R14: R15: [ 8664.965579] ? __cpuidle_text_start+0x8/0x8 [ 8664.966808] default_idle+0x18/0xf0 [ 8664.968040] do_idle+0x150/0x1d0 [ 8664.969249] cpu_startup_entry+0x19/0x20 [ 8664.970477] start_secondary+0x133/0x170 [ 8664.971700] secondary_startup_64+0xa5/0xb0 [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74 [ 8664.975420] ---[ end trace 8be4ba51cd83f4bd ]--- [3] [ 8943.038767] BUG: unable to handle kernel paging request at 00037a6b561b [ 8943.040114] IP: free_moved_vector+0x61/0x100 [ 8943.041531] PGD 0 P4D 0 [ 8943.042855] Oops: 0002 [#1] SMP PTI [ 8943.044128] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core mlx4_en mlx4_core devlink iptable_filter fuse btrfs xor zstd_decompress zstd_compress xxhash raid6_pq vfat msdos fat binfmt_misc bridge macvlan vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc mst_pciconf(OE) nfsv3 nfs fscache netconsole dm_mirror dm_region_hash dm_log dm_mod dax kvm_intel kvm irqbypass pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ata_generic cirrus drm_kms_helper syscopyarea sysfillrect pata_acpi sysimgblt fb_sys_fops ttm drm e1000 serio_raw virtio_console i2c_core floppy ata_piix [last unloaded: mst_pci] [ 8943.052038] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW OE 4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1 [ 8943.053350] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100 [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007 [ 8943.057233] RAX:
WARNING and PANIC in irq_matrix_free
Hi Thomas, We started seeing new issues in our net-device daily regression tests. They are related to patch [1] introduced in kernel 4.15-rc1. We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to narrow it down to a smaller run but couldn't. In addition, sometimes (less frequent) the warning is followed by a panic [3]. I can share all needed details to help analyze this bug. If you suspect specific flows, we can do an educated narrow down. Regards, Tariq [1] 2f75d9e1c905 genirq: Implement bitmap matrix allocator [2] [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 irq_matrix_free+0x30/0xd0 [ 8664.891905] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core mlx4_en mlx4_core devlink macvlan vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc mst_pciconf(OE) nfsv3 nfs fscache netconsole dm_mirror dm_region_hash dm_log dm_mod dax kvm_intel kvm irqbypass pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ata_generic cirrus drm_kms_helper syscopyarea sysfillrect pata_acpi sysimgblt fb_sys_fops ttm drm e1000 serio_raw virtio_console i2c_core floppy ata_piix [last unloaded: mst_pci] [ 8664.905117] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G OE 4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1 [ 8664.907613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0 [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002 [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: [ 8664.917679] RDX: 0001 RSI: 0001 RDI: 880237038400 [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: [ 8664.922813] R10: 03ff R11: 0ad9 R12: 88023fc4 [ 8664.925345] R13: R14: 0001 R15: 002b [ 8664.927872] FS: () GS:88023fd4() knlGS: [ 8664.930455] CS: 0010 DS: ES: CR0: 80050033 [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 06e0 [ 8664.935557] DR0: DR1: DR2: [ 8664.938051] DR3: DR6: fffe0ff0 DR7: 0400 [ 8664.940541] Call Trace: [ 8664.942980] [ 8664.945399] free_moved_vector+0x4e/0x100 [ 8664.947787] smp_irq_move_cleanup_interrupt+0x89/0x9e [ 8664.950134] irq_move_cleanup_interrupt+0x95/0xa0 [ 8664.952480] [ 8664.954800] RIP: 0010:native_safe_halt+0x2/0x10 [ 8664.957052] RSP: 0018:c9ccfee0 EFLAGS: 0246 ORIG_RAX: ffdf [ 8664.959186] RAX: 818ab6e0 RBX: 880236233f00 RCX: [ 8664.960499] RDX: RSI: RDI: [ 8664.961774] RBP: 0005 R08: R09: [ 8664.963048] R10: 03ff R11: 0ad9 R12: 880236233f00 [ 8664.964345] R13: 880236233f00 R14: R15: [ 8664.965579] ? __cpuidle_text_start+0x8/0x8 [ 8664.966808] default_idle+0x18/0xf0 [ 8664.968040] do_idle+0x150/0x1d0 [ 8664.969249] cpu_startup_entry+0x19/0x20 [ 8664.970477] start_secondary+0x133/0x170 [ 8664.971700] secondary_startup_64+0xa5/0xb0 [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74 [ 8664.975420] ---[ end trace 8be4ba51cd83f4bd ]--- [3] [ 8943.038767] BUG: unable to handle kernel paging request at 00037a6b561b [ 8943.040114] IP: free_moved_vector+0x61/0x100 [ 8943.041531] PGD 0 P4D 0 [ 8943.042855] Oops: 0002 [#1] SMP PTI [ 8943.044128] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core mlx4_en mlx4_core devlink iptable_filter fuse btrfs xor zstd_decompress zstd_compress xxhash raid6_pq vfat msdos fat binfmt_misc bridge macvlan vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc mst_pciconf(OE) nfsv3 nfs fscache netconsole dm_mirror dm_region_hash dm_log dm_mod dax kvm_intel kvm irqbypass pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ata_generic cirrus drm_kms_helper syscopyarea sysfillrect pata_acpi sysimgblt fb_sys_fops ttm drm e1000 serio_raw virtio_console i2c_core floppy ata_piix [last unloaded: mst_pci] [ 8943.052038] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW OE 4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1 [ 8943.053350] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100 [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007 [ 8943.057233] RAX:
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
On 25/01/2018 8:25 AM, jianchao.wang wrote: Hi Eric Thanks for you kindly response and suggestion. That's really appreciated. Jianchao On 01/25/2018 11:55 AM, Eric Dumazet wrote: On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote: Hi Tariq On 01/22/2018 10:12 AM, jianchao.wang wrote: On 19/01/2018 5:49 PM, Eric Dumazet wrote: On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: Hi Tariq Very sad that the crash was reproduced again after applied the patch. Memory barriers vary for different Archs, can you please share more details regarding arch and repro steps? The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015 The xen is installed. The crash occurred in DOM0. Regarding to the repro steps, it is a customer's test which does heavy disk I/O over NFS storage without any guest. What is the finial suggestion on this ? If use wmb there, is the performance pulled down ? I want to evaluate this effect. I agree with Eric, expected impact is restricted, especially after batching the allocations. Since https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k= we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often. I doubt the additional wmb() will have serious impact there. I will test the effect (it'll be beginning of next week). I'll update so we can make a more confident decision. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
On 25/01/2018 8:25 AM, jianchao.wang wrote: Hi Eric Thanks for you kindly response and suggestion. That's really appreciated. Jianchao On 01/25/2018 11:55 AM, Eric Dumazet wrote: On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote: Hi Tariq On 01/22/2018 10:12 AM, jianchao.wang wrote: On 19/01/2018 5:49 PM, Eric Dumazet wrote: On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: Hi Tariq Very sad that the crash was reproduced again after applied the patch. Memory barriers vary for different Archs, can you please share more details regarding arch and repro steps? The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015 The xen is installed. The crash occurred in DOM0. Regarding to the repro steps, it is a customer's test which does heavy disk I/O over NFS storage without any guest. What is the finial suggestion on this ? If use wmb there, is the performance pulled down ? I want to evaluate this effect. I agree with Eric, expected impact is restricted, especially after batching the allocations. Since https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k= we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often. I doubt the additional wmb() will have serious impact there. I will test the effect (it'll be beginning of next week). I'll update so we can make a more confident decision. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
On 21/01/2018 11:31 AM, Tariq Toukan wrote: On 19/01/2018 5:49 PM, Eric Dumazet wrote: On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: Hi Tariq Very sad that the crash was reproduced again after applied the patch. Memory barriers vary for different Archs, can you please share more details regarding arch and repro steps? --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct mlx4_en_rx_ring *ring) static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring) { + dma_wmb(); So... is wmb() here fixing the issue ? *ring->wqres.db.db = cpu_to_be32(ring->prod & 0x); } I analyzed the kdump, it should be a memory corruption. Thanks Jianchao Hmm, this is actually consistent with the example below [1]. AIU from the example, it seems that the dma_wmb/dma_rmb barriers are good for synchronizing cpu/device accesses to the "Streaming DMA mapped" buffers (the descriptors, went through the dma_map_page() API), but not for the doorbell (a coherent memory, typically allocated via dma_alloc_coherent) that requires using the stronger wmb() barrier. [1] Documentation/memory-barriers.txt (*) dma_wmb(); (*) dma_rmb(); These are for use with consistent memory to guarantee the ordering of writes or reads of shared memory accessible to both the CPU and a DMA capable device. For example, consider a device driver that shares memory with a device and uses a descriptor status value to indicate if the descriptor belongs to the device or the CPU, and a doorbell to notify it when new descriptors are available: if (desc->status != DEVICE_OWN) { /* do not read data until we own descriptor */ dma_rmb(); /* read/modify data */ read_data = desc->data; desc->data = write_data; /* flush modifications before status update */ dma_wmb(); /* assign ownership */ desc->status = DEVICE_OWN; /* force memory to sync before notifying device via MMIO */ wmb(); /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device can see it now has ownership. The wmb() is needed to guarantee that the cache coherent memory writes have completed before attempting a write to the cache incoherent MMIO region. See Documentation/DMA-API.txt for more information on consistent memory. On 01/15/2018 01:50 PM, jianchao.wang wrote: Hi Tariq Thanks for your kindly response. On 01/14/2018 05:47 PM, Tariq Toukan wrote: Thanks Jianchao for your patch. And Thank you guys for your reviews, much appreciated. I was off-work on Friday and Saturday. On 14/01/2018 4:40 AM, jianchao.wang wrote: Dear all Thanks for the kindly response and reviewing. That's really appreciated. On 01/13/2018 12:46 AM, Eric Dumazet wrote: Does this need to be dma_wmb(), and should it be in mlx4_en_update_rx_prod_db ? +1 on dma_wmb() On what architecture bug was observed ? This issue was observed on x86-64. And I will send a new patch, in which replace wmb() with dma_wmb(), to customer to confirm. +1 on dma_wmb, let us know once customer confirms. Please place it within mlx4_en_update_rx_prod_db as suggested. Yes, I have recommended it to customer. Once I get the result, I will share it here. All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I prefer being on the safe side, and care less about bulking the barrier. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
On 21/01/2018 11:31 AM, Tariq Toukan wrote: On 19/01/2018 5:49 PM, Eric Dumazet wrote: On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: Hi Tariq Very sad that the crash was reproduced again after applied the patch. Memory barriers vary for different Archs, can you please share more details regarding arch and repro steps? --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct mlx4_en_rx_ring *ring) static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring) { + dma_wmb(); So... is wmb() here fixing the issue ? *ring->wqres.db.db = cpu_to_be32(ring->prod & 0x); } I analyzed the kdump, it should be a memory corruption. Thanks Jianchao Hmm, this is actually consistent with the example below [1]. AIU from the example, it seems that the dma_wmb/dma_rmb barriers are good for synchronizing cpu/device accesses to the "Streaming DMA mapped" buffers (the descriptors, went through the dma_map_page() API), but not for the doorbell (a coherent memory, typically allocated via dma_alloc_coherent) that requires using the stronger wmb() barrier. [1] Documentation/memory-barriers.txt (*) dma_wmb(); (*) dma_rmb(); These are for use with consistent memory to guarantee the ordering of writes or reads of shared memory accessible to both the CPU and a DMA capable device. For example, consider a device driver that shares memory with a device and uses a descriptor status value to indicate if the descriptor belongs to the device or the CPU, and a doorbell to notify it when new descriptors are available: if (desc->status != DEVICE_OWN) { /* do not read data until we own descriptor */ dma_rmb(); /* read/modify data */ read_data = desc->data; desc->data = write_data; /* flush modifications before status update */ dma_wmb(); /* assign ownership */ desc->status = DEVICE_OWN; /* force memory to sync before notifying device via MMIO */ wmb(); /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device can see it now has ownership. The wmb() is needed to guarantee that the cache coherent memory writes have completed before attempting a write to the cache incoherent MMIO region. See Documentation/DMA-API.txt for more information on consistent memory. On 01/15/2018 01:50 PM, jianchao.wang wrote: Hi Tariq Thanks for your kindly response. On 01/14/2018 05:47 PM, Tariq Toukan wrote: Thanks Jianchao for your patch. And Thank you guys for your reviews, much appreciated. I was off-work on Friday and Saturday. On 14/01/2018 4:40 AM, jianchao.wang wrote: Dear all Thanks for the kindly response and reviewing. That's really appreciated. On 01/13/2018 12:46 AM, Eric Dumazet wrote: Does this need to be dma_wmb(), and should it be in mlx4_en_update_rx_prod_db ? +1 on dma_wmb() On what architecture bug was observed ? This issue was observed on x86-64. And I will send a new patch, in which replace wmb() with dma_wmb(), to customer to confirm. +1 on dma_wmb, let us know once customer confirms. Please place it within mlx4_en_update_rx_prod_db as suggested. Yes, I have recommended it to customer. Once I get the result, I will share it here. All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I prefer being on the safe side, and care less about bulking the barrier. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
On 19/01/2018 5:49 PM, Eric Dumazet wrote: On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: Hi Tariq Very sad that the crash was reproduced again after applied the patch. --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct mlx4_en_rx_ring *ring) static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring) { + dma_wmb(); So... is wmb() here fixing the issue ? *ring->wqres.db.db = cpu_to_be32(ring->prod & 0x); } I analyzed the kdump, it should be a memory corruption. Thanks Jianchao Hmm, this is actually consistent with the example below [1]. AIU from the example, it seems that the dma_wmb/dma_rmb barriers are good for synchronizing cpu/device accesses to the "Streaming DMA mapped" buffers (the descriptors, went through the dma_map_page() API), but not for the doorbell (a coherent memory, typically allocated via dma_alloc_coherent) that requires using the stronger wmb() barrier. [1] Documentation/memory-barriers.txt (*) dma_wmb(); (*) dma_rmb(); These are for use with consistent memory to guarantee the ordering of writes or reads of shared memory accessible to both the CPU and a DMA capable device. For example, consider a device driver that shares memory with a device and uses a descriptor status value to indicate if the descriptor belongs to the device or the CPU, and a doorbell to notify it when new descriptors are available: if (desc->status != DEVICE_OWN) { /* do not read data until we own descriptor */ dma_rmb(); /* read/modify data */ read_data = desc->data; desc->data = write_data; /* flush modifications before status update */ dma_wmb(); /* assign ownership */ desc->status = DEVICE_OWN; /* force memory to sync before notifying device via MMIO */ wmb(); /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device can see it now has ownership. The wmb() is needed to guarantee that the cache coherent memory writes have completed before attempting a write to the cache incoherent MMIO region. See Documentation/DMA-API.txt for more information on consistent memory. On 01/15/2018 01:50 PM, jianchao.wang wrote: Hi Tariq Thanks for your kindly response. On 01/14/2018 05:47 PM, Tariq Toukan wrote: Thanks Jianchao for your patch. And Thank you guys for your reviews, much appreciated. I was off-work on Friday and Saturday. On 14/01/2018 4:40 AM, jianchao.wang wrote: Dear all Thanks for the kindly response and reviewing. That's really appreciated. On 01/13/2018 12:46 AM, Eric Dumazet wrote: Does this need to be dma_wmb(), and should it be in mlx4_en_update_rx_prod_db ? +1 on dma_wmb() On what architecture bug was observed ? This issue was observed on x86-64. And I will send a new patch, in which replace wmb() with dma_wmb(), to customer to confirm. +1 on dma_wmb, let us know once customer confirms. Please place it within mlx4_en_update_rx_prod_db as suggested. Yes, I have recommended it to customer. Once I get the result, I will share it here. All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I prefer being on the safe side, and care less about bulking the barrier. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
On 19/01/2018 5:49 PM, Eric Dumazet wrote: On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote: Hi Tariq Very sad that the crash was reproduced again after applied the patch. --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct mlx4_en_rx_ring *ring) static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring) { + dma_wmb(); So... is wmb() here fixing the issue ? *ring->wqres.db.db = cpu_to_be32(ring->prod & 0x); } I analyzed the kdump, it should be a memory corruption. Thanks Jianchao Hmm, this is actually consistent with the example below [1]. AIU from the example, it seems that the dma_wmb/dma_rmb barriers are good for synchronizing cpu/device accesses to the "Streaming DMA mapped" buffers (the descriptors, went through the dma_map_page() API), but not for the doorbell (a coherent memory, typically allocated via dma_alloc_coherent) that requires using the stronger wmb() barrier. [1] Documentation/memory-barriers.txt (*) dma_wmb(); (*) dma_rmb(); These are for use with consistent memory to guarantee the ordering of writes or reads of shared memory accessible to both the CPU and a DMA capable device. For example, consider a device driver that shares memory with a device and uses a descriptor status value to indicate if the descriptor belongs to the device or the CPU, and a doorbell to notify it when new descriptors are available: if (desc->status != DEVICE_OWN) { /* do not read data until we own descriptor */ dma_rmb(); /* read/modify data */ read_data = desc->data; desc->data = write_data; /* flush modifications before status update */ dma_wmb(); /* assign ownership */ desc->status = DEVICE_OWN; /* force memory to sync before notifying device via MMIO */ wmb(); /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device can see it now has ownership. The wmb() is needed to guarantee that the cache coherent memory writes have completed before attempting a write to the cache incoherent MMIO region. See Documentation/DMA-API.txt for more information on consistent memory. On 01/15/2018 01:50 PM, jianchao.wang wrote: Hi Tariq Thanks for your kindly response. On 01/14/2018 05:47 PM, Tariq Toukan wrote: Thanks Jianchao for your patch. And Thank you guys for your reviews, much appreciated. I was off-work on Friday and Saturday. On 14/01/2018 4:40 AM, jianchao.wang wrote: Dear all Thanks for the kindly response and reviewing. That's really appreciated. On 01/13/2018 12:46 AM, Eric Dumazet wrote: Does this need to be dma_wmb(), and should it be in mlx4_en_update_rx_prod_db ? +1 on dma_wmb() On what architecture bug was observed ? This issue was observed on x86-64. And I will send a new patch, in which replace wmb() with dma_wmb(), to customer to confirm. +1 on dma_wmb, let us know once customer confirms. Please place it within mlx4_en_update_rx_prod_db as suggested. Yes, I have recommended it to customer. Once I get the result, I will share it here. All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I prefer being on the safe side, and care less about bulking the barrier. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
Thanks Jianchao for your patch. And Thank you guys for your reviews, much appreciated. I was off-work on Friday and Saturday. On 14/01/2018 4:40 AM, jianchao.wang wrote: Dear all Thanks for the kindly response and reviewing. That's really appreciated. On 01/13/2018 12:46 AM, Eric Dumazet wrote: Does this need to be dma_wmb(), and should it be in mlx4_en_update_rx_prod_db ? +1 on dma_wmb() On what architecture bug was observed ? This issue was observed on x86-64. And I will send a new patch, in which replace wmb() with dma_wmb(), to customer to confirm. +1 on dma_wmb, let us know once customer confirms. Please place it within mlx4_en_update_rx_prod_db as suggested. All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I prefer being on the safe side, and care less about bulking the barrier. Thanks, Tariq
Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating
Thanks Jianchao for your patch. And Thank you guys for your reviews, much appreciated. I was off-work on Friday and Saturday. On 14/01/2018 4:40 AM, jianchao.wang wrote: Dear all Thanks for the kindly response and reviewing. That's really appreciated. On 01/13/2018 12:46 AM, Eric Dumazet wrote: Does this need to be dma_wmb(), and should it be in mlx4_en_update_rx_prod_db ? +1 on dma_wmb() On what architecture bug was observed ? This issue was observed on x86-64. And I will send a new patch, in which replace wmb() with dma_wmb(), to customer to confirm. +1 on dma_wmb, let us know once customer confirms. Please place it within mlx4_en_update_rx_prod_db as suggested. All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I prefer being on the safe side, and care less about bulking the barrier. Thanks, Tariq
Re: [PATCH net-next V2 2/2] tuntap: XDP transmission
On 04/01/2018 5:14 AM, Jason Wang wrote: This patch implements XDP transmission for TAP. Since we can't create new queues for TAP during XDP set, exist ptr_ring was reused for queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG (0x1UL) was encoded into lowest bit of xpd_buff pointer during ptr_ring_produce, and was decoded during consuming. XDP metadata was stored in the headroom of the packet which should work in most of cases since driver usually reserve enough headroom. Very minor changes were done for vhost_net: it just need to peek the length depends on the type of pointer. Tests were done on two Intel E5-2630 2.40GHz machines connected back to back through two 82599ES. Traffic were generated/received through MoonGen/testpmd(rxonly). It reports ~20% improvements when xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps to 3.05Mpps) Cc: Jesper Dangaard BrouerSigned-off-by: Jason Wang --- drivers/net/tun.c | 211 + drivers/vhost/net.c| 13 ++- include/linux/if_tun.h | 17 3 files changed, 208 insertions(+), 33 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2c89efe..f2e805d 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -240,6 +240,24 @@ struct tun_struct { struct tun_steering_prog __rcu *steering_prog; }; +bool tun_is_xdp_buff(void *ptr) +{ + return (unsigned long)ptr & TUN_XDP_FLAG; +} +EXPORT_SYMBOL(tun_is_xdp_buff); + +void *tun_xdp_to_ptr(void *ptr) +{ + return (void *)((unsigned long)ptr | TUN_XDP_FLAG); +} +EXPORT_SYMBOL(tun_xdp_to_ptr); + +void *tun_ptr_to_xdp(void *ptr) +{ + return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG); +} +EXPORT_SYMBOL(tun_ptr_to_xdp); + Hi Jason, I started getting the following compilation issues. + make -j24 -s net/socket.o: In function `tun_xdp_to_ptr': /images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: multiple definition of `tun_xdp_to_ptr' fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: first defined here net/socket.o: In function `tun_ptr_to_xdp': /images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: multiple definition of `tun_ptr_to_xdp' fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: first defined here make: *** [vmlinux] Error 1 Seems you missed adding the following ifdef: #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) Thanks, Tariq static int tun_napi_receive(struct napi_struct *napi, int budget) { struct tun_file *tfile = container_of(napi, struct tun_file, napi); @@ -630,12 +648,25 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile) return tun; } +static void tun_ptr_free(void *ptr) +{ + if (!ptr) + return; + if (tun_is_xdp_buff(ptr)) { + struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + + put_page(virt_to_head_page(xdp->data)); + } else { + __skb_array_destroy_skb(ptr); + } +} + static void tun_queue_purge(struct tun_file *tfile) { - struct sk_buff *skb; + void *ptr; - while ((skb = ptr_ring_consume(>tx_ring)) != NULL) - kfree_skb(skb); + while ((ptr = ptr_ring_consume(>tx_ring)) != NULL) + tun_ptr_free(ptr); skb_queue_purge(>sk.sk_write_queue); skb_queue_purge(>sk.sk_error_queue); @@ -688,8 +719,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean) unregister_netdevice(tun->dev); } if (tun) - ptr_ring_cleanup(>tx_ring, -__skb_array_destroy_skb); + ptr_ring_cleanup(>tx_ring, tun_ptr_free); sock_put(>sk); } } @@ -1201,6 +1231,67 @@ static const struct net_device_ops tun_netdev_ops = { .ndo_get_stats64= tun_net_get_stats64, }; +static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) +{ + struct tun_struct *tun = netdev_priv(dev); + struct xdp_buff *buff = xdp->data_hard_start; + int headroom = xdp->data - xdp->data_hard_start; + struct tun_file *tfile; + u32 numqueues; + int ret = 0; + + /* Assure headroom is available and buff is properly aligned */ + if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp))) + return -ENOSPC; + + *buff = *xdp; + + rcu_read_lock(); + + numqueues = READ_ONCE(tun->numqueues); + if (!numqueues) { + ret = -ENOSPC; + goto out; + } + + tfile = rcu_dereference(tun->tfiles[smp_processor_id() % + numqueues]); + /* Encode the XDP flag into lowest bit for consumer to
Re: [PATCH net-next V2 2/2] tuntap: XDP transmission
On 04/01/2018 5:14 AM, Jason Wang wrote: This patch implements XDP transmission for TAP. Since we can't create new queues for TAP during XDP set, exist ptr_ring was reused for queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG (0x1UL) was encoded into lowest bit of xpd_buff pointer during ptr_ring_produce, and was decoded during consuming. XDP metadata was stored in the headroom of the packet which should work in most of cases since driver usually reserve enough headroom. Very minor changes were done for vhost_net: it just need to peek the length depends on the type of pointer. Tests were done on two Intel E5-2630 2.40GHz machines connected back to back through two 82599ES. Traffic were generated/received through MoonGen/testpmd(rxonly). It reports ~20% improvements when xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps to 3.05Mpps) Cc: Jesper Dangaard Brouer Signed-off-by: Jason Wang --- drivers/net/tun.c | 211 + drivers/vhost/net.c| 13 ++- include/linux/if_tun.h | 17 3 files changed, 208 insertions(+), 33 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2c89efe..f2e805d 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -240,6 +240,24 @@ struct tun_struct { struct tun_steering_prog __rcu *steering_prog; }; +bool tun_is_xdp_buff(void *ptr) +{ + return (unsigned long)ptr & TUN_XDP_FLAG; +} +EXPORT_SYMBOL(tun_is_xdp_buff); + +void *tun_xdp_to_ptr(void *ptr) +{ + return (void *)((unsigned long)ptr | TUN_XDP_FLAG); +} +EXPORT_SYMBOL(tun_xdp_to_ptr); + +void *tun_ptr_to_xdp(void *ptr) +{ + return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG); +} +EXPORT_SYMBOL(tun_ptr_to_xdp); + Hi Jason, I started getting the following compilation issues. + make -j24 -s net/socket.o: In function `tun_xdp_to_ptr': /images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: multiple definition of `tun_xdp_to_ptr' fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: first defined here net/socket.o: In function `tun_ptr_to_xdp': /images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: multiple definition of `tun_ptr_to_xdp' fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: first defined here make: *** [vmlinux] Error 1 Seems you missed adding the following ifdef: #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) Thanks, Tariq static int tun_napi_receive(struct napi_struct *napi, int budget) { struct tun_file *tfile = container_of(napi, struct tun_file, napi); @@ -630,12 +648,25 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile) return tun; } +static void tun_ptr_free(void *ptr) +{ + if (!ptr) + return; + if (tun_is_xdp_buff(ptr)) { + struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + + put_page(virt_to_head_page(xdp->data)); + } else { + __skb_array_destroy_skb(ptr); + } +} + static void tun_queue_purge(struct tun_file *tfile) { - struct sk_buff *skb; + void *ptr; - while ((skb = ptr_ring_consume(>tx_ring)) != NULL) - kfree_skb(skb); + while ((ptr = ptr_ring_consume(>tx_ring)) != NULL) + tun_ptr_free(ptr); skb_queue_purge(>sk.sk_write_queue); skb_queue_purge(>sk.sk_error_queue); @@ -688,8 +719,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean) unregister_netdevice(tun->dev); } if (tun) - ptr_ring_cleanup(>tx_ring, -__skb_array_destroy_skb); + ptr_ring_cleanup(>tx_ring, tun_ptr_free); sock_put(>sk); } } @@ -1201,6 +1231,67 @@ static const struct net_device_ops tun_netdev_ops = { .ndo_get_stats64= tun_net_get_stats64, }; +static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) +{ + struct tun_struct *tun = netdev_priv(dev); + struct xdp_buff *buff = xdp->data_hard_start; + int headroom = xdp->data - xdp->data_hard_start; + struct tun_file *tfile; + u32 numqueues; + int ret = 0; + + /* Assure headroom is available and buff is properly aligned */ + if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp))) + return -ENOSPC; + + *buff = *xdp; + + rcu_read_lock(); + + numqueues = READ_ONCE(tun->numqueues); + if (!numqueues) { + ret = -ENOSPC; + goto out; + } + + tfile = rcu_dereference(tun->tfiles[smp_processor_id() % + numqueues]); + /* Encode the XDP flag into lowest bit for consumer to differ +* XDP buffer from
Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 01/01/2018 10:46 PM, SF Markus Elfring wrote: From: Markus Elfring <elfr...@users.sourceforge.net> Date: Mon, 1 Jan 2018 21:42:27 +0100 Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> --- Acked-by: Tariq Toukan <tar...@mellanox.com> Thanks.
Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 01/01/2018 10:46 PM, SF Markus Elfring wrote: From: Markus Elfring Date: Mon, 1 Jan 2018 21:42:27 +0100 Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- Acked-by: Tariq Toukan Thanks.
Re: ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 03/01/2018 4:22 PM, SF Markus Elfring wrote: I don't really accept this claim... Short informative strings worth the tiny space they consume. There can be different opinions for their usefulness. In addition, some out-of-memory errors are recoverable, even though their backtrace is also printed. How do you think about to suppress the backtrace generation for them? OK, makes sense. For example, in function mlx4_en_create_cq (appears in patch) we have a first allocation attempt (kzalloc_node) Would it be helpful to pass the option “__GFP_NOWARN” there? I'll prepare a patch to use it. Will ack this patch for now. and a fallback (kzalloc). I'd prefer to state a clear error message only when both have failed, because otherwise the user might be confused whether the backtrace should indicate a malfunctioning interface, or not. Can the distinction become easier by any other means? Regards, Markus -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 03/01/2018 4:22 PM, SF Markus Elfring wrote: I don't really accept this claim... Short informative strings worth the tiny space they consume. There can be different opinions for their usefulness. In addition, some out-of-memory errors are recoverable, even though their backtrace is also printed. How do you think about to suppress the backtrace generation for them? OK, makes sense. For example, in function mlx4_en_create_cq (appears in patch) we have a first allocation attempt (kzalloc_node) Would it be helpful to pass the option “__GFP_NOWARN” there? I'll prepare a patch to use it. Will ack this patch for now. and a fallback (kzalloc). I'd prefer to state a clear error message only when both have failed, because otherwise the user might be confused whether the backtrace should indicate a malfunctioning interface, or not. Can the distinction become easier by any other means? Regards, Markus -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 03/01/2018 10:06 AM, Julia Lawall wrote: On Wed, 3 Jan 2018, Tariq Toukan wrote: On 01/01/2018 10:46 PM, SF Markus Elfring wrote: From: Markus Elfring <elfr...@users.sourceforge.net> Date: Mon, 1 Jan 2018 21:42:27 +0100 Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> --- Is this an issue? Why? What is your motivation? These are error messages, very informative, appear only upon errors, and in control flow. Strings take up space. Since there is a backtrace on an out of memory problem, if the string does not provide any more information than the position of the call, then there is not much added value. I don't know what was the string in this case. If it provides some additional information, then it would be reasonable to keep it. I don't really accept this claim... Short informative strings worth the tiny space they consume. It helps the users of our driver understand what went wrong in simple words, without the need to understand the role of the functions/callstack or being familiar with different parts of the driver code. In addition, some out-of-memory errors are recoverable, even though their backtrace is also printed. For example, in function mlx4_en_create_cq (appears in patch) we have a first allocation attempt (kzalloc_node) and a fallback (kzalloc). I'd prefer to state a clear error message only when both have failed, because otherwise the user might be confused whether the backtrace should indicate a malfunctioning interface, or not. Tariq julia -- To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 03/01/2018 10:06 AM, Julia Lawall wrote: On Wed, 3 Jan 2018, Tariq Toukan wrote: On 01/01/2018 10:46 PM, SF Markus Elfring wrote: From: Markus Elfring Date: Mon, 1 Jan 2018 21:42:27 +0100 Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- Is this an issue? Why? What is your motivation? These are error messages, very informative, appear only upon errors, and in control flow. Strings take up space. Since there is a backtrace on an out of memory problem, if the string does not provide any more information than the position of the call, then there is not much added value. I don't know what was the string in this case. If it provides some additional information, then it would be reasonable to keep it. I don't really accept this claim... Short informative strings worth the tiny space they consume. It helps the users of our driver understand what went wrong in simple words, without the need to understand the role of the functions/callstack or being familiar with different parts of the driver code. In addition, some out-of-memory errors are recoverable, even though their backtrace is also printed. For example, in function mlx4_en_create_cq (appears in patch) we have a first allocation attempt (kzalloc_node) and a fallback (kzalloc). I'd prefer to state a clear error message only when both have failed, because otherwise the user might be confused whether the backtrace should indicate a malfunctioning interface, or not. Tariq julia -- To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 01/01/2018 10:46 PM, SF Markus Elfring wrote: From: Markus ElfringDate: Mon, 1 Jan 2018 21:42:27 +0100 Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- Is this an issue? Why? What is your motivation? These are error messages, very informative, appear only upon errors, and in control flow.
Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions
On 01/01/2018 10:46 PM, SF Markus Elfring wrote: From: Markus Elfring Date: Mon, 1 Jan 2018 21:42:27 +0100 Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- Is this an issue? Why? What is your motivation? These are error messages, very informative, appear only upon errors, and in control flow.
Re: [PATCH] kobject: fix suppressing modalias in uevents delivered over netlink
On 31/12/2017 12:39 PM, Greg Kroah-Hartman wrote: On Sun, Dec 31, 2017 at 12:28:02PM +0200, Tariq Toukan wrote: On 21/12/2017 7:13 AM, Dmitry Torokhov wrote: Greg, can you please schedule my patch in for 4.15? We must have it in, otherwise there will be a major degradation (many drivers won't be able to reload). Greg? It's already queued up to get merged, please be patient, it is the holidays... greg k-h Many thanks, and happy holidays! Tariq