Re: [PATCH] net/mlx5e: allocate 'indirection_rqt' buffer dynamically

2021-03-08 Thread Tariq Toukan




On 3/8/2021 5:32 PM, Arnd Bergmann wrote:

From: Arnd Bergmann 

Increasing the size of the indirection_rqt array from 128 to 256 bytes
pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function
over the warning limit when building with clang and CONFIG_KASAN:

drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size 
of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=]

Using dynamic allocation here is safe because the caller does the
same, and it reduces the stack usage of the function to just a few
bytes.

Fixes: 1dd55ba2fb70 ("net/mlx5e: Increase indirection RQ table size to 256")
Signed-off-by: Arnd Bergmann 
---
  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 16 +---
  1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 0da69b98f38f..66f98618dc13 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -445,12 +445,16 @@ static void mlx5e_hairpin_destroy_transport(struct 
mlx5e_hairpin *hp)
mlx5_core_dealloc_transport_domain(hp->func_mdev, hp->tdn);
  }
  
-static void mlx5e_hairpin_fill_rqt_rqns(struct mlx5e_hairpin *hp, void *rqtc)

+static int mlx5e_hairpin_fill_rqt_rqns(struct mlx5e_hairpin *hp, void *rqtc)
  {
-   u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE], rqn;
+   u32 *indirection_rqt, rqn;
struct mlx5e_priv *priv = hp->func_priv;
int i, ix, sz = MLX5E_INDIR_RQT_SIZE;
  
+	indirection_rqt = kzalloc(sz, GFP_KERNEL);

+   if (!indirection_rqt)
+   return -ENOMEM;
+
mlx5e_build_default_indir_rqt(indirection_rqt, sz,
  hp->num_channels);
  
@@ -462,6 +466,9 @@ static void mlx5e_hairpin_fill_rqt_rqns(struct mlx5e_hairpin *hp, void *rqtc)

rqn = hp->pair->rqn[ix];
MLX5_SET(rqtc, rqtc, rq_num[i], rqn);
}
+
+   kfree(indirection_rqt);
+   return 0;
  }
  
  static int mlx5e_hairpin_create_indirect_rqt(struct mlx5e_hairpin *hp)

@@ -482,12 +489,15 @@ static int mlx5e_hairpin_create_indirect_rqt(struct 
mlx5e_hairpin *hp)
MLX5_SET(rqtc, rqtc, rqt_actual_size, sz);
MLX5_SET(rqtc, rqtc, rqt_max_size, sz);
  
-	mlx5e_hairpin_fill_rqt_rqns(hp, rqtc);

+   err = mlx5e_hairpin_fill_rqt_rqns(hp, rqtc);
+   if (err)
+   goto out;
  
  	err = mlx5_core_create_rqt(mdev, in, inlen, >indir_rqt.rqtn);

if (!err)
hp->indir_rqt.enabled = true;
  
+out:

kvfree(in);
    return err;
  }



Reviewed-by: Tariq Toukan 
Thanks for your patch.

Tariq


Re: [PATCH] net/mlx4_core: Add missed mlx4_free_cmd_mailbox()

2021-02-21 Thread Tariq Toukan




On 2/21/2021 4:35 PM, Chuhong Yuan wrote:

mlx4_do_mirror_rule() forgets to call mlx4_free_cmd_mailbox() to
free the memory region allocated by mlx4_alloc_cmd_mailbox() before
an exit.
Add the missed call to fix it.

Fixes: 78efed275117 ("net/mlx4_core: Support mirroring VF DMFS rules on both 
ports")
Signed-off-by: Chuhong Yuan 
---
  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 394f43add85c..a99e71bc7b3c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -4986,6 +4986,7 @@ static int mlx4_do_mirror_rule(struct mlx4_dev *dev, 
struct res_fs_rule *fs_rule
  
  	if (!fs_rule->mirr_mbox) {

mlx4_err(dev, "rule mirroring mailbox is null\n");
+   mlx4_free_cmd_mailbox(dev, mailbox);
return -EINVAL;
}
memcpy(mailbox->buf, fs_rule->mirr_mbox, fs_rule->mirr_mbox_size);



Reviewed-by: Tariq Toukan 

Thanks for your patch.
Tariq


Re: [PATCH v4 net-next] net: Remove redundant calls of sk_tx_queue_clear().

2021-01-28 Thread Tariq Toukan




On 1/28/2021 2:42 PM, Kuniyuki Iwashima wrote:

The commit 41b14fb8724d ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.

If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.

Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.

v4:
* Fix typo in the changelog (runtime -> compile-time)

v3: https://lore.kernel.org/netdev/20210128021905.57471-1-kun...@amazon.co.jp/
* Remove Fixes: tag
* Add BUILD_BUG_ON
* Remove sk_tx_queue_clear() from sk_prot_alloc()
   instead of sk_alloc() and sk_clone_lock()

v2: https://lore.kernel.org/netdev/20210127132215.10842-1-kun...@amazon.co.jp/
* Remove Reviewed-by: tag

v1: https://lore.kernel.org/netdev/20210127125018.7059-1-kun...@amazon.co.jp/



Sorry for not pointing this out earlier, but shouldn't the changelog 
come after the --- separator? Unless you want it to appear as part of 
the commit message.


Other than that, I think now I'm fine with the patch.

Acked-by: Tariq Toukan 

Thanks,
Tariq


CC: Tariq Toukan 
CC: Boris Pismenny 
Signed-off-by: Kuniyuki Iwashima 
---
  net/core/sock.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index bbcd4b97eddd..cfbd62a5e079 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1657,6 +1657,16 @@ static void sock_copy(struct sock *nsk, const struct 
sock *osk)
  #ifdef CONFIG_SECURITY_NETWORK
void *sptr = nsk->sk_security;
  #endif
+
+   /* If we move sk_tx_queue_mapping out of the private section,
+* we must check if sk_tx_queue_clear() is called after
+* sock_copy() in sk_clone_lock().
+*/
+   BUILD_BUG_ON(offsetof(struct sock, sk_tx_queue_mapping) <
+offsetof(struct sock, sk_dontcopy_begin) ||
+offsetof(struct sock, sk_tx_queue_mapping) >=
+offsetof(struct sock, sk_dontcopy_end));
+
memcpy(nsk, osk, offsetof(struct sock, sk_dontcopy_begin));
  
  	memcpy(>sk_dontcopy_end, >sk_dontcopy_end,

@@ -1690,7 +1700,6 @@ static struct sock *sk_prot_alloc(struct proto *prot, 
gfp_t priority,
  
  		if (!try_module_get(prot->owner))

goto out_free_sec;
-   sk_tx_queue_clear(sk);
}
  
  	return sk;




Re: [PATCH v3 net-next] net: Remove redundant calls of sk_tx_queue_clear().

2021-01-28 Thread Tariq Toukan




On 1/28/2021 4:19 AM, Kuniyuki Iwashima wrote:

The commit 41b14fb8724d ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.

If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.

Therefore, this patch adds a runtime 


compile-time


check to take care of the order of
sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.

v3:
* Remove Fixes: tag
* Add BUILD_BUG_ON
* Remove sk_tx_queue_clear() from sk_prot_alloc()
   instead of sk_alloc() and sk_clone_lock()

v2: https://lore.kernel.org/netdev/20210127132215.10842-1-kun...@amazon.co.jp/
* Remove Reviewed-by: tag

v1: https://lore.kernel.org/netdev/20210127125018.7059-1-kun...@amazon.co.jp/

CC: Tariq Toukan 
CC: Boris Pismenny 
Signed-off-by: Kuniyuki Iwashima 
---
  net/core/sock.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index bbcd4b97eddd..cfbd62a5e079 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1657,6 +1657,16 @@ static void sock_copy(struct sock *nsk, const struct 
sock *osk)
  #ifdef CONFIG_SECURITY_NETWORK
void *sptr = nsk->sk_security;
  #endif
+
+   /* If we move sk_tx_queue_mapping out of the private section,
+* we must check if sk_tx_queue_clear() is called after
+* sock_copy() in sk_clone_lock().
+*/
+   BUILD_BUG_ON(offsetof(struct sock, sk_tx_queue_mapping) <
+offsetof(struct sock, sk_dontcopy_begin) ||
+offsetof(struct sock, sk_tx_queue_mapping) >=
+offsetof(struct sock, sk_dontcopy_end));
+
memcpy(nsk, osk, offsetof(struct sock, sk_dontcopy_begin));
  
  	memcpy(>sk_dontcopy_end, >sk_dontcopy_end,

@@ -1690,7 +1700,6 @@ static struct sock *sk_prot_alloc(struct proto *prot, 
gfp_t priority,
  
  		if (!try_module_get(prot->owner))

goto out_free_sec;
-   sk_tx_queue_clear(sk);
}
  
  	return sk;




Re: [PATCH] mlx4: style: replace zero-length array with flexible-array member.

2020-12-30 Thread Tariq Toukan




On 12/30/2020 8:28 AM, YANG LI wrote:

There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use "flexible array members"[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/
 deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: YANG LI 
Reported-by: Abaci 
---
  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index e8ed2319..4029a8b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -314,7 +314,7 @@ struct mlx4_en_tx_ring {
  
  struct mlx4_en_rx_desc {

/* actual number of entries depends on rx ring stride */
-   struct mlx4_wqe_data_seg data[0];
+   struct mlx4_wqe_data_seg data[];
  };
  
  struct mlx4_en_rx_ring {




Reviewed-by: Tariq Toukan 

Thanks.


Re: [patch 23/30] net/mlx5: Use effective interrupt affinity

2020-12-13 Thread Tariq Toukan




On 12/10/2020 9:25 PM, Thomas Gleixner wrote:

Using the interrupt affinity mask for checking locality is not really
working well on architectures which support effective affinity masks.

The affinity mask is either the system wide default or set by user space,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs.

Signed-off-by: Thomas Gleixner 
Cc: Saeed Mahameed 
Cc: Leon Romanovsky 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
Cc: linux-r...@vger.kernel.org
---
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1998,7 +1998,7 @@ static int mlx5e_open_channel(struct mlx
c->num_tc   = params->num_tc;
c->xdp  = !!params->xdp_prog;
c->stats= >channel_stats[ix].ch;
-   c->aff_mask = irq_get_affinity_mask(irq);
+   c->aff_mask = irq_get_effective_affinity_mask(irq);
c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix);
  
  	netif_napi_add(netdev, >napi, mlx5e_napi_poll, 64);




Reviewed-by: Tariq Toukan 

Thanks.


Re: [patch 22/30] net/mlx5: Replace irq_to_desc() abuse

2020-12-13 Thread Tariq Toukan




On 12/10/2020 9:25 PM, Thomas Gleixner wrote:

No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.

Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.

Signed-off-by: Thomas Gleixner 
---
  drivers/net/ethernet/mellanox/mlx5/core/en.h  |2 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |6 +-
  3 files changed, 3 insertions(+), 7 deletions(-)



Reviewed-by: Tariq Toukan 

Thanks.


Re: [patch 21/30] net/mlx4: Use effective interrupt affinity

2020-12-13 Thread Tariq Toukan




On 12/10/2020 9:25 PM, Thomas Gleixner wrote:

Using the interrupt affinity mask for checking locality is not really
working well on architectures which support effective affinity masks.

The affinity mask is either the system wide default or set by user space,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs.

Signed-off-by: Thomas Gleixner 
Cc: Tariq Toukan 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
Cc: linux-r...@vger.kernel.org
---
  drivers/net/ethernet/mellanox/mlx4/en_cq.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -117,7 +117,7 @@ int mlx4_en_activate_cq(struct mlx4_en_p
assigned_eq = true;
}
irq = mlx4_eq_get_irq(mdev->dev, cq->vector);
-   cq->aff_mask = irq_get_affinity_mask(irq);
+   cq->aff_mask = irq_get_effective_affinity_mask(irq);
} else {
/* For TX we use the same irq per
ring we assigned for the RX    */



Reviewed-by: Tariq Toukan 

Thanks.


Re: [patch 20/30] net/mlx4: Replace irq_to_desc() abuse

2020-12-13 Thread Tariq Toukan




On 12/10/2020 9:25 PM, Thomas Gleixner wrote:

No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.

Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.

Signed-off-by: Thomas Gleixner 
Cc: Tariq Toukan 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: net...@vger.kernel.org
Cc: linux-r...@vger.kernel.org
---
  drivers/net/ethernet/mellanox/mlx4/en_cq.c   |8 +++-
  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |6 +-
  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |3 ++-
  3 files changed, 6 insertions(+), 11 deletions(-)



Reviewed-by: Tariq Toukan 

Thanks for your patch.


Re: [PATCH 044/141] net/mlx4: Fix fall-through warnings for Clang

2020-11-22 Thread Tariq Toukan




On 11/20/2020 8:31 PM, Gustavo A. R. Silva wrote:

In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva 
---
  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 1187ef1375e2..e6b8b8dc7894 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -2660,6 +2660,7 @@ int mlx4_FREE_RES_wrapper(struct mlx4_dev *dev, int slave,
case RES_XRCD:
err = xrcdn_free_res(dev, slave, vhcr->op_modifier, alop,
 vhcr->in_param, >out_param);
+   break;
  
  	default:

break;



Reviewed-by: Tariq Toukan 

Thanks for your patch.


Re: [PATCH] net/mlx4: Assign boolean values to a bool variable

2020-11-09 Thread Tariq Toukan




On 11/9/2020 9:07 AM, kaixuxia wrote:



On 2020/11/8 16:20, Tariq Toukan wrote:



On 11/7/2020 8:53 AM, xiakaixu1...@gmail.com wrote:

From: Kaixu Xia 

Fix the following coccinelle warnings:



Hi Kaixu,

Which coccinelle version gave this warning?


Hi Tariq,

The version is coccinelle-1.0.7.

Thanks,
Kaixu




./drivers/net/ethernet/mellanox/mlx4/en_rx.c:687:1-17: WARNING: Assignment of 
0/1 to bool variable

Reported-by: Tosk Robot 
Signed-off-by: Kaixu Xia 



Reviewed-by: Tariq Toukan 

Thanks,
Tariq



Re: [PATCH] net/mlx4: Assign boolean values to a bool variable

2020-11-08 Thread Tariq Toukan




On 11/7/2020 8:53 AM, xiakaixu1...@gmail.com wrote:

From: Kaixu Xia 

Fix the following coccinelle warnings:



Hi Kaixu,

Which coccinelle version gave this warning?



./drivers/net/ethernet/mellanox/mlx4/en_rx.c:687:1-17: WARNING: Assignment of 
0/1 to bool variable

Reported-by: Tosk Robot 
Signed-off-by: Kaixu Xia 
---
  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 502d1b97855c..b0f79a5151cf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -684,7 +684,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
xdp_prog = rcu_dereference(ring->xdp_prog);
xdp.rxq = >xdp_rxq;
xdp.frame_sz = priv->frag_info[0].frag_stride;
-   doorbell_pending = 0;
+   doorbell_pending = false;
  
  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx

 * descriptor offset can be deduced from the CQE index instead of





Re: [PATCH] net/mlx4_core : remove unneeded semicolon

2020-11-02 Thread Tariq Toukan




On 11/1/2020 4:05 PM, t...@redhat.com wrote:

From: Tom Rix 

A semicolon is not needed after a switch statement.

Signed-off-by: Tom Rix 
---
  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 1187ef1375e2..394f43add85c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -300,7 +300,7 @@ static const char *resource_str(enum mlx4_resource rt)
case RES_FS_RULE: return "RES_FS_RULE";
case RES_XRCD: return "RES_XRCD";
default: return "Unknown resource type !!!";
-   };
+   }
  }
  
  static void rem_slave_vlans(struct mlx4_dev *dev, int slave);




Reviewed-by: Tariq Toukan 

Thanks,
Tariq


[RFC PATCH] workqueue: Add support for exposing singlethread workqueues in sysfs

2020-10-06 Thread Tariq Toukan
Extend the workqueue API so that singlethread workqueues can be exposed
in sysfs. Guarantee max_active is 1 by turning it read-only.

This allows admins to control the cpumask of a workqueue, and apply
the desired system cpu separation/isolation policy.

Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |  2 +-
 include/linux/workqueue.h |  4 
 kernel/workqueue.c| 21 +--
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 1d91a0d0ab1d..5106820a5cd5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -2033,7 +2033,7 @@ int mlx5_cmd_init(struct mlx5_core_dev *dev)
create_msg_cache(dev);
 
set_wqname(dev);
-   cmd->wq = create_singlethread_workqueue(cmd->wq_name);
+   cmd->wq = create_singlethread_sysfs_workqueue(cmd->wq_name);
if (!cmd->wq) {
mlx5_core_err(dev, "failed to create command workqueue\n");
err = -ENOMEM;
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 26de0cae2a0a..d4d4ca2b041a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -344,6 +344,7 @@ enum {
__WQ_ORDERED= 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
__WQ_ORDERED_EXPLICIT   = 1 << 19, /* internal: 
alloc_ordered_workqueue() */
+   __WQ_MAX_ACTIVE_RO  = 1 << 20, /* internal: make max_active 
read-only */
 
WQ_MAX_ACTIVE   = 512,/* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU  = 4,  /* 4 * #cpus for unbound wq */
@@ -432,6 +433,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
WQ_MEM_RECLAIM, 1, (name))
 #define create_singlethread_workqueue(name)\
alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
+#define create_singlethread_sysfs_workqueue(name)  \
+   alloc_ordered_workqueue("%s", __WQ_MAX_ACTIVE_RO |  \
+   __WQ_LEGACY | WQ_MEM_RECLAIM, name)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c41c3c17b86a..a80d34726e68 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4258,6 +4258,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
flags |= WQ_UNBOUND;
 
+   if (flags & __WQ_MAX_ACTIVE_RO)
+   flags |= WQ_SYSFS;
+
/* allocate wq and format name */
if (flags & WQ_UNBOUND)
tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]);
@@ -5401,9 +5404,11 @@ static ssize_t max_active_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(max_active);
 
+static struct device_attribute dev_attr_max_active_ro =
+   __ATTR(max_active, 0444, max_active_show, NULL);
+
 static struct attribute *wq_sysfs_attrs[] = {
_attr_per_cpu.attr,
-   _attr_max_active.attr,
NULL,
 };
 ATTRIBUTE_GROUPS(wq_sysfs);
@@ -5642,6 +5647,7 @@ static void wq_device_release(struct device *dev)
  */
 int workqueue_sysfs_register(struct workqueue_struct *wq)
 {
+   const struct device_attribute *max_active_entry;
struct wq_device *wq_dev;
int ret;
 
@@ -5650,7 +5656,8 @@ int workqueue_sysfs_register(struct workqueue_struct *wq)
 * attributes breaks ordering guarantee.  Disallow exposing ordered
 * workqueues.
 */
-   if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
+   if (WARN_ON(!(wq->flags & __WQ_MAX_ACTIVE_RO) &&
+   wq->flags & __WQ_ORDERED_EXPLICIT))
return -EINVAL;
 
wq->wq_dev = wq_dev = kzalloc(sizeof(*wq_dev), GFP_KERNEL);
@@ -5675,6 +5682,16 @@ int workqueue_sysfs_register(struct workqueue_struct *wq)
return ret;
}
 
+   max_active_entry = wq->flags & __WQ_MAX_ACTIVE_RO ?
+   _attr_max_active_ro : _attr_max_active;
+
+   ret = device_create_file(_dev->dev, max_active_entry);
+   if (ret) {
+   device_unregister(_dev->dev);
+   wq->wq_dev = NULL;
+   return ret;
+   }
+
if (wq->flags & WQ_UNBOUND) {
struct device_attribute *attr;
 
-- 
2.21.0



Re: [PATCH net-next] net: ethernet: mlx4: Avoid assigning a value to ring_cons but not used it anymore in mlx4_en_xmit()

2020-09-15 Thread Tariq Toukan




On 9/14/2020 11:02 PM, Jakub Kicinski wrote:

On Sun, 13 Sep 2020 13:12:05 +0300 Tariq Toukan wrote:

2. When MLX4_EN_PERF_STAT is not defined, we should totally remove the
local variable declaration, not only its usage.


I was actually wondering about this when working on the pause stat
patch. Where is MLX4_EN_PERF_STAT ever defined?

$ git grep MLX4_EN_PERF_STAT
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h:#ifdef MLX4_EN_PERF_STAT
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h:#endif /* MLX4_EN_PERF_STAT */
drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h:#ifdef MLX4_EN_PERF_STAT



Good point.

This was introduced long ago, since day 1 of mlx4 driver.
I believe it had off-tree usage back then, not sure though...

Anyway, I don't find it useful anymore.
Should be removed. I'll prepare a cleanup patch for net-next.

Thanks,
Tariq


Re: [PATCH net-next] net: ethernet: mlx4: Avoid assigning a value to ring_cons but not used it anymore in mlx4_en_xmit()

2020-09-13 Thread Tariq Toukan




On 9/13/2020 4:22 AM, David Miller wrote:

From: Luo Jiaxing 
Date: Sat, 12 Sep 2020 16:08:15 +0800


We found a set but not used variable 'ring_cons' in mlx4_en_xmit(), it will
cause a warning when build the kernel. And after checking the commit record
of this function, we found that it was introduced by a previous patch.

So, We delete this redundant assignment code.

Fixes: 488a9b48e398 ("net/mlx4_en: Wake TX queues only when there's enough 
room")

Signed-off-by: Luo Jiaxing 


Looks good, applied, thanks.



Hi Luo,

I didn't get a chance to review it during the weekend.

The ring_cons local variable is used in line 903:
https://elixir.bootlin.com/linux/v5.9-rc4/source/drivers/net/ethernet/mellanox/mlx4/en_tx.c#L903

AVG_PERF_COUNTER depends on the compile-time definition of 
MLX4_EN_PERF_STAT. Otherwise it is a nop.


1. Your patch causes a degradation to the case when MLX4_EN_PERF_STAT is 
defined.
2. When MLX4_EN_PERF_STAT is not defined, we should totally remove the 
local variable declaration, not only its usage.


Please let me know if you're planning to fix this. Otherwise I'll do.

Regards,
Tariq


Re: [PATCH][next] net/mlx5e: fix memory leak of tls

2020-06-30 Thread Tariq Toukan




On 6/30/2020 6:16 PM, Colin King wrote:

From: Colin Ian King 

The error return path when create_singlethread_workqueue fails currently
does not kfree tls and leads to a memory leak. Fix this by kfree'ing
tls before returning -ENOMEM.

Addresses-Coverity: ("Resource leak")
Fixes: 1182f3659357 ("net/mlx5e: kTLS, Add kTLS RX HW offload support")
Signed-off-by: Colin Ian King 
---
  drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 99beb928feff..fee991f5ee7c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -232,8 +232,10 @@ int mlx5e_tls_init(struct mlx5e_priv *priv)
return -ENOMEM;
  
  	tls->rx_wq = create_singlethread_workqueue("mlx5e_tls_rx");

-   if (!tls->rx_wq)
+   if (!tls->rx_wq) {
+   kfree(tls);
return -ENOMEM;
+   }
  
  	priv->tls = tls;

return 0;



Reviewed-by: Tariq Toukan 
Thanks.


Re: [PATCH][next] net/tls: fix sign extension issue when left shifting u16 value

2020-06-30 Thread Tariq Toukan




On 6/30/2020 5:27 PM, Colin King wrote:

From: Colin Ian King 

Left shifting the u16 value promotes it to a int and then it
gets sign extended to a u64.  If len << 16 is greater than 0x7fff
then the upper bits get set to 1 because of the implicit sign extension.
Fix this by casting len to u64 before shifting it.

Addresses-Coverity: ("integer handling issues")
Fixes: ed9b7646b06a ("net/tls: Add asynchronous resync")
Signed-off-by: Colin Ian King 
---
  include/net/tls.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index c875c0a445a6..e5dac7e74e79 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -637,7 +637,7 @@ tls_offload_rx_resync_async_request_start(struct sock *sk, 
__be32 seq, u16 len)
struct tls_offload_context_rx *rx_ctx = tls_offload_ctx_rx(tls_ctx);
  
  	atomic64_set(_ctx->resync_async->req, ((u64)ntohl(seq) << 32) |

-(len << 16) | RESYNC_REQ | RESYNC_REQ_ASYNC);
+((u64)len << 16) | RESYNC_REQ | RESYNC_REQ_ASYNC);
rx_ctx->resync_async->loglen = 0;
  }
  



Reviewed-by: Tariq Toukan 
Thanks!


Re: [PATCH v2] net/mlx4_en: fix a memory leak bug

2019-08-13 Thread Tariq Toukan


On 8/12/2019 10:11 PM, Wenwen Wang wrote:
> In mlx4_en_config_rss_steer(), 'rss_map->indir_qp' is allocated through
> kzalloc(). After that, mlx4_qp_alloc() is invoked to configure RSS
> indirection. However, if mlx4_qp_alloc() fails, the allocated
> 'rss_map->indir_qp' is not deallocated, leading to a memory leak bug.
> 
> To fix the above issue, add the 'qp_alloc_err' label to free
> 'rss_map->indir_qp'.
> 
> Fixes: 4931c6ef04b4 ("net/mlx4_en: Optimized single ring steering")
> 
> Signed-off-by: Wenwen Wang 
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6c01314..db3552f 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1187,7 +1187,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
>   err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, rss_map->indir_qp);
>   if (err) {
>   en_err(priv, "Failed to allocate RSS indirection QP\n");
> - goto rss_err;
> + goto qp_alloc_err;
>   }
>   
>   rss_map->indir_qp->event = mlx4_en_sqp_event;
> @@ -1241,6 +1241,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
>  MLX4_QP_STATE_RST, NULL, 0, 0, rss_map->indir_qp);
>   mlx4_qp_remove(mdev->dev, rss_map->indir_qp);
>   mlx4_qp_free(mdev->dev, rss_map->indir_qp);
> +qp_alloc_err:
>   kfree(rss_map->indir_qp);
>   rss_map->indir_qp = NULL;
>   rss_err:
> 

Thanks for your patch.
Reviewed-by: Tariq Toukan 


Re: [PATCH] net/mlx4_en: fix a memory leak bug

2019-08-12 Thread Tariq Toukan
Hi Wenwen,

Thanks for your patch.

On 8/12/2019 9:36 AM, Wenwen Wang wrote:
> In mlx4_en_config_rss_steer(), 'rss_map->indir_qp' is allocated through
> kzalloc(). After that, mlx4_qp_alloc() is invoked to configure RSS
> indirection. However, if mlx4_qp_alloc() fails, the allocated
> 'rss_map->indir_qp' is not deallocated, leading to a memory leak bug.
> 
> To fix the above issue, add the 'mlx4_err' label to free
> 'rss_map->indir_qp'.
> 

Add a Fixes line.

> Signed-off-by: Wenwen Wang  > ---
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6c01314..9476dbd 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1187,7 +1187,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
>   err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, rss_map->indir_qp);
>   if (err) {
>   en_err(priv, "Failed to allocate RSS indirection QP\n");
> - goto rss_err;
> + goto mlx4_err;
>   }
>   
>   rss_map->indir_qp->event = mlx4_en_sqp_event;
> @@ -1241,6 +1241,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
>  MLX4_QP_STATE_RST, NULL, 0, 0, rss_map->indir_qp);
>   mlx4_qp_remove(mdev->dev, rss_map->indir_qp);
>   mlx4_qp_free(mdev->dev, rss_map->indir_qp);
> +mlx4_err:

I don't like the label name. It's too general and not informative.
Maybe qp_alloc_err?

>   kfree(rss_map->indir_qp);
>   rss_map->indir_qp = NULL;
>   rss_err:
> 


Re: [PATCH v2] net/mlx5e: always initialize frag->last_in_page

2019-08-04 Thread Tariq Toukan


On 8/1/2019 4:52 PM, Qian Cai wrote:
> The commit 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue
> memory scheme") introduced an undefined behaviour below due to
> "frag->last_in_page" is only initialized in mlx5e_init_frags_partition()
> when,
> 
> if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE)
> 
> or after bailed out the loop,
> 
> for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++)
> 
> As the result, there could be some "frag" have uninitialized
> value of "last_in_page".
> 
> Later, get_frag() obtains those "frag" and check "frag->last_in_page" in
> mlx5e_put_rx_frag() and triggers the error during boot. Fix it by always
> initializing "frag->last_in_page" to "false" in
> mlx5e_init_frags_partition().
> 
> UBSAN: Undefined behaviour in
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:325:12
> load of value 170 is not a valid value for type 'bool' (aka '_Bool')
> Call trace:
>   dump_backtrace+0x0/0x264
>   show_stack+0x20/0x2c
>   dump_stack+0xb0/0x104
>   __ubsan_handle_load_invalid_value+0x104/0x128
>   mlx5e_handle_rx_cqe+0x8e8/0x12cc [mlx5_core]
>   mlx5e_poll_rx_cq+0xca8/0x1a94 [mlx5_core]
>   mlx5e_napi_poll+0x17c/0xa30 [mlx5_core]
>   net_rx_action+0x248/0x940
>   __do_softirq+0x350/0x7b8
>   irq_exit+0x200/0x26c
>   __handle_domain_irq+0xc8/0x128
>   gic_handle_irq+0x138/0x228
>   el1_irq+0xb8/0x140
>   arch_cpu_idle+0x1a4/0x348
>   do_idle+0x114/0x1b0
>   cpu_startup_entry+0x24/0x28
>   rest_init+0x1ac/0x1dc
>   arch_call_rest_init+0x10/0x18
>   start_kernel+0x4d4/0x57c
> 
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory 
> scheme")
> Signed-off-by: Qian Cai 
> ---
> 
> v2: zero-init the whole struct instead per Tariq.
> 
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 47eea6b3a1c3..e1810c03a510 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -331,12 +331,11 @@ static inline u64 mlx5e_get_mpwqe_offset(struct 
> mlx5e_rq *rq, u16 wqe_ix)
>   
>   static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
>   {
> - struct mlx5e_wqe_frag_info next_frag, *prev;
> + struct mlx5e_wqe_frag_info next_frag = {};
> +     struct mlx5e_wqe_frag_info *prev = NULL;
>   int i;
>   
>   next_frag.di = >wqe.di[0];
> - next_frag.offset = 0;
> - prev = NULL;
>   
>   for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) {
>   struct mlx5e_rq_frag_info *frag_info = >wqe.info.arr[0];
> 

Reviewed-by: Tariq Toukan 

Thanks.


Re: [PATCH] net/mlx5e: always initialize frag->last_in_page

2019-08-01 Thread Tariq Toukan


On 7/31/2019 10:02 PM, Qian Cai wrote:
> The commit 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue
> memory scheme") introduced an undefined behaviour below due to
> "frag->last_in_page" is only initialized in
> mlx5e_init_frags_partition() when,
> 
> if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE)
> 
> or after bailed out the loop,
> 
> for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++)
> 
> As the result, there could be some "frag" have uninitialized
> value of "last_in_page".
> 
> Later, get_frag() obtains those "frag" and check "rag->last_in_page" in
> mlx5e_put_rx_frag() and triggers the error during boot. Fix it by always
> initializing "frag->last_in_page" to "false" in
> mlx5e_init_frags_partition().
> 
> UBSAN: Undefined behaviour in
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:325:12
> load of value 170 is not a valid value for type 'bool' (aka '_Bool')
> Call trace:
>   dump_backtrace+0x0/0x264
>   show_stack+0x20/0x2c
>   dump_stack+0xb0/0x104
>   __ubsan_handle_load_invalid_value+0x104/0x128
>   mlx5e_handle_rx_cqe+0x8e8/0x12cc [mlx5_core]
>   mlx5e_poll_rx_cq+0xca8/0x1a94 [mlx5_core]
>   mlx5e_napi_poll+0x17c/0xa30 [mlx5_core]
>   net_rx_action+0x248/0x940
>   __do_softirq+0x350/0x7b8
>   irq_exit+0x200/0x26c
>   __handle_domain_irq+0xc8/0x128
>   gic_handle_irq+0x138/0x228
>   el1_irq+0xb8/0x140
>   arch_cpu_idle+0x1a4/0x348
>   do_idle+0x114/0x1b0
>   cpu_startup_entry+0x24/0x28
>   rest_init+0x1ac/0x1dc
>   arch_call_rest_init+0x10/0x18
>   start_kernel+0x4d4/0x57c
> 
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory 
> scheme")
> Signed-off-by: Qian Cai 
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 47eea6b3a1c3..96f5110a9b43 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -336,6 +336,7 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq 
> *rq)
>   
>   next_frag.di = >wqe.di[0];
>   next_frag.offset = 0;
> + next_frag.last_in_page = false;
>   prev = NULL;
>   
>   for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) {
> 

Thanks Qian.
Please zero-init the whole struct instead:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1f433a06e637..55f4f5cc1d8f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -312,11 +312,10 @@ static inline u64 mlx5e_get_mpwqe_offset(struct 
mlx5e_rq *rq, u16 wqe_ix)

  static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
  {
-   struct mlx5e_wqe_frag_info next_frag, *prev;
+   struct mlx5e_wqe_frag_info next_frag = {}, *prev;
 int i;

 next_frag.di = >wqe.di[0];
-   next_frag.offset = 0;
 prev = NULL;

 for (i = 0; i < mlx5_wq_cyc_get_size(>wqe.wq); i++) {


Thanks,
Tariq


Re: linux-next: Fixes tag needs some work in the net tree

2019-07-14 Thread Tariq Toukan



On 7/14/2019 3:15 PM, Stephen Rothwell wrote:
> Hi Tariq,
> 
> On Sun, 14 Jul 2019 07:55:48 +0000 Tariq Toukan  wrote:
>>
>> How do you think we should handle this?
> 
> Dave doesn't rebase his trees, so all you can really do is learn from
> it and not do it again :-)
> 

Sure.
My bad, used the SHA1 I had on my internal branch.

Thanks,
Tariq


Re: linux-next: Fixes tag needs some work in the net tree

2019-07-14 Thread Tariq Toukan



On 7/12/2019 9:50 AM, Stephen Rothwell wrote:
> Hi all,
> 
> In commit
> 
>c93dfec10f1d ("net/mlx5e: Fix compilation error in TLS code")
> 
> Fixes tag
> 
>Fixes: 90687e1a9a50 ("net/mlx5: Kconfig, Better organize compilation 
> flags")
> 
> has these problem(s):
> 
>- Target SHA1 does not exist
> 
> Did you mean
> 
> Fixes: e2869fb2068b ("net/mlx5: Kconfig, Better organize compilation flags")
> 

Right.
Thank you Stephen!
How do you think we should handle this?

Tariq


Re: [PATCH] [net-next] net/mlx5e: avoid uninitialized variable use

2019-07-10 Thread Tariq Toukan


On 7/10/2019 4:06 PM, Arnd Bergmann wrote:
> clang points to a variable being used in an unexpected
> code path:
> 
> drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:251:2: warning: 
> variable 'rec_seq_sz' is used uninitialized whenever switch default is taken 
> [-Wsometimes-uninitialized]
>  default:
>  ^~~
> drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:255:46: note: 
> uninitialized use occurs here
>  skip_static_post = !memcmp(rec_seq, _be, rec_seq_sz);
>  ^~
> 
>  From looking at the function logic, it seems that there is no
> sensible way to continue here, so just return early and hope
> for the best.
> 
> Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
> Signed-off-by: Arnd Bergmann 
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> index 3f5f4317a22b..5c08891806f0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> @@ -250,6 +250,7 @@ tx_post_resync_params(struct mlx5e_txqsq *sq,
>   }
>   default:
>   WARN_ON(1);
> +     return;
>   }
>   
>   skip_static_post = !memcmp(rec_seq, _be, rec_seq_sz);
> 

Reviewed-by: Tariq Toukan 

Thanks!


Re: [PATCH v3 15/27] net: use zeroing allocator rather than allocator followed by memset zero

2019-07-07 Thread Tariq Toukan


On 7/2/2019 10:58 AM, Fuqian Huang wrote:
> Replace allocator followed by memset with 0 with zeroing allocator.
> 
> Signed-off-by: Fuqian Huang 
> ---
> Changes in v3:
>- Resend
> 
>   drivers/net/eql.c   | 3 +--
>   drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c | 4 +---
>   drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c | 4 +---
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c  | 3 +--
>   4 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/eql.c b/drivers/net/eql.c
> index 74263f8efe1a..2f101a6036e6 100644
> --- a/drivers/net/eql.c
> +++ b/drivers/net/eql.c
> @@ -419,14 +419,13 @@ static int eql_enslave(struct net_device *master_dev, 
> slaving_request_t __user *
>   if ((master_dev->flags & IFF_UP) == IFF_UP) {
>   /* slave is not a master & not already a slave: */
>   if (!eql_is_master(slave_dev) && !eql_is_slave(slave_dev)) {
> - slave_t *s = kmalloc(sizeof(*s), GFP_KERNEL);
> + slave_t *s = kzalloc(sizeof(*s), GFP_KERNEL);
>   equalizer_t *eql = netdev_priv(master_dev);
>   int ret;
>   
>   if (!s)
>   return -ENOMEM;
>   
> - memset(s, 0, sizeof(*s));
>   s->dev = slave_dev;
>   s->priority = srq.priority;
>   s->priority_bps = srq.priority;
> diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
> b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
> index 43d11c38b38a..cf3835da32c8 100644
> --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
> +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
> @@ -719,12 +719,10 @@ static int cn23xx_setup_pf_mbox(struct octeon_device 
> *oct)
>   for (i = 0; i < oct->sriov_info.max_vfs; i++) {
>   q_no = i * oct->sriov_info.rings_per_vf;
>   
> - mbox = vmalloc(sizeof(*mbox));
> + mbox = vzalloc(sizeof(*mbox));
>   if (!mbox)
>   goto free_mbox;
>   
> - memset(mbox, 0, sizeof(struct octeon_mbox));
> -
>   spin_lock_init(>lock);
>   
>   mbox->oct_dev = oct;
> diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c 
> b/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c
> index fda49404968c..b3bd2767d3dd 100644
> --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c
> +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c
> @@ -279,12 +279,10 @@ static int cn23xx_setup_vf_mbox(struct octeon_device 
> *oct)
>   {
>   struct octeon_mbox *mbox = NULL;
>   
> - mbox = vmalloc(sizeof(*mbox));
> + mbox = vzalloc(sizeof(*mbox));
>   if (!mbox)
>   return 1;
>   
> - memset(mbox, 0, sizeof(struct octeon_mbox));
> -
>   spin_lock_init(>lock);
>   
>   mbox->oct_dev = oct;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6c01314e87b0..f1dff5c47676 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1062,7 +1062,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
> *priv, int qpn,
>   struct mlx4_qp_context *context;
>   int err = 0;
>   
> - context = kmalloc(sizeof(*context), GFP_KERNEL);
> + context = kzalloc(sizeof(*context), GFP_KERNEL);
>   if (!context)
>   return -ENOMEM;
>   
> @@ -1073,7 +1073,6 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
> *priv, int qpn,
>   }
>   qp->event = mlx4_en_sqp_event;
>   
> - memset(context, 0, sizeof(*context));
>   mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0,
>   qpn, ring->cqn, -1, context);
>   context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
> 

Reviewed-by: Tariq Toukan 


Re: [PATCH v2 15/27] net: use zeroing allocator rather than allocator followed by memset zero

2019-06-30 Thread Tariq Toukan


On 6/28/2019 5:48 AM, Fuqian Huang wrote:
> Replace allocator followed by memset with 0 with zeroing allocator.
> 
> Signed-off-by: Fuqian Huang 
> ---

..


> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1062,7 +1062,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
> *priv, int qpn,
>   struct mlx4_qp_context *context;
>   int err = 0;
>   
> - context = kmalloc(sizeof(*context), GFP_KERNEL);
> + context = kzalloc(sizeof(*context), GFP_KERNEL);
>   if (!context)
>   return -ENOMEM;
>   
> @@ -1073,7 +1073,6 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
> *priv, int qpn,
>   }
>   qp->event = mlx4_en_sqp_event;
>   
> - memset(context, 0, sizeof(*context));
>   mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0,
>   qpn, ring->cqn, -1, context);
>   context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
> 


For the mlx4 part:
Reviewed-by: Tariq Toukan 

Tariq


Re: [PATCH 87/87] ethernet: mlx4: remove memset after dma_alloc_coherent

2019-06-30 Thread Tariq Toukan


On 6/27/2019 8:42 PM, Fuqian Huang wrote:
> In commit af7ddd8a627c
> ("Merge tag 'dma-mapping-4.21' of 
> git://git.infradead.org/users/hch/dma-mapping"),
> dma_alloc_coherent has already zeroed the memory.
> So memset is not needed.
> 
> Signed-off-by: Fuqian Huang 
> ---
>   drivers/net/ethernet/mellanox/mlx4/eq.c | 2 --
>   1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
> b/drivers/net/ethernet/mellanox/mlx4/eq.c
> index a5be27772b8e..c790a5fcea73 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
> @@ -1013,8 +1013,6 @@ static int mlx4_create_eq(struct mlx4_dev *dev, int 
> nent,
>   
>   dma_list[i] = t;
>   eq->page_list[i].map = t;
> -
> - memset(eq->page_list[i].buf, 0, PAGE_SIZE);
>       }
>   
>   eq->eqn = mlx4_bitmap_alloc(>eq_table.bitmap);
> 

Reviewed-by: Tariq Toukan 

Thanks.


Re: [PATCH 1/8] net/mlx4: use kzalloc instead of kmalloc

2019-02-28 Thread Tariq Toukan


On 2/27/2019 8:09 AM, Robert Eshleman wrote:
> This patch replaces a kmalloc/memset(,0) call with a call to kzalloc.
> It also removes a memset(,0) call that always follows a *zalloc call.
> 
> Signed-off-by: Robert Eshleman 
> ---
>   drivers/net/ethernet/mellanox/mlx4/cmd.c   | 1 -
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c | 3 +--
>   2 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
> b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> index e65bc3c95630..7bfa6e850e5f 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> @@ -3307,7 +3307,6 @@ int mlx4_get_counter_stats(struct mlx4_dev *dev, int 
> counter_index,
>   if (IS_ERR(mailbox))
>   return PTR_ERR(mailbox);
>   
> - memset(mailbox->buf, 0, sizeof(struct mlx4_counter));
>   if_stat_in_mod = counter_index;
>   if (reset)
>   if_stat_in_mod |= MLX4_QUERY_IF_STAT_RESET;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 9a0881cb7f51..f55805d206ef 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1044,7 +1044,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
> *priv, int qpn,
>   struct mlx4_qp_context *context;
>   int err = 0;
>   
> - context = kmalloc(sizeof(*context), GFP_KERNEL);
> + context = kzalloc(sizeof(*context), GFP_KERNEL);
>   if (!context)
>   return -ENOMEM;
>   
> @@ -1055,7 +1055,6 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
> *priv, int qpn,
>   }
>   qp->event = mlx4_en_sqp_event;
>   
> - memset(context, 0, sizeof(*context));
>   mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0,
>   qpn, ring->cqn, -1, context);
>   context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
> 

Hi,

Thanks for your patch.
It looks good, but misses one similar point you might want to cover:

in drivers/net/ethernet/mellanox/mlx4/port.c:1780 :
memset(context, 0, sizeof(*context));

Tariq


Re: [PATCH] net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"

2019-02-19 Thread Tariq Toukan


On 2/18/2019 10:08 PM, David Miller wrote:
> From: Colin King 
> Date: Sun, 17 Feb 2019 23:03:31 +
> 
>> From: Colin Ian King 
>>
>> There is a spelling mistake in a en_err error message. Fix it.
>>
>> Signed-off-by: Colin Ian King 
> 
> Applied, thanks Colin.
> 
> And I agree that this doesn't really deserve a Fixes: tag.
> 
> Fixes: tags should really be for changes that introduce truly
> functional bugs.
> 
> And that could even be applied in this case _iff_ the string
> was essential in some way for userland tools which parse the
> output or similar.  But that is not the case here.
> 

Thanks for the clarification.

> Anyways, thanks.
> 


Re: [PATCH] net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"

2019-02-18 Thread Tariq Toukan


On 2/18/2019 12:25 PM, Dan Carpenter wrote:
> On Mon, Feb 18, 2019 at 09:37:22AM +0000, Tariq Toukan wrote:
>>
>>
>> On 2/18/2019 1:03 AM, Colin King wrote:
>>> From: Colin Ian King 
>>>
>>> There is a spelling mistake in a en_err error message. Fix it.
>>>
>>> Signed-off-by: Colin Ian King 
>>> ---
>>>drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +-
>>>1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
>>> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> index 6b1b8e35..c1438ae52a11 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> @@ -3360,7 +3360,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
>>> port,
>>> dev->addr_len = ETH_ALEN;
>>> mlx4_en_u64_to_mac(dev->dev_addr, mdev->dev->caps.def_mac[priv->port]);
>>> if (!is_valid_ether_addr(dev->dev_addr)) {
>>> -   en_err(priv, "Port: %d, invalid mac burned: %pM, quiting\n",
>>> +   en_err(priv, "Port: %d, invalid mac burned: %pM, quitting\n",
>>>priv->port, dev->dev_addr);
>>> err = -EINVAL;
>>> goto out;
>>>
>>
>> Hi Colin, thanks for your patch.
>>
>> Reviewed-by: Tariq Toukan 
>>
>> I would suggest adding a Fixes line, but looking into the history of the
>> typo, it went through many patches that modified this line but preserved
>> the typo.
>> Actually, it dates back to the very first commit that introduces mlx4
>> driver:
>>
>> Patches history:
>> 2b3ddf27f48c net/mlx4_core: Replace VF zero mac with random mac in mlx4_core
>> ef96f7d46ad8 net/mlx4_en: Handle unassigned VF MAC address correctly
>> 6bbb6d99f3d2 net/mlx4_en: Optimize Rx fast path filter checks
>> 453a60827735 mlx4_en: Giving interface name in debug messages
>> c27a02cd94d6 mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC
>>
>> I'm not sure what the "Fixes:" policy is in these cases.
> 
> I wouldn't necessarily put a Fixes tag on this, because does fixing the
> spelling really count as a bugfix?  It's borderline whether it's a fix
> or a cleanup.
> 
> regards,
> daan carpenter
> 

Thanks Dan, I'm fine with that.


Re: [PATCH] net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"

2019-02-18 Thread Tariq Toukan


On 2/18/2019 1:03 AM, Colin King wrote:
> From: Colin Ian King 
> 
> There is a spelling mistake in a en_err error message. Fix it.
> 
> Signed-off-by: Colin Ian King 
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 6b1b8e35..c1438ae52a11 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -3360,7 +3360,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
> port,
>   dev->addr_len = ETH_ALEN;
>   mlx4_en_u64_to_mac(dev->dev_addr, mdev->dev->caps.def_mac[priv->port]);
>   if (!is_valid_ether_addr(dev->dev_addr)) {
> - en_err(priv, "Port: %d, invalid mac burned: %pM, quiting\n",
> + en_err(priv, "Port: %d, invalid mac burned: %pM, quitting\n",
>  priv->port, dev->dev_addr);
>   err = -EINVAL;
>   goto out;
> 

Hi Colin, thanks for your patch.

Reviewed-by: Tariq Toukan 

I would suggest adding a Fixes line, but looking into the history of the 
typo, it went through many patches that modified this line but preserved 
the typo.
Actually, it dates back to the very first commit that introduces mlx4 
driver:

Patches history:
2b3ddf27f48c net/mlx4_core: Replace VF zero mac with random mac in mlx4_core
ef96f7d46ad8 net/mlx4_en: Handle unassigned VF MAC address correctly
6bbb6d99f3d2 net/mlx4_en: Optimize Rx fast path filter checks
453a60827735 mlx4_en: Giving interface name in debug messages
c27a02cd94d6 mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC

I'm not sure what the "Fixes:" policy is in these cases.

Thanks,
Tariq


Re: [PATCH net-next] net/mlx4: Mark expected switch fall-through

2019-01-23 Thread Tariq Toukan


On 1/23/2019 10:05 AM, Gustavo A. R. Silva wrote:
> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> This patch fixes the following warning:
> 
> drivers/net/ethernet/mellanox/mlx4/eq.c: In function ‘mlx4_eq_int’:
> drivers/net/ethernet/mellanox/mlx4/mlx4.h:219:5: warning: this statement may 
> fall through [-Wimplicit-fallthrough=]
>if (mlx4_debug_level)  \
>   ^
> drivers/net/ethernet/mellanox/mlx4/eq.c:558:4: note: in expansion of macro 
> ‘mlx4_dbg’
>  mlx4_dbg(dev, "%s: MLX4_EVENT_TYPE_SRQ_LIMIT. srq_no=0x%x, eq 0x%x\n",
>  ^~~~
> drivers/net/ethernet/mellanox/mlx4/eq.c:561:3: note: here
> case MLX4_EVENT_TYPE_SRQ_CATAS_ERROR:
> ^~~~
> 
> Warning level 3 was used: -Wimplicit-fallthrough=3
> 
> This patch is part of the ongoing efforts to enabling
> -Wimplicit-fallthrough.
> 
> Signed-off-by: Gustavo A. R. Silva 
> ---
>   drivers/net/ethernet/mellanox/mlx4/eq.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
> b/drivers/net/ethernet/mellanox/mlx4/eq.c
> index 4953c852c247..2f4201023836 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
> @@ -558,6 +558,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct 
> mlx4_eq *eq)
>   mlx4_dbg(dev, "%s: MLX4_EVENT_TYPE_SRQ_LIMIT. 
> srq_no=0x%x, eq 0x%x\n",
>__func__, be32_to_cpu(eqe->event.srq.srqn),
>eq->eqn);
> + /* fall through */
>   case MLX4_EVENT_TYPE_SRQ_CATAS_ERROR:
>   if (mlx4_is_master(dev)) {
>   /* forward only to slave owning the SRQ */
> 


Reviewed-by: Tariq Toukan 
Thanks for your patch.


Re: [PATCH v2 RESEND 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()

2018-11-19 Thread Tariq Toukan


On 19/11/2018 3:48 PM, Aaron Lu wrote:
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> Paweł Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>mlx5e_poll_tx_cq
>|
> --16.34%--napi_consume_skb
>   |
>   |--12.65%--__free_pages_ok
>   |  |
>   |   --11.86%--free_one_page
>   | |
>   | |--10.10%--queued_spin_lock_slowpath
>   | |
>   |  --0.65%--_raw_spin_lock
>   |
>   |--1.55%--page_frag_free
>   |
>--1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 
> The free path as pointed out by Jesper is:
> skb_free_head()
>-> skb_free_frag()
>  -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, Paweł hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone. Ilias' test
> on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11%
> performance boost testing with 64byte packets and __free_pages_ok()
> disappeared from perf top.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> 
> Reported-by: Paweł Staszewski 
> Analysed-by: Jesper Dangaard Brouer 
> Acked-by: Vlastimil Babka 
> Acked-by: Mel Gorman 
> Acked-by: Jesper Dangaard Brouer 
> Acked-by: Ilias Apalodimas 
> Tested-by: Ilias Apalodimas 
> Acked-by: Alexander Duyck 
> Acked-by: Tariq Toukan ' sign in my email tag.

> Signed-off-by: Aaron Lu 
> ---


Re: [PATCH v2 RESEND 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()

2018-11-19 Thread Tariq Toukan


On 19/11/2018 3:48 PM, Aaron Lu wrote:
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> Paweł Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>mlx5e_poll_tx_cq
>|
> --16.34%--napi_consume_skb
>   |
>   |--12.65%--__free_pages_ok
>   |  |
>   |   --11.86%--free_one_page
>   | |
>   | |--10.10%--queued_spin_lock_slowpath
>   | |
>   |  --0.65%--_raw_spin_lock
>   |
>   |--1.55%--page_frag_free
>   |
>--1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 
> The free path as pointed out by Jesper is:
> skb_free_head()
>-> skb_free_frag()
>  -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, Paweł hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone. Ilias' test
> on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11%
> performance boost testing with 64byte packets and __free_pages_ok()
> disappeared from perf top.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> 
> Reported-by: Paweł Staszewski 
> Analysed-by: Jesper Dangaard Brouer 
> Acked-by: Vlastimil Babka 
> Acked-by: Mel Gorman 
> Acked-by: Jesper Dangaard Brouer 
> Acked-by: Ilias Apalodimas 
> Tested-by: Ilias Apalodimas 
> Acked-by: Alexander Duyck 
> Acked-by: Tariq Toukan ' sign in my email tag.

> Signed-off-by: Aaron Lu 
> ---


Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Tariq Toukan




On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote:

On Thu, 12 Jul 2018 14:54:08 +0200
Michal Hocko  wrote:


[CC Jesper - I remember he was really concerned about the worst case
  latencies for highspeed network workloads.]


Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
where we are contenting on the page allocator lock, in a CPU scaling
netperf test AFAIK.  I also have some special-case micro-benchmarks
where I can hit it, but it a micro-bench...



Thanks! Looks good.

Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a 
major PCP/buddy bottleneck, where spinning the zonelock took up to 80% 
CPU, with dramatic BW degradation.


Test ran relatively small number of TCP streams (4-16) with unpinned 
application (iperf).


Larger batching reduces the contention on the zone lock and improves the 
CPU util. I also considered increasing the percpu_pagelist_fraction to a 
larger value (thought of 512, see patch below), which also affects the 
batch size (in pageset_set_high_and_batch).


As far as I see it, to totally solve the page allocation bottleneck for 
the increasing networking speeds, the following is still required:
1) optimize order-0 allocations (even on the cost of higher-order 
allocations).

2) bulking API for page allocations.
3) do SKB remote-release (on the originating core).

Regards,
Tariq

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 697ef8c225df..88763bd716a5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -741,9 +741,9 @@ of hot per cpu pagelists.  User can specify a number 
like 100 to allocate
 The batch value of each per cpu pagelist is also updated as a result. 
It is

 set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

-The initial value is zero.  Kernel does not use this value at boot time 
to set

+The initial value is 512.  Kernel uses this value at boot time to set
 the high water marks for each per cpu page list.  If the user writes 
'0' to this

-sysctl, it will revert to this default behavior.
+sysctl, it will revert to a behavior based on batchsize calculation.

 ==

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..c88e8eb50bcb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,7 +129,7 @@
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;

-int percpu_pagelist_fraction;
+int percpu_pagelist_fraction = 512;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

 /*


Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Tariq Toukan




On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote:

On Thu, 12 Jul 2018 14:54:08 +0200
Michal Hocko  wrote:


[CC Jesper - I remember he was really concerned about the worst case
  latencies for highspeed network workloads.]


Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
where we are contenting on the page allocator lock, in a CPU scaling
netperf test AFAIK.  I also have some special-case micro-benchmarks
where I can hit it, but it a micro-bench...



Thanks! Looks good.

Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a 
major PCP/buddy bottleneck, where spinning the zonelock took up to 80% 
CPU, with dramatic BW degradation.


Test ran relatively small number of TCP streams (4-16) with unpinned 
application (iperf).


Larger batching reduces the contention on the zone lock and improves the 
CPU util. I also considered increasing the percpu_pagelist_fraction to a 
larger value (thought of 512, see patch below), which also affects the 
batch size (in pageset_set_high_and_batch).


As far as I see it, to totally solve the page allocation bottleneck for 
the increasing networking speeds, the following is still required:
1) optimize order-0 allocations (even on the cost of higher-order 
allocations).

2) bulking API for page allocations.
3) do SKB remote-release (on the originating core).

Regards,
Tariq

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 697ef8c225df..88763bd716a5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -741,9 +741,9 @@ of hot per cpu pagelists.  User can specify a number 
like 100 to allocate
 The batch value of each per cpu pagelist is also updated as a result. 
It is

 set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

-The initial value is zero.  Kernel does not use this value at boot time 
to set

+The initial value is 512.  Kernel uses this value at boot time to set
 the high water marks for each per cpu page list.  If the user writes 
'0' to this

-sysctl, it will revert to this default behavior.
+sysctl, it will revert to a behavior based on batchsize calculation.

 ==

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..c88e8eb50bcb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,7 +129,7 @@
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;

-int percpu_pagelist_fraction;
+int percpu_pagelist_fraction = 512;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

 /*


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Tariq Toukan



On 28/05/2018 1:53 PM, Thomas Gleixner wrote:

On Fri, 25 May 2018, Song Liu wrote:

On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <t...@linutronix.de> wrote:

On Wed, 23 May 2018, Tariq Toukan wrote:

I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Could you please enable the vector and irq matrix trace points and capture
the trace when this happens?


Does the patch below fix it?

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb6f7a2148d7..54af3d4884b1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
 * prev_vector for this and the offlined target case.
 */
apicd->prev_vector = 0;
+   apicd->move_in_progress = false;
if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
goto setnew;
/*



I took it into my internal branch. Will let you know.

Tariq


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Tariq Toukan



On 28/05/2018 1:53 PM, Thomas Gleixner wrote:

On Fri, 25 May 2018, Song Liu wrote:

On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:

On Wed, 23 May 2018, Tariq Toukan wrote:

I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Could you please enable the vector and irq matrix trace points and capture
the trace when this happens?


Does the patch below fix it?

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb6f7a2148d7..54af3d4884b1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
 * prev_vector for this and the offlined target case.
 */
apicd->prev_vector = 0;
+   apicd->move_in_progress = false;
if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
goto setnew;
/*



I took it into my internal branch. Will let you know.

Tariq


Re: [PATCH 4.16 008/161] packet: in packet_snd start writing at link layer allocation

2018-05-24 Thread Tariq Toukan



On 24/05/2018 12:37 PM, Greg Kroah-Hartman wrote:

4.16-stable review patch.  If anyone has any objections, please let me know.



Please hold on with this.
I think it causes some degradation.

I'll report the issue on the original patch.

Thanks,
Tariq


--

From: Willem de Bruijn 

[ Upstream commit b84bbaf7a6c8cca24f8acf25a2c8e46913a947ba ]

Packet sockets allow construction of packets shorter than
dev->hard_header_len to accommodate protocols with variable length
link layer headers. These packets are padded to dev->hard_header_len,
because some device drivers interpret that as a minimum packet size.

packet_snd reserves dev->hard_header_len bytes on allocation.
SOCK_DGRAM sockets call skb_push in dev_hard_header() to ensure that
link layer headers are stored in the reserved range. SOCK_RAW sockets
do the same in tpacket_snd, but not in packet_snd.

Syzbot was able to send a zero byte packet to a device with massive
116B link layer header, causing padding to cross over into skb_shinfo.
Fix this by writing from the start of the llheader reserved range also
in the case of packet_snd/SOCK_RAW.

Update skb_set_network_header to the new offset. This also corrects
it for SOCK_DGRAM, where it incorrectly double counted reserve due to
the skb_push in dev_hard_header.

Fixes: 9ed988cd5915 ("packet: validate variable length ll headers")
Reported-by: syzbot+71d74a5406d02057d...@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
  net/packet/af_packet.c |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2903,13 +2903,15 @@ static int packet_snd(struct socket *soc
if (skb == NULL)
goto out_unlock;
  
-	skb_set_network_header(skb, reserve);

+   skb_reset_network_header(skb);
  
  	err = -EINVAL;

if (sock->type == SOCK_DGRAM) {
offset = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, 
len);
if (unlikely(offset < 0))
goto out_free;
+   } else if (reserve) {
+   skb_push(skb, reserve);
}
  
  	/* Returns -EFAULT on error */





Re: [PATCH 4.16 008/161] packet: in packet_snd start writing at link layer allocation

2018-05-24 Thread Tariq Toukan



On 24/05/2018 12:37 PM, Greg Kroah-Hartman wrote:

4.16-stable review patch.  If anyone has any objections, please let me know.



Please hold on with this.
I think it causes some degradation.

I'll report the issue on the original patch.

Thanks,
Tariq


--

From: Willem de Bruijn 

[ Upstream commit b84bbaf7a6c8cca24f8acf25a2c8e46913a947ba ]

Packet sockets allow construction of packets shorter than
dev->hard_header_len to accommodate protocols with variable length
link layer headers. These packets are padded to dev->hard_header_len,
because some device drivers interpret that as a minimum packet size.

packet_snd reserves dev->hard_header_len bytes on allocation.
SOCK_DGRAM sockets call skb_push in dev_hard_header() to ensure that
link layer headers are stored in the reserved range. SOCK_RAW sockets
do the same in tpacket_snd, but not in packet_snd.

Syzbot was able to send a zero byte packet to a device with massive
116B link layer header, causing padding to cross over into skb_shinfo.
Fix this by writing from the start of the llheader reserved range also
in the case of packet_snd/SOCK_RAW.

Update skb_set_network_header to the new offset. This also corrects
it for SOCK_DGRAM, where it incorrectly double counted reserve due to
the skb_push in dev_hard_header.

Fixes: 9ed988cd5915 ("packet: validate variable length ll headers")
Reported-by: syzbot+71d74a5406d02057d...@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
  net/packet/af_packet.c |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2903,13 +2903,15 @@ static int packet_snd(struct socket *soc
if (skb == NULL)
goto out_unlock;
  
-	skb_set_network_header(skb, reserve);

+   skb_reset_network_header(skb);
  
  	err = -EINVAL;

if (sock->type == SOCK_DGRAM) {
offset = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, 
len);
if (unlikely(offset < 0))
goto out_free;
+   } else if (reserve) {
+   skb_push(skb, reserve);
}
  
  	/* Returns -EFAULT on error */





Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks

2018-05-24 Thread Tariq Toukan



On 24/05/2018 2:22 AM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v4: use kvzalloc instead of vzalloc
 add one err condition check
 don't include vmalloc.h any more

v3: use PAGE_SIZE instead of PAGE_SHIFT
 add comma to the end of enum variables
 include vmalloc.h header file to avoid build issues on Sparc

v2: adjusted chunk size to reflect different architectures

  drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..685337d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
  
  /*

- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
  };
  
  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

@@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
u64 size;
  
  	obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;

+   if (WARN_ON(!obj_per_chunk))
+   return -EINVAL;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  
-	table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);

+   table->icm  = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
  
-	kfree(table->icm);

+   kvfree(table->icm);
  
  	return -ENOMEM;

  }
@@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
  
-	kfree(table->icm);

+   kvfree(table->icm);
  }



Thanks Qing.

Reviewed-by: Tariq Toukan <tar...@mellanox.com>


Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks

2018-05-24 Thread Tariq Toukan



On 24/05/2018 2:22 AM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v4: use kvzalloc instead of vzalloc
 add one err condition check
 don't include vmalloc.h any more

v3: use PAGE_SIZE instead of PAGE_SHIFT
 add comma to the end of enum variables
 include vmalloc.h header file to avoid build issues on Sparc

v2: adjusted chunk size to reflect different architectures

  drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..685337d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
  
  /*

- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
  };
  
  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

@@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
u64 size;
  
  	obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;

+   if (WARN_ON(!obj_per_chunk))
+   return -EINVAL;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  
-	table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);

+   table->icm  = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
  
-	kfree(table->icm);

+   kvfree(table->icm);
  
  	return -ENOMEM;

  }
@@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
  
-	kfree(table->icm);

+   kvfree(table->icm);
  }



Thanks Qing.

Reviewed-by: Tariq Toukan 


Re: [PATCH][V2] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message

2018-05-23 Thread Tariq Toukan



On 22/05/2018 6:42 PM, Colin King wrote:

From: Colin Ian King <colin.k...@canonical.com>

Trivial fix to spelling mistake in mlx4_dbg debug message and also
change the phrasing of the message so that is is more readable

Signed-off-by: Colin Ian King <colin.k...@canonical.com>

---
V2: rephrase message, as helpfully suggested by Tariq Toukan
---
  drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c 
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 2edcce98ab2d..65482f004e50 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
list_add_tail(_ctx->list, >ctx_list);
spin_unlock_irqrestore(>ctx_lock, flags);
  
-		mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n",

+   mlx4_dbg(dev, "Interface for protocol %d restarted with bonded mode 
%s\n",
 dev_ctx->intf->protocol, enable ?
 "enabled" : "disabled");
}



Thanks Colin.

Reviewed-by: Tariq Toukan <tar...@mellanox.com>



Re: [PATCH][V2] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message

2018-05-23 Thread Tariq Toukan



On 22/05/2018 6:42 PM, Colin King wrote:

From: Colin Ian King 

Trivial fix to spelling mistake in mlx4_dbg debug message and also
change the phrasing of the message so that is is more readable

Signed-off-by: Colin Ian King 

---
V2: rephrase message, as helpfully suggested by Tariq Toukan
---
  drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c 
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 2edcce98ab2d..65482f004e50 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
list_add_tail(_ctx->list, >ctx_list);
spin_unlock_irqrestore(>ctx_lock, flags);
  
-		mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n",

+   mlx4_dbg(dev, "Interface for protocol %d restarted with bonded mode 
%s\n",
 dev_ctx->intf->protocol, enable ?
 "enabled" : "disabled");
}



Thanks Colin.

Reviewed-by: Tariq Toukan 



Re: WARNING and PANIC in irq_matrix_free

2018-05-23 Thread Tariq Toukan



On 19/05/2018 2:20 PM, Thomas Gleixner wrote:

On Fri, 18 May 2018, Dmitry Safonov wrote:

I'm not entirely sure that it's the same fault, but at least backtrace
looks resembling.


Yes, it's similar, but not the same issue. I'll stare are the code ...

Thanks,

tglx



We still see the issue in our daily regression runs.
I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Re: WARNING and PANIC in irq_matrix_free

2018-05-23 Thread Tariq Toukan



On 19/05/2018 2:20 PM, Thomas Gleixner wrote:

On Fri, 18 May 2018, Dmitry Safonov wrote:

I'm not entirely sure that it's the same fault, but at least backtrace
looks resembling.


Yes, it's similar, but not the same issue. I'll stare are the code ...

Thanks,

tglx



We still see the issue in our daily regression runs.
I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-22 Thread Tariq Toukan



On 18/05/2018 12:45 AM, Qing Huang wrote:



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d 



Would you please take a look at how table->icm is being used in the mlx4 
driver? It's a meta data used for individual pointer variable referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.



NACK.

This would cause a degradation when iterating the entries of table->icm.
For example, in mlx4_table_get_range.

Thanks,
Tariq


And you'll understand some people care about this.

Strongly.

Thanks.





Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-22 Thread Tariq Toukan



On 18/05/2018 12:45 AM, Qing Huang wrote:



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d 



Would you please take a look at how table->icm is being used in the mlx4 
driver? It's a meta data used for individual pointer variable referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.



NACK.

This would cause a degradation when iterating the entries of table->icm.
For example, in mlx4_table_get_range.

Thanks,
Tariq


And you'll understand some people care about this.

Strongly.

Thanks.





Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"

2018-05-22 Thread Tariq Toukan



On 22/05/2018 6:23 PM, Colin Ian King wrote:

On 22/05/18 16:21, Tariq Toukan wrote:



On 22/05/2018 11:37 AM, Colin King wrote:

From: Colin Ian King <colin.k...@canonical.com>

Trivial fix to spelling mistake in mlx4_dbg debug message

Signed-off-by: Colin Ian King <colin.k...@canonical.com>
---
   drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 2edcce98ab2d..6bd4103265d2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
   list_add_tail(_ctx->list, >ctx_list);
   spin_unlock_irqrestore(>ctx_lock, flags);
   -    mlx4_dbg(dev, "Inrerface for protocol %d restarted with
when bonded mode is %s\n",
+    mlx4_dbg(dev, "Interface for protocol %d restarted with when
bonded mode is %s\n",


Thanks Colin.
I think there's one more thing to fix here.
It is redundant to say "with when", it was probably done by mistake.
Let's rephrase, maybe this way?

restarted with bonded mode %s


Sounds like a good idea, do you want me to send V2 of the patch with
this fix?



Yes please.




    dev_ctx->intf->protocol, enable ?
    "enabled" : "disabled");
   }



--
To unsubscribe from this list: send the line "unsubscribe
kernel-janitors" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"

2018-05-22 Thread Tariq Toukan



On 22/05/2018 6:23 PM, Colin Ian King wrote:

On 22/05/18 16:21, Tariq Toukan wrote:



On 22/05/2018 11:37 AM, Colin King wrote:

From: Colin Ian King 

Trivial fix to spelling mistake in mlx4_dbg debug message

Signed-off-by: Colin Ian King 
---
   drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 2edcce98ab2d..6bd4103265d2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
   list_add_tail(_ctx->list, >ctx_list);
   spin_unlock_irqrestore(>ctx_lock, flags);
   -    mlx4_dbg(dev, "Inrerface for protocol %d restarted with
when bonded mode is %s\n",
+    mlx4_dbg(dev, "Interface for protocol %d restarted with when
bonded mode is %s\n",


Thanks Colin.
I think there's one more thing to fix here.
It is redundant to say "with when", it was probably done by mistake.
Let's rephrase, maybe this way?

restarted with bonded mode %s


Sounds like a good idea, do you want me to send V2 of the patch with
this fix?



Yes please.




    dev_ctx->intf->protocol, enable ?
    "enabled" : "disabled");
   }



--
To unsubscribe from this list: send the line "unsubscribe
kernel-janitors" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"

2018-05-22 Thread Tariq Toukan



On 22/05/2018 11:37 AM, Colin King wrote:

From: Colin Ian King 

Trivial fix to spelling mistake in mlx4_dbg debug message

Signed-off-by: Colin Ian King 
---
  drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c 
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 2edcce98ab2d..6bd4103265d2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
list_add_tail(_ctx->list, >ctx_list);
spin_unlock_irqrestore(>ctx_lock, flags);
  
-		mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n",

+   mlx4_dbg(dev, "Interface for protocol %d restarted with when bonded 
mode is %s\n",


Thanks Colin.
I think there's one more thing to fix here.
It is redundant to say "with when", it was probably done by mistake. 
Let's rephrase, maybe this way?


restarted with bonded mode %s


 dev_ctx->intf->protocol, enable ?
 "enabled" : "disabled");
}





Re: [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"

2018-05-22 Thread Tariq Toukan



On 22/05/2018 11:37 AM, Colin King wrote:

From: Colin Ian King 

Trivial fix to spelling mistake in mlx4_dbg debug message

Signed-off-by: Colin Ian King 
---
  drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c 
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 2edcce98ab2d..6bd4103265d2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
list_add_tail(_ctx->list, >ctx_list);
spin_unlock_irqrestore(>ctx_lock, flags);
  
-		mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n",

+   mlx4_dbg(dev, "Interface for protocol %d restarted with when bonded 
mode is %s\n",


Thanks Colin.
I think there's one more thing to fix here.
It is redundant to say "with when", it was probably done by mistake. 
Let's rephrase, maybe this way?


restarted with bonded mode %s


 dev_ctx->intf->protocol, enable ?
 "enabled" : "disabled");
}





Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-16 Thread Tariq Toukan



On 15/05/2018 9:53 PM, Qing Huang wrote:



On 5/15/2018 2:19 AM, Tariq Toukan wrote:



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high
+ * order memory allocations in fragmented/high usage memory 
situation.

   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.



Then please use PAGE_SIZE instead.


PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT 
is actually more appropriate here.




Definition of PAGE_SIZE varies among different archs.
It is not always as simple as 1 << PAGE_SHIFT.
It might be:
PAGE_SIZE (1UL << PAGE_SHIFT)
PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
etc...

Please replace 1 << PAGE_SHIFT with PAGE_SIZE.






Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.




I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.


I know it didn't exist before your patch, but once we're here, let's 
do it.


I'm okay either way. If adding an extra comma is preferred by many 
people, someone should update checkpatch.pl to enforce it. :)



I agree.
Until then, please use an extra comma in this patch.






  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.




This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc 
so we maintain a similar behavior when contiguous memory is available, 
and a fallback for resiliency.


No sure what exactly degradation is caused by vzalloc here. I think it's 
better to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem 
compacting can be really expensive.




Degradation is expected when you replace a contig memory with non-contig 
memory, without any perf test.
We agree that when contig memory is not available, we should use 
non-contig instead of simply failing, and for this you can 

Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-16 Thread Tariq Toukan



On 15/05/2018 9:53 PM, Qing Huang wrote:



On 5/15/2018 2:19 AM, Tariq Toukan wrote:



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high
+ * order memory allocations in fragmented/high usage memory 
situation.

   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.



Then please use PAGE_SIZE instead.


PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT 
is actually more appropriate here.




Definition of PAGE_SIZE varies among different archs.
It is not always as simple as 1 << PAGE_SHIFT.
It might be:
PAGE_SIZE (1UL << PAGE_SHIFT)
PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
etc...

Please replace 1 << PAGE_SHIFT with PAGE_SIZE.






Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.




I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.


I know it didn't exist before your patch, but once we're here, let's 
do it.


I'm okay either way. If adding an extra comma is preferred by many 
people, someone should update checkpatch.pl to enforce it. :)



I agree.
Until then, please use an extra comma in this patch.






  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.




This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc 
so we maintain a similar behavior when contiguous memory is available, 
and a fallback for resiliency.


No sure what exactly degradation is caused by vzalloc here. I think it's 
better to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem 
compacting can be really expensive.




Degradation is expected when you replace a contig memory with non-contig 
memory, without any perf test.
We agree that when contig memory is not available, we should use 
non-contig instead of simply failing, and for this you can call kvzalloc.





Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
 

Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Tariq Toukan



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high

+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.



Then please use PAGE_SIZE instead.


Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.




I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.


I know it didn't exist before your patch, but once we're here, let's do it.




  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.




This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc so 
we maintain a similar behavior when contiguous memory is available, and 
a fallback for resiliency.



Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Tariq Toukan



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high

+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.



Then please use PAGE_SIZE instead.


Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.




I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.


I know it didn't exist before your patch, but once we're here, let's do it.




  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.




This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc so 
we maintain a similar behavior when contiguous memory is available, and 
a fallback for resiliency.



Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH v2] net/mlx4_core: Fix error handling in mlx4_init_port_info.

2018-05-14 Thread Tariq Toukan



On 14/05/2018 2:38 AM, Tarick Bedeir wrote:

Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.

In other words, either mlx4_init_port_info() succeeds and returns zero, or
it fails, returns non-zero, and requires no cleanup.

Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB
ports")
Signed-off-by: Tarick Bedeir <tar...@google.com>
---
v1 -> v2: Added "Fixes" tag.
---
  drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 4d84cab77105..e8a3a45d0b53 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
mlx4_err(dev, "Failed to create file for port %d\n", port);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
  	sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port);

@@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
   >port_attr);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
-	return err;

+   return 0;
  }
  
  static void mlx4_cleanup_port_info(struct mlx4_port_info *info)




Thanks for you patch.
Reviewed-by: Tariq Toukan <tar...@mellanox.com>

Dave, please queue for -stable.

Best,
Tariq


Re: [PATCH v2] net/mlx4_core: Fix error handling in mlx4_init_port_info.

2018-05-14 Thread Tariq Toukan



On 14/05/2018 2:38 AM, Tarick Bedeir wrote:

Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.

In other words, either mlx4_init_port_info() succeeds and returns zero, or
it fails, returns non-zero, and requires no cleanup.

Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB
ports")
Signed-off-by: Tarick Bedeir 
---
v1 -> v2: Added "Fixes" tag.
---
  drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 4d84cab77105..e8a3a45d0b53 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
mlx4_err(dev, "Failed to create file for port %d\n", port);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
  	sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port);

@@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
   >port_attr);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
-	return err;

+   return 0;
  }
  
  static void mlx4_cleanup_port_info(struct mlx4_port_info *info)




Thanks for you patch.
Reviewed-by: Tariq Toukan 

Dave, please queue for -stable.

Best,
Tariq


Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-13 Thread Tariq Toukan



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
  
  /*

- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.
Also, please add a comma at the end of the last entry.


  };
  
  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  
-	table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);

+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
  
-	kfree(table->icm);

+   vfree(table->icm);
  
  	return -ENOMEM;

  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
  
-	kfree(table->icm);

+   vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq


Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-13 Thread Tariq Toukan



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
  
  /*

- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.
Also, please add a comma at the end of the last entry.


  };
  
  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  
-	table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);

+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
  
-	kfree(table->icm);

+   vfree(table->icm);
  
  	return -ENOMEM;

  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
  
-	kfree(table->icm);

+   vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq


Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.

2018-05-13 Thread Tariq Toukan



On 02/05/2018 4:31 PM, Tariq Toukan wrote:



On 27/04/2018 6:20 PM, Tarick Bedeir wrote:

Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.

In other words, either mlx4_init_port_info() succeeds and returns 
zero, or

it fails, returns non-zero, and requires no cleanup.

Signed-off-by: Tarick Bedeir <tar...@google.com>
---
  drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c

index 4d84cab77105..e8a3a45d0b53 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev 
*dev, int port)

  mlx4_err(dev, "Failed to create file for port %d\n", port);
  devlink_port_unregister(>devlink_port);
  info->port = -1;
+    return err;
  }
  sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port);
@@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev 
*dev, int port)

 >port_attr);
  devlink_port_unregister(>devlink_port);
  info->port = -1;
+    return err;
  }
-    return err;
+    return 0;
  }
  static void mlx4_cleanup_port_info(struct mlx4_port_info *info)


Acked-by: Tariq Toukan <tar...@mellanox.com>

Thanks Tarick.


Actually, you need to add a Fixes line:

Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB 
ports")


Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.

2018-05-13 Thread Tariq Toukan



On 02/05/2018 4:31 PM, Tariq Toukan wrote:



On 27/04/2018 6:20 PM, Tarick Bedeir wrote:

Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.

In other words, either mlx4_init_port_info() succeeds and returns 
zero, or

it fails, returns non-zero, and requires no cleanup.

Signed-off-by: Tarick Bedeir 
---
  drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c

index 4d84cab77105..e8a3a45d0b53 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev 
*dev, int port)

  mlx4_err(dev, "Failed to create file for port %d\n", port);
  devlink_port_unregister(>devlink_port);
  info->port = -1;
+    return err;
  }
  sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port);
@@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev 
*dev, int port)

 >port_attr);
  devlink_port_unregister(>devlink_port);
  info->port = -1;
+    return err;
  }
-    return err;
+    return 0;
  }
  static void mlx4_cleanup_port_info(struct mlx4_port_info *info)


Acked-by: Tariq Toukan 

Thanks Tarick.


Actually, you need to add a Fixes line:

Fixes: 096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for IB 
ports")


Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-10 Thread Tariq Toukan



On 10/05/2018 5:36 PM, Tariq Toukan wrote:



On 10/05/2018 5:18 PM, Dan Carpenter wrote:

On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote:

On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote:

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed 
resources

cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.


Patch make sense but what's bothering me is that mlx4_en_free_resources
loops on the entire array, assuming !priv->tx_ring[t] means entry is
allocated but the existing code does not assume that, see [1]. So i 
looked

to see where tx_ring array is zeroed and didn't find it.

Am i missing something here.



It's zeroed twice.  alloc_etherdev_mqs() allocates zeroed memory and
then we do a memset(priv, 0, sizeof(struct mlx4_en_priv));

regards,
dan carpenter



We do zero (twice) on init, that's right. But I think Yuval's comment is 
valid in case of the driver went into configuration change, or down/up, 
that reallocates the rings. I'm double checking this.


Well, the flows in which we need to nullify the tx_rings pointer (if 
any, I still need to investigate this) is not related to this init function.


Here we're safe.

Anyway, a V2 is already submitted, please use it for your next comments. 
I think patch is OK.


Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-10 Thread Tariq Toukan



On 10/05/2018 5:36 PM, Tariq Toukan wrote:



On 10/05/2018 5:18 PM, Dan Carpenter wrote:

On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote:

On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote:

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed 
resources

cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.


Patch make sense but what's bothering me is that mlx4_en_free_resources
loops on the entire array, assuming !priv->tx_ring[t] means entry is
allocated but the existing code does not assume that, see [1]. So i 
looked

to see where tx_ring array is zeroed and didn't find it.

Am i missing something here.



It's zeroed twice.  alloc_etherdev_mqs() allocates zeroed memory and
then we do a memset(priv, 0, sizeof(struct mlx4_en_priv));

regards,
dan carpenter



We do zero (twice) on init, that's right. But I think Yuval's comment is 
valid in case of the driver went into configuration change, or down/up, 
that reallocates the rings. I'm double checking this.


Well, the flows in which we need to nullify the tx_rings pointer (if 
any, I still need to investigate this) is not related to this init function.


Here we're safe.

Anyway, a V2 is already submitted, please use it for your next comments. 
I think patch is OK.


Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-10 Thread Tariq Toukan



On 10/05/2018 5:18 PM, Dan Carpenter wrote:

On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote:

On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote:

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.


Patch make sense but what's bothering me is that mlx4_en_free_resources
loops on the entire array, assuming !priv->tx_ring[t] means entry is
allocated but the existing code does not assume that, see [1]. So i looked
to see where tx_ring array is zeroed and didn't find it.

Am i missing something here.



It's zeroed twice.  alloc_etherdev_mqs() allocates zeroed memory and
then we do a memset(priv, 0, sizeof(struct mlx4_en_priv));

regards,
dan carpenter



We do zero (twice) on init, that's right. But I think Yuval's comment is 
valid in case of the driver went into configuration change, or down/up, 
that reallocates the rings. I'm double checking this.


Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-10 Thread Tariq Toukan



On 10/05/2018 5:18 PM, Dan Carpenter wrote:

On Thu, May 10, 2018 at 04:38:08PM +0300, Yuval Shaia wrote:

On Thu, May 10, 2018 at 09:02:26AM +0200, Christophe JAILLET wrote:

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.


Patch make sense but what's bothering me is that mlx4_en_free_resources
loops on the entire array, assuming !priv->tx_ring[t] means entry is
allocated but the existing code does not assume that, see [1]. So i looked
to see where tx_ring array is zeroed and didn't find it.

Am i missing something here.



It's zeroed twice.  alloc_etherdev_mqs() allocates zeroed memory and
then we do a memset(priv, 0, sizeof(struct mlx4_en_priv));

regards,
dan carpenter



We do zero (twice) on init, that's right. But I think Yuval's comment is 
valid in case of the driver went into configuration change, or down/up, 
that reallocates the rings. I'm double checking this.


Re: [PATCH v2] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-10 Thread Tariq Toukan



On 10/05/2018 10:06 AM, Christophe JAILLET wrote:

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.

Simplify code to avoid such a case.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET <christophe.jail...@wanadoo.fr>
---
v1 -> v2 : rewrite the fix as explained by Tariq Toukan
(this 2nd version may have been posted twice, once without the
v2 tag. PLease ignore the first one)
---

  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +---
  1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..9670b33fc9b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
   MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_ring[t]) {
err = -ENOMEM;
-   goto err_free_tx;
+   goto out;
}
priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
 MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_cq[t]) {
-   kfree(priv->tx_ring[t]);
err = -ENOMEM;
goto out;
}
@@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
  
  	return 0;
  
-err_free_tx:

-   while (t--) {
-   kfree(priv->tx_ring[t]);
-   kfree(priv->tx_cq[t]);
-   }
  out:
mlx4_en_destroy_netdev(dev);
    return err;



Reviewed-by: Tariq Toukan <tar...@mellanox.com>

Thanks Christophe.


Re: [PATCH v2] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-10 Thread Tariq Toukan



On 10/05/2018 10:06 AM, Christophe JAILLET wrote:

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.

Simplify code to avoid such a case.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET 
---
v1 -> v2 : rewrite the fix as explained by Tariq Toukan
(this 2nd version may have been posted twice, once without the
v2 tag. PLease ignore the first one)
---

  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +---
  1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..9670b33fc9b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
   MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_ring[t]) {
err = -ENOMEM;
-   goto err_free_tx;
+   goto out;
}
priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
 MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_cq[t]) {
-   kfree(priv->tx_ring[t]);
err = -ENOMEM;
goto out;
}
@@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
  
  	return 0;
  
-err_free_tx:

-   while (t--) {
-   kfree(priv->tx_ring[t]);
-   kfree(priv->tx_cq[t]);
-   }
  out:
mlx4_en_destroy_netdev(dev);
    return err;



Reviewed-by: Tariq Toukan 

Thanks Christophe.


Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-09 Thread Tariq Toukan



On 08/05/2018 12:34 PM, Christophe JAILLET wrote:

If the 2nd memory allocation of the loop fails, we must undo the
memory allocation done so far.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET 
---
  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..bf078244e467 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3331,7 +3331,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
if (!priv->tx_cq[t]) {
kfree(priv->tx_ring[t]);
err = -ENOMEM;
-   goto out;
+   goto err_free_tx;
}
}
priv->rx_ring_num = prof->rx_ring_num;


Hi Christophe,

Thanks for re-sending this.

In your previous mail you referred to the call mlx4_en_destroy_netdev here:
https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_main.c#L232

While I was referring to the one below, which is always called on failures:

https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_netdev.c#L3587

I still believe that the err_free_tx label and its while loop is redundant.

Regards,
Tariq


Re: [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'

2018-05-09 Thread Tariq Toukan



On 08/05/2018 12:34 PM, Christophe JAILLET wrote:

If the 2nd memory allocation of the loop fails, we must undo the
memory allocation done so far.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET 
---
  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..bf078244e467 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3331,7 +3331,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
if (!priv->tx_cq[t]) {
kfree(priv->tx_ring[t]);
err = -ENOMEM;
-   goto out;
+   goto err_free_tx;
}
}
priv->rx_ring_num = prof->rx_ring_num;


Hi Christophe,

Thanks for re-sending this.

In your previous mail you referred to the call mlx4_en_destroy_netdev here:
https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_main.c#L232

While I was referring to the one below, which is always called on failures:

https://elixir.bootlin.com/linux/v4.16-rc5/source/drivers/net/ethernet/mellanox/mlx4/en_netdev.c#L3587

I still believe that the err_free_tx label and its while loop is redundant.

Regards,
Tariq


Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.

2018-05-02 Thread Tariq Toukan



On 27/04/2018 6:20 PM, Tarick Bedeir wrote:

Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.

In other words, either mlx4_init_port_info() succeeds and returns zero, or
it fails, returns non-zero, and requires no cleanup.

Signed-off-by: Tarick Bedeir <tar...@google.com>
---
  drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 4d84cab77105..e8a3a45d0b53 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
mlx4_err(dev, "Failed to create file for port %d\n", port);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
  	sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port);

@@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
   >port_attr);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
-	return err;

+   return 0;
  }
  
  static void mlx4_cleanup_port_info(struct mlx4_port_info *info)



Acked-by: Tariq Toukan <tar...@mellanox.com>

Thanks Tarick.


Re: [PATCH] net/mlx4_core: Fix error handling in mlx4_init_port_info.

2018-05-02 Thread Tariq Toukan



On 27/04/2018 6:20 PM, Tarick Bedeir wrote:

Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.

In other words, either mlx4_init_port_info() succeeds and returns zero, or
it fails, returns non-zero, and requires no cleanup.

Signed-off-by: Tarick Bedeir 
---
  drivers/net/ethernet/mellanox/mlx4/main.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 4d84cab77105..e8a3a45d0b53 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3007,6 +3007,7 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
mlx4_err(dev, "Failed to create file for port %d\n", port);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
  	sprintf(info->dev_mtu_name, "mlx4_port%d_mtu", port);

@@ -3028,9 +3029,10 @@ static int mlx4_init_port_info(struct mlx4_dev *dev, int 
port)
   >port_attr);
devlink_port_unregister(>devlink_port);
info->port = -1;
+   return err;
}
  
-	return err;

+   return 0;
  }
  
  static void mlx4_cleanup_port_info(struct mlx4_port_info *info)



Acked-by: Tariq Toukan 

Thanks Tarick.


Re: [PATCH] net/mlx4_en: Fix a memory leak in case of error in 'mlx4_en_init_netdev()'

2018-03-12 Thread Tariq Toukan



On 12/03/2018 12:45 AM, Christophe JAILLET wrote:

If 'kzalloc' fails, we must free some memory before returning.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET 
---
  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 8fc51bc29003..f9db018e858f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3327,7 +3327,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
if (!priv->tx_cq[t]) {
kfree(priv->tx_ring[t]);
err = -ENOMEM;
-   goto out;
+   goto err_free_tx;
}
}
priv->rx_ring_num = prof->rx_ring_num;



Hi Christophe, thanks for spotting this.

However, I think these err_free_tx label and loop are redundant.
Both tx_ring/tx_cq flows should just goto out, as resources are freed 
later in mlx4_en_destroy_netdev() -> mlx4_en_free_resources().


Re: [PATCH] net/mlx4_en: Fix a memory leak in case of error in 'mlx4_en_init_netdev()'

2018-03-12 Thread Tariq Toukan



On 12/03/2018 12:45 AM, Christophe JAILLET wrote:

If 'kzalloc' fails, we must free some memory before returning.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET 
---
  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 8fc51bc29003..f9db018e858f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3327,7 +3327,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int 
port,
if (!priv->tx_cq[t]) {
kfree(priv->tx_ring[t]);
err = -ENOMEM;
-   goto out;
+   goto err_free_tx;
}
}
priv->rx_ring_num = prof->rx_ring_num;



Hi Christophe, thanks for spotting this.

However, I think these err_free_tx label and loop are redundant.
Both tx_ring/tx_cq flows should just goto out, as resources are freed 
later in mlx4_en_destroy_netdev() -> mlx4_en_free_resources().


Re: WARNING and PANIC in irq_matrix_free

2018-02-25 Thread Tariq Toukan



On 22/02/2018 11:38 PM, Thomas Gleixner wrote:

On Wed, 21 Feb 2018, Tariq Toukan wrote:

On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.


Ok.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or interface
configuration changes.


Hmm. Looks like memory corruption, but I can't pin point it.

Find below a debug patch which should prevent the crash and might give us
some insight into the type of corruption.

Please enable the irq_matrix and vector allocation trace points.

echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable

When the problem triggers the bogus vector is printed and the trace is
frozen. Please provide dmesg and the tracebuffer output.



OK, I'm temporarily adding this to the regression internal branch. I'll 
let you know once we have a repro.


Thanks,
Tariq


Thanks,

tglx

8<--
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -822,6 +822,12 @@ static void free_moved_vector(struct api
unsigned int cpu = apicd->prev_cpu;
bool managed = apicd->is_managed;
  
+	if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {

+   tracing_off();
+   pr_err("Trying to clear prev_vector: %u\n", vector);
+   goto out;
+   }
+
/*
 * This should never happen. Managed interrupts are not
 * migrated except on CPU down, which does not involve the
@@ -833,6 +839,7 @@ static void free_moved_vector(struct api
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
+out:
hlist_del_init(>clist);
apicd->prev_vector = 0;
apicd->move_in_progress = 0;



Re: WARNING and PANIC in irq_matrix_free

2018-02-25 Thread Tariq Toukan



On 22/02/2018 11:38 PM, Thomas Gleixner wrote:

On Wed, 21 Feb 2018, Tariq Toukan wrote:

On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.


Ok.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or interface
configuration changes.


Hmm. Looks like memory corruption, but I can't pin point it.

Find below a debug patch which should prevent the crash and might give us
some insight into the type of corruption.

Please enable the irq_matrix and vector allocation trace points.

echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable

When the problem triggers the bogus vector is printed and the trace is
frozen. Please provide dmesg and the tracebuffer output.



OK, I'm temporarily adding this to the regression internal branch. I'll 
let you know once we have a repro.


Thanks,
Tariq


Thanks,

tglx

8<--
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -822,6 +822,12 @@ static void free_moved_vector(struct api
unsigned int cpu = apicd->prev_cpu;
bool managed = apicd->is_managed;
  
+	if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {

+   tracing_off();
+   pr_err("Trying to clear prev_vector: %u\n", vector);
+   goto out;
+   }
+
/*
 * This should never happen. Managed interrupts are not
 * migrated except on CPU down, which does not involve the
@@ -833,6 +839,7 @@ static void free_moved_vector(struct api
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
+out:
hlist_del_init(>clist);
apicd->prev_vector = 0;
apicd->move_in_progress = 0;



Re: WARNING and PANIC in irq_matrix_free

2018-02-21 Thread Tariq Toukan



On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or 
interface configuration changes.


Thanks,
Tariq


Re: WARNING and PANIC in irq_matrix_free

2018-02-21 Thread Tariq Toukan



On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or 
interface configuration changes.


Thanks,
Tariq


WARNING and PANIC in irq_matrix_free

2018-02-20 Thread Tariq Toukan

Hi Thomas,

We started seeing new issues in our net-device daily regression tests.
They are related to patch [1] introduced in kernel 4.15-rc1.

We frequently see a warning in dmesg [2]. Repro is not consistent, we 
tried to narrow it down to a smaller run but couldn't.


In addition, sometimes (less frequent) the warning is followed by a 
panic [3].


I can share all needed details to help analyze this bug.
If you suspect specific flows, we can do an educated narrow down.

Regards,
Tariq


[1] 2f75d9e1c905 genirq: Implement bitmap matrix allocator

[2]
[ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
irq_matrix_free+0x30/0xd0
[ 8664.891905] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core 
mlx4_en mlx4_core devlink macvlan vxlan ip6_udp_tunnel udp_tunnel 8021q 
garp mrp stp llc mst_pciconf(OE) nfsv3 nfs fscache netconsole dm_mirror 
dm_region_hash dm_log dm_mod dax kvm_intel kvm irqbypass pcspkr 
i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables 
ata_generic cirrus drm_kms_helper syscopyarea sysfillrect pata_acpi 
sysimgblt fb_sys_fops ttm drm e1000 serio_raw virtio_console i2c_core 
floppy ata_piix [last unloaded: mst_pci]
[ 8664.905117] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G   OE 
4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1
[ 8664.907613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014

[ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
[ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002
[ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: 

[ 8664.917679] RDX: 0001 RSI: 0001 RDI: 
880237038400
[ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: 

[ 8664.922813] R10: 03ff R11: 0ad9 R12: 
88023fc4
[ 8664.925345] R13:  R14: 0001 R15: 
002b
[ 8664.927872] FS:  () GS:88023fd4() 
knlGS:

[ 8664.930455] CS:  0010 DS:  ES:  CR0: 80050033
[ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 
06e0
[ 8664.935557] DR0:  DR1:  DR2: 

[ 8664.938051] DR3:  DR6: fffe0ff0 DR7: 
0400

[ 8664.940541] Call Trace:
[ 8664.942980]  
[ 8664.945399]  free_moved_vector+0x4e/0x100
[ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
[ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
[ 8664.952480]  
[ 8664.954800] RIP: 0010:native_safe_halt+0x2/0x10
[ 8664.957052] RSP: 0018:c9ccfee0 EFLAGS: 0246 ORIG_RAX: 
ffdf
[ 8664.959186] RAX: 818ab6e0 RBX: 880236233f00 RCX: 

[ 8664.960499] RDX:  RSI:  RDI: 

[ 8664.961774] RBP: 0005 R08:  R09: 

[ 8664.963048] R10: 03ff R11: 0ad9 R12: 
880236233f00
[ 8664.964345] R13: 880236233f00 R14:  R15: 


[ 8664.965579]  ? __cpuidle_text_start+0x8/0x8
[ 8664.966808]  default_idle+0x18/0xf0
[ 8664.968040]  do_idle+0x150/0x1d0
[ 8664.969249]  cpu_startup_entry+0x19/0x20
[ 8664.970477]  start_secondary+0x133/0x170
[ 8664.971700]  secondary_startup_64+0xa5/0xb0
[ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 
d5 60 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 
72 0b <0f> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74

[ 8664.975420] ---[ end trace 8be4ba51cd83f4bd ]---


[3]
[ 8943.038767] BUG: unable to handle kernel paging request at 
00037a6b561b

[ 8943.040114] IP: free_moved_vector+0x61/0x100
[ 8943.041531] PGD 0 P4D 0
[ 8943.042855] Oops: 0002 [#1] SMP PTI
[ 8943.044128] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core 
mlx4_en mlx4_core devlink iptable_filter fuse btrfs xor zstd_decompress 
zstd_compress xxhash raid6_pq vfat msdos fat binfmt_misc bridge macvlan 
vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc mst_pciconf(OE) 
nfsv3 nfs fscache netconsole dm_mirror dm_region_hash dm_log dm_mod dax 
kvm_intel kvm irqbypass pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd 
grace sunrpc ip_tables ata_generic cirrus drm_kms_helper syscopyarea 
sysfillrect pata_acpi sysimgblt fb_sys_fops ttm drm e1000 serio_raw 
virtio_console i2c_core floppy ata_piix [last unloaded: mst_pci]
[ 8943.052038] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW  OE 
4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1
[ 8943.053350] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014

[ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
[ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007
[ 8943.057233] RAX: 

WARNING and PANIC in irq_matrix_free

2018-02-20 Thread Tariq Toukan

Hi Thomas,

We started seeing new issues in our net-device daily regression tests.
They are related to patch [1] introduced in kernel 4.15-rc1.

We frequently see a warning in dmesg [2]. Repro is not consistent, we 
tried to narrow it down to a smaller run but couldn't.


In addition, sometimes (less frequent) the warning is followed by a 
panic [3].


I can share all needed details to help analyze this bug.
If you suspect specific flows, we can do an educated narrow down.

Regards,
Tariq


[1] 2f75d9e1c905 genirq: Implement bitmap matrix allocator

[2]
[ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
irq_matrix_free+0x30/0xd0
[ 8664.891905] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core 
mlx4_en mlx4_core devlink macvlan vxlan ip6_udp_tunnel udp_tunnel 8021q 
garp mrp stp llc mst_pciconf(OE) nfsv3 nfs fscache netconsole dm_mirror 
dm_region_hash dm_log dm_mod dax kvm_intel kvm irqbypass pcspkr 
i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables 
ata_generic cirrus drm_kms_helper syscopyarea sysfillrect pata_acpi 
sysimgblt fb_sys_fops ttm drm e1000 serio_raw virtio_console i2c_core 
floppy ata_piix [last unloaded: mst_pci]
[ 8664.905117] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G   OE 
4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1
[ 8664.907613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014

[ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
[ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002
[ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: 

[ 8664.917679] RDX: 0001 RSI: 0001 RDI: 
880237038400
[ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: 

[ 8664.922813] R10: 03ff R11: 0ad9 R12: 
88023fc4
[ 8664.925345] R13:  R14: 0001 R15: 
002b
[ 8664.927872] FS:  () GS:88023fd4() 
knlGS:

[ 8664.930455] CS:  0010 DS:  ES:  CR0: 80050033
[ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 
06e0
[ 8664.935557] DR0:  DR1:  DR2: 

[ 8664.938051] DR3:  DR6: fffe0ff0 DR7: 
0400

[ 8664.940541] Call Trace:
[ 8664.942980]  
[ 8664.945399]  free_moved_vector+0x4e/0x100
[ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
[ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
[ 8664.952480]  
[ 8664.954800] RIP: 0010:native_safe_halt+0x2/0x10
[ 8664.957052] RSP: 0018:c9ccfee0 EFLAGS: 0246 ORIG_RAX: 
ffdf
[ 8664.959186] RAX: 818ab6e0 RBX: 880236233f00 RCX: 

[ 8664.960499] RDX:  RSI:  RDI: 

[ 8664.961774] RBP: 0005 R08:  R09: 

[ 8664.963048] R10: 03ff R11: 0ad9 R12: 
880236233f00
[ 8664.964345] R13: 880236233f00 R14:  R15: 


[ 8664.965579]  ? __cpuidle_text_start+0x8/0x8
[ 8664.966808]  default_idle+0x18/0xf0
[ 8664.968040]  do_idle+0x150/0x1d0
[ 8664.969249]  cpu_startup_entry+0x19/0x20
[ 8664.970477]  start_secondary+0x133/0x170
[ 8664.971700]  secondary_startup_64+0xa5/0xb0
[ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 
d5 60 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 
72 0b <0f> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74

[ 8664.975420] ---[ end trace 8be4ba51cd83f4bd ]---


[3]
[ 8943.038767] BUG: unable to handle kernel paging request at 
00037a6b561b

[ 8943.040114] IP: free_moved_vector+0x61/0x100
[ 8943.041531] PGD 0 P4D 0
[ 8943.042855] Oops: 0002 [#1] SMP PTI
[ 8943.044128] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core 
mlx4_en mlx4_core devlink iptable_filter fuse btrfs xor zstd_decompress 
zstd_compress xxhash raid6_pq vfat msdos fat binfmt_misc bridge macvlan 
vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc mst_pciconf(OE) 
nfsv3 nfs fscache netconsole dm_mirror dm_region_hash dm_log dm_mod dax 
kvm_intel kvm irqbypass pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd 
grace sunrpc ip_tables ata_generic cirrus drm_kms_helper syscopyarea 
sysfillrect pata_acpi sysimgblt fb_sys_fops ttm drm e1000 serio_raw 
virtio_console i2c_core floppy ata_piix [last unloaded: mst_pci]
[ 8943.052038] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW  OE 
4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1
[ 8943.053350] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014

[ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
[ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007
[ 8943.057233] RAX: 

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-25 Thread Tariq Toukan



On 25/01/2018 8:25 AM, jianchao.wang wrote:

Hi Eric

Thanks for you kindly response and suggestion.
That's really appreciated.

Jianchao

On 01/25/2018 11:55 AM, Eric Dumazet wrote:

On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:

Hi Tariq

On 01/22/2018 10:12 AM, jianchao.wang wrote:

On 19/01/2018 5:49 PM, Eric Dumazet wrote:

On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:

Hi Tariq

Very sad that the crash was reproduced again after applied the patch.


Memory barriers vary for different Archs, can you please share more details 
regarding arch and repro steps?

The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
The xen is installed. The crash occurred in DOM0.
Regarding to the repro steps, it is a customer's test which does heavy disk I/O 
over NFS storage without any guest.



What is the finial suggestion on this ?
If use wmb there, is the performance pulled down ?


I want to evaluate this effect.
I agree with Eric, expected impact is restricted, especially after 
batching the allocations.




Since 
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k=

we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.

I doubt the additional wmb() will have serious impact there.



I will test the effect (it'll be beginning of next week).
I'll update so we can make a more confident decision.

Thanks,
Tariq




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-25 Thread Tariq Toukan



On 25/01/2018 8:25 AM, jianchao.wang wrote:

Hi Eric

Thanks for you kindly response and suggestion.
That's really appreciated.

Jianchao

On 01/25/2018 11:55 AM, Eric Dumazet wrote:

On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:

Hi Tariq

On 01/22/2018 10:12 AM, jianchao.wang wrote:

On 19/01/2018 5:49 PM, Eric Dumazet wrote:

On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:

Hi Tariq

Very sad that the crash was reproduced again after applied the patch.


Memory barriers vary for different Archs, can you please share more details 
regarding arch and repro steps?

The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
The xen is installed. The crash occurred in DOM0.
Regarding to the repro steps, it is a customer's test which does heavy disk I/O 
over NFS storage without any guest.



What is the finial suggestion on this ?
If use wmb there, is the performance pulled down ?


I want to evaluate this effect.
I agree with Eric, expected impact is restricted, especially after 
batching the allocations.




Since 
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k=

we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.

I doubt the additional wmb() will have serious impact there.



I will test the effect (it'll be beginning of next week).
I'll update so we can make a more confident decision.

Thanks,
Tariq




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-21 Thread Tariq Toukan



On 21/01/2018 11:31 AM, Tariq Toukan wrote:



On 19/01/2018 5:49 PM, Eric Dumazet wrote:

On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:

Hi Tariq

Very sad that the crash was reproduced again after applied the patch.


Memory barriers vary for different Archs, can you please share more 
details regarding arch and repro steps?




--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct 
mlx4_en_rx_ring *ring)
  static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring 
*ring)

  {
+    dma_wmb();


So... is wmb() here fixing the issue ?


  *ring->wqres.db.db = cpu_to_be32(ring->prod & 0x);
  }

I analyzed the kdump, it should be a memory corruption.

Thanks
Jianchao


Hmm, this is actually consistent with the example below [1].

AIU from the example, it seems that the dma_wmb/dma_rmb barriers are 
good for synchronizing cpu/device accesses to the "Streaming DMA mapped" 
buffers (the descriptors, went through the dma_map_page() API), but not 
for the doorbell (a coherent memory, typically allocated via 
dma_alloc_coherent) that requires using the stronger wmb() barrier.



[1] Documentation/memory-barriers.txt

  (*) dma_wmb();
  (*) dma_rmb();

  These are for use with consistent memory to guarantee the ordering
  of writes or reads of shared memory accessible to both the CPU and a
  DMA capable device.

  For example, consider a device driver that shares memory with a 
device
  and uses a descriptor status value to indicate if the descriptor 
belongs

  to the device or the CPU, and a doorbell to notify it when new
  descriptors are available:

 if (desc->status != DEVICE_OWN) {
     /* do not read data until we own descriptor */
     dma_rmb();

     /* read/modify data */
     read_data = desc->data;
     desc->data = write_data;

     /* flush modifications before status update */
     dma_wmb();

     /* assign ownership */
     desc->status = DEVICE_OWN;

     /* force memory to sync before notifying device via MMIO */
     wmb();

     /* notify device of new descriptors */
     writel(DESC_NOTIFY, doorbell);
 }

  The dma_rmb() allows us guarantee the device has released ownership
  before we read the data from the descriptor, and the dma_wmb() allows
  us to guarantee the data is written to the descriptor before the 
device
  can see it now has ownership.  The wmb() is needed to guarantee 
that the
  cache coherent memory writes have completed before attempting a 
write to

  the cache incoherent MMIO region.

  See Documentation/DMA-API.txt for more information on consistent 
memory.




On 01/15/2018 01:50 PM, jianchao.wang wrote:

Hi Tariq

Thanks for your kindly response.

On 01/14/2018 05:47 PM, Tariq Toukan wrote:

Thanks Jianchao for your patch.

And Thank you guys for your reviews, much appreciated.
I was off-work on Friday and Saturday.

On 14/01/2018 4:40 AM, jianchao.wang wrote:

Dear all

Thanks for the kindly response and reviewing. That's really 
appreciated.


On 01/13/2018 12:46 AM, Eric Dumazet wrote:

Does this need to be dma_wmb(), and should it be in
mlx4_en_update_rx_prod_db ?



+1 on dma_wmb()

On what architecture bug was observed ?


This issue was observed on x86-64.
And I will send a new patch, in which replace wmb() with 
dma_wmb(), to customer

to confirm.


+1 on dma_wmb, let us know once customer confirms.
Please place it within mlx4_en_update_rx_prod_db as suggested.


Yes, I have recommended it to customer.
Once I get the result, I will share it here.
All other calls to mlx4_en_update_rx_prod_db are in control/slow 
path so I prefer being on the safe side, and care less about 
bulking the barrier.


Thanks,
Tariq


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-21 Thread Tariq Toukan



On 21/01/2018 11:31 AM, Tariq Toukan wrote:



On 19/01/2018 5:49 PM, Eric Dumazet wrote:

On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:

Hi Tariq

Very sad that the crash was reproduced again after applied the patch.


Memory barriers vary for different Archs, can you please share more 
details regarding arch and repro steps?




--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct 
mlx4_en_rx_ring *ring)
  static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring 
*ring)

  {
+    dma_wmb();


So... is wmb() here fixing the issue ?


  *ring->wqres.db.db = cpu_to_be32(ring->prod & 0x);
  }

I analyzed the kdump, it should be a memory corruption.

Thanks
Jianchao


Hmm, this is actually consistent with the example below [1].

AIU from the example, it seems that the dma_wmb/dma_rmb barriers are 
good for synchronizing cpu/device accesses to the "Streaming DMA mapped" 
buffers (the descriptors, went through the dma_map_page() API), but not 
for the doorbell (a coherent memory, typically allocated via 
dma_alloc_coherent) that requires using the stronger wmb() barrier.



[1] Documentation/memory-barriers.txt

  (*) dma_wmb();
  (*) dma_rmb();

  These are for use with consistent memory to guarantee the ordering
  of writes or reads of shared memory accessible to both the CPU and a
  DMA capable device.

  For example, consider a device driver that shares memory with a 
device
  and uses a descriptor status value to indicate if the descriptor 
belongs

  to the device or the CPU, and a doorbell to notify it when new
  descriptors are available:

 if (desc->status != DEVICE_OWN) {
     /* do not read data until we own descriptor */
     dma_rmb();

     /* read/modify data */
     read_data = desc->data;
     desc->data = write_data;

     /* flush modifications before status update */
     dma_wmb();

     /* assign ownership */
     desc->status = DEVICE_OWN;

     /* force memory to sync before notifying device via MMIO */
     wmb();

     /* notify device of new descriptors */
     writel(DESC_NOTIFY, doorbell);
 }

  The dma_rmb() allows us guarantee the device has released ownership
  before we read the data from the descriptor, and the dma_wmb() allows
  us to guarantee the data is written to the descriptor before the 
device
  can see it now has ownership.  The wmb() is needed to guarantee 
that the
  cache coherent memory writes have completed before attempting a 
write to

  the cache incoherent MMIO region.

  See Documentation/DMA-API.txt for more information on consistent 
memory.




On 01/15/2018 01:50 PM, jianchao.wang wrote:

Hi Tariq

Thanks for your kindly response.

On 01/14/2018 05:47 PM, Tariq Toukan wrote:

Thanks Jianchao for your patch.

And Thank you guys for your reviews, much appreciated.
I was off-work on Friday and Saturday.

On 14/01/2018 4:40 AM, jianchao.wang wrote:

Dear all

Thanks for the kindly response and reviewing. That's really 
appreciated.


On 01/13/2018 12:46 AM, Eric Dumazet wrote:

Does this need to be dma_wmb(), and should it be in
mlx4_en_update_rx_prod_db ?



+1 on dma_wmb()

On what architecture bug was observed ?


This issue was observed on x86-64.
And I will send a new patch, in which replace wmb() with 
dma_wmb(), to customer

to confirm.


+1 on dma_wmb, let us know once customer confirms.
Please place it within mlx4_en_update_rx_prod_db as suggested.


Yes, I have recommended it to customer.
Once I get the result, I will share it here.
All other calls to mlx4_en_update_rx_prod_db are in control/slow 
path so I prefer being on the safe side, and care less about 
bulking the barrier.


Thanks,
Tariq


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-21 Thread Tariq Toukan



On 19/01/2018 5:49 PM, Eric Dumazet wrote:

On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:

Hi Tariq

Very sad that the crash was reproduced again after applied the patch.

--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct 
mlx4_en_rx_ring *ring)
  
  static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)

  {
+   dma_wmb();


So... is wmb() here fixing the issue ?


*ring->wqres.db.db = cpu_to_be32(ring->prod & 0x);
  }

I analyzed the kdump, it should be a memory corruption.

Thanks
Jianchao


Hmm, this is actually consistent with the example below [1].

AIU from the example, it seems that the dma_wmb/dma_rmb barriers are 
good for synchronizing cpu/device accesses to the "Streaming DMA mapped" 
buffers (the descriptors, went through the dma_map_page() API), but not 
for the doorbell (a coherent memory, typically allocated via 
dma_alloc_coherent) that requires using the stronger wmb() barrier.



[1] Documentation/memory-barriers.txt

 (*) dma_wmb();
 (*) dma_rmb();

 These are for use with consistent memory to guarantee the ordering
 of writes or reads of shared memory accessible to both the CPU and a
 DMA capable device.

 For example, consider a device driver that shares memory with a device
 and uses a descriptor status value to indicate if the descriptor 
belongs

 to the device or the CPU, and a doorbell to notify it when new
 descriptors are available:

if (desc->status != DEVICE_OWN) {
/* do not read data until we own descriptor */
dma_rmb();

/* read/modify data */
read_data = desc->data;
desc->data = write_data;

/* flush modifications before status update */
dma_wmb();

/* assign ownership */
desc->status = DEVICE_OWN;

/* force memory to sync before notifying device via MMIO */
wmb();

/* notify device of new descriptors */
writel(DESC_NOTIFY, doorbell);
}

 The dma_rmb() allows us guarantee the device has released ownership
 before we read the data from the descriptor, and the dma_wmb() allows
 us to guarantee the data is written to the descriptor before the 
device
 can see it now has ownership.  The wmb() is needed to guarantee 
that the
 cache coherent memory writes have completed before attempting a 
write to

 the cache incoherent MMIO region.

 See Documentation/DMA-API.txt for more information on consistent 
memory.




On 01/15/2018 01:50 PM, jianchao.wang wrote:

Hi Tariq

Thanks for your kindly response.

On 01/14/2018 05:47 PM, Tariq Toukan wrote:

Thanks Jianchao for your patch.

And Thank you guys for your reviews, much appreciated.
I was off-work on Friday and Saturday.

On 14/01/2018 4:40 AM, jianchao.wang wrote:

Dear all

Thanks for the kindly response and reviewing. That's really appreciated.

On 01/13/2018 12:46 AM, Eric Dumazet wrote:

Does this need to be dma_wmb(), and should it be in
mlx4_en_update_rx_prod_db ?



+1 on dma_wmb()

On what architecture bug was observed ?


This issue was observed on x86-64.
And I will send a new patch, in which replace wmb() with dma_wmb(), to customer
to confirm.


+1 on dma_wmb, let us know once customer confirms.
Please place it within mlx4_en_update_rx_prod_db as suggested.


Yes, I have recommended it to customer.
Once I get the result, I will share it here.

All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I 
prefer being on the safe side, and care less about bulking the barrier.

Thanks,
Tariq


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-21 Thread Tariq Toukan



On 19/01/2018 5:49 PM, Eric Dumazet wrote:

On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:

Hi Tariq

Very sad that the crash was reproduced again after applied the patch.

--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -252,6 +252,7 @@ static inline bool mlx4_en_is_ring_empty(struct 
mlx4_en_rx_ring *ring)
  
  static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)

  {
+   dma_wmb();


So... is wmb() here fixing the issue ?


*ring->wqres.db.db = cpu_to_be32(ring->prod & 0x);
  }

I analyzed the kdump, it should be a memory corruption.

Thanks
Jianchao


Hmm, this is actually consistent with the example below [1].

AIU from the example, it seems that the dma_wmb/dma_rmb barriers are 
good for synchronizing cpu/device accesses to the "Streaming DMA mapped" 
buffers (the descriptors, went through the dma_map_page() API), but not 
for the doorbell (a coherent memory, typically allocated via 
dma_alloc_coherent) that requires using the stronger wmb() barrier.



[1] Documentation/memory-barriers.txt

 (*) dma_wmb();
 (*) dma_rmb();

 These are for use with consistent memory to guarantee the ordering
 of writes or reads of shared memory accessible to both the CPU and a
 DMA capable device.

 For example, consider a device driver that shares memory with a device
 and uses a descriptor status value to indicate if the descriptor 
belongs

 to the device or the CPU, and a doorbell to notify it when new
 descriptors are available:

if (desc->status != DEVICE_OWN) {
/* do not read data until we own descriptor */
dma_rmb();

/* read/modify data */
read_data = desc->data;
desc->data = write_data;

/* flush modifications before status update */
dma_wmb();

/* assign ownership */
desc->status = DEVICE_OWN;

/* force memory to sync before notifying device via MMIO */
wmb();

/* notify device of new descriptors */
writel(DESC_NOTIFY, doorbell);
}

 The dma_rmb() allows us guarantee the device has released ownership
 before we read the data from the descriptor, and the dma_wmb() allows
 us to guarantee the data is written to the descriptor before the 
device
 can see it now has ownership.  The wmb() is needed to guarantee 
that the
 cache coherent memory writes have completed before attempting a 
write to

 the cache incoherent MMIO region.

 See Documentation/DMA-API.txt for more information on consistent 
memory.




On 01/15/2018 01:50 PM, jianchao.wang wrote:

Hi Tariq

Thanks for your kindly response.

On 01/14/2018 05:47 PM, Tariq Toukan wrote:

Thanks Jianchao for your patch.

And Thank you guys for your reviews, much appreciated.
I was off-work on Friday and Saturday.

On 14/01/2018 4:40 AM, jianchao.wang wrote:

Dear all

Thanks for the kindly response and reviewing. That's really appreciated.

On 01/13/2018 12:46 AM, Eric Dumazet wrote:

Does this need to be dma_wmb(), and should it be in
mlx4_en_update_rx_prod_db ?



+1 on dma_wmb()

On what architecture bug was observed ?


This issue was observed on x86-64.
And I will send a new patch, in which replace wmb() with dma_wmb(), to customer
to confirm.


+1 on dma_wmb, let us know once customer confirms.
Please place it within mlx4_en_update_rx_prod_db as suggested.


Yes, I have recommended it to customer.
Once I get the result, I will share it here.

All other calls to mlx4_en_update_rx_prod_db are in control/slow path so I 
prefer being on the safe side, and care less about bulking the barrier.

Thanks,
Tariq


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-14 Thread Tariq Toukan

Thanks Jianchao for your patch.

And Thank you guys for your reviews, much appreciated.
I was off-work on Friday and Saturday.

On 14/01/2018 4:40 AM, jianchao.wang wrote:

Dear all

Thanks for the kindly response and reviewing. That's really appreciated.

On 01/13/2018 12:46 AM, Eric Dumazet wrote:

Does this need to be dma_wmb(), and should it be in
mlx4_en_update_rx_prod_db ?


+1 on dma_wmb()

On what architecture bug was observed ?

This issue was observed on x86-64.
And I will send a new patch, in which replace wmb() with dma_wmb(), to customer
to confirm.


+1 on dma_wmb, let us know once customer confirms.
Please place it within mlx4_en_update_rx_prod_db as suggested.
All other calls to mlx4_en_update_rx_prod_db are in control/slow path so 
I prefer being on the safe side, and care less about bulking the barrier.


Thanks,
Tariq


Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-14 Thread Tariq Toukan

Thanks Jianchao for your patch.

And Thank you guys for your reviews, much appreciated.
I was off-work on Friday and Saturday.

On 14/01/2018 4:40 AM, jianchao.wang wrote:

Dear all

Thanks for the kindly response and reviewing. That's really appreciated.

On 01/13/2018 12:46 AM, Eric Dumazet wrote:

Does this need to be dma_wmb(), and should it be in
mlx4_en_update_rx_prod_db ?


+1 on dma_wmb()

On what architecture bug was observed ?

This issue was observed on x86-64.
And I will send a new patch, in which replace wmb() with dma_wmb(), to customer
to confirm.


+1 on dma_wmb, let us know once customer confirms.
Please place it within mlx4_en_update_rx_prod_db as suggested.
All other calls to mlx4_en_update_rx_prod_db are in control/slow path so 
I prefer being on the safe side, and care less about bulking the barrier.


Thanks,
Tariq


Re: [PATCH net-next V2 2/2] tuntap: XDP transmission

2018-01-10 Thread Tariq Toukan



On 04/01/2018 5:14 AM, Jason Wang wrote:

This patch implements XDP transmission for TAP. Since we can't create
new queues for TAP during XDP set, exist ptr_ring was reused for
queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
(0x1UL) was encoded into lowest bit of xpd_buff pointer during
ptr_ring_produce, and was decoded during consuming. XDP metadata was
stored in the headroom of the packet which should work in most of
cases since driver usually reserve enough headroom. Very minor changes
were done for vhost_net: it just need to peek the length depends on
the type of pointer.

Tests were done on two Intel E5-2630 2.40GHz machines connected back
to back through two 82599ES. Traffic were generated/received through
MoonGen/testpmd(rxonly). It reports ~20% improvements when
xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps
to 3.05Mpps)

Cc: Jesper Dangaard Brouer 
Signed-off-by: Jason Wang 
---
  drivers/net/tun.c  | 211 +
  drivers/vhost/net.c|  13 ++-
  include/linux/if_tun.h |  17 
  3 files changed, 208 insertions(+), 33 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2c89efe..f2e805d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -240,6 +240,24 @@ struct tun_struct {
struct tun_steering_prog __rcu *steering_prog;
  };
  
+bool tun_is_xdp_buff(void *ptr)

+{
+   return (unsigned long)ptr & TUN_XDP_FLAG;
+}
+EXPORT_SYMBOL(tun_is_xdp_buff);
+
+void *tun_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
+}
+EXPORT_SYMBOL(tun_xdp_to_ptr);
+
+void *tun_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
+}
+EXPORT_SYMBOL(tun_ptr_to_xdp);
+


Hi Jason,
I started getting the following compilation issues.

+ make -j24 -s
net/socket.o: In function `tun_xdp_to_ptr':
/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: 
multiple definition of `tun_xdp_to_ptr'
fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: 
first defined here

net/socket.o: In function `tun_ptr_to_xdp':
/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: 
multiple definition of `tun_ptr_to_xdp'
fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: 
first defined here

make: *** [vmlinux] Error 1

Seems you missed adding the following ifdef:
#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)

Thanks,
Tariq


  static int tun_napi_receive(struct napi_struct *napi, int budget)
  {
struct tun_file *tfile = container_of(napi, struct tun_file, napi);
@@ -630,12 +648,25 @@ static struct tun_struct *tun_enable_queue(struct 
tun_file *tfile)
return tun;
  }
  
+static void tun_ptr_free(void *ptr)

+{
+   if (!ptr)
+   return;
+   if (tun_is_xdp_buff(ptr)) {
+   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+
+   put_page(virt_to_head_page(xdp->data));
+   } else {
+   __skb_array_destroy_skb(ptr);
+   }
+}
+
  static void tun_queue_purge(struct tun_file *tfile)
  {
-   struct sk_buff *skb;
+   void *ptr;
  
-	while ((skb = ptr_ring_consume(>tx_ring)) != NULL)

-   kfree_skb(skb);
+   while ((ptr = ptr_ring_consume(>tx_ring)) != NULL)
+   tun_ptr_free(ptr);
  
  	skb_queue_purge(>sk.sk_write_queue);

skb_queue_purge(>sk.sk_error_queue);
@@ -688,8 +719,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
unregister_netdevice(tun->dev);
}
if (tun)
-   ptr_ring_cleanup(>tx_ring,
-__skb_array_destroy_skb);
+   ptr_ring_cleanup(>tx_ring, tun_ptr_free);
sock_put(>sk);
}
  }
@@ -1201,6 +1231,67 @@ static const struct net_device_ops tun_netdev_ops = {
.ndo_get_stats64= tun_net_get_stats64,
  };
  
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)

+{
+   struct tun_struct *tun = netdev_priv(dev);
+   struct xdp_buff *buff = xdp->data_hard_start;
+   int headroom = xdp->data - xdp->data_hard_start;
+   struct tun_file *tfile;
+   u32 numqueues;
+   int ret = 0;
+
+   /* Assure headroom is available and buff is properly aligned */
+   if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
+   return -ENOSPC;
+
+   *buff = *xdp;
+
+   rcu_read_lock();
+
+   numqueues = READ_ONCE(tun->numqueues);
+   if (!numqueues) {
+   ret = -ENOSPC;
+   goto out;
+   }
+
+   tfile = rcu_dereference(tun->tfiles[smp_processor_id() %
+   numqueues]);
+   /* Encode the XDP flag into lowest bit for consumer to 

Re: [PATCH net-next V2 2/2] tuntap: XDP transmission

2018-01-10 Thread Tariq Toukan



On 04/01/2018 5:14 AM, Jason Wang wrote:

This patch implements XDP transmission for TAP. Since we can't create
new queues for TAP during XDP set, exist ptr_ring was reused for
queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
(0x1UL) was encoded into lowest bit of xpd_buff pointer during
ptr_ring_produce, and was decoded during consuming. XDP metadata was
stored in the headroom of the packet which should work in most of
cases since driver usually reserve enough headroom. Very minor changes
were done for vhost_net: it just need to peek the length depends on
the type of pointer.

Tests were done on two Intel E5-2630 2.40GHz machines connected back
to back through two 82599ES. Traffic were generated/received through
MoonGen/testpmd(rxonly). It reports ~20% improvements when
xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps
to 3.05Mpps)

Cc: Jesper Dangaard Brouer 
Signed-off-by: Jason Wang 
---
  drivers/net/tun.c  | 211 +
  drivers/vhost/net.c|  13 ++-
  include/linux/if_tun.h |  17 
  3 files changed, 208 insertions(+), 33 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2c89efe..f2e805d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -240,6 +240,24 @@ struct tun_struct {
struct tun_steering_prog __rcu *steering_prog;
  };
  
+bool tun_is_xdp_buff(void *ptr)

+{
+   return (unsigned long)ptr & TUN_XDP_FLAG;
+}
+EXPORT_SYMBOL(tun_is_xdp_buff);
+
+void *tun_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
+}
+EXPORT_SYMBOL(tun_xdp_to_ptr);
+
+void *tun_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
+}
+EXPORT_SYMBOL(tun_ptr_to_xdp);
+


Hi Jason,
I started getting the following compilation issues.

+ make -j24 -s
net/socket.o: In function `tun_xdp_to_ptr':
/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: 
multiple definition of `tun_xdp_to_ptr'
fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:46: 
first defined here

net/socket.o: In function `tun_ptr_to_xdp':
/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: 
multiple definition of `tun_ptr_to_xdp'
fs/compat_ioctl.o:/images/autom/buildbot/worker/merge-net-next/build/./include/linux/if_tun.h:50: 
first defined here

make: *** [vmlinux] Error 1

Seems you missed adding the following ifdef:
#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)

Thanks,
Tariq


  static int tun_napi_receive(struct napi_struct *napi, int budget)
  {
struct tun_file *tfile = container_of(napi, struct tun_file, napi);
@@ -630,12 +648,25 @@ static struct tun_struct *tun_enable_queue(struct 
tun_file *tfile)
return tun;
  }
  
+static void tun_ptr_free(void *ptr)

+{
+   if (!ptr)
+   return;
+   if (tun_is_xdp_buff(ptr)) {
+   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+
+   put_page(virt_to_head_page(xdp->data));
+   } else {
+   __skb_array_destroy_skb(ptr);
+   }
+}
+
  static void tun_queue_purge(struct tun_file *tfile)
  {
-   struct sk_buff *skb;
+   void *ptr;
  
-	while ((skb = ptr_ring_consume(>tx_ring)) != NULL)

-   kfree_skb(skb);
+   while ((ptr = ptr_ring_consume(>tx_ring)) != NULL)
+   tun_ptr_free(ptr);
  
  	skb_queue_purge(>sk.sk_write_queue);

skb_queue_purge(>sk.sk_error_queue);
@@ -688,8 +719,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
unregister_netdevice(tun->dev);
}
if (tun)
-   ptr_ring_cleanup(>tx_ring,
-__skb_array_destroy_skb);
+   ptr_ring_cleanup(>tx_ring, tun_ptr_free);
sock_put(>sk);
}
  }
@@ -1201,6 +1231,67 @@ static const struct net_device_ops tun_netdev_ops = {
.ndo_get_stats64= tun_net_get_stats64,
  };
  
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)

+{
+   struct tun_struct *tun = netdev_priv(dev);
+   struct xdp_buff *buff = xdp->data_hard_start;
+   int headroom = xdp->data - xdp->data_hard_start;
+   struct tun_file *tfile;
+   u32 numqueues;
+   int ret = 0;
+
+   /* Assure headroom is available and buff is properly aligned */
+   if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
+   return -ENOSPC;
+
+   *buff = *xdp;
+
+   rcu_read_lock();
+
+   numqueues = READ_ONCE(tun->numqueues);
+   if (!numqueues) {
+   ret = -ENOSPC;
+   goto out;
+   }
+
+   tfile = rcu_dereference(tun->tfiles[smp_processor_id() %
+   numqueues]);
+   /* Encode the XDP flag into lowest bit for consumer to differ
+* XDP buffer from 

Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-04 Thread Tariq Toukan



On 01/01/2018 10:46 PM, SF Markus Elfring wrote:

From: Markus Elfring <elfr...@users.sourceforge.net>
Date: Mon, 1 Jan 2018 21:42:27 +0100

Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
---



Acked-by: Tariq Toukan <tar...@mellanox.com>

Thanks.


Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-04 Thread Tariq Toukan



On 01/01/2018 10:46 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Mon, 1 Jan 2018 21:42:27 +0100

Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---



Acked-by: Tariq Toukan 

Thanks.


Re: ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-04 Thread Tariq Toukan



On 03/01/2018 4:22 PM, SF Markus Elfring wrote:

I don't really accept this claim...
Short informative strings worth the tiny space they consume.


There can be different opinions for their usefulness.



In addition, some out-of-memory errors are recoverable, even though their 
backtrace is also printed.


How do you think about to suppress the backtrace generation for them?



OK, makes sense.


For example, in function mlx4_en_create_cq (appears in patch) we have a first 
allocation attempt (kzalloc_node)


Would it be helpful to pass the option “__GFP_NOWARN” there?




I'll prepare a patch to use it.
Will ack this patch for now.


and a fallback (kzalloc). I'd prefer to state a clear error message only when 
both have failed,
because otherwise the user might be confused whether the backtrace should 
indicate a malfunctioning interface, or not.


Can the distinction become easier by any other means?

Regards,
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-04 Thread Tariq Toukan



On 03/01/2018 4:22 PM, SF Markus Elfring wrote:

I don't really accept this claim...
Short informative strings worth the tiny space they consume.


There can be different opinions for their usefulness.



In addition, some out-of-memory errors are recoverable, even though their 
backtrace is also printed.


How do you think about to suppress the backtrace generation for them?



OK, makes sense.


For example, in function mlx4_en_create_cq (appears in patch) we have a first 
allocation attempt (kzalloc_node)


Would it be helpful to pass the option “__GFP_NOWARN” there?




I'll prepare a patch to use it.
Will ack this patch for now.


and a fallback (kzalloc). I'd prefer to state a clear error message only when 
both have failed,
because otherwise the user might be confused whether the backtrace should 
indicate a malfunctioning interface, or not.


Can the distinction become easier by any other means?

Regards,
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-03 Thread Tariq Toukan



On 03/01/2018 10:06 AM, Julia Lawall wrote:



On Wed, 3 Jan 2018, Tariq Toukan wrote:




On 01/01/2018 10:46 PM, SF Markus Elfring wrote:

From: Markus Elfring <elfr...@users.sourceforge.net>
Date: Mon, 1 Jan 2018 21:42:27 +0100

Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
---


Is this an issue? Why? What is your motivation?
These are error messages, very informative, appear only upon errors, and in
control flow.


Strings take up space.  Since there is a backtrace on an out of memory
problem, if the string does not provide any more information than the
position of the call, then there is not much added value.  I don't know
what was the string in this case.  If it provides some additional
information, then it would be reasonable to keep it.


I don't really accept this claim...

Short informative strings worth the tiny space they consume. It helps 
the users of our driver understand what went wrong in simple words, 
without the need to understand the role of the functions/callstack or 
being familiar with different parts of the driver code.


In addition, some out-of-memory errors are recoverable, even though 
their backtrace is also printed. For example, in function 
mlx4_en_create_cq (appears in patch) we have a first allocation attempt 
(kzalloc_node) and a fallback (kzalloc). I'd prefer to state a clear 
error message only when both have failed, because otherwise the user 
might be confused whether the backtrace should indicate a malfunctioning 
interface, or not.


Tariq



julia


--
To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-03 Thread Tariq Toukan



On 03/01/2018 10:06 AM, Julia Lawall wrote:



On Wed, 3 Jan 2018, Tariq Toukan wrote:




On 01/01/2018 10:46 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Mon, 1 Jan 2018 21:42:27 +0100

Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---


Is this an issue? Why? What is your motivation?
These are error messages, very informative, appear only upon errors, and in
control flow.


Strings take up space.  Since there is a backtrace on an out of memory
problem, if the string does not provide any more information than the
position of the call, then there is not much added value.  I don't know
what was the string in this case.  If it provides some additional
information, then it would be reasonable to keep it.


I don't really accept this claim...

Short informative strings worth the tiny space they consume. It helps 
the users of our driver understand what went wrong in simple words, 
without the need to understand the role of the functions/callstack or 
being familiar with different parts of the driver code.


In addition, some out-of-memory errors are recoverable, even though 
their backtrace is also printed. For example, in function 
mlx4_en_create_cq (appears in patch) we have a first allocation attempt 
(kzalloc_node) and a fallback (kzalloc). I'd prefer to state a clear 
error message only when both have failed, because otherwise the user 
might be confused whether the backtrace should indicate a malfunctioning 
interface, or not.


Tariq



julia


--
To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-02 Thread Tariq Toukan



On 01/01/2018 10:46 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Mon, 1 Jan 2018 21:42:27 +0100

Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---


Is this an issue? Why? What is your motivation?
These are error messages, very informative, appear only upon errors, and 
in control flow.


Re: [PATCH] ethernet: mlx4: Delete an error message for a failed memory allocation in five functions

2018-01-02 Thread Tariq Toukan



On 01/01/2018 10:46 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Mon, 1 Jan 2018 21:42:27 +0100

Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---


Is this an issue? Why? What is your motivation?
These are error messages, very informative, appear only upon errors, and 
in control flow.


Re: [PATCH] kobject: fix suppressing modalias in uevents delivered over netlink

2017-12-31 Thread Tariq Toukan



On 31/12/2017 12:39 PM, Greg Kroah-Hartman wrote:

On Sun, Dec 31, 2017 at 12:28:02PM +0200, Tariq Toukan wrote:



On 21/12/2017 7:13 AM, Dmitry Torokhov wrote:

  Greg,
can you please schedule my patch in for 4.15?



We must have it in, otherwise there will be a major degradation (many
drivers won't be able to reload).
Greg?


It's already queued up to get merged, please be patient, it is the
holidays...

greg k-h


Many thanks, and happy holidays!

Tariq


  1   2   >