[PATCH v2] IB/mlx5: avoid excessive warning msgs when creating VFs on 2nd port

2018-07-23 Thread Qing Huang
When a CX5 device is configured in dual-port RoCE mode, after creating
many VFs against port 1, creating the same number of VFs against port 2
will flood kernel/syslog with something like
"mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already
affiliated."

So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port()
repeatedly attempts to bind the new mpi structure to every device
on the list until it finds an unbound device.

Change the log level from warn to dbg to avoid log flooding as the warning
should be harmless.

Signed-off-by: Qing Huang 
---
  v1 -> v2: change the log level instead

 drivers/infiniband/hw/mlx5/main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..f57b8f7 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -5127,8 +5127,8 @@ static bool mlx5_ib_bind_slave_port(struct mlx5_ib_dev 
*ibdev,
 
spin_lock(>port[port_num].mp.mpi_lock);
if (ibdev->port[port_num].mp.mpi) {
-   mlx5_ib_warn(ibdev, "port %d already affiliated.\n",
-port_num + 1);
+   mlx5_ib_dbg(ibdev, "port %d already affiliated.\n",
+   port_num + 1);
spin_unlock(>port[port_num].mp.mpi_lock);
return false;
}
-- 
2.9.3



[PATCH v2] IB/mlx5: avoid excessive warning msgs when creating VFs on 2nd port

2018-07-23 Thread Qing Huang
When a CX5 device is configured in dual-port RoCE mode, after creating
many VFs against port 1, creating the same number of VFs against port 2
will flood kernel/syslog with something like
"mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already
affiliated."

So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port()
repeatedly attempts to bind the new mpi structure to every device
on the list until it finds an unbound device.

Change the log level from warn to dbg to avoid log flooding as the warning
should be harmless.

Signed-off-by: Qing Huang 
---
  v1 -> v2: change the log level instead

 drivers/infiniband/hw/mlx5/main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..f57b8f7 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -5127,8 +5127,8 @@ static bool mlx5_ib_bind_slave_port(struct mlx5_ib_dev 
*ibdev,
 
spin_lock(>port[port_num].mp.mpi_lock);
if (ibdev->port[port_num].mp.mpi) {
-   mlx5_ib_warn(ibdev, "port %d already affiliated.\n",
-port_num + 1);
+   mlx5_ib_dbg(ibdev, "port %d already affiliated.\n",
+   port_num + 1);
spin_unlock(>port[port_num].mp.mpi_lock);
return false;
}
-- 
2.9.3



Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly

2018-07-23 Thread Qing Huang

Hi Daniel,


On 7/23/2018 11:11 AM, Daniel Jurgens wrote:


On 7/23/2018 10:36 AM, Qing Huang wrote:

Hi Daniel/Parav,

Have you got a chance to review this patch? Thanks!

Hi Qing, sorry for the delay, I just got back to the office today. I don't 
agree with the proposed fix, I provided an alternative suggestion below.

Or.


Reported-by: Gerald Gibson 
Signed-off-by: Qing Huang 
---
   drivers/infiniband/hw/mlx5/main.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..1ddd1d3 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev 
*mdev, u8 port_num)

  mutex_lock(_ib_multiport_mutex);
  list_for_each_entry(dev, _ib_dev_list, ib_dev_list) {
-   if (dev->sys_image_guid == mpi->sys_image_guid)
+   if (dev->sys_image_guid == mpi->sys_image_guid &&
+   !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi)

You shouldn't check the mpi field that without holding the lock in the mp 
structure. Prefer you change the print from a warning in 
mlx5_ib_bind_slave_port to a debug message.

Thanks for the review. That works for us too. Will resend the patch.

Regards,
Qing




  bound = mlx5_ib_bind_slave_port(dev, mpi);

  if (bound) {
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly

2018-07-23 Thread Qing Huang

Hi Daniel,


On 7/23/2018 11:11 AM, Daniel Jurgens wrote:


On 7/23/2018 10:36 AM, Qing Huang wrote:

Hi Daniel/Parav,

Have you got a chance to review this patch? Thanks!

Hi Qing, sorry for the delay, I just got back to the office today. I don't 
agree with the proposed fix, I provided an alternative suggestion below.

Or.


Reported-by: Gerald Gibson 
Signed-off-by: Qing Huang 
---
   drivers/infiniband/hw/mlx5/main.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..1ddd1d3 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev 
*mdev, u8 port_num)

  mutex_lock(_ib_multiport_mutex);
  list_for_each_entry(dev, _ib_dev_list, ib_dev_list) {
-   if (dev->sys_image_guid == mpi->sys_image_guid)
+   if (dev->sys_image_guid == mpi->sys_image_guid &&
+   !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi)

You shouldn't check the mpi field that without holding the lock in the mp 
structure. Prefer you change the print from a warning in 
mlx5_ib_bind_slave_port to a debug message.

Thanks for the review. That works for us too. Will resend the patch.

Regards,
Qing




  bound = mlx5_ib_bind_slave_port(dev, mpi);

  if (bound) {
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly

2018-07-23 Thread Qing Huang




On 7/15/2018 12:48 PM, Daniel Jurgens wrote:

On 7/14/2018 10:57 AM, Or Gerlitz wrote:

On Sat, Jul 14, 2018 at 2:50 AM, Qing Huang  wrote:

When a CX5 device is configured in dual-port RoCE mode, after creating
many VFs against port 1, creating the same number of VFs against port 2
will flood kernel/syslog with something like
"mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already
affiliated."

So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port()
shouldn't repeatedly attempt to bind the new mpi data unit to every device
on the list until it finds an unbound device.

Daniel,

What is mpi data unit?

It's a structure to keep track affiliated port info in dual port RoCE mode, mpi 
meaning multi-port info. Parav can review this it my absence, otherwise I can 
take a closer look when I return to the office.

Hi Daniel/Parav,

Have you got a chance to review this patch? Thanks!


Or.


Reported-by: Gerald Gibson 
Signed-off-by: Qing Huang 
---
  drivers/infiniband/hw/mlx5/main.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..1ddd1d3 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev 
*mdev, u8 port_num)

 mutex_lock(_ib_multiport_mutex);
 list_for_each_entry(dev, _ib_dev_list, ib_dev_list) {
-   if (dev->sys_image_guid == mpi->sys_image_guid)
+   if (dev->sys_image_guid == mpi->sys_image_guid &&
+   !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi)
 bound = mlx5_ib_bind_slave_port(dev, mpi);

 if (bound) {
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly

2018-07-23 Thread Qing Huang




On 7/15/2018 12:48 PM, Daniel Jurgens wrote:

On 7/14/2018 10:57 AM, Or Gerlitz wrote:

On Sat, Jul 14, 2018 at 2:50 AM, Qing Huang  wrote:

When a CX5 device is configured in dual-port RoCE mode, after creating
many VFs against port 1, creating the same number of VFs against port 2
will flood kernel/syslog with something like
"mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already
affiliated."

So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port()
shouldn't repeatedly attempt to bind the new mpi data unit to every device
on the list until it finds an unbound device.

Daniel,

What is mpi data unit?

It's a structure to keep track affiliated port info in dual port RoCE mode, mpi 
meaning multi-port info. Parav can review this it my absence, otherwise I can 
take a closer look when I return to the office.

Hi Daniel/Parav,

Have you got a chance to review this patch? Thanks!


Or.


Reported-by: Gerald Gibson 
Signed-off-by: Qing Huang 
---
  drivers/infiniband/hw/mlx5/main.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..1ddd1d3 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev 
*mdev, u8 port_num)

 mutex_lock(_ib_multiport_mutex);
 list_for_each_entry(dev, _ib_dev_list, ib_dev_list) {
-   if (dev->sys_image_guid == mpi->sys_image_guid)
+   if (dev->sys_image_guid == mpi->sys_image_guid &&
+   !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi)
 bound = mlx5_ib_bind_slave_port(dev, mpi);

 if (bound) {
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly

2018-07-13 Thread Qing Huang
When a CX5 device is configured in dual-port RoCE mode, after creating
many VFs against port 1, creating the same number of VFs against port 2
will flood kernel/syslog with something like
"mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already
affiliated."

So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port()
shouldn't repeatedly attempt to bind the new mpi data unit to every device
on the list until it finds an unbound device.

Reported-by: Gerald Gibson 
Signed-off-by: Qing Huang 
---
 drivers/infiniband/hw/mlx5/main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..1ddd1d3 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev 
*mdev, u8 port_num)
 
mutex_lock(_ib_multiport_mutex);
list_for_each_entry(dev, _ib_dev_list, ib_dev_list) {
-   if (dev->sys_image_guid == mpi->sys_image_guid)
+   if (dev->sys_image_guid == mpi->sys_image_guid &&
+   !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi)
bound = mlx5_ib_bind_slave_port(dev, mpi);
 
if (bound) {
-- 
2.9.3



[PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly

2018-07-13 Thread Qing Huang
When a CX5 device is configured in dual-port RoCE mode, after creating
many VFs against port 1, creating the same number of VFs against port 2
will flood kernel/syslog with something like
"mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already
affiliated."

So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port()
shouldn't repeatedly attempt to bind the new mpi data unit to every device
on the list until it finds an unbound device.

Reported-by: Gerald Gibson 
Signed-off-by: Qing Huang 
---
 drivers/infiniband/hw/mlx5/main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b3ba9a2..1ddd1d3 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev 
*mdev, u8 port_num)
 
mutex_lock(_ib_multiport_mutex);
list_for_each_entry(dev, _ib_dev_list, ib_dev_list) {
-   if (dev->sys_image_guid == mpi->sys_image_guid)
+   if (dev->sys_image_guid == mpi->sys_image_guid &&
+   !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi)
bound = mlx5_ib_bind_slave_port(dev, mpi);
 
if (bound) {
-- 
2.9.3



Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks

2018-05-31 Thread Qing Huang




On 5/31/2018 2:10 AM, Michal Hocko wrote:

On Thu 31-05-18 10:55:32, Michal Hocko wrote:

On Thu 31-05-18 04:35:31, Eric Dumazet wrote:

[...]

I merely copied/pasted from alloc_skb_with_frags() :/

I will have a look at it. Thanks!

OK, so this is an example of an incremental development ;).

__GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
high order allocations") to prevent from OOM killer. Yet this was
not enough because fb05e7a89f50 ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant. Should I send a patch?



Just curious, how about GFP_ATOMIC flag? Would it work in a similar 
fashion? We experimented
with it a bit in the past but it seemed to cause other issue in our 
tests. :-)


By the way, we didn't encounter any OOM killer events. It seemed that 
the mlx4_alloc_icm() triggered slowpath.

We still had about 2GB free memory while it was highly fragmented.

 #0 [8801f308b380] remove_migration_pte at 811f0e0b
 #1 [8801f308b3e0] rmap_walk_file at 811cb890
 #2 [8801f308b440] rmap_walk at 811cbaf2
 #3 [8801f308b450] remove_migration_ptes at 811f0db0
 #4 [8801f308b490] __unmap_and_move at 811f2ea6
 #5 [8801f308b4e0] unmap_and_move at 811f2fc5
 #6 [8801f308b540] migrate_pages at 811f3219
 #7 [8801f308b5c0] compact_zone at 811b707e
 #8 [8801f308b650] compact_zone_order at 811b735d
 #9 [8801f308b6e0] try_to_compact_pages at 811b7485
#10 [8801f308b770] __alloc_pages_direct_compact at 81195f96
#11 [8801f308b7b0] __alloc_pages_slowpath at 811978a1
#12 [8801f308b890] __alloc_pages_nodemask at 81197ec1
#13 [8801f308b970] alloc_pages_current at 811e261f
#14 [8801f308b9e0] mlx4_alloc_icm at a01f39b2 [mlx4_core]

Thanks!


Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks

2018-05-31 Thread Qing Huang




On 5/31/2018 2:10 AM, Michal Hocko wrote:

On Thu 31-05-18 10:55:32, Michal Hocko wrote:

On Thu 31-05-18 04:35:31, Eric Dumazet wrote:

[...]

I merely copied/pasted from alloc_skb_with_frags() :/

I will have a look at it. Thanks!

OK, so this is an example of an incremental development ;).

__GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
high order allocations") to prevent from OOM killer. Yet this was
not enough because fb05e7a89f50 ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant. Should I send a patch?



Just curious, how about GFP_ATOMIC flag? Would it work in a similar 
fashion? We experimented
with it a bit in the past but it seemed to cause other issue in our 
tests. :-)


By the way, we didn't encounter any OOM killer events. It seemed that 
the mlx4_alloc_icm() triggered slowpath.

We still had about 2GB free memory while it was highly fragmented.

 #0 [8801f308b380] remove_migration_pte at 811f0e0b
 #1 [8801f308b3e0] rmap_walk_file at 811cb890
 #2 [8801f308b440] rmap_walk at 811cbaf2
 #3 [8801f308b450] remove_migration_ptes at 811f0db0
 #4 [8801f308b490] __unmap_and_move at 811f2ea6
 #5 [8801f308b4e0] unmap_and_move at 811f2fc5
 #6 [8801f308b540] migrate_pages at 811f3219
 #7 [8801f308b5c0] compact_zone at 811b707e
 #8 [8801f308b650] compact_zone_order at 811b735d
 #9 [8801f308b6e0] try_to_compact_pages at 811b7485
#10 [8801f308b770] __alloc_pages_direct_compact at 81195f96
#11 [8801f308b7b0] __alloc_pages_slowpath at 811978a1
#12 [8801f308b890] __alloc_pages_nodemask at 81197ec1
#13 [8801f308b970] alloc_pages_current at 811e261f
#14 [8801f308b9e0] mlx4_alloc_icm at a01f39b2 [mlx4_core]

Thanks!


[PATCH V4] mlx4_core: allocate ICM memory in page size chunks

2018-05-23 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v4: use kvzalloc instead of vzalloc
add one err condition check
don't include vmalloc.h any more

v3: use PAGE_SIZE instead of PAGE_SHIFT
add comma to the end of enum variables
include vmalloc.h header file to avoid build issues on Sparc

v2: adjusted chunk size to reflect different architectures

 drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..685337d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
u64 size;
 
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
+   if (WARN_ON(!obj_per_chunk))
+   return -EINVAL;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   kvfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   kvfree(table->icm);
 }
-- 
2.9.3



[PATCH V4] mlx4_core: allocate ICM memory in page size chunks

2018-05-23 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v4: use kvzalloc instead of vzalloc
add one err condition check
don't include vmalloc.h any more

v3: use PAGE_SIZE instead of PAGE_SHIFT
add comma to the end of enum variables
include vmalloc.h header file to avoid build issues on Sparc

v2: adjusted chunk size to reflect different architectures

 drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..685337d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
u64 size;
 
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
+   if (WARN_ON(!obj_per_chunk))
+   return -EINVAL;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   kvfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   kvfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-22 Thread Qing Huang



On 5/22/2018 8:33 AM, Tariq Toukan wrote:



On 18/05/2018 12:45 AM, Qing Huang wrote:



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d 



Would you please take a look at how table->icm is being used in the 
mlx4 driver? It's a meta data used for individual pointer variable 
referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.



NACK.

This would cause a degradation when iterating the entries of table->icm.
For example, in mlx4_table_get_range.

E.g.
int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 u32 start, u32 end)
{
    int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size;
    int err;
    u32 i;

    for (i = start; i <= end; i += inc) {
    err = mlx4_table_get(dev, table, i);
    if (err)
    goto fail;
    }

    return 0;
...
}

E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects. 
So you will have to allocate
more 512 mtt objects in order to have table->icm pointer to increment by 
1 to fetch next pointer
value.  So 256K mtt objects are needed in order to traverse table->icm 
pointer across a page boundary

in the call stacks.

Considering mlx4_table_get_range() is only used in control path, there 
is no significant gain by using kvzalloc

vs. vzalloc for table->icm.

Anyway, if a user makes sure mlx4 driver to be loaded very early and 
doesn't remove and reload it afterwards,
we should have enough (and not wasting) contiguous phy mem for 
table->icm allocation. I will use kvzalloc to

replace vzalloc and send a V4 patch.

Thanks,
Qing




Thanks,
Tariq


And you'll understand some people care about this.

Strongly.

Thanks.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html





Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-22 Thread Qing Huang



On 5/22/2018 8:33 AM, Tariq Toukan wrote:



On 18/05/2018 12:45 AM, Qing Huang wrote:



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d 



Would you please take a look at how table->icm is being used in the 
mlx4 driver? It's a meta data used for individual pointer variable 
referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.



NACK.

This would cause a degradation when iterating the entries of table->icm.
For example, in mlx4_table_get_range.

E.g.
int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 u32 start, u32 end)
{
    int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size;
    int err;
    u32 i;

    for (i = start; i <= end; i += inc) {
    err = mlx4_table_get(dev, table, i);
    if (err)
    goto fail;
    }

    return 0;
...
}

E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects. 
So you will have to allocate
more 512 mtt objects in order to have table->icm pointer to increment by 
1 to fetch next pointer
value.  So 256K mtt objects are needed in order to traverse table->icm 
pointer across a page boundary

in the call stacks.

Considering mlx4_table_get_range() is only used in control path, there 
is no significant gain by using kvzalloc

vs. vzalloc for table->icm.

Anyway, if a user makes sure mlx4 driver to be loaded very early and 
doesn't remove and reload it afterwards,
we should have enough (and not wasting) contiguous phy mem for 
table->icm allocation. I will use kvzalloc to

replace vzalloc and send a V4 patch.

Thanks,
Qing




Thanks,
Tariq


And you'll understand some people care about this.

Strongly.

Thanks.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html





Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Qing Huang



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d


Would you please take a look at how table->icm is being used in the mlx4 
driver? It's a meta data used for individual pointer variable referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.


And you'll understand some people care about this.

Strongly.

Thanks.





Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Qing Huang



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d


Would you please take a look at how table->icm is being used in the mlx4 
driver? It's a meta data used for individual pointer variable referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.


And you'll understand some people care about this.

Strongly.

Thanks.





[PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v3: use PAGE_SIZE instead of PAGE_SHIFT
add comma to the end of enum variables
include vmalloc.h header file to avoid build issues on Sparc

v2: adjuste chunk size to reflect different architectures

 drivers/net/ethernet/mellanox/mlx4/icm.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..3705f4e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -33,6 +33,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -43,12 +44,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



[PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v3: use PAGE_SIZE instead of PAGE_SHIFT
add comma to the end of enum variables
include vmalloc.h header file to avoid build issues on Sparc

v2: adjuste chunk size to reflect different architectures

 drivers/net/ethernet/mellanox/mlx4/icm.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..3705f4e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -33,6 +33,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -43,12 +44,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Qing Huang



On 5/15/2018 12:08 PM, Eric Dumazet wrote:


On 05/15/2018 11:53 AM, Qing Huang wrote:

This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc so we 
maintain a similar behavior when contiguous memory is available, and a fallback 
for resiliency.

No sure what exactly degradation is caused by vzalloc here. I think it's better 
to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem compacting can 
be really expensive.


Just use kvzalloc(), and you get the benefit of having contiguous memory if 
available,
without expensive compact phase.

This thing _automatically_ falls back to vmalloc(), thus your problem will be 
solved.

If you are not sure, trust others.


Thanks for the review. There are many places in kernel and applications 
where physically contiguous pages are needed.
We saw quite a few issues when there were not enough contiguous phy mem 
available. My main concern here is that why

using physically contiguous pages when they are not really needed?






Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Qing Huang



On 5/15/2018 12:08 PM, Eric Dumazet wrote:


On 05/15/2018 11:53 AM, Qing Huang wrote:

This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc so we 
maintain a similar behavior when contiguous memory is available, and a fallback 
for resiliency.

No sure what exactly degradation is caused by vzalloc here. I think it's better 
to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem compacting can 
be really expensive.


Just use kvzalloc(), and you get the benefit of having contiguous memory if 
available,
without expensive compact phase.

This thing _automatically_ falls back to vmalloc(), thus your problem will be 
solved.

If you are not sure, trust others.


Thanks for the review. There are many places in kernel and applications 
where physically contiguous pages are needed.
We saw quite a few issues when there were not enough contiguous phy mem 
available. My main concern here is that why

using physically contiguous pages when they are not really needed?






Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Qing Huang



On 5/15/2018 2:19 AM, Tariq Toukan wrote:



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high
+ * order memory allocations in fragmented/high usage memory 
situation.

   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.



Then please use PAGE_SIZE instead.


PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT 
is actually more appropriate here.






Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.




I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.


I know it didn't exist before your patch, but once we're here, let's 
do it.


I'm okay either way. If adding an extra comma is preferred by many 
people, someone should update checkpatch.pl to enforce it. :)







  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.




This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc 
so we maintain a similar behavior when contiguous memory is available, 
and a fallback for resiliency.


No sure what exactly degradation is caused by vzalloc here. I think it's 
better to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem 
compacting can be really expensive.





Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
*dev, struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);

Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Qing Huang



On 5/15/2018 2:19 AM, Tariq Toukan wrote:



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high
+ * order memory allocations in fragmented/high usage memory 
situation.

   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.



Then please use PAGE_SIZE instead.


PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT 
is actually more appropriate here.






Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.




I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.


I know it didn't exist before your patch, but once we're here, let's 
do it.


I'm okay either way. If adding an extra comma is preferred by many 
people, someone should update checkpatch.pl to enforce it. :)







  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.




This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc 
so we maintain a similar behavior when contiguous memory is available, 
and a fallback for resiliency.


No sure what exactly degradation is caused by vzalloc here. I think it's 
better to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem 
compacting can be really expensive.





Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
*dev, struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic perf

Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-14 Thread Qing Huang



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high

+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.


Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.





  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.


Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-14 Thread Qing Huang



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high

+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.


Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.





  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.


Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-11 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun@oracle.com>
---
v2 -> v1: adjusted chunk size to reflect different architectures.

 drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << PAGE_SHIFT
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



[PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-11 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

 drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << PAGE_SHIFT
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-11 Thread Qing Huang


On 5/11/2018 3:27 AM, Håkon Bugge wrote:

On 11 May 2018, at 01:31, Qing Huang<qing.hu...@oracle.com>  wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang<qing.hu...@oracle.com>
Acked-by: Daniel Jurgens<dani...@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
#include "fw.h"

/*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
  */
enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12

Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the 
hw page size is 8KiB.


Good point on supporting wider range of architectures. I got tunnel 
vision when fixing this on our x64 lab machines.

Will send an v2 patch.

Thanks,
Qing


Thxs, Håkon


};

static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;

-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);

return -ENOMEM;
}
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);
}
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html




Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-11 Thread Qing Huang


On 5/11/2018 3:27 AM, Håkon Bugge wrote:

On 11 May 2018, at 01:31, Qing Huang  wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang
Acked-by: Daniel Jurgens
---
drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
#include "fw.h"

/*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
  */
enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12

Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the 
hw page size is 8KiB.


Good point on supporting wider range of architectures. I got tunnel 
vision when fixing this on our x64 lab machines.

Will send an v2 patch.

Thanks,
Qing


Thxs, Håkon


};

static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;

-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);

return -ENOMEM;
}
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);
}
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html




Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-10 Thread Qing Huang

Thank you for reviewing it!


On 5/10/2018 6:23 PM, Yanjun Zhu wrote:




On 2018/5/11 9:15, Qing Huang wrote:




On 5/10/2018 5:13 PM, Yanjun Zhu wrote:



On 2018/5/11 7:31, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need

Hi,

Replace continuous memory pages with virtual memory, is there any 
performance loss?


Not really. "table->icm" will be accessed as individual pointer 
variables randomly. Kcalloc


Sure. Thanks. If "table->icm" will be accessed as individual pointer 
variables randomly, the performance loss

caused by discontinuous memory will be very trivial.

Reviewed-by: Zhu Yanjun <yanjun@oracle.com>

also returns a virtual address except its mapped pages are guaranteed 
to be contiguous
which will provide little advantage over vzalloc for individual 
pointer variable access.


Qing



Zhu Yanjun

of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
---
  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << 12,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << 12
  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));
  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
*dev, struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }










Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-10 Thread Qing Huang

Thank you for reviewing it!


On 5/10/2018 6:23 PM, Yanjun Zhu wrote:




On 2018/5/11 9:15, Qing Huang wrote:




On 5/10/2018 5:13 PM, Yanjun Zhu wrote:



On 2018/5/11 7:31, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need

Hi,

Replace continuous memory pages with virtual memory, is there any 
performance loss?


Not really. "table->icm" will be accessed as individual pointer 
variables randomly. Kcalloc


Sure. Thanks. If "table->icm" will be accessed as individual pointer 
variables randomly, the performance loss

caused by discontinuous memory will be very trivial.

Reviewed-by: Zhu Yanjun 

also returns a virtual address except its mapped pages are guaranteed 
to be contiguous
which will provide little advantage over vzalloc for individual 
pointer variable access.


Qing



Zhu Yanjun

of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
---
  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << 12,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << 12
  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));
  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
*dev, struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }










[PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-10 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
Acked-by: Daniel Jurgens <dani...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



[PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-10 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
---
 drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang
Thanks! Looks like Eran's patch will be available much sooner than v4.18 
release time frame (since v4.16 was just released).


We will wait. :-)


On 04/26/2018 04:43 PM, Saeed Mahameed wrote:

Before you address my comments, it looks like Eran's work is
converging and we will finalize the internal
review next week, in which case submission will take 2-3 weeks from today.

So i suggest to wait.

please checkout:
https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=testing/mlx5e-stats=15ffb7f87432d073e8ac0e6b895188d40fdda4d4




Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang
Thanks! Looks like Eran's patch will be available much sooner than v4.18 
release time frame (since v4.16 was just released).


We will wait. :-)


On 04/26/2018 04:43 PM, Saeed Mahameed wrote:

Before you address my comments, it looks like Eran's work is
converging and we will finalize the internal
review next week, in which case submission will take 2-3 weeks from today.

So i suggest to wait.

please checkout:
https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=testing/mlx5e-stats=15ffb7f87432d073e8ac0e6b895188d40fdda4d4




Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang



On 04/26/2018 02:50 PM, Saeed Mahameed wrote:

On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang <qing.hu...@oracle.com> wrote:

Current stats collecting scheme in mlx5 driver is to periodically fetch
aggregated stats from all the active mlx5 software channels associated
with the device. However when a mlx5 interface is brought down(ifdown),
all the channels will be deactivated and closed. A new set of channels
will be created when next ifup command or a similar command is called.
Unfortunately the new channels will have all stats reset to 0. So you
lose the accumulated stats information. This behavior is different from
other netdev drivers including the mlx4 driver. In order to fix it, we
now save prior mlx5 software stats into netdev stats fields, so all the
accumulated stats will survive multiple runs of ifdown/ifup commands and
be shown correctly.

Orabug: 27548610

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
---

Hi Qing,

I am adding Eran since he is currently working on a similar patch,
He is also taking care of all cores/rings stats to make them
persistent, so you won't have discrepancy  between
ethtool and ifconfig stats.

I am ok with this patch, but this means Eran has to work his way around it.

so we have two options:

1. Temporary accept this patch, and change it later with Eran's work.
2. Wait for Eran's work.

I am ok with either one of them, please let me know.

Thanks !

Hi Saeed,

Any idea on rough ETA of Eran's stats work to be in upstream? If it will 
be available soon, I think
we can wait a bit. If it will take a while to redesign the whole stats 
scheme (for both ethtool and ifconfig),

maybe we can go with this incremental fix first?

Thanks!




  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++
  1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f1fe490..5d50e69 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device 
*netdev)
 netdev_set_tc_queue(netdev, tc, nch, 0);
  }

+static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
+{
+   struct net_device *netdev = priv->netdev;
+
+   netdev->stats.rx_packets += priv->stats.sw.rx_packets;
+   netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
+   netdev->stats.tx_packets += priv->stats.sw.tx_packets;
+   netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
+   netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
+
+   priv->stats.sw.rx_packets   = 0;
+   priv->stats.sw.rx_bytes = 0;
+   priv->stats.sw.tx_packets   = 0;
+   priv->stats.sw.tx_bytes = 0;
+   priv->stats.sw.tx_queue_dropped = 0;
+}
+

This means that we are now explicitly clearing channels stats on
ifconfig down or switch_channels.
and now after ifconfing down, ethtool will always show 0, before this
patch it didn't.
Anyway update sw stats function will always override them with the new
channels stats next time we load new channels.
so it is not that big of a deal.



  static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
  {
 struct mlx5e_channel *c;
@@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
 netif_set_real_num_tx_queues(netdev, new_num_txqs);

 mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
 mlx5e_close_channels(>channels);

 priv->channels = *new_chs;
@@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)

 netif_carrier_off(priv->netdev);
 mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
 mlx5e_close_channels(>channels);

 return 0;
@@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
 stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
 stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
 } else {
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
-   stats->tx_dropped = sstats->tx_queue_dropped;
+   stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
+   stats->rx_bytes   = sstats->rx_bytes + dev->stats.rx_bytes;
+   stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
+   stats->tx_bytes   = sstats->tx_bytes + dev->stats.tx_bytes;
+   stats-&

Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang



On 04/26/2018 02:50 PM, Saeed Mahameed wrote:

On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang  wrote:

Current stats collecting scheme in mlx5 driver is to periodically fetch
aggregated stats from all the active mlx5 software channels associated
with the device. However when a mlx5 interface is brought down(ifdown),
all the channels will be deactivated and closed. A new set of channels
will be created when next ifup command or a similar command is called.
Unfortunately the new channels will have all stats reset to 0. So you
lose the accumulated stats information. This behavior is different from
other netdev drivers including the mlx4 driver. In order to fix it, we
now save prior mlx5 software stats into netdev stats fields, so all the
accumulated stats will survive multiple runs of ifdown/ifup commands and
be shown correctly.

Orabug: 27548610

Signed-off-by: Qing Huang 
---

Hi Qing,

I am adding Eran since he is currently working on a similar patch,
He is also taking care of all cores/rings stats to make them
persistent, so you won't have discrepancy  between
ethtool and ifconfig stats.

I am ok with this patch, but this means Eran has to work his way around it.

so we have two options:

1. Temporary accept this patch, and change it later with Eran's work.
2. Wait for Eran's work.

I am ok with either one of them, please let me know.

Thanks !

Hi Saeed,

Any idea on rough ETA of Eran's stats work to be in upstream? If it will 
be available soon, I think
we can wait a bit. If it will take a while to redesign the whole stats 
scheme (for both ethtool and ifconfig),

maybe we can go with this incremental fix first?

Thanks!




  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++
  1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f1fe490..5d50e69 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device 
*netdev)
 netdev_set_tc_queue(netdev, tc, nch, 0);
  }

+static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
+{
+   struct net_device *netdev = priv->netdev;
+
+   netdev->stats.rx_packets += priv->stats.sw.rx_packets;
+   netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
+   netdev->stats.tx_packets += priv->stats.sw.tx_packets;
+   netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
+   netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
+
+   priv->stats.sw.rx_packets   = 0;
+   priv->stats.sw.rx_bytes = 0;
+   priv->stats.sw.tx_packets   = 0;
+   priv->stats.sw.tx_bytes = 0;
+   priv->stats.sw.tx_queue_dropped = 0;
+}
+

This means that we are now explicitly clearing channels stats on
ifconfig down or switch_channels.
and now after ifconfing down, ethtool will always show 0, before this
patch it didn't.
Anyway update sw stats function will always override them with the new
channels stats next time we load new channels.
so it is not that big of a deal.



  static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
  {
 struct mlx5e_channel *c;
@@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
 netif_set_real_num_tx_queues(netdev, new_num_txqs);

 mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
 mlx5e_close_channels(>channels);

 priv->channels = *new_chs;
@@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)

 netif_carrier_off(priv->netdev);
 mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
 mlx5e_close_channels(>channels);

 return 0;
@@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
 stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
 stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
 } else {
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
-   stats->tx_dropped = sstats->tx_queue_dropped;
+   stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
+   stats->rx_bytes   = sstats->rx_bytes + dev->stats.rx_bytes;
+   stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
+   stats->tx_bytes   = sstats->tx_bytes + dev->stats.tx_bytes;
+   stats->tx_dropped = sstats->tx_queue_dropped +
+   dev->stats.tx_dropped;
 }

 stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
--
1.8.3.1





[PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang
Current stats collecting scheme in mlx5 driver is to periodically fetch
aggregated stats from all the active mlx5 software channels associated
with the device. However when a mlx5 interface is brought down(ifdown),
all the channels will be deactivated and closed. A new set of channels
will be created when next ifup command or a similar command is called.
Unfortunately the new channels will have all stats reset to 0. So you
lose the accumulated stats information. This behavior is different from
other netdev drivers including the mlx4 driver. In order to fix it, we
now save prior mlx5 software stats into netdev stats fields, so all the
accumulated stats will survive multiple runs of ifdown/ifup commands and
be shown correctly.

Orabug: 27548610

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f1fe490..5d50e69 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device 
*netdev)
netdev_set_tc_queue(netdev, tc, nch, 0);
 }
 
+static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
+{
+   struct net_device *netdev = priv->netdev;
+
+   netdev->stats.rx_packets += priv->stats.sw.rx_packets;
+   netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
+   netdev->stats.tx_packets += priv->stats.sw.tx_packets;
+   netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
+   netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
+
+   priv->stats.sw.rx_packets   = 0;
+   priv->stats.sw.rx_bytes = 0;
+   priv->stats.sw.tx_packets   = 0;
+   priv->stats.sw.tx_bytes = 0;
+   priv->stats.sw.tx_queue_dropped = 0;
+}
+
 static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
 {
struct mlx5e_channel *c;
@@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
netif_set_real_num_tx_queues(netdev, new_num_txqs);
 
mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
mlx5e_close_channels(>channels);
 
priv->channels = *new_chs;
@@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
 
netif_carrier_off(priv->netdev);
mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
mlx5e_close_channels(>channels);
 
return 0;
@@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
} else {
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
-   stats->tx_dropped = sstats->tx_queue_dropped;
+   stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
+   stats->rx_bytes   = sstats->rx_bytes + dev->stats.rx_bytes;
+   stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
+   stats->tx_bytes   = sstats->tx_bytes + dev->stats.tx_bytes;
+   stats->tx_dropped = sstats->tx_queue_dropped +
+   dev->stats.tx_dropped;
}
 
stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
-- 
1.8.3.1



[PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang
Current stats collecting scheme in mlx5 driver is to periodically fetch
aggregated stats from all the active mlx5 software channels associated
with the device. However when a mlx5 interface is brought down(ifdown),
all the channels will be deactivated and closed. A new set of channels
will be created when next ifup command or a similar command is called.
Unfortunately the new channels will have all stats reset to 0. So you
lose the accumulated stats information. This behavior is different from
other netdev drivers including the mlx4 driver. In order to fix it, we
now save prior mlx5 software stats into netdev stats fields, so all the
accumulated stats will survive multiple runs of ifdown/ifup commands and
be shown correctly.

Orabug: 27548610

Signed-off-by: Qing Huang 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f1fe490..5d50e69 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device 
*netdev)
netdev_set_tc_queue(netdev, tc, nch, 0);
 }
 
+static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
+{
+   struct net_device *netdev = priv->netdev;
+
+   netdev->stats.rx_packets += priv->stats.sw.rx_packets;
+   netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
+   netdev->stats.tx_packets += priv->stats.sw.tx_packets;
+   netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
+   netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
+
+   priv->stats.sw.rx_packets   = 0;
+   priv->stats.sw.rx_bytes = 0;
+   priv->stats.sw.tx_packets   = 0;
+   priv->stats.sw.tx_bytes = 0;
+   priv->stats.sw.tx_queue_dropped = 0;
+}
+
 static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
 {
struct mlx5e_channel *c;
@@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
netif_set_real_num_tx_queues(netdev, new_num_txqs);
 
mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
mlx5e_close_channels(>channels);
 
priv->channels = *new_chs;
@@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
 
netif_carrier_off(priv->netdev);
mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
mlx5e_close_channels(>channels);
 
return 0;
@@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
} else {
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
-   stats->tx_dropped = sstats->tx_queue_dropped;
+   stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
+   stats->rx_bytes   = sstats->rx_bytes + dev->stats.rx_bytes;
+   stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
+   stats->tx_bytes   = sstats->tx_bytes + dev->stats.tx_bytes;
+   stats->tx_dropped = sstats->tx_queue_dropped +
+   dev->stats.tx_dropped;
}
 
stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
-- 
1.8.3.1



Re: Setting large MTU size on slave interfaces may stall the whole system

2017-12-14 Thread Qing Huang

Hi Or,


On 12/13/2017 06:28 AM, Or Gerlitz wrote:

On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang <qing.hu...@oracle.com> wrote:

Hi,

We found an issue with the bonding driver when testing Mellanox devices.
The following test commands will stall the whole system sometimes, with
serial console
flooded with log messages from the bond_miimon_inspect() function. Setting
mtu size
to be 1500 seems okay but very rarely it may hit the same problem too.

ip address flush dev ens3f0
ip link set dev ens3f0 down
ip address flush dev ens3f1
ip link set dev ens3f1 down
[root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
updelay=500 downdelay=500
[root@ca-hcl629 etc]# ifconfig bond0 up
[root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
[root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up
Seiral console output:

** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
interface ens3f0, disabling it in 500 ms

[..]



It seems that when setting a large mtu size on an RoCE interface, the RTNL
mutex may be held too long by the slave
interface, causing bond_mii_monitor() to be called repeatedly at an interval
of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive.

Did you try/managed to reproduce that also with other NIC drivers?
This seems to be an issue with the bonding driver. Also running older 
kernels on the same
hardware without commit (de77ecd4ef02) seems to work fine. We can try to 
reproduce the

issue with other type of NIC hardware.



Or.




Re: Setting large MTU size on slave interfaces may stall the whole system

2017-12-14 Thread Qing Huang

Hi Or,


On 12/13/2017 06:28 AM, Or Gerlitz wrote:

On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang  wrote:

Hi,

We found an issue with the bonding driver when testing Mellanox devices.
The following test commands will stall the whole system sometimes, with
serial console
flooded with log messages from the bond_miimon_inspect() function. Setting
mtu size
to be 1500 seems okay but very rarely it may hit the same problem too.

ip address flush dev ens3f0
ip link set dev ens3f0 down
ip address flush dev ens3f1
ip link set dev ens3f1 down
[root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
updelay=500 downdelay=500
[root@ca-hcl629 etc]# ifconfig bond0 up
[root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
[root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up
Seiral console output:

** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
interface ens3f0, disabling it in 500 ms

[..]



It seems that when setting a large mtu size on an RoCE interface, the RTNL
mutex may be held too long by the slave
interface, causing bond_mii_monitor() to be called repeatedly at an interval
of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive.

Did you try/managed to reproduce that also with other NIC drivers?
This seems to be an issue with the bonding driver. Also running older 
kernels on the same
hardware without commit (de77ecd4ef02) seems to work fine. We can try to 
reproduce the

issue with other type of NIC hardware.



Or.




Setting large MTU size on slave interfaces may stall the whole system

2017-12-11 Thread Qing Huang

(resend this email in text format)


Hi,

We found an issue with the bonding driver when testing Mellanox devices.
The following test commands will stall the whole system sometimes, with 
serial console
flooded with log messages from the bond_miimon_inspect() function. 
Setting mtu size

to be 1500 seems okay but very rarely it may hit the same problem too.

ip address flush dev ens3f0
ip link set dev ens3f0 down
ip address flush dev ens3f1
ip link set dev ens3f1 down
[root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
updelay=500 downdelay=500
[root@ca-hcl629 etc]# ifconfig bond0 up
[root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
[root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up


Seiral console output:

** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 4 printk messages dropped ** [ 3717.37] bond0: link status down for
interface ens3f0, disabling it in 500 ms

or

** 4 printk messages dropped ** [274743.297863] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.307866] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.317857] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.327823] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.337817] bond0: link status down 
again

after 500 ms for interface enp48s0f1


The root cause is the combined affect from commit 
1f2cd845d3827412e82bf26dde0abca332ede402(Revert
"Merge branch 'bonding_monitor_locking'") and commit 
de77ecd4ef02ca783f7762e04e92b3d0964be66b
("bonding: improve link-status update in mii-monitoring"). E.g. 
reverting the second commit, we don't

see the problem.

It seems that when setting a large mtu size on an RoCE interface, the 
RTNL mutex may be held too long by the slave
interface, causing bond_mii_monitor() to be called repeatedly at an 
interval of 1 tick (1K HZ kernel configuration)

and kernel to become unresponsive.


We found two possible solutions:

#1, don't re-arm the mii monitor thread too quick if we cannot get RTNL 
lock:

index b2db581..8fd587a 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct 
*work)


    /* Race avoidance with bond_close cancel of workqueue */
    if (!rtnl_trylock()) {
-   delay = 1;
    should_notify_peers = false;
    goto re_arm;
    }

#2, we use printk_ratelimit() to avoid flooding log messages generated 
by bond_miimon_inspect().


index b2db581..0183b7f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond)
    bond_propose_link_state(slave, BOND_LINK_FAIL);
    commit++;
    slave->delay = bond->params.downdelay;
-   if (slave->delay) {
+   if (slave->delay && printk_ratelimit()) {
    netdev_info(bond->dev, "link status 
down for

%sinterface %s, disabling it in %d ms\n",
    (BOND_MODE(bond) ==
BOND_MODE_ACTIVEBACKUP) ?
@@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond)
    case BOND_LINK_BACK:
    if (!link_state) {
    bond_propose_link_state(slave,
BOND_LINK_DOWN);
-   netdev_info(bond->dev, "link status down
again after %d ms for interface %s\n",
+   if(printk_ratelimit())
+   netdev_info(bond->dev, "link status
down again after %d ms for interface %s\n",
(bond->params.updelay -
slave->delay) *
bond->params.miimon,
slave->dev->name);


Regarding the flooding messages, the netdev_info output is misleading 
anyway
when bond_mii_monitor() is called at 1 tick interval due to lock 
contention.



Solution #1 looks simpler and cleaner to me. Any side affect of doing that?


Thanks,
Qing



Setting large MTU size on slave interfaces may stall the whole system

2017-12-11 Thread Qing Huang

(resend this email in text format)


Hi,

We found an issue with the bonding driver when testing Mellanox devices.
The following test commands will stall the whole system sometimes, with 
serial console
flooded with log messages from the bond_miimon_inspect() function. 
Setting mtu size

to be 1500 seems okay but very rarely it may hit the same problem too.

ip address flush dev ens3f0
ip link set dev ens3f0 down
ip address flush dev ens3f1
ip link set dev ens3f1 down
[root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
updelay=500 downdelay=500
[root@ca-hcl629 etc]# ifconfig bond0 up
[root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
[root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up


Seiral console output:

** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 4 printk messages dropped ** [ 3717.37] bond0: link status down for
interface ens3f0, disabling it in 500 ms

or

** 4 printk messages dropped ** [274743.297863] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.307866] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.317857] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.327823] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.337817] bond0: link status down 
again

after 500 ms for interface enp48s0f1


The root cause is the combined affect from commit 
1f2cd845d3827412e82bf26dde0abca332ede402(Revert
"Merge branch 'bonding_monitor_locking'") and commit 
de77ecd4ef02ca783f7762e04e92b3d0964be66b
("bonding: improve link-status update in mii-monitoring"). E.g. 
reverting the second commit, we don't

see the problem.

It seems that when setting a large mtu size on an RoCE interface, the 
RTNL mutex may be held too long by the slave
interface, causing bond_mii_monitor() to be called repeatedly at an 
interval of 1 tick (1K HZ kernel configuration)

and kernel to become unresponsive.


We found two possible solutions:

#1, don't re-arm the mii monitor thread too quick if we cannot get RTNL 
lock:

index b2db581..8fd587a 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct 
*work)


    /* Race avoidance with bond_close cancel of workqueue */
    if (!rtnl_trylock()) {
-   delay = 1;
    should_notify_peers = false;
    goto re_arm;
    }

#2, we use printk_ratelimit() to avoid flooding log messages generated 
by bond_miimon_inspect().


index b2db581..0183b7f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond)
    bond_propose_link_state(slave, BOND_LINK_FAIL);
    commit++;
    slave->delay = bond->params.downdelay;
-   if (slave->delay) {
+   if (slave->delay && printk_ratelimit()) {
    netdev_info(bond->dev, "link status 
down for

%sinterface %s, disabling it in %d ms\n",
    (BOND_MODE(bond) ==
BOND_MODE_ACTIVEBACKUP) ?
@@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond)
    case BOND_LINK_BACK:
    if (!link_state) {
    bond_propose_link_state(slave,
BOND_LINK_DOWN);
-   netdev_info(bond->dev, "link status down
again after %d ms for interface %s\n",
+   if(printk_ratelimit())
+   netdev_info(bond->dev, "link status
down again after %d ms for interface %s\n",
(bond->params.updelay -
slave->delay) *
bond->params.miimon,
slave->dev->name);


Regarding the flooding messages, the netdev_info output is misleading 
anyway
when bond_mii_monitor() is called at 1 tick interval due to lock 
contention.



Solution #1 looks simpler and cleaner to me. Any side affect of doing that?


Thanks,
Qing



[PATCH] IB/CM: fix memory corruption by avoiding unnecessary memset

2017-11-02 Thread Qing Huang
The size of path array could be dynamic. However the fixed number(2)
of memset could cause memory corruption by writing into wrong memory
space.

Fixes: 9fdca4da4d8c (IB/SA: Split struct sa_path_rec based on IB ands
ROCE specific fields)

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
---
 drivers/infiniband/core/cm.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 4c4b465..af4f6a0 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1856,7 +1856,9 @@ static int cm_req_handler(struct cm_work *work)
cm_process_routed_req(req_msg, work->mad_recv_wc->wc);
 
memset(>path[0], 0, sizeof(work->path[0]));
-   memset(>path[1], 0, sizeof(work->path[1]));
+   if (cm_req_has_alt_path(req_msg))
+   memset(>path[1], 0, sizeof(work->path[1]));
+
grh = rdma_ah_read_grh(_id_priv->av.ah_attr);
ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
work->port->port_num,
@@ -3823,8 +3825,8 @@ static void cm_recv_handler(struct ib_mad_agent 
*mad_agent,
 
switch (mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) {
case CM_REQ_ATTR_ID:
-   paths = 1 + (((struct cm_req_msg *) mad_recv_wc->recv_buf.mad)->
-   alt_local_lid != 0);
+   paths = 1 + cm_req_has_alt_path(
+   (struct cm_req_msg *)mad_recv_wc->recv_buf.mad);
event = IB_CM_REQ_RECEIVED;
break;
case CM_MRA_ATTR_ID:
-- 
2.9.3



[PATCH] IB/CM: fix memory corruption by avoiding unnecessary memset

2017-11-02 Thread Qing Huang
The size of path array could be dynamic. However the fixed number(2)
of memset could cause memory corruption by writing into wrong memory
space.

Fixes: 9fdca4da4d8c (IB/SA: Split struct sa_path_rec based on IB ands
ROCE specific fields)

Signed-off-by: Qing Huang 
---
 drivers/infiniband/core/cm.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 4c4b465..af4f6a0 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1856,7 +1856,9 @@ static int cm_req_handler(struct cm_work *work)
cm_process_routed_req(req_msg, work->mad_recv_wc->wc);
 
memset(>path[0], 0, sizeof(work->path[0]));
-   memset(>path[1], 0, sizeof(work->path[1]));
+   if (cm_req_has_alt_path(req_msg))
+   memset(>path[1], 0, sizeof(work->path[1]));
+
grh = rdma_ah_read_grh(_id_priv->av.ah_attr);
ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
work->port->port_num,
@@ -3823,8 +3825,8 @@ static void cm_recv_handler(struct ib_mad_agent 
*mad_agent,
 
switch (mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) {
case CM_REQ_ATTR_ID:
-   paths = 1 + (((struct cm_req_msg *) mad_recv_wc->recv_buf.mad)->
-   alt_local_lid != 0);
+   paths = 1 + cm_req_has_alt_path(
+   (struct cm_req_msg *)mad_recv_wc->recv_buf.mad);
event = IB_CM_REQ_RECEIVED;
break;
case CM_MRA_ATTR_ID:
-- 
2.9.3



Re: [PATCH] ib/core: not to set page dirty bit if it's already set.

2017-05-23 Thread Qing Huang



On 5/23/2017 12:42 AM, Christoph Hellwig wrote:

On Mon, May 22, 2017 at 04:43:57PM -0700, Qing Huang wrote:

On 5/19/2017 6:05 AM, Christoph Hellwig wrote:

On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote:

This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.

As far as I can tell this code doesn't even need set_page_dirty_lock
and could just use set_page_dirty

It seems that set_page_dirty_lock has been used here for more than 10 years.
Don't know the original purpose. Maybe it was used to prevent races between
setting dirty bits and swapping out pages?

I suspect copy & paste.  Or maybe I don't actually understand the
explanation of set_page_dirty vs set_page_dirty_lock enough.  But
I'd rather not hack around the problem.
--
I think there are two parts here. First part is that we don't need to 
set the dirty bit if it's already set. Second part is whether we use 
set_page_dirty or set_page_dirty_lock to set dirty bits.





Re: [PATCH] ib/core: not to set page dirty bit if it's already set.

2017-05-23 Thread Qing Huang



On 5/23/2017 12:42 AM, Christoph Hellwig wrote:

On Mon, May 22, 2017 at 04:43:57PM -0700, Qing Huang wrote:

On 5/19/2017 6:05 AM, Christoph Hellwig wrote:

On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote:

This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.

As far as I can tell this code doesn't even need set_page_dirty_lock
and could just use set_page_dirty

It seems that set_page_dirty_lock has been used here for more than 10 years.
Don't know the original purpose. Maybe it was used to prevent races between
setting dirty bits and swapping out pages?

I suspect copy & paste.  Or maybe I don't actually understand the
explanation of set_page_dirty vs set_page_dirty_lock enough.  But
I'd rather not hack around the problem.
--
I think there are two parts here. First part is that we don't need to 
set the dirty bit if it's already set. Second part is whether we use 
set_page_dirty or set_page_dirty_lock to set dirty bits.





Re: [PATCH] ib/core: not to set page dirty bit if it's already set.

2017-05-22 Thread Qing Huang


On 5/19/2017 6:05 AM, Christoph Hellwig wrote:

On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote:

This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.

As far as I can tell this code doesn't even need set_page_dirty_lock
and could just use set_page_dirty


It seems that set_page_dirty_lock has been used here for more than 10 
years. Don't know the original purpose. Maybe it was used to prevent 
races between setting dirty bits and swapping out pages?


Perhaps we can call set_page_dirty before calling ib_dma_unmap_sg?


Signed-off-by: Qing Huang<qing.hu...@oracle.com>
---
  drivers/infiniband/core/umem.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 3dbf811..21e60b1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
  
  		page = sg_page(sg);

-   if (umem->writable && dirty)
+   if (!PageDirty(page) && umem->writable && dirty)
set_page_dirty_lock(page);
put_page(page);
}
--
2.9.3





Re: [PATCH] ib/core: not to set page dirty bit if it's already set.

2017-05-22 Thread Qing Huang


On 5/19/2017 6:05 AM, Christoph Hellwig wrote:

On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote:

This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.

As far as I can tell this code doesn't even need set_page_dirty_lock
and could just use set_page_dirty


It seems that set_page_dirty_lock has been used here for more than 10 
years. Don't know the original purpose. Maybe it was used to prevent 
races between setting dirty bits and swapping out pages?


Perhaps we can call set_page_dirty before calling ib_dma_unmap_sg?


Signed-off-by: Qing Huang
---
  drivers/infiniband/core/umem.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 3dbf811..21e60b1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
  
  		page = sg_page(sg);

-   if (umem->writable && dirty)
+   if (!PageDirty(page) && umem->writable && dirty)
set_page_dirty_lock(page);
put_page(page);
}
--
2.9.3





[PATCH] ib/core: not to set page dirty bit if it's already set.

2017-05-18 Thread Qing Huang
This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.

Signed-off-by: Qing Huang <qing.hu...@oracle.com>
---
 drivers/infiniband/core/umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 3dbf811..21e60b1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
page = sg_page(sg);
-   if (umem->writable && dirty)
+   if (!PageDirty(page) && umem->writable && dirty)
set_page_dirty_lock(page);
put_page(page);
}
-- 
2.9.3



[PATCH] ib/core: not to set page dirty bit if it's already set.

2017-05-18 Thread Qing Huang
This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.

Signed-off-by: Qing Huang 
---
 drivers/infiniband/core/umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 3dbf811..21e60b1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
page = sg_page(sg);
-   if (umem->writable && dirty)
+   if (!PageDirty(page) && umem->writable && dirty)
set_page_dirty_lock(page);
put_page(page);
}
-- 
2.9.3



Re: [PATCH] device probe: add self triggered delayed work request

2016-08-09 Thread Qing Huang



On 08/09/2016 03:11 AM, Shamir Rabinovitch wrote:

On Mon, Aug 08, 2016 at 05:10:05PM -0700, Qing Huang wrote:

Not sure if I understood your scenario. Why there is a deadlock here?


  CPU0   | CPU1
-
  driver_deferred_probe_add  | 
driver_deferred_probe_trigger_wrapper
   mutex_lock(_probe_mutex) |  
driver_deferred_probe_trigger
cancel_delayed_work(_probe_trigger_work)|   
mutex_lock(_probe_mutex)
 wait for "driver_deferred_probe_trigger_wrapper"|wait for 
"deferred_probe_mutex"

is this possible scenario with this patch?

if yes then CPU0 will wait for CPU1 to finish the delayed work whith
mutex deferred_probe_mutex held while CPU1 will try to finish the
delayed work and will wait for the same mutex forever.


CPU0 will not wait for "driver_deferred_probe_trigger_wrapper" to 
finish, it simply puts the work request onto the queue and returns.


Qing



it seems like dead lock scenario to me.

please say if this scenario is possible.

BR, Shamir Rabinovitch




Re: [PATCH] device probe: add self triggered delayed work request

2016-08-09 Thread Qing Huang



On 08/09/2016 03:11 AM, Shamir Rabinovitch wrote:

On Mon, Aug 08, 2016 at 05:10:05PM -0700, Qing Huang wrote:

Not sure if I understood your scenario. Why there is a deadlock here?


  CPU0   | CPU1
-
  driver_deferred_probe_add  | 
driver_deferred_probe_trigger_wrapper
   mutex_lock(_probe_mutex) |  
driver_deferred_probe_trigger
cancel_delayed_work(_probe_trigger_work)|   
mutex_lock(_probe_mutex)
 wait for "driver_deferred_probe_trigger_wrapper"|wait for 
"deferred_probe_mutex"

is this possible scenario with this patch?

if yes then CPU0 will wait for CPU1 to finish the delayed work whith
mutex deferred_probe_mutex held while CPU1 will try to finish the
delayed work and will wait for the same mutex forever.


CPU0 will not wait for "driver_deferred_probe_trigger_wrapper" to 
finish, it simply puts the work request onto the queue and returns.


Qing



it seems like dead lock scenario to me.

please say if this scenario is possible.

BR, Shamir Rabinovitch




Re: [PATCH] device probe: add self triggered delayed work request

2016-08-08 Thread Qing Huang



On 08/08/2016 03:42 AM, Shamir Rabinovitch wrote:

Hi Qing,

I suspect there is potential dead-lock with this patch:

cpu0cpu1

driver_deferred_probe_add   deferred_probe_work_func
  ... 
mutex_unlock(_probe_mutex)
  mutex_lock(_probe_mutex)bus_probe_device(dev)
   ...  device return 
-EPROBE_DEFER
   ...   
driver_deferred_probe_add
   ...
mutex_lock(_probe_mutex)
   ... 
   cancel_delayed_work(_probe_trigger_work)


Not sure if I understood your scenario. Why there is a deadlock here?



Please confirm if this scenario is possible.

BR, Shamir Rabinovitch




Re: [PATCH] device probe: add self triggered delayed work request

2016-08-08 Thread Qing Huang



On 08/08/2016 03:42 AM, Shamir Rabinovitch wrote:

Hi Qing,

I suspect there is potential dead-lock with this patch:

cpu0cpu1

driver_deferred_probe_add   deferred_probe_work_func
  ... 
mutex_unlock(_probe_mutex)
  mutex_lock(_probe_mutex)bus_probe_device(dev)
   ...  device return 
-EPROBE_DEFER
   ...   
driver_deferred_probe_add
   ...
mutex_lock(_probe_mutex)
   ... 
   cancel_delayed_work(_probe_trigger_work)


Not sure if I understood your scenario. Why there is a deadlock here?



Please confirm if this scenario is possible.

BR, Shamir Rabinovitch




Re: [PATCH] device probe: add self triggered delayed work request

2016-08-08 Thread Qing Huang



On 08/08/2016 01:44 PM, Frank Rowand wrote:

On 07/29/16 22:39, Qing Huang wrote:

In normal condition, the device probe requests kept in deferred
queue would only be triggered for re-probing when another new device
probe is finished successfully. This change will set up a delayed
trigger work request if the current deferred probe being added is
the only one in the queue. This delayed work request will try to
reactivate any device from the deferred queue for re-probing later.

By doing this, if the last device being probed in system boot process
has a deferred probe error, this particular device will still be able
to be probed again.

I am trying to understand the use case.

Can you explain the scenario you are trying to fix?  If I understand
correctly, you expect that something will change such that a later
probe attempt will succeed.  How will that change occur and why
will the deferred probe list not be processed in this case?

Why are you conditioning this on the deferred_probe_pending_list
being empty?

-Frank


It turns out one corner case which we worried about has already been 
solved in the really_probe() function by comparing 
'deferred_trigger_count' values.


Another use case we are investigating now: when we probe a device, the 
main thread returns EPROBE_DEFER from the driver after we spawn a child 
thread to do the actual init work. So we can initialize multiple similar 
devices at the same time. After the child thread finishes its task, we 
can call driver_deferred_probe_trigger() directly from child thread to 
re-probe the device(driver_deferred_probe_trigger() has to be exported 
though). Or we could rely on something in this patch to re-probe the 
deferred devices from the pending list...


What do you suggest?

Thanks,
-Qing


Cc: Greg Kroah-Hartman<gre...@linuxfoundation.org>
Cc: Grant Likely<grant.lik...@linaro.org>
Cc: Santosh Shilimkar<santosh.shilim...@oracle.com>
Signed-off-by: Qing Huang<qing.hu...@oracle.com>
---
  drivers/base/dd.c |   25 +
  1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 16688f5..251042d 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list);
  static LIST_HEAD(deferred_probe_active_list);
  static struct workqueue_struct *deferred_wq;
  static atomic_t deferred_trigger_count = ATOMIC_INIT(0);
+static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0);
+#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */
+#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5  /* 5 times */
  
  /*

   * In some cases, like suspend to RAM or hibernation, It might be reasonable
@@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct 
*work)
mutex_unlock(_probe_mutex);
  }
  static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work);
+static DECLARE_DELAYED_WORK(deferred_probe_trigger_work,
+   driver_deferred_probe_trigger_wrapper);
  
  static void driver_deferred_probe_add(struct device *dev)

  {
+   int local_self_trigger_count;
+
mutex_lock(_probe_mutex);
+   local_self_trigger_count = atomic_read(_self_trigger_count);
+   if (list_empty(_probe_pending_list) &&
+   local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) {
+   cancel_delayed_work(_probe_trigger_work);
+   queue_delayed_work(deferred_wq, _probe_trigger_work,
+  HZ * DEFERRED_SELF_TRIGGER_INTERVAL);
+   }
+
if (list_empty(>p->deferred_probe)) {
dev_dbg(dev, "Added to deferred list\n");
list_add_tail(>p->deferred_probe, 
_probe_pending_list);
@@ -202,6 +218,15 @@ void device_unblock_probing(void)
driver_deferred_probe_trigger();
  }
  
+/*

+ * Wrapper function for delayed trigger work requests
+ */
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work)
+{
+   atomic_inc(_self_trigger_count);
+   driver_deferred_probe_trigger();
+}
+
  /**
   * deferred_probe_initcall() - Enable probing of deferred devices
   *





Re: [PATCH] device probe: add self triggered delayed work request

2016-08-08 Thread Qing Huang



On 08/08/2016 01:44 PM, Frank Rowand wrote:

On 07/29/16 22:39, Qing Huang wrote:

In normal condition, the device probe requests kept in deferred
queue would only be triggered for re-probing when another new device
probe is finished successfully. This change will set up a delayed
trigger work request if the current deferred probe being added is
the only one in the queue. This delayed work request will try to
reactivate any device from the deferred queue for re-probing later.

By doing this, if the last device being probed in system boot process
has a deferred probe error, this particular device will still be able
to be probed again.

I am trying to understand the use case.

Can you explain the scenario you are trying to fix?  If I understand
correctly, you expect that something will change such that a later
probe attempt will succeed.  How will that change occur and why
will the deferred probe list not be processed in this case?

Why are you conditioning this on the deferred_probe_pending_list
being empty?

-Frank


It turns out one corner case which we worried about has already been 
solved in the really_probe() function by comparing 
'deferred_trigger_count' values.


Another use case we are investigating now: when we probe a device, the 
main thread returns EPROBE_DEFER from the driver after we spawn a child 
thread to do the actual init work. So we can initialize multiple similar 
devices at the same time. After the child thread finishes its task, we 
can call driver_deferred_probe_trigger() directly from child thread to 
re-probe the device(driver_deferred_probe_trigger() has to be exported 
though). Or we could rely on something in this patch to re-probe the 
deferred devices from the pending list...


What do you suggest?

Thanks,
-Qing


Cc: Greg Kroah-Hartman
Cc: Grant Likely
Cc: Santosh Shilimkar
Signed-off-by: Qing Huang
---
  drivers/base/dd.c |   25 +
  1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 16688f5..251042d 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list);
  static LIST_HEAD(deferred_probe_active_list);
  static struct workqueue_struct *deferred_wq;
  static atomic_t deferred_trigger_count = ATOMIC_INIT(0);
+static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0);
+#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */
+#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5  /* 5 times */
  
  /*

   * In some cases, like suspend to RAM or hibernation, It might be reasonable
@@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct 
*work)
mutex_unlock(_probe_mutex);
  }
  static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work);
+static DECLARE_DELAYED_WORK(deferred_probe_trigger_work,
+   driver_deferred_probe_trigger_wrapper);
  
  static void driver_deferred_probe_add(struct device *dev)

  {
+   int local_self_trigger_count;
+
mutex_lock(_probe_mutex);
+   local_self_trigger_count = atomic_read(_self_trigger_count);
+   if (list_empty(_probe_pending_list) &&
+   local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) {
+   cancel_delayed_work(_probe_trigger_work);
+   queue_delayed_work(deferred_wq, _probe_trigger_work,
+  HZ * DEFERRED_SELF_TRIGGER_INTERVAL);
+   }
+
if (list_empty(>p->deferred_probe)) {
dev_dbg(dev, "Added to deferred list\n");
list_add_tail(>p->deferred_probe, 
_probe_pending_list);
@@ -202,6 +218,15 @@ void device_unblock_probing(void)
driver_deferred_probe_trigger();
  }
  
+/*

+ * Wrapper function for delayed trigger work requests
+ */
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work)
+{
+   atomic_inc(_self_trigger_count);
+   driver_deferred_probe_trigger();
+}
+
  /**
   * deferred_probe_initcall() - Enable probing of deferred devices
   *





[PATCH] device probe: add self triggered delayed work request

2016-07-29 Thread Qing Huang
In normal condition, the device probe requests kept in deferred
queue would only be triggered for re-probing when another new device
probe is finished successfully. This change will set up a delayed
trigger work request if the current deferred probe being added is
the only one in the queue. This delayed work request will try to
reactivate any device from the deferred queue for re-probing later.

By doing this, if the last device being probed in system boot process
has a deferred probe error, this particular device will still be able
to be probed again.

Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Grant Likely <grant.lik...@linaro.org>
Cc: Santosh Shilimkar <santosh.shilim...@oracle.com>
Signed-off-by: Qing Huang <qing.hu...@oracle.com>
---
 drivers/base/dd.c |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 16688f5..251042d 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list);
 static LIST_HEAD(deferred_probe_active_list);
 static struct workqueue_struct *deferred_wq;
 static atomic_t deferred_trigger_count = ATOMIC_INIT(0);
+static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0);
+#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */
+#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5  /* 5 times */
 
 /*
  * In some cases, like suspend to RAM or hibernation, It might be reasonable
@@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct 
*work)
mutex_unlock(_probe_mutex);
 }
 static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work);
+static DECLARE_DELAYED_WORK(deferred_probe_trigger_work,
+   driver_deferred_probe_trigger_wrapper);
 
 static void driver_deferred_probe_add(struct device *dev)
 {
+   int local_self_trigger_count;
+
mutex_lock(_probe_mutex);
+   local_self_trigger_count = atomic_read(_self_trigger_count);
+   if (list_empty(_probe_pending_list) &&
+   local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) {
+   cancel_delayed_work(_probe_trigger_work);
+   queue_delayed_work(deferred_wq, _probe_trigger_work,
+  HZ * DEFERRED_SELF_TRIGGER_INTERVAL);
+   }
+
if (list_empty(>p->deferred_probe)) {
dev_dbg(dev, "Added to deferred list\n");
list_add_tail(>p->deferred_probe, 
_probe_pending_list);
@@ -202,6 +218,15 @@ void device_unblock_probing(void)
driver_deferred_probe_trigger();
 }
 
+/*
+ * Wrapper function for delayed trigger work requests
+ */
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work)
+{
+   atomic_inc(_self_trigger_count);
+   driver_deferred_probe_trigger();
+}
+
 /**
  * deferred_probe_initcall() - Enable probing of deferred devices
  *
-- 
1.7.1



[PATCH] device probe: add self triggered delayed work request

2016-07-29 Thread Qing Huang
In normal condition, the device probe requests kept in deferred
queue would only be triggered for re-probing when another new device
probe is finished successfully. This change will set up a delayed
trigger work request if the current deferred probe being added is
the only one in the queue. This delayed work request will try to
reactivate any device from the deferred queue for re-probing later.

By doing this, if the last device being probed in system boot process
has a deferred probe error, this particular device will still be able
to be probed again.

Cc: Greg Kroah-Hartman 
Cc: Grant Likely 
Cc: Santosh Shilimkar 
Signed-off-by: Qing Huang 
---
 drivers/base/dd.c |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 16688f5..251042d 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list);
 static LIST_HEAD(deferred_probe_active_list);
 static struct workqueue_struct *deferred_wq;
 static atomic_t deferred_trigger_count = ATOMIC_INIT(0);
+static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0);
+#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */
+#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5  /* 5 times */
 
 /*
  * In some cases, like suspend to RAM or hibernation, It might be reasonable
@@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct 
*work)
mutex_unlock(_probe_mutex);
 }
 static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work);
+static DECLARE_DELAYED_WORK(deferred_probe_trigger_work,
+   driver_deferred_probe_trigger_wrapper);
 
 static void driver_deferred_probe_add(struct device *dev)
 {
+   int local_self_trigger_count;
+
mutex_lock(_probe_mutex);
+   local_self_trigger_count = atomic_read(_self_trigger_count);
+   if (list_empty(_probe_pending_list) &&
+   local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) {
+   cancel_delayed_work(_probe_trigger_work);
+   queue_delayed_work(deferred_wq, _probe_trigger_work,
+  HZ * DEFERRED_SELF_TRIGGER_INTERVAL);
+   }
+
if (list_empty(>p->deferred_probe)) {
dev_dbg(dev, "Added to deferred list\n");
list_add_tail(>p->deferred_probe, 
_probe_pending_list);
@@ -202,6 +218,15 @@ void device_unblock_probing(void)
driver_deferred_probe_trigger();
 }
 
+/*
+ * Wrapper function for delayed trigger work requests
+ */
+void driver_deferred_probe_trigger_wrapper(struct work_struct *work)
+{
+   atomic_inc(_self_trigger_count);
+   driver_deferred_probe_trigger();
+}
+
 /**
  * deferred_probe_initcall() - Enable probing of deferred devices
  *
-- 
1.7.1