[PATCH v2] IB/mlx5: avoid excessive warning msgs when creating VFs on 2nd port
When a CX5 device is configured in dual-port RoCE mode, after creating many VFs against port 1, creating the same number of VFs against port 2 will flood kernel/syslog with something like "mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already affiliated." So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port() repeatedly attempts to bind the new mpi structure to every device on the list until it finds an unbound device. Change the log level from warn to dbg to avoid log flooding as the warning should be harmless. Signed-off-by: Qing Huang --- v1 -> v2: change the log level instead drivers/infiniband/hw/mlx5/main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..f57b8f7 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -5127,8 +5127,8 @@ static bool mlx5_ib_bind_slave_port(struct mlx5_ib_dev *ibdev, spin_lock(>port[port_num].mp.mpi_lock); if (ibdev->port[port_num].mp.mpi) { - mlx5_ib_warn(ibdev, "port %d already affiliated.\n", -port_num + 1); + mlx5_ib_dbg(ibdev, "port %d already affiliated.\n", + port_num + 1); spin_unlock(>port[port_num].mp.mpi_lock); return false; } -- 2.9.3
[PATCH v2] IB/mlx5: avoid excessive warning msgs when creating VFs on 2nd port
When a CX5 device is configured in dual-port RoCE mode, after creating many VFs against port 1, creating the same number of VFs against port 2 will flood kernel/syslog with something like "mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already affiliated." So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port() repeatedly attempts to bind the new mpi structure to every device on the list until it finds an unbound device. Change the log level from warn to dbg to avoid log flooding as the warning should be harmless. Signed-off-by: Qing Huang --- v1 -> v2: change the log level instead drivers/infiniband/hw/mlx5/main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..f57b8f7 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -5127,8 +5127,8 @@ static bool mlx5_ib_bind_slave_port(struct mlx5_ib_dev *ibdev, spin_lock(>port[port_num].mp.mpi_lock); if (ibdev->port[port_num].mp.mpi) { - mlx5_ib_warn(ibdev, "port %d already affiliated.\n", -port_num + 1); + mlx5_ib_dbg(ibdev, "port %d already affiliated.\n", + port_num + 1); spin_unlock(>port[port_num].mp.mpi_lock); return false; } -- 2.9.3
Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly
Hi Daniel, On 7/23/2018 11:11 AM, Daniel Jurgens wrote: On 7/23/2018 10:36 AM, Qing Huang wrote: Hi Daniel/Parav, Have you got a chance to review this patch? Thanks! Hi Qing, sorry for the delay, I just got back to the office today. I don't agree with the proposed fix, I provided an alternative suggestion below. Or. Reported-by: Gerald Gibson Signed-off-by: Qing Huang --- drivers/infiniband/hw/mlx5/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..1ddd1d3 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev *mdev, u8 port_num) mutex_lock(_ib_multiport_mutex); list_for_each_entry(dev, _ib_dev_list, ib_dev_list) { - if (dev->sys_image_guid == mpi->sys_image_guid) + if (dev->sys_image_guid == mpi->sys_image_guid && + !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi) You shouldn't check the mpi field that without holding the lock in the mp structure. Prefer you change the print from a warning in mlx5_ib_bind_slave_port to a debug message. Thanks for the review. That works for us too. Will resend the patch. Regards, Qing bound = mlx5_ib_bind_slave_port(dev, mpi); if (bound) { -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly
Hi Daniel, On 7/23/2018 11:11 AM, Daniel Jurgens wrote: On 7/23/2018 10:36 AM, Qing Huang wrote: Hi Daniel/Parav, Have you got a chance to review this patch? Thanks! Hi Qing, sorry for the delay, I just got back to the office today. I don't agree with the proposed fix, I provided an alternative suggestion below. Or. Reported-by: Gerald Gibson Signed-off-by: Qing Huang --- drivers/infiniband/hw/mlx5/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..1ddd1d3 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev *mdev, u8 port_num) mutex_lock(_ib_multiport_mutex); list_for_each_entry(dev, _ib_dev_list, ib_dev_list) { - if (dev->sys_image_guid == mpi->sys_image_guid) + if (dev->sys_image_guid == mpi->sys_image_guid && + !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi) You shouldn't check the mpi field that without holding the lock in the mp structure. Prefer you change the print from a warning in mlx5_ib_bind_slave_port to a debug message. Thanks for the review. That works for us too. Will resend the patch. Regards, Qing bound = mlx5_ib_bind_slave_port(dev, mpi); if (bound) { -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly
On 7/15/2018 12:48 PM, Daniel Jurgens wrote: On 7/14/2018 10:57 AM, Or Gerlitz wrote: On Sat, Jul 14, 2018 at 2:50 AM, Qing Huang wrote: When a CX5 device is configured in dual-port RoCE mode, after creating many VFs against port 1, creating the same number of VFs against port 2 will flood kernel/syslog with something like "mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already affiliated." So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port() shouldn't repeatedly attempt to bind the new mpi data unit to every device on the list until it finds an unbound device. Daniel, What is mpi data unit? It's a structure to keep track affiliated port info in dual port RoCE mode, mpi meaning multi-port info. Parav can review this it my absence, otherwise I can take a closer look when I return to the office. Hi Daniel/Parav, Have you got a chance to review this patch? Thanks! Or. Reported-by: Gerald Gibson Signed-off-by: Qing Huang --- drivers/infiniband/hw/mlx5/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..1ddd1d3 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev *mdev, u8 port_num) mutex_lock(_ib_multiport_mutex); list_for_each_entry(dev, _ib_dev_list, ib_dev_list) { - if (dev->sys_image_guid == mpi->sys_image_guid) + if (dev->sys_image_guid == mpi->sys_image_guid && + !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi) bound = mlx5_ib_bind_slave_port(dev, mpi); if (bound) { -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly
On 7/15/2018 12:48 PM, Daniel Jurgens wrote: On 7/14/2018 10:57 AM, Or Gerlitz wrote: On Sat, Jul 14, 2018 at 2:50 AM, Qing Huang wrote: When a CX5 device is configured in dual-port RoCE mode, after creating many VFs against port 1, creating the same number of VFs against port 2 will flood kernel/syslog with something like "mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already affiliated." So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port() shouldn't repeatedly attempt to bind the new mpi data unit to every device on the list until it finds an unbound device. Daniel, What is mpi data unit? It's a structure to keep track affiliated port info in dual port RoCE mode, mpi meaning multi-port info. Parav can review this it my absence, otherwise I can take a closer look when I return to the office. Hi Daniel/Parav, Have you got a chance to review this patch? Thanks! Or. Reported-by: Gerald Gibson Signed-off-by: Qing Huang --- drivers/infiniband/hw/mlx5/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..1ddd1d3 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev *mdev, u8 port_num) mutex_lock(_ib_multiport_mutex); list_for_each_entry(dev, _ib_dev_list, ib_dev_list) { - if (dev->sys_image_guid == mpi->sys_image_guid) + if (dev->sys_image_guid == mpi->sys_image_guid && + !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi) bound = mlx5_ib_bind_slave_port(dev, mpi); if (bound) { -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly
When a CX5 device is configured in dual-port RoCE mode, after creating many VFs against port 1, creating the same number of VFs against port 2 will flood kernel/syslog with something like "mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already affiliated." So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port() shouldn't repeatedly attempt to bind the new mpi data unit to every device on the list until it finds an unbound device. Reported-by: Gerald Gibson Signed-off-by: Qing Huang --- drivers/infiniband/hw/mlx5/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..1ddd1d3 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev *mdev, u8 port_num) mutex_lock(_ib_multiport_mutex); list_for_each_entry(dev, _ib_dev_list, ib_dev_list) { - if (dev->sys_image_guid == mpi->sys_image_guid) + if (dev->sys_image_guid == mpi->sys_image_guid && + !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi) bound = mlx5_ib_bind_slave_port(dev, mpi); if (bound) { -- 2.9.3
[PATCH] IB/mlx5: avoid binding a new mpi unit to the same devices repeatedly
When a CX5 device is configured in dual-port RoCE mode, after creating many VFs against port 1, creating the same number of VFs against port 2 will flood kernel/syslog with something like "mlx5_*:mlx5_ib_bind_slave_port:4266:(pid 5269): port 2 already affiliated." So basically, when traversing mlx5_ib_dev_list, mlx5_ib_add_slave_port() shouldn't repeatedly attempt to bind the new mpi data unit to every device on the list until it finds an unbound device. Reported-by: Gerald Gibson Signed-off-by: Qing Huang --- drivers/infiniband/hw/mlx5/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index b3ba9a2..1ddd1d3 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -6068,7 +6068,8 @@ static void *mlx5_ib_add_slave_port(struct mlx5_core_dev *mdev, u8 port_num) mutex_lock(_ib_multiport_mutex); list_for_each_entry(dev, _ib_dev_list, ib_dev_list) { - if (dev->sys_image_guid == mpi->sys_image_guid) + if (dev->sys_image_guid == mpi->sys_image_guid && + !dev->port[mlx5_core_native_port_num(mdev) - 1].mp.mpi) bound = mlx5_ib_bind_slave_port(dev, mpi); if (bound) { -- 2.9.3
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 5/31/2018 2:10 AM, Michal Hocko wrote: On Thu 31-05-18 10:55:32, Michal Hocko wrote: On Thu 31-05-18 04:35:31, Eric Dumazet wrote: [...] I merely copied/pasted from alloc_skb_with_frags() :/ I will have a look at it. Thanks! OK, so this is an example of an incremental development ;). __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations") to prevent from OOM killer. Yet this was not enough because fb05e7a89f50 ("net: don't wait for order-3 page allocation") didn't want an excessive reclaim for non-costly orders so it made it completely NOWAIT while it preserved __GFP_NORETRY in place which is now redundant. Should I send a patch? Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion? We experimented with it a bit in the past but it seemed to cause other issue in our tests. :-) By the way, we didn't encounter any OOM killer events. It seemed that the mlx4_alloc_icm() triggered slowpath. We still had about 2GB free memory while it was highly fragmented. #0 [8801f308b380] remove_migration_pte at 811f0e0b #1 [8801f308b3e0] rmap_walk_file at 811cb890 #2 [8801f308b440] rmap_walk at 811cbaf2 #3 [8801f308b450] remove_migration_ptes at 811f0db0 #4 [8801f308b490] __unmap_and_move at 811f2ea6 #5 [8801f308b4e0] unmap_and_move at 811f2fc5 #6 [8801f308b540] migrate_pages at 811f3219 #7 [8801f308b5c0] compact_zone at 811b707e #8 [8801f308b650] compact_zone_order at 811b735d #9 [8801f308b6e0] try_to_compact_pages at 811b7485 #10 [8801f308b770] __alloc_pages_direct_compact at 81195f96 #11 [8801f308b7b0] __alloc_pages_slowpath at 811978a1 #12 [8801f308b890] __alloc_pages_nodemask at 81197ec1 #13 [8801f308b970] alloc_pages_current at 811e261f #14 [8801f308b9e0] mlx4_alloc_icm at a01f39b2 [mlx4_core] Thanks!
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 5/31/2018 2:10 AM, Michal Hocko wrote: On Thu 31-05-18 10:55:32, Michal Hocko wrote: On Thu 31-05-18 04:35:31, Eric Dumazet wrote: [...] I merely copied/pasted from alloc_skb_with_frags() :/ I will have a look at it. Thanks! OK, so this is an example of an incremental development ;). __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations") to prevent from OOM killer. Yet this was not enough because fb05e7a89f50 ("net: don't wait for order-3 page allocation") didn't want an excessive reclaim for non-costly orders so it made it completely NOWAIT while it preserved __GFP_NORETRY in place which is now redundant. Should I send a patch? Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion? We experimented with it a bit in the past but it seemed to cause other issue in our tests. :-) By the way, we didn't encounter any OOM killer events. It seemed that the mlx4_alloc_icm() triggered slowpath. We still had about 2GB free memory while it was highly fragmented. #0 [8801f308b380] remove_migration_pte at 811f0e0b #1 [8801f308b3e0] rmap_walk_file at 811cb890 #2 [8801f308b440] rmap_walk at 811cbaf2 #3 [8801f308b450] remove_migration_ptes at 811f0db0 #4 [8801f308b490] __unmap_and_move at 811f2ea6 #5 [8801f308b4e0] unmap_and_move at 811f2fc5 #6 [8801f308b540] migrate_pages at 811f3219 #7 [8801f308b5c0] compact_zone at 811b707e #8 [8801f308b650] compact_zone_order at 811b735d #9 [8801f308b6e0] try_to_compact_pages at 811b7485 #10 [8801f308b770] __alloc_pages_direct_compact at 81195f96 #11 [8801f308b7b0] __alloc_pages_slowpath at 811978a1 #12 [8801f308b890] __alloc_pages_nodemask at 81197ec1 #13 [8801f308b970] alloc_pages_current at 811e261f #14 [8801f308b9e0] mlx4_alloc_icm at a01f39b2 [mlx4_core] Thanks!
[PATCH V4] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } -- 2.9.3
[PATCH V4] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } -- 2.9.3
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 5/22/2018 8:33 AM, Tariq Toukan wrote: On 18/05/2018 12:45 AM, Qing Huang wrote: On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. NACK. This would cause a degradation when iterating the entries of table->icm. For example, in mlx4_table_get_range. E.g. int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 start, u32 end) { int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size; int err; u32 i; for (i = start; i <= end; i += inc) { err = mlx4_table_get(dev, table, i); if (err) goto fail; } return 0; ... } E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects. So you will have to allocate more 512 mtt objects in order to have table->icm pointer to increment by 1 to fetch next pointer value. So 256K mtt objects are needed in order to traverse table->icm pointer across a page boundary in the call stacks. Considering mlx4_table_get_range() is only used in control path, there is no significant gain by using kvzalloc vs. vzalloc for table->icm. Anyway, if a user makes sure mlx4 driver to be loaded very early and doesn't remove and reload it afterwards, we should have enough (and not wasting) contiguous phy mem for table->icm allocation. I will use kvzalloc to replace vzalloc and send a V4 patch. Thanks, Qing Thanks, Tariq And you'll understand some people care about this. Strongly. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 5/22/2018 8:33 AM, Tariq Toukan wrote: On 18/05/2018 12:45 AM, Qing Huang wrote: On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. NACK. This would cause a degradation when iterating the entries of table->icm. For example, in mlx4_table_get_range. E.g. int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 start, u32 end) { int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size; int err; u32 i; for (i = start; i <= end; i += inc) { err = mlx4_table_get(dev, table, i); if (err) goto fail; } return 0; ... } E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects. So you will have to allocate more 512 mtt objects in order to have table->icm pointer to increment by 1 to fetch next pointer value. So 256K mtt objects are needed in order to traverse table->icm pointer across a page boundary in the call stacks. Considering mlx4_table_get_range() is only used in control path, there is no significant gain by using kvzalloc vs. vzalloc for table->icm. Anyway, if a user makes sure mlx4 driver to be loaded very early and doesn't remove and reload it afterwards, we should have enough (and not wasting) contiguous phy mem for table->icm allocation. I will use kvzalloc to replace vzalloc and send a V4 patch. Thanks, Qing Thanks, Tariq And you'll understand some people care about this. Strongly. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. And you'll understand some people care about this. Strongly. Thanks.
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. And you'll understand some people care about this. Strongly. Thanks.
[PATCH v3] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjuste chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..3705f4e 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -33,6 +33,7 @@ #include #include +#include #include #include @@ -43,12 +44,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
[PATCH v3] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjuste chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..3705f4e 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -33,6 +33,7 @@ #include #include +#include #include #include @@ -43,12 +44,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/15/2018 12:08 PM, Eric Dumazet wrote: On 05/15/2018 11:53 AM, Qing Huang wrote: This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Just use kvzalloc(), and you get the benefit of having contiguous memory if available, without expensive compact phase. This thing _automatically_ falls back to vmalloc(), thus your problem will be solved. If you are not sure, trust others. Thanks for the review. There are many places in kernel and applications where physically contiguous pages are needed. We saw quite a few issues when there were not enough contiguous phy mem available. My main concern here is that why using physically contiguous pages when they are not really needed?
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/15/2018 12:08 PM, Eric Dumazet wrote: On 05/15/2018 11:53 AM, Qing Huang wrote: This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Just use kvzalloc(), and you get the benefit of having contiguous memory if available, without expensive compact phase. This thing _automatically_ falls back to vmalloc(), thus your problem will be solved. If you are not sure, trust others. Thanks for the review. There are many places in kernel and applications where physically contiguous pages are needed. We saw quite a few issues when there were not enough contiguous phy mem available. My main concern here is that why using physically contiguous pages when they are not really needed?
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/15/2018 2:19 AM, Tariq Toukan wrote: On 14/05/2018 7:41 PM, Qing Huang wrote: On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Then please use PAGE_SIZE instead. PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is actually more appropriate here. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame. I know it didn't exist before your patch, but once we're here, let's do it. I'm okay either way. If adding an extra comma is preferred by many people, someone should update checkpatch.pl to enforce it. :) }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm);
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/15/2018 2:19 AM, Tariq Toukan wrote: On 14/05/2018 7:41 PM, Qing Huang wrote: On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Then please use PAGE_SIZE instead. PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is actually more appropriate here. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame. I know it didn't exist before your patch, but once we're here, let's do it. I'm okay either way. If adding an extra comma is preferred by many people, someone should update checkpatch.pl to enforce it. :) }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic perf
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic performance degradation here. You can prepare and send a v3 in the meanwhile. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic performance degradation here. You can prepare and send a v3 in the meanwhile. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun@oracle.com> --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
[PATCH V2] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
On 5/11/2018 3:27 AM, Håkon Bugge wrote: On 11 May 2018, at 01:31, Qing Huang<qing.hu...@oracle.com> wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang<qing.hu...@oracle.com> Acked-by: Daniel Jurgens<dani...@mellanox.com> --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the hw page size is 8KiB. Good point on supporting wider range of architectures. I got tunnel vision when fixing this on our x64 lab machines. Will send an v2 patch. Thanks, Qing Thxs, Håkon }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
On 5/11/2018 3:27 AM, Håkon Bugge wrote: On 11 May 2018, at 01:31, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the hw page size is 8KiB. Good point on supporting wider range of architectures. I got tunnel vision when fixing this on our x64 lab machines. Will send an v2 patch. Thanks, Qing Thxs, Håkon }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
Thank you for reviewing it! On 5/10/2018 6:23 PM, Yanjun Zhu wrote: On 2018/5/11 9:15, Qing Huang wrote: On 5/10/2018 5:13 PM, Yanjun Zhu wrote: On 2018/5/11 7:31, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need Hi, Replace continuous memory pages with virtual memory, is there any performance loss? Not really. "table->icm" will be accessed as individual pointer variables randomly. Kcalloc Sure. Thanks. If "table->icm" will be accessed as individual pointer variables randomly, the performance loss caused by discontinuous memory will be very trivial. Reviewed-by: Zhu Yanjun <yanjun@oracle.com> also returns a virtual address except its mapped pages are guaranteed to be contiguous which will provide little advantage over vzalloc for individual pointer variable access. Qing Zhu Yanjun of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); }
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
Thank you for reviewing it! On 5/10/2018 6:23 PM, Yanjun Zhu wrote: On 2018/5/11 9:15, Qing Huang wrote: On 5/10/2018 5:13 PM, Yanjun Zhu wrote: On 2018/5/11 7:31, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need Hi, Replace continuous memory pages with virtual memory, is there any performance loss? Not really. "table->icm" will be accessed as individual pointer variables randomly. Kcalloc Sure. Thanks. If "table->icm" will be accessed as individual pointer variables randomly, the performance loss caused by discontinuous memory will be very trivial. Reviewed-by: Zhu Yanjun also returns a virtual address except its mapped pages are guaranteed to be contiguous which will provide little advantage over vzalloc for individual pointer variable access. Qing Zhu Yanjun of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); }
[PATCH] mlx4_core: allocate 4KB ICM chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang <qing.hu...@oracle.com> Acked-by: Daniel Jurgens <dani...@mellanox.com> --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
[PATCH] mlx4_core: allocate 4KB ICM chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
Thanks! Looks like Eran's patch will be available much sooner than v4.18 release time frame (since v4.16 was just released). We will wait. :-) On 04/26/2018 04:43 PM, Saeed Mahameed wrote: Before you address my comments, it looks like Eran's work is converging and we will finalize the internal review next week, in which case submission will take 2-3 weeks from today. So i suggest to wait. please checkout: https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=testing/mlx5e-stats=15ffb7f87432d073e8ac0e6b895188d40fdda4d4
Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
Thanks! Looks like Eran's patch will be available much sooner than v4.18 release time frame (since v4.16 was just released). We will wait. :-) On 04/26/2018 04:43 PM, Saeed Mahameed wrote: Before you address my comments, it looks like Eran's work is converging and we will finalize the internal review next week, in which case submission will take 2-3 weeks from today. So i suggest to wait. please checkout: https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=testing/mlx5e-stats=15ffb7f87432d073e8ac0e6b895188d40fdda4d4
Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
On 04/26/2018 02:50 PM, Saeed Mahameed wrote: On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang <qing.hu...@oracle.com> wrote: Current stats collecting scheme in mlx5 driver is to periodically fetch aggregated stats from all the active mlx5 software channels associated with the device. However when a mlx5 interface is brought down(ifdown), all the channels will be deactivated and closed. A new set of channels will be created when next ifup command or a similar command is called. Unfortunately the new channels will have all stats reset to 0. So you lose the accumulated stats information. This behavior is different from other netdev drivers including the mlx4 driver. In order to fix it, we now save prior mlx5 software stats into netdev stats fields, so all the accumulated stats will survive multiple runs of ifdown/ifup commands and be shown correctly. Orabug: 27548610 Signed-off-by: Qing Huang <qing.hu...@oracle.com> --- Hi Qing, I am adding Eran since he is currently working on a similar patch, He is also taking care of all cores/rings stats to make them persistent, so you won't have discrepancy between ethtool and ifconfig stats. I am ok with this patch, but this means Eran has to work his way around it. so we have two options: 1. Temporary accept this patch, and change it later with Eran's work. 2. Wait for Eran's work. I am ok with either one of them, please let me know. Thanks ! Hi Saeed, Any idea on rough ETA of Eran's stats work to be in upstream? If it will be available soon, I think we can wait a bit. If it will take a while to redesign the whole stats scheme (for both ethtool and ifconfig), maybe we can go with this incremental fix first? Thanks! drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++ 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index f1fe490..5d50e69 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev) netdev_set_tc_queue(netdev, tc, nch, 0); } +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv) +{ + struct net_device *netdev = priv->netdev; + + netdev->stats.rx_packets += priv->stats.sw.rx_packets; + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes; + netdev->stats.tx_packets += priv->stats.sw.tx_packets; + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes; + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped; + + priv->stats.sw.rx_packets = 0; + priv->stats.sw.rx_bytes = 0; + priv->stats.sw.tx_packets = 0; + priv->stats.sw.tx_bytes = 0; + priv->stats.sw.tx_queue_dropped = 0; +} + This means that we are now explicitly clearing channels stats on ifconfig down or switch_channels. and now after ifconfing down, ethtool will always show 0, before this patch it didn't. Anyway update sw stats function will always override them with the new channels stats next time we load new channels. so it is not that big of a deal. static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv) { struct mlx5e_channel *c; @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv, netif_set_real_num_tx_queues(netdev, new_num_txqs); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); priv->channels = *new_chs; @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev) netif_carrier_off(priv->netdev); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); return 0; @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type, stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok); stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok); } else { - stats->rx_packets = sstats->rx_packets; - stats->rx_bytes = sstats->rx_bytes; - stats->tx_packets = sstats->tx_packets; - stats->tx_bytes = sstats->tx_bytes; - stats->tx_dropped = sstats->tx_queue_dropped; + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets; + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes; + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets; + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes; + stats-&
Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
On 04/26/2018 02:50 PM, Saeed Mahameed wrote: On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang wrote: Current stats collecting scheme in mlx5 driver is to periodically fetch aggregated stats from all the active mlx5 software channels associated with the device. However when a mlx5 interface is brought down(ifdown), all the channels will be deactivated and closed. A new set of channels will be created when next ifup command or a similar command is called. Unfortunately the new channels will have all stats reset to 0. So you lose the accumulated stats information. This behavior is different from other netdev drivers including the mlx4 driver. In order to fix it, we now save prior mlx5 software stats into netdev stats fields, so all the accumulated stats will survive multiple runs of ifdown/ifup commands and be shown correctly. Orabug: 27548610 Signed-off-by: Qing Huang --- Hi Qing, I am adding Eran since he is currently working on a similar patch, He is also taking care of all cores/rings stats to make them persistent, so you won't have discrepancy between ethtool and ifconfig stats. I am ok with this patch, but this means Eran has to work his way around it. so we have two options: 1. Temporary accept this patch, and change it later with Eran's work. 2. Wait for Eran's work. I am ok with either one of them, please let me know. Thanks ! Hi Saeed, Any idea on rough ETA of Eran's stats work to be in upstream? If it will be available soon, I think we can wait a bit. If it will take a while to redesign the whole stats scheme (for both ethtool and ifconfig), maybe we can go with this incremental fix first? Thanks! drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++ 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index f1fe490..5d50e69 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev) netdev_set_tc_queue(netdev, tc, nch, 0); } +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv) +{ + struct net_device *netdev = priv->netdev; + + netdev->stats.rx_packets += priv->stats.sw.rx_packets; + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes; + netdev->stats.tx_packets += priv->stats.sw.tx_packets; + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes; + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped; + + priv->stats.sw.rx_packets = 0; + priv->stats.sw.rx_bytes = 0; + priv->stats.sw.tx_packets = 0; + priv->stats.sw.tx_bytes = 0; + priv->stats.sw.tx_queue_dropped = 0; +} + This means that we are now explicitly clearing channels stats on ifconfig down or switch_channels. and now after ifconfing down, ethtool will always show 0, before this patch it didn't. Anyway update sw stats function will always override them with the new channels stats next time we load new channels. so it is not that big of a deal. static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv) { struct mlx5e_channel *c; @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv, netif_set_real_num_tx_queues(netdev, new_num_txqs); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); priv->channels = *new_chs; @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev) netif_carrier_off(priv->netdev); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); return 0; @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type, stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok); stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok); } else { - stats->rx_packets = sstats->rx_packets; - stats->rx_bytes = sstats->rx_bytes; - stats->tx_packets = sstats->tx_packets; - stats->tx_bytes = sstats->tx_bytes; - stats->tx_dropped = sstats->tx_queue_dropped; + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets; + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes; + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets; + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes; + stats->tx_dropped = sstats->tx_queue_dropped + + dev->stats.tx_dropped; } stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer; -- 1.8.3.1
[PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
Current stats collecting scheme in mlx5 driver is to periodically fetch aggregated stats from all the active mlx5 software channels associated with the device. However when a mlx5 interface is brought down(ifdown), all the channels will be deactivated and closed. A new set of channels will be created when next ifup command or a similar command is called. Unfortunately the new channels will have all stats reset to 0. So you lose the accumulated stats information. This behavior is different from other netdev drivers including the mlx4 driver. In order to fix it, we now save prior mlx5 software stats into netdev stats fields, so all the accumulated stats will survive multiple runs of ifdown/ifup commands and be shown correctly. Orabug: 27548610 Signed-off-by: Qing Huang <qing.hu...@oracle.com> --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++ 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index f1fe490..5d50e69 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev) netdev_set_tc_queue(netdev, tc, nch, 0); } +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv) +{ + struct net_device *netdev = priv->netdev; + + netdev->stats.rx_packets += priv->stats.sw.rx_packets; + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes; + netdev->stats.tx_packets += priv->stats.sw.tx_packets; + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes; + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped; + + priv->stats.sw.rx_packets = 0; + priv->stats.sw.rx_bytes = 0; + priv->stats.sw.tx_packets = 0; + priv->stats.sw.tx_bytes = 0; + priv->stats.sw.tx_queue_dropped = 0; +} + static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv) { struct mlx5e_channel *c; @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv, netif_set_real_num_tx_queues(netdev, new_num_txqs); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); priv->channels = *new_chs; @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev) netif_carrier_off(priv->netdev); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); return 0; @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type, stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok); stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok); } else { - stats->rx_packets = sstats->rx_packets; - stats->rx_bytes = sstats->rx_bytes; - stats->tx_packets = sstats->tx_packets; - stats->tx_bytes = sstats->tx_bytes; - stats->tx_dropped = sstats->tx_queue_dropped; + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets; + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes; + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets; + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes; + stats->tx_dropped = sstats->tx_queue_dropped + + dev->stats.tx_dropped; } stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer; -- 1.8.3.1
[PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
Current stats collecting scheme in mlx5 driver is to periodically fetch aggregated stats from all the active mlx5 software channels associated with the device. However when a mlx5 interface is brought down(ifdown), all the channels will be deactivated and closed. A new set of channels will be created when next ifup command or a similar command is called. Unfortunately the new channels will have all stats reset to 0. So you lose the accumulated stats information. This behavior is different from other netdev drivers including the mlx4 driver. In order to fix it, we now save prior mlx5 software stats into netdev stats fields, so all the accumulated stats will survive multiple runs of ifdown/ifup commands and be shown correctly. Orabug: 27548610 Signed-off-by: Qing Huang --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++ 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index f1fe490..5d50e69 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev) netdev_set_tc_queue(netdev, tc, nch, 0); } +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv) +{ + struct net_device *netdev = priv->netdev; + + netdev->stats.rx_packets += priv->stats.sw.rx_packets; + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes; + netdev->stats.tx_packets += priv->stats.sw.tx_packets; + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes; + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped; + + priv->stats.sw.rx_packets = 0; + priv->stats.sw.rx_bytes = 0; + priv->stats.sw.tx_packets = 0; + priv->stats.sw.tx_bytes = 0; + priv->stats.sw.tx_queue_dropped = 0; +} + static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv) { struct mlx5e_channel *c; @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv, netif_set_real_num_tx_queues(netdev, new_num_txqs); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); priv->channels = *new_chs; @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev) netif_carrier_off(priv->netdev); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(>channels); return 0; @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type, stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok); stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok); } else { - stats->rx_packets = sstats->rx_packets; - stats->rx_bytes = sstats->rx_bytes; - stats->tx_packets = sstats->tx_packets; - stats->tx_bytes = sstats->tx_bytes; - stats->tx_dropped = sstats->tx_queue_dropped; + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets; + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes; + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets; + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes; + stats->tx_dropped = sstats->tx_queue_dropped + + dev->stats.tx_dropped; } stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer; -- 1.8.3.1
Re: Setting large MTU size on slave interfaces may stall the whole system
Hi Or, On 12/13/2017 06:28 AM, Or Gerlitz wrote: On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang <qing.hu...@oracle.com> wrote: Hi, We found an issue with the bonding driver when testing Mellanox devices. The following test commands will stall the whole system sometimes, with serial console flooded with log messages from the bond_miimon_inspect() function. Setting mtu size to be 1500 seems okay but very rarely it may hit the same problem too. ip address flush dev ens3f0 ip link set dev ens3f0 down ip address flush dev ens3f1 ip link set dev ens3f1 down [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 updelay=500 downdelay=500 [root@ca-hcl629 etc]# ifconfig bond0 up [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up Seiral console output: ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for interface ens3f0, disabling it in 500 ms [..] It seems that when setting a large mtu size on an RoCE interface, the RTNL mutex may be held too long by the slave interface, causing bond_mii_monitor() to be called repeatedly at an interval of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. Did you try/managed to reproduce that also with other NIC drivers? This seems to be an issue with the bonding driver. Also running older kernels on the same hardware without commit (de77ecd4ef02) seems to work fine. We can try to reproduce the issue with other type of NIC hardware. Or.
Re: Setting large MTU size on slave interfaces may stall the whole system
Hi Or, On 12/13/2017 06:28 AM, Or Gerlitz wrote: On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang wrote: Hi, We found an issue with the bonding driver when testing Mellanox devices. The following test commands will stall the whole system sometimes, with serial console flooded with log messages from the bond_miimon_inspect() function. Setting mtu size to be 1500 seems okay but very rarely it may hit the same problem too. ip address flush dev ens3f0 ip link set dev ens3f0 down ip address flush dev ens3f1 ip link set dev ens3f1 down [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 updelay=500 downdelay=500 [root@ca-hcl629 etc]# ifconfig bond0 up [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up Seiral console output: ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for interface ens3f0, disabling it in 500 ms [..] It seems that when setting a large mtu size on an RoCE interface, the RTNL mutex may be held too long by the slave interface, causing bond_mii_monitor() to be called repeatedly at an interval of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. Did you try/managed to reproduce that also with other NIC drivers? This seems to be an issue with the bonding driver. Also running older kernels on the same hardware without commit (de77ecd4ef02) seems to work fine. We can try to reproduce the issue with other type of NIC hardware. Or.
Setting large MTU size on slave interfaces may stall the whole system
(resend this email in text format) Hi, We found an issue with the bonding driver when testing Mellanox devices. The following test commands will stall the whole system sometimes, with serial console flooded with log messages from the bond_miimon_inspect() function. Setting mtu size to be 1500 seems okay but very rarely it may hit the same problem too. ip address flush dev ens3f0 ip link set dev ens3f0 down ip address flush dev ens3f1 ip link set dev ens3f1 down [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 updelay=500 downdelay=500 [root@ca-hcl629 etc]# ifconfig bond0 up [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up Seiral console output: ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 4 printk messages dropped ** [ 3717.37] bond0: link status down for interface ens3f0, disabling it in 500 ms or ** 4 printk messages dropped ** [274743.297863] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.307866] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.317857] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.327823] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.337817] bond0: link status down again after 500 ms for interface enp48s0f1 The root cause is the combined affect from commit 1f2cd845d3827412e82bf26dde0abca332ede402(Revert "Merge branch 'bonding_monitor_locking'") and commit de77ecd4ef02ca783f7762e04e92b3d0964be66b ("bonding: improve link-status update in mii-monitoring"). E.g. reverting the second commit, we don't see the problem. It seems that when setting a large mtu size on an RoCE interface, the RTNL mutex may be held too long by the slave interface, causing bond_mii_monitor() to be called repeatedly at an interval of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. We found two possible solutions: #1, don't re-arm the mii monitor thread too quick if we cannot get RTNL lock: index b2db581..8fd587a 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct *work) /* Race avoidance with bond_close cancel of workqueue */ if (!rtnl_trylock()) { - delay = 1; should_notify_peers = false; goto re_arm; } #2, we use printk_ratelimit() to avoid flooding log messages generated by bond_miimon_inspect(). index b2db581..0183b7f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond) bond_propose_link_state(slave, BOND_LINK_FAIL); commit++; slave->delay = bond->params.downdelay; - if (slave->delay) { + if (slave->delay && printk_ratelimit()) { netdev_info(bond->dev, "link status down for %sinterface %s, disabling it in %d ms\n", (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) ? @@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond) case BOND_LINK_BACK: if (!link_state) { bond_propose_link_state(slave, BOND_LINK_DOWN); - netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", + if(printk_ratelimit()) + netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", (bond->params.updelay - slave->delay) * bond->params.miimon, slave->dev->name); Regarding the flooding messages, the netdev_info output is misleading anyway when bond_mii_monitor() is called at 1 tick interval due to lock contention. Solution #1 looks simpler and cleaner to me. Any side affect of doing that? Thanks, Qing
Setting large MTU size on slave interfaces may stall the whole system
(resend this email in text format) Hi, We found an issue with the bonding driver when testing Mellanox devices. The following test commands will stall the whole system sometimes, with serial console flooded with log messages from the bond_miimon_inspect() function. Setting mtu size to be 1500 seems okay but very rarely it may hit the same problem too. ip address flush dev ens3f0 ip link set dev ens3f0 down ip address flush dev ens3f1 ip link set dev ens3f1 down [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 updelay=500 downdelay=500 [root@ca-hcl629 etc]# ifconfig bond0 up [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up Seiral console output: ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 4 printk messages dropped ** [ 3717.37] bond0: link status down for interface ens3f0, disabling it in 500 ms or ** 4 printk messages dropped ** [274743.297863] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.307866] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.317857] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.327823] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.337817] bond0: link status down again after 500 ms for interface enp48s0f1 The root cause is the combined affect from commit 1f2cd845d3827412e82bf26dde0abca332ede402(Revert "Merge branch 'bonding_monitor_locking'") and commit de77ecd4ef02ca783f7762e04e92b3d0964be66b ("bonding: improve link-status update in mii-monitoring"). E.g. reverting the second commit, we don't see the problem. It seems that when setting a large mtu size on an RoCE interface, the RTNL mutex may be held too long by the slave interface, causing bond_mii_monitor() to be called repeatedly at an interval of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. We found two possible solutions: #1, don't re-arm the mii monitor thread too quick if we cannot get RTNL lock: index b2db581..8fd587a 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct *work) /* Race avoidance with bond_close cancel of workqueue */ if (!rtnl_trylock()) { - delay = 1; should_notify_peers = false; goto re_arm; } #2, we use printk_ratelimit() to avoid flooding log messages generated by bond_miimon_inspect(). index b2db581..0183b7f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond) bond_propose_link_state(slave, BOND_LINK_FAIL); commit++; slave->delay = bond->params.downdelay; - if (slave->delay) { + if (slave->delay && printk_ratelimit()) { netdev_info(bond->dev, "link status down for %sinterface %s, disabling it in %d ms\n", (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) ? @@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond) case BOND_LINK_BACK: if (!link_state) { bond_propose_link_state(slave, BOND_LINK_DOWN); - netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", + if(printk_ratelimit()) + netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", (bond->params.updelay - slave->delay) * bond->params.miimon, slave->dev->name); Regarding the flooding messages, the netdev_info output is misleading anyway when bond_mii_monitor() is called at 1 tick interval due to lock contention. Solution #1 looks simpler and cleaner to me. Any side affect of doing that? Thanks, Qing
[PATCH] IB/CM: fix memory corruption by avoiding unnecessary memset
The size of path array could be dynamic. However the fixed number(2) of memset could cause memory corruption by writing into wrong memory space. Fixes: 9fdca4da4d8c (IB/SA: Split struct sa_path_rec based on IB ands ROCE specific fields) Signed-off-by: Qing Huang <qing.hu...@oracle.com> --- drivers/infiniband/core/cm.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 4c4b465..af4f6a0 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1856,7 +1856,9 @@ static int cm_req_handler(struct cm_work *work) cm_process_routed_req(req_msg, work->mad_recv_wc->wc); memset(>path[0], 0, sizeof(work->path[0])); - memset(>path[1], 0, sizeof(work->path[1])); + if (cm_req_has_alt_path(req_msg)) + memset(>path[1], 0, sizeof(work->path[1])); + grh = rdma_ah_read_grh(_id_priv->av.ah_attr); ret = ib_get_cached_gid(work->port->cm_dev->ib_device, work->port->port_num, @@ -3823,8 +3825,8 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, switch (mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) { case CM_REQ_ATTR_ID: - paths = 1 + (((struct cm_req_msg *) mad_recv_wc->recv_buf.mad)-> - alt_local_lid != 0); + paths = 1 + cm_req_has_alt_path( + (struct cm_req_msg *)mad_recv_wc->recv_buf.mad); event = IB_CM_REQ_RECEIVED; break; case CM_MRA_ATTR_ID: -- 2.9.3
[PATCH] IB/CM: fix memory corruption by avoiding unnecessary memset
The size of path array could be dynamic. However the fixed number(2) of memset could cause memory corruption by writing into wrong memory space. Fixes: 9fdca4da4d8c (IB/SA: Split struct sa_path_rec based on IB ands ROCE specific fields) Signed-off-by: Qing Huang --- drivers/infiniband/core/cm.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 4c4b465..af4f6a0 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1856,7 +1856,9 @@ static int cm_req_handler(struct cm_work *work) cm_process_routed_req(req_msg, work->mad_recv_wc->wc); memset(>path[0], 0, sizeof(work->path[0])); - memset(>path[1], 0, sizeof(work->path[1])); + if (cm_req_has_alt_path(req_msg)) + memset(>path[1], 0, sizeof(work->path[1])); + grh = rdma_ah_read_grh(_id_priv->av.ah_attr); ret = ib_get_cached_gid(work->port->cm_dev->ib_device, work->port->port_num, @@ -3823,8 +3825,8 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, switch (mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) { case CM_REQ_ATTR_ID: - paths = 1 + (((struct cm_req_msg *) mad_recv_wc->recv_buf.mad)-> - alt_local_lid != 0); + paths = 1 + cm_req_has_alt_path( + (struct cm_req_msg *)mad_recv_wc->recv_buf.mad); event = IB_CM_REQ_RECEIVED; break; case CM_MRA_ATTR_ID: -- 2.9.3
Re: [PATCH] ib/core: not to set page dirty bit if it's already set.
On 5/23/2017 12:42 AM, Christoph Hellwig wrote: On Mon, May 22, 2017 at 04:43:57PM -0700, Qing Huang wrote: On 5/19/2017 6:05 AM, Christoph Hellwig wrote: On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote: This change will optimize kernel memory deregistration operations. __ib_umem_release() used to call set_page_dirty_lock() against every writable page in its memory region. Its purpose is to keep data synced between CPU and DMA device when swapping happens after mem deregistration ops. Now we choose not to set page dirty bit if it's already set by kernel prior to calling __ib_umem_release(). This reduces memory deregistration time by half or even more when we ran application simulation test program. As far as I can tell this code doesn't even need set_page_dirty_lock and could just use set_page_dirty It seems that set_page_dirty_lock has been used here for more than 10 years. Don't know the original purpose. Maybe it was used to prevent races between setting dirty bits and swapping out pages? I suspect copy & paste. Or maybe I don't actually understand the explanation of set_page_dirty vs set_page_dirty_lock enough. But I'd rather not hack around the problem. -- I think there are two parts here. First part is that we don't need to set the dirty bit if it's already set. Second part is whether we use set_page_dirty or set_page_dirty_lock to set dirty bits.
Re: [PATCH] ib/core: not to set page dirty bit if it's already set.
On 5/23/2017 12:42 AM, Christoph Hellwig wrote: On Mon, May 22, 2017 at 04:43:57PM -0700, Qing Huang wrote: On 5/19/2017 6:05 AM, Christoph Hellwig wrote: On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote: This change will optimize kernel memory deregistration operations. __ib_umem_release() used to call set_page_dirty_lock() against every writable page in its memory region. Its purpose is to keep data synced between CPU and DMA device when swapping happens after mem deregistration ops. Now we choose not to set page dirty bit if it's already set by kernel prior to calling __ib_umem_release(). This reduces memory deregistration time by half or even more when we ran application simulation test program. As far as I can tell this code doesn't even need set_page_dirty_lock and could just use set_page_dirty It seems that set_page_dirty_lock has been used here for more than 10 years. Don't know the original purpose. Maybe it was used to prevent races between setting dirty bits and swapping out pages? I suspect copy & paste. Or maybe I don't actually understand the explanation of set_page_dirty vs set_page_dirty_lock enough. But I'd rather not hack around the problem. -- I think there are two parts here. First part is that we don't need to set the dirty bit if it's already set. Second part is whether we use set_page_dirty or set_page_dirty_lock to set dirty bits.
Re: [PATCH] ib/core: not to set page dirty bit if it's already set.
On 5/19/2017 6:05 AM, Christoph Hellwig wrote: On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote: This change will optimize kernel memory deregistration operations. __ib_umem_release() used to call set_page_dirty_lock() against every writable page in its memory region. Its purpose is to keep data synced between CPU and DMA device when swapping happens after mem deregistration ops. Now we choose not to set page dirty bit if it's already set by kernel prior to calling __ib_umem_release(). This reduces memory deregistration time by half or even more when we ran application simulation test program. As far as I can tell this code doesn't even need set_page_dirty_lock and could just use set_page_dirty It seems that set_page_dirty_lock has been used here for more than 10 years. Don't know the original purpose. Maybe it was used to prevent races between setting dirty bits and swapping out pages? Perhaps we can call set_page_dirty before calling ib_dma_unmap_sg? Signed-off-by: Qing Huang<qing.hu...@oracle.com> --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 3dbf811..21e60b1 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { page = sg_page(sg); - if (umem->writable && dirty) + if (!PageDirty(page) && umem->writable && dirty) set_page_dirty_lock(page); put_page(page); } -- 2.9.3
Re: [PATCH] ib/core: not to set page dirty bit if it's already set.
On 5/19/2017 6:05 AM, Christoph Hellwig wrote: On Thu, May 18, 2017 at 04:33:53PM -0700, Qing Huang wrote: This change will optimize kernel memory deregistration operations. __ib_umem_release() used to call set_page_dirty_lock() against every writable page in its memory region. Its purpose is to keep data synced between CPU and DMA device when swapping happens after mem deregistration ops. Now we choose not to set page dirty bit if it's already set by kernel prior to calling __ib_umem_release(). This reduces memory deregistration time by half or even more when we ran application simulation test program. As far as I can tell this code doesn't even need set_page_dirty_lock and could just use set_page_dirty It seems that set_page_dirty_lock has been used here for more than 10 years. Don't know the original purpose. Maybe it was used to prevent races between setting dirty bits and swapping out pages? Perhaps we can call set_page_dirty before calling ib_dma_unmap_sg? Signed-off-by: Qing Huang --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 3dbf811..21e60b1 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { page = sg_page(sg); - if (umem->writable && dirty) + if (!PageDirty(page) && umem->writable && dirty) set_page_dirty_lock(page); put_page(page); } -- 2.9.3
[PATCH] ib/core: not to set page dirty bit if it's already set.
This change will optimize kernel memory deregistration operations. __ib_umem_release() used to call set_page_dirty_lock() against every writable page in its memory region. Its purpose is to keep data synced between CPU and DMA device when swapping happens after mem deregistration ops. Now we choose not to set page dirty bit if it's already set by kernel prior to calling __ib_umem_release(). This reduces memory deregistration time by half or even more when we ran application simulation test program. Signed-off-by: Qing Huang <qing.hu...@oracle.com> --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 3dbf811..21e60b1 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { page = sg_page(sg); - if (umem->writable && dirty) + if (!PageDirty(page) && umem->writable && dirty) set_page_dirty_lock(page); put_page(page); } -- 2.9.3
[PATCH] ib/core: not to set page dirty bit if it's already set.
This change will optimize kernel memory deregistration operations. __ib_umem_release() used to call set_page_dirty_lock() against every writable page in its memory region. Its purpose is to keep data synced between CPU and DMA device when swapping happens after mem deregistration ops. Now we choose not to set page dirty bit if it's already set by kernel prior to calling __ib_umem_release(). This reduces memory deregistration time by half or even more when we ran application simulation test program. Signed-off-by: Qing Huang --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 3dbf811..21e60b1 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -58,7 +58,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { page = sg_page(sg); - if (umem->writable && dirty) + if (!PageDirty(page) && umem->writable && dirty) set_page_dirty_lock(page); put_page(page); } -- 2.9.3
Re: [PATCH] device probe: add self triggered delayed work request
On 08/09/2016 03:11 AM, Shamir Rabinovitch wrote: On Mon, Aug 08, 2016 at 05:10:05PM -0700, Qing Huang wrote: Not sure if I understood your scenario. Why there is a deadlock here? CPU0 | CPU1 - driver_deferred_probe_add | driver_deferred_probe_trigger_wrapper mutex_lock(_probe_mutex) | driver_deferred_probe_trigger cancel_delayed_work(_probe_trigger_work)| mutex_lock(_probe_mutex) wait for "driver_deferred_probe_trigger_wrapper"|wait for "deferred_probe_mutex" is this possible scenario with this patch? if yes then CPU0 will wait for CPU1 to finish the delayed work whith mutex deferred_probe_mutex held while CPU1 will try to finish the delayed work and will wait for the same mutex forever. CPU0 will not wait for "driver_deferred_probe_trigger_wrapper" to finish, it simply puts the work request onto the queue and returns. Qing it seems like dead lock scenario to me. please say if this scenario is possible. BR, Shamir Rabinovitch
Re: [PATCH] device probe: add self triggered delayed work request
On 08/09/2016 03:11 AM, Shamir Rabinovitch wrote: On Mon, Aug 08, 2016 at 05:10:05PM -0700, Qing Huang wrote: Not sure if I understood your scenario. Why there is a deadlock here? CPU0 | CPU1 - driver_deferred_probe_add | driver_deferred_probe_trigger_wrapper mutex_lock(_probe_mutex) | driver_deferred_probe_trigger cancel_delayed_work(_probe_trigger_work)| mutex_lock(_probe_mutex) wait for "driver_deferred_probe_trigger_wrapper"|wait for "deferred_probe_mutex" is this possible scenario with this patch? if yes then CPU0 will wait for CPU1 to finish the delayed work whith mutex deferred_probe_mutex held while CPU1 will try to finish the delayed work and will wait for the same mutex forever. CPU0 will not wait for "driver_deferred_probe_trigger_wrapper" to finish, it simply puts the work request onto the queue and returns. Qing it seems like dead lock scenario to me. please say if this scenario is possible. BR, Shamir Rabinovitch
Re: [PATCH] device probe: add self triggered delayed work request
On 08/08/2016 03:42 AM, Shamir Rabinovitch wrote: Hi Qing, I suspect there is potential dead-lock with this patch: cpu0cpu1 driver_deferred_probe_add deferred_probe_work_func ... mutex_unlock(_probe_mutex) mutex_lock(_probe_mutex)bus_probe_device(dev) ... device return -EPROBE_DEFER ... driver_deferred_probe_add ... mutex_lock(_probe_mutex) ...cancel_delayed_work(_probe_trigger_work) Not sure if I understood your scenario. Why there is a deadlock here? Please confirm if this scenario is possible. BR, Shamir Rabinovitch
Re: [PATCH] device probe: add self triggered delayed work request
On 08/08/2016 03:42 AM, Shamir Rabinovitch wrote: Hi Qing, I suspect there is potential dead-lock with this patch: cpu0cpu1 driver_deferred_probe_add deferred_probe_work_func ... mutex_unlock(_probe_mutex) mutex_lock(_probe_mutex)bus_probe_device(dev) ... device return -EPROBE_DEFER ... driver_deferred_probe_add ... mutex_lock(_probe_mutex) ... cancel_delayed_work(_probe_trigger_work) Not sure if I understood your scenario. Why there is a deadlock here? Please confirm if this scenario is possible. BR, Shamir Rabinovitch
Re: [PATCH] device probe: add self triggered delayed work request
On 08/08/2016 01:44 PM, Frank Rowand wrote: On 07/29/16 22:39, Qing Huang wrote: In normal condition, the device probe requests kept in deferred queue would only be triggered for re-probing when another new device probe is finished successfully. This change will set up a delayed trigger work request if the current deferred probe being added is the only one in the queue. This delayed work request will try to reactivate any device from the deferred queue for re-probing later. By doing this, if the last device being probed in system boot process has a deferred probe error, this particular device will still be able to be probed again. I am trying to understand the use case. Can you explain the scenario you are trying to fix? If I understand correctly, you expect that something will change such that a later probe attempt will succeed. How will that change occur and why will the deferred probe list not be processed in this case? Why are you conditioning this on the deferred_probe_pending_list being empty? -Frank It turns out one corner case which we worried about has already been solved in the really_probe() function by comparing 'deferred_trigger_count' values. Another use case we are investigating now: when we probe a device, the main thread returns EPROBE_DEFER from the driver after we spawn a child thread to do the actual init work. So we can initialize multiple similar devices at the same time. After the child thread finishes its task, we can call driver_deferred_probe_trigger() directly from child thread to re-probe the device(driver_deferred_probe_trigger() has to be exported though). Or we could rely on something in this patch to re-probe the deferred devices from the pending list... What do you suggest? Thanks, -Qing Cc: Greg Kroah-Hartman<gre...@linuxfoundation.org> Cc: Grant Likely<grant.lik...@linaro.org> Cc: Santosh Shilimkar<santosh.shilim...@oracle.com> Signed-off-by: Qing Huang<qing.hu...@oracle.com> --- drivers/base/dd.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 16688f5..251042d 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list); static LIST_HEAD(deferred_probe_active_list); static struct workqueue_struct *deferred_wq; static atomic_t deferred_trigger_count = ATOMIC_INIT(0); +static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0); +#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */ +#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5 /* 5 times */ /* * In some cases, like suspend to RAM or hibernation, It might be reasonable @@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct *work) mutex_unlock(_probe_mutex); } static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func); +void driver_deferred_probe_trigger_wrapper(struct work_struct *work); +static DECLARE_DELAYED_WORK(deferred_probe_trigger_work, + driver_deferred_probe_trigger_wrapper); static void driver_deferred_probe_add(struct device *dev) { + int local_self_trigger_count; + mutex_lock(_probe_mutex); + local_self_trigger_count = atomic_read(_self_trigger_count); + if (list_empty(_probe_pending_list) && + local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) { + cancel_delayed_work(_probe_trigger_work); + queue_delayed_work(deferred_wq, _probe_trigger_work, + HZ * DEFERRED_SELF_TRIGGER_INTERVAL); + } + if (list_empty(>p->deferred_probe)) { dev_dbg(dev, "Added to deferred list\n"); list_add_tail(>p->deferred_probe, _probe_pending_list); @@ -202,6 +218,15 @@ void device_unblock_probing(void) driver_deferred_probe_trigger(); } +/* + * Wrapper function for delayed trigger work requests + */ +void driver_deferred_probe_trigger_wrapper(struct work_struct *work) +{ + atomic_inc(_self_trigger_count); + driver_deferred_probe_trigger(); +} + /** * deferred_probe_initcall() - Enable probing of deferred devices *
Re: [PATCH] device probe: add self triggered delayed work request
On 08/08/2016 01:44 PM, Frank Rowand wrote: On 07/29/16 22:39, Qing Huang wrote: In normal condition, the device probe requests kept in deferred queue would only be triggered for re-probing when another new device probe is finished successfully. This change will set up a delayed trigger work request if the current deferred probe being added is the only one in the queue. This delayed work request will try to reactivate any device from the deferred queue for re-probing later. By doing this, if the last device being probed in system boot process has a deferred probe error, this particular device will still be able to be probed again. I am trying to understand the use case. Can you explain the scenario you are trying to fix? If I understand correctly, you expect that something will change such that a later probe attempt will succeed. How will that change occur and why will the deferred probe list not be processed in this case? Why are you conditioning this on the deferred_probe_pending_list being empty? -Frank It turns out one corner case which we worried about has already been solved in the really_probe() function by comparing 'deferred_trigger_count' values. Another use case we are investigating now: when we probe a device, the main thread returns EPROBE_DEFER from the driver after we spawn a child thread to do the actual init work. So we can initialize multiple similar devices at the same time. After the child thread finishes its task, we can call driver_deferred_probe_trigger() directly from child thread to re-probe the device(driver_deferred_probe_trigger() has to be exported though). Or we could rely on something in this patch to re-probe the deferred devices from the pending list... What do you suggest? Thanks, -Qing Cc: Greg Kroah-Hartman Cc: Grant Likely Cc: Santosh Shilimkar Signed-off-by: Qing Huang --- drivers/base/dd.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 16688f5..251042d 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list); static LIST_HEAD(deferred_probe_active_list); static struct workqueue_struct *deferred_wq; static atomic_t deferred_trigger_count = ATOMIC_INIT(0); +static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0); +#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */ +#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5 /* 5 times */ /* * In some cases, like suspend to RAM or hibernation, It might be reasonable @@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct *work) mutex_unlock(_probe_mutex); } static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func); +void driver_deferred_probe_trigger_wrapper(struct work_struct *work); +static DECLARE_DELAYED_WORK(deferred_probe_trigger_work, + driver_deferred_probe_trigger_wrapper); static void driver_deferred_probe_add(struct device *dev) { + int local_self_trigger_count; + mutex_lock(_probe_mutex); + local_self_trigger_count = atomic_read(_self_trigger_count); + if (list_empty(_probe_pending_list) && + local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) { + cancel_delayed_work(_probe_trigger_work); + queue_delayed_work(deferred_wq, _probe_trigger_work, + HZ * DEFERRED_SELF_TRIGGER_INTERVAL); + } + if (list_empty(>p->deferred_probe)) { dev_dbg(dev, "Added to deferred list\n"); list_add_tail(>p->deferred_probe, _probe_pending_list); @@ -202,6 +218,15 @@ void device_unblock_probing(void) driver_deferred_probe_trigger(); } +/* + * Wrapper function for delayed trigger work requests + */ +void driver_deferred_probe_trigger_wrapper(struct work_struct *work) +{ + atomic_inc(_self_trigger_count); + driver_deferred_probe_trigger(); +} + /** * deferred_probe_initcall() - Enable probing of deferred devices *
[PATCH] device probe: add self triggered delayed work request
In normal condition, the device probe requests kept in deferred queue would only be triggered for re-probing when another new device probe is finished successfully. This change will set up a delayed trigger work request if the current deferred probe being added is the only one in the queue. This delayed work request will try to reactivate any device from the deferred queue for re-probing later. By doing this, if the last device being probed in system boot process has a deferred probe error, this particular device will still be able to be probed again. Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org> Cc: Grant Likely <grant.lik...@linaro.org> Cc: Santosh Shilimkar <santosh.shilim...@oracle.com> Signed-off-by: Qing Huang <qing.hu...@oracle.com> --- drivers/base/dd.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 16688f5..251042d 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list); static LIST_HEAD(deferred_probe_active_list); static struct workqueue_struct *deferred_wq; static atomic_t deferred_trigger_count = ATOMIC_INIT(0); +static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0); +#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */ +#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5 /* 5 times */ /* * In some cases, like suspend to RAM or hibernation, It might be reasonable @@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct *work) mutex_unlock(_probe_mutex); } static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func); +void driver_deferred_probe_trigger_wrapper(struct work_struct *work); +static DECLARE_DELAYED_WORK(deferred_probe_trigger_work, + driver_deferred_probe_trigger_wrapper); static void driver_deferred_probe_add(struct device *dev) { + int local_self_trigger_count; + mutex_lock(_probe_mutex); + local_self_trigger_count = atomic_read(_self_trigger_count); + if (list_empty(_probe_pending_list) && + local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) { + cancel_delayed_work(_probe_trigger_work); + queue_delayed_work(deferred_wq, _probe_trigger_work, + HZ * DEFERRED_SELF_TRIGGER_INTERVAL); + } + if (list_empty(>p->deferred_probe)) { dev_dbg(dev, "Added to deferred list\n"); list_add_tail(>p->deferred_probe, _probe_pending_list); @@ -202,6 +218,15 @@ void device_unblock_probing(void) driver_deferred_probe_trigger(); } +/* + * Wrapper function for delayed trigger work requests + */ +void driver_deferred_probe_trigger_wrapper(struct work_struct *work) +{ + atomic_inc(_self_trigger_count); + driver_deferred_probe_trigger(); +} + /** * deferred_probe_initcall() - Enable probing of deferred devices * -- 1.7.1
[PATCH] device probe: add self triggered delayed work request
In normal condition, the device probe requests kept in deferred queue would only be triggered for re-probing when another new device probe is finished successfully. This change will set up a delayed trigger work request if the current deferred probe being added is the only one in the queue. This delayed work request will try to reactivate any device from the deferred queue for re-probing later. By doing this, if the last device being probed in system boot process has a deferred probe error, this particular device will still be able to be probed again. Cc: Greg Kroah-Hartman Cc: Grant Likely Cc: Santosh Shilimkar Signed-off-by: Qing Huang --- drivers/base/dd.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 16688f5..251042d 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -53,6 +53,9 @@ static LIST_HEAD(deferred_probe_pending_list); static LIST_HEAD(deferred_probe_active_list); static struct workqueue_struct *deferred_wq; static atomic_t deferred_trigger_count = ATOMIC_INIT(0); +static atomic_t deferred_self_trigger_count = ATOMIC_INIT(0); +#define DEFERRED_SELF_TRIGGER_INTERVAL 30 /* 30 secs */ +#define DEFERRED_SELF_TRIGGER_ATTEMPTS 5 /* 5 times */ /* * In some cases, like suspend to RAM or hibernation, It might be reasonable @@ -115,10 +118,23 @@ static void deferred_probe_work_func(struct work_struct *work) mutex_unlock(_probe_mutex); } static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func); +void driver_deferred_probe_trigger_wrapper(struct work_struct *work); +static DECLARE_DELAYED_WORK(deferred_probe_trigger_work, + driver_deferred_probe_trigger_wrapper); static void driver_deferred_probe_add(struct device *dev) { + int local_self_trigger_count; + mutex_lock(_probe_mutex); + local_self_trigger_count = atomic_read(_self_trigger_count); + if (list_empty(_probe_pending_list) && + local_self_trigger_count <= DEFERRED_SELF_TRIGGER_ATTEMPTS) { + cancel_delayed_work(_probe_trigger_work); + queue_delayed_work(deferred_wq, _probe_trigger_work, + HZ * DEFERRED_SELF_TRIGGER_INTERVAL); + } + if (list_empty(>p->deferred_probe)) { dev_dbg(dev, "Added to deferred list\n"); list_add_tail(>p->deferred_probe, _probe_pending_list); @@ -202,6 +218,15 @@ void device_unblock_probing(void) driver_deferred_probe_trigger(); } +/* + * Wrapper function for delayed trigger work requests + */ +void driver_deferred_probe_trigger_wrapper(struct work_struct *work) +{ + atomic_inc(_self_trigger_count); + driver_deferred_probe_trigger(); +} + /** * deferred_probe_initcall() - Enable probing of deferred devices * -- 1.7.1