Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On Thu 31-05-18 11:10:22, Michal Hocko wrote: > On Thu 31-05-18 10:55:32, Michal Hocko wrote: > > On Thu 31-05-18 04:35:31, Eric Dumazet wrote: > [...] > > > I merely copied/pasted from alloc_skb_with_frags() :/ > > > > I will have a look at it. Thanks! > > OK, so this is an example of an incremental development ;). > > __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for > high order allocations") to prevent from OOM killer. Yet this was > not enough because fb05e7a89f50 ("net: don't wait for order-3 page > allocation") didn't want an excessive reclaim for non-costly orders > so it made it completely NOWAIT while it preserved __GFP_NORETRY in > place which is now redundant. Should I send a patch? Just in case you are interested --- >From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 4 Jun 2018 15:07:37 +0200 Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations which is just a noop and a little bit confusing. __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations") to prevent from the OOM killer. Yet this was not enough because fb05e7a89f50 ("net: don't wait for order-3 page allocation") didn't want an excessive reclaim for non-costly orders so it made it completely NOWAIT while it preserved __GFP_NORETRY in place which is now redundant. Drop the pointless __GFP_NORETRY because this function is used as copy source for other places. Signed-off-by: Michal Hocko --- net/core/skbuff.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 857e4e6f751a..c1f22adc30de 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5239,8 +5239,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len, if (npages >= 1 << order) { page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | - __GFP_NOWARN | - __GFP_NORETRY, + __GFP_NOWARN, order); if (page) goto fill_page; -- 2.17.0 -- Michal Hocko SUSE Labs
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On Thu 31-05-18 11:10:22, Michal Hocko wrote: > On Thu 31-05-18 10:55:32, Michal Hocko wrote: > > On Thu 31-05-18 04:35:31, Eric Dumazet wrote: > [...] > > > I merely copied/pasted from alloc_skb_with_frags() :/ > > > > I will have a look at it. Thanks! > > OK, so this is an example of an incremental development ;). > > __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for > high order allocations") to prevent from OOM killer. Yet this was > not enough because fb05e7a89f50 ("net: don't wait for order-3 page > allocation") didn't want an excessive reclaim for non-costly orders > so it made it completely NOWAIT while it preserved __GFP_NORETRY in > place which is now redundant. Should I send a patch? Just in case you are interested --- >From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 4 Jun 2018 15:07:37 +0200 Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations which is just a noop and a little bit confusing. __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations") to prevent from the OOM killer. Yet this was not enough because fb05e7a89f50 ("net: don't wait for order-3 page allocation") didn't want an excessive reclaim for non-costly orders so it made it completely NOWAIT while it preserved __GFP_NORETRY in place which is now redundant. Drop the pointless __GFP_NORETRY because this function is used as copy source for other places. Signed-off-by: Michal Hocko --- net/core/skbuff.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 857e4e6f751a..c1f22adc30de 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5239,8 +5239,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len, if (npages >= 1 << order) { page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | - __GFP_NOWARN | - __GFP_NORETRY, + __GFP_NOWARN, order); if (page) goto fill_page; -- 2.17.0 -- Michal Hocko SUSE Labs
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 5/31/2018 2:10 AM, Michal Hocko wrote: On Thu 31-05-18 10:55:32, Michal Hocko wrote: On Thu 31-05-18 04:35:31, Eric Dumazet wrote: [...] I merely copied/pasted from alloc_skb_with_frags() :/ I will have a look at it. Thanks! OK, so this is an example of an incremental development ;). __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations") to prevent from OOM killer. Yet this was not enough because fb05e7a89f50 ("net: don't wait for order-3 page allocation") didn't want an excessive reclaim for non-costly orders so it made it completely NOWAIT while it preserved __GFP_NORETRY in place which is now redundant. Should I send a patch? Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion? We experimented with it a bit in the past but it seemed to cause other issue in our tests. :-) By the way, we didn't encounter any OOM killer events. It seemed that the mlx4_alloc_icm() triggered slowpath. We still had about 2GB free memory while it was highly fragmented. #0 [8801f308b380] remove_migration_pte at 811f0e0b #1 [8801f308b3e0] rmap_walk_file at 811cb890 #2 [8801f308b440] rmap_walk at 811cbaf2 #3 [8801f308b450] remove_migration_ptes at 811f0db0 #4 [8801f308b490] __unmap_and_move at 811f2ea6 #5 [8801f308b4e0] unmap_and_move at 811f2fc5 #6 [8801f308b540] migrate_pages at 811f3219 #7 [8801f308b5c0] compact_zone at 811b707e #8 [8801f308b650] compact_zone_order at 811b735d #9 [8801f308b6e0] try_to_compact_pages at 811b7485 #10 [8801f308b770] __alloc_pages_direct_compact at 81195f96 #11 [8801f308b7b0] __alloc_pages_slowpath at 811978a1 #12 [8801f308b890] __alloc_pages_nodemask at 81197ec1 #13 [8801f308b970] alloc_pages_current at 811e261f #14 [8801f308b9e0] mlx4_alloc_icm at a01f39b2 [mlx4_core] Thanks!
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 5/31/2018 2:10 AM, Michal Hocko wrote: On Thu 31-05-18 10:55:32, Michal Hocko wrote: On Thu 31-05-18 04:35:31, Eric Dumazet wrote: [...] I merely copied/pasted from alloc_skb_with_frags() :/ I will have a look at it. Thanks! OK, so this is an example of an incremental development ;). __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations") to prevent from OOM killer. Yet this was not enough because fb05e7a89f50 ("net: don't wait for order-3 page allocation") didn't want an excessive reclaim for non-costly orders so it made it completely NOWAIT while it preserved __GFP_NORETRY in place which is now redundant. Should I send a patch? Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion? We experimented with it a bit in the past but it seemed to cause other issue in our tests. :-) By the way, we didn't encounter any OOM killer events. It seemed that the mlx4_alloc_icm() triggered slowpath. We still had about 2GB free memory while it was highly fragmented. #0 [8801f308b380] remove_migration_pte at 811f0e0b #1 [8801f308b3e0] rmap_walk_file at 811cb890 #2 [8801f308b440] rmap_walk at 811cbaf2 #3 [8801f308b450] remove_migration_ptes at 811f0db0 #4 [8801f308b490] __unmap_and_move at 811f2ea6 #5 [8801f308b4e0] unmap_and_move at 811f2fc5 #6 [8801f308b540] migrate_pages at 811f3219 #7 [8801f308b5c0] compact_zone at 811b707e #8 [8801f308b650] compact_zone_order at 811b735d #9 [8801f308b6e0] try_to_compact_pages at 811b7485 #10 [8801f308b770] __alloc_pages_direct_compact at 81195f96 #11 [8801f308b7b0] __alloc_pages_slowpath at 811978a1 #12 [8801f308b890] __alloc_pages_nodemask at 81197ec1 #13 [8801f308b970] alloc_pages_current at 811e261f #14 [8801f308b9e0] mlx4_alloc_icm at a01f39b2 [mlx4_core] Thanks!
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 05/29/2018 11:44 PM, Eric Dumazet wrote: > > And I will add this simple fix, this really should address your initial > concern much better. > > @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, > int order, > { > struct page *page; > > + if (order) > + gfp_mask |= __GFP_NORETRY; and also gfp_mask &= ~__GFP_DIRECT_RECLAIM > page = alloc_pages_node(node, gfp_mask, order); > if (!page) { > page = alloc_pages(gfp_mask, order); >
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 05/29/2018 11:44 PM, Eric Dumazet wrote: > > And I will add this simple fix, this really should address your initial > concern much better. > > @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, > int order, > { > struct page *page; > > + if (order) > + gfp_mask |= __GFP_NORETRY; and also gfp_mask &= ~__GFP_DIRECT_RECLAIM > page = alloc_pages_node(node, gfp_mask, order); > if (!page) { > page = alloc_pages(gfp_mask, order); >
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 05/29/2018 11:34 PM, Eric Dumazet wrote: > I will test : > > diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c > b/drivers/net/ethernet/mellanox/mlx4/icm.c > index > 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 > 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/icm.c > +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c > @@ -43,12 +43,13 @@ > #include "fw.h" > > /* > - * We allocate in page size (default 4KB on many archs) chunks to avoid high > - * order memory allocations in fragmented/high usage memory situation. > + * We allocate in as big chunks as we can, up to a maximum of 256 KB > + * per chunk. Note that the chunks are not necessarily in contiguous > + * physical memory. > */ > enum { > - MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, > - MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, > + MLX4_ICM_ALLOC_SIZE = 1 << 18, > + MLX4_TABLE_CHUNK_SIZE = 1 << 18 > }; > > static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk > *chunk) > And I will add this simple fix, this really should address your initial concern much better. @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order, { struct page *page; + if (order) + gfp_mask |= __GFP_NORETRY; page = alloc_pages_node(node, gfp_mask, order); if (!page) { page = alloc_pages(gfp_mask, order);
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 05/29/2018 11:34 PM, Eric Dumazet wrote: > I will test : > > diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c > b/drivers/net/ethernet/mellanox/mlx4/icm.c > index > 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 > 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/icm.c > +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c > @@ -43,12 +43,13 @@ > #include "fw.h" > > /* > - * We allocate in page size (default 4KB on many archs) chunks to avoid high > - * order memory allocations in fragmented/high usage memory situation. > + * We allocate in as big chunks as we can, up to a maximum of 256 KB > + * per chunk. Note that the chunks are not necessarily in contiguous > + * physical memory. > */ > enum { > - MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, > - MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, > + MLX4_ICM_ALLOC_SIZE = 1 << 18, > + MLX4_TABLE_CHUNK_SIZE = 1 << 18 > }; > > static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk > *chunk) > And I will add this simple fix, this really should address your initial concern much better. @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order, { struct page *page; + if (order) + gfp_mask |= __GFP_NORETRY; page = alloc_pages_node(node, gfp_mask, order); if (!page) { page = alloc_pages(gfp_mask, order);
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 05/25/2018 10:23 AM, David Miller wrote: > From: Qing Huang > Date: Wed, 23 May 2018 16:22:46 -0700 > >> When a system is under memory presure (high usage with fragments), >> the original 256KB ICM chunk allocations will likely trigger kernel >> memory management to enter slow path doing memory compact/migration >> ops in order to complete high order memory allocations. >> >> When that happens, user processes calling uverb APIs may get stuck >> for more than 120s easily even though there are a lot of free pages >> in smaller chunks available in the system. >> >> Syslog: >> ... >> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task >> oracle_205573_e:205573 blocked for more than 120 seconds. >> ... >> >> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. >> >> However in order to support smaller ICM chunk size, we need to fix >> another issue in large size kcalloc allocations. >> >> E.g. >> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk >> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt >> entry). So we need a 16MB allocation for a table->icm pointer array to >> hold 2M pointers which can easily cause kcalloc to fail. >> >> The solution is to use kvzalloc to replace kcalloc which will fall back >> to vmalloc automatically if kmalloc fails. >> >> Signed-off-by: Qing Huang >> Acked-by: Daniel Jurgens >> Reviewed-by: Zhu Yanjun > > Applied, thanks. > I must say this patch causes regressions here. KASAN is not happy. It looks that you guys did not really looked at mlx4_alloc_icm() This function is properly handling high order allocations with fallbacks to order-0 pages under high memory pressure. BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] Read of size 4 at addr 8817df584f68 by task qp_listing_test/92585 CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G O Call Trace: [] dump_stack+0x4d/0x72 [] print_address_description+0x6f/0x260 [] kasan_report+0x257/0x370 [] __asan_report_load4_noabort+0x19/0x20 [] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] [] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib] [] qpstat_print_qp+0x13b/0x500 [ib_uverbs] [] qpstat_seq_show+0x4a/0xb0 [ib_uverbs] [] seq_read+0xa9c/0x1230 [] proc_reg_read+0xc1/0x180 [] __vfs_read+0xe8/0x730 [] vfs_read+0xf7/0x300 [] SyS_read+0xd2/0x1b0 [] do_syscall_64+0x186/0x420 [] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f851a7bb30d RSP: 002b:7ffd09a758c0 EFLAGS: 0293 ORIG_RAX: RAX: ffda RBX: 7f84ff959440 RCX: 7f851a7bb30d RDX: 0003fc00 RSI: 7f84ff60a000 RDI: 000b RBP: 7ffd09a75900 R08: R09: R10: 0022 R11: 0293 R12: R13: 0003 R14: 0003 R15: 7f84ff60a000 Allocated by task 4488: save_stack+0x46/0xd0 kasan_kmalloc+0xad/0xe0 __kmalloc+0x101/0x5e0 ib_register_device+0xc03/0x1250 [ib_core] mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib] mlx4_add_device+0xa9/0x340 [mlx4_core] mlx4_register_interface+0x16e/0x390 [mlx4_core] xhci_pci_remove+0x7a/0x180 [xhci_pci] do_one_initcall+0xa0/0x230 do_init_module+0x1b9/0x5a4 load_module+0x63e6/0x94c0 SYSC_init_module+0x1a4/0x1c0 SyS_init_module+0xe/0x10 do_syscall_64+0x186/0x420 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Freed by task 0: (stack is not available) The buggy address belongs to the object at 8817df584f40 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 8 bytes to the right of 32-byte region [8817df584f40, 8817df584f60) The buggy address belongs to the page: page:ea005f7d6100 count:1 mapcount:0 mapping:8817df584000 index:0x8817df584fc1 flags: 0x8800100(slab) raw: 08800100 8817df584000 8817df584fc1 0001003f raw: ea005f3ac0a0 ea005c476760 8817fec00900 883ff78d26c0 page dumped because: kasan: bad access detected page->mem_cgroup:883ff78d26c0 Memory state around the buggy address: 8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc 8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc >8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc ^ 8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc 8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb I will test : diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,13 @@ #include "fw.h" /* - * We allocate in page size (default 4KB on many archs) chunks to avoid high - * order memory allocations in fragmented/high usage memory situation. + * We allocate in as big
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 05/25/2018 10:23 AM, David Miller wrote: > From: Qing Huang > Date: Wed, 23 May 2018 16:22:46 -0700 > >> When a system is under memory presure (high usage with fragments), >> the original 256KB ICM chunk allocations will likely trigger kernel >> memory management to enter slow path doing memory compact/migration >> ops in order to complete high order memory allocations. >> >> When that happens, user processes calling uverb APIs may get stuck >> for more than 120s easily even though there are a lot of free pages >> in smaller chunks available in the system. >> >> Syslog: >> ... >> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task >> oracle_205573_e:205573 blocked for more than 120 seconds. >> ... >> >> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. >> >> However in order to support smaller ICM chunk size, we need to fix >> another issue in large size kcalloc allocations. >> >> E.g. >> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk >> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt >> entry). So we need a 16MB allocation for a table->icm pointer array to >> hold 2M pointers which can easily cause kcalloc to fail. >> >> The solution is to use kvzalloc to replace kcalloc which will fall back >> to vmalloc automatically if kmalloc fails. >> >> Signed-off-by: Qing Huang >> Acked-by: Daniel Jurgens >> Reviewed-by: Zhu Yanjun > > Applied, thanks. > I must say this patch causes regressions here. KASAN is not happy. It looks that you guys did not really looked at mlx4_alloc_icm() This function is properly handling high order allocations with fallbacks to order-0 pages under high memory pressure. BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] Read of size 4 at addr 8817df584f68 by task qp_listing_test/92585 CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G O Call Trace: [] dump_stack+0x4d/0x72 [] print_address_description+0x6f/0x260 [] kasan_report+0x257/0x370 [] __asan_report_load4_noabort+0x19/0x20 [] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] [] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib] [] qpstat_print_qp+0x13b/0x500 [ib_uverbs] [] qpstat_seq_show+0x4a/0xb0 [ib_uverbs] [] seq_read+0xa9c/0x1230 [] proc_reg_read+0xc1/0x180 [] __vfs_read+0xe8/0x730 [] vfs_read+0xf7/0x300 [] SyS_read+0xd2/0x1b0 [] do_syscall_64+0x186/0x420 [] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f851a7bb30d RSP: 002b:7ffd09a758c0 EFLAGS: 0293 ORIG_RAX: RAX: ffda RBX: 7f84ff959440 RCX: 7f851a7bb30d RDX: 0003fc00 RSI: 7f84ff60a000 RDI: 000b RBP: 7ffd09a75900 R08: R09: R10: 0022 R11: 0293 R12: R13: 0003 R14: 0003 R15: 7f84ff60a000 Allocated by task 4488: save_stack+0x46/0xd0 kasan_kmalloc+0xad/0xe0 __kmalloc+0x101/0x5e0 ib_register_device+0xc03/0x1250 [ib_core] mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib] mlx4_add_device+0xa9/0x340 [mlx4_core] mlx4_register_interface+0x16e/0x390 [mlx4_core] xhci_pci_remove+0x7a/0x180 [xhci_pci] do_one_initcall+0xa0/0x230 do_init_module+0x1b9/0x5a4 load_module+0x63e6/0x94c0 SYSC_init_module+0x1a4/0x1c0 SyS_init_module+0xe/0x10 do_syscall_64+0x186/0x420 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Freed by task 0: (stack is not available) The buggy address belongs to the object at 8817df584f40 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 8 bytes to the right of 32-byte region [8817df584f40, 8817df584f60) The buggy address belongs to the page: page:ea005f7d6100 count:1 mapcount:0 mapping:8817df584000 index:0x8817df584fc1 flags: 0x8800100(slab) raw: 08800100 8817df584000 8817df584fc1 0001003f raw: ea005f3ac0a0 ea005c476760 8817fec00900 883ff78d26c0 page dumped because: kasan: bad access detected page->mem_cgroup:883ff78d26c0 Memory state around the buggy address: 8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc 8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc >8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc ^ 8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc 8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb I will test : diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,13 @@ #include "fw.h" /* - * We allocate in page size (default 4KB on many archs) chunks to avoid high - * order memory allocations in fragmented/high usage memory situation. + * We allocate in as big
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Qing HuangDate: Wed, 23 May 2018 16:22:46 -0700 > When a system is under memory presure (high usage with fragments), > the original 256KB ICM chunk allocations will likely trigger kernel > memory management to enter slow path doing memory compact/migration > ops in order to complete high order memory allocations. > > When that happens, user processes calling uverb APIs may get stuck > for more than 120s easily even though there are a lot of free pages > in smaller chunks available in the system. > > Syslog: > ... > Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task > oracle_205573_e:205573 blocked for more than 120 seconds. > ... > > With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. > > However in order to support smaller ICM chunk size, we need to fix > another issue in large size kcalloc allocations. > > E.g. > Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk > size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt > entry). So we need a 16MB allocation for a table->icm pointer array to > hold 2M pointers which can easily cause kcalloc to fail. > > The solution is to use kvzalloc to replace kcalloc which will fall back > to vmalloc automatically if kmalloc fails. > > Signed-off-by: Qing Huang > Acked-by: Daniel Jurgens > Reviewed-by: Zhu Yanjun Applied, thanks.
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Qing Huang Date: Wed, 23 May 2018 16:22:46 -0700 > When a system is under memory presure (high usage with fragments), > the original 256KB ICM chunk allocations will likely trigger kernel > memory management to enter slow path doing memory compact/migration > ops in order to complete high order memory allocations. > > When that happens, user processes calling uverb APIs may get stuck > for more than 120s easily even though there are a lot of free pages > in smaller chunks available in the system. > > Syslog: > ... > Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task > oracle_205573_e:205573 blocked for more than 120 seconds. > ... > > With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. > > However in order to support smaller ICM chunk size, we need to fix > another issue in large size kcalloc allocations. > > E.g. > Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk > size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt > entry). So we need a 16MB allocation for a table->icm pointer array to > hold 2M pointers which can easily cause kcalloc to fail. > > The solution is to use kvzalloc to replace kcalloc which will fall back > to vmalloc automatically if kmalloc fails. > > Signed-off-by: Qing Huang > Acked-by: Daniel Jurgens > Reviewed-by: Zhu Yanjun Applied, thanks.
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 24/05/2018 2:22 AM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing HuangAcked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } Thanks Qing. Reviewed-by: Tariq Toukan
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On 24/05/2018 2:22 AM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } Thanks Qing. Reviewed-by: Tariq Toukan
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On Thu, May 24, 2018 at 1:22 AM, Qing Huangwrote: > When a system is under memory presure (high usage with fragments), > the original 256KB ICM chunk allocations will likely trigger kernel > memory management to enter slow path doing memory compact/migration > ops in order to complete high order memory allocations. > > When that happens, user processes calling uverb APIs may get stuck > for more than 120s easily even though there are a lot of free pages > in smaller chunks available in the system. > > Syslog: > ... > Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task > oracle_205573_e:205573 blocked for more than 120 seconds. > ... > > With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. > > However in order to support smaller ICM chunk size, we need to fix > another issue in large size kcalloc allocations. > > E.g. > Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk > size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt > entry). So we need a 16MB allocation for a table->icm pointer array to > hold 2M pointers which can easily cause kcalloc to fail. > > The solution is to use kvzalloc to replace kcalloc which will fall back > to vmalloc automatically if kmalloc fails. Hi, Could you please write why it first try to allocate the contiguous pages? I think it is necessary to comment why it uses kvzalloc instead of vzalloc. > > Signed-off-by: Qing Huang > Acked-by: Daniel Jurgens > Reviewed-by: Zhu Yanjun +Reviewed-by: Gioh Kim > --- > v4: use kvzalloc instead of vzalloc > add one err condition check > don't include vmalloc.h any more > > v3: use PAGE_SIZE instead of PAGE_SHIFT > add comma to the end of enum variables > include vmalloc.h header file to avoid build issues on Sparc > > v2: adjusted chunk size to reflect different architectures > > drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c > b/drivers/net/ethernet/mellanox/mlx4/icm.c > index a822f7a..685337d 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/icm.c > +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c > @@ -43,12 +43,12 @@ > #include "fw.h" > > /* > - * We allocate in as big chunks as we can, up to a maximum of 256 KB > - * per chunk. > + * We allocate in page size (default 4KB on many archs) chunks to avoid high > + * order memory allocations in fragmented/high usage memory situation. > */ > enum { > - MLX4_ICM_ALLOC_SIZE = 1 << 18, > - MLX4_TABLE_CHUNK_SIZE = 1 << 18 > + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, > + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, > }; > > static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk > *chunk) > @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table, > u64 size; > > obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; > + if (WARN_ON(!obj_per_chunk)) > + return -EINVAL; > num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; > > - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); > + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); > if (!table->icm) > return -ENOMEM; > table->virt = virt; > @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table, > mlx4_free_icm(dev, table->icm[i], use_coherent); > } > > - kfree(table->icm); > + kvfree(table->icm); > > return -ENOMEM; > } > @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table) > mlx4_free_icm(dev, table->icm[i], table->coherent); > } > > - kfree(table->icm); > + kvfree(table->icm); > } > -- > 2.9.3 > -- GIOH KIM Linux Kernel Entwickler ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 176 2697 8962 Fax: +49 30 577 008 299 Email:gi-oh@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens
Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
On Thu, May 24, 2018 at 1:22 AM, Qing Huang wrote: > When a system is under memory presure (high usage with fragments), > the original 256KB ICM chunk allocations will likely trigger kernel > memory management to enter slow path doing memory compact/migration > ops in order to complete high order memory allocations. > > When that happens, user processes calling uverb APIs may get stuck > for more than 120s easily even though there are a lot of free pages > in smaller chunks available in the system. > > Syslog: > ... > Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task > oracle_205573_e:205573 blocked for more than 120 seconds. > ... > > With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. > > However in order to support smaller ICM chunk size, we need to fix > another issue in large size kcalloc allocations. > > E.g. > Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk > size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt > entry). So we need a 16MB allocation for a table->icm pointer array to > hold 2M pointers which can easily cause kcalloc to fail. > > The solution is to use kvzalloc to replace kcalloc which will fall back > to vmalloc automatically if kmalloc fails. Hi, Could you please write why it first try to allocate the contiguous pages? I think it is necessary to comment why it uses kvzalloc instead of vzalloc. > > Signed-off-by: Qing Huang > Acked-by: Daniel Jurgens > Reviewed-by: Zhu Yanjun +Reviewed-by: Gioh Kim > --- > v4: use kvzalloc instead of vzalloc > add one err condition check > don't include vmalloc.h any more > > v3: use PAGE_SIZE instead of PAGE_SHIFT > add comma to the end of enum variables > include vmalloc.h header file to avoid build issues on Sparc > > v2: adjusted chunk size to reflect different architectures > > drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c > b/drivers/net/ethernet/mellanox/mlx4/icm.c > index a822f7a..685337d 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/icm.c > +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c > @@ -43,12 +43,12 @@ > #include "fw.h" > > /* > - * We allocate in as big chunks as we can, up to a maximum of 256 KB > - * per chunk. > + * We allocate in page size (default 4KB on many archs) chunks to avoid high > + * order memory allocations in fragmented/high usage memory situation. > */ > enum { > - MLX4_ICM_ALLOC_SIZE = 1 << 18, > - MLX4_TABLE_CHUNK_SIZE = 1 << 18 > + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, > + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, > }; > > static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk > *chunk) > @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table, > u64 size; > > obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; > + if (WARN_ON(!obj_per_chunk)) > + return -EINVAL; > num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; > > - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); > + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); > if (!table->icm) > return -ENOMEM; > table->virt = virt; > @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table, > mlx4_free_icm(dev, table->icm[i], use_coherent); > } > > - kfree(table->icm); > + kvfree(table->icm); > > return -ENOMEM; > } > @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table) > mlx4_free_icm(dev, table->icm[i], table->coherent); > } > > - kfree(table->icm); > + kvfree(table->icm); > } > -- > 2.9.3 > -- GIOH KIM Linux Kernel Entwickler ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 176 2697 8962 Fax: +49 30 577 008 299 Email:gi-oh@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens
[PATCH V4] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing HuangAcked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } -- 2.9.3
[PATCH V4] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } -- 2.9.3