Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 21/11/2018 16:57, Will Deacon wrote: On Wed, Nov 21, 2018 at 04:47:48PM +, John Garry wrote: On 21/11/2018 16:07, Will Deacon wrote: On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote: From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. We also include a change to use kvzalloc() for kzalloc()/vzalloc() combination. Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator] Signed-off-by: John Garry Weird, you're missing a diffstat here. Anyway, the patch looks fine to me, but it would be nice if you could justify the change with some numbers. Do you actually see an improvement >from this change? Hi Will, Ah, I missed adding my comments explaining the motivation. It would be better in the commit log. Anyway, here's the snippet: " ... as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html; Yes, please add to this to the commit log. Sure, I did have some numbers to show improvement in some scenarios when I tested this a while back which I'll dig out. However I would say that some scenarios will improve and the opposite for others with this change, considering different conditions in which DMA memory may be used. Well, if you can show that it's useful in some cases and not catastrophic in others, then I think shooting for parity with direct DMA is a reasonable justification for the change. So I have done some more testing with our SoC crypto engine, using tcrypt module. The reason I used this device was because we can utilise a local device per socket in the system, unlike other DMAing devices, which generally only exist on a single socket (for us, anyway). Overall the results aren't brilliant - as expected - but show a general marginal improvement. Here's some figures: Summary: Average diff+0.9% Max diff+1.5% Min diff+0.2% Test Ops/second before after diff % async ecb(aes) encrypt test 0 (128 bit key, 16 byte blocks)68717 69057 0.5 test 1 (128 bit key, 64 byte blocks): 72633 73163 0.7 test 2 (128 bit key, 256 byte blocks): 71475 72076 0.8 test 3 (128 bit key, 1024 byte blocks): 66819 67467 1.0 test 4 (128 bit key, 8192 byte blocks): 38237 38495 0.7 test 5 (192 bit key, 16 byte blocks): 70273 71079 1.2 test 6 (192 bit key, 64 byte blocks): 72455 73292 1.2 test 7 (192 bit key, 256 byte blocks): 71085 71876 1.1 test 8 (192 bit key, 1024 byte blocks): 65891 66576 1.0 test 9 (192 bit key, 8192 byte blocks): 34846 35061 0.6 test 10 (256 bit key, 16 byte blocks): 72927 73762 1.2 test 11 (256 bit key, 64 byte blocks): 72361 73207 1.2 test 12 (256 bit key, 256 byte blocks): 70907 71602 1.0 test 13 (256 bit key, 1024 byte blocks):65035 65653 1.0 test 14 (256 bit key, 8192 byte blocks):32835 32998 0.5 async ecb(aes) decrypt test 0 (128 bit key, 16 byte blocks)68384 69130 1.1 test 1 (128 bit key, 64 byte blocks): 72645 73133 0.7 test 2 (128 bit key, 256 byte blocks): 71523 71912 0.5 test 3 (128 bit key, 1024 byte blocks): 66902 67258 0.5 test 4 (128 bit key, 8192 byte blocks): 38260 38434 0.5 test 5 (192 bit key, 16 byte blocks): 70301 70816 0.7 test 6 (192 bit key, 64 byte blocks): 72473 73064 0.8 test 7 (192 bit key, 256 byte blocks): 71106 71663 0.8 test 8 (192 bit key, 1024 byte blocks): 65915 66363 0.7 test 9 (192 bit key, 8192 byte blocks): 34876 35006 0.4 test 10 (256 bit key, 16 byte blocks): 72969 73519 0.8 test 11 (256 bit key, 64 byte blocks): 72404 72925 0.7 test 12 (256 bit key, 256 byte blocks): 70861 71350 0.7 test 13 (256 bit key, 1024 byte blocks):65074 65485 0.6 test 14 (256 bit key, 8192 byte blocks):32861 32974 0.3 async cbc(aes) encrypt test 0 (128 bit key, 16 byte blocks)58306 59131 1.4 test 1 (128 bit key, 64 byte blocks): 61647 62565 1.5 test 2 (128 bit key, 256 byte blocks): 60841 61666 1.4 test 3 (128 bit key, 1024 byte blocks): 57503 58204 1.2 test 4 (128 bit key, 8192 byte blocks): 34760 35055 0.9 test 5 (192 bit key, 16 byte blocks): 59684 60452 1.3 test 6 (192 bit key, 64 byte blocks): 61705 62516 1.3 test 7 (192 bit key, 256 byte blocks): 60678 61426 1.2 test 8 (192 bit key, 1024 byte blocks): 56805 57487 1.2 test 9 (192 bit key, 8192 byte blocks): 31836 32093 0.8 test 10 (256 bit key, 16 byte blocks): 61961 62785 1.3 test 11 (256 bit key, 64 byte blocks): 61584 62427 1.4 test 12 (256 bit key, 256 byte blocks): 60407 61246 1.4 test 13 (256 bit key, 1024
Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On Wed, Nov 21, 2018 at 04:47:48PM +, John Garry wrote: > On 21/11/2018 16:07, Will Deacon wrote: > >On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote: > >>From: Ganapatrao Kulkarni > >> > >>Change function __iommu_dma_alloc_pages() to allocate pages for DMA from > >>respective device NUMA node. The ternary operator which would be for > >>alloc_pages_node() is tidied along with this. > >> > >>We also include a change to use kvzalloc() for kzalloc()/vzalloc() > >>combination. > >> > >>Signed-off-by: Ganapatrao Kulkarni > >>[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary > >>operator] > >>Signed-off-by: John Garry > > > >Weird, you're missing a diffstat here. > > > >Anyway, the patch looks fine to me, but it would be nice if you could > >justify the change with some numbers. Do you actually see an improvement > >from this change? > > > > Hi Will, > > Ah, I missed adding my comments explaining the motivation. It would be > better in the commit log. Anyway, here's the snippet: > > " ... as mentioned in [3], dma_alloc_coherent() uses the locality > information from the device - as in direct DMA - so this patch is just > applying this same policy. > > [3] > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html; Yes, please add to this to the commit log. > I did have some numbers to show improvement in some scenarios when I tested > this a while back which I'll dig out. > > However I would say that some scenarios will improve and the opposite for > others with this change, considering different conditions in which DMA > memory may be used. Well, if you can show that it's useful in some cases and not catastrophic in others, then I think shooting for parity with direct DMA is a reasonable justification for the change. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On 21/11/2018 16:07, Will Deacon wrote: On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote: From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. We also include a change to use kvzalloc() for kzalloc()/vzalloc() combination. Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator] Signed-off-by: John Garry Weird, you're missing a diffstat here. Anyway, the patch looks fine to me, but it would be nice if you could justify the change with some numbers. Do you actually see an improvement from this change? Hi Will, Ah, I missed adding my comments explaining the motivation. It would be better in the commit log. Anyway, here's the snippet: " ... as mentioned in [3], dma_alloc_coherent() uses the locality information from the device - as in direct DMA - so this patch is just applying this same policy. [3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html; I did have some numbers to show improvement in some scenarios when I tested this a while back which I'll dig out. However I would say that some scenarios will improve and the opposite for others with this change, considering different conditions in which DMA memory may be used. Cheers, John Will . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote: > From: Ganapatrao Kulkarni > > Change function __iommu_dma_alloc_pages() to allocate pages for DMA from > respective device NUMA node. The ternary operator which would be for > alloc_pages_node() is tidied along with this. > > We also include a change to use kvzalloc() for kzalloc()/vzalloc() > combination. > > Signed-off-by: Ganapatrao Kulkarni > [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary > operator] > Signed-off-by: John Garry Weird, you're missing a diffstat here. Anyway, the patch looks fine to me, but it would be nice if you could justify the change with some numbers. Do you actually see an improvement from this change? Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()
From: Ganapatrao Kulkarni Change function __iommu_dma_alloc_pages() to allocate pages for DMA from respective device NUMA node. The ternary operator which would be for alloc_pages_node() is tidied along with this. We also include a change to use kvzalloc() for kzalloc()/vzalloc() combination. Signed-off-by: Ganapatrao Kulkarni [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator] Signed-off-by: John Garry diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d1b0475..4afb1a8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, int count) kvfree(pages); } -static struct page **__iommu_dma_alloc_pages(unsigned int count, - unsigned long order_mask, gfp_t gfp) +static struct page **__iommu_dma_alloc_pages(struct device *dev, + unsigned int count, unsigned long order_mask, gfp_t gfp) { struct page **pages; - unsigned int i = 0, array_size = count * sizeof(*pages); + unsigned int i = 0, nid = dev_to_node(dev); order_mask &= (2U << MAX_ORDER) - 1; if (!order_mask) return NULL; - if (array_size <= PAGE_SIZE) - pages = kzalloc(array_size, GFP_KERNEL); - else - pages = vzalloc(array_size); + pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL); if (!pages) return NULL; @@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count, for (order_mask &= (2U << __fls(count)) - 1; order_mask; order_mask &= ~order_size) { unsigned int order = __fls(order_mask); + gfp_t alloc_flags = gfp; order_size = 1U << order; - page = alloc_pages((order_mask - order_size) ? - gfp | __GFP_NORETRY : gfp, order); + if (order_mask > order_size) + alloc_flags |= __GFP_NORETRY; + page = alloc_pages_node(nid, alloc_flags, order); if (!page) continue; if (!order) @@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp, alloc_sizes = min_size; count = PAGE_ALIGN(size) >> PAGE_SHIFT; - pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp); + pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT, + gfp); if (!pages) return NULL; -- 1.9.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu