Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-23 Thread John Garry

On 21/11/2018 16:57, Will Deacon wrote:

On Wed, Nov 21, 2018 at 04:47:48PM +, John Garry wrote:

On 21/11/2018 16:07, Will Deacon wrote:

On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:

From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

We also include a change to use kvzalloc() for kzalloc()/vzalloc()
combination.

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
operator]
Signed-off-by: John Garry 


Weird, you're missing a diffstat here.

Anyway, the patch looks fine to me, but it would be nice if you could
justify the change with some numbers. Do you actually see an improvement

>from this change?




Hi Will,

Ah, I missed adding my comments explaining the motivation. It would be
better in the commit log. Anyway, here's the snippet:

" ... as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[3]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html;


Yes, please add to this to the commit log.



Sure,


I did have some numbers to show improvement in some scenarios when I tested
this a while back which I'll dig out.

However I would say that some scenarios will improve and the opposite for
others with this change, considering different conditions in which DMA
memory may be used.


Well, if you can show that it's useful in some cases and not catastrophic in
others, then I think shooting for parity with direct DMA is a reasonable
justification for the change.



So I have done some more testing with our SoC crypto engine, using 
tcrypt module. The reason I used this device was because we can utilise 
a local device per socket in the system, unlike other DMAing devices, 
which generally only exist on a single socket (for us, anyway).


Overall the results aren't brilliant - as expected - but show a general 
marginal improvement. Here's some figures:


Summary:
Average diff+0.9%
Max diff+1.5%
Min diff+0.2%

Test Ops/second before  after   diff %  

async ecb(aes) encrypt
test 0 (128 bit key, 16 byte blocks)68717   69057   0.5
test 1 (128 bit key, 64 byte blocks):   72633   73163   0.7
test 2 (128 bit key, 256 byte blocks):  71475   72076   0.8
test 3 (128 bit key, 1024 byte blocks): 66819   67467   1.0
test 4 (128 bit key, 8192 byte blocks): 38237   38495   0.7
test 5 (192 bit key, 16 byte blocks):   70273   71079   1.2
test 6 (192 bit key, 64 byte blocks):   72455   73292   1.2
test 7 (192 bit key, 256 byte blocks):  71085   71876   1.1
test 8 (192 bit key, 1024 byte blocks): 65891   66576   1.0
test 9 (192 bit key, 8192 byte blocks): 34846   35061   0.6
test 10 (256 bit key, 16 byte blocks):  72927   73762   1.2
test 11 (256 bit key, 64 byte blocks):  72361   73207   1.2
test 12 (256 bit key, 256 byte blocks): 70907   71602   1.0
test 13 (256 bit key, 1024 byte blocks):65035   65653   1.0
test 14 (256 bit key, 8192 byte blocks):32835   32998   0.5
async ecb(aes) decrypt
test 0 (128 bit key, 16 byte blocks)68384   69130   1.1
test 1 (128 bit key, 64 byte blocks):   72645   73133   0.7
test 2 (128 bit key, 256 byte blocks):  71523   71912   0.5
test 3 (128 bit key, 1024 byte blocks): 66902   67258   0.5
test 4 (128 bit key, 8192 byte blocks): 38260   38434   0.5
test 5 (192 bit key, 16 byte blocks):   70301   70816   0.7
test 6 (192 bit key, 64 byte blocks):   72473   73064   0.8
test 7 (192 bit key, 256 byte blocks):  71106   71663   0.8
test 8 (192 bit key, 1024 byte blocks): 65915   66363   0.7
test 9 (192 bit key, 8192 byte blocks): 34876   35006   0.4
test 10 (256 bit key, 16 byte blocks):  72969   73519   0.8
test 11 (256 bit key, 64 byte blocks):  72404   72925   0.7
test 12 (256 bit key, 256 byte blocks): 70861   71350   0.7
test 13 (256 bit key, 1024 byte blocks):65074   65485   0.6
test 14 (256 bit key, 8192 byte blocks):32861   32974   0.3
async cbc(aes) encrypt
test 0 (128 bit key, 16 byte blocks)58306   59131   1.4
test 1 (128 bit key, 64 byte blocks):   61647   62565   1.5
test 2 (128 bit key, 256 byte blocks):  60841   61666   1.4
test 3 (128 bit key, 1024 byte blocks): 57503   58204   1.2
test 4 (128 bit key, 8192 byte blocks): 34760   35055   0.9
test 5 (192 bit key, 16 byte blocks):   59684   60452   1.3
test 6 (192 bit key, 64 byte blocks):   61705   62516   1.3
test 7 (192 bit key, 256 byte blocks):  60678   61426   1.2
test 8 (192 bit key, 1024 byte blocks): 56805   57487   1.2
test 9 (192 bit key, 8192 byte blocks): 31836   32093   0.8
test 10 (256 bit key, 16 byte blocks):  61961   62785   1.3
test 11 (256 bit key, 64 byte blocks):  61584   62427   1.4
test 12 (256 bit key, 256 byte blocks): 60407   61246   1.4
test 13 (256 bit key, 1024 

Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-21 Thread Will Deacon
On Wed, Nov 21, 2018 at 04:47:48PM +, John Garry wrote:
> On 21/11/2018 16:07, Will Deacon wrote:
> >On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:
> >>From: Ganapatrao Kulkarni 
> >>
> >>Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
> >>respective device NUMA node. The ternary operator which would be for
> >>alloc_pages_node() is tidied along with this.
> >>
> >>We also include a change to use kvzalloc() for kzalloc()/vzalloc()
> >>combination.
> >>
> >>Signed-off-by: Ganapatrao Kulkarni 
> >>[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
> >>operator]
> >>Signed-off-by: John Garry 
> >
> >Weird, you're missing a diffstat here.
> >
> >Anyway, the patch looks fine to me, but it would be nice if you could
> >justify the change with some numbers. Do you actually see an improvement
> >from this change?
> >
> 
> Hi Will,
> 
> Ah, I missed adding my comments explaining the motivation. It would be
> better in the commit log. Anyway, here's the snippet:
> 
> " ... as mentioned in [3], dma_alloc_coherent() uses the locality
> information from the device - as in direct DMA - so this patch is just
> applying this same policy.
> 
> [3]
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html;

Yes, please add to this to the commit log.

> I did have some numbers to show improvement in some scenarios when I tested
> this a while back which I'll dig out.
> 
> However I would say that some scenarios will improve and the opposite for
> others with this change, considering different conditions in which DMA
> memory may be used.

Well, if you can show that it's useful in some cases and not catastrophic in
others, then I think shooting for parity with direct DMA is a reasonable
justification for the change.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-21 Thread John Garry

On 21/11/2018 16:07, Will Deacon wrote:

On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:

From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

We also include a change to use kvzalloc() for kzalloc()/vzalloc()
combination.

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
operator]
Signed-off-by: John Garry 


Weird, you're missing a diffstat here.

Anyway, the patch looks fine to me, but it would be nice if you could
justify the change with some numbers. Do you actually see an improvement
from this change?



Hi Will,

Ah, I missed adding my comments explaining the motivation. It would be 
better in the commit log. Anyway, here's the snippet:


" ... as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[3] 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html;


I did have some numbers to show improvement in some scenarios when I 
tested this a while back which I'll dig out.


However I would say that some scenarios will improve and the opposite 
for others with this change, considering different conditions in which 
DMA memory may be used.


Cheers,
John


Will

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-21 Thread Will Deacon
On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:
> From: Ganapatrao Kulkarni 
> 
> Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
> respective device NUMA node. The ternary operator which would be for
> alloc_pages_node() is tidied along with this.
> 
> We also include a change to use kvzalloc() for kzalloc()/vzalloc()
> combination.
> 
> Signed-off-by: Ganapatrao Kulkarni 
> [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
> operator]
> Signed-off-by: John Garry 

Weird, you're missing a diffstat here.

Anyway, the patch looks fine to me, but it would be nice if you could
justify the change with some numbers. Do you actually see an improvement
from this change?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

2018-11-21 Thread John Garry
From: Ganapatrao Kulkarni 

Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

We also include a change to use kvzalloc() for kzalloc()/vzalloc()
combination.

Signed-off-by: Ganapatrao Kulkarni 
[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary 
operator]
Signed-off-by: John Garry 

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..4afb1a8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-   unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+   unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
struct page **pages;
-   unsigned int i = 0, array_size = count * sizeof(*pages);
+   unsigned int i = 0, nid = dev_to_node(dev);
 
order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
return NULL;
 
-   if (array_size <= PAGE_SIZE)
-   pages = kzalloc(array_size, GFP_KERNEL);
-   else
-   pages = vzalloc(array_size);
+   pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
if (!pages)
return NULL;
 
@@ -481,10 +478,12 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
for (order_mask &= (2U << __fls(count)) - 1;
 order_mask; order_mask &= ~order_size) {
unsigned int order = __fls(order_mask);
+   gfp_t alloc_flags = gfp;
 
order_size = 1U << order;
-   page = alloc_pages((order_mask - order_size) ?
-  gfp | __GFP_NORETRY : gfp, order);
+   if (order_mask > order_size)
+   alloc_flags |= __GFP_NORETRY;
+   page = alloc_pages_node(nid, alloc_flags, order);
if (!page)
continue;
if (!order)
@@ -569,7 +568,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
alloc_sizes = min_size;
 
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+   gfp);
if (!pages)
return NULL;
 
-- 
1.9.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu