Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-09-19 Thread Robin Murphy
On 19/09/17 10:42, Leizhen (ThunderTown) wrote:
[...]
 static void
   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
 *free)
   {
   struct iova *cached_iova;
 -struct rb_node *curr;
 +struct rb_node **curr = NULL;
   -if (!iovad->cached32_node)
 -return;
 -curr = iovad->cached32_node;
 -cached_iova = rb_entry(curr, struct iova, node);
> 
> -
 +if (free->pfn_hi < iovad->dma_32bit_pfn)
 +curr = >cached32_node;
 +if (!curr)
 +curr = >cached_node;
>>
>> +if (!*curr)
>> +return;
> Is it necessary for us to try the following adjustment?
> + if (free->pfn_hi < iovad->dma_32bit_pfn)
> + curr = >cached32_node;
> + else
> + curr = >cached_node;
> +
> + if (!*curr) {
> + *curr = rb_next(>node);
> + return;
> + }

Yeah, I spotted that this looked a bit wonky after I posted it. It's
already cleaned up in v3, which I'll be posting shortly after I write up
some cover letters.

Thanks,
Robin.


Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-09-19 Thread Robin Murphy
On 19/09/17 10:42, Leizhen (ThunderTown) wrote:
[...]
 static void
   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
 *free)
   {
   struct iova *cached_iova;
 -struct rb_node *curr;
 +struct rb_node **curr = NULL;
   -if (!iovad->cached32_node)
 -return;
 -curr = iovad->cached32_node;
 -cached_iova = rb_entry(curr, struct iova, node);
> 
> -
 +if (free->pfn_hi < iovad->dma_32bit_pfn)
 +curr = >cached32_node;
 +if (!curr)
 +curr = >cached_node;
>>
>> +if (!*curr)
>> +return;
> Is it necessary for us to try the following adjustment?
> + if (free->pfn_hi < iovad->dma_32bit_pfn)
> + curr = >cached32_node;
> + else
> + curr = >cached_node;
> +
> + if (!*curr) {
> + *curr = rb_next(>node);
> + return;
> + }

Yeah, I spotted that this looked a bit wonky after I posted it. It's
already cleaned up in v3, which I'll be posting shortly after I write up
some cover letters.

Thanks,
Robin.


Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-09-19 Thread Leizhen (ThunderTown)


On 2017/7/31 19:42, Robin Murphy wrote:
> Hi Nate,
> 
> On 29/07/17 04:57, Nate Watterson wrote:
>> Hi Robin,
>> I am seeing a crash when performing very basic testing on this series
>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>> patch is the culprit, but this rcache business is still mostly
>> witchcraft to me.
>>
>> # ifconfig eth5 up
>> # ifconfig eth5 down
>> Unable to handle kernel NULL pointer dereference at virtual address
>> 0020
>> user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
>> [0020] *pgd=0006efab0003, *pud=0006efab0003,
>> *pmd=0007d8720003, *pte=
>> Internal error: Oops: 9607 [#1] SMP
>> Modules linked in:
>> CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>> task: 8007da1e5780 task.stack: 8007ddcb8000
>> PC is at __cached_rbnode_delete_update+0x2c/0x58
>> LR is at private_free_iova+0x2c/0x60
>> pc : [] lr : [] pstate: 204001c5
>> sp : 8007ddcbba00
>> x29: 8007ddcbba00 x28: 8007c8350210
>> x27: 8007d1a8 x26: 8007dcc20800
>> x25: 0140 x24: 8007c98f0008
>> x23: fe4e x22: 0140
>> x21: 8007c98f0008 x20: 8007c9adb240
>> x19: 8007c98f0018 x18: 0010
>> x17:  x16: 
>> x15: 4000 x14: 
>> x13:  x12: 0001
>> x11: dead0200 x10: 
>> x9 :  x8 : 8007c9adb1c0
>> x7 : 40002000 x6 : 00210d00
>> x5 :  x4 : c57e
>> x3 : ffcf x2 : ffcf
>> x1 : 8007c9adb240 x0 : 
>> [...]
>> [] __cached_rbnode_delete_update+0x2c/0x58
>> [] private_free_iova+0x2c/0x60
>> [] iova_magazine_free_pfns+0x4c/0xa0
>> [] free_iova_fast+0x1b0/0x230
>> [] iommu_dma_free_iova+0x5c/0x80
>> [] __iommu_dma_unmap+0x5c/0x98
>> [] iommu_dma_unmap_resource+0x24/0x30
>> [] iommu_dma_unmap_page+0xc/0x18
>> [] __iommu_unmap_page+0x40/0x60
>> [] mlx5e_page_release+0xbc/0x128
>> [] mlx5e_dealloc_rx_wqe+0x30/0x40
>> [] mlx5e_close_channel+0x70/0x1f8
>> [] mlx5e_close_channels+0x2c/0x50
>> [] mlx5e_close_locked+0x54/0x68
>> [] mlx5e_close+0x30/0x58
>> [...]
>>
>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
>> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
>> 0852C6CC|cmp x3,x2
>> 0852C6D0|b.cs0x0852C708
>> |curr = >cached32_node;
>>   94|if (!curr)
>> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
>> 0852C6D8|b.eq0x0852C708
>> |
>> |cached_iova = rb_entry(*curr, struct iova, node);
>> |
>>   99|if (free->pfn_lo >= cached_iova->pfn_lo)
>> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
>> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
>> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>
>> 0852C6E8|cmp x2,x0
>> 0852C6EC|b.cc0x0852C6FC
>> 0852C6F0|mov x0,x1; x0,free
>>  100|*curr = rb_next(>node);
>> After instrumenting the code a bit, this seems to be the culprit. In the
>> previous call, free->pfn_lo was 0x_ which is actually the
>> dma_limit for the domain so rb_next() returns NULL.
>>
>> Let me know if you have any questions or would like additional tests
>> run. I also applied your "DMA domain debug info" patches and dumped the
>> contents of the domain at each of the steps above in case that would be
>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>> especially when using 64k pages and many CPUs.
> 
> Thanks for the report - I somehow managed to reason myself out of
> keeping the "no cached node" check in __cached_rbnode_delete_update() on
> the assumption that it must always be set by a previous allocation.
> However, there is indeed just one case case for which that fails: when
> you free any IOVA immediately after freeing the very topmost one. Which
> is something that freeing an entire magazine's worth of IOVAs back to
> the tree all at once has a very real chance of doing...
> 
> The obvious straightforward fix is inline below, but I'm now starting to
> understand the appeal of reserving a sentinel node to ensure the tree
> can never be empty, so I might have a quick go at that to see if it
> results in 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-09-19 Thread Leizhen (ThunderTown)


On 2017/7/31 19:42, Robin Murphy wrote:
> Hi Nate,
> 
> On 29/07/17 04:57, Nate Watterson wrote:
>> Hi Robin,
>> I am seeing a crash when performing very basic testing on this series
>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>> patch is the culprit, but this rcache business is still mostly
>> witchcraft to me.
>>
>> # ifconfig eth5 up
>> # ifconfig eth5 down
>> Unable to handle kernel NULL pointer dereference at virtual address
>> 0020
>> user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
>> [0020] *pgd=0006efab0003, *pud=0006efab0003,
>> *pmd=0007d8720003, *pte=
>> Internal error: Oops: 9607 [#1] SMP
>> Modules linked in:
>> CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>> task: 8007da1e5780 task.stack: 8007ddcb8000
>> PC is at __cached_rbnode_delete_update+0x2c/0x58
>> LR is at private_free_iova+0x2c/0x60
>> pc : [] lr : [] pstate: 204001c5
>> sp : 8007ddcbba00
>> x29: 8007ddcbba00 x28: 8007c8350210
>> x27: 8007d1a8 x26: 8007dcc20800
>> x25: 0140 x24: 8007c98f0008
>> x23: fe4e x22: 0140
>> x21: 8007c98f0008 x20: 8007c9adb240
>> x19: 8007c98f0018 x18: 0010
>> x17:  x16: 
>> x15: 4000 x14: 
>> x13:  x12: 0001
>> x11: dead0200 x10: 
>> x9 :  x8 : 8007c9adb1c0
>> x7 : 40002000 x6 : 00210d00
>> x5 :  x4 : c57e
>> x3 : ffcf x2 : ffcf
>> x1 : 8007c9adb240 x0 : 
>> [...]
>> [] __cached_rbnode_delete_update+0x2c/0x58
>> [] private_free_iova+0x2c/0x60
>> [] iova_magazine_free_pfns+0x4c/0xa0
>> [] free_iova_fast+0x1b0/0x230
>> [] iommu_dma_free_iova+0x5c/0x80
>> [] __iommu_dma_unmap+0x5c/0x98
>> [] iommu_dma_unmap_resource+0x24/0x30
>> [] iommu_dma_unmap_page+0xc/0x18
>> [] __iommu_unmap_page+0x40/0x60
>> [] mlx5e_page_release+0xbc/0x128
>> [] mlx5e_dealloc_rx_wqe+0x30/0x40
>> [] mlx5e_close_channel+0x70/0x1f8
>> [] mlx5e_close_channels+0x2c/0x50
>> [] mlx5e_close_locked+0x54/0x68
>> [] mlx5e_close+0x30/0x58
>> [...]
>>
>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
>> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
>> 0852C6CC|cmp x3,x2
>> 0852C6D0|b.cs0x0852C708
>> |curr = >cached32_node;
>>   94|if (!curr)
>> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
>> 0852C6D8|b.eq0x0852C708
>> |
>> |cached_iova = rb_entry(*curr, struct iova, node);
>> |
>>   99|if (free->pfn_lo >= cached_iova->pfn_lo)
>> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
>> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
>> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>
>> 0852C6E8|cmp x2,x0
>> 0852C6EC|b.cc0x0852C6FC
>> 0852C6F0|mov x0,x1; x0,free
>>  100|*curr = rb_next(>node);
>> After instrumenting the code a bit, this seems to be the culprit. In the
>> previous call, free->pfn_lo was 0x_ which is actually the
>> dma_limit for the domain so rb_next() returns NULL.
>>
>> Let me know if you have any questions or would like additional tests
>> run. I also applied your "DMA domain debug info" patches and dumped the
>> contents of the domain at each of the steps above in case that would be
>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>> especially when using 64k pages and many CPUs.
> 
> Thanks for the report - I somehow managed to reason myself out of
> keeping the "no cached node" check in __cached_rbnode_delete_update() on
> the assumption that it must always be set by a previous allocation.
> However, there is indeed just one case case for which that fails: when
> you free any IOVA immediately after freeing the very topmost one. Which
> is something that freeing an entire magazine's worth of IOVAs back to
> the tree all at once has a very real chance of doing...
> 
> The obvious straightforward fix is inline below, but I'm now starting to
> understand the appeal of reserving a sentinel node to ensure the tree
> can never be empty, so I might have a quick go at that to see if it
> results in 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-31 Thread Leizhen (ThunderTown)


On 2017/8/4 3:41, Nate Watterson wrote:
> Hi Robin,
> 
> On 7/31/2017 7:42 AM, Robin Murphy wrote:
>> Hi Nate,
>>
>> On 29/07/17 04:57, Nate Watterson wrote:
>>> Hi Robin,
>>> I am seeing a crash when performing very basic testing on this series
>>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>>> patch is the culprit, but this rcache business is still mostly
>>> witchcraft to me.
>>>
>>> # ifconfig eth5 up
>>> # ifconfig eth5 down
>>>  Unable to handle kernel NULL pointer dereference at virtual address
>>> 0020
>>>  user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
>>>  [0020] *pgd=0006efab0003, *pud=0006efab0003,
>>> *pmd=0007d8720003, *pte=
>>>  Internal error: Oops: 9607 [#1] SMP
>>>  Modules linked in:
>>>  CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>>>  task: 8007da1e5780 task.stack: 8007ddcb8000
>>>  PC is at __cached_rbnode_delete_update+0x2c/0x58
>>>  LR is at private_free_iova+0x2c/0x60
>>>  pc : [] lr : [] pstate: 204001c5
>>>  sp : 8007ddcbba00
>>>  x29: 8007ddcbba00 x28: 8007c8350210
>>>  x27: 8007d1a8 x26: 8007dcc20800
>>>  x25: 0140 x24: 8007c98f0008
>>>  x23: fe4e x22: 0140
>>>  x21: 8007c98f0008 x20: 8007c9adb240
>>>  x19: 8007c98f0018 x18: 0010
>>>  x17:  x16: 
>>>  x15: 4000 x14: 
>>>  x13:  x12: 0001
>>>  x11: dead0200 x10: 
>>>  x9 :  x8 : 8007c9adb1c0
>>>  x7 : 40002000 x6 : 00210d00
>>>  x5 :  x4 : c57e
>>>  x3 : ffcf x2 : ffcf
>>>  x1 : 8007c9adb240 x0 : 
>>>  [...]
>>>  [] __cached_rbnode_delete_update+0x2c/0x58
>>>  [] private_free_iova+0x2c/0x60
>>>  [] iova_magazine_free_pfns+0x4c/0xa0
>>>  [] free_iova_fast+0x1b0/0x230
>>>  [] iommu_dma_free_iova+0x5c/0x80
>>>  [] __iommu_dma_unmap+0x5c/0x98
>>>  [] iommu_dma_unmap_resource+0x24/0x30
>>>  [] iommu_dma_unmap_page+0xc/0x18
>>>  [] __iommu_unmap_page+0x40/0x60
>>>  [] mlx5e_page_release+0xbc/0x128
>>>  [] mlx5e_dealloc_rx_wqe+0x30/0x40
>>>  [] mlx5e_close_channel+0x70/0x1f8
>>>  [] mlx5e_close_channels+0x2c/0x50
>>>  [] mlx5e_close_locked+0x54/0x68
>>>  [] mlx5e_close+0x30/0x58
>>>  [...]
>>>
>>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>>92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>>> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
>>> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
>>> 0852C6CC|cmp x3,x2
>>> 0852C6D0|b.cs0x0852C708
>>>  |curr = >cached32_node;
>>>94|if (!curr)
>>> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
>>> 0852C6D8|b.eq0x0852C708
>>>  |
>>>  |cached_iova = rb_entry(*curr, struct iova, node);
>>>  |
>>>99|if (free->pfn_lo >= cached_iova->pfn_lo)
>>> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
>>> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
>>> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
>>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>>
>>> 0852C6E8|cmp x2,x0
>>> 0852C6EC|b.cc0x0852C6FC
>>> 0852C6F0|mov x0,x1; x0,free
>>>   100|*curr = rb_next(>node);
>>> After instrumenting the code a bit, this seems to be the culprit. In the
>>> previous call, free->pfn_lo was 0x_ which is actually the
>>> dma_limit for the domain so rb_next() returns NULL.
>>>
>>> Let me know if you have any questions or would like additional tests
>>> run. I also applied your "DMA domain debug info" patches and dumped the
>>> contents of the domain at each of the steps above in case that would be
>>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>>> especially when using 64k pages and many CPUs.
>>
>> Thanks for the report - I somehow managed to reason myself out of
>> keeping the "no cached node" check in __cached_rbnode_delete_update() on
>> the assumption that it must always be set by a previous allocation.
>> However, there is indeed just one case case for which that fails: when
>> you free any IOVA immediately after freeing the very topmost one. Which
>> is something that freeing an entire magazine's worth of IOVAs back to
>> the tree all at once has a very real chance of doing...
>>
>> The obvious 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-31 Thread Leizhen (ThunderTown)


On 2017/8/4 3:41, Nate Watterson wrote:
> Hi Robin,
> 
> On 7/31/2017 7:42 AM, Robin Murphy wrote:
>> Hi Nate,
>>
>> On 29/07/17 04:57, Nate Watterson wrote:
>>> Hi Robin,
>>> I am seeing a crash when performing very basic testing on this series
>>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>>> patch is the culprit, but this rcache business is still mostly
>>> witchcraft to me.
>>>
>>> # ifconfig eth5 up
>>> # ifconfig eth5 down
>>>  Unable to handle kernel NULL pointer dereference at virtual address
>>> 0020
>>>  user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
>>>  [0020] *pgd=0006efab0003, *pud=0006efab0003,
>>> *pmd=0007d8720003, *pte=
>>>  Internal error: Oops: 9607 [#1] SMP
>>>  Modules linked in:
>>>  CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>>>  task: 8007da1e5780 task.stack: 8007ddcb8000
>>>  PC is at __cached_rbnode_delete_update+0x2c/0x58
>>>  LR is at private_free_iova+0x2c/0x60
>>>  pc : [] lr : [] pstate: 204001c5
>>>  sp : 8007ddcbba00
>>>  x29: 8007ddcbba00 x28: 8007c8350210
>>>  x27: 8007d1a8 x26: 8007dcc20800
>>>  x25: 0140 x24: 8007c98f0008
>>>  x23: fe4e x22: 0140
>>>  x21: 8007c98f0008 x20: 8007c9adb240
>>>  x19: 8007c98f0018 x18: 0010
>>>  x17:  x16: 
>>>  x15: 4000 x14: 
>>>  x13:  x12: 0001
>>>  x11: dead0200 x10: 
>>>  x9 :  x8 : 8007c9adb1c0
>>>  x7 : 40002000 x6 : 00210d00
>>>  x5 :  x4 : c57e
>>>  x3 : ffcf x2 : ffcf
>>>  x1 : 8007c9adb240 x0 : 
>>>  [...]
>>>  [] __cached_rbnode_delete_update+0x2c/0x58
>>>  [] private_free_iova+0x2c/0x60
>>>  [] iova_magazine_free_pfns+0x4c/0xa0
>>>  [] free_iova_fast+0x1b0/0x230
>>>  [] iommu_dma_free_iova+0x5c/0x80
>>>  [] __iommu_dma_unmap+0x5c/0x98
>>>  [] iommu_dma_unmap_resource+0x24/0x30
>>>  [] iommu_dma_unmap_page+0xc/0x18
>>>  [] __iommu_unmap_page+0x40/0x60
>>>  [] mlx5e_page_release+0xbc/0x128
>>>  [] mlx5e_dealloc_rx_wqe+0x30/0x40
>>>  [] mlx5e_close_channel+0x70/0x1f8
>>>  [] mlx5e_close_channels+0x2c/0x50
>>>  [] mlx5e_close_locked+0x54/0x68
>>>  [] mlx5e_close+0x30/0x58
>>>  [...]
>>>
>>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>>92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>>> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
>>> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
>>> 0852C6CC|cmp x3,x2
>>> 0852C6D0|b.cs0x0852C708
>>>  |curr = >cached32_node;
>>>94|if (!curr)
>>> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
>>> 0852C6D8|b.eq0x0852C708
>>>  |
>>>  |cached_iova = rb_entry(*curr, struct iova, node);
>>>  |
>>>99|if (free->pfn_lo >= cached_iova->pfn_lo)
>>> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
>>> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
>>> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
>>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>>
>>> 0852C6E8|cmp x2,x0
>>> 0852C6EC|b.cc0x0852C6FC
>>> 0852C6F0|mov x0,x1; x0,free
>>>   100|*curr = rb_next(>node);
>>> After instrumenting the code a bit, this seems to be the culprit. In the
>>> previous call, free->pfn_lo was 0x_ which is actually the
>>> dma_limit for the domain so rb_next() returns NULL.
>>>
>>> Let me know if you have any questions or would like additional tests
>>> run. I also applied your "DMA domain debug info" patches and dumped the
>>> contents of the domain at each of the steps above in case that would be
>>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>>> especially when using 64k pages and many CPUs.
>>
>> Thanks for the report - I somehow managed to reason myself out of
>> keeping the "no cached node" check in __cached_rbnode_delete_update() on
>> the assumption that it must always be set by a previous allocation.
>> However, there is indeed just one case case for which that fails: when
>> you free any IOVA immediately after freeing the very topmost one. Which
>> is something that freeing an entire magazine's worth of IOVAs back to
>> the tree all at once has a very real chance of doing...
>>
>> The obvious 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-03 Thread Nate Watterson

Hi Robin,

On 7/31/2017 7:42 AM, Robin Murphy wrote:

Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
 Unable to handle kernel NULL pointer dereference at virtual address
0020
 user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
 [0020] *pgd=0006efab0003, *pud=0006efab0003,
*pmd=0007d8720003, *pte=
 Internal error: Oops: 9607 [#1] SMP
 Modules linked in:
 CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
 task: 8007da1e5780 task.stack: 8007ddcb8000
 PC is at __cached_rbnode_delete_update+0x2c/0x58
 LR is at private_free_iova+0x2c/0x60
 pc : [] lr : [] pstate: 204001c5
 sp : 8007ddcbba00
 x29: 8007ddcbba00 x28: 8007c8350210
 x27: 8007d1a8 x26: 8007dcc20800
 x25: 0140 x24: 8007c98f0008
 x23: fe4e x22: 0140
 x21: 8007c98f0008 x20: 8007c9adb240
 x19: 8007c98f0018 x18: 0010
 x17:  x16: 
 x15: 4000 x14: 
 x13:  x12: 0001
 x11: dead0200 x10: 
 x9 :  x8 : 8007c9adb1c0
 x7 : 40002000 x6 : 00210d00
 x5 :  x4 : c57e
 x3 : ffcf x2 : ffcf
 x1 : 8007c9adb240 x0 : 
 [...]
 [] __cached_rbnode_delete_update+0x2c/0x58
 [] private_free_iova+0x2c/0x60
 [] iova_magazine_free_pfns+0x4c/0xa0
 [] free_iova_fast+0x1b0/0x230
 [] iommu_dma_free_iova+0x5c/0x80
 [] __iommu_dma_unmap+0x5c/0x98
 [] iommu_dma_unmap_resource+0x24/0x30
 [] iommu_dma_unmap_page+0xc/0x18
 [] __iommu_unmap_page+0x40/0x60
 [] mlx5e_page_release+0xbc/0x128
 [] mlx5e_dealloc_rx_wqe+0x30/0x40
 [] mlx5e_close_channel+0x70/0x1f8
 [] mlx5e_close_channels+0x2c/0x50
 [] mlx5e_close_locked+0x54/0x68
 [] mlx5e_close+0x30/0x58
 [...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
 |curr = >cached32_node;
   94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
 |
 |cached_iova = rb_entry(*curr, struct iova, node);
 |
   99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
  100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.


Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.


After applying your fix, the crash no longer occurs, but the performance
of this use case is only marginally less awful than it was before. I'll
start a new thread to discuss the causes and potential 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-03 Thread Nate Watterson

Hi Robin,

On 7/31/2017 7:42 AM, Robin Murphy wrote:

Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
 Unable to handle kernel NULL pointer dereference at virtual address
0020
 user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
 [0020] *pgd=0006efab0003, *pud=0006efab0003,
*pmd=0007d8720003, *pte=
 Internal error: Oops: 9607 [#1] SMP
 Modules linked in:
 CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
 task: 8007da1e5780 task.stack: 8007ddcb8000
 PC is at __cached_rbnode_delete_update+0x2c/0x58
 LR is at private_free_iova+0x2c/0x60
 pc : [] lr : [] pstate: 204001c5
 sp : 8007ddcbba00
 x29: 8007ddcbba00 x28: 8007c8350210
 x27: 8007d1a8 x26: 8007dcc20800
 x25: 0140 x24: 8007c98f0008
 x23: fe4e x22: 0140
 x21: 8007c98f0008 x20: 8007c9adb240
 x19: 8007c98f0018 x18: 0010
 x17:  x16: 
 x15: 4000 x14: 
 x13:  x12: 0001
 x11: dead0200 x10: 
 x9 :  x8 : 8007c9adb1c0
 x7 : 40002000 x6 : 00210d00
 x5 :  x4 : c57e
 x3 : ffcf x2 : ffcf
 x1 : 8007c9adb240 x0 : 
 [...]
 [] __cached_rbnode_delete_update+0x2c/0x58
 [] private_free_iova+0x2c/0x60
 [] iova_magazine_free_pfns+0x4c/0xa0
 [] free_iova_fast+0x1b0/0x230
 [] iommu_dma_free_iova+0x5c/0x80
 [] __iommu_dma_unmap+0x5c/0x98
 [] iommu_dma_unmap_resource+0x24/0x30
 [] iommu_dma_unmap_page+0xc/0x18
 [] __iommu_unmap_page+0x40/0x60
 [] mlx5e_page_release+0xbc/0x128
 [] mlx5e_dealloc_rx_wqe+0x30/0x40
 [] mlx5e_close_channel+0x70/0x1f8
 [] mlx5e_close_channels+0x2c/0x50
 [] mlx5e_close_locked+0x54/0x68
 [] mlx5e_close+0x30/0x58
 [...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
 |curr = >cached32_node;
   94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
 |
 |cached_iova = rb_entry(*curr, struct iova, node);
 |
   99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
  100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.


Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.


After applying your fix, the crash no longer occurs, but the performance
of this use case is only marginally less awful than it was before. I'll
start a new thread to discuss the causes and potential 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-31 Thread Robin Murphy
Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:
> Hi Robin,
> I am seeing a crash when performing very basic testing on this series
> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
> patch is the culprit, but this rcache business is still mostly
> witchcraft to me.
> 
> # ifconfig eth5 up
> # ifconfig eth5 down
> Unable to handle kernel NULL pointer dereference at virtual address
> 0020
> user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
> [0020] *pgd=0006efab0003, *pud=0006efab0003,
> *pmd=0007d8720003, *pte=
> Internal error: Oops: 9607 [#1] SMP
> Modules linked in:
> CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
> task: 8007da1e5780 task.stack: 8007ddcb8000
> PC is at __cached_rbnode_delete_update+0x2c/0x58
> LR is at private_free_iova+0x2c/0x60
> pc : [] lr : [] pstate: 204001c5
> sp : 8007ddcbba00
> x29: 8007ddcbba00 x28: 8007c8350210
> x27: 8007d1a8 x26: 8007dcc20800
> x25: 0140 x24: 8007c98f0008
> x23: fe4e x22: 0140
> x21: 8007c98f0008 x20: 8007c9adb240
> x19: 8007c98f0018 x18: 0010
> x17:  x16: 
> x15: 4000 x14: 
> x13:  x12: 0001
> x11: dead0200 x10: 
> x9 :  x8 : 8007c9adb1c0
> x7 : 40002000 x6 : 00210d00
> x5 :  x4 : c57e
> x3 : ffcf x2 : ffcf
> x1 : 8007c9adb240 x0 : 
> [...]
> [] __cached_rbnode_delete_update+0x2c/0x58
> [] private_free_iova+0x2c/0x60
> [] iova_magazine_free_pfns+0x4c/0xa0
> [] free_iova_fast+0x1b0/0x230
> [] iommu_dma_free_iova+0x5c/0x80
> [] __iommu_dma_unmap+0x5c/0x98
> [] iommu_dma_unmap_resource+0x24/0x30
> [] iommu_dma_unmap_page+0xc/0x18
> [] __iommu_unmap_page+0x40/0x60
> [] mlx5e_page_release+0xbc/0x128
> [] mlx5e_dealloc_rx_wqe+0x30/0x40
> [] mlx5e_close_channel+0x70/0x1f8
> [] mlx5e_close_channels+0x2c/0x50
> [] mlx5e_close_locked+0x54/0x68
> [] mlx5e_close+0x30/0x58
> [...]
> 
> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
> 0852C6CC|cmp x3,x2
> 0852C6D0|b.cs0x0852C708
> |curr = >cached32_node;
>   94|if (!curr)
> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
> 0852C6D8|b.eq0x0852C708
> |
> |cached_iova = rb_entry(*curr, struct iova, node);
> |
>   99|if (free->pfn_lo >= cached_iova->pfn_lo)
> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
> Apparently cached_iova was NULL so the pfn_lo access faulted.
> 
> 0852C6E8|cmp x2,x0
> 0852C6EC|b.cc0x0852C6FC
> 0852C6F0|mov x0,x1; x0,free
>  100|*curr = rb_next(>node);
> After instrumenting the code a bit, this seems to be the culprit. In the
> previous call, free->pfn_lo was 0x_ which is actually the
> dma_limit for the domain so rb_next() returns NULL.
> 
> Let me know if you have any questions or would like additional tests
> run. I also applied your "DMA domain debug info" patches and dumped the
> contents of the domain at each of the steps above in case that would be
> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
> especially when using 64k pages and many CPUs.

Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.

Robin.

>
> -Nate
> 
> On 7/21/2017 7:42 AM, Robin Murphy wrote:
>> The cached node mechanism provides a significant performance benefit 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-31 Thread Robin Murphy
Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:
> Hi Robin,
> I am seeing a crash when performing very basic testing on this series
> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
> patch is the culprit, but this rcache business is still mostly
> witchcraft to me.
> 
> # ifconfig eth5 up
> # ifconfig eth5 down
> Unable to handle kernel NULL pointer dereference at virtual address
> 0020
> user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
> [0020] *pgd=0006efab0003, *pud=0006efab0003,
> *pmd=0007d8720003, *pte=
> Internal error: Oops: 9607 [#1] SMP
> Modules linked in:
> CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
> task: 8007da1e5780 task.stack: 8007ddcb8000
> PC is at __cached_rbnode_delete_update+0x2c/0x58
> LR is at private_free_iova+0x2c/0x60
> pc : [] lr : [] pstate: 204001c5
> sp : 8007ddcbba00
> x29: 8007ddcbba00 x28: 8007c8350210
> x27: 8007d1a8 x26: 8007dcc20800
> x25: 0140 x24: 8007c98f0008
> x23: fe4e x22: 0140
> x21: 8007c98f0008 x20: 8007c9adb240
> x19: 8007c98f0018 x18: 0010
> x17:  x16: 
> x15: 4000 x14: 
> x13:  x12: 0001
> x11: dead0200 x10: 
> x9 :  x8 : 8007c9adb1c0
> x7 : 40002000 x6 : 00210d00
> x5 :  x4 : c57e
> x3 : ffcf x2 : ffcf
> x1 : 8007c9adb240 x0 : 
> [...]
> [] __cached_rbnode_delete_update+0x2c/0x58
> [] private_free_iova+0x2c/0x60
> [] iova_magazine_free_pfns+0x4c/0xa0
> [] free_iova_fast+0x1b0/0x230
> [] iommu_dma_free_iova+0x5c/0x80
> [] __iommu_dma_unmap+0x5c/0x98
> [] iommu_dma_unmap_resource+0x24/0x30
> [] iommu_dma_unmap_page+0xc/0x18
> [] __iommu_unmap_page+0x40/0x60
> [] mlx5e_page_release+0xbc/0x128
> [] mlx5e_dealloc_rx_wqe+0x30/0x40
> [] mlx5e_close_channel+0x70/0x1f8
> [] mlx5e_close_channels+0x2c/0x50
> [] mlx5e_close_locked+0x54/0x68
> [] mlx5e_close+0x30/0x58
> [...]
> 
> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
> 0852C6CC|cmp x3,x2
> 0852C6D0|b.cs0x0852C708
> |curr = >cached32_node;
>   94|if (!curr)
> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
> 0852C6D8|b.eq0x0852C708
> |
> |cached_iova = rb_entry(*curr, struct iova, node);
> |
>   99|if (free->pfn_lo >= cached_iova->pfn_lo)
> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
> Apparently cached_iova was NULL so the pfn_lo access faulted.
> 
> 0852C6E8|cmp x2,x0
> 0852C6EC|b.cc0x0852C6FC
> 0852C6F0|mov x0,x1; x0,free
>  100|*curr = rb_next(>node);
> After instrumenting the code a bit, this seems to be the culprit. In the
> previous call, free->pfn_lo was 0x_ which is actually the
> dma_limit for the domain so rb_next() returns NULL.
> 
> Let me know if you have any questions or would like additional tests
> run. I also applied your "DMA domain debug info" patches and dumped the
> contents of the domain at each of the steps above in case that would be
> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
> especially when using 64k pages and many CPUs.

Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.

Robin.

>
> -Nate
> 
> On 7/21/2017 7:42 AM, Robin Murphy wrote:
>> The cached node mechanism provides a significant performance benefit 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-28 Thread Nate Watterson

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
Unable to handle kernel NULL pointer dereference at virtual address 0020
user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
[0020] *pgd=0006efab0003, *pud=0006efab0003, 
*pmd=0007d8720003, *pte=
Internal error: Oops: 9607 [#1] SMP
Modules linked in:
CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
task: 8007da1e5780 task.stack: 8007ddcb8000
PC is at __cached_rbnode_delete_update+0x2c/0x58
LR is at private_free_iova+0x2c/0x60
pc : [] lr : [] pstate: 204001c5
sp : 8007ddcbba00
x29: 8007ddcbba00 x28: 8007c8350210
x27: 8007d1a8 x26: 8007dcc20800
x25: 0140 x24: 8007c98f0008
x23: fe4e x22: 0140
x21: 8007c98f0008 x20: 8007c9adb240
x19: 8007c98f0018 x18: 0010
x17:  x16: 
x15: 4000 x14: 
x13:  x12: 0001
x11: dead0200 x10: 
x9 :  x8 : 8007c9adb1c0
x7 : 40002000 x6 : 00210d00
x5 :  x4 : c57e
x3 : ffcf x2 : ffcf
x1 : 8007c9adb240 x0 : 
[...]
[] __cached_rbnode_delete_update+0x2c/0x58
[] private_free_iova+0x2c/0x60
[] iova_magazine_free_pfns+0x4c/0xa0
[] free_iova_fast+0x1b0/0x230
[] iommu_dma_free_iova+0x5c/0x80
[] __iommu_dma_unmap+0x5c/0x98
[] iommu_dma_unmap_resource+0x24/0x30
[] iommu_dma_unmap_page+0xc/0x18
[] __iommu_unmap_page+0x40/0x60
[] mlx5e_page_release+0xbc/0x128
[] mlx5e_dealloc_rx_wqe+0x30/0x40
[] mlx5e_close_channel+0x70/0x1f8
[] mlx5e_close_channels+0x2c/0x50
[] mlx5e_close_locked+0x54/0x68
[] mlx5e_close+0x30/0x58
[...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
  92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
|curr = >cached32_node;
  94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
|
|cached_iova = rb_entry(*curr, struct iova, node);
|
  99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
 100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.

-Nate

On 7/21/2017 7:42 AM, Robin Murphy wrote:

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Signed-off-by: Robin Murphy 
---

v2: No change

  drivers/iommu/iova.c | 59 +---
  include/linux/iova.h |  3 ++-
  2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-28 Thread Nate Watterson

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
Unable to handle kernel NULL pointer dereference at virtual address 0020
user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
[0020] *pgd=0006efab0003, *pud=0006efab0003, 
*pmd=0007d8720003, *pte=
Internal error: Oops: 9607 [#1] SMP
Modules linked in:
CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
task: 8007da1e5780 task.stack: 8007ddcb8000
PC is at __cached_rbnode_delete_update+0x2c/0x58
LR is at private_free_iova+0x2c/0x60
pc : [] lr : [] pstate: 204001c5
sp : 8007ddcbba00
x29: 8007ddcbba00 x28: 8007c8350210
x27: 8007d1a8 x26: 8007dcc20800
x25: 0140 x24: 8007c98f0008
x23: fe4e x22: 0140
x21: 8007c98f0008 x20: 8007c9adb240
x19: 8007c98f0018 x18: 0010
x17:  x16: 
x15: 4000 x14: 
x13:  x12: 0001
x11: dead0200 x10: 
x9 :  x8 : 8007c9adb1c0
x7 : 40002000 x6 : 00210d00
x5 :  x4 : c57e
x3 : ffcf x2 : ffcf
x1 : 8007c9adb240 x0 : 
[...]
[] __cached_rbnode_delete_update+0x2c/0x58
[] private_free_iova+0x2c/0x60
[] iova_magazine_free_pfns+0x4c/0xa0
[] free_iova_fast+0x1b0/0x230
[] iommu_dma_free_iova+0x5c/0x80
[] __iommu_dma_unmap+0x5c/0x98
[] iommu_dma_unmap_resource+0x24/0x30
[] iommu_dma_unmap_page+0xc/0x18
[] __iommu_unmap_page+0x40/0x60
[] mlx5e_page_release+0xbc/0x128
[] mlx5e_dealloc_rx_wqe+0x30/0x40
[] mlx5e_close_channel+0x70/0x1f8
[] mlx5e_close_channels+0x2c/0x50
[] mlx5e_close_locked+0x54/0x68
[] mlx5e_close+0x30/0x58
[...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
  92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
|curr = >cached32_node;
  94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
|
|cached_iova = rb_entry(*curr, struct iova, node);
|
  99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
 100|*curr = rb_next(>node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.

-Nate

On 7/21/2017 7:42 AM, Robin Murphy wrote:

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Signed-off-by: Robin Murphy 
---

v2: No change

  drivers/iommu/iova.c | 59 +---
  include/linux/iova.h |  3 ++-
  2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ 

[PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-21 Thread Robin Murphy
The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Signed-off-by: Robin Murphy 
---

v2: No change

 drivers/iommu/iova.c | 59 +---
 include/linux/iova.h |  3 ++-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
 
spin_lock_init(>iova_rbtree_lock);
iovad->rbroot = RB_ROOT;
+   iovad->cached_node = NULL;
iovad->cached32_node = NULL;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
@@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
 static struct rb_node *
 __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
 {
-   if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-   (iovad->cached32_node == NULL))
+   struct rb_node *cached_node = NULL;
+   struct iova *curr_iova;
+
+   if (*limit_pfn <= iovad->dma_32bit_pfn)
+   cached_node = iovad->cached32_node;
+   if (!cached_node)
+   cached_node = iovad->cached_node;
+   if (!cached_node)
return rb_last(>rbroot);
-   else {
-   struct rb_node *prev_node = rb_prev(iovad->cached32_node);
-   struct iova *curr_iova =
-   rb_entry(iovad->cached32_node, struct iova, node);
-   *limit_pfn = curr_iova->pfn_lo;
-   return prev_node;
-   }
+
+   curr_iova = rb_entry(cached_node, struct iova, node);
+   *limit_pfn = curr_iova->pfn_lo;
+
+   return rb_prev(cached_node);
 }
 
 static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
-   unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
 {
-   if (limit_pfn != iovad->dma_32bit_pfn)
-   return;
-   iovad->cached32_node = >node;
+   if (new->pfn_lo > iovad->dma_32bit_pfn)
+   iovad->cached_node = >node;
+   else
+   iovad->cached32_node = >node;
 }
 
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
struct iova *cached_iova;
-   struct rb_node *curr;
+   struct rb_node **curr = NULL;
 
-   if (!iovad->cached32_node)
-   return;
-   curr = iovad->cached32_node;
-   cached_iova = rb_entry(curr, struct iova, node);
+   if (free->pfn_hi < iovad->dma_32bit_pfn)
+   curr = >cached32_node;
+   if (!curr)
+   curr = >cached_node;
 
-   if (free->pfn_lo >= cached_iova->pfn_lo) {
-   struct rb_node *node = rb_next(>node);
-   struct iova *iova = rb_entry(node, struct iova, node);
+   cached_iova = rb_entry(*curr, struct iova, node);
 
-   /* only cache if it's below 32bit pfn */
-   if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-   iovad->cached32_node = node;
-   else
-   iovad->cached32_node = NULL;
-   }
+   if (free->pfn_lo >= cached_iova->pfn_lo)
+   *curr = rb_next(>node);
 }
 
 /* Insert the iova into domain rbtree by holding writer lock */
@@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 {
struct rb_node *prev, *curr = NULL;
unsigned long flags;
-   unsigned long saved_pfn;
unsigned long align_mask = ~0UL;
 
if (size_aligned)
@@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 
/* Walk the tree backwards */
spin_lock_irqsave(>iova_rbtree_lock, flags);
-   saved_pfn = limit_pfn;
curr = __get_cached_rbnode(iovad, _pfn);
prev = curr;
while (curr) {
@@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 
/* If we have 'prev', it's a valid place to start the insertion. */
iova_insert_rbtree(>rbroot, new, prev);
-   __cached_rbnode_insert_update(iovad, saved_pfn, new);
+   __cached_rbnode_insert_update(iovad, new);
 

[PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-21 Thread Robin Murphy
The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Signed-off-by: Robin Murphy 
---

v2: No change

 drivers/iommu/iova.c | 59 +---
 include/linux/iova.h |  3 ++-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
 
spin_lock_init(>iova_rbtree_lock);
iovad->rbroot = RB_ROOT;
+   iovad->cached_node = NULL;
iovad->cached32_node = NULL;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
@@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
 static struct rb_node *
 __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
 {
-   if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-   (iovad->cached32_node == NULL))
+   struct rb_node *cached_node = NULL;
+   struct iova *curr_iova;
+
+   if (*limit_pfn <= iovad->dma_32bit_pfn)
+   cached_node = iovad->cached32_node;
+   if (!cached_node)
+   cached_node = iovad->cached_node;
+   if (!cached_node)
return rb_last(>rbroot);
-   else {
-   struct rb_node *prev_node = rb_prev(iovad->cached32_node);
-   struct iova *curr_iova =
-   rb_entry(iovad->cached32_node, struct iova, node);
-   *limit_pfn = curr_iova->pfn_lo;
-   return prev_node;
-   }
+
+   curr_iova = rb_entry(cached_node, struct iova, node);
+   *limit_pfn = curr_iova->pfn_lo;
+
+   return rb_prev(cached_node);
 }
 
 static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
-   unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
 {
-   if (limit_pfn != iovad->dma_32bit_pfn)
-   return;
-   iovad->cached32_node = >node;
+   if (new->pfn_lo > iovad->dma_32bit_pfn)
+   iovad->cached_node = >node;
+   else
+   iovad->cached32_node = >node;
 }
 
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
struct iova *cached_iova;
-   struct rb_node *curr;
+   struct rb_node **curr = NULL;
 
-   if (!iovad->cached32_node)
-   return;
-   curr = iovad->cached32_node;
-   cached_iova = rb_entry(curr, struct iova, node);
+   if (free->pfn_hi < iovad->dma_32bit_pfn)
+   curr = >cached32_node;
+   if (!curr)
+   curr = >cached_node;
 
-   if (free->pfn_lo >= cached_iova->pfn_lo) {
-   struct rb_node *node = rb_next(>node);
-   struct iova *iova = rb_entry(node, struct iova, node);
+   cached_iova = rb_entry(*curr, struct iova, node);
 
-   /* only cache if it's below 32bit pfn */
-   if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-   iovad->cached32_node = node;
-   else
-   iovad->cached32_node = NULL;
-   }
+   if (free->pfn_lo >= cached_iova->pfn_lo)
+   *curr = rb_next(>node);
 }
 
 /* Insert the iova into domain rbtree by holding writer lock */
@@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 {
struct rb_node *prev, *curr = NULL;
unsigned long flags;
-   unsigned long saved_pfn;
unsigned long align_mask = ~0UL;
 
if (size_aligned)
@@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 
/* Walk the tree backwards */
spin_lock_irqsave(>iova_rbtree_lock, flags);
-   saved_pfn = limit_pfn;
curr = __get_cached_rbnode(iovad, _pfn);
prev = curr;
while (curr) {
@@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 
/* If we have 'prev', it's a valid place to start the insertion. */
iova_insert_rbtree(>rbroot, new, prev);
-   __cached_rbnode_insert_update(iovad, saved_pfn, new);
+   __cached_rbnode_insert_update(iovad, new);
 
spin_unlock_irqrestore(>iova_rbtree_lock, flags);
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index