Re: IOAT DMA w/IOMMU

2018-08-10 Thread Kit Chow
There is an internal routine (__intel_map_single) inside the intel iommu 
code that does the actual mapping using a phys_addr_t. Think I'll try to 
implement a intel_map_resource routine that calls that routine directly 
without all of the conversions done for dma_map_{single,page} (pci bar 
addr -> page -> phys_addr)...



On 08/10/2018 09:02 AM, Kit Chow wrote:
Turns out there is no dma_map_resource routine on x86. get_dma_ops 
returns intel_dma_ops which has map_resource pointing to NULL.


(gdb) p intel_dma_ops
$7 = {alloc = 0x8150f310 ,
  free = 0x8150ec20 ,
  mmap = 0x0 , get_sgtable = 0x0 ,
  map_page = 0x8150f2d0 ,
  unmap_page = 0x8150ec10 ,
  map_sg = 0x8150ef40 ,
  unmap_sg = 0x8150eb80 ,
  map_resource = 0x0 ,
  unmap_resource = 0x0 ,
  sync_single_for_cpu = 0x0 ,
  sync_single_for_device = 0x0 ,
  sync_sg_for_cpu = 0x0 ,
  sync_sg_for_device = 0x0 ,
  cache_sync = 0x0 ,
  mapping_error = 0x815095f0 ,
  dma_supported = 0x81033830 , is_phys = 0}

Will poke around some in the intel_map_page code but can you actually 
get a valid struct page for a pci bar address (dma_map_single calls 
virt_to_page)?  If not, does a map_resource routine that can properly 
map a pci bar address need to be implemented?


Kit

---


static inline dma_addr_t dma_map_single_attrs(struct device *dev, void 
*ptr,

  size_t size,
  enum dma_data_direction 
dir,

  unsigned long attrs)
{
    const struct dma_map_ops *ops = get_dma_ops(dev);
    dma_addr_t addr;

    BUG_ON(!valid_dma_direction(dir));
    addr = ops->map_page(dev, virt_to_page(ptr),
 offset_in_page(ptr), size,
 dir, attrs);
    debug_dma_map_page(dev, virt_to_page(ptr),
   offset_in_page(ptr), size,
   dir, addr, true);
    return addr;
}






On 08/09/2018 04:00 PM, Kit Chow wrote:



On 08/09/2018 03:50 PM, Logan Gunthorpe wrote:


On 09/08/18 04:48 PM, Kit Chow wrote:

Based on Logan's comments, I am very hopeful that the dma_map_resource
will make things work on the older platforms...

Well, I *think* dma_map_single() would still work. So I'm not that
confident that's the root of your problem. I'd still like to see the
actual code snippet you are using.

Logan
Here's the code snippet - (ntbdebug & 4) path does dma_map_resource 
of the pci bar address.


It was:
    unmap->addr[1] = dma_map_single(device->dev, (void 
*)dest, len,

    DMA_TO_DEVICE);

Kit
---


static int ntb_async_tx_submit(struct ntb_transport_qp *qp,
   struct ntb_queue_entry *entry)
{
    struct dma_async_tx_descriptor *txd;
    struct dma_chan *chan = qp->tx_dma_chan;
    struct dma_device *device;
    size_t len = entry->len;
    void *buf = entry->buf;
    size_t dest_off, buff_off;
    struct dmaengine_unmap_data *unmap;
    dma_addr_t dest;
    dma_cookie_t cookie;
    int unmapcnt;

    device = chan->device;

    dest = qp->tx_mw_phys + qp->tx_max_frame * entry->tx_index;

    buff_off = (size_t)buf & ~PAGE_MASK;
    dest_off = (size_t)dest & ~PAGE_MASK;

    if (!is_dma_copy_aligned(device, buff_off, dest_off, len))
    goto err;


    if (ntbdebug & 0x4) {
    unmapcnt = 2;
    } else {
    unmapcnt = 1;
    }

    unmap = dmaengine_get_unmap_data(device->dev, unmapcnt, 
GFP_NOWAIT);

    if (!unmap)
    goto err;

    unmap->len = len;
    unmap->addr[0] = dma_map_page(device->dev, virt_to_page(buf),
  buff_off, len, DMA_TO_DEVICE);
    if (dma_mapping_error(device->dev, unmap->addr[0]))
    goto err_get_unmap;

    if (ntbdebug & 0x4) {
    unmap->addr[1] = dma_map_resource(device->dev,
    (phys_addr_t)dest, len, DMA_TO_DEVICE, 0);
    if (dma_mapping_error(device->dev, unmap->addr[1]))
    goto err_get_unmap;
    unmap->to_cnt = 2;
    } else {
    unmap->addr[1] = dest;
    unmap->to_cnt = 1;
    }

    txd = device->device_prep_dma_memcpy(chan, unmap->addr[1],
    unmap->addr[0], len, DMA_PREP_INTERRUPT);

    if (!txd)
    goto err_get_unmap;

    txd->callback_result = ntb_tx_copy_callback;
    txd->callback_param = entry;
    dma_set_unmap(txd, unmap);

    cookie = dmaengine_submit(txd);
    if (dma_submit_error(cookie))
    goto err_set_unmap;

    dmaengine_unmap_put(unmap);

    dma_async_issue_pending(chan);

    return 0;

err_set_unmap:
    dma_descriptor_unmap(txd);
    txd->desc_free(txd);
err_get_unmap:
  

Re: IOAT DMA w/IOMMU

2018-08-10 Thread Logan Gunthorpe


On 10/08/18 10:02 AM, Kit Chow wrote:
> Turns out there is no dma_map_resource routine on x86. get_dma_ops 
> returns intel_dma_ops which has map_resource pointing to NULL.

Oh, yup. I wasn't aware of that. From a cursory view, it looks like it
shouldn't be too hard to implement though.

> Will poke around some in the intel_map_page code but can you actually 
> get a valid struct page for a pci bar address (dma_map_single calls 
> virt_to_page)?  If not, does a map_resource routine that can properly 
> map a pci bar address need to be implemented?

Yes, you can not get a struct page for a PCI bar address unless it's
mapped with ZONE_DEVICE like in my p2p work. So that would explain why
dma_map_single() didn't work.

This all implies that ntb_transport doesn't work with DMA and the IOMMU
turned on. I'm not sure I've ever tried that configuration myself but it
is a bit surprising.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-10 Thread Logan Gunthorpe



On 10/08/18 10:23 AM, Kit Chow wrote:
> There is an internal routine (__intel_map_single) inside the intel iommu 
> code that does the actual mapping using a phys_addr_t. Think I'll try to 
> implement a intel_map_resource routine that calls that routine directly 
> without all of the conversions done for dma_map_{single,page} (pci bar 
> addr -> page -> phys_addr)...

Nice, yes, that was what I was thinking.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-10 Thread Logan Gunthorpe


On 10/08/18 10:31 AM, Dave Jiang wrote:
> 
> 
> On 08/10/2018 09:24 AM, Logan Gunthorpe wrote:
>>
>>
>> On 10/08/18 10:02 AM, Kit Chow wrote:
>>> Turns out there is no dma_map_resource routine on x86. get_dma_ops 
>>> returns intel_dma_ops which has map_resource pointing to NULL.
>>
>> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it
>> shouldn't be too hard to implement though.
>>
>>> Will poke around some in the intel_map_page code but can you actually 
>>> get a valid struct page for a pci bar address (dma_map_single calls 
>>> virt_to_page)?  If not, does a map_resource routine that can properly 
>>> map a pci bar address need to be implemented?
>>
>> Yes, you can not get a struct page for a PCI bar address unless it's
>> mapped with ZONE_DEVICE like in my p2p work. So that would explain why
>> dma_map_single() didn't work.
>>
>> This all implies that ntb_transport doesn't work with DMA and the IOMMU
>> turned on. I'm not sure I've ever tried that configuration myself but it
>> is a bit surprising.
> 
> Hmmthat's surprising because it seems to work on Skylake platform
> when I tested it yesterday with Intel NTB. Kit is using a Haswell
> platform at the moment I think. Although I'm curious if it works with
> the PLX NTB he's using on Skylake.

Does that mean on Skylake the IOAT can bypass the IOMMU? Because it
looks like the ntb_transport code doesn't map the physical address of
the NTB MW into the IOMMU when doing DMA...

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/3] dts: arm64/sdm845: Add node for arm,mmu-500

2018-08-10 Thread Doug Anderson via iommu
Hi,

On Thu, Jul 19, 2018 at 10:53 AM, Vivek Gautam
 wrote:
> Add device node for arm,mmu-500 available on sdm845.
> This MMU-500 with single TCU and multiple TBU architecture
> is shared among all the peripherals except gpu on sdm845.
>
> Signed-off-by: Vivek Gautam 
> ---
>  arch/arm64/boot/dts/qcom/sdm845-mtp.dts |  4 ++
>  arch/arm64/boot/dts/qcom/sdm845.dtsi| 73 
> +
>  2 files changed, 77 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/qcom/sdm845-mtp.dts 
> b/arch/arm64/boot/dts/qcom/sdm845-mtp.dts
> index 6d651f314193..13b50dff440f 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845-mtp.dts
> +++ b/arch/arm64/boot/dts/qcom/sdm845-mtp.dts
> @@ -58,3 +58,7 @@
> bias-pull-up;
> };
>  };
> +
> +_smmu {
> +   status = "okay";
> +};

When you spin this patch please put the above in the correct place.
Since "a" sorts alphabetically before "i" then this should be just
before the line:

 {

Thanks!

-Doug
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-10 Thread Dave Jiang


On 08/10/2018 09:24 AM, Logan Gunthorpe wrote:
> 
> 
> On 10/08/18 10:02 AM, Kit Chow wrote:
>> Turns out there is no dma_map_resource routine on x86. get_dma_ops 
>> returns intel_dma_ops which has map_resource pointing to NULL.
> 
> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it
> shouldn't be too hard to implement though.
> 
>> Will poke around some in the intel_map_page code but can you actually 
>> get a valid struct page for a pci bar address (dma_map_single calls 
>> virt_to_page)?  If not, does a map_resource routine that can properly 
>> map a pci bar address need to be implemented?
> 
> Yes, you can not get a struct page for a PCI bar address unless it's
> mapped with ZONE_DEVICE like in my p2p work. So that would explain why
> dma_map_single() didn't work.
> 
> This all implies that ntb_transport doesn't work with DMA and the IOMMU
> turned on. I'm not sure I've ever tried that configuration myself but it
> is a bit surprising.

Hmmthat's surprising because it seems to work on Skylake platform
when I tested it yesterday with Intel NTB. Kit is using a Haswell
platform at the moment I think. Although I'm curious if it works with
the PLX NTB he's using on Skylake.

> 
> Logan
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-10 Thread Dave Jiang


On 08/10/2018 09:33 AM, Logan Gunthorpe wrote:
> 
> 
> On 10/08/18 10:31 AM, Dave Jiang wrote:
>>
>>
>> On 08/10/2018 09:24 AM, Logan Gunthorpe wrote:
>>>
>>>
>>> On 10/08/18 10:02 AM, Kit Chow wrote:
 Turns out there is no dma_map_resource routine on x86. get_dma_ops 
 returns intel_dma_ops which has map_resource pointing to NULL.
>>>
>>> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it
>>> shouldn't be too hard to implement though.
>>>
 Will poke around some in the intel_map_page code but can you actually 
 get a valid struct page for a pci bar address (dma_map_single calls 
 virt_to_page)?  If not, does a map_resource routine that can properly 
 map a pci bar address need to be implemented?
>>>
>>> Yes, you can not get a struct page for a PCI bar address unless it's
>>> mapped with ZONE_DEVICE like in my p2p work. So that would explain why
>>> dma_map_single() didn't work.
>>>
>>> This all implies that ntb_transport doesn't work with DMA and the IOMMU
>>> turned on. I'm not sure I've ever tried that configuration myself but it
>>> is a bit surprising.
>>
>> Hmmthat's surprising because it seems to work on Skylake platform
>> when I tested it yesterday with Intel NTB. Kit is using a Haswell
>> platform at the moment I think. Although I'm curious if it works with
>> the PLX NTB he's using on Skylake.
> 
> Does that mean on Skylake the IOAT can bypass the IOMMU? Because it
> looks like the ntb_transport code doesn't map the physical address of
> the NTB MW into the IOMMU when doing DMA...

Or if the BIOS has provided mapping for the Intel NTB device
specifically? Is that a possibility? NTB does go through the IOMMU.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-10 Thread Dave Jiang



On 08/10/2018 10:15 AM, Logan Gunthorpe wrote:
> 
> 
> On 10/08/18 11:01 AM, Dave Jiang wrote:
>> Or if the BIOS has provided mapping for the Intel NTB device
>> specifically? Is that a possibility? NTB does go through the IOMMU.
> 
> I don't know but if the BIOS is doing it, but that would only at best
> work for Intel NTB I see no hope in getting the BIOS to map the MW
> for a Switchtec NTB device...

Right. I'm just speculating why it may possibly work. But yes, I think
kernel will need appropriate mapping if it's not happening right now.
Hopefully Kit is onto something.

> 
> Logan
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-08-10 Thread Alan Cox
> The hardware isn't public yet, so can't talk about it :-(. Once this patch 
> gets 
> merged, will let the OSV engagement folks drive it for inclusions. We 
> could mark this for stable, but i would rather wait until we know the 
> timeline when they are expecting it to be in. It shouldn't break anything
> since we are just enforcing the spec.

Until a new better spec appears...

I know there is always fun when it comes to the people involved in
such a screwup having to admit it in public but this should IMHO be
tied to a PCI identifier table so that we know what the afflicted
hardware is. Otherwise some day in the future SRIOV will grow new features
and we'll have no idea what particular devices we need to narrow the
workaround too and indeed not necessarily even know this device is the
only one, as we'll silently fix other stuff then have it break on us
later.

Alan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-10 Thread Kit Chow
Turns out there is no dma_map_resource routine on x86. get_dma_ops 
returns intel_dma_ops which has map_resource pointing to NULL.


(gdb) p intel_dma_ops
$7 = {alloc = 0x8150f310 ,
  free = 0x8150ec20 ,
  mmap = 0x0 , get_sgtable = 0x0 ,
  map_page = 0x8150f2d0 ,
  unmap_page = 0x8150ec10 ,
  map_sg = 0x8150ef40 ,
  unmap_sg = 0x8150eb80 ,
  map_resource = 0x0 ,
  unmap_resource = 0x0 ,
  sync_single_for_cpu = 0x0 ,
  sync_single_for_device = 0x0 ,
  sync_sg_for_cpu = 0x0 ,
  sync_sg_for_device = 0x0 ,
  cache_sync = 0x0 ,
  mapping_error = 0x815095f0 ,
  dma_supported = 0x81033830 , is_phys = 0}

Will poke around some in the intel_map_page code but can you actually 
get a valid struct page for a pci bar address (dma_map_single calls 
virt_to_page)?  If not, does a map_resource routine that can properly 
map a pci bar address need to be implemented?


Kit

---


static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
  size_t size,
  enum dma_data_direction dir,
  unsigned long attrs)
{
    const struct dma_map_ops *ops = get_dma_ops(dev);
    dma_addr_t addr;

    BUG_ON(!valid_dma_direction(dir));
    addr = ops->map_page(dev, virt_to_page(ptr),
 offset_in_page(ptr), size,
 dir, attrs);
    debug_dma_map_page(dev, virt_to_page(ptr),
   offset_in_page(ptr), size,
   dir, addr, true);
    return addr;
}






On 08/09/2018 04:00 PM, Kit Chow wrote:



On 08/09/2018 03:50 PM, Logan Gunthorpe wrote:


On 09/08/18 04:48 PM, Kit Chow wrote:

Based on Logan's comments, I am very hopeful that the dma_map_resource
will make things work on the older platforms...

Well, I *think* dma_map_single() would still work. So I'm not that
confident that's the root of your problem. I'd still like to see the
actual code snippet you are using.

Logan
Here's the code snippet - (ntbdebug & 4) path does dma_map_resource of 
the pci bar address.


It was:
    unmap->addr[1] = dma_map_single(device->dev, (void 
*)dest, len,

    DMA_TO_DEVICE);

Kit
---


static int ntb_async_tx_submit(struct ntb_transport_qp *qp,
   struct ntb_queue_entry *entry)
{
    struct dma_async_tx_descriptor *txd;
    struct dma_chan *chan = qp->tx_dma_chan;
    struct dma_device *device;
    size_t len = entry->len;
    void *buf = entry->buf;
    size_t dest_off, buff_off;
    struct dmaengine_unmap_data *unmap;
    dma_addr_t dest;
    dma_cookie_t cookie;
    int unmapcnt;

    device = chan->device;

    dest = qp->tx_mw_phys + qp->tx_max_frame * entry->tx_index;

    buff_off = (size_t)buf & ~PAGE_MASK;
    dest_off = (size_t)dest & ~PAGE_MASK;

    if (!is_dma_copy_aligned(device, buff_off, dest_off, len))
    goto err;


    if (ntbdebug & 0x4) {
    unmapcnt = 2;
    } else {
    unmapcnt = 1;
    }

    unmap = dmaengine_get_unmap_data(device->dev, unmapcnt, 
GFP_NOWAIT);

    if (!unmap)
    goto err;

    unmap->len = len;
    unmap->addr[0] = dma_map_page(device->dev, virt_to_page(buf),
  buff_off, len, DMA_TO_DEVICE);
    if (dma_mapping_error(device->dev, unmap->addr[0]))
    goto err_get_unmap;

    if (ntbdebug & 0x4) {
    unmap->addr[1] = dma_map_resource(device->dev,
    (phys_addr_t)dest, len, DMA_TO_DEVICE, 0);
    if (dma_mapping_error(device->dev, unmap->addr[1]))
    goto err_get_unmap;
    unmap->to_cnt = 2;
    } else {
    unmap->addr[1] = dest;
    unmap->to_cnt = 1;
    }

    txd = device->device_prep_dma_memcpy(chan, unmap->addr[1],
    unmap->addr[0], len, DMA_PREP_INTERRUPT);

    if (!txd)
    goto err_get_unmap;

    txd->callback_result = ntb_tx_copy_callback;
    txd->callback_param = entry;
    dma_set_unmap(txd, unmap);

    cookie = dmaengine_submit(txd);
    if (dma_submit_error(cookie))
    goto err_set_unmap;

    dmaengine_unmap_put(unmap);

    dma_async_issue_pending(chan);

    return 0;

err_set_unmap:
    dma_descriptor_unmap(txd);
    txd->desc_free(txd);
err_get_unmap:
    dmaengine_unmap_put(unmap);
err:
    return -ENXIO;
}



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-08-10 Thread Raj, Ashok
On Fri, Aug 10, 2018 at 05:48:36PM +0100, Alan Cox wrote:
> > The hardware isn't public yet, so can't talk about it :-(. Once this patch 
> > gets 
> > merged, will let the OSV engagement folks drive it for inclusions. We 
> > could mark this for stable, but i would rather wait until we know the 
> > timeline when they are expecting it to be in. It shouldn't break anything
> > since we are just enforcing the spec.
> 
> Until a new better spec appears...

Well, here we are talking about PIN interrupts and VF's. IMO it would be
weird to allow Virtual functions and physical interrupts other than MSIx.

There are other things in the PCIe spec which could be enforced in SW
or expect devices to implement them.  

What might be ok is probably doing the right thing in SW as done in this
patch, additionaly maybe we can flag them and say "Your device is busted"
message. 

> 
> I know there is always fun when it comes to the people involved in
> such a screwup having to admit it in public but this should IMHO be
> tied to a PCI identifier table so that we know what the afflicted
> hardware is. Otherwise some day in the future SRIOV will grow new features
> and we'll have no idea what particular devices we need to narrow the
> workaround too and indeed not necessarily even know this device is the
> only one, as we'll silently fix other stuff then have it break on us
> later.
> 
> Alan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-10 Thread Jerome Glisse
On Fri, Aug 10, 2018 at 11:39:13AM +0800, Kenneth Lee wrote:
> On Thu, Aug 09, 2018 at 10:46:13AM -0400, Jerome Glisse wrote:
> > Date: Thu, 9 Aug 2018 10:46:13 -0400
> > From: Jerome Glisse 
> > To: Kenneth Lee 
> > CC: Kenneth Lee , "Tian, Kevin"
> >  , Alex Williamson ,
> >  Herbert Xu , "k...@vger.kernel.org"
> >  , Jonathan Corbet , Greg
> >  Kroah-Hartman , Zaibo Xu ,
> >  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
> >  , Hao Fang ,
> >  "linux-ker...@vger.kernel.org" ,
> >  "linux...@huawei.com" ,
> >  "iommu@lists.linux-foundation.org" ,
> >  "linux-cry...@vger.kernel.org" , Philippe
> >  Ombredanne , Thomas Gleixner ,
> >  "David S . Miller" ,
> >  "linux-accelerat...@lists.ozlabs.org"
> >  
> > Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> > User-Agent: Mutt/1.10.0 (2018-05-17)
> > Message-ID: <20180809144613.gb3...@redhat.com>
> > 
> > On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:
> > > On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> > > > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > > > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > 
> > [...]
> > 
> > > > > > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > > > > > drivers for each hardware and thus there are virtually no
> > > > > > > > > > differences between having this userspace driver open a 
> > > > > > > > > > device
> > > > > > > > > > file in vfio or somewhere else in the device filesystem. 
> > > > > > > > > > This is
> > > > > > > > > > just a different path.
> > > > > > > > > > 
> > > > > > > > > The basic problem WarpDrive want to solve it to avoid 
> > > > > > > > > syscall. This is important
> > > > > > > > > to accelerators. We have some data here:
> > > > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > > > > > 
> > > > > > > > > (see page 3)
> > > > > > > > > 
> > > > > > > > > The performance is different on using kernel and user drivers.
> > > > > > > > Yes and example i point to is exactly that. You have a one time 
> > > > > > > > setup
> > > > > > > > cost (creating command buffer binding PASID with command buffer 
> > > > > > > > and
> > > > > > > > couple other setup steps). Then userspace no longer have to do 
> > > > > > > > any
> > > > > > > > ioctl to schedule work on the GPU. It is all down from 
> > > > > > > > userspace and
> > > > > > > > it use a doorbell to notify hardware when it should go look at 
> > > > > > > > command
> > > > > > > > buffer for new thing to execute.
> > > > > > > > 
> > > > > > > > My point stands on that. You have existing driver already doing 
> > > > > > > > so
> > > > > > > > with no new framework and in your scheme you need a userspace 
> > > > > > > > driver.
> > > > > > > > So i do not see the value add, using one path or the other in 
> > > > > > > > the
> > > > > > > > userspace driver is litteraly one line to change.
> > > > > > > > 
> > > > > > > Sorry, I'd got confuse here. I partially agree that the user 
> > > > > > > driver is
> > > > > > > redundance of kernel driver. (But for WarpDrive, the kernel 
> > > > > > > driver is a full
> > > > > > > driver include all preparation and setup stuff for the hardware, 
> > > > > > > the user driver
> > > > > > > is simply to send request and receive answer). Yes, it is just a 
> > > > > > > choice of path.
> > > > > > > But the user path is faster if the request come from use space. 
> > > > > > > And to do that,
> > > > > > > we need user land DMA support. Then why is it invaluable to let 
> > > > > > > VFIO involved?
> > > > > > Some drivers in the kernel already do exactly what you said. The 
> > > > > > user
> > > > > > space emit commands without ever going into kernel by directly 
> > > > > > scheduling
> > > > > > commands and ringing a doorbell. They do not need VFIO either and 
> > > > > > they
> > > > > > can map userspace address into the DMA address space of the device 
> > > > > > and
> > > > > > again they do not need VFIO for that.
> > > > > Could you please directly point out which driver you refer to here? 
> > > > > Thank
> > > > > you.
> > > > 
> > > > drivers/gpu/drm/amd/
> > > > 
> > > > Sub-directory of interest is amdkfd
> > > > 
> > > > Because it is a big driver here is a highlevel overview of how it works
> > > > (this is a simplification):
> > > >   - Process can allocate GPUs buffer (through ioclt) and map them into
> > > > its address space (through mmap of device file at buffer object
> > > > specific offset).
> > 

Re: IOAT DMA w/IOMMU

2018-08-10 Thread Logan Gunthorpe



On 10/08/18 11:01 AM, Dave Jiang wrote:
> Or if the BIOS has provided mapping for the Intel NTB device
> specifically? Is that a possibility? NTB does go through the IOMMU.

I don't know but if the BIOS is doing it, but that would only at best
work for Intel NTB I see no hope in getting the BIOS to map the MW
for a Switchtec NTB device...

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-10 Thread Logan Gunthorpe



On 10/08/18 06:53 PM, Kit Chow wrote:
> I was able to finally succeed in doing the dma transfers over ioat only
> when prot has DMA_PTE_WRITE set by setting the direction to either
> DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. Any ideas if the prot settings
> need to be changed? Are there any bad side effects if I used
> DMA_BIDIRECTIONAL?

Good to hear it. Without digging into the direction much all I can say
is that it can sometimes be very confusing what the direction is. Adding
another PCI device just adds to the confusion.

I believe, the direction should be from the IOAT's point of view. So if
the IOAT is writing to the BAR you'd set DMA_FROM_DEVICE (ie. data is
coming from the IOAT) and if it's reading you'd set DMA_TO_DEVICE (ie.
data is going to the IOAT).

Using DMA_BIDIRECTIONAL just forgoes any hardware security / protection
that the buffer would have in terms of direction. Generally it's good
practice to use the strictest direction you can.

> Given that using the pci bar address as is without getting an iommu
> address results in the same "PTE Write access" error, I wonder if there
> is some internal 'prot' associated with the non-translated pci bar
> address that just needs to be tweaked to include DMA_PTE_WRITE???

No, I don't think so. The 'prot' will be a property of the IOMMU. Not
having an entry is probably just the same (from the perspective of the
error you see) as only having an entry for reading.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-10 Thread Kit Chow

Success!

I've implemented a new intel_map_resource (and intel_unmap_resource) 
routine which is called by dma_map_resource. As mentioned previously, 
the primary job of dma_map_resource/intel_map_resource is to call the 
intel iommu internal mapping routine (__intel_map_single) without 
translating the pci bar address into a page and then back to a phys addr 
as dma_map_page and dma_map_single are doing.


With the new dma_map_resource routine mapping the pci bar address in 
ntb_transport_tx_submit, I was still getting the "PTE Write access is 
not set" error.


The __intel_map_single routine sets prot based on the DMA direction.

    if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \
    !cap_zlr(iommu->cap))
    prot |= DMA_PTE_READ;
    if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
    prot |= DMA_PTE_WRITE;

I was able to finally succeed in doing the dma transfers over ioat only 
when prot has DMA_PTE_WRITE set by setting the direction to either 
DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. Any ideas if the prot settings 
need to be changed? Are there any bad side effects if I used 
DMA_BIDIRECTIONAL?


Given that using the pci bar address as is without getting an iommu 
address results in the same "PTE Write access" error, I wonder if there 
is some internal 'prot' associated with the non-translated pci bar 
address that just needs to be tweaked to include DMA_PTE_WRITE???


Thanks!




On 08/10/2018 10:46 AM, Dave Jiang wrote:


On 08/10/2018 10:15 AM, Logan Gunthorpe wrote:


On 10/08/18 11:01 AM, Dave Jiang wrote:

Or if the BIOS has provided mapping for the Intel NTB device
specifically? Is that a possibility? NTB does go through the IOMMU.

I don't know but if the BIOS is doing it, but that would only at best
work for Intel NTB I see no hope in getting the BIOS to map the MW
for a Switchtec NTB device...

Right. I'm just speculating why it may possibly work. But yes, I think
kernel will need appropriate mapping if it's not happening right now.
Hopefully Kit is onto something.


Logan



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-08-10 Thread Ganapatrao Kulkarni
Hi Robin,

On Fri, Aug 10, 2018 at 2:13 AM, Robin Murphy  wrote:
> On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote:
>>
>> Hi Robin,
>>
>> On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy  wrote:
>>>
>>> On 07/08/18 09:54, Ganapatrao Kulkarni wrote:


 As an optimisation for PCI devices, there is always first attempt
 been made to allocate iova from SAC address range. This will lead
 to unnecessary attempts/function calls, when there are no free ranges
 available.

 This patch optimises by adding flag to track previous failed attempts
 and avoids further attempts until replenish happens.
>>>
>>>
>>>
>>> Agh, what I overlooked is that this still suffers from the original
>>> problem,
>>> wherein a large allocation which fails due to fragmentation then blocks
>>> all
>>> subsequent smaller allocations, even if they may have succeeded.
>>>
>>> For a minimal change, though, what I think we could do is instead of just
>>> having a flag, track the size of the last 32-bit allocation which failed.
>>> If
>>> we're happy to assume that nobody's likely to mix aligned and unaligned
>>> allocations within the same domain, then that should be sufficiently
>>> robust
>>> whilst being no more complicated than this version, i.e. (modulo thinking
>>> up
>>> a better name for it):
>>
>>
>> I agree, it would be better to track size and attempt to allocate for
>> smaller chunks, if not for bigger one.
>>
>>>

 Signed-off-by: Ganapatrao Kulkarni 
 ---
 This patch is based on comments from Robin Murphy 
 for patch [1]

 [1] https://lkml.org/lkml/2018/4/19/780

drivers/iommu/iova.c | 11 ++-
include/linux/iova.h |  1 +
2 files changed, 11 insertions(+), 1 deletion(-)

 diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
 index 83fe262..d97bb5a 100644
 --- a/drivers/iommu/iova.c
 +++ b/drivers/iommu/iova.c
 @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
 long granule,
  iovad->granule = granule;
  iovad->start_pfn = start_pfn;
  iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
 +   iovad->free_32bit_pfns = true;
>>>
>>>
>>>
>>>  iovad->max_32bit_free = iovad->dma_32bit_pfn;
>>>
  iovad->flush_cb = NULL;
  iovad->fq = NULL;
  iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
 @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
 *iovad, struct iova *free)
  cached_iova = rb_entry(iovad->cached32_node, struct iova,
 node);
  if (free->pfn_hi < iovad->dma_32bit_pfn &&
 -   free->pfn_lo >= cached_iova->pfn_lo)
 +   free->pfn_lo >= cached_iova->pfn_lo) {
  iovad->cached32_node = rb_next(>node);
 +   iovad->free_32bit_pfns = true;
>>>
>>>
>>>
>>>  iovad->max_32bit_free = iovad->dma_32bit_pfn;
>>
>>
>> i think, you intended to say,
>>iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo);
>
>
> Nope, that's why I said it needed a better name ;)
>
> (I nearly called it last_failed_32bit_alloc_size, but that's a bit much)

may be we can name it "max32_alloc_size"?
>
> The point of this value (whetever it's called) is that at any given time it
> holds an upper bound on the size of the largest contiguous free area. It
> doesn't have to be the *smallest* upper bound, which is why we can keep
> things simple and avoid arithmetic - in realistic use-cases like yours when
> the allocations are a pretty constant size, this should work out directly
> equivalent to the boolean, only with values of "size" and "dma_32bit_pfn"
> instead of 0 and 1, so we don't do any more work than necessary. In the edge
> cases where allocations are all different sizes, it does mean that we will
> probably end up performing more failing allocations than if we actually
> tried to track all of the free space exactly, but I think that's reasonable
> - just because I want to make sure we handle such cases fairly gracefully,
> doesn't mean that we need to do extra work on the typical fast path to try
> and actually optimise for them (which is why I didn't really like the
> accounting implementation I came up with).
>

ok got it, thanks for the explanation.
>>>
 +   }
  cached_iova = rb_entry(iovad->cached_node, struct iova, node);
  if (free->pfn_lo >= cached_iova->pfn_lo)
 @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long
 size,
  struct iova *new_iova;
  int ret;
+ if (limit_pfn <= iovad->dma_32bit_pfn &&
 +   !iovad->free_32bit_pfns)
>>>
>>>
>>>
>>>  size >= iovad->max_32bit_free)
>>>
 +   return NULL;
 +
  new_iova = alloc_iova_mem();
  if (!new_iova)
  return NULL;
 @@ 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-10 Thread Jean-Philippe Brucker
Hi Kenneth,

On 10/08/18 04:39, Kenneth Lee wrote:
>> You can achieve everything you want to achieve with existing upstream
>> solution. Re-inventing a whole new driver infrastructure should really
>> be motivated with strong and obvious reasons.
> 
> I want to understand better of your idea. If I create some unified helper
> APIs in drivers/iommu/, say:
> 
>   wd_create_dev(parent_dev, wd_dev)
>   wd_release_dev(wd_dev)
> 
> The API create chrdev to take request from user space for open(resource
> allocation), iomap, epoll (irq), and dma_map(with pasid automatically).
> 
> Do you think it is acceptable?

Maybe not drivers/iommu/ :) That subsystem only contains tools for
dealing with DMA, I don't think epoll, resource enumeration or iomap fit
in there.

Creating new helpers seems to be precisely what we're trying to avoid in
this thread, and vfio-mdev does provide the components that you
describe, so I wouldn't discard it right away. When the GPU, net, block
or another subsystem doesn't fit your needs, either because your
accelerator provides some specialized function, or because for
performance reasons your client wants direct MMIO access, you can at
least build your driver and library on top of those existing VFIO
components:

* open allocates a partition of an accelerator.
* vfio_device_info, vfio_region_info and vfio_irq_info enumerates
available resources.
* vfio_irq_set deals with epoll.
* mmap gives you a private MMIO doorbell.
* vfio_iommu_type1 provides the DMA operations.

Currently missing:

* Sharing the parent IOMMU between mdev, which is also what the "IOMMU
aware mediated device" series tackles, and seems like a logical addition
to VFIO. I'd argue that the existing IOMMU ops (or ones implemented by
the SVA series) can be used to deal with this

* The interface to discover an accelerator near your memory node, or one
that you can chain with other devices. If I understood correctly the
conclusion was that the API (a topology description in sysfs?) should be
common to various subsystems, in which case vfio-mdev (or the mediating
driver) could also use it.

* The queue abstraction discussed on patch 3/7. Perhaps the current vfio
resource description of MMIO and IRQ is sufficient here as well, since
vendors tend to each implement their own queue schemes. If you need
additional features, read/write fops give the mediating driver a lot of
freedom. To support features that are too specific for drivers/vfio/ you
can implement a config space with capabilities and registers of your
choice. If you're versioning the capabilities, the code to handle them
could even be shared between different accelerator drivers and libraries.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-08-10 Thread Robin Murphy

On 10/08/18 10:24, Ganapatrao Kulkarni wrote:

Hi Robin,

On Fri, Aug 10, 2018 at 2:13 AM, Robin Murphy  wrote:

On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote:


Hi Robin,

On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy  wrote:


On 07/08/18 09:54, Ganapatrao Kulkarni wrote:



As an optimisation for PCI devices, there is always first attempt
been made to allocate iova from SAC address range. This will lead
to unnecessary attempts/function calls, when there are no free ranges
available.

This patch optimises by adding flag to track previous failed attempts
and avoids further attempts until replenish happens.




Agh, what I overlooked is that this still suffers from the original
problem,
wherein a large allocation which fails due to fragmentation then blocks
all
subsequent smaller allocations, even if they may have succeeded.

For a minimal change, though, what I think we could do is instead of just
having a flag, track the size of the last 32-bit allocation which failed.
If
we're happy to assume that nobody's likely to mix aligned and unaligned
allocations within the same domain, then that should be sufficiently
robust
whilst being no more complicated than this version, i.e. (modulo thinking
up
a better name for it):



I agree, it would be better to track size and attempt to allocate for
smaller chunks, if not for bigger one.





Signed-off-by: Ganapatrao Kulkarni 
---
This patch is based on comments from Robin Murphy 
for patch [1]

[1] https://lkml.org/lkml/2018/4/19/780

drivers/iommu/iova.c | 11 ++-
include/linux/iova.h |  1 +
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..d97bb5a 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
long granule,
  iovad->granule = granule;
  iovad->start_pfn = start_pfn;
  iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
+   iovad->free_32bit_pfns = true;




  iovad->max_32bit_free = iovad->dma_32bit_pfn;


  iovad->flush_cb = NULL;
  iovad->fq = NULL;
  iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
@@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
*iovad, struct iova *free)
  cached_iova = rb_entry(iovad->cached32_node, struct iova,
node);
  if (free->pfn_hi < iovad->dma_32bit_pfn &&
-   free->pfn_lo >= cached_iova->pfn_lo)
+   free->pfn_lo >= cached_iova->pfn_lo) {
  iovad->cached32_node = rb_next(>node);
+   iovad->free_32bit_pfns = true;




  iovad->max_32bit_free = iovad->dma_32bit_pfn;



i think, you intended to say,
iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo);



Nope, that's why I said it needed a better name ;)

(I nearly called it last_failed_32bit_alloc_size, but that's a bit much)


may be we can name it "max32_alloc_size"?


The point of this value (whetever it's called) is that at any given time it
holds an upper bound on the size of the largest contiguous free area. It
doesn't have to be the *smallest* upper bound, which is why we can keep
things simple and avoid arithmetic - in realistic use-cases like yours when
the allocations are a pretty constant size, this should work out directly
equivalent to the boolean, only with values of "size" and "dma_32bit_pfn"
instead of 0 and 1, so we don't do any more work than necessary. In the edge
cases where allocations are all different sizes, it does mean that we will
probably end up performing more failing allocations than if we actually
tried to track all of the free space exactly, but I think that's reasonable
- just because I want to make sure we handle such cases fairly gracefully,
doesn't mean that we need to do extra work on the typical fast path to try
and actually optimise for them (which is why I didn't really like the
accounting implementation I came up with).



ok got it, thanks for the explanation.



+   }
  cached_iova = rb_entry(iovad->cached_node, struct iova, node);
  if (free->pfn_lo >= cached_iova->pfn_lo)
@@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long
size,
  struct iova *new_iova;
  int ret;
+ if (limit_pfn <= iovad->dma_32bit_pfn &&
+   !iovad->free_32bit_pfns)




  size >= iovad->max_32bit_free)


+   return NULL;
+
  new_iova = alloc_iova_mem();
  if (!new_iova)
  return NULL;
@@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long
size,
  if (ret) {
  free_iova_mem(new_iova);
+   if (limit_pfn <= iovad->dma_32bit_pfn)
+   iovad->free_32bit_pfns = false;




  iovad->max_32bit_free = size;



same here, we should decrease available free range after successful
allocation.

Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-08-10 Thread Ganapatrao Kulkarni
On Fri, Aug 10, 2018 at 3:19 PM, Robin Murphy  wrote:
> On 10/08/18 10:24, Ganapatrao Kulkarni wrote:
>>
>> Hi Robin,
>>
>> On Fri, Aug 10, 2018 at 2:13 AM, Robin Murphy 
>> wrote:
>>>
>>> On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote:


 Hi Robin,

 On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy 
 wrote:
>
>
> On 07/08/18 09:54, Ganapatrao Kulkarni wrote:
>>
>>
>>
>> As an optimisation for PCI devices, there is always first attempt
>> been made to allocate iova from SAC address range. This will lead
>> to unnecessary attempts/function calls, when there are no free ranges
>> available.
>>
>> This patch optimises by adding flag to track previous failed attempts
>> and avoids further attempts until replenish happens.
>
>
>
>
> Agh, what I overlooked is that this still suffers from the original
> problem,
> wherein a large allocation which fails due to fragmentation then blocks
> all
> subsequent smaller allocations, even if they may have succeeded.
>
> For a minimal change, though, what I think we could do is instead of
> just
> having a flag, track the size of the last 32-bit allocation which
> failed.
> If
> we're happy to assume that nobody's likely to mix aligned and unaligned
> allocations within the same domain, then that should be sufficiently
> robust
> whilst being no more complicated than this version, i.e. (modulo
> thinking
> up
> a better name for it):



 I agree, it would be better to track size and attempt to allocate for
 smaller chunks, if not for bigger one.

>
>>
>> Signed-off-by: Ganapatrao Kulkarni 
>> ---
>> This patch is based on comments from Robin Murphy
>> 
>> for patch [1]
>>
>> [1] https://lkml.org/lkml/2018/4/19/780
>>
>> drivers/iommu/iova.c | 11 ++-
>> include/linux/iova.h |  1 +
>> 2 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 83fe262..d97bb5a 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>> long granule,
>>   iovad->granule = granule;
>>   iovad->start_pfn = start_pfn;
>>   iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
>> +   iovad->free_32bit_pfns = true;
>
>
>
>
>   iovad->max_32bit_free = iovad->dma_32bit_pfn;
>
>>   iovad->flush_cb = NULL;
>>   iovad->fq = NULL;
>>   iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
>> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
>> *iovad, struct iova *free)
>>   cached_iova = rb_entry(iovad->cached32_node, struct iova,
>> node);
>>   if (free->pfn_hi < iovad->dma_32bit_pfn &&
>> -   free->pfn_lo >= cached_iova->pfn_lo)
>> +   free->pfn_lo >= cached_iova->pfn_lo) {
>>   iovad->cached32_node = rb_next(>node);
>> +   iovad->free_32bit_pfns = true;
>
>
>
>
>   iovad->max_32bit_free = iovad->dma_32bit_pfn;



 i think, you intended to say,
 iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo);
>>>
>>>
>>>
>>> Nope, that's why I said it needed a better name ;)
>>>
>>> (I nearly called it last_failed_32bit_alloc_size, but that's a bit much)
>>
>>
>> may be we can name it "max32_alloc_size"?
>>>
>>>
>>> The point of this value (whetever it's called) is that at any given time
>>> it
>>> holds an upper bound on the size of the largest contiguous free area. It
>>> doesn't have to be the *smallest* upper bound, which is why we can keep
>>> things simple and avoid arithmetic - in realistic use-cases like yours
>>> when
>>> the allocations are a pretty constant size, this should work out directly
>>> equivalent to the boolean, only with values of "size" and "dma_32bit_pfn"
>>> instead of 0 and 1, so we don't do any more work than necessary. In the
>>> edge
>>> cases where allocations are all different sizes, it does mean that we
>>> will
>>> probably end up performing more failing allocations than if we actually
>>> tried to track all of the free space exactly, but I think that's
>>> reasonable
>>> - just because I want to make sure we handle such cases fairly
>>> gracefully,
>>> doesn't mean that we need to do extra work on the typical fast path to
>>> try
>>> and actually optimise for them (which is why I didn't really like the
>>> accounting implementation I came up with).
>>>
>>
>> ok got it, thanks for the explanation.
>
>
>> +   }
>>   cached_iova = rb_entry(iovad->cached_node, struct iova,
>> node);
>>   if (free->pfn_lo >=