Re: IOAT DMA w/IOMMU
There is an internal routine (__intel_map_single) inside the intel iommu code that does the actual mapping using a phys_addr_t. Think I'll try to implement a intel_map_resource routine that calls that routine directly without all of the conversions done for dma_map_{single,page} (pci bar addr -> page -> phys_addr)... On 08/10/2018 09:02 AM, Kit Chow wrote: Turns out there is no dma_map_resource routine on x86. get_dma_ops returns intel_dma_ops which has map_resource pointing to NULL. (gdb) p intel_dma_ops $7 = {alloc = 0x8150f310 , free = 0x8150ec20 , mmap = 0x0 , get_sgtable = 0x0 , map_page = 0x8150f2d0 , unmap_page = 0x8150ec10 , map_sg = 0x8150ef40 , unmap_sg = 0x8150eb80 , map_resource = 0x0 , unmap_resource = 0x0 , sync_single_for_cpu = 0x0 , sync_single_for_device = 0x0 , sync_sg_for_cpu = 0x0 , sync_sg_for_device = 0x0 , cache_sync = 0x0 , mapping_error = 0x815095f0 , dma_supported = 0x81033830 , is_phys = 0} Will poke around some in the intel_map_page code but can you actually get a valid struct page for a pci bar address (dma_map_single calls virt_to_page)? If not, does a map_resource routine that can properly map a pci bar address need to be implemented? Kit --- static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); dma_addr_t addr; BUG_ON(!valid_dma_direction(dir)); addr = ops->map_page(dev, virt_to_page(ptr), offset_in_page(ptr), size, dir, attrs); debug_dma_map_page(dev, virt_to_page(ptr), offset_in_page(ptr), size, dir, addr, true); return addr; } On 08/09/2018 04:00 PM, Kit Chow wrote: On 08/09/2018 03:50 PM, Logan Gunthorpe wrote: On 09/08/18 04:48 PM, Kit Chow wrote: Based on Logan's comments, I am very hopeful that the dma_map_resource will make things work on the older platforms... Well, I *think* dma_map_single() would still work. So I'm not that confident that's the root of your problem. I'd still like to see the actual code snippet you are using. Logan Here's the code snippet - (ntbdebug & 4) path does dma_map_resource of the pci bar address. It was: unmap->addr[1] = dma_map_single(device->dev, (void *)dest, len, DMA_TO_DEVICE); Kit --- static int ntb_async_tx_submit(struct ntb_transport_qp *qp, struct ntb_queue_entry *entry) { struct dma_async_tx_descriptor *txd; struct dma_chan *chan = qp->tx_dma_chan; struct dma_device *device; size_t len = entry->len; void *buf = entry->buf; size_t dest_off, buff_off; struct dmaengine_unmap_data *unmap; dma_addr_t dest; dma_cookie_t cookie; int unmapcnt; device = chan->device; dest = qp->tx_mw_phys + qp->tx_max_frame * entry->tx_index; buff_off = (size_t)buf & ~PAGE_MASK; dest_off = (size_t)dest & ~PAGE_MASK; if (!is_dma_copy_aligned(device, buff_off, dest_off, len)) goto err; if (ntbdebug & 0x4) { unmapcnt = 2; } else { unmapcnt = 1; } unmap = dmaengine_get_unmap_data(device->dev, unmapcnt, GFP_NOWAIT); if (!unmap) goto err; unmap->len = len; unmap->addr[0] = dma_map_page(device->dev, virt_to_page(buf), buff_off, len, DMA_TO_DEVICE); if (dma_mapping_error(device->dev, unmap->addr[0])) goto err_get_unmap; if (ntbdebug & 0x4) { unmap->addr[1] = dma_map_resource(device->dev, (phys_addr_t)dest, len, DMA_TO_DEVICE, 0); if (dma_mapping_error(device->dev, unmap->addr[1])) goto err_get_unmap; unmap->to_cnt = 2; } else { unmap->addr[1] = dest; unmap->to_cnt = 1; } txd = device->device_prep_dma_memcpy(chan, unmap->addr[1], unmap->addr[0], len, DMA_PREP_INTERRUPT); if (!txd) goto err_get_unmap; txd->callback_result = ntb_tx_copy_callback; txd->callback_param = entry; dma_set_unmap(txd, unmap); cookie = dmaengine_submit(txd); if (dma_submit_error(cookie)) goto err_set_unmap; dmaengine_unmap_put(unmap); dma_async_issue_pending(chan); return 0; err_set_unmap: dma_descriptor_unmap(txd); txd->desc_free(txd); err_get_unmap:
Re: IOAT DMA w/IOMMU
On 10/08/18 10:02 AM, Kit Chow wrote: > Turns out there is no dma_map_resource routine on x86. get_dma_ops > returns intel_dma_ops which has map_resource pointing to NULL. Oh, yup. I wasn't aware of that. From a cursory view, it looks like it shouldn't be too hard to implement though. > Will poke around some in the intel_map_page code but can you actually > get a valid struct page for a pci bar address (dma_map_single calls > virt_to_page)? If not, does a map_resource routine that can properly > map a pci bar address need to be implemented? Yes, you can not get a struct page for a PCI bar address unless it's mapped with ZONE_DEVICE like in my p2p work. So that would explain why dma_map_single() didn't work. This all implies that ntb_transport doesn't work with DMA and the IOMMU turned on. I'm not sure I've ever tried that configuration myself but it is a bit surprising. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 10/08/18 10:23 AM, Kit Chow wrote: > There is an internal routine (__intel_map_single) inside the intel iommu > code that does the actual mapping using a phys_addr_t. Think I'll try to > implement a intel_map_resource routine that calls that routine directly > without all of the conversions done for dma_map_{single,page} (pci bar > addr -> page -> phys_addr)... Nice, yes, that was what I was thinking. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 10/08/18 10:31 AM, Dave Jiang wrote: > > > On 08/10/2018 09:24 AM, Logan Gunthorpe wrote: >> >> >> On 10/08/18 10:02 AM, Kit Chow wrote: >>> Turns out there is no dma_map_resource routine on x86. get_dma_ops >>> returns intel_dma_ops which has map_resource pointing to NULL. >> >> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it >> shouldn't be too hard to implement though. >> >>> Will poke around some in the intel_map_page code but can you actually >>> get a valid struct page for a pci bar address (dma_map_single calls >>> virt_to_page)? If not, does a map_resource routine that can properly >>> map a pci bar address need to be implemented? >> >> Yes, you can not get a struct page for a PCI bar address unless it's >> mapped with ZONE_DEVICE like in my p2p work. So that would explain why >> dma_map_single() didn't work. >> >> This all implies that ntb_transport doesn't work with DMA and the IOMMU >> turned on. I'm not sure I've ever tried that configuration myself but it >> is a bit surprising. > > Hmmthat's surprising because it seems to work on Skylake platform > when I tested it yesterday with Intel NTB. Kit is using a Haswell > platform at the moment I think. Although I'm curious if it works with > the PLX NTB he's using on Skylake. Does that mean on Skylake the IOAT can bypass the IOMMU? Because it looks like the ntb_transport code doesn't map the physical address of the NTB MW into the IOMMU when doing DMA... Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/3] dts: arm64/sdm845: Add node for arm,mmu-500
Hi, On Thu, Jul 19, 2018 at 10:53 AM, Vivek Gautam wrote: > Add device node for arm,mmu-500 available on sdm845. > This MMU-500 with single TCU and multiple TBU architecture > is shared among all the peripherals except gpu on sdm845. > > Signed-off-by: Vivek Gautam > --- > arch/arm64/boot/dts/qcom/sdm845-mtp.dts | 4 ++ > arch/arm64/boot/dts/qcom/sdm845.dtsi| 73 > + > 2 files changed, 77 insertions(+) > > diff --git a/arch/arm64/boot/dts/qcom/sdm845-mtp.dts > b/arch/arm64/boot/dts/qcom/sdm845-mtp.dts > index 6d651f314193..13b50dff440f 100644 > --- a/arch/arm64/boot/dts/qcom/sdm845-mtp.dts > +++ b/arch/arm64/boot/dts/qcom/sdm845-mtp.dts > @@ -58,3 +58,7 @@ > bias-pull-up; > }; > }; > + > +_smmu { > + status = "okay"; > +}; When you spin this patch please put the above in the correct place. Since "a" sorts alphabetically before "i" then this should be just before the line: { Thanks! -Doug ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 08/10/2018 09:24 AM, Logan Gunthorpe wrote: > > > On 10/08/18 10:02 AM, Kit Chow wrote: >> Turns out there is no dma_map_resource routine on x86. get_dma_ops >> returns intel_dma_ops which has map_resource pointing to NULL. > > Oh, yup. I wasn't aware of that. From a cursory view, it looks like it > shouldn't be too hard to implement though. > >> Will poke around some in the intel_map_page code but can you actually >> get a valid struct page for a pci bar address (dma_map_single calls >> virt_to_page)? If not, does a map_resource routine that can properly >> map a pci bar address need to be implemented? > > Yes, you can not get a struct page for a PCI bar address unless it's > mapped with ZONE_DEVICE like in my p2p work. So that would explain why > dma_map_single() didn't work. > > This all implies that ntb_transport doesn't work with DMA and the IOMMU > turned on. I'm not sure I've ever tried that configuration myself but it > is a bit surprising. Hmmthat's surprising because it seems to work on Skylake platform when I tested it yesterday with Intel NTB. Kit is using a Haswell platform at the moment I think. Although I'm curious if it works with the PLX NTB he's using on Skylake. > > Logan > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 08/10/2018 09:33 AM, Logan Gunthorpe wrote: > > > On 10/08/18 10:31 AM, Dave Jiang wrote: >> >> >> On 08/10/2018 09:24 AM, Logan Gunthorpe wrote: >>> >>> >>> On 10/08/18 10:02 AM, Kit Chow wrote: Turns out there is no dma_map_resource routine on x86. get_dma_ops returns intel_dma_ops which has map_resource pointing to NULL. >>> >>> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it >>> shouldn't be too hard to implement though. >>> Will poke around some in the intel_map_page code but can you actually get a valid struct page for a pci bar address (dma_map_single calls virt_to_page)? If not, does a map_resource routine that can properly map a pci bar address need to be implemented? >>> >>> Yes, you can not get a struct page for a PCI bar address unless it's >>> mapped with ZONE_DEVICE like in my p2p work. So that would explain why >>> dma_map_single() didn't work. >>> >>> This all implies that ntb_transport doesn't work with DMA and the IOMMU >>> turned on. I'm not sure I've ever tried that configuration myself but it >>> is a bit surprising. >> >> Hmmthat's surprising because it seems to work on Skylake platform >> when I tested it yesterday with Intel NTB. Kit is using a Haswell >> platform at the moment I think. Although I'm curious if it works with >> the PLX NTB he's using on Skylake. > > Does that mean on Skylake the IOAT can bypass the IOMMU? Because it > looks like the ntb_transport code doesn't map the physical address of > the NTB MW into the IOMMU when doing DMA... Or if the BIOS has provided mapping for the Intel NTB device specifically? Is that a possibility? NTB does go through the IOMMU. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 08/10/2018 10:15 AM, Logan Gunthorpe wrote: > > > On 10/08/18 11:01 AM, Dave Jiang wrote: >> Or if the BIOS has provided mapping for the Intel NTB device >> specifically? Is that a possibility? NTB does go through the IOMMU. > > I don't know but if the BIOS is doing it, but that would only at best > work for Intel NTB I see no hope in getting the BIOS to map the MW > for a Switchtec NTB device... Right. I'm just speculating why it may possibly work. But yes, I think kernel will need appropriate mapping if it's not happening right now. Hopefully Kit is onto something. > > Logan > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.
> The hardware isn't public yet, so can't talk about it :-(. Once this patch > gets > merged, will let the OSV engagement folks drive it for inclusions. We > could mark this for stable, but i would rather wait until we know the > timeline when they are expecting it to be in. It shouldn't break anything > since we are just enforcing the spec. Until a new better spec appears... I know there is always fun when it comes to the people involved in such a screwup having to admit it in public but this should IMHO be tied to a PCI identifier table so that we know what the afflicted hardware is. Otherwise some day in the future SRIOV will grow new features and we'll have no idea what particular devices we need to narrow the workaround too and indeed not necessarily even know this device is the only one, as we'll silently fix other stuff then have it break on us later. Alan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
Turns out there is no dma_map_resource routine on x86. get_dma_ops returns intel_dma_ops which has map_resource pointing to NULL. (gdb) p intel_dma_ops $7 = {alloc = 0x8150f310 , free = 0x8150ec20 , mmap = 0x0 , get_sgtable = 0x0 , map_page = 0x8150f2d0 , unmap_page = 0x8150ec10 , map_sg = 0x8150ef40 , unmap_sg = 0x8150eb80 , map_resource = 0x0 , unmap_resource = 0x0 , sync_single_for_cpu = 0x0 , sync_single_for_device = 0x0 , sync_sg_for_cpu = 0x0 , sync_sg_for_device = 0x0 , cache_sync = 0x0 , mapping_error = 0x815095f0 , dma_supported = 0x81033830 , is_phys = 0} Will poke around some in the intel_map_page code but can you actually get a valid struct page for a pci bar address (dma_map_single calls virt_to_page)? If not, does a map_resource routine that can properly map a pci bar address need to be implemented? Kit --- static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); dma_addr_t addr; BUG_ON(!valid_dma_direction(dir)); addr = ops->map_page(dev, virt_to_page(ptr), offset_in_page(ptr), size, dir, attrs); debug_dma_map_page(dev, virt_to_page(ptr), offset_in_page(ptr), size, dir, addr, true); return addr; } On 08/09/2018 04:00 PM, Kit Chow wrote: On 08/09/2018 03:50 PM, Logan Gunthorpe wrote: On 09/08/18 04:48 PM, Kit Chow wrote: Based on Logan's comments, I am very hopeful that the dma_map_resource will make things work on the older platforms... Well, I *think* dma_map_single() would still work. So I'm not that confident that's the root of your problem. I'd still like to see the actual code snippet you are using. Logan Here's the code snippet - (ntbdebug & 4) path does dma_map_resource of the pci bar address. It was: unmap->addr[1] = dma_map_single(device->dev, (void *)dest, len, DMA_TO_DEVICE); Kit --- static int ntb_async_tx_submit(struct ntb_transport_qp *qp, struct ntb_queue_entry *entry) { struct dma_async_tx_descriptor *txd; struct dma_chan *chan = qp->tx_dma_chan; struct dma_device *device; size_t len = entry->len; void *buf = entry->buf; size_t dest_off, buff_off; struct dmaengine_unmap_data *unmap; dma_addr_t dest; dma_cookie_t cookie; int unmapcnt; device = chan->device; dest = qp->tx_mw_phys + qp->tx_max_frame * entry->tx_index; buff_off = (size_t)buf & ~PAGE_MASK; dest_off = (size_t)dest & ~PAGE_MASK; if (!is_dma_copy_aligned(device, buff_off, dest_off, len)) goto err; if (ntbdebug & 0x4) { unmapcnt = 2; } else { unmapcnt = 1; } unmap = dmaengine_get_unmap_data(device->dev, unmapcnt, GFP_NOWAIT); if (!unmap) goto err; unmap->len = len; unmap->addr[0] = dma_map_page(device->dev, virt_to_page(buf), buff_off, len, DMA_TO_DEVICE); if (dma_mapping_error(device->dev, unmap->addr[0])) goto err_get_unmap; if (ntbdebug & 0x4) { unmap->addr[1] = dma_map_resource(device->dev, (phys_addr_t)dest, len, DMA_TO_DEVICE, 0); if (dma_mapping_error(device->dev, unmap->addr[1])) goto err_get_unmap; unmap->to_cnt = 2; } else { unmap->addr[1] = dest; unmap->to_cnt = 1; } txd = device->device_prep_dma_memcpy(chan, unmap->addr[1], unmap->addr[0], len, DMA_PREP_INTERRUPT); if (!txd) goto err_get_unmap; txd->callback_result = ntb_tx_copy_callback; txd->callback_param = entry; dma_set_unmap(txd, unmap); cookie = dmaengine_submit(txd); if (dma_submit_error(cookie)) goto err_set_unmap; dmaengine_unmap_put(unmap); dma_async_issue_pending(chan); return 0; err_set_unmap: dma_descriptor_unmap(txd); txd->desc_free(txd); err_get_unmap: dmaengine_unmap_put(unmap); err: return -ENXIO; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.
On Fri, Aug 10, 2018 at 05:48:36PM +0100, Alan Cox wrote: > > The hardware isn't public yet, so can't talk about it :-(. Once this patch > > gets > > merged, will let the OSV engagement folks drive it for inclusions. We > > could mark this for stable, but i would rather wait until we know the > > timeline when they are expecting it to be in. It shouldn't break anything > > since we are just enforcing the spec. > > Until a new better spec appears... Well, here we are talking about PIN interrupts and VF's. IMO it would be weird to allow Virtual functions and physical interrupts other than MSIx. There are other things in the PCIe spec which could be enforced in SW or expect devices to implement them. What might be ok is probably doing the right thing in SW as done in this patch, additionaly maybe we can flag them and say "Your device is busted" message. > > I know there is always fun when it comes to the people involved in > such a screwup having to admit it in public but this should IMHO be > tied to a PCI identifier table so that we know what the afflicted > hardware is. Otherwise some day in the future SRIOV will grow new features > and we'll have no idea what particular devices we need to narrow the > workaround too and indeed not necessarily even know this device is the > only one, as we'll silently fix other stuff then have it break on us > later. > > Alan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
On Fri, Aug 10, 2018 at 11:39:13AM +0800, Kenneth Lee wrote: > On Thu, Aug 09, 2018 at 10:46:13AM -0400, Jerome Glisse wrote: > > Date: Thu, 9 Aug 2018 10:46:13 -0400 > > From: Jerome Glisse > > To: Kenneth Lee > > CC: Kenneth Lee , "Tian, Kevin" > > , Alex Williamson , > > Herbert Xu , "k...@vger.kernel.org" > > , Jonathan Corbet , Greg > > Kroah-Hartman , Zaibo Xu , > > "linux-...@vger.kernel.org" , "Kumar, Sanjay K" > > , Hao Fang , > > "linux-ker...@vger.kernel.org" , > > "linux...@huawei.com" , > > "iommu@lists.linux-foundation.org" , > > "linux-cry...@vger.kernel.org" , Philippe > > Ombredanne , Thomas Gleixner , > > "David S . Miller" , > > "linux-accelerat...@lists.ozlabs.org" > > > > Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive > > User-Agent: Mutt/1.10.0 (2018-05-17) > > Message-ID: <20180809144613.gb3...@redhat.com> > > > > On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote: > > > On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote: > > > > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote: > > > > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道: > > > > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote: > > > > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote: > > > > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote: > > > > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote: > > > > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote: > > > > [...] > > > > > > > > > > > > your mechanisms the userspace must have a specific userspace > > > > > > > > > > drivers for each hardware and thus there are virtually no > > > > > > > > > > differences between having this userspace driver open a > > > > > > > > > > device > > > > > > > > > > file in vfio or somewhere else in the device filesystem. > > > > > > > > > > This is > > > > > > > > > > just a different path. > > > > > > > > > > > > > > > > > > > The basic problem WarpDrive want to solve it to avoid > > > > > > > > > syscall. This is important > > > > > > > > > to accelerators. We have some data here: > > > > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317 > > > > > > > > > > > > > > > > > > (see page 3) > > > > > > > > > > > > > > > > > > The performance is different on using kernel and user drivers. > > > > > > > > Yes and example i point to is exactly that. You have a one time > > > > > > > > setup > > > > > > > > cost (creating command buffer binding PASID with command buffer > > > > > > > > and > > > > > > > > couple other setup steps). Then userspace no longer have to do > > > > > > > > any > > > > > > > > ioctl to schedule work on the GPU. It is all down from > > > > > > > > userspace and > > > > > > > > it use a doorbell to notify hardware when it should go look at > > > > > > > > command > > > > > > > > buffer for new thing to execute. > > > > > > > > > > > > > > > > My point stands on that. You have existing driver already doing > > > > > > > > so > > > > > > > > with no new framework and in your scheme you need a userspace > > > > > > > > driver. > > > > > > > > So i do not see the value add, using one path or the other in > > > > > > > > the > > > > > > > > userspace driver is litteraly one line to change. > > > > > > > > > > > > > > > Sorry, I'd got confuse here. I partially agree that the user > > > > > > > driver is > > > > > > > redundance of kernel driver. (But for WarpDrive, the kernel > > > > > > > driver is a full > > > > > > > driver include all preparation and setup stuff for the hardware, > > > > > > > the user driver > > > > > > > is simply to send request and receive answer). Yes, it is just a > > > > > > > choice of path. > > > > > > > But the user path is faster if the request come from use space. > > > > > > > And to do that, > > > > > > > we need user land DMA support. Then why is it invaluable to let > > > > > > > VFIO involved? > > > > > > Some drivers in the kernel already do exactly what you said. The > > > > > > user > > > > > > space emit commands without ever going into kernel by directly > > > > > > scheduling > > > > > > commands and ringing a doorbell. They do not need VFIO either and > > > > > > they > > > > > > can map userspace address into the DMA address space of the device > > > > > > and > > > > > > again they do not need VFIO for that. > > > > > Could you please directly point out which driver you refer to here? > > > > > Thank > > > > > you. > > > > > > > > drivers/gpu/drm/amd/ > > > > > > > > Sub-directory of interest is amdkfd > > > > > > > > Because it is a big driver here is a highlevel overview of how it works > > > > (this is a simplification): > > > > - Process can allocate GPUs buffer (through ioclt) and map them into > > > > its address space (through mmap of device file at buffer object > > > > specific offset). > >
Re: IOAT DMA w/IOMMU
On 10/08/18 11:01 AM, Dave Jiang wrote: > Or if the BIOS has provided mapping for the Intel NTB device > specifically? Is that a possibility? NTB does go through the IOMMU. I don't know but if the BIOS is doing it, but that would only at best work for Intel NTB I see no hope in getting the BIOS to map the MW for a Switchtec NTB device... Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 10/08/18 06:53 PM, Kit Chow wrote: > I was able to finally succeed in doing the dma transfers over ioat only > when prot has DMA_PTE_WRITE set by setting the direction to either > DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. Any ideas if the prot settings > need to be changed? Are there any bad side effects if I used > DMA_BIDIRECTIONAL? Good to hear it. Without digging into the direction much all I can say is that it can sometimes be very confusing what the direction is. Adding another PCI device just adds to the confusion. I believe, the direction should be from the IOAT's point of view. So if the IOAT is writing to the BAR you'd set DMA_FROM_DEVICE (ie. data is coming from the IOAT) and if it's reading you'd set DMA_TO_DEVICE (ie. data is going to the IOAT). Using DMA_BIDIRECTIONAL just forgoes any hardware security / protection that the buffer would have in terms of direction. Generally it's good practice to use the strictest direction you can. > Given that using the pci bar address as is without getting an iommu > address results in the same "PTE Write access" error, I wonder if there > is some internal 'prot' associated with the non-translated pci bar > address that just needs to be tweaked to include DMA_PTE_WRITE??? No, I don't think so. The 'prot' will be a property of the IOMMU. Not having an entry is probably just the same (from the perspective of the error you see) as only having an entry for reading. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
Success! I've implemented a new intel_map_resource (and intel_unmap_resource) routine which is called by dma_map_resource. As mentioned previously, the primary job of dma_map_resource/intel_map_resource is to call the intel iommu internal mapping routine (__intel_map_single) without translating the pci bar address into a page and then back to a phys addr as dma_map_page and dma_map_single are doing. With the new dma_map_resource routine mapping the pci bar address in ntb_transport_tx_submit, I was still getting the "PTE Write access is not set" error. The __intel_map_single routine sets prot based on the DMA direction. if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \ !cap_zlr(iommu->cap)) prot |= DMA_PTE_READ; if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) prot |= DMA_PTE_WRITE; I was able to finally succeed in doing the dma transfers over ioat only when prot has DMA_PTE_WRITE set by setting the direction to either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. Any ideas if the prot settings need to be changed? Are there any bad side effects if I used DMA_BIDIRECTIONAL? Given that using the pci bar address as is without getting an iommu address results in the same "PTE Write access" error, I wonder if there is some internal 'prot' associated with the non-translated pci bar address that just needs to be tweaked to include DMA_PTE_WRITE??? Thanks! On 08/10/2018 10:46 AM, Dave Jiang wrote: On 08/10/2018 10:15 AM, Logan Gunthorpe wrote: On 10/08/18 11:01 AM, Dave Jiang wrote: Or if the BIOS has provided mapping for the Intel NTB device specifically? Is that a possibility? NTB does go through the IOMMU. I don't know but if the BIOS is doing it, but that would only at best work for Intel NTB I see no hope in getting the BIOS to map the MW for a Switchtec NTB device... Right. I'm just speculating why it may possibly work. But yes, I think kernel will need appropriate mapping if it's not happening right now. Hopefully Kit is onto something. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range
Hi Robin, On Fri, Aug 10, 2018 at 2:13 AM, Robin Murphy wrote: > On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote: >> >> Hi Robin, >> >> On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy wrote: >>> >>> On 07/08/18 09:54, Ganapatrao Kulkarni wrote: As an optimisation for PCI devices, there is always first attempt been made to allocate iova from SAC address range. This will lead to unnecessary attempts/function calls, when there are no free ranges available. This patch optimises by adding flag to track previous failed attempts and avoids further attempts until replenish happens. >>> >>> >>> >>> Agh, what I overlooked is that this still suffers from the original >>> problem, >>> wherein a large allocation which fails due to fragmentation then blocks >>> all >>> subsequent smaller allocations, even if they may have succeeded. >>> >>> For a minimal change, though, what I think we could do is instead of just >>> having a flag, track the size of the last 32-bit allocation which failed. >>> If >>> we're happy to assume that nobody's likely to mix aligned and unaligned >>> allocations within the same domain, then that should be sufficiently >>> robust >>> whilst being no more complicated than this version, i.e. (modulo thinking >>> up >>> a better name for it): >> >> >> I agree, it would be better to track size and attempt to allocate for >> smaller chunks, if not for bigger one. >> >>> Signed-off-by: Ganapatrao Kulkarni --- This patch is based on comments from Robin Murphy for patch [1] [1] https://lkml.org/lkml/2018/4/19/780 drivers/iommu/iova.c | 11 ++- include/linux/iova.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 83fe262..d97bb5a 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->granule = granule; iovad->start_pfn = start_pfn; iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); + iovad->free_32bit_pfns = true; >>> >>> >>> >>> iovad->max_32bit_free = iovad->dma_32bit_pfn; >>> iovad->flush_cb = NULL; iovad->fq = NULL; iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) cached_iova = rb_entry(iovad->cached32_node, struct iova, node); if (free->pfn_hi < iovad->dma_32bit_pfn && - free->pfn_lo >= cached_iova->pfn_lo) + free->pfn_lo >= cached_iova->pfn_lo) { iovad->cached32_node = rb_next(>node); + iovad->free_32bit_pfns = true; >>> >>> >>> >>> iovad->max_32bit_free = iovad->dma_32bit_pfn; >> >> >> i think, you intended to say, >>iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo); > > > Nope, that's why I said it needed a better name ;) > > (I nearly called it last_failed_32bit_alloc_size, but that's a bit much) may be we can name it "max32_alloc_size"? > > The point of this value (whetever it's called) is that at any given time it > holds an upper bound on the size of the largest contiguous free area. It > doesn't have to be the *smallest* upper bound, which is why we can keep > things simple and avoid arithmetic - in realistic use-cases like yours when > the allocations are a pretty constant size, this should work out directly > equivalent to the boolean, only with values of "size" and "dma_32bit_pfn" > instead of 0 and 1, so we don't do any more work than necessary. In the edge > cases where allocations are all different sizes, it does mean that we will > probably end up performing more failing allocations than if we actually > tried to track all of the free space exactly, but I think that's reasonable > - just because I want to make sure we handle such cases fairly gracefully, > doesn't mean that we need to do extra work on the typical fast path to try > and actually optimise for them (which is why I didn't really like the > accounting implementation I came up with). > ok got it, thanks for the explanation. >>> + } cached_iova = rb_entry(iovad->cached_node, struct iova, node); if (free->pfn_lo >= cached_iova->pfn_lo) @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long size, struct iova *new_iova; int ret; + if (limit_pfn <= iovad->dma_32bit_pfn && + !iovad->free_32bit_pfns) >>> >>> >>> >>> size >= iovad->max_32bit_free) >>> + return NULL; + new_iova = alloc_iova_mem(); if (!new_iova) return NULL; @@
Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
Hi Kenneth, On 10/08/18 04:39, Kenneth Lee wrote: >> You can achieve everything you want to achieve with existing upstream >> solution. Re-inventing a whole new driver infrastructure should really >> be motivated with strong and obvious reasons. > > I want to understand better of your idea. If I create some unified helper > APIs in drivers/iommu/, say: > > wd_create_dev(parent_dev, wd_dev) > wd_release_dev(wd_dev) > > The API create chrdev to take request from user space for open(resource > allocation), iomap, epoll (irq), and dma_map(with pasid automatically). > > Do you think it is acceptable? Maybe not drivers/iommu/ :) That subsystem only contains tools for dealing with DMA, I don't think epoll, resource enumeration or iomap fit in there. Creating new helpers seems to be precisely what we're trying to avoid in this thread, and vfio-mdev does provide the components that you describe, so I wouldn't discard it right away. When the GPU, net, block or another subsystem doesn't fit your needs, either because your accelerator provides some specialized function, or because for performance reasons your client wants direct MMIO access, you can at least build your driver and library on top of those existing VFIO components: * open allocates a partition of an accelerator. * vfio_device_info, vfio_region_info and vfio_irq_info enumerates available resources. * vfio_irq_set deals with epoll. * mmap gives you a private MMIO doorbell. * vfio_iommu_type1 provides the DMA operations. Currently missing: * Sharing the parent IOMMU between mdev, which is also what the "IOMMU aware mediated device" series tackles, and seems like a logical addition to VFIO. I'd argue that the existing IOMMU ops (or ones implemented by the SVA series) can be used to deal with this * The interface to discover an accelerator near your memory node, or one that you can chain with other devices. If I understood correctly the conclusion was that the API (a topology description in sysfs?) should be common to various subsystems, in which case vfio-mdev (or the mediating driver) could also use it. * The queue abstraction discussed on patch 3/7. Perhaps the current vfio resource description of MMIO and IRQ is sufficient here as well, since vendors tend to each implement their own queue schemes. If you need additional features, read/write fops give the mediating driver a lot of freedom. To support features that are too specific for drivers/vfio/ you can implement a config space with capabilities and registers of your choice. If you're versioning the capabilities, the code to handle them could even be shared between different accelerator drivers and libraries. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range
On 10/08/18 10:24, Ganapatrao Kulkarni wrote: Hi Robin, On Fri, Aug 10, 2018 at 2:13 AM, Robin Murphy wrote: On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote: Hi Robin, On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy wrote: On 07/08/18 09:54, Ganapatrao Kulkarni wrote: As an optimisation for PCI devices, there is always first attempt been made to allocate iova from SAC address range. This will lead to unnecessary attempts/function calls, when there are no free ranges available. This patch optimises by adding flag to track previous failed attempts and avoids further attempts until replenish happens. Agh, what I overlooked is that this still suffers from the original problem, wherein a large allocation which fails due to fragmentation then blocks all subsequent smaller allocations, even if they may have succeeded. For a minimal change, though, what I think we could do is instead of just having a flag, track the size of the last 32-bit allocation which failed. If we're happy to assume that nobody's likely to mix aligned and unaligned allocations within the same domain, then that should be sufficiently robust whilst being no more complicated than this version, i.e. (modulo thinking up a better name for it): I agree, it would be better to track size and attempt to allocate for smaller chunks, if not for bigger one. Signed-off-by: Ganapatrao Kulkarni --- This patch is based on comments from Robin Murphy for patch [1] [1] https://lkml.org/lkml/2018/4/19/780 drivers/iommu/iova.c | 11 ++- include/linux/iova.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 83fe262..d97bb5a 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->granule = granule; iovad->start_pfn = start_pfn; iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); + iovad->free_32bit_pfns = true; iovad->max_32bit_free = iovad->dma_32bit_pfn; iovad->flush_cb = NULL; iovad->fq = NULL; iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) cached_iova = rb_entry(iovad->cached32_node, struct iova, node); if (free->pfn_hi < iovad->dma_32bit_pfn && - free->pfn_lo >= cached_iova->pfn_lo) + free->pfn_lo >= cached_iova->pfn_lo) { iovad->cached32_node = rb_next(>node); + iovad->free_32bit_pfns = true; iovad->max_32bit_free = iovad->dma_32bit_pfn; i think, you intended to say, iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo); Nope, that's why I said it needed a better name ;) (I nearly called it last_failed_32bit_alloc_size, but that's a bit much) may be we can name it "max32_alloc_size"? The point of this value (whetever it's called) is that at any given time it holds an upper bound on the size of the largest contiguous free area. It doesn't have to be the *smallest* upper bound, which is why we can keep things simple and avoid arithmetic - in realistic use-cases like yours when the allocations are a pretty constant size, this should work out directly equivalent to the boolean, only with values of "size" and "dma_32bit_pfn" instead of 0 and 1, so we don't do any more work than necessary. In the edge cases where allocations are all different sizes, it does mean that we will probably end up performing more failing allocations than if we actually tried to track all of the free space exactly, but I think that's reasonable - just because I want to make sure we handle such cases fairly gracefully, doesn't mean that we need to do extra work on the typical fast path to try and actually optimise for them (which is why I didn't really like the accounting implementation I came up with). ok got it, thanks for the explanation. + } cached_iova = rb_entry(iovad->cached_node, struct iova, node); if (free->pfn_lo >= cached_iova->pfn_lo) @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long size, struct iova *new_iova; int ret; + if (limit_pfn <= iovad->dma_32bit_pfn && + !iovad->free_32bit_pfns) size >= iovad->max_32bit_free) + return NULL; + new_iova = alloc_iova_mem(); if (!new_iova) return NULL; @@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long size, if (ret) { free_iova_mem(new_iova); + if (limit_pfn <= iovad->dma_32bit_pfn) + iovad->free_32bit_pfns = false; iovad->max_32bit_free = size; same here, we should decrease available free range after successful allocation.
Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range
On Fri, Aug 10, 2018 at 3:19 PM, Robin Murphy wrote: > On 10/08/18 10:24, Ganapatrao Kulkarni wrote: >> >> Hi Robin, >> >> On Fri, Aug 10, 2018 at 2:13 AM, Robin Murphy >> wrote: >>> >>> On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote: Hi Robin, On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy wrote: > > > On 07/08/18 09:54, Ganapatrao Kulkarni wrote: >> >> >> >> As an optimisation for PCI devices, there is always first attempt >> been made to allocate iova from SAC address range. This will lead >> to unnecessary attempts/function calls, when there are no free ranges >> available. >> >> This patch optimises by adding flag to track previous failed attempts >> and avoids further attempts until replenish happens. > > > > > Agh, what I overlooked is that this still suffers from the original > problem, > wherein a large allocation which fails due to fragmentation then blocks > all > subsequent smaller allocations, even if they may have succeeded. > > For a minimal change, though, what I think we could do is instead of > just > having a flag, track the size of the last 32-bit allocation which > failed. > If > we're happy to assume that nobody's likely to mix aligned and unaligned > allocations within the same domain, then that should be sufficiently > robust > whilst being no more complicated than this version, i.e. (modulo > thinking > up > a better name for it): I agree, it would be better to track size and attempt to allocate for smaller chunks, if not for bigger one. > >> >> Signed-off-by: Ganapatrao Kulkarni >> --- >> This patch is based on comments from Robin Murphy >> >> for patch [1] >> >> [1] https://lkml.org/lkml/2018/4/19/780 >> >> drivers/iommu/iova.c | 11 ++- >> include/linux/iova.h | 1 + >> 2 files changed, 11 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >> index 83fe262..d97bb5a 100644 >> --- a/drivers/iommu/iova.c >> +++ b/drivers/iommu/iova.c >> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned >> long granule, >> iovad->granule = granule; >> iovad->start_pfn = start_pfn; >> iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); >> + iovad->free_32bit_pfns = true; > > > > > iovad->max_32bit_free = iovad->dma_32bit_pfn; > >> iovad->flush_cb = NULL; >> iovad->fq = NULL; >> iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; >> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain >> *iovad, struct iova *free) >> cached_iova = rb_entry(iovad->cached32_node, struct iova, >> node); >> if (free->pfn_hi < iovad->dma_32bit_pfn && >> - free->pfn_lo >= cached_iova->pfn_lo) >> + free->pfn_lo >= cached_iova->pfn_lo) { >> iovad->cached32_node = rb_next(>node); >> + iovad->free_32bit_pfns = true; > > > > > iovad->max_32bit_free = iovad->dma_32bit_pfn; i think, you intended to say, iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo); >>> >>> >>> >>> Nope, that's why I said it needed a better name ;) >>> >>> (I nearly called it last_failed_32bit_alloc_size, but that's a bit much) >> >> >> may be we can name it "max32_alloc_size"? >>> >>> >>> The point of this value (whetever it's called) is that at any given time >>> it >>> holds an upper bound on the size of the largest contiguous free area. It >>> doesn't have to be the *smallest* upper bound, which is why we can keep >>> things simple and avoid arithmetic - in realistic use-cases like yours >>> when >>> the allocations are a pretty constant size, this should work out directly >>> equivalent to the boolean, only with values of "size" and "dma_32bit_pfn" >>> instead of 0 and 1, so we don't do any more work than necessary. In the >>> edge >>> cases where allocations are all different sizes, it does mean that we >>> will >>> probably end up performing more failing allocations than if we actually >>> tried to track all of the free space exactly, but I think that's >>> reasonable >>> - just because I want to make sure we handle such cases fairly >>> gracefully, >>> doesn't mean that we need to do extra work on the typical fast path to >>> try >>> and actually optimise for them (which is why I didn't really like the >>> accounting implementation I came up with). >>> >> >> ok got it, thanks for the explanation. > > >> + } >> cached_iova = rb_entry(iovad->cached_node, struct iova, >> node); >> if (free->pfn_lo >=