Re: [PATCH 2/4] dma-buf: lock the reservation object during (un)map_dma_buf v2
Am 03.07.2018 um 14:52 schrieb Daniel Vetter: On Tue, Jul 03, 2018 at 01:46:44PM +0200, Christian König wrote: Am 25.06.2018 um 11:12 schrieb Daniel Vetter: On Mon, Jun 25, 2018 at 10:22:31AM +0200, Daniel Vetter wrote: On Fri, Jun 22, 2018 at 04:11:01PM +0200, Christian König wrote: First step towards unpinned DMA buf operation. I've checked the DRM drivers to potential locking of the reservation object, but essentially we need to audit all implementations of the dma_buf _ops for this to work. v2: reordered Signed-off-by: Christian König Reviewed-by: Daniel Vetter Ok I did review drivers a bit, but apparently not well enough by far. i915 CI is unhappy: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9400/fi-whl-u/igt@gem_mmap_...@basic-small-bo-tiledx.html So yeah inserting that lock in there isn't the most trivial thing :-/ I kinda assume that other drivers will have similar issues, e.g. omapdrm's use of dev->struct_mutex also very much looks like it'll result in a new locking inversion. Ah, crap. Already feared that this wouldn't be easy, but yeah that it is as bad as this is rather disappointing. Thanks for the info, going to keep thinking about how to solve those issues. Side note: We want to make sure that drivers don't get the reservation_obj locking hierarchy wrong in other places (using dev->struct_mutex is kinda a pre-existing mis-use that we can't wish away retroactively unfortunately). One really important thing is that shrinker vs resv_obj must work with trylocks in the shrinker, so that you can allocate memory while holding reservation objects. One neat trick to teach lockdep about this would be to have a dummy if (IS_ENABLED(CONFIG_PROVE_LOCKING)) { ww_mutex_lock(dma_buf->resv_obj); fs_reclaim_acquire(GFP_KERNEL); fs_reclaim_release(GFP_KERNEL); ww_mutex_unlock(dma_buf->resv_obj); } in dma_buf_init(). We're using the fs_reclaim_acquire/release check very successfully to improve our igt test coverage for i915.ko in other areas. Totally unrelated to dev->struct_mutex, but thoughts? Well for dev->struct_mutex we could at least decide on one true way to nest resv_obj vs. dev->struct_mutex as maybe an interim step, but not sure how much that would help. I don't think that would help. As far as I can see we only have two choices: 1. Either have a big patch which fixes all DMA-buf implementations to allow the reservation lock to be held during map/unmap (unrealistic). 2. Add a flag to at least in the mid term tell the DMA-buf helper functions what to do. E.g. create the mapping without the reservation lock held. How about moving the SGL caching from the DRM layer into the DMA-buf layer and add a flag if the exporter wants/needs this caching? Then only the implementations which can deal with dynamic invalidation disable SGL caching and with it enable creating the sgl with the reservation object locked. This way we can kill two birds with one stone by both avoiding the SGL caching in the DRM layer as well as having a sane handling for the locking. Thoughts? Christian. -Daniel
Re: [PATCH 04/15] dma-fence: Make ->wait callback optional
Am 04.05.2018 um 11:25 schrieb Daniel Vetter: On Fri, May 4, 2018 at 11:16 AM, Chris Wilsonwrote: Quoting Daniel Vetter (2018-05-04 09:57:59) On Fri, May 04, 2018 at 09:31:33AM +0100, Chris Wilson wrote: Quoting Daniel Vetter (2018-05-04 09:23:01) On Fri, May 04, 2018 at 10:17:22AM +0200, Daniel Vetter wrote: On Fri, May 04, 2018 at 09:09:10AM +0100, Chris Wilson wrote: Quoting Daniel Vetter (2018-05-03 15:25:52) Almost everyone uses dma_fence_default_wait. v2: Also remove the BUG_ON(!ops->wait) (Chris). I just don't get the rationale for implicit over explicit. Closer approximation of dwim semantics. There's been tons of patch series all over drm and related places to get there, once we have a big pile of implementations and know what the dwim semantics should be. Individually they're all not much, in aggregate they substantially simplify simple drivers. I also think clearer separation between optional optimization hooks and mandatory core parts is useful in itself. A new spelling of midlayer ;) I don't see the contradiction with a driver saying use the default and simplicity. (I know which one the compiler thinks is simpler ;) If the compiler overhead is real then I guess it would makes to be explicit. I don't expect that to be a problem though for a blocking function. I disagree on this being a midlayer - you can still overwrite everything you please to. What it does help is people doing less copypasting (and assorted bugs), at least in the grand scheme of things. And we do have a _lot_ more random small drivers than just a few years ago. Reducing the amount of explicit typing just to get default bahaviour has been an ongoing theme for a few years now, and your objection here is about the first that this is not a good idea. So I'm somewhat confused. I'm just saying I don't see any rationale for this patch. "Almost everyone uses dma_fence_default_wait." Why change? Making it look simpler on the surface, so that you don't have to think about things straight away? I understand the appeal, but I do worry about it just being an illusion. (Cutting and pasting a line saying .wait = default_wait, doesn't feel that onerous, as you likely cut and paste the ops anyway, and at the very least you are reminded about some of the interactions. You could even have default initializers and/or magic macros to hide the cut and paste; maybe a simple_dma_fence [now that's a midlayer!] but I haven't looked.) In really monolithic vtables like drm_driver we do use default function macros, so you type 1 line, get them all. But dma_fence_ops is pretty small, and most drivers only implement a few callbacks. Also note that e.g. the ->release callback already works like that, so this pattern is there already. I simply extended it to ->wait and ->enable_signaling. Also note that I leave the EXPORT_SYMBOL in place, you can still wrap dma_fence_default_wait if you wish to do so. But I just realized that I didn't clean out the optional release hooks, I guess I should do that too (for the few cases it's not yet done) and respin. I kind of agree with Chris here, but also see the practical problem to copy the default function in all the implementations. We had the same problem in TTM and I also don't really like the result to always have that "if (some_callback) default(); else some_callback();". Might be that the run time overhead is negligible, but it doesn't feels right from the coding style perspective. Christian. -Daniel
Re: [PATCH 04/17] dma-fence: Allow wait_any_timeout for all fences
Am 27.04.2018 um 08:17 schrieb Daniel Vetter: When this was introduced in commit a519435a96597d8cd96123246fea4ae5a6c90b02 Author: Christian König <christian.koe...@amd.com> Date: Tue Oct 20 16:34:16 2015 +0200 dma-buf/fence: add fence_wait_any_timeout function v2 there was a restriction added that this only works if the dma-fence uses the dma_fence_default_wait hook. Which works for amdgpu, which is the only caller. Well, until you share some buffers with e.g. i915, then you get an -EINVAL. But there's really no reason for this, because all drivers must support callbacks. The special ->wait hook is only as an optimization; if the driver needs to create a worker thread for an active callback, then it can avoid to do that if it knows that there's a process context available already. So ->wait is just an optimization, just using the logic in dma_fence_default_wait() should work for all drivers. Let's remove this restriction. Mhm, that was intentional introduced because for radeon that is not only an optimization, but mandatory for correct operation. On the other hand radeon isn't using this function, so it should be fine as long as the Intel driver can live with it. Christian. Signed-off-by: Daniel Vetter <daniel.vet...@intel.com> Cc: Sumit Semwal <sumit.sem...@linaro.org> Cc: Gustavo Padovan <gust...@padovan.org> Cc: linux-media@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: Christian König <christian.koe...@amd.com> Cc: Alex Deucher <alexander.deuc...@amd.com> --- drivers/dma-buf/dma-fence.c | 5 - 1 file changed, 5 deletions(-) diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 7b5b40d6b70e..59049375bd19 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -503,11 +503,6 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count, for (i = 0; i < count; ++i) { struct dma_fence *fence = fences[i]; - if (fence->ops->wait != dma_fence_default_wait) { - ret = -EINVAL; - goto fence_rm_cb; - } - cb[i].task = current; if (dma_fence_add_callback(fence, [i].base, dma_fence_default_wait_cb)) {
Re: [PATCH 4/8] dma-buf: add peer2peer flag
Am 20.04.2018 um 12:17 schrieb Christoph Hellwig: On Fri, Apr 20, 2018 at 10:58:50AM +0200, Christian König wrote: Yes there's a bit a layering violation insofar that drivers really shouldn't each have their own copy of "how do I convert a piece of dma memory into dma-buf", but that doesn't render the interface a bad idea. Completely agree on that. What we need is an sg_alloc_table_from_resources(dev, resources, num_resources) which does the handling common to all drivers. A structure that contains {page,offset,len} + {dma_addr+dma_len} is not a good container for storing {virt addr, dma_addr, len} no matter what interface you build arond it. Why not? I mean at least for my use case we actually don't need the virtual address. What we need is {dma_addr+dma_len} in a consistent interface which can come from both {page,offset,len} as well as {resource, len}. What I actually don't need is separate handling for system memory and resources, but that would we get exactly when we don't use sg_table. Christian. And that is discounting all the problems around mapping coherent allocations for other devices, or the iommu merging problem we are having another thread on. So let's come up with a better high level interface first, and then worrty about how to implement it in the low-level dma-mapping interface second. Especially given that my consolidation of the dma_map_ops implementation is in full stream and there shoudn't be all that many to bother with. So first question: Do you actually care about having multiple pairs of the above, or instead of all chunks just deal with a single of the above? In that case we really should not need that many new interfaces as dma_map_resource will be all you need anyway. Christian. -Daniel ---end quoted text---
Re: [PATCH 4/8] dma-buf: add peer2peer flag
Am 20.04.2018 um 09:13 schrieb Daniel Vetter: On Thu, Apr 19, 2018 at 01:16:57AM -0700, Christoph Hellwig wrote: On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote: We've broken that assumption in i915 years ago. Not struct page backed gpu memory is very real. Of course we'll never feed such a strange sg table to a driver which doesn't understand it, but allowing sg_page == NULL works perfectly fine. At least for gpu drivers. For GPU drivers on x86 with no dma coherency problems, sure. But not all the world is x86. We already have problems due to dmabugs use of the awkward get_sgtable interface (see the common on arm_dma_get_sgtable that I fully agree with), and doing this for memory that doesn't have a struct page at all will make things even worse. x86 dma isn't coherent either, if you're a GPU :-) Flushing gpu caches tends to be too expensive, so there's pci-e support and chipset support to forgo it. Plus drivers flushing caches themselves. The dma_get_sgtable thing is indeed fun, right solution would probably be to push the dma-buf export down into the dma layer. The comment for arm_dma_get_sgtable is also not a realy concern, because dma-buf also abstracts away the flushing (or well is supposed to), so there really shouldn't be anyone calling the streaming apis on the returned sg table. That's why dma-buf gives you an sg table that's mapped already. If that's not acceptable then I guess we could go over the entire tree and frob all the gpu related code to switch over to a new struct sg_table_might_not_be_struct_page_backed, including all the other functions we added over the past few years to iterate over sg tables. But seems slightly silly, given that sg tables seem to do exactly what we need. It isn't silly. We will have to do some surgery like that anyway because the current APIs don't work. So relax, sit back and come up with an API that solves the existing issues and serves us well in the future. So we should just implement a copy of sg table for dma-buf, since I still think it does exactly what we need for gpus? Yes there's a bit a layering violation insofar that drivers really shouldn't each have their own copy of "how do I convert a piece of dma memory into dma-buf", but that doesn't render the interface a bad idea. Completely agree on that. What we need is an sg_alloc_table_from_resources(dev, resources, num_resources) which does the handling common to all drivers. Christian. -Daniel
[PATCH 5/5] drm/amdgpu: add independent DMA-buf import v3
Instead of relying on the DRM functions just implement our own import functions. This adds support for taking care of unpinned DMA-buf. v2: enable for all exporters, not just amdgpu, fix invalidation handling, lock reservation object while setting callback v3: change to new dma_buf attach interface Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 72 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 32 +++--- 2 files changed, 89 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 7fef95f0fed1..58dcfba0225a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -86,28 +86,24 @@ int amdgpu_gem_prime_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma return ret; } -struct drm_gem_object * -amdgpu_gem_prime_import_sg_table(struct drm_device *dev, -struct dma_buf_attachment *attach, -struct sg_table *sg) +static struct drm_gem_object * +amdgpu_gem_prime_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) { - struct reservation_object *resv = attach->dmabuf->resv; + struct reservation_object *resv = dma_buf->resv; struct amdgpu_device *adev = dev->dev_private; struct amdgpu_bo *bo; int ret; ww_mutex_lock(>lock, NULL); - ret = amdgpu_bo_create(adev, attach->dmabuf->size, PAGE_SIZE, + ret = amdgpu_bo_create(adev, dma_buf->size, PAGE_SIZE, AMDGPU_GEM_DOMAIN_CPU, 0, ttm_bo_type_sg, resv, ); if (ret) goto error; - bo->tbo.sg = sg; - bo->tbo.ttm->sg = sg; bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT; - if (attach->dmabuf->ops != _dmabuf_ops) + if (dma_buf->ops != _dmabuf_ops) bo->prime_shared_count = 1; ww_mutex_unlock(>lock); @@ -118,6 +114,26 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret); } +struct drm_gem_object * +amdgpu_gem_prime_import_sg_table(struct drm_device *dev, +struct dma_buf_attachment *attach, +struct sg_table *sg) +{ + struct drm_gem_object *obj; + struct amdgpu_bo *bo; + + obj = amdgpu_gem_prime_create_obj(dev, attach->dmabuf); + if (IS_ERR(obj)) + return obj; + + bo = gem_to_amdgpu_bo(obj); + amdgpu_bo_reserve(bo, true); + bo->tbo.sg = sg; + bo->tbo.ttm->sg = sg; + amdgpu_bo_unreserve(bo); + return obj; +} + static struct sg_table * amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, enum dma_data_direction dir) @@ -293,9 +309,29 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_device *dev, return buf; } +static void +amdgpu_gem_prime_invalidate_mappings(struct dma_buf_attachment *attach) +{ + struct ttm_operation_ctx ctx = { false, false }; + struct drm_gem_object *obj = attach->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct ttm_placement placement = {}; + int r; + + r = ttm_bo_validate(>tbo, , ); + if (r) + DRM_ERROR("Failed to unmap DMA-buf import (%d))\n", r); +} + struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf) { + struct dma_buf_attach_info attach_info = { + .dev = dev->dev, + .dmabuf = dma_buf, + .invalidate = amdgpu_gem_prime_invalidate_mappings + }; + struct dma_buf_attachment *attach; struct drm_gem_object *obj; if (dma_buf->ops == _dmabuf_ops) { @@ -310,5 +346,21 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, } } - return drm_gem_prime_import(dev, dma_buf); + if (!dma_buf->invalidation_supported) + return drm_gem_prime_import(dev, dma_buf); + + obj = amdgpu_gem_prime_create_obj(dev, dma_buf); + if (IS_ERR(obj)) + return obj; + + attach_info.importer_priv = obj; + attach = dma_buf_attach(_info); + if (IS_ERR(attach)) { + drm_gem_object_put(obj); + return ERR_CAST(attach); + } + + get_dma_buf(dma_buf); + obj->import_attach = attach; + return obj; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index ab73300e6c7f..2a8f328918cb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgp
[PATCH 1/5] dma-buf: use parameter structure for dma_buf_attach
Move the parameters into a structure to make it simpler to extend it in follow up patches. This also adds the importer private as parameter so that we can directly work with a completely filled in attachment structure. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 15 --- drivers/gpu/drm/armada/armada_gem.c | 6 +- drivers/gpu/drm/drm_prime.c | 6 +- drivers/gpu/drm/i915/i915_gem_dmabuf.c| 6 +- drivers/gpu/drm/tegra/gem.c | 6 +- drivers/gpu/drm/udl/udl_dmabuf.c | 6 +- drivers/media/common/videobuf2/videobuf2-dma-contig.c | 6 +- drivers/media/common/videobuf2/videobuf2-dma-sg.c | 6 +- drivers/staging/media/tegra-vde/tegra-vde.c | 6 +- include/linux/dma-buf.h | 17 +++-- 10 files changed, 63 insertions(+), 17 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d78d5fc173dc..4b46982c6d9c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -534,8 +534,9 @@ EXPORT_SYMBOL_GPL(dma_buf_put); /** * dma_buf_attach - Add the device to dma_buf's attachments list; optionally, * calls attach() of dma_buf_ops to allow device-specific attach functionality - * @dmabuf:[in]buffer to attach device to. - * @dev: [in]device to be attached. + * @info: [in]holds all the attach related information provided + * by the importer. see dma_buf_attach_info + * for further details. * * Returns struct dma_buf_attachment pointer for this attachment. Attachments * must be cleaned up by calling dma_buf_detach(). @@ -549,26 +550,26 @@ EXPORT_SYMBOL_GPL(dma_buf_put); * accessible to @dev, and cannot be moved to a more suitable place. This is * indicated with the error code -EBUSY. */ -struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, - struct device *dev) +struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info) { + struct dma_buf *dmabuf = info->dmabuf; struct dma_buf_attachment *attach; int ret; - if (WARN_ON(!dmabuf || !dev)) + if (WARN_ON(!dmabuf || !info->dev)) return ERR_PTR(-EINVAL); attach = kzalloc(sizeof(*attach), GFP_KERNEL); if (!attach) return ERR_PTR(-ENOMEM); - attach->dev = dev; + attach->dev = info->dev; attach->dmabuf = dmabuf; mutex_lock(>lock); if (dmabuf->ops->attach) { - ret = dmabuf->ops->attach(dmabuf, dev, attach); + ret = dmabuf->ops->attach(dmabuf, info->dev, attach); if (ret) goto err_attach; } diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c index a97f509743a5..f4d1c11f57ea 100644 --- a/drivers/gpu/drm/armada/armada_gem.c +++ b/drivers/gpu/drm/armada/armada_gem.c @@ -514,6 +514,10 @@ armada_gem_prime_export(struct drm_device *dev, struct drm_gem_object *obj, struct drm_gem_object * armada_gem_prime_import(struct drm_device *dev, struct dma_buf *buf) { + struct dma_buf_attach_info attach_info = { + .dev = dev->dev, + .dmabuf = buf + }; struct dma_buf_attachment *attach; struct armada_gem_object *dobj; @@ -529,7 +533,7 @@ armada_gem_prime_import(struct drm_device *dev, struct dma_buf *buf) } } - attach = dma_buf_attach(buf, dev->dev); + attach = dma_buf_attach(_info); if (IS_ERR(attach)) return ERR_CAST(attach); diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 7856a9b3f8a8..4da242de51c2 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -707,6 +707,10 @@ struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev, struct dma_buf *dma_buf, struct device *attach_dev) { + struct dma_buf_attach_info attach_info = { + .dev = attach_dev, + .dmabuf = dma_buf + }; struct dma_buf_attachment *attach; struct sg_table *sgt; struct drm_gem_object *obj; @@ -727,7 +731,7 @@ struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev, if (!dev->driver->gem_prime_import_sg_table) return ERR_PTR(-EINVAL); - attach = dma_buf_attach(dma_buf, attach_dev); + attach = dma_buf_attach(_info); if (IS_ERR(attach)) return ERR_CAST(attach); diff --git a/drivers/gpu/drm/i915/i915_gem_dmabuf.c b/drivers/gpu/drm/i9
[PATCH 3/5] drm/ttm: remove the backing store if no placement is given
Pipeline removal of the BOs backing store when the placement is given during validation. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/ttm/ttm_bo.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 98e06f8bf23b..17e821f01d0a 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1078,6 +1078,18 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, uint32_t new_flags; reservation_object_assert_held(bo->resv); + + /* +* Remove the backing store if no placement is given. +*/ + if (!placement->num_placement && !placement->num_busy_placement) { + ret = ttm_bo_pipeline_gutting(bo); + if (ret) + return ret; + + return ttm_tt_create(bo, false); + } + /* * Check whether we need to move buffer. */ -- 2.14.1
[PATCH 4/5] drm/amdgpu: add independent DMA-buf export v2
Instead of relying on the DRM functions just implement our own export functions. This adds support for taking care of unpinned DMA-buf. v2: fix unintended recursion, remove debugging leftovers Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 108 ++--- 4 files changed, 72 insertions(+), 43 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2babfad1fd7f..3e1727251edc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -371,7 +371,6 @@ int amdgpu_gem_object_open(struct drm_gem_object *obj, void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv); unsigned long amdgpu_gem_timeout(uint64_t timeout_ns); -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj); struct drm_gem_object * amdgpu_gem_prime_import_sg_table(struct drm_device *dev, struct dma_buf_attachment *attach, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 0b19482b36b8..6107b845d8a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -890,7 +890,6 @@ static struct drm_driver kms_driver = { .gem_prime_export = amdgpu_gem_prime_export, .gem_prime_import = amdgpu_gem_prime_import, .gem_prime_res_obj = amdgpu_gem_prime_res_obj, - .gem_prime_get_sg_table = amdgpu_gem_prime_get_sg_table, .gem_prime_import_sg_table = amdgpu_gem_prime_import_sg_table, .gem_prime_vmap = amdgpu_gem_prime_vmap, .gem_prime_vunmap = amdgpu_gem_prime_vunmap, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 9e23d6f6f3f3..15bb48cebbc4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -31,6 +31,7 @@ */ #include #include +#include #include #include #include @@ -953,6 +954,10 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo, amdgpu_bo_kunmap(abo); + if (abo->gem_base.dma_buf && !abo->gem_base.import_attach && + bo->mem.mem_type != TTM_PL_SYSTEM) + dma_buf_invalidate_mappings(abo->gem_base.dma_buf); + /* remember the eviction */ if (evict) atomic64_inc(>num_evictions); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 4b584cb75bf4..7fef95f0fed1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -32,14 +32,6 @@ static const struct dma_buf_ops amdgpu_dmabuf_ops; -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj) -{ - struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); - int npages = bo->tbo.num_pages; - - return drm_prime_pages_to_sg(bo->tbo.ttm->pages, npages); -} - void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj) { struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); @@ -126,23 +118,22 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret); } -static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, -struct device *target_dev, -struct dma_buf_attachment *attach) +static struct sg_table * +amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, + enum dma_data_direction dir) { + struct dma_buf *dma_buf = attach->dmabuf; struct drm_gem_object *obj = dma_buf->priv; struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); + struct sg_table *sgt; long r; - r = drm_gem_map_attach(dma_buf, target_dev, attach); - if (r) - return r; - - r = amdgpu_bo_reserve(bo, false); - if (unlikely(r != 0)) - goto error_detach; - + if (!attach->invalidate) { + r = amdgpu_bo_reserve(bo, false); + if (unlikely(r != 0)) + return ERR_PTR(r); + } if (attach->dev->driver != adev->dev->driver) { /* @@ -158,42 +149,77 @@ static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, } } - /* pin buffer into GTT */ - r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); - if (r) + if (!attach->invalidate) { + /* pin buffer into GTT */ + r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); + if (r) + goto er
[PATCH 2/5] dma-buf: add optional invalidate_mappings callback v3
Each importer can now provide an invalidate_mappings callback. This allows the exporter to provide the mappings without the need to pin the backing store. v2: don't try to invalidate mappings when the callback is NULL, lock the reservation obj while using the attachments, add helper to set the callback v3: move flag for invalidation support into the DMA-buf, use new attach_info structure to set the callback v4: use importer_priv field instead of mangling exporter priv. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 44 include/linux/dma-buf.h | 36 ++-- 2 files changed, 78 insertions(+), 2 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 4b46982c6d9c..ffdaab10e2f2 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -565,6 +565,8 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info attach->dev = info->dev; attach->dmabuf = dmabuf; + attach->importer_priv = info->importer_priv; + attach->invalidate = info->invalidate; mutex_lock(>lock); @@ -573,7 +575,9 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info if (ret) goto err_attach; } + reservation_object_lock(dmabuf->resv, NULL); list_add(>node, >attachments); + reservation_object_unlock(dmabuf->resv); mutex_unlock(>lock); return attach; @@ -599,7 +603,9 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) return; mutex_lock(>lock); + reservation_object_lock(dmabuf->resv, NULL); list_del(>node); + reservation_object_unlock(dmabuf->resv); if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach); @@ -633,10 +639,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, if (WARN_ON(!attach || !attach->dmabuf)) return ERR_PTR(-EINVAL); + /* +* Mapping a DMA-buf can trigger its invalidation, prevent sending this +* event to the caller by temporary removing this attachment from the +* list. +*/ + if (attach->invalidate) { + reservation_object_assert_held(attach->dmabuf->resv); + list_del(>node); + } + sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction); if (!sg_table) sg_table = ERR_PTR(-ENOMEM); + if (attach->invalidate) + list_add(>node, >dmabuf->attachments); + return sg_table; } EXPORT_SYMBOL_GPL(dma_buf_map_attachment); @@ -657,6 +676,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep(); + if (attach->invalidate) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) return; @@ -665,6 +687,26 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment); +/** + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf + * + * @dmabuf:[in]buffer which mappings should be invalidated + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + reservation_object_assert_held(dmabuf->resv); + + list_for_each_entry(attach, >attachments, node) + if (attach->invalidate) + attach->invalidate(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings); + /** * DOC: cpu access * @@ -1122,10 +1164,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) seq_puts(s, "\tAttached Devices:\n"); attach_count = 0; + reservation_object_lock(buf_obj->resv, NULL); list_for_each_entry(attach_obj, _obj->attachments, node) { seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); attach_count++; } + reservation_object_unlock(buf_obj->resv); seq_printf(s, "Total %d devices attached\n\n", attach_count); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 414b4dde5eb7..566503dd2b4f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -270,6 +270,8 @@ struct dma_buf_ops { * @poll: for userspace poll support * @cb_excl: for userspace poll support * @cb_shared: for userspace po
Re: [PATCH v2] Add udmabuf misc device
Am 06.04.2018 um 11:33 schrieb Gerd Hoffmann: Hi, The pages backing a DMA-buf are not allowed to move (at least not without a patch set I'm currently working on), but for certain MM operations to work correctly you must be able to modify the page tables entries and move the pages backing them around. For example try to use fork() with some copy on write pages with this approach. You will find that you have only two options to correctly handle this. The fork() issue should go away with shared memory pages (no cow). I guess this is the reason why vgem is internally backed by shmem. Yes, exactly that is also an approach which should work fine. Just don't try to get this working with get_user_pages(). Hmm. So I could try to limit the udmabuf driver to shmem too (i.e. have the ioctl take a shmem filehandle and offset instead of a virtual address). But maybe it is better then to just extend vgem, i.e. add support to create gem objects from existing shmem. Comments? Yes, extending vgem instead of creating something new sounds like a good idea to me as well. Regards, Christian. cheers, Gerd
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
Am 29.03.2018 um 18:25 schrieb Logan Gunthorpe: On 29/03/18 10:10 AM, Christian König wrote: Why not? I mean the dma_map_resource() function is for P2P while other dma_map_* functions are only for system memory. Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping P2P. Though it's a bit odd seeing we've been working under the assumption that PCI P2P is different as it has to translate the PCI bus address. Where as P2P for devices on other buses is a big unknown. Yeah, completely agree. On my TODO list (but rather far down) is actually supporting P2P with USB devices. And no, I don't have the slightest idea how to do this at the moment. And this is necessary to check if the DMA ops in use support it or not. We can't have the dma_map_X() functions do the wrong thing because they don't support it yet. Well that sounds like we should just return an error from dma_map_resources() when an architecture doesn't support P2P yet as Alex suggested. Yes, well except in our patch-set we can't easily use dma_map_resources() as we either have SGLs to deal with or we need to create whole new interfaces to a number of subsystems. Agree as well. I was also in clear favor of extending the SGLs to have a flag for this instead of the dma_map_resource() interface, but for some reason that didn't made it into the kernel. You don't seem to understand the implications: The devices do have a common upstream bridge! In other words your code would currently claim that P2P is supported, but in practice it doesn't work. Do they? They don't on any of the Intel machines I'm looking at. The previous version of the patchset not only required a common upstream bridge but two layers of upstream bridges on both devices which would effectively limit transfers to PCIe switches only. But Bjorn did not like this. At least to me that sounds like a good idea, it would at least disable (the incorrect) auto detection of P2P for such devices. You need to include both drivers which participate in the P2P transaction to make sure that both supports this and give them opportunity to chicken out and in the case of AMD APUs even redirect the request to another location (e.g. participate in the DMA translation). I don't think it's the drivers responsibility to reject P2P . The topology is what governs support or not. The discussions we had with Bjorn settled on if the devices are all behind the same bridge they can communicate with each other. This is essentially guaranteed by the PCI spec. Well it is not only rejecting P2P, see the devices I need to worry about are essentially part of the CPU. Their resources looks like a PCI BAR to the BIOS and OS, but are actually backed by stolen system memory. So as crazy as it sounds what you get is an operation which starts as P2P, but then the GPU drivers sees it and says: Hey please don't write that to my PCIe BAR, but rather system memory location X. DMA-buf fortunately seems to handle all this already, that's why we choose it as base for our implementation. Well, unfortunately DMA-buf doesn't help for the drivers we are working with as neither the block layer nor the RDMA subsystem have any interfaces for it. A fact that gives me quite some sleepless nights as well. I think we sooner or later need to extend those interfaces to work with DMA-bufs as well. I will try to give your patch set a review when I'm back from vacation and rebase my DMA-buf work on top of that. Regards, Christian. Logan
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
Am 29.03.2018 um 17:45 schrieb Logan Gunthorpe: On 29/03/18 05:44 AM, Christian König wrote: Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe: On 28/03/18 01:44 PM, Christian König wrote: Well, isn't that exactly what dma_map_resource() is good for? As far as I can see it makes sure IOMMU is aware of the access route and translates a CPU address into a PCI Bus address. I'm using that with the AMD IOMMU driver and at least there it works perfectly fine. Yes, it would be nice, but no arch has implemented this yet. We are just lucky in the x86 case because that arch is simple and doesn't need to do anything for P2P (partially due to the Bus and CPU addresses being the same). But in the general case, you can't rely on it. Well, that an arch hasn't implemented it doesn't mean that we don't have the right interface to do it. Yes, but right now we don't have a performant way to check if we are doing P2P or not in the dma_map_X() wrappers. Why not? I mean the dma_map_resource() function is for P2P while other dma_map_* functions are only for system memory. And this is necessary to check if the DMA ops in use support it or not. We can't have the dma_map_X() functions do the wrong thing because they don't support it yet. Well that sounds like we should just return an error from dma_map_resources() when an architecture doesn't support P2P yet as Alex suggested. Devices integrated in the CPU usually only "claim" to be PCIe devices. In reality their memory request path go directly through the integrated north bridge. The reason for this is simple better throughput/latency. These are just more reasons why our patchset restricts to devices behind a switch. And more mess for someone to deal with if they need to relax that restriction. You don't seem to understand the implications: The devices do have a common upstream bridge! In other words your code would currently claim that P2P is supported, but in practice it doesn't work. You need to include both drivers which participate in the P2P transaction to make sure that both supports this and give them opportunity to chicken out and in the case of AMD APUs even redirect the request to another location (e.g. participate in the DMA translation). DMA-buf fortunately seems to handle all this already, that's why we choose it as base for our implementation. Regards, Christian.
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe: On 28/03/18 01:44 PM, Christian König wrote: Well, isn't that exactly what dma_map_resource() is good for? As far as I can see it makes sure IOMMU is aware of the access route and translates a CPU address into a PCI Bus address. I'm using that with the AMD IOMMU driver and at least there it works perfectly fine. Yes, it would be nice, but no arch has implemented this yet. We are just lucky in the x86 case because that arch is simple and doesn't need to do anything for P2P (partially due to the Bus and CPU addresses being the same). But in the general case, you can't rely on it. Well, that an arch hasn't implemented it doesn't mean that we don't have the right interface to do it. Yeah, but not for ours. See if you want to do real peer 2 peer you need to keep both the operation as well as the direction into account. Not sure what you are saying here... I'm pretty sure we are doing "real" peer 2 peer... For example when you can do writes between A and B that doesn't mean that writes between B and A work. And reads are generally less likely to work than writes. etc... If both devices are behind a switch then the PCI spec guarantees that A can both read and write B and vice versa. Sorry to say that, but I know a whole bunch of PCI devices which horrible ignores that. Can you elaborate? As far as the device is concerned it shouldn't know whether a request comes from a peer or from the host. If it does do crazy stuff like that it's well out of spec. It's up to the switch (or root complex if good support exists) to route the request to the device and it's the root complex that tends to be what drops the load requests which causes the asymmetries. Devices integrated in the CPU usually only "claim" to be PCIe devices. In reality their memory request path go directly through the integrated north bridge. The reason for this is simple better throughput/latency. That is hidden from the software, for example the BIOS just allocates address space for the BARs as if it's a normal PCIe device. The only crux is when you then do peer2peer your request simply go into nirvana and are not handled by anything because the BARs are only visible from the CPU side of the northbridge. Regards, Christian. Logan ___ amd-gfx mailing list amd-...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 4/8] dma-buf: add peer2peer flag
Am 29.03.2018 um 08:57 schrieb Daniel Vetter: On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote: Add a peer2peer flag noting that the importer can deal with device resources which are not backed by pages. Signed-off-by: Christian König <christian.koe...@amd.com> Um strictly speaking they all should, but ttm never bothered to use the real interfaces but just hacked around the provided sg list, grabbing the underlying struct pages, then rebuilding the sg list again. Actually that isn't correct. TTM converts them to a dma address array because drivers need it like this (at least nouveau, radeon and amdgpu). I've fixed radeon and amdgpu to be able to deal without it and mailed with Ben about nouveau, but the outcome is they don't really know. TTM itself doesn't have any need for the pages on imported BOs (you can't mmap them anyway), the real underlying problem is that sg tables doesn't provide what drivers need. I think we could rather easily fix sg tables, but that is a totally separate task. The entire point of using sg lists was exactly to allow this use case of peer2peer dma (or well in general have special exporters which managed memory/IO ranges not backed by struct page). So essentially you're having a "I'm totally not broken flag" here. No, independent of needed struct page pointers we need to note if the exporter can handle peer2peer stuff from the hardware side in general. So what I've did is just to set peer2peer allowed on the importer because of the driver needs and clear it in the exporter if the hardware can't handle that. I think a better approach would be if we add a requires_struct_page or so, and annotate the current importers accordingly. Or we just fix them up (it is all in shared ttm code after all, I think everyone else got this right). I would rather not bed on that. Christian. -Daniel --- drivers/dma-buf/dma-buf.c | 1 + include/linux/dma-buf.h | 4 2 files changed, 5 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ffaa2f9a9c2c..f420225f93c6 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info attach->dev = info->dev; attach->dmabuf = dmabuf; + attach->peer2peer = info->peer2peer; attach->priv = info->priv; attach->invalidate = info->invalidate; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 15dd8598bff1..1ef50bd9bc5b 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -313,6 +313,7 @@ struct dma_buf { * @dmabuf: buffer for this attachment. * @dev: device attached to the buffer. * @node: list of dma_buf_attachment. + * @peer2peer: true if the importer can handle peer resources without pages. * @priv: exporter specific attachment data. * * This structure holds the attachment information between the dma_buf buffer @@ -328,6 +329,7 @@ struct dma_buf_attachment { struct dma_buf *dmabuf; struct device *dev; struct list_head node; + bool peer2peer; void *priv; /** @@ -392,6 +394,7 @@ struct dma_buf_export_info { * @dmabuf: the exported dma_buf * @dev: the device which wants to import the attachment * @priv: private data of importer to this attachment + * @peer2peer: true if the importer can handle peer resources without pages * @invalidate: callback to use for invalidating mappings * * This structure holds the information required to attach to a buffer. Used @@ -401,6 +404,7 @@ struct dma_buf_attach_info { struct dma_buf *dmabuf; struct device *dev; void *priv; + bool peer2peer; void (*invalidate)(struct dma_buf_attachment *attach); }; -- 2.14.1 ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
Am 28.03.2018 um 20:57 schrieb Logan Gunthorpe: On 28/03/18 12:28 PM, Christian König wrote: I'm just using amdgpu as blueprint because I'm the co-maintainer of it and know it mostly inside out. Ah, I see. The resource addresses are translated using dma_map_resource(). As far as I know that should be sufficient to offload all the architecture specific stuff to the DMA subsystem. It's not. The dma_map infrastructure currently has no concept of peer-to-peer mappings and is designed for system memory only. No architecture I'm aware of will translate PCI CPU addresses into PCI Bus addresses which is necessary for any transfer that doesn't go through the root complex (though on arches like x86 the CPU and Bus address happen to be the same). There's a lot of people that would like to see this change but it's likely going to be a long road before it does. Well, isn't that exactly what dma_map_resource() is good for? As far as I can see it makes sure IOMMU is aware of the access route and translates a CPU address into a PCI Bus address. Furthermore, one of the reasons our patch-set avoids going through the root complex at all is that IOMMU drivers will need to be made aware that it is operating on P2P memory and do arch-specific things accordingly. There will also need to be flags that indicate whether a given IOMMU driver supports this. None of this work is done or easy. I'm using that with the AMD IOMMU driver and at least there it works perfectly fine. Yeah, but not for ours. See if you want to do real peer 2 peer you need to keep both the operation as well as the direction into account. Not sure what you are saying here... I'm pretty sure we are doing "real" peer 2 peer... For example when you can do writes between A and B that doesn't mean that writes between B and A work. And reads are generally less likely to work than writes. etc... If both devices are behind a switch then the PCI spec guarantees that A can both read and write B and vice versa. Sorry to say that, but I know a whole bunch of PCI devices which horrible ignores that. For example all AMD APUs fall under that category... Only once you involve root complexes do you have this problem. Ie. you have unknown support which may be no support, or partial support (stores but not loads); or sometimes bad performance; or a combination of both... and you need some way to figure out all this mess and that is hard. Whoever tries to implement a white list will have to sort all this out. Yes, exactly and unfortunately it looks like I'm the poor guy who needs to do this :) Regards, Christian. Logan ___ amd-gfx mailing list amd-...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
Am 28.03.2018 um 18:25 schrieb Logan Gunthorpe: On 28/03/18 10:02 AM, Christian König wrote: Yeah, that looks very similar to what I picked up from the older patches, going to read up on that after my vacation. Yeah, I was just reading through your patchset and there are a lot of similarities. Though, I'm not sure what you're trying to accomplish as I could not find a cover letter and it seems to only enable one driver. Yeah, it was the last day before my easter vacation and I wanted it out of the door. Is it meant to enable DMA transactions only between two AMD GPUs? Not really, DMA-buf is a general framework for sharing buffers between device drivers. It is widely used in the GFX stack on laptops with both Intel+AMD, Intel+NVIDIA or AMD+AMD graphics devices. Additional to that ARM uses it quite massively for their GFX stacks because they have rendering and displaying device separated. I'm just using amdgpu as blueprint because I'm the co-maintainer of it and know it mostly inside out. I also don't see where you've taken into account the PCI bus address. On some architectures this is not the same as the CPU physical address. The resource addresses are translated using dma_map_resource(). As far as I know that should be sufficient to offload all the architecture specific stuff to the DMA subsystem. Just in general why are you interested in the "distance" of the devices? We've taken a general approach where some drivers may provide p2p memory (ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie. the NVMe-of driver). The orchestrator driver needs to find the most applicable provider device for a transaction in a situation that may have multiple providers and multiple clients. So the most applicable provider is the one that's closest ("distance"-wise) to all the clients for the P2P transaction. That seems to make sense. And BTW: At least for writes that Peer 2 Peer transactions between different root complexes work is actually more common than the other way around. Maybe on x86 with hardware made in the last few years. But on PowerPC, ARM64, and likely a lot more the chance of support is *much* less. Also, hardware that only supports P2P stores is hardly full support and is insufficient for our needs. Yeah, but not for ours. See if you want to do real peer 2 peer you need to keep both the operation as well as the direction into account. For example when you can do writes between A and B that doesn't mean that writes between B and A work. And reads are generally less likely to work than writes. etc... Since the use case I'm targeting for is GFX or GFX+V4L (or GFX+NIC in the future) I really need to handle all such use cases as well. So I'm a bit torn between using a blacklist or a whitelist. A whitelist is certainly more conservative approach, but that could get a bit long. I think a whitelist approach is correct. Given old hardware and other architectures, a black list is going to be too long and too difficult to comprehensively populate. Yeah, it would certainly be better if we have something in the root complex capabilities. But you're right that a whitelist sounds the less painful way. Regards, Christian. Logan ___ amd-gfx mailing list amd-...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
Am 28.03.2018 um 17:47 schrieb Logan Gunthorpe: On 28/03/18 09:07 AM, Christian König wrote: Am 28.03.2018 um 14:38 schrieb Christoph Hellwig: On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote: From: "wda...@nvidia.com" <wda...@nvidia.com> Add an interface to find the first device which is upstream of both devices. Please work with Logan and base this on top of the outstanding peer to peer patchset. Can you point me to that? The last code I could find about that was from 2015. The latest posted series is here: https://lkml.org/lkml/2018/3/12/830 However, we've made some significant changes to the area that's similar to what you are doing. You can find lasted un-posted here: https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2 Specifically this function would be of interest to you: https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239 However, the difference between what we are doing is that we are interested in the distance through the common upstream device and you appear to be finding the actual common device. Yeah, that looks very similar to what I picked up from the older patches, going to read up on that after my vacation. Just in general why are you interested in the "distance" of the devices? And BTW: At least for writes that Peer 2 Peer transactions between different root complexes work is actually more common than the other way around. So I'm a bit torn between using a blacklist or a whitelist. A whitelist is certainly more conservative approach, but that could get a bit long. Thanks, Christian. Thanks, Logan
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
Am 28.03.2018 um 14:38 schrieb Christoph Hellwig: On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote: From: "wda...@nvidia.com" <wda...@nvidia.com> Add an interface to find the first device which is upstream of both devices. Please work with Logan and base this on top of the outstanding peer to peer patchset. Can you point me to that? The last code I could find about that was from 2015. Thanks, Christian.
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 27.03.2018 um 09:53 schrieb Daniel Vetter: [SNIP] [SNIP] A slightly better solution is using atomic counter: driver_range_start() { atomic_inc(>notifier_count); ... Yeah, that is exactly what amdgpu is doing now. Sorry if my description didn't made that clear. I would like to see driver using same code, as it means one place to fix issues. I had for a long time on my TODO list doing the above conversion to amd or radeon kernel driver. I am pushing up my todo list hopefully in next few weeks i can send an rfc so people can have a real sense of how it can look. Certainly a good idea, but I think we might have that separate to HMM. TTM suffered really from feature overload, e.g. trying to do everything in a single subsystem. And it would be rather nice to have coherent userptr handling for GPUs as separate feature. TTM suffered from being a midlayer imo, not from doing too much. Yeah, completely agree. midlayers work as long as you concentrate on doing exactly one things in your midlayer. E.g. in the case of TTM the callback for BO move handling is well justified. Only all the stuff around it like address space handling etc... is really wrong designed and should be separated (which is exactly what DRM MM did, but TTM still uses this in the wrong way). HMM is apparently structured like a toolbox (despite its documentation claiming otherwise), so you can pick freely. That sounds good, but I would still have a better feeling if userptr handling would be separated. That avoids mangling things together again. Christian. -Daniel
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 26.03.2018 um 17:42 schrieb Jerome Glisse: On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote: On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote: Am 22.03.2018 um 08:18 schrieb Daniel Vetter: [SNIP] Key take away from that was that you can't take any locks from neither the MMU notifier nor the shrinker you also take while calling kmalloc (or simpler speaking get_user_pages()). Additional to that in the MMU or shrinker callback all different kinds of locks might be held, so you basically can't assume that you do thinks like recursive page table walks or call dma_unmap_anything. That sounds like a design bug in mmu_notifiers, since it would render them useless for KVM. And they were developed for that originally. I think I'll chat with Jerome to understand this, since it's all rather confusing. Doing dma_unmap() during mmu_notifier callback should be fine, it was last time i check. However there is no formal contract that it is ok to do so. As I said before dma_unmap() isn't the real problem here. The issues is more that you can't take a lock in the MMU notifier which you would also take while allocating memory without GFP_NOIO. That makes it rather tricky to do any command submission, e.g. you need to grab all the pages/memory/resources prehand, then make sure that you don't have a MMU notifier running concurrently and do the submission. If any of the prerequisites isn't fulfilled we need to restart the operation. [SNIP] A slightly better solution is using atomic counter: driver_range_start() { atomic_inc(>notifier_count); ... Yeah, that is exactly what amdgpu is doing now. Sorry if my description didn't made that clear. I would like to see driver using same code, as it means one place to fix issues. I had for a long time on my TODO list doing the above conversion to amd or radeon kernel driver. I am pushing up my todo list hopefully in next few weeks i can send an rfc so people can have a real sense of how it can look. Certainly a good idea, but I think we might have that separate to HMM. TTM suffered really from feature overload, e.g. trying to do everything in a single subsystem. And it would be rather nice to have coherent userptr handling for GPUs as separate feature. Regards, Christian.
Re: [PATCH] dma-buf: use parameter structure for dma_buf_attach
Am 26.03.2018 um 10:36 schrieb Daniel Vetter: On Sun, Mar 25, 2018 at 01:34:51PM +0200, Christian König wrote: [SNIP] - attach->dev = dev; + attach->dev = info->dev; attach->dmabuf = dmabuf; + attach->priv = info->priv; The ->priv field is for the exporter, not the importer. See e.g. drm_gem_map_attach. You can't let the importer set this now too, so needs to be removed from the info struct. Crap, in this case I need to add an importer_priv field because we now need to map from the attachment to it's importer object as well. Thanks for noticing this. Regards, Christian.
Re: [PATCH] dma-buf: use parameter structure for dma_buf_attach
Am 26.03.2018 um 10:36 schrieb Daniel Vetter: On Sun, Mar 25, 2018 at 01:34:51PM +0200, Christian König wrote: [SNIP] - attach->dev = dev; + attach->dev = info->dev; attach->dmabuf = dmabuf; + attach->priv = info->priv; The ->priv field is for the exporter, not the importer. See e.g. drm_gem_map_attach. You can't let the importer set this now too, so needs to be removed from the info struct. Crap, in this case I need to add an importer_priv field because we now need to map from the attachment to it's importer object as well. Thanks for noticing this. Regards, Christian.
[PATCH] dma-buf: use parameter structure for dma_buf_attach
Move the parameters into a structure to make it simpler to extend it in follow up patches. This also adds the importer private as parameter so that we can directly work with a completely filled in attachment structure. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 16 +--- drivers/gpu/drm/armada/armada_gem.c | 6 +- drivers/gpu/drm/drm_prime.c | 6 +- drivers/gpu/drm/i915/i915_gem_dmabuf.c| 6 +- drivers/gpu/drm/tegra/gem.c | 6 +- drivers/gpu/drm/udl/udl_dmabuf.c | 6 +- drivers/media/common/videobuf2/videobuf2-dma-contig.c | 6 +- drivers/media/common/videobuf2/videobuf2-dma-sg.c | 6 +- drivers/staging/media/tegra-vde/tegra-vde.c | 6 +- include/linux/dma-buf.h | 19 +-- 10 files changed, 66 insertions(+), 17 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d78d5fc173dc..d2e8ca0d9427 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -534,8 +534,9 @@ EXPORT_SYMBOL_GPL(dma_buf_put); /** * dma_buf_attach - Add the device to dma_buf's attachments list; optionally, * calls attach() of dma_buf_ops to allow device-specific attach functionality - * @dmabuf:[in]buffer to attach device to. - * @dev: [in]device to be attached. + * @info: [in]holds all the attach related information provided + * by the importer. see dma_buf_attach_info + * for further details. * * Returns struct dma_buf_attachment pointer for this attachment. Attachments * must be cleaned up by calling dma_buf_detach(). @@ -549,26 +550,27 @@ EXPORT_SYMBOL_GPL(dma_buf_put); * accessible to @dev, and cannot be moved to a more suitable place. This is * indicated with the error code -EBUSY. */ -struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, - struct device *dev) +struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info) { + struct dma_buf *dmabuf = info->dmabuf; struct dma_buf_attachment *attach; int ret; - if (WARN_ON(!dmabuf || !dev)) + if (WARN_ON(!dmabuf || !info->dev)) return ERR_PTR(-EINVAL); attach = kzalloc(sizeof(*attach), GFP_KERNEL); if (!attach) return ERR_PTR(-ENOMEM); - attach->dev = dev; + attach->dev = info->dev; attach->dmabuf = dmabuf; + attach->priv = info->priv; mutex_lock(>lock); if (dmabuf->ops->attach) { - ret = dmabuf->ops->attach(dmabuf, dev, attach); + ret = dmabuf->ops->attach(dmabuf, info->dev, attach); if (ret) goto err_attach; } diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c index a97f509743a5..f4d1c11f57ea 100644 --- a/drivers/gpu/drm/armada/armada_gem.c +++ b/drivers/gpu/drm/armada/armada_gem.c @@ -514,6 +514,10 @@ armada_gem_prime_export(struct drm_device *dev, struct drm_gem_object *obj, struct drm_gem_object * armada_gem_prime_import(struct drm_device *dev, struct dma_buf *buf) { + struct dma_buf_attach_info attach_info = { + .dev = dev->dev, + .dmabuf = buf + }; struct dma_buf_attachment *attach; struct armada_gem_object *dobj; @@ -529,7 +533,7 @@ armada_gem_prime_import(struct drm_device *dev, struct dma_buf *buf) } } - attach = dma_buf_attach(buf, dev->dev); + attach = dma_buf_attach(_info); if (IS_ERR(attach)) return ERR_CAST(attach); diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 7856a9b3f8a8..4da242de51c2 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -707,6 +707,10 @@ struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev, struct dma_buf *dma_buf, struct device *attach_dev) { + struct dma_buf_attach_info attach_info = { + .dev = attach_dev, + .dmabuf = dma_buf + }; struct dma_buf_attachment *attach; struct sg_table *sgt; struct drm_gem_object *obj; @@ -727,7 +731,7 @@ struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev, if (!dev->driver->gem_prime_import_sg_table) return ERR_PTR(-EINVAL); - attach = dma_buf_attach(dma_buf, attach_dev); + attach = dma_buf_attach(_info); if (IS_ERR(attach)) return ERR_CAST(attach); diff --git a/drivers/gpu/drm/i915
[PATCH 8/8] drm/amdgpu: add support for exporting VRAM using DMA-buf
We should be able to do this now after checking all the prerequisites. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c| 44 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 9 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 91 3 files changed, 135 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 133596df0775..86d983a0678b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -197,23 +197,44 @@ amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, } else { /* move buffer into GTT */ struct ttm_operation_ctx ctx = { false, false }; + unsigned domains = AMDGPU_GEM_DOMAIN_GTT; - amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT); + if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM && + attach->peer2peer) { + bo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED; + domains |= AMDGPU_GEM_DOMAIN_VRAM; + } + amdgpu_ttm_placement_from_domain(bo, domains); r = ttm_bo_validate(>tbo, >placement, ); if (r) goto error_unreserve; } - sgt = drm_prime_pages_to_sg(bo->tbo.ttm->pages, bo->tbo.num_pages); - if (IS_ERR(sgt)) { - r = PTR_ERR(sgt); + switch (bo->tbo.mem.mem_type) { + case TTM_PL_TT: + sgt = drm_prime_pages_to_sg(bo->tbo.ttm->pages, + bo->tbo.num_pages); + if (IS_ERR(sgt)) { + r = PTR_ERR(sgt); + goto error_unreserve; + } + + if (!dma_map_sg_attrs(attach->dev, sgt->sgl, sgt->nents, dir, + DMA_ATTR_SKIP_CPU_SYNC)) + goto error_free; + break; + + case TTM_PL_VRAM: + r = amdgpu_vram_mgr_alloc_sgt(adev, >tbo.mem, attach->dev, + dir, ); + if (r) + goto error_unreserve; + break; + default: + r = -EINVAL; goto error_unreserve; } - if (!dma_map_sg_attrs(attach->dev, sgt->sgl, sgt->nents, dir, - DMA_ATTR_SKIP_CPU_SYNC)) - goto error_free; - if (attach->dev->driver != adev->dev->driver) bo->prime_shared_count++; @@ -254,10 +275,15 @@ static void amdgpu_gem_unmap_dma_buf(struct dma_buf_attachment *attach, if (!attach->invalidate) amdgpu_bo_unreserve(bo); - if (sgt) { + if (!sgt) + return; + + if (sgt->sgl->page_link) { dma_unmap_sg(attach->dev, sgt->sgl, sgt->nents, dir); sg_free_table(sgt); kfree(sgt); + } else { + amdgpu_vram_mgr_free_sgt(adev, attach->dev, dir, sgt); } } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h index 6ea7de863041..b483900abed2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h @@ -73,6 +73,15 @@ bool amdgpu_gtt_mgr_has_gart_addr(struct ttm_mem_reg *mem); uint64_t amdgpu_gtt_mgr_usage(struct ttm_mem_type_manager *man); int amdgpu_gtt_mgr_recover(struct ttm_mem_type_manager *man); +int amdgpu_vram_mgr_alloc_sgt(struct amdgpu_device *adev, + struct ttm_mem_reg *mem, + struct device *dev, + enum dma_data_direction dir, + struct sg_table **sgt); +void amdgpu_vram_mgr_free_sgt(struct amdgpu_device *adev, + struct device *dev, + enum dma_data_direction dir, + struct sg_table *sgt); uint64_t amdgpu_vram_mgr_usage(struct ttm_mem_type_manager *man); uint64_t amdgpu_vram_mgr_vis_usage(struct ttm_mem_type_manager *man); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c index 9aca653bec07..eb8f75525a81 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c @@ -233,6 +233,97 @@ static void amdgpu_vram_mgr_del(struct ttm_mem_type_manager *man, mem->mm_node = NULL; } +/** + * amdgpu_vram_mgr_alloc_sgt - allocate and fill a sg table + * + * @adev: amdgpu device pointer + * @mem: TTM memory object + * @dev: the other device + * @dir: dma direction + * @sgt: resulting sg table + * +
[PATCH 1/8] lib/scatterlist: add sg_set_dma_addr() helper
Use this function to set an sg entry to point to device resources mapped using dma_map_resource(). The page pointer is set to NULL and only the DMA address, length and offset values are valid. Signed-off-by: Christian König <christian.koe...@amd.com> --- include/linux/scatterlist.h | 23 +++ 1 file changed, 23 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 22b2131bcdcd..f944ee4e482c 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -149,6 +149,29 @@ static inline void sg_set_buf(struct scatterlist *sg, const void *buf, sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); } +/** + * sg_set_dma_addr - Set sg entry to point at specified dma address + * @sg: SG entry + * @address:DMA address to set + * @len:Length of data + * @offset: Offset into page + * + * Description: + * Use this function to set an sg entry to point to device resources mapped + * using dma_map_resource(). The page pointer is set to NULL and only the DMA + * address, length and offset values are valid. + * + **/ +static inline void sg_set_dma_addr(struct scatterlist *sg, dma_addr_t address, + unsigned int len, unsigned int offset) +{ + sg_set_page(sg, NULL, len, offset); + sg->dma_address = address; +#ifdef CONFIG_NEED_SG_DMA_LENGTH + sg->dma_length = len; +#endif +} + /* * Loop over each sg element, following the pointer to a new list if necessary */ -- 2.14.1
[PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
From: "wda...@nvidia.com" <wda...@nvidia.com> Add an interface to find the first device which is upstream of both devices. Signed-off-by: Will Davis <wda...@nvidia.com> Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/pci/search.c | 24 include/linux/pci.h | 2 ++ 2 files changed, 26 insertions(+) diff --git a/drivers/pci/search.c b/drivers/pci/search.c index bc1e023f1353..446648f4238b 100644 --- a/drivers/pci/search.c +++ b/drivers/pci/search.c @@ -393,3 +393,27 @@ int pci_dev_present(const struct pci_device_id *ids) return 0; } EXPORT_SYMBOL(pci_dev_present); + +/** + * pci_find_common_upstream_dev - Returns the first common upstream device + * @dev: the PCI device from which the bus hierarchy walk will start + * @other: the PCI device whose bus number will be used for the search + * + * Walks up the bus hierarchy from the @dev device, looking for the first bus + * which the @other device is downstream of. Returns %NULL if the devices do + * not share any upstream devices. + */ +struct pci_dev *pci_find_common_upstream_dev(struct pci_dev *dev, +struct pci_dev *other) +{ + struct pci_dev *pdev = dev; + + while (pdev != NULL) { + if ((other->bus->number >= pdev->bus->busn_res.start) && + (other->bus->number <= pdev->bus->busn_res.end)) + return pdev; + pdev = pci_upstream_bridge(pdev); + } + + return NULL; +} diff --git a/include/linux/pci.h b/include/linux/pci.h index 024a1beda008..0d29f5cdcb04 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -956,6 +956,8 @@ static inline struct pci_dev *pci_get_bus_and_slot(unsigned int bus, } struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from); int pci_dev_present(const struct pci_device_id *ids); +struct pci_dev *pci_find_common_upstream_dev(struct pci_dev *from, +struct pci_dev *to); int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn, int where, u8 *val); -- 2.14.1
[PATCH 4/8] dma-buf: add peer2peer flag
Add a peer2peer flag noting that the importer can deal with device resources which are not backed by pages. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 1 + include/linux/dma-buf.h | 4 2 files changed, 5 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ffaa2f9a9c2c..f420225f93c6 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info attach->dev = info->dev; attach->dmabuf = dmabuf; + attach->peer2peer = info->peer2peer; attach->priv = info->priv; attach->invalidate = info->invalidate; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 15dd8598bff1..1ef50bd9bc5b 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -313,6 +313,7 @@ struct dma_buf { * @dmabuf: buffer for this attachment. * @dev: device attached to the buffer. * @node: list of dma_buf_attachment. + * @peer2peer: true if the importer can handle peer resources without pages. * @priv: exporter specific attachment data. * * This structure holds the attachment information between the dma_buf buffer @@ -328,6 +329,7 @@ struct dma_buf_attachment { struct dma_buf *dmabuf; struct device *dev; struct list_head node; + bool peer2peer; void *priv; /** @@ -392,6 +394,7 @@ struct dma_buf_export_info { * @dmabuf:the exported dma_buf * @dev: the device which wants to import the attachment * @priv: private data of importer to this attachment + * @peer2peer: true if the importer can handle peer resources without pages * @invalidate:callback to use for invalidating mappings * * This structure holds the information required to attach to a buffer. Used @@ -401,6 +404,7 @@ struct dma_buf_attach_info { struct dma_buf *dmabuf; struct device *dev; void *priv; + bool peer2peer; void (*invalidate)(struct dma_buf_attachment *attach); }; -- 2.14.1
[PATCH 3/8] PCI: Add pci_peer_traffic_supported()
From: "wda...@nvidia.com" <wda...@nvidia.com> Add checks for topology and ACS configuration to determine whether or not peer traffic should be supported between two PCI devices. Signed-off-by: Will Davis <wda...@nvidia.com> Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/pci/pci.c | 112 include/linux/pci.h | 1 + 2 files changed, 113 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index f6a4dd10d9b0..efdca3c9dad1 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -5285,6 +5286,117 @@ void pci_ignore_hotplug(struct pci_dev *dev) } EXPORT_SYMBOL_GPL(pci_ignore_hotplug); +static bool pci_peer_host_traffic_supported(struct pci_host_bridge *host) +{ + struct pci_dev *bridge = pci_get_slot(host->bus, PCI_DEVFN(0, 0)); + unsigned short vendor, device; + + if (!bridge) + return false; + + vendor = bridge->vendor; + device = bridge->device; + pci_dev_put(bridge); + + /* AMD ZEN host bridges can do peer to peer */ + if (vendor == PCI_VENDOR_ID_AMD && device == 0x1450) + return true; + + /* TODO: Extend that to a proper whitelist */ + return false; +} + +bool pci_peer_traffic_supported(struct pci_dev *dev, struct pci_dev *peer) +{ + struct pci_dev *rp_dev, *rp_peer, *common_upstream; + struct pci_host_bridge *peer_host_bridge; + struct pci_host_bridge *dev_host_bridge; + int pos; + u16 cap; + + /* This is tricky enough, focus on PCIe for now */ + if (!pci_is_pcie(dev) || !pci_is_pcie(peer)) + return false; + + /* +* Disallow the peer-to-peer traffic if the devices do not share a +* host bridge. The PCI specifications does not make any guarantees +* about P2P capabilities between devices under separate domains. +* +* PCI Local Bus Specification Revision 3.0, section 3.10: +*"Peer-to-peer transactions crossing multiple host bridges +* PCI host bridges may, but are not required to, support PCI +* peer-to-peer transactions that traverse multiple PCI host +* bridges." +*/ + dev_host_bridge = pci_find_host_bridge(dev->bus); + peer_host_bridge = pci_find_host_bridge(peer->bus); + if (dev_host_bridge != peer_host_bridge) + return pci_peer_host_traffic_supported(dev_host_bridge); + + rp_dev = pcie_find_root_port(dev); + rp_peer = pcie_find_root_port(peer); + common_upstream = pci_find_common_upstream_dev(dev, peer); + + /* +* Access Control Services (ACS) Checks +* +* ACS has a capability bit for P2P Request Redirects (RR), +* but unfortunately it doesn't tell us much about the real +* capabilities of the hardware. +* +* PCI Express Base Specification Revision 3.0, section +* 6.12.1.1: +*"ACS P2P Request Redirect: must be implemented by Root +* Ports that support peer-to-peer traffic with other +* Root Ports; [80]" +* but +*"[80] Root Port indication of ACS P2P Request Redirect +* or ACS P2P Completion Redirect support does not imply +* any particular level of peer-to-peer support by the +* Root Complex, or that peer-to-peer traffic is +* supported at all" +*/ + + /* +* If ACS is not implemented, we have no idea about P2P +* support. Optimistically allow this if there is a common +* upstream device. +*/ + pos = pci_find_ext_capability(rp_dev, PCI_EXT_CAP_ID_ACS); + if (!pos) + return common_upstream != NULL; + + /* +* If the devices are under the same root port and have a common +* upstream device, allow if the root port is further upstream +* from the common upstream device and the common upstream +* device has Upstream Forwarding disabled, or if the root port +* is the common upstream device and ACS is not implemented. +*/ + pci_read_config_word(rp_dev, pos + PCI_ACS_CAP, ); + if ((rp_dev == rp_peer && common_upstream) && + (((common_upstream != rp_dev) && + !pci_acs_enabled(common_upstream, PCI_ACS_UF)) || +((common_upstream == rp_dev) && ((cap & PCI_ACS_RR) == 0 + return true; + + /* +* If ACS RR is implemented and disabled, allow only if the +* devices are under the same root port. +*/ + if (cap & PCI_ACS_RR && !pci_acs_enabled(rp_dev, PCI_ACS_RR)) + return rp_dev == rp_peer; + + /* +
[PATCH 7/8] drm/amdgpu: add amdgpu_gem_attach
Check if we can do peer2peer on the PCIe bus. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 24 1 file changed, 24 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 2566268806c3..133596df0775 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -134,6 +134,29 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return obj; } +static int amdgpu_gem_attach(struct dma_buf *dma_buf, struct device *dev, +struct dma_buf_attachment *attach) +{ + struct drm_gem_object *obj = dma_buf->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); + + if (!attach->peer2peer) + return 0; + + if (!dev_is_pci(dev)) + goto no_peer2peer; + + if (!pci_peer_traffic_supported(adev->pdev, to_pci_dev(dev))) + goto no_peer2peer; + + return 0; + +no_peer2peer: + attach->peer2peer = false; + return 0; +} + static struct sg_table * amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, enum dma_data_direction dir) @@ -274,6 +297,7 @@ static int amdgpu_gem_begin_cpu_access(struct dma_buf *dma_buf, } static const struct dma_buf_ops amdgpu_dmabuf_ops = { + .attach = amdgpu_gem_attach, .map_dma_buf = amdgpu_gem_map_dma_buf, .unmap_dma_buf = amdgpu_gem_unmap_dma_buf, .release = drm_gem_dmabuf_release, -- 2.14.1
[PATCH 6/8] drm/amdgpu: note that we can handle peer2peer DMA-buf
Importing should work out of the box. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index fb43faf88eb0..2566268806c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -353,6 +353,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, if (IS_ERR(obj)) return obj; + attach_info.peer2peer = true; attach_info.priv = obj; attach = dma_buf_attach(_info); if (IS_ERR(attach)) { -- 2.14.1
[PATCH 5/8] drm/amdgpu: print DMA-buf status in debugfs
Just note if a BO was imported/exported. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 46b9ea4e6103..7c1f761bb1a5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -777,6 +777,8 @@ static int amdgpu_debugfs_gem_bo_info(int id, void *ptr, void *data) struct amdgpu_bo *bo = gem_to_amdgpu_bo(gobj); struct seq_file *m = data; + struct dma_buf_attachment *attachment; + struct dma_buf *dma_buf; unsigned domain; const char *placement; unsigned pin_count; @@ -805,6 +807,15 @@ static int amdgpu_debugfs_gem_bo_info(int id, void *ptr, void *data) pin_count = READ_ONCE(bo->pin_count); if (pin_count) seq_printf(m, " pin count %d", pin_count); + + dma_buf = READ_ONCE(bo->gem_base.dma_buf); + attachment = READ_ONCE(bo->gem_base.import_attach); + + if (attachment) + seq_printf(m, " imported from %p", dma_buf); + else if (dma_buf) + seq_printf(m, " exported as %p", dma_buf); + seq_printf(m, "\n"); return 0; -- 2.14.1
[PATCH 4/5] drm/amdgpu: add independent DMA-buf export v2
Instead of relying on the DRM functions just implement our own export functions. This adds support for taking care of unpinned DMA-buf. v2: fix unintended recursion, remove debugging leftovers Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 108 ++--- 4 files changed, 72 insertions(+), 43 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index a7e22298ce5c..190bada6f487 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -371,7 +371,6 @@ int amdgpu_gem_object_open(struct drm_gem_object *obj, void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv); unsigned long amdgpu_gem_timeout(uint64_t timeout_ns); -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj); struct drm_gem_object * amdgpu_gem_prime_import_sg_table(struct drm_device *dev, struct dma_buf_attachment *attach, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 1bfce79bc074..33893f0311c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -892,7 +892,6 @@ static struct drm_driver kms_driver = { .gem_prime_export = amdgpu_gem_prime_export, .gem_prime_import = amdgpu_gem_prime_import, .gem_prime_res_obj = amdgpu_gem_prime_res_obj, - .gem_prime_get_sg_table = amdgpu_gem_prime_get_sg_table, .gem_prime_import_sg_table = amdgpu_gem_prime_import_sg_table, .gem_prime_vmap = amdgpu_gem_prime_vmap, .gem_prime_vunmap = amdgpu_gem_prime_vunmap, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index ac1fa3394a9e..11d2ac175b30 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -31,6 +31,7 @@ */ #include #include +#include #include #include #include @@ -935,6 +936,10 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo, amdgpu_bo_kunmap(abo); + if (abo->gem_base.dma_buf && !abo->gem_base.import_attach && + bo->mem.mem_type != TTM_PL_SYSTEM) + dma_buf_invalidate_mappings(abo->gem_base.dma_buf); + /* remember the eviction */ if (evict) atomic64_inc(>num_evictions); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 4b584cb75bf4..7fef95f0fed1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -32,14 +32,6 @@ static const struct dma_buf_ops amdgpu_dmabuf_ops; -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj) -{ - struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); - int npages = bo->tbo.num_pages; - - return drm_prime_pages_to_sg(bo->tbo.ttm->pages, npages); -} - void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj) { struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); @@ -126,23 +118,22 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret); } -static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, -struct device *target_dev, -struct dma_buf_attachment *attach) +static struct sg_table * +amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, + enum dma_data_direction dir) { + struct dma_buf *dma_buf = attach->dmabuf; struct drm_gem_object *obj = dma_buf->priv; struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); + struct sg_table *sgt; long r; - r = drm_gem_map_attach(dma_buf, target_dev, attach); - if (r) - return r; - - r = amdgpu_bo_reserve(bo, false); - if (unlikely(r != 0)) - goto error_detach; - + if (!attach->invalidate) { + r = amdgpu_bo_reserve(bo, false); + if (unlikely(r != 0)) + return ERR_PTR(r); + } if (attach->dev->driver != adev->dev->driver) { /* @@ -158,42 +149,77 @@ static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, } } - /* pin buffer into GTT */ - r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); - if (r) + if (!attach->invalidate) { + /* pin buffer into GTT */ + r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); + if (r) + goto er
[PATCH 2/5] drm/ttm: keep a reference to transfer pipelined BOs
Make sure the transfered BO is never destroy before the transfer BO. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/ttm/ttm_bo_util.c | 50 +++ 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 1f730b3f18e5..1ee20558ee31 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -39,6 +39,11 @@ #include #include +struct ttm_transfer_obj { + struct ttm_buffer_object base; + struct ttm_buffer_object *bo; +}; + void ttm_bo_free_old_node(struct ttm_buffer_object *bo) { ttm_bo_mem_put(bo, >mem); @@ -435,7 +440,11 @@ EXPORT_SYMBOL(ttm_bo_move_memcpy); static void ttm_transfered_destroy(struct ttm_buffer_object *bo) { - kfree(bo); + struct ttm_transfer_obj *fbo; + + fbo = container_of(bo, struct ttm_transfer_obj, base); + ttm_bo_unref(>bo); + kfree(fbo); } /** @@ -456,14 +465,15 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo) static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, struct ttm_buffer_object **new_obj) { - struct ttm_buffer_object *fbo; + struct ttm_transfer_obj *fbo; int ret; fbo = kmalloc(sizeof(*fbo), GFP_KERNEL); if (!fbo) return -ENOMEM; - *fbo = *bo; + fbo->base = *bo; + fbo->bo = ttm_bo_reference(bo); /** * Fix up members that we shouldn't copy directly: @@ -471,25 +481,25 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, */ atomic_inc(>bdev->glob->bo_count); - INIT_LIST_HEAD(>ddestroy); - INIT_LIST_HEAD(>lru); - INIT_LIST_HEAD(>swap); - INIT_LIST_HEAD(>io_reserve_lru); - mutex_init(>wu_mutex); - fbo->moving = NULL; - drm_vma_node_reset(>vma_node); - atomic_set(>cpu_writers, 0); - - kref_init(>list_kref); - kref_init(>kref); - fbo->destroy = _transfered_destroy; - fbo->acc_size = 0; - fbo->resv = >ttm_resv; - reservation_object_init(fbo->resv); - ret = reservation_object_trylock(fbo->resv); + INIT_LIST_HEAD(>base.ddestroy); + INIT_LIST_HEAD(>base.lru); + INIT_LIST_HEAD(>base.swap); + INIT_LIST_HEAD(>base.io_reserve_lru); + mutex_init(>base.wu_mutex); + fbo->base.moving = NULL; + drm_vma_node_reset(>base.vma_node); + atomic_set(>base.cpu_writers, 0); + + kref_init(>base.list_kref); + kref_init(>base.kref); + fbo->base.destroy = _transfered_destroy; + fbo->base.acc_size = 0; + fbo->base.resv = >base.ttm_resv; + reservation_object_init(fbo->base.resv); + ret = reservation_object_trylock(fbo->base.resv); WARN_ON(!ret); - *new_obj = fbo; + *new_obj = >base; return 0; } -- 2.14.1
[PATCH 5/5] drm/amdgpu: add independent DMA-buf import v3
Instead of relying on the DRM functions just implement our own import functions. This adds support for taking care of unpinned DMA-buf. v2: enable for all exporters, not just amdgpu, fix invalidation handling, lock reservation object while setting callback v3: change to new dma_buf attach interface Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 72 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 32 +++--- 2 files changed, 89 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 7fef95f0fed1..fb43faf88eb0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -86,28 +86,24 @@ int amdgpu_gem_prime_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma return ret; } -struct drm_gem_object * -amdgpu_gem_prime_import_sg_table(struct drm_device *dev, -struct dma_buf_attachment *attach, -struct sg_table *sg) +static struct drm_gem_object * +amdgpu_gem_prime_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) { - struct reservation_object *resv = attach->dmabuf->resv; + struct reservation_object *resv = dma_buf->resv; struct amdgpu_device *adev = dev->dev_private; struct amdgpu_bo *bo; int ret; ww_mutex_lock(>lock, NULL); - ret = amdgpu_bo_create(adev, attach->dmabuf->size, PAGE_SIZE, + ret = amdgpu_bo_create(adev, dma_buf->size, PAGE_SIZE, AMDGPU_GEM_DOMAIN_CPU, 0, ttm_bo_type_sg, resv, ); if (ret) goto error; - bo->tbo.sg = sg; - bo->tbo.ttm->sg = sg; bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT; - if (attach->dmabuf->ops != _dmabuf_ops) + if (dma_buf->ops != _dmabuf_ops) bo->prime_shared_count = 1; ww_mutex_unlock(>lock); @@ -118,6 +114,26 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret); } +struct drm_gem_object * +amdgpu_gem_prime_import_sg_table(struct drm_device *dev, +struct dma_buf_attachment *attach, +struct sg_table *sg) +{ + struct drm_gem_object *obj; + struct amdgpu_bo *bo; + + obj = amdgpu_gem_prime_create_obj(dev, attach->dmabuf); + if (IS_ERR(obj)) + return obj; + + bo = gem_to_amdgpu_bo(obj); + amdgpu_bo_reserve(bo, true); + bo->tbo.sg = sg; + bo->tbo.ttm->sg = sg; + amdgpu_bo_unreserve(bo); + return obj; +} + static struct sg_table * amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, enum dma_data_direction dir) @@ -293,9 +309,29 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_device *dev, return buf; } +static void +amdgpu_gem_prime_invalidate_mappings(struct dma_buf_attachment *attach) +{ + struct ttm_operation_ctx ctx = { false, false }; + struct drm_gem_object *obj = attach->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct ttm_placement placement = {}; + int r; + + r = ttm_bo_validate(>tbo, , ); + if (r) + DRM_ERROR("Failed to unmap DMA-buf import (%d))\n", r); +} + struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf) { + struct dma_buf_attach_info attach_info = { + .dev = dev->dev, + .dmabuf = dma_buf, + .invalidate = amdgpu_gem_prime_invalidate_mappings + }; + struct dma_buf_attachment *attach; struct drm_gem_object *obj; if (dma_buf->ops == _dmabuf_ops) { @@ -310,5 +346,21 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, } } - return drm_gem_prime_import(dev, dma_buf); + if (!dma_buf->invalidation_supported) + return drm_gem_prime_import(dev, dma_buf); + + obj = amdgpu_gem_prime_create_obj(dev, dma_buf); + if (IS_ERR(obj)) + return obj; + + attach_info.priv = obj; + attach = dma_buf_attach(_info); + if (IS_ERR(attach)) { + drm_gem_object_put(obj); + return ERR_CAST(attach); + } + + get_dma_buf(dma_buf); + obj->import_attach = attach; + return obj; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index d2ab40494a4c..ad93f201e7b6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -43,6
[PATCH 1/5] dma-buf: add optional invalidate_mappings callback v3
Each importer can now provide an invalidate_mappings callback. This allows the exporter to provide the mappings without the need to pin the backing store. v2: don't try to invalidate mappings when the callback is NULL, lock the reservation obj while using the attachments, add helper to set the callback v3: move flag for invalidation support into the DMA-buf, use new attach_info structure to set the callback Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 43 +++ include/linux/dma-buf.h | 28 2 files changed, 71 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d2e8ca0d9427..ffaa2f9a9c2c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -566,6 +566,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info attach->dev = info->dev; attach->dmabuf = dmabuf; attach->priv = info->priv; + attach->invalidate = info->invalidate; mutex_lock(>lock); @@ -574,7 +575,9 @@ struct dma_buf_attachment *dma_buf_attach(const struct dma_buf_attach_info *info if (ret) goto err_attach; } + reservation_object_lock(dmabuf->resv, NULL); list_add(>node, >attachments); + reservation_object_unlock(dmabuf->resv); mutex_unlock(>lock); return attach; @@ -600,7 +603,9 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) return; mutex_lock(>lock); + reservation_object_lock(dmabuf->resv, NULL); list_del(>node); + reservation_object_unlock(dmabuf->resv); if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach); @@ -634,10 +639,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, if (WARN_ON(!attach || !attach->dmabuf)) return ERR_PTR(-EINVAL); + /* +* Mapping a DMA-buf can trigger its invalidation, prevent sending this +* event to the caller by temporary removing this attachment from the +* list. +*/ + if (attach->invalidate) { + reservation_object_assert_held(attach->dmabuf->resv); + list_del(>node); + } + sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction); if (!sg_table) sg_table = ERR_PTR(-ENOMEM); + if (attach->invalidate) + list_add(>node, >dmabuf->attachments); + return sg_table; } EXPORT_SYMBOL_GPL(dma_buf_map_attachment); @@ -658,6 +676,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep(); + if (attach->invalidate) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) return; @@ -666,6 +687,26 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment); +/** + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf + * + * @dmabuf:[in]buffer which mappings should be invalidated + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + reservation_object_assert_held(dmabuf->resv); + + list_for_each_entry(attach, >attachments, node) + if (attach->invalidate) + attach->invalidate(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings); + /** * DOC: cpu access * @@ -1123,10 +1164,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) seq_puts(s, "\tAttached Devices:\n"); attach_count = 0; + reservation_object_lock(buf_obj->resv, NULL); list_for_each_entry(attach_obj, _obj->attachments, node) { seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); attach_count++; } + reservation_object_unlock(buf_obj->resv); seq_printf(s, "Total %d devices attached\n\n", attach_count); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 2c27568d44af..15dd8598bff1 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -270,6 +270,8 @@ struct dma_buf_ops { * @poll: for userspace poll support * @cb_excl: for userspace poll support * @cb_shared: for userspace poll support + * @invalidation_supported: True when the exporter supports unpinned operation + *
[PATCH 3/5] drm/ttm: remove the backing store if no placement is given
Pipeline removal of the BOs backing store when the placement is given during validation. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/ttm/ttm_bo.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 98e06f8bf23b..17e821f01d0a 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1078,6 +1078,18 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, uint32_t new_flags; reservation_object_assert_held(bo->resv); + + /* +* Remove the backing store if no placement is given. +*/ + if (!placement->num_placement && !placement->num_busy_placement) { + ret = ttm_bo_pipeline_gutting(bo); + if (ret) + return ret; + + return ttm_tt_create(bo, false); + } + /* * Check whether we need to move buffer. */ -- 2.14.1
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 22.03.2018 um 08:18 schrieb Daniel Vetter: On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote: Am 21.03.2018 um 09:28 schrieb Daniel Vetter: On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote: Am 20.03.2018 um 15:08 schrieb Daniel Vetter: [SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though. That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again. Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding. That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop. Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have: amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj; if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } } If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again. I'm not talking about userptr here, but general bo eviction. Sorry for the confusion. The reason I'm dragging all the general bo management into this discussions is because we do seem to have fairly fundamental difference in how that's done, with resulting consequences for the locking hierarchy. And if this invalidate_mapping stuff should work, together with userptr and everything else, I think we're required to agree on how this is all supposed to nest, and how exactly we should back off for the other side that needs to break the locking circle. That aside, I don't entirely understand why you need to restart so much. I figured that get_user_pages is ordered correctly against mmu invalidations, but I get the impression you think that's not the case. How does that happen? Correct. I've had the same assumption, but both Jerome as well as our internal tests proved me wrong on that. Key take away from that was that you can't take any locks from neither the MMU notifier nor the shrinker you also take while calling kmalloc (or simpler speaking get_user_pages()). Additional to that in the MMU or shrinker callback all different kinds of locks might be held, so you basically can't assume that you do thinks like recursive page table walks or call dma_unmap_anything. Thinking a moment about that it actually seems to make perfect sense. So it doesn't matter what order you got between the mmap_sem and your buffer or allocation lock, it will simply be incorrect with other locks in the system anyway. The only solution to that problem we have found is the dance we have in amdgpu now: 1. Inside invalidate_range_start you grab a recursive read side lock. 2. Wait for all GPU operation on that pages to finish. 3. Then mark all pages as dirty and accessed. 4. Inside invalidate_range_end you release your recursive read side lock again. Then during command submission you do the following: 1. Take the locks Figure out all the userptr BOs which needs new pages. 2. Drop the locks again call get_user_pages(). 3. Install the new page arrays and reiterate with #1 4. As soon as everybody has pages update your page tables and prepare all command submission. 5. Take the write side of our recursive lock. 6. Check if all pages are still valid, if not restart the whole IOCTL. 7. Submit the job to the hardware. 8. Drop the write side of our recursive lock. Regards, Christian. -Daniel
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 22.03.2018 um 08:14 schrieb Daniel Vetter: On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote: Am 21.03.2018 um 09:18 schrieb Daniel Vetter: [SNIP] For correct operation you always need to implement invalidate_range_end as well and add some lock/completion work Otherwise get_user_pages() can again grab the reference to the wrong page. Is this really a problem? Yes, and quite a big one. I figured that if a mmu_notifier invalidation is going on, a get_user_pages on that mm from anywhere else (whether i915 or anyone really) will serialize with the ongoing invalidate? No, that isn't correct. Jerome can probably better explain that than I do. If that's not the case, then really any get_user_pages is racy, including all the DIRECT_IO ones. The key point here is that get_user_pages() grabs a reference to the page. So what you get is a bunch of pages which where mapped at that location at a specific point in time. There is no guarantee that after get_user_pages() return you still have the same pages mapped at that point, you only guarantee that the pages are not reused for something else. That is perfectly sufficient for a task like DIRECT_IO where you can only have block or network I/O, but unfortunately not really for GPUs where you crunch of results, write them back to pages and actually count on that the CPU sees the result in the right place. [SNIP] So no matter how you put it i915 is clearly doing something wrong here :) tbh I'm not entirely clear on the reasons why this works, but cross-release lockdep catches these things, and it did not complain. On a high-level we make sure that mm locks needed by get_user_pages do _not_ nest within dev->struct_mutex. We have massive back-off slowpaths to do anything that could fault outside of our own main gem locking. I'm pretty sure that this doesn't work as intended and just hides the real problem. That was (at least in the past) a major difference with amdgpu, which essentially has none of these paths. That would trivially deadlock with your own gem mmap fault handler, so you had (maybe that changed) a dumb retry loop, which did shut up lockdep but didn't fix any of the locking inversions. Any lock you grab in an MMU callback can't be even held when you call kmalloc() or get_free_page() (without GFP_NOIO). Even simple things like drm_vm_open() violate that by using GFP_KERNEL. So I can 100% ensure you that what you do here is not correct. So yeah, grabbing dev->struct_mutex is in principle totally fine while holding all kinds of struct mm/vma locks. I'm not entirely clear why we punt the actual unmapping to the worker though, maybe simply to not have a constrained stack. I strongly disagree on that. As far as I can see what TTM does looks actually like the right approach to the problem. This is re: your statement that you can't unamp sg tables from the shrinker. We can, because we've actually untangled the locking depencies so that you can fully operate on gem objects from within mm/vma locks. Maybe code has changed, but last time I looked at radeon/ttm a while back that was totally not the case, and if you don't do all this work then yes you'll deadlock. Doen't mean it's not impossible, because we've done it :-) And I'm pretty sure you didn't do it correctly :D Well, it actually gets the job done. We'd need to at least get to per-object locking, and probably even then we'd need to rewrite the code a lot. But please note that this here is only to avoid the GFP_NOIO constraint, all the other bits I clarified around why we don't actually have circular locking (because the entire hierarchy is inverted for us) still hold even if you would only trylock here. Well you reversed your allocation and mmap_sem lock which avoids the lock inversion during page faults, but it doesn't help you at all with the MMU notifier and shrinker because then there are a lot more locks involved. Regards, Christian. Aside: Given that yesterday a bunch of folks complained on #dri-devel that amdgpu prematurely OOMs compared to i915, and that we've switched from a simple trylock to this nastiness to be able to recover from more low memory situation it's maybe not such a silly idea. Horrible, but not silly because actually necessary. -Daniel
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 21.03.2018 um 09:28 schrieb Daniel Vetter: On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote: Am 20.03.2018 um 15:08 schrieb Daniel Vetter: [SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though. That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again. Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding. That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop. Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have: amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj; if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } } If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again. Regards, Christian. -Daniel
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 21.03.2018 um 09:18 schrieb Daniel Vetter: [SNIP] They're both in i915_gem_userptr.c, somewhat interleaved. Would be interesting if you could show what you think is going wrong in there compared to amdgpu_mn.c. i915 implements only one callback: static const struct mmu_notifier_ops i915_gem_userptr_notifier = { .invalidate_range_start = i915_gem_userptr_mn_invalidate_range_start, }; For correct operation you always need to implement invalidate_range_end as well and add some lock/completion work Otherwise get_user_pages() can again grab the reference to the wrong page. The next problem seems to be that cancel_userptr() doesn't prevent any new command submission. E.g. i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL); What prevents new command submissions to use the GEM object directly after you finished waiting here? I get a feeling we're talking past each another here. Yeah, agree. Additional to that I don't know the i915 code very well. Can you perhaps explain what exactly the race is you're seeing? The i915 userptr code is fairly convoluted and pushes a lot of stuff to workers (but then syncs with those workers again later on), so easily possible you've overlooked one of these lines that might guarantee already what you think needs to be guaranteed. We're definitely not aiming to allow userspace to allow writing to random pages all over. You not read/write to random pages, there still is a reference to the page. So that the page can't be reused until you are done. The problem is rather that you can't guarantee that you write to the page which is mapped into the process at that location. E.g. the CPU and the GPU might see two different things. Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's book-keeping. In i915 we guarantee that we call set_page_dirty/mark_page_accessed only after all the mappings are really gone (both GPU PTEs and sg mapping), guaranteeing that any stray writes from either the GPU or IOMMU will result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU actually works" is an assumption behind device isolation). Well exactly that's the point, the handling in i915 looks incorrect to me. You need to call set_page_dirty/mark_page_accessed way before the mapping is destroyed. To be more precise for userptrs it must be called from the invalidate_range_start, but i915 seems to delegate everything into a background worker to avoid the locking problems. Yeah, and at the end of the function there's a flush_work to make sure the worker has caught up. Ah, yes haven't seen that. But then grabbing the obj->base.dev->struct_mutex lock in cancel_userptr() is rather evil. You just silenced lockdep because you offloaded that into a work item. So no matter how you put it i915 is clearly doing something wrong here :) I know. i915 gem has tons of fallbacks and retry loops (we restart the entire CS if needed), and i915 userptr pushes the entire get_user_pages dance off into a worker if the fastpath doesn't succeed and we run out of memory or hit contended locks. We also have obscene amounts of __GFP_NORETRY and __GFP_NOWARN all over the place to make sure the core mm code doesn't do something we don't want it do to do in the fastpaths (because there's really no point in spending lots of time trying to make memory available if we have a slowpath fallback with much less constraints). Well I haven't audited the code, but I'm pretty sure that just mitigates the problem and silenced lockdep instead of really fixing the issue. We're also not limiting ourselves to GFP_NOIO, but instead have a recursion detection in our own shrinker callback to avoid these deadlocks. Which if you ask me is absolutely horrible. I mean the comment in linux/mutex.h sums it up pretty well: * This function should not be used, _ever_. It is purely for hysterical GEM * raisins, and once those are gone this will be removed. Regards, Christian.
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 20.03.2018 um 15:08 schrieb Daniel Vetter: [SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though. That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again. [SNIP] Note that there are 2 paths for i915 userptr. One is the mmu notifier, the other one is the root-only hack we have for dubious reasons (or that I really don't see the point in myself). Well I'm referring to i915_gem_userptr.c, if that isn't what you are exposing then just feel free to ignore this whole discussion. For coherent usage you need to install some lock to prevent concurrent get_user_pages(), command submission and invalidate_range_start/invalidate_range_end from the MMU notifier. Otherwise you can't guarantee that you are actually accessing the right page in the case of a fork() or mprotect(). Yeah doing that with a full lock will create endless amounts of issues, but I don't see why we need that. Userspace racing stuff with itself gets to keep all the pieces. This is like racing DIRECT_IO against mprotect and fork. First of all I strongly disagree on that. A thread calling fork() because it wants to run a command is not something we can forbid just because we have a gfx stack loaded. That the video driver is not capable of handling that correct is certainly not the problem of userspace. Second it's not only userspace racing here, you can get into this kind of issues just because of transparent huge page support where the background daemon tries to reallocate the page tables into bigger chunks. And if I'm not completely mistaken you can also open up quite a bunch of security problems if you suddenly access the wrong page. Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's book-keeping. In i915 we guarantee that we call set_page_dirty/mark_page_accessed only after all the mappings are really gone (both GPU PTEs and sg mapping), guaranteeing that any stray writes from either the GPU or IOMMU will result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU actually works" is an assumption behind device isolation). Well exactly that's the point, the handling in i915 looks incorrect to me. You need to call set_page_dirty/mark_page_accessed way before the mapping is destroyed. To be more precise for userptrs it must be called from the invalidate_range_start, but i915 seems to delegate everything into a background worker to avoid the locking problems. Felix and I hammered for quite some time on amdgpu until all of this was handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c. Maybe we should have more shared code in this, it seems to be a source of endless amounts of fun ... I can try to gather the lockdep splat from my mail history, but it essentially took us multiple years to get rid of all of them. I'm very much interested in specifically the splat that makes it impossible for you folks to remove the sg mappings. That one sounds bad. And would essentially make mmu_notifiers useless for their primary use case, which is handling virtual machines where you really have to make sure the IOMMU mapping is gone before you claim it's gone, because there's no 2nd level of device checks (like GPU PTEs, or command checker) catching stray writes. Well to be more precise the problem is not that we can't destroy the sg table, but rather that we can't grab the locks to do so. See when you need to destroy the sg table you usually need to grab the same lock you grabbed when you created it. And all locks taken while in an MMU notifier can only depend on memory allocation with GFP_NOIO, which is not really feasible for gfx drivers. Regards, Christian.
Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 20.03.2018 um 08:44 schrieb Daniel Vetter: On Mon, Mar 19, 2018 at 5:23 PM, Christian König <ckoenig.leichtzumer...@gmail.com> wrote: Am 19.03.2018 um 16:53 schrieb Chris Wilson: Quoting Christian König (2018-03-16 14:22:32) [snip, probably lost too must context] This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence. Ok, the situation I have in mind is the non-pipelined case: revoking dma-buf for mmu_invalidate_range or shrink_slab. I would need a completion event that can be waited on the cpu for all the invalidate callbacks. (Essentially an atomic_t counter plus struct completion; a lighter version of dma_fence, I wonder where I've seen that before ;) Actually that is harmless. When you need to unmap a DMA-buf because of mmu_invalidate_range or shrink_slab you need to wait for it's reservation object anyway. reservation_object only prevents adding new fences, you still have to wait for all the current ones to signal. Also, we have dma-access without fences in i915. "I hold the reservation_object" does not imply you can just go and nuke the backing storage. I was not talking about taking the lock, but rather using reservation_object_wait_timeout_rcu(). To be more precise you actually can't take the reservation object lock in an mmu_invalidate_range callback and you can only trylock it in a shrink_slab callback. This needs to be done to make sure that the backing memory is now idle, it doesn't matter if the jobs where submitted by DMA-buf importers or your own driver. The sg tables pointing to the now released memory might live a bit longer, but that is unproblematic and actually intended. I think that's very problematic. One reason for an IOMMU is that you have device access isolation, and a broken device can't access memory it shouldn't be able to access. From that security-in-depth point of view it's not cool that there's some sg tables hanging around still that a broken GPU could use. And let's not pretend hw is perfect, especially GPUs :-) I completely agree on that, but there is unfortunately no other way. See you simply can't take a reservation object lock in an mmu or slab callback, you can only trylock them. For example it would require changing all allocations done while holding any reservation lock to GFP_NOIO. When we would try to destroy the sg tables in an mmu_invalidate_range or shrink_slab callback we would run into a lockdep horror. So I'm no expert on this, but I think this is exactly what we're doing in i915. Kinda no other way to actually free the memory without throwing all the nice isolation aspects of an IOMMU into the wind. Can you please paste the lockdeps you've seen with amdgpu when trying to do that? Taking a quick look at i915 I can definitely say that this is actually quite buggy what you guys do here. For coherent usage you need to install some lock to prevent concurrent get_user_pages(), command submission and invalidate_range_start/invalidate_range_end from the MMU notifier. Otherwise you can't guarantee that you are actually accessing the right page in the case of a fork() or mprotect(). Felix and I hammered for quite some time on amdgpu until all of this was handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c. I can try to gather the lockdep splat from my mail history, but it essentially took us multiple years to get rid of all of them. Regards, Christian. -Daniel Regards, Christian. Even so, it basically means passing a fence object down to the async callbacks for them to signal when they are complete. Just to handle the non-pipelined version. :| -Chris ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ Linaro-mm-sig mailing list linaro-mm-...@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Re: [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 19.03.2018 um 16:53 schrieb Chris Wilson: Quoting Christian König (2018-03-16 14:22:32) [snip, probably lost too must context] This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence. Ok, the situation I have in mind is the non-pipelined case: revoking dma-buf for mmu_invalidate_range or shrink_slab. I would need a completion event that can be waited on the cpu for all the invalidate callbacks. (Essentially an atomic_t counter plus struct completion; a lighter version of dma_fence, I wonder where I've seen that before ;) Actually that is harmless. When you need to unmap a DMA-buf because of mmu_invalidate_range or shrink_slab you need to wait for it's reservation object anyway. This needs to be done to make sure that the backing memory is now idle, it doesn't matter if the jobs where submitted by DMA-buf importers or your own driver. The sg tables pointing to the now released memory might live a bit longer, but that is unproblematic and actually intended. When we would try to destroy the sg tables in an mmu_invalidate_range or shrink_slab callback we would run into a lockdep horror. Regards, Christian. Even so, it basically means passing a fence object down to the async callbacks for them to signal when they are complete. Just to handle the non-pipelined version. :| -Chris ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Am 16.03.2018 um 14:51 schrieb Chris Wilson: Quoting Christian König (2018-03-16 13:20:45) @@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv; + + /** +* @invalidate_mappings: +* +* Optional callback provided by the importer of the attachment which +* must be set before mappings are created. +* +* If provided the exporter can avoid pinning the backing store while +* mappings exists. Hmm, no I don't think it avoids the pinning issue entirely. As it stands, the importer doesn't have a page refcount and so they all rely on the exporter keeping the dmabuf pages pinned while attached. What can happen is that given the invalidate cb, the importers can revoke their attachments, letting the exporter recover the pages/sg, and then start again from scratch. Yes, exactly. The wording is just not 100% precise and I haven't found something better so far. That also neatly answers what happens if not all importers provide an invalidate cb, or fail, the dmabuf remains pinned and the exporter must retreat. Yes, exactly as well. As soon as at least one importer says "I can't do this", we must fallback to the old behavior. +* +* The function is called with the lock of the reservation object +* associated with the dma_buf held and the mapping function must be +* called with this lock held as well. This makes sure that no mapping +* is created concurrently with an ongoing invalidation. +* +* After the callback all existing mappings are still valid until all +* fences in the dma_bufs reservation object are signaled, but should be +* destroyed by the importer as soon as possible. +* +* New mappings can be created immediately, but can't be used before the +* exclusive fence in the dma_bufs reservation object is signaled. +*/ + void (*invalidate_mappings)(struct dma_buf_attachment *attach); The intent is that the invalidate is synchronous and immediate, while locked? We are looking at recursing back into the dma_buf functions to remove each attachment from the invalidate cb (as well as waiting for dma), won't that cause some nasty issues? No, with this idea invalidation is asynchronous. Already discussed that with Daniel as well and YES Daniel also already pointed out that I need to better document this. When the exporter calls invalidate_mappings() it just means that all importers can no longer use their sg tables for new submissions, old ones stay active. The existing sg tables are guaranteed to stay valid until all fences in the reservation object have signaled and the importer should also delay it's call to call dma_buf_unmap_attachment() until all the fences have signaled. When the importer has new work to do, e.g. wants to attach a new fence to the reservation object, it must grab a new sg table for that. The importer also needs to make sure that all new work touching the dma-buf doesn't start before the exclusive fence in the reservation object signals. This allows for full grown pipelining, e.g. the exporter can say I need to move the buffer for some operation. Then let the move operation wait for all existing fences in the reservation object and install the fence of the move operation as exclusive fence. The importer can then immediately grab a new sg table for the new location of the buffer and use it to prepare the next operation. Regards, Christian. -Chris ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH 4/5] drm/amdgpu: add independent DMA-buf export v2
Instead of relying on the DRM functions just implement our own export functions. This adds support for taking care of unpinned DMA-buf. v2: fix unintended recursion, remove debugging leftovers Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 110 ++--- 4 files changed, 73 insertions(+), 44 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 96bcdb97e7e2..909dc9764a22 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -371,7 +371,6 @@ int amdgpu_gem_object_open(struct drm_gem_object *obj, void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv); unsigned long amdgpu_gem_timeout(uint64_t timeout_ns); -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj); struct drm_gem_object * amdgpu_gem_prime_import_sg_table(struct drm_device *dev, struct dma_buf_attachment *attach, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index e6709362994a..e32dcdf8d3db 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -886,7 +886,6 @@ static struct drm_driver kms_driver = { .gem_prime_export = amdgpu_gem_prime_export, .gem_prime_import = amdgpu_gem_prime_import, .gem_prime_res_obj = amdgpu_gem_prime_res_obj, - .gem_prime_get_sg_table = amdgpu_gem_prime_get_sg_table, .gem_prime_import_sg_table = amdgpu_gem_prime_import_sg_table, .gem_prime_vmap = amdgpu_gem_prime_vmap, .gem_prime_vunmap = amdgpu_gem_prime_vunmap, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 48e0115d4f76..18483a79dcdd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -31,6 +31,7 @@ */ #include #include +#include #include #include #include @@ -931,6 +932,10 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo, amdgpu_bo_kunmap(abo); + if (abo->gem_base.dma_buf && !abo->gem_base.import_attach && + bo->mem.mem_type != TTM_PL_SYSTEM) + dma_buf_invalidate_mappings(abo->gem_base.dma_buf); + /* remember the eviction */ if (evict) atomic64_inc(>num_evictions); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 1c9991738477..7b9edbc938eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -32,14 +32,6 @@ static const struct dma_buf_ops amdgpu_dmabuf_ops; -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj) -{ - struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); - int npages = bo->tbo.num_pages; - - return drm_prime_pages_to_sg(bo->tbo.ttm->pages, npages); -} - void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj) { struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); @@ -126,22 +118,21 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret); } -static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, -struct device *target_dev, -struct dma_buf_attachment *attach) +static struct sg_table * +amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, + enum dma_data_direction dir) { + struct dma_buf *dma_buf = attach->dmabuf; struct drm_gem_object *obj = dma_buf->priv; struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct sg_table *sgt; long r; - r = drm_gem_map_attach(dma_buf, target_dev, attach); - if (r) - return r; - - r = amdgpu_bo_reserve(bo, false); - if (unlikely(r != 0)) - goto error_detach; - + if (!attach->invalidate_mappings) { + r = amdgpu_bo_reserve(bo, false); + if (unlikely(r != 0)) + return ERR_PTR(r); + } if (dma_buf->ops != _dmabuf_ops) { /* @@ -157,41 +148,77 @@ static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, } } - /* pin buffer into GTT */ - r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); - if (r) + if (!attach->invalidate_mappings) { + /* pin buffer into GTT */ + r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); + if (r) + goto error_unreserve; + + } else { + /* move buffer into GTT */
[PATCH 3/5] drm/ttm: remove the backing store if no placement is given
Pipeline removal of the BOs backing store when the placement is given during validation. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/ttm/ttm_bo.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 98e06f8bf23b..17e821f01d0a 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1078,6 +1078,18 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, uint32_t new_flags; reservation_object_assert_held(bo->resv); + + /* +* Remove the backing store if no placement is given. +*/ + if (!placement->num_placement && !placement->num_busy_placement) { + ret = ttm_bo_pipeline_gutting(bo); + if (ret) + return ret; + + return ttm_tt_create(bo, false); + } + /* * Check whether we need to move buffer. */ -- 2.14.1
[PATCH 5/5] drm/amdgpu: add independent DMA-buf import v2
Instead of relying on the DRM functions just implement our own import functions. This adds support for taking care of unpinned DMA-buf. v2: enable for all exporters, not just amdgpu, fix invalidation handling, lock reservation object while setting callback Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 44 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 32 ++ 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 7b9edbc938eb..cb6ec9b7844a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -291,10 +291,26 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_device *dev, return buf; } +static void +amdgpu_gem_prime_invalidate_mappings(struct dma_buf_attachment *attach) +{ + struct ttm_operation_ctx ctx = { false, false }; + struct drm_gem_object *obj = attach->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct ttm_placement placement = {}; + int r; + + r = ttm_bo_validate(>tbo, , ); + if (r) + DRM_ERROR("Failed to unmap DMA-buf import (%d))\n", r); +} + struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf) { + struct dma_buf_attachment *attach; struct drm_gem_object *obj; + int ret; if (dma_buf->ops == _dmabuf_ops) { obj = dma_buf->priv; @@ -308,5 +324,31 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, } } - return drm_gem_prime_import(dev, dma_buf); + if (!dma_buf->ops->supports_mapping_invalidation) + return drm_gem_prime_import(dev, dma_buf); + + attach = dma_buf_attach(dma_buf, dev->dev); + if (IS_ERR(attach)) + return ERR_CAST(attach); + + get_dma_buf(dma_buf); + + obj = amdgpu_gem_prime_import_sg_table(dev, attach, NULL); + if (IS_ERR(obj)) { + ret = PTR_ERR(obj); + goto fail_detach; + } + + obj->import_attach = attach; + attach->priv = obj; + + dma_buf_set_invalidate_callback(attach, + amdgpu_gem_prime_invalidate_mappings); + return obj; + +fail_detach: + dma_buf_detach(dma_buf, attach); + dma_buf_put(dma_buf); + + return ERR_PTR(ret); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 291dd3d600cd..aeead0281e92 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -43,6 +43,7 @@ #include #include #include +#include #include "amdgpu.h" #include "amdgpu_object.h" #include "amdgpu_trace.h" @@ -685,6 +686,7 @@ struct amdgpu_ttm_gup_task_list { struct amdgpu_ttm_tt { struct ttm_dma_tt ttm; + struct amdgpu_bo*bo; u64 offset; uint64_tuserptr; struct mm_struct*usermm; @@ -993,6 +995,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo, return NULL; } gtt->ttm.ttm.func = _backend_func; + gtt->bo = ttm_to_amdgpu_bo(bo); if (ttm_sg_tt_init(>ttm, bo, page_flags)) { kfree(gtt); return NULL; @@ -1005,7 +1008,6 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm, { struct amdgpu_device *adev = amdgpu_ttm_adev(ttm->bdev); struct amdgpu_ttm_tt *gtt = (void *)ttm; - bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG); if (gtt && gtt->userptr) { ttm->sg = kzalloc(sizeof(struct sg_table), GFP_KERNEL); @@ -1017,7 +1019,19 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm, return 0; } - if (slave && ttm->sg) { + if (ttm->page_flags & TTM_PAGE_FLAG_SG) { + if (!ttm->sg) { + struct dma_buf_attachment *attach; + struct sg_table *sgt; + + attach = gtt->bo->gem_base.import_attach; + sgt = dma_buf_map_attachment(attach, DMA_BIDIRECTIONAL); + if (IS_ERR(sgt)) + return PTR_ERR(sgt); + + ttm->sg = sgt; + } + drm_prime_sg_to_page_addr_arrays(ttm->sg, ttm->pages, gtt->ttm.dma_address, ttm->num_pages); @@ -1036,9 +1050,8 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm
[PATCH 2/5] drm/ttm: keep a reference to transfer pipelined BOs
Make sure the transfered BO is never destroy before the transfer BO. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/ttm/ttm_bo_util.c | 50 +++ 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 1f730b3f18e5..1ee20558ee31 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -39,6 +39,11 @@ #include #include +struct ttm_transfer_obj { + struct ttm_buffer_object base; + struct ttm_buffer_object *bo; +}; + void ttm_bo_free_old_node(struct ttm_buffer_object *bo) { ttm_bo_mem_put(bo, >mem); @@ -435,7 +440,11 @@ EXPORT_SYMBOL(ttm_bo_move_memcpy); static void ttm_transfered_destroy(struct ttm_buffer_object *bo) { - kfree(bo); + struct ttm_transfer_obj *fbo; + + fbo = container_of(bo, struct ttm_transfer_obj, base); + ttm_bo_unref(>bo); + kfree(fbo); } /** @@ -456,14 +465,15 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo) static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, struct ttm_buffer_object **new_obj) { - struct ttm_buffer_object *fbo; + struct ttm_transfer_obj *fbo; int ret; fbo = kmalloc(sizeof(*fbo), GFP_KERNEL); if (!fbo) return -ENOMEM; - *fbo = *bo; + fbo->base = *bo; + fbo->bo = ttm_bo_reference(bo); /** * Fix up members that we shouldn't copy directly: @@ -471,25 +481,25 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, */ atomic_inc(>bdev->glob->bo_count); - INIT_LIST_HEAD(>ddestroy); - INIT_LIST_HEAD(>lru); - INIT_LIST_HEAD(>swap); - INIT_LIST_HEAD(>io_reserve_lru); - mutex_init(>wu_mutex); - fbo->moving = NULL; - drm_vma_node_reset(>vma_node); - atomic_set(>cpu_writers, 0); - - kref_init(>list_kref); - kref_init(>kref); - fbo->destroy = _transfered_destroy; - fbo->acc_size = 0; - fbo->resv = >ttm_resv; - reservation_object_init(fbo->resv); - ret = reservation_object_trylock(fbo->resv); + INIT_LIST_HEAD(>base.ddestroy); + INIT_LIST_HEAD(>base.lru); + INIT_LIST_HEAD(>base.swap); + INIT_LIST_HEAD(>base.io_reserve_lru); + mutex_init(>base.wu_mutex); + fbo->base.moving = NULL; + drm_vma_node_reset(>base.vma_node); + atomic_set(>base.cpu_writers, 0); + + kref_init(>base.list_kref); + kref_init(>base.kref); + fbo->base.destroy = _transfered_destroy; + fbo->base.acc_size = 0; + fbo->base.resv = >base.ttm_resv; + reservation_object_init(fbo->base.resv); + ret = reservation_object_trylock(fbo->base.resv); WARN_ON(!ret); - *new_obj = fbo; + *new_obj = >base; return 0; } -- 2.14.1
[PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2
Each importer can now provide an invalidate_mappings callback. This allows the exporter to provide the mappings without the need to pin the backing store. v2: don't try to invalidate mappings when the callback is NULL, lock the reservation obj while using the attachments, add helper to set the callback Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 60 +++ include/linux/dma-buf.h | 38 ++ 2 files changed, 98 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d78d5fc173dc..ed2b3559ba25 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -572,7 +572,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, if (ret) goto err_attach; } + reservation_object_lock(dmabuf->resv, NULL); list_add(>node, >attachments); + reservation_object_unlock(dmabuf->resv); mutex_unlock(>lock); return attach; @@ -598,7 +600,9 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) return; mutex_lock(>lock); + reservation_object_lock(dmabuf->resv, NULL); list_del(>node); + reservation_object_unlock(dmabuf->resv); if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach); @@ -632,10 +636,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, if (WARN_ON(!attach || !attach->dmabuf)) return ERR_PTR(-EINVAL); + /* +* Mapping a DMA-buf can trigger its invalidation, prevent sending this +* event to the caller by temporary removing this attachment from the +* list. +*/ + if (attach->invalidate_mappings) { + reservation_object_assert_held(attach->dmabuf->resv); + list_del(>node); + } + sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction); if (!sg_table) sg_table = ERR_PTR(-ENOMEM); + if (attach->invalidate_mappings) + list_add(>node, >dmabuf->attachments); + return sg_table; } EXPORT_SYMBOL_GPL(dma_buf_map_attachment); @@ -656,6 +673,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep(); + if (attach->invalidate_mappings) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) return; @@ -664,6 +684,44 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment); +/** + * dma_buf_set_invalidate_callback - set the invalidate_mappings callback + * + * @attach:[in]attachment where to set the callback + * @cb:[in]the callback to set + * + * Makes sure to take the appropriate locks when updating the invalidate + * mappings callback. + */ +void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach, +void (*cb)(struct dma_buf_attachment *)) +{ + reservation_object_lock(attach->dmabuf->resv, NULL); + attach->invalidate_mappings = cb; + reservation_object_unlock(attach->dmabuf->resv); +} +EXPORT_SYMBOL_GPL(dma_buf_set_invalidate_callback); + +/** + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf + * + * @dmabuf:[in]buffer which mappings should be invalidated + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + reservation_object_assert_held(dmabuf->resv); + + list_for_each_entry(attach, >attachments, node) + if (attach->invalidate_mappings) + attach->invalidate_mappings(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings); + /** * DOC: cpu access * @@ -1121,10 +1179,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) seq_puts(s, "\tAttached Devices:\n"); attach_count = 0; + reservation_object_lock(buf_obj->resv, NULL); list_for_each_entry(attach_obj, _obj->attachments, node) { seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); attach_count++; } + reservation_object_unlock(buf_obj->resv); seq_printf(s, "Total %d devices attached\n\n", attach_count); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 085db2fee2d7..70c65fcfe1e3 1006
RFC: unpinned DMA-buf exporting v2
Hi everybody, since I've got positive feedback from Daniel I continued working on this approach. A few issues are still open: 1. Daniel suggested that I make the invalidate_mappings callback a parameter of dma_buf_attach(). This approach unfortunately won't work because when the attachment is created the importer is not necessarily ready to handle invalidation events. E.g. in the amdgpu example we first need to setup the imported GEM/TMM objects and install that in the attachment. My solution is to introduce a separate function to grab the locks and set the callback, this function could then be used to pin the buffer later on if that turns out to be necessary after all. 2. With my example setup this currently results in a ping/pong situation because the exporter prefers a VRAM placement while the importer prefers a GTT placement. This results in quite a performance drop, but can be fixed by a simple mesa patch which allows shred BOs to be placed in both VRAM and GTT. Question is what should we do in the meantime? Accept the performance drop or only allow unpinned sharing with new Mesa? Please review and comment, Christian.
Re: [PATCH v2] Add udmabuf misc device
Am 16.03.2018 um 08:46 schrieb Gerd Hoffmann: A driver to let userspace turn iovecs into dma-bufs. Use case: Allows qemu create dmabufs for the vga framebuffer or virtio-gpu ressources. Then they can be passed around to display those guest things on the host. To spice client for classic full framebuffer display, and hopefully some day to wayland server for seamless guest window display. Those dma-bufs are accounted against user's shm mlock bucket as the pages are effectively locked in memory. Sorry to disappoint you, but we have investigated the same idea for userptr use quite extensively and found that whole approach doesn't work. The pages backing a DMA-buf are not allowed to move (at least not without a patch set I'm currently working on), but for certain MM operations to work correctly you must be able to modify the page tables entries and move the pages backing them around. For example try to use fork() with some copy on write pages with this approach. You will find that you have only two options to correctly handle this. Either you access memory which could no longer belong to your process, which in turn is major security no-go. Or you use a MM notifier, but without being able to unmap DMA-bufs that won't work and you will certainly create a deadlock. Even with the patch set I'm currently working on the MM notifier approach is a real beast to get right. Regards, Christian. Cc: David AirlieCc: Tomeu Vizoso Cc: Daniel Vetter Signed-off-by: Gerd Hoffmann --- include/uapi/linux/udmabuf.h | 23 ++ drivers/dma-buf/udmabuf.c | 261 ++ tools/testing/selftests/drivers/dma-buf/udmabuf.c | 69 ++ drivers/dma-buf/Kconfig | 7 + drivers/dma-buf/Makefile | 1 + tools/testing/selftests/drivers/dma-buf/Makefile | 5 + 6 files changed, 366 insertions(+) create mode 100644 include/uapi/linux/udmabuf.h create mode 100644 drivers/dma-buf/udmabuf.c create mode 100644 tools/testing/selftests/drivers/dma-buf/udmabuf.c create mode 100644 tools/testing/selftests/drivers/dma-buf/Makefile diff --git a/include/uapi/linux/udmabuf.h b/include/uapi/linux/udmabuf.h new file mode 100644 index 00..54ceba203a --- /dev/null +++ b/include/uapi/linux/udmabuf.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_UDMABUF_H +#define _UAPI_LINUX_UDMABUF_H + +#include +#include + +struct udmabuf_iovec { + __u64 base; + __u64 len; +}; + +#define UDMABUF_FLAGS_CLOEXEC 0x01 + +struct udmabuf_create { + __u32 flags; + __u32 niov; + struct udmabuf_iovec iovs[]; +}; + +#define UDMABUF_CREATE _IOW(0x42, 0x23, struct udmabuf_create) + +#endif /* _UAPI_LINUX_UDMABUF_H */ diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c new file mode 100644 index 00..664ab4ee4e --- /dev/null +++ b/drivers/dma-buf/udmabuf.c @@ -0,0 +1,261 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +struct udmabuf { + u32 pagecount; + struct page **pages; + struct user_struct *owner; +}; + +static int udmabuf_vm_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct udmabuf *ubuf = vma->vm_private_data; + + if (WARN_ON(vmf->pgoff >= ubuf->pagecount)) + return VM_FAULT_SIGBUS; + + vmf->page = ubuf->pages[vmf->pgoff]; + get_page(vmf->page); + return 0; +} + +static const struct vm_operations_struct udmabuf_vm_ops = { + .fault = udmabuf_vm_fault, +}; + +static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma) +{ + struct udmabuf *ubuf = buf->priv; + + if ((vma->vm_flags & VM_SHARED) == 0) + return -EINVAL; + + vma->vm_ops = _vm_ops; + vma->vm_private_data = ubuf; + return 0; +} + +static struct sg_table *map_udmabuf(struct dma_buf_attachment *at, + enum dma_data_direction direction) +{ + struct udmabuf *ubuf = at->dmabuf->priv; + struct sg_table *sg; + + sg = kzalloc(sizeof(*sg), GFP_KERNEL); + if (!sg) + goto err1; + if (sg_alloc_table_from_pages(sg, ubuf->pages, ubuf->pagecount, + 0, ubuf->pagecount << PAGE_SHIFT, + GFP_KERNEL) < 0) + goto err2; + if (!dma_map_sg(at->dev, sg->sgl, sg->nents, direction)) + goto err3; + + return sg; + +err3: + sg_free_table(sg); +err2: + kfree(sg); +err1: +
Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback
Am 15.03.2018 um 10:20 schrieb Daniel Vetter: On Tue, Mar 13, 2018 at 06:20:07PM +0100, Christian König wrote: [SNIP] Take a look at the DOT graphs for atomic I've done a while ago. I think we could make a formidable competition for who's doing the worst diagrams :-) Thanks, going to give that a try. [SNIP] amdgpu: Expects that you never hold any of the heavywheight locks while waiting for a fence (since gpu resets will need them). i915: Happily blocks on fences while holding all kinds of locks, expects gpu reset to be able to recover even in this case. In this case I can comfort you, the looks amdgpu needs to grab during GPU reset are the reservation lock of the VM page tables. I have strong doubt that i915 will ever hold those. Could be that we run into problems because Thread A hold lock 1 tries to take lock 2, then i915 holds 2 and our reset path needs 1. [SNIP] Yes, except for fallback paths and bootup self tests we simply never wait for fences while holding locks. That's not what I meant with "are you sure". Did you enable the cross-release stuff (after patching the bunch of leftover core kernel issues still present), annotate dma_fence with the cross-release stuff, run a bunch of multi-driver (amdgpu vs i915) dma-buf sharing tests and weep? Ok, what exactly do you mean with cross-release checking? I didn't do the full thing yet, but just within i915 we've found tons of small little deadlocks we never really considered thanks to cross release, and that wasn't even including the dma_fence annotation. Luckily nothing that needed a full-on driver redesign. I guess I need to ping core kernel maintainers about cross-release again. I'd much prefer if we could validate ->invalidate_mapping and the locking/fence dependency issues using that, instead of me having to read and understand all the drivers. [SNIP] I fear that with the ->invalidate_mapping callback (which inverts the control flow between importer and exporter) and tying dma_fences into all this it will be a _lot_ worse. And I'm definitely too stupid to understand all the dependency chains without the aid of lockdep and a full test suite (we have a bunch of amdgpu/i915 dma-buf tests in igt btw). Yes, that is also something I worry about. Regards, Christian.
Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback
Am 13.03.2018 um 17:00 schrieb Daniel Vetter: On Tue, Mar 13, 2018 at 04:52:02PM +0100, Christian König wrote: Am 13.03.2018 um 16:17 schrieb Daniel Vetter: [SNIP] Ok, so plan is to support fully pipeline moves and everything, with the old sg tables lazily cleaned up. I was thinking more about evicting stuff and throwing it out, where there's not going to be any new sg list but the object is going to be swapped out. Yes, exactly. Well my example was the unlikely case when the object is swapped out and immediately swapped in again because somebody needs it. I think some state flow charts (we can do SVG or DOT) in the kerneldoc would be sweet.Yeah, probably a good idea. Sounds good and I find it great that you're volunteering for that :D Ok seriously, my drawing capabilities are a bit underdeveloped. So I would prefer if somebody could at least help with that. Re GPU might cause a deadlock: Isn't that already a problem if you hold reservations of buffers used on other gpus, which want those reservations to complete the gpu reset, but that gpu reset blocks some fence that the reservation holder is waiting for? Correct, that's why amdgpu and TTM tries quite hard to never wait for a fence while a reservation object is locked. We might have a fairly huge mismatch of expectations here :-/ What do you mean with that? The only use case I haven't fixed so far is reaping deleted object during eviction, but that is only a matter of my free time to fix it. Yeah, this is the hard one. Actually it isn't so hard, it's just that I didn't had time so far to clean it up and we never hit that issue so far during our reset testing. The main point missing just a bit of functionality in the reservation object and Chris and I already had a good idea how to implement that. In general the assumption is that dma_fence will get signalled no matter what you're doing, assuming the only thing you need is to not block interrupts. The i915 gpu reset logic to make that work is a bit a work of art ... Correct, but I don't understand why that is so hard on i915? Our GPU scheduler makes all of that rather trivial, e.g. fences either signal correctly or are aborted and set as erroneous after a timeout. If we expect amdgpu and i915 to cooperate with shared buffers I guess one has to give in. No idea how to do that best. Again at least from amdgpu side I don't see much of an issue with that. So what exactly do you have in mind here? We have tons of fun with deadlocks against GPU resets, and loots of testcases, and I kinda get the impression amdgpu is throwing a lot of issues under the rug through trylock tricks that shut up lockdep, but don't fix much really. Hui? Why do you think that? The only trylock I'm aware of is during eviction and there it isn't a problem. mmap fault handler had one too last time I looked, and it smelled fishy. Good point, never wrapped my head fully around that one either. btw adding cross-release lockdep annotations for fences will probably turn up _lots_ more bugs in this area. At least for amdgpu that should be handled by now. You're sure? :-) Yes, except for fallback paths and bootup self tests we simply never wait for fences while holding locks. Trouble is that cross-release wasn't even ever enabled, much less anyone typed the dma_fence annotations. And just cross-release alone turned up _lost_ of deadlocks in i915 between fences, async workers (userptr, gpu reset) and core mm stuff. Yeah, we had lots of fun with the mm locks as well but as far as I know Felix and I already fixed all of them. Christian. I'd be seriously surprised if it wouldn't find an entire rats nest of issues around dma_fence once we enable it. -Daniel +* +* New mappings can be created immediately, but can't be used before the +* exclusive fence in the dma_bufs reservation object is signaled. +*/ + void (*invalidate_mappings)(struct dma_buf_attachment *attach); Bunch of questions about exact semantics, but I very much like this. And I think besides those technical details, the overall approach seems sound. Yeah this initial implementation was buggy like hell. Just wanted to confirm that the idea is going in the right direction. I wanted this 7 years ago, idea very much acked :-) Ok, thanks. Good to know. Christian.
Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback
Am 13.03.2018 um 16:17 schrieb Daniel Vetter: [SNIP] I think a helper which both unmaps _and_ waits for all the fences to clear would be best, with some guarantees that it'll either fail or all the mappings _will_ be gone. The locking for that one will be hilarious, since we need to figure out dmabuf->lock vs. the reservation. I kinda prefer we throw away the dmabuf->lock and superseed it entirely by the reservation lock. Big NAK on that. The whole API is asynchronously, e.g. we never block for any operation to finish. Otherwise you run into big trouble with cross device GPU resets and stuff like that. But how will the unmapping work then? You can't throw the sg list away before the dma stopped. The dma only stops once the fence is signalled. The importer can't call dma_buf_detach because the reservation lock is hogged already by the exporter trying to unmap everything. How is this supposed to work? Even after invalidation the sg list stays alive until it is explicitly destroyed by the importer using dma_buf_unmap_attachment() which in turn is only allowed after all fences have signaled. The implementation is in ttm_bo_pipeline_gutting(), basically we use the same functionality as for pipelined moves/evictions which hangs the old backing store on a dummy object and destroys it after all fences signaled. While the old sg list is still about to be destroyed the importer can request a new sg list for the new location of the DMA-buf using dma_buf_map_attachment(). This new location becomes valid after the move fence in the reservation object is signaled. So from the CPU point of view multiple sg list could exists at the same time which allows us to have a seamless transition from the old to the new location from the GPU point of view. Re GPU might cause a deadlock: Isn't that already a problem if you hold reservations of buffers used on other gpus, which want those reservations to complete the gpu reset, but that gpu reset blocks some fence that the reservation holder is waiting for? Correct, that's why amdgpu and TTM tries quite hard to never wait for a fence while a reservation object is locked. The only use case I haven't fixed so far is reaping deleted object during eviction, but that is only a matter of my free time to fix it. We have tons of fun with deadlocks against GPU resets, and loots of testcases, and I kinda get the impression amdgpu is throwing a lot of issues under the rug through trylock tricks that shut up lockdep, but don't fix much really. Hui? Why do you think that? The only trylock I'm aware of is during eviction and there it isn't a problem. btw adding cross-release lockdep annotations for fences will probably turn up _lots_ more bugs in this area. At least for amdgpu that should be handled by now. +* +* New mappings can be created immediately, but can't be used before the +* exclusive fence in the dma_bufs reservation object is signaled. +*/ + void (*invalidate_mappings)(struct dma_buf_attachment *attach); Bunch of questions about exact semantics, but I very much like this. And I think besides those technical details, the overall approach seems sound. Yeah this initial implementation was buggy like hell. Just wanted to confirm that the idea is going in the right direction. I wanted this 7 years ago, idea very much acked :-) Ok, thanks. Good to know. Christian.
Re: RFC: unpinned DMA-buf exporting
Am 12.03.2018 um 18:24 schrieb Daniel Vetter: On Fri, Mar 09, 2018 at 08:11:40PM +0100, Christian K??nig wrote: This set of patches adds an option invalidate_mappings callback to each DMA-buf attachment which can be filled in by the importer. This callback allows the exporter to provided the DMA-buf content without pinning it. The reservation objects lock acts as synchronization point for buffer moves and creating mappings. This set includes an implementation for amdgpu which should be rather easily portable to other DRM drivers. Bunch of higher level comments, and one I've forgotten in reply to patch 1: - What happens when a dma-buf is pinned (e.g. i915 loves to pin buffers for scanout)? When you need to pin an imported DMA-buf you need to detach and reattach without the invalidate_mappings callback. - pulling the dma-buf implementations into amdgpu makes sense, that's kinda how it was meant to be anyway. The gem prime helpers are a bit too much midlayer for my taste (mostly because nvidia wanted to bypass the EXPORT_SYMBOL_GPL of core dma-buf, hooray for legal bs). We can always extract more helpers once there's more ttm based drivers doing this. Yeah, I though to abstract that similar to the AGP backend. Just moving some callbacks around in TTM should be sufficient to de-midlayer the whole thing. Thanks, Christian. Overall I like, there's some details to figure out first. -Daniel
Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback
Am 12.03.2018 um 18:07 schrieb Daniel Vetter: On Fri, Mar 09, 2018 at 08:11:41PM +0100, Christian K??nig wrote: [SNIP] +/** + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf + * + * @dmabuf:[in]buffer which mappings should be invalidated + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + reservation_object_assert_held(dmabuf->resv); + + list_for_each_entry(attach, >attachments, node) + attach->invalidate_mappings(attach); To make the locking work I think we also need to require importers to hold the reservation object while attaching/detaching. Otherwise the list walk above could go boom. Oh, good point. Going, to fix this. [SNIP] + /** +* @supports_mapping_invalidation: +* +* True for exporters which supports unpinned DMA-buf operation using +* the reservation lock. +* +* When attachment->invalidate_mappings is set the @map_dma_buf and +* @unmap_dma_buf callbacks can be called with the reservation lock +* held. +*/ + bool supports_mapping_invalidation; Why do we need this? Importer could simply always register with the invalidate_mapping hook registered, and exporters could use it when they see fit. That gives us more lockdep coverage to make sure importers use their attachment callbacks correctly (aka they hold the reservation object). One sole reason: Backward compability. I didn't wanted to audit all those different drivers if they can handle being called with the reservation lock held. + /** * @map_dma_buf: * @@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv; + + /** +* @invalidate_mappings: +* +* Optional callback provided by the importer of the attachment which +* must be set before mappings are created. This doesn't work, it must be set before the attachment is created, otherwise you race with your invalidate callback. Another good point. I think the simplest option would be to add a new dma_buf_attach_dynamic (well except a less crappy name). Well how about adding an optional invalidate_mappings parameter to the existing dma_buf_attach? +* +* If provided the exporter can avoid pinning the backing store while +* mappings exists. +* +* The function is called with the lock of the reservation object +* associated with the dma_buf held and the mapping function must be +* called with this lock held as well. This makes sure that no mapping +* is created concurrently with an ongoing invalidation. +* +* After the callback all existing mappings are still valid until all +* fences in the dma_bufs reservation object are signaled, but should be +* destroyed by the importer as soon as possible. Do we guarantee that the importer will attach a fence, after which the mapping will be gone? What about re-trying? Or just best effort (i.e. only useful for evicting to try to make room). The importer should attach fences for all it's operations with the DMA-buf. I think a helper which both unmaps _and_ waits for all the fences to clear would be best, with some guarantees that it'll either fail or all the mappings _will_ be gone. The locking for that one will be hilarious, since we need to figure out dmabuf->lock vs. the reservation. I kinda prefer we throw away the dmabuf->lock and superseed it entirely by the reservation lock. Big NAK on that. The whole API is asynchronously, e.g. we never block for any operation to finish. Otherwise you run into big trouble with cross device GPU resets and stuff like that. +* +* New mappings can be created immediately, but can't be used before the +* exclusive fence in the dma_bufs reservation object is signaled. +*/ + void (*invalidate_mappings)(struct dma_buf_attachment *attach); Bunch of questions about exact semantics, but I very much like this. And I think besides those technical details, the overall approach seems sound. Yeah this initial implementation was buggy like hell. Just wanted to confirm that the idea is going in the right direction. Thanks for the comments, Christian. -Daniel
[PATCH 1/4] dma-buf: add optional invalidate_mappings callback
Each importer can now provide an invalidate_mappings callback. This allows the exporter to provide the mappings without the need to pin the backing store. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 25 + include/linux/dma-buf.h | 36 2 files changed, 61 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d78d5fc173dc..ed8d5844ae74 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -629,6 +629,9 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, might_sleep(); + if (attach->invalidate_mappings) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf)) return ERR_PTR(-EINVAL); @@ -656,6 +659,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep(); + if (attach->invalidate_mappings) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) return; @@ -664,6 +670,25 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment); +/** + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf + * + * @dmabuf:[in]buffer which mappings should be invalidated + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + reservation_object_assert_held(dmabuf->resv); + + list_for_each_entry(attach, >attachments, node) + attach->invalidate_mappings(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings); + /** * DOC: cpu access * diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 085db2fee2d7..c1e2f7d93509 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -91,6 +91,18 @@ struct dma_buf_ops { */ void (*detach)(struct dma_buf *, struct dma_buf_attachment *); + /** +* @supports_mapping_invalidation: +* +* True for exporters which supports unpinned DMA-buf operation using +* the reservation lock. +* +* When attachment->invalidate_mappings is set the @map_dma_buf and +* @unmap_dma_buf callbacks can be called with the reservation lock +* held. +*/ + bool supports_mapping_invalidation; + /** * @map_dma_buf: * @@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv; + + /** +* @invalidate_mappings: +* +* Optional callback provided by the importer of the attachment which +* must be set before mappings are created. +* +* If provided the exporter can avoid pinning the backing store while +* mappings exists. +* +* The function is called with the lock of the reservation object +* associated with the dma_buf held and the mapping function must be +* called with this lock held as well. This makes sure that no mapping +* is created concurrently with an ongoing invalidation. +* +* After the callback all existing mappings are still valid until all +* fences in the dma_bufs reservation object are signaled, but should be +* destroyed by the importer as soon as possible. +* +* New mappings can be created immediately, but can't be used before the +* exclusive fence in the dma_bufs reservation object is signaled. +*/ + void (*invalidate_mappings)(struct dma_buf_attachment *attach); }; /** @@ -391,6 +426,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf, -- 2.14.1
[PATCH 3/4] drm/amdgpu: add independent DMA-buf export
Instead of relying on the DRM functions just implement our own export functions. This adds support for taking care of unpinned DMA-buf. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 117 ++--- 4 files changed, 79 insertions(+), 44 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 96bcdb97e7e2..909dc9764a22 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -371,7 +371,6 @@ int amdgpu_gem_object_open(struct drm_gem_object *obj, void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv); unsigned long amdgpu_gem_timeout(uint64_t timeout_ns); -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj); struct drm_gem_object * amdgpu_gem_prime_import_sg_table(struct drm_device *dev, struct dma_buf_attachment *attach, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index e6709362994a..e32dcdf8d3db 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -886,7 +886,6 @@ static struct drm_driver kms_driver = { .gem_prime_export = amdgpu_gem_prime_export, .gem_prime_import = amdgpu_gem_prime_import, .gem_prime_res_obj = amdgpu_gem_prime_res_obj, - .gem_prime_get_sg_table = amdgpu_gem_prime_get_sg_table, .gem_prime_import_sg_table = amdgpu_gem_prime_import_sg_table, .gem_prime_vmap = amdgpu_gem_prime_vmap, .gem_prime_vunmap = amdgpu_gem_prime_vunmap, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 48e0115d4f76..d5db5955a70a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -31,6 +31,7 @@ */ #include #include +#include #include #include #include @@ -931,6 +932,9 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo, amdgpu_bo_kunmap(abo); + if (abo->gem_base.dma_buf) + dma_buf_invalidate_mappings(abo->gem_base.dma_buf); + /* remember the eviction */ if (evict) atomic64_inc(>num_evictions); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 1c9991738477..f575fb46d7a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -32,13 +32,9 @@ static const struct dma_buf_ops amdgpu_dmabuf_ops; -struct sg_table *amdgpu_gem_prime_get_sg_table(struct drm_gem_object *obj) -{ - struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); - int npages = bo->tbo.num_pages; - - return drm_prime_pages_to_sg(bo->tbo.ttm->pages, npages); -} +static void amdgpu_gem_unmap_dma_buf(struct dma_buf_attachment *attach, +struct sg_table *sgt, +enum dma_data_direction dir); void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj) { @@ -126,22 +122,21 @@ amdgpu_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret); } -static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, -struct device *target_dev, -struct dma_buf_attachment *attach) +static struct sg_table * +amdgpu_gem_map_dma_buf(struct dma_buf_attachment *attach, + enum dma_data_direction dir) { + struct dma_buf *dma_buf = attach->dmabuf; struct drm_gem_object *obj = dma_buf->priv; struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct sg_table *sgt; long r; - r = drm_gem_map_attach(dma_buf, target_dev, attach); - if (r) - return r; - - r = amdgpu_bo_reserve(bo, false); - if (unlikely(r != 0)) - goto error_detach; - + if (!attach->invalidate_mappings) { + r = amdgpu_bo_reserve(bo, false); + if (unlikely(r != 0)) + return ERR_PTR(r); + } if (dma_buf->ops != _dmabuf_ops) { /* @@ -157,41 +152,80 @@ static int amdgpu_gem_map_attach(struct dma_buf *dma_buf, } } - /* pin buffer into GTT */ - r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); - if (r) - goto error_unreserve; + if (!attach->invalidate_mappings || true) { + /* pin buffer into GTT */ + r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT, NULL); + if (r) + goto error_unreserve; + + } else {
[PATCH 4/4] drm/amdgpu: add independent DMA-buf import
Instead of relying on the DRM functions just implement our own import functions. This adds support for taking care of unpinned DMA-buf. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c | 64 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 32 +--- 2 files changed, 80 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index f575fb46d7a8..7963ce329519 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -298,22 +298,64 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_device *dev, return buf; } +static void +amdgpu_gem_prime_invalidate_mappings(struct dma_buf_attachment *attach) +{ + struct ttm_operation_ctx ctx = { false, false }; + struct drm_gem_object *obj = attach->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + int r; + + amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_CPU); + r = ttm_bo_validate(>tbo, >placement, ); + if (r) + DRM_ERROR("Failed to unmap DMA-buf import (%d))\n", r); +} + struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf) { + struct dma_buf_attachment *attach; struct drm_gem_object *obj; + int ret; - if (dma_buf->ops == _dmabuf_ops) { - obj = dma_buf->priv; - if (obj->dev == dev) { - /* -* Importing dmabuf exported from out own gem increases -* refcount on gem itself instead of f_count of dmabuf. -*/ - drm_gem_object_get(obj); - return obj; - } + if (dma_buf->ops != _dmabuf_ops) + return drm_gem_prime_import(dev, dma_buf); + + obj = dma_buf->priv; + if (obj->dev == dev) { + /* +* Importing dmabuf exported from out own gem increases +* refcount on gem itself instead of f_count of dmabuf. +*/ + drm_gem_object_get(obj); + return obj; + } + + /* +* Attach, but don't map other amdgpu BOs +*/ + attach = dma_buf_attach(dma_buf, dev->dev); + if (IS_ERR(attach)) + return ERR_CAST(attach); + + get_dma_buf(dma_buf); + + obj = amdgpu_gem_prime_import_sg_table(dev, attach, NULL); + if (IS_ERR(obj)) { + ret = PTR_ERR(obj); + goto fail_detach; } - return drm_gem_prime_import(dev, dma_buf); + obj->import_attach = attach; + attach->invalidate_mappings = amdgpu_gem_prime_invalidate_mappings; + attach->priv = obj; + + return obj; + +fail_detach: + dma_buf_detach(dma_buf, attach); + dma_buf_put(dma_buf); + + return ERR_PTR(ret); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 291dd3d600cd..aeead0281e92 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -43,6 +43,7 @@ #include #include #include +#include #include "amdgpu.h" #include "amdgpu_object.h" #include "amdgpu_trace.h" @@ -685,6 +686,7 @@ struct amdgpu_ttm_gup_task_list { struct amdgpu_ttm_tt { struct ttm_dma_tt ttm; + struct amdgpu_bo*bo; u64 offset; uint64_tuserptr; struct mm_struct*usermm; @@ -993,6 +995,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo, return NULL; } gtt->ttm.ttm.func = _backend_func; + gtt->bo = ttm_to_amdgpu_bo(bo); if (ttm_sg_tt_init(>ttm, bo, page_flags)) { kfree(gtt); return NULL; @@ -1005,7 +1008,6 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm, { struct amdgpu_device *adev = amdgpu_ttm_adev(ttm->bdev); struct amdgpu_ttm_tt *gtt = (void *)ttm; - bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG); if (gtt && gtt->userptr) { ttm->sg = kzalloc(sizeof(struct sg_table), GFP_KERNEL); @@ -1017,7 +1019,19 @@ static int amdgpu_ttm_tt_populate(struct ttm_tt *ttm, return 0; } - if (slave && ttm->sg) { + if (ttm->page_flags & TTM_PAGE_FLAG_SG) { + if (!ttm->sg) { + struct dma_buf_attachment *attach; + struct sg_table *sgt; + + attach = gtt->bo->gem_base.import_attach; + sgt
RFC: unpinned DMA-buf exporting
This set of patches adds an option invalidate_mappings callback to each DMA-buf attachment which can be filled in by the importer. This callback allows the exporter to provided the DMA-buf content without pinning it. The reservation objects lock acts as synchronization point for buffer moves and creating mappings. This set includes an implementation for amdgpu which should be rather easily portable to other DRM drivers. Please comment, Christian.
[PATCH 2/4] drm/ttm: keep a reference to transfer pipelined BOs
Make sure the transfered BO is never destroy before the transfer BO. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/ttm/ttm_bo_util.c | 50 +++ 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 1f730b3f18e5..1ee20558ee31 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -39,6 +39,11 @@ #include #include +struct ttm_transfer_obj { + struct ttm_buffer_object base; + struct ttm_buffer_object *bo; +}; + void ttm_bo_free_old_node(struct ttm_buffer_object *bo) { ttm_bo_mem_put(bo, >mem); @@ -435,7 +440,11 @@ EXPORT_SYMBOL(ttm_bo_move_memcpy); static void ttm_transfered_destroy(struct ttm_buffer_object *bo) { - kfree(bo); + struct ttm_transfer_obj *fbo; + + fbo = container_of(bo, struct ttm_transfer_obj, base); + ttm_bo_unref(>bo); + kfree(fbo); } /** @@ -456,14 +465,15 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo) static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, struct ttm_buffer_object **new_obj) { - struct ttm_buffer_object *fbo; + struct ttm_transfer_obj *fbo; int ret; fbo = kmalloc(sizeof(*fbo), GFP_KERNEL); if (!fbo) return -ENOMEM; - *fbo = *bo; + fbo->base = *bo; + fbo->bo = ttm_bo_reference(bo); /** * Fix up members that we shouldn't copy directly: @@ -471,25 +481,25 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, */ atomic_inc(>bdev->glob->bo_count); - INIT_LIST_HEAD(>ddestroy); - INIT_LIST_HEAD(>lru); - INIT_LIST_HEAD(>swap); - INIT_LIST_HEAD(>io_reserve_lru); - mutex_init(>wu_mutex); - fbo->moving = NULL; - drm_vma_node_reset(>vma_node); - atomic_set(>cpu_writers, 0); - - kref_init(>list_kref); - kref_init(>kref); - fbo->destroy = _transfered_destroy; - fbo->acc_size = 0; - fbo->resv = >ttm_resv; - reservation_object_init(fbo->resv); - ret = reservation_object_trylock(fbo->resv); + INIT_LIST_HEAD(>base.ddestroy); + INIT_LIST_HEAD(>base.lru); + INIT_LIST_HEAD(>base.swap); + INIT_LIST_HEAD(>base.io_reserve_lru); + mutex_init(>base.wu_mutex); + fbo->base.moving = NULL; + drm_vma_node_reset(>base.vma_node); + atomic_set(>base.cpu_writers, 0); + + kref_init(>base.list_kref); + kref_init(>base.kref); + fbo->base.destroy = _transfered_destroy; + fbo->base.acc_size = 0; + fbo->base.resv = >base.ttm_resv; + reservation_object_init(fbo->base.resv); + ret = reservation_object_trylock(fbo->base.resv); WARN_ON(!ret); - *new_obj = fbo; + *new_obj = >base; return 0; } -- 2.14.1
[PATCH 1/4] dma-buf: add optional invalidate_mappings callback
Each importer can now provide an invalidate_mappings callback. This allows the exporter to provide the mappings without the need to pin the backing store. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 25 + include/linux/dma-buf.h | 36 2 files changed, 61 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d78d5fc173dc..ed8d5844ae74 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -629,6 +629,9 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, might_sleep(); + if (attach->invalidate_mappings) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf)) return ERR_PTR(-EINVAL); @@ -656,6 +659,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep(); + if (attach->invalidate_mappings) + reservation_object_assert_held(attach->dmabuf->resv); + if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) return; @@ -664,6 +670,25 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment); +/** + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf + * + * @dmabuf:[in]buffer which mappings should be invalidated + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + reservation_object_assert_held(dmabuf->resv); + + list_for_each_entry(attach, >attachments, node) + attach->invalidate_mappings(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings); + /** * DOC: cpu access * diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 085db2fee2d7..c1e2f7d93509 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -91,6 +91,18 @@ struct dma_buf_ops { */ void (*detach)(struct dma_buf *, struct dma_buf_attachment *); + /** +* @supports_mapping_invalidation: +* +* True for exporters which supports unpinned DMA-buf operation using +* the reservation lock. +* +* When attachment->invalidate_mappings is set the @map_dma_buf and +* @unmap_dma_buf callbacks can be called with the reservation lock +* held. +*/ + bool supports_mapping_invalidation; + /** * @map_dma_buf: * @@ -326,6 +338,29 @@ struct dma_buf_attachment { struct device *dev; struct list_head node; void *priv; + + /** +* @invalidate_mappings: +* +* Optional callback provided by the importer of the attachment which +* must be set before mappings are created. +* +* If provided the exporter can avoid pinning the backing store while +* mappings exists. +* +* The function is called with the lock of the reservation object +* associated with the dma_buf held and the mapping function must be +* called with this lock held as well. This makes sure that no mapping +* is created concurrently with an ongoing invalidation. +* +* After the callback all existing mappings are still valid until all +* fences in the dma_bufs reservation object are signaled, but should be +* destroyed by the importer as soon as possible. +* +* New mappings can be created immediately, but can't be used before the +* exclusive fence in the dma_bufs reservation object is signaled. +*/ + void (*invalidate_mappings)(struct dma_buf_attachment *attach); }; /** @@ -391,6 +426,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *, enum dma_data_direction); +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf); int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_direction dir); int dma_buf_end_cpu_access(struct dma_buf *dma_buf, -- 2.14.1
Re: [PATCH 1/3] dma-buf: make returning the exclusive fence optional
Ping? Daniel you requested the patch with its user. Would be nice when I can commit this cause we need it for debugging and cleaning up a bunch of other things as well. Regards, Christian. Am 12.01.2018 um 10:47 schrieb Christian König: Change reservation_object_get_fences_rcu to make the exclusive fence pointer optional. If not specified the exclusive fence is put into the fence array as well. This is helpful for a couple of cases where we need all fences in a single array. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 31 ++- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index b759a569b7b8..461afa9febd4 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -374,8 +374,9 @@ EXPORT_SYMBOL(reservation_object_copy_fences); * @pshared: the array of shared fence ptrs returned (array is krealloc'd to * the required size, and must be freed by caller) * - * RETURNS - * Zero or -errno + * Retrieve all fences from the reservation object. If the pointer for the + * exclusive fence is not specified the fence is put into the array of the + * shared fences as well. Returns either zero or -ENOMEM. */ int reservation_object_get_fences_rcu(struct reservation_object *obj, struct dma_fence **pfence_excl, @@ -389,8 +390,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, do { struct reservation_object_list *fobj; - unsigned seq; - unsigned int i; + unsigned int i, seq; + size_t sz = 0; shared_count = i = 0; @@ -402,9 +403,14 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, goto unlock; fobj = rcu_dereference(obj->fence); - if (fobj) { + if (fobj) + sz += sizeof(*shared) * fobj->shared_max; + + if (!pfence_excl && fence_excl) + sz += sizeof(*shared); + + if (sz) { struct dma_fence **nshared; - size_t sz = sizeof(*shared) * fobj->shared_max; nshared = krealloc(shared, sz, GFP_NOWAIT | __GFP_NOWARN); @@ -420,13 +426,19 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, break; } shared = nshared; - shared_count = fobj->shared_count; - + shared_count = fobj ? fobj->shared_count : 0; for (i = 0; i < shared_count; ++i) { shared[i] = rcu_dereference(fobj->shared[i]); if (!dma_fence_get_rcu(shared[i])) break; } + + if (!pfence_excl && fence_excl) { + shared[i] = fence_excl; + fence_excl = NULL; + ++i; + ++shared_count; + } } if (i != shared_count || read_seqcount_retry(>seq, seq)) { @@ -448,7 +460,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, *pshared_count = shared_count; *pshared = shared; - *pfence_excl = fence_excl; + if (pfence_excl) + *pfence_excl = fence_excl; return ret; }
[PATCH 2/3] drm/amdgpu: add amdgpu_pasid_free_delayed v2
Free up a pasid after all fences signaled. v2: also handle the case when we can't allocate a fence array. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 82 + drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h | 2 + 2 files changed, 84 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c index 5248a3232aff..842caa5ed73b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c @@ -40,6 +40,12 @@ */ static DEFINE_IDA(amdgpu_pasid_ida); +/* Helper to free pasid from a fence callback */ +struct amdgpu_pasid_cb { + struct dma_fence_cb cb; + unsigned int pasid; +}; + /** * amdgpu_pasid_alloc - Allocate a PASID * @bits: Maximum width of the PASID in bits, must be at least 1 @@ -75,6 +81,82 @@ void amdgpu_pasid_free(unsigned int pasid) ida_simple_remove(_pasid_ida, pasid); } +static void amdgpu_pasid_free_cb(struct dma_fence *fence, +struct dma_fence_cb *_cb) +{ + struct amdgpu_pasid_cb *cb = + container_of(_cb, struct amdgpu_pasid_cb, cb); + + amdgpu_pasid_free(cb->pasid); + dma_fence_put(fence); + kfree(cb); +} + +/** + * amdgpu_pasid_free_delayed - free pasid when fences signal + * + * @resv: reservation object with the fences to wait for + * @pasid: pasid to free + * + * Free the pasid only after all the fences in resv are signaled. + */ +void amdgpu_pasid_free_delayed(struct reservation_object *resv, + unsigned int pasid) +{ + struct dma_fence *fence, **fences; + struct amdgpu_pasid_cb *cb; + unsigned count; + int r; + + r = reservation_object_get_fences_rcu(resv, NULL, , ); + if (r) + goto fallback; + + if (count == 0) { + amdgpu_pasid_free(pasid); + return; + } + + if (count == 1) { + fence = fences[0]; + kfree(fences); + } else { + uint64_t context = dma_fence_context_alloc(1); + struct dma_fence_array *array; + + array = dma_fence_array_create(count, fences, context, + 1, false); + if (!array) { + kfree(fences); + goto fallback; + } + fence = >base; + } + + cb = kmalloc(sizeof(*cb), GFP_KERNEL); + if (!cb) { + /* Last resort when we are OOM */ + dma_fence_wait(fence, false); + dma_fence_put(fence); + amdgpu_pasid_free(pasid); + } else { + cb->pasid = pasid; + if (dma_fence_add_callback(fence, >cb, + amdgpu_pasid_free_cb)) + amdgpu_pasid_free_cb(fence, >cb); + } + + return; + +fallback: + /* Not enough memory for the delayed delete, as last resort +* block for all the fences to complete. +*/ + reservation_object_wait_timeout_rcu(resv, true, false, + MAX_SCHEDULE_TIMEOUT); + amdgpu_pasid_free(pasid); +} + /* * VMID manager * diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h index ad931fa570b3..38f37c16fc5e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h @@ -69,6 +69,8 @@ struct amdgpu_vmid_mgr { int amdgpu_pasid_alloc(unsigned int bits); void amdgpu_pasid_free(unsigned int pasid); +void amdgpu_pasid_free_delayed(struct reservation_object *resv, + unsigned int pasid); bool amdgpu_vmid_had_gpu_reset(struct amdgpu_device *adev, struct amdgpu_vmid *id); -- 2.14.1
[PATCH 1/3] dma-buf: make returning the exclusive fence optional
Change reservation_object_get_fences_rcu to make the exclusive fence pointer optional. If not specified the exclusive fence is put into the fence array as well. This is helpful for a couple of cases where we need all fences in a single array. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 31 ++- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index b759a569b7b8..461afa9febd4 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -374,8 +374,9 @@ EXPORT_SYMBOL(reservation_object_copy_fences); * @pshared: the array of shared fence ptrs returned (array is krealloc'd to * the required size, and must be freed by caller) * - * RETURNS - * Zero or -errno + * Retrieve all fences from the reservation object. If the pointer for the + * exclusive fence is not specified the fence is put into the array of the + * shared fences as well. Returns either zero or -ENOMEM. */ int reservation_object_get_fences_rcu(struct reservation_object *obj, struct dma_fence **pfence_excl, @@ -389,8 +390,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, do { struct reservation_object_list *fobj; - unsigned seq; - unsigned int i; + unsigned int i, seq; + size_t sz = 0; shared_count = i = 0; @@ -402,9 +403,14 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, goto unlock; fobj = rcu_dereference(obj->fence); - if (fobj) { + if (fobj) + sz += sizeof(*shared) * fobj->shared_max; + + if (!pfence_excl && fence_excl) + sz += sizeof(*shared); + + if (sz) { struct dma_fence **nshared; - size_t sz = sizeof(*shared) * fobj->shared_max; nshared = krealloc(shared, sz, GFP_NOWAIT | __GFP_NOWARN); @@ -420,13 +426,19 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, break; } shared = nshared; - shared_count = fobj->shared_count; - + shared_count = fobj ? fobj->shared_count : 0; for (i = 0; i < shared_count; ++i) { shared[i] = rcu_dereference(fobj->shared[i]); if (!dma_fence_get_rcu(shared[i])) break; } + + if (!pfence_excl && fence_excl) { + shared[i] = fence_excl; + fence_excl = NULL; + ++i; + ++shared_count; + } } if (i != shared_count || read_seqcount_retry(>seq, seq)) { @@ -448,7 +460,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, *pshared_count = shared_count; *pshared = shared; - *pfence_excl = fence_excl; + if (pfence_excl) + *pfence_excl = fence_excl; return ret; } -- 2.14.1
[PATCH 3/3] drm/amdgpu: always allocate a PASIDs for each VM v2
Start to always allocate a pasid for each VM. v2: use dev_warn when we run out of PASIDs Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 43 ++--- 1 file changed, 29 insertions(+), 14 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 5773a581761b..a108b30d8186 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -805,7 +805,7 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) { struct amdgpu_device *adev = dev->dev_private; struct amdgpu_fpriv *fpriv; - int r; + int r, pasid; file_priv->driver_priv = NULL; @@ -819,28 +819,25 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) goto out_suspend; } - r = amdgpu_vm_init(adev, >vm, - AMDGPU_VM_CONTEXT_GFX, 0); - if (r) { - kfree(fpriv); - goto out_suspend; + pasid = amdgpu_pasid_alloc(16); + if (pasid < 0) { + dev_warn(adev->dev, "No more PASIDs available!"); + pasid = 0; } + r = amdgpu_vm_init(adev, >vm, AMDGPU_VM_CONTEXT_GFX, pasid); + if (r) + goto error_pasid; fpriv->prt_va = amdgpu_vm_bo_add(adev, >vm, NULL); if (!fpriv->prt_va) { r = -ENOMEM; - amdgpu_vm_fini(adev, >vm); - kfree(fpriv); - goto out_suspend; + goto error_vm; } if (amdgpu_sriov_vf(adev)) { r = amdgpu_map_static_csa(adev, >vm, >csa_va); - if (r) { - amdgpu_vm_fini(adev, >vm); - kfree(fpriv); - goto out_suspend; - } + if (r) + goto error_vm; } mutex_init(>bo_list_lock); @@ -849,6 +846,16 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv) amdgpu_ctx_mgr_init(>ctx_mgr); file_priv->driver_priv = fpriv; + goto out_suspend; + +error_vm: + amdgpu_vm_fini(adev, >vm); + +error_pasid: + if (pasid) + amdgpu_pasid_free(pasid); + + kfree(fpriv); out_suspend: pm_runtime_mark_last_busy(dev->dev); @@ -871,6 +878,8 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev, struct amdgpu_device *adev = dev->dev_private; struct amdgpu_fpriv *fpriv = file_priv->driver_priv; struct amdgpu_bo_list *list; + struct amdgpu_bo *pd; + unsigned int pasid; int handle; if (!fpriv) @@ -895,7 +904,13 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev, amdgpu_bo_unreserve(adev->virt.csa_obj); } + pasid = fpriv->vm.pasid; + pd = amdgpu_bo_ref(fpriv->vm.root.base.bo); + amdgpu_vm_fini(adev, >vm); + if (pasid) + amdgpu_pasid_free_delayed(pd->tbo.resv, pasid); + amdgpu_bo_unref(); idr_for_each_entry(>bo_list_handles, list, handle) amdgpu_bo_list_free(list); -- 2.14.1
Re: [Linaro-mm-sig] [PATCH] dma-buf: add some lockdep asserts to the reservation object implementation
Yeah, somehow missed that one. The patch looks mostly good, except for reservation_object_get_excl(). For that one an RCU protection is usually sufficient, so annotating it with reservation_object_assert_held() sounds incorrect to me. Regards, Christian. Am 11.01.2018 um 11:43 schrieb Lucas Stach: Did this fall through the cracks over the holidays? It really has made my work much easier while reworking some of the reservation object handling in etnaviv and I think it might benefit others. Regards, Lucas Am Freitag, den 01.12.2017, 12:12 +0100 schrieb Lucas Stach: This adds lockdep asserts to the reservation functions which state in their documentation that obj->lock must be held. Allows builds with PROVE_LOCKING enabled to check that the locking requirements are met. Signed-off-by: Lucas Stach--- drivers/dma-buf/reservation.c | 8 include/linux/reservation.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index b44d9d7db347..accd398e2ea6 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -71,6 +71,8 @@ int reservation_object_reserve_shared(struct reservation_object *obj) struct reservation_object_list *fobj, *old; u32 max; + reservation_object_assert_held(obj); + old = reservation_object_get_list(obj); if (old && old->shared_max) { @@ -211,6 +213,8 @@ void reservation_object_add_shared_fence(struct reservation_object *obj, { struct reservation_object_list *old, *fobj = obj->staged; + reservation_object_assert_held(obj); + old = reservation_object_get_list(obj); obj->staged = NULL; @@ -236,6 +240,8 @@ void reservation_object_add_excl_fence(struct reservation_object *obj, struct reservation_object_list *old; u32 i = 0; + reservation_object_assert_held(obj); + old = reservation_object_get_list(obj); if (old) i = old->shared_count; @@ -276,6 +282,8 @@ int reservation_object_copy_fences(struct reservation_object *dst, size_t size; unsigned i; + reservation_object_assert_held(dst); + rcu_read_lock(); src_list = rcu_dereference(src->fence); diff --git a/include/linux/reservation.h b/include/linux/reservation.h index 21fc84d82d41..55e7318800fd 100644 --- a/include/linux/reservation.h +++ b/include/linux/reservation.h @@ -212,6 +212,8 @@ reservation_object_unlock(struct reservation_object *obj) static inline struct dma_fence * reservation_object_get_excl(struct reservation_object *obj) { + reservation_object_assert_held(obj); + return rcu_dereference_protected(obj->fence_excl, reservation_object_held(obj)); } ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Linaro-mm-sig] [PATCH] dma-buf: make returning the exclusive fence optional
Am 10.01.2018 um 14:21 schrieb Daniel Vetter: On Wed, Jan 10, 2018 at 01:53:41PM +0100, Christian König wrote: Change reservation_object_get_fences_rcu to make the exclusive fence pointer optional. If not specified the exclusive fence is put into the fence array as well. This is helpful for a couple of cases where we need all fences in a single array. Signed-off-by: Christian König <christian.koe...@amd.com> Seeing the use-case for this would be a lot more interesting ... Yeah, sorry the use case is a 20 patches set on amd-gfx. Didn't wanted to post all those here as well. Christian. -Daniel --- drivers/dma-buf/reservation.c | 31 ++- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index b759a569b7b8..461afa9febd4 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -374,8 +374,9 @@ EXPORT_SYMBOL(reservation_object_copy_fences); * @pshared: the array of shared fence ptrs returned (array is krealloc'd to * the required size, and must be freed by caller) * - * RETURNS - * Zero or -errno + * Retrieve all fences from the reservation object. If the pointer for the + * exclusive fence is not specified the fence is put into the array of the + * shared fences as well. Returns either zero or -ENOMEM. */ int reservation_object_get_fences_rcu(struct reservation_object *obj, struct dma_fence **pfence_excl, @@ -389,8 +390,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, do { struct reservation_object_list *fobj; - unsigned seq; - unsigned int i; + unsigned int i, seq; + size_t sz = 0; shared_count = i = 0; @@ -402,9 +403,14 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, goto unlock; fobj = rcu_dereference(obj->fence); - if (fobj) { + if (fobj) + sz += sizeof(*shared) * fobj->shared_max; + + if (!pfence_excl && fence_excl) + sz += sizeof(*shared); + + if (sz) { struct dma_fence **nshared; - size_t sz = sizeof(*shared) * fobj->shared_max; nshared = krealloc(shared, sz, GFP_NOWAIT | __GFP_NOWARN); @@ -420,13 +426,19 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, break; } shared = nshared; - shared_count = fobj->shared_count; - + shared_count = fobj ? fobj->shared_count : 0; for (i = 0; i < shared_count; ++i) { shared[i] = rcu_dereference(fobj->shared[i]); if (!dma_fence_get_rcu(shared[i])) break; } + + if (!pfence_excl && fence_excl) { + shared[i] = fence_excl; + fence_excl = NULL; + ++i; + ++shared_count; + } } if (i != shared_count || read_seqcount_retry(>seq, seq)) { @@ -448,7 +460,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, *pshared_count = shared_count; *pshared = shared; - *pfence_excl = fence_excl; + if (pfence_excl) + *pfence_excl = fence_excl; return ret; } -- 2.14.1 ___ Linaro-mm-sig mailing list linaro-mm-...@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
[PATCH] dma-buf: make returning the exclusive fence optional
Change reservation_object_get_fences_rcu to make the exclusive fence pointer optional. If not specified the exclusive fence is put into the fence array as well. This is helpful for a couple of cases where we need all fences in a single array. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 31 ++- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index b759a569b7b8..461afa9febd4 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -374,8 +374,9 @@ EXPORT_SYMBOL(reservation_object_copy_fences); * @pshared: the array of shared fence ptrs returned (array is krealloc'd to * the required size, and must be freed by caller) * - * RETURNS - * Zero or -errno + * Retrieve all fences from the reservation object. If the pointer for the + * exclusive fence is not specified the fence is put into the array of the + * shared fences as well. Returns either zero or -ENOMEM. */ int reservation_object_get_fences_rcu(struct reservation_object *obj, struct dma_fence **pfence_excl, @@ -389,8 +390,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, do { struct reservation_object_list *fobj; - unsigned seq; - unsigned int i; + unsigned int i, seq; + size_t sz = 0; shared_count = i = 0; @@ -402,9 +403,14 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, goto unlock; fobj = rcu_dereference(obj->fence); - if (fobj) { + if (fobj) + sz += sizeof(*shared) * fobj->shared_max; + + if (!pfence_excl && fence_excl) + sz += sizeof(*shared); + + if (sz) { struct dma_fence **nshared; - size_t sz = sizeof(*shared) * fobj->shared_max; nshared = krealloc(shared, sz, GFP_NOWAIT | __GFP_NOWARN); @@ -420,13 +426,19 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, break; } shared = nshared; - shared_count = fobj->shared_count; - + shared_count = fobj ? fobj->shared_count : 0; for (i = 0; i < shared_count; ++i) { shared[i] = rcu_dereference(fobj->shared[i]); if (!dma_fence_get_rcu(shared[i])) break; } + + if (!pfence_excl && fence_excl) { + shared[i] = fence_excl; + fence_excl = NULL; + ++i; + ++shared_count; + } } if (i != shared_count || read_seqcount_retry(>seq, seq)) { @@ -448,7 +460,8 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj, *pshared_count = shared_count; *pshared = shared; - *pfence_excl = fence_excl; + if (pfence_excl) + *pfence_excl = fence_excl; return ret; } -- 2.14.1
Re: [PATCH 0/4] Backported amdgpu ttm deadlock fixes for 4.14
Am 01.12.2017 um 01:23 schrieb Lyude Paul: I haven't gone to see where it started, but as of late a good number of pretty nasty deadlock issues have appeared with the kernel. Easy reproduction recipe on a laptop with i915/amdgpu prime with lockdep enabled: DRI_PRIME=1 glxinfo Acked-by: Christian König <christian.koe...@amd.com> Thanks for taking care of this, Christian. Additionally, some more race conditions exist that I've managed to trigger with piglit and lockdep enabled after applying these patches: = WARNING: suspicious RCU usage 4.14.3Lyude-Test+ #2 Not tainted - ./include/linux/reservation.h:216 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ext_image_dma_b/27451: #0: (reservation_ww_class_mutex){+.+.}, at: [] ttm_bo_unref+0x9f/0x3c0 [ttm] stack backtrace: CPU: 0 PID: 27451 Comm: ext_image_dma_b Not tainted 4.14.3Lyude-Test+ #2 Hardware name: HP HP ZBook 15 G4/8275, BIOS P70 Ver. 01.02 06/09/2017 Call Trace: dump_stack+0x8e/0xce lockdep_rcu_suspicious+0xc5/0x100 reservation_object_copy_fences+0x292/0x2b0 ? ttm_bo_unref+0x9f/0x3c0 [ttm] ttm_bo_unref+0xbd/0x3c0 [ttm] amdgpu_bo_unref+0x2a/0x50 [amdgpu] amdgpu_gem_object_free+0x4b/0x50 [amdgpu] drm_gem_object_free+0x1f/0x40 [drm] drm_gem_object_put_unlocked+0x40/0xb0 [drm] drm_gem_object_handle_put_unlocked+0x6c/0xb0 [drm] drm_gem_object_release_handle+0x51/0x90 [drm] drm_gem_handle_delete+0x5e/0x90 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] drm_gem_close_ioctl+0x20/0x30 [drm] drm_ioctl_kernel+0x5d/0xb0 [drm] drm_ioctl+0x2f7/0x3b0 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] ? trace_hardirqs_on_caller+0xf4/0x190 ? trace_hardirqs_on+0xd/0x10 amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] do_vfs_ioctl+0x93/0x670 ? __fget+0x108/0x1f0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 I've also added the relevant fixes for the issue mentioned above. Christian König (3): drm/ttm: fix ttm_bo_cleanup_refs_or_queue once more dma-buf: make reservation_object_copy_fences rcu save drm/amdgpu: reserve root PD while releasing it Michel Dänzer (1): drm/ttm: Always and only destroy bo->ttm_resv in ttm_bo_release_list drivers/dma-buf/reservation.c | 56 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 13 ++-- drivers/gpu/drm/ttm/ttm_bo.c | 43 +- 3 files changed, 74 insertions(+), 38 deletions(-) -- 2.14.3 ___ amd-gfx mailing list amd-...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] reservation: don't wait when timeout=0
Am 21.11.2017 um 16:58 schrieb Chris Wilson: Quoting Christian König (2017-11-21 15:49:55) Am 21.11.2017 um 15:59 schrieb Rob Clark: On Tue, Nov 21, 2017 at 9:38 AM, Chris Wilson <ch...@chris-wilson.co.uk> wrote: Quoting Rob Clark (2017-11-21 14:08:46) If we are testing if a reservation object's fences have been signaled with timeout=0 (non-blocking), we need to pass 0 for timeout to dma_fence_wait_timeout(). Plus bonus spelling correction. Signed-off-by: Rob Clark <robdcl...@gmail.com> --- drivers/dma-buf/reservation.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index dec3a815455d..71f51140a9ad 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -420,7 +420,7 @@ EXPORT_SYMBOL_GPL(reservation_object_get_fences_rcu); * * RETURNS * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or - * greater than zer on success. + * greater than zero on success. */ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, bool wait_all, bool intr, @@ -483,7 +483,14 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, goto retry; } - ret = dma_fence_wait_timeout(fence, intr, ret); + /* +* Note that dma_fence_wait_timeout() will return 1 if +* the fence is already signaled, so in the wait_all +* case when we go through the retry loop again, ret +* will be greater than 0 and we don't want this to +* cause _wait_timeout() to block +*/ + ret = dma_fence_wait_timeout(fence, intr, timeout ? ret : 0); One should ask if we should just fix the interface to stop returning incorrect results (stop "correcting" a completion with 0 jiffies remaining as 1). A timeout can be distinguished by -ETIME (or your pick of errno). perhaps -EBUSY, if we go that route (although maybe it should be a follow-on patch, this one is suitable for backport to stable/lts if one should so choose..) I think current approach was chosen to match schedule_timeout() and other such functions that take a timeout in jiffies. Not making a judgement on whether that is a good or bad reason.. We intentionally switched away from that to be in sync with the wait_event_* interface. Returning 1 when a function with a zero timeout succeeds is actually quite common in the kernel. We actually had this conversation last time, and outside of that it isn't. Looking at all the convolution to first return 1, then undo the damage in the caller, it looks pretty silly. I don't find that very intuitive either, but you would also have to handle the error code in the calling function as well. And it is what the whole kernel does all over the place with it's wait_event_* and scheduler timeouts as well. Regards, Christian. -Chris
Re: [PATCH] reservation: don't wait when timeout=0
Am 21.11.2017 um 15:59 schrieb Rob Clark: On Tue, Nov 21, 2017 at 9:38 AM, Chris Wilsonwrote: Quoting Rob Clark (2017-11-21 14:08:46) If we are testing if a reservation object's fences have been signaled with timeout=0 (non-blocking), we need to pass 0 for timeout to dma_fence_wait_timeout(). Plus bonus spelling correction. Signed-off-by: Rob Clark --- drivers/dma-buf/reservation.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index dec3a815455d..71f51140a9ad 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -420,7 +420,7 @@ EXPORT_SYMBOL_GPL(reservation_object_get_fences_rcu); * * RETURNS * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or - * greater than zer on success. + * greater than zero on success. */ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, bool wait_all, bool intr, @@ -483,7 +483,14 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, goto retry; } - ret = dma_fence_wait_timeout(fence, intr, ret); + /* +* Note that dma_fence_wait_timeout() will return 1 if +* the fence is already signaled, so in the wait_all +* case when we go through the retry loop again, ret +* will be greater than 0 and we don't want this to +* cause _wait_timeout() to block +*/ + ret = dma_fence_wait_timeout(fence, intr, timeout ? ret : 0); One should ask if we should just fix the interface to stop returning incorrect results (stop "correcting" a completion with 0 jiffies remaining as 1). A timeout can be distinguished by -ETIME (or your pick of errno). perhaps -EBUSY, if we go that route (although maybe it should be a follow-on patch, this one is suitable for backport to stable/lts if one should so choose..) I think current approach was chosen to match schedule_timeout() and other such functions that take a timeout in jiffies. Not making a judgement on whether that is a good or bad reason.. We intentionally switched away from that to be in sync with the wait_event_* interface. Returning 1 when a function with a zero timeout succeeds is actually quite common in the kernel. Regards, Christian. BR, -R -Chris ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] reservation: don't wait when timeout=0
Am 21.11.2017 um 15:08 schrieb Rob Clark: If we are testing if a reservation object's fences have been signaled with timeout=0 (non-blocking), we need to pass 0 for timeout to dma_fence_wait_timeout(). Plus bonus spelling correction. Signed-off-by: Rob Clark <robdcl...@gmail.com> Reviewed-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index dec3a815455d..71f51140a9ad 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -420,7 +420,7 @@ EXPORT_SYMBOL_GPL(reservation_object_get_fences_rcu); * * RETURNS * Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or - * greater than zer on success. + * greater than zero on success. */ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, bool wait_all, bool intr, @@ -483,7 +483,14 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, goto retry; } - ret = dma_fence_wait_timeout(fence, intr, ret); + /* +* Note that dma_fence_wait_timeout() will return 1 if +* the fence is already signaled, so in the wait_all +* case when we go through the retry loop again, ret +* will be greater than 0 and we don't want this to +* cause _wait_timeout() to block +*/ + ret = dma_fence_wait_timeout(fence, intr, timeout ? ret : 0); dma_fence_put(fence); if (ret > 0 && wait_all && (i + 1 < shared_count)) goto retry;
Re: [PATCH 0/4] dma-buf: Silence dma_fence __rcu sparse warnings
Patch #4 is Reviewed-by: Christian König <christian.koe...@amd.com>. The rest is Acked-by: Christian König <christian.koe...@amd.com>. Regards, Christian. Am 02.11.2017 um 21:03 schrieb Ville Syrjala: From: Ville Syrjälä <ville.syrj...@linux.intel.com> When building drm+i915 I get around 150 lines of sparse noise from dma_fence __rcu warnings. This series eliminates all of that. The first two patches were already posted by Chris, but there wasn't any real reaction, so I figured I'd repost with a wider Cc list. As for the other two patches, I'm no expert on dma_fence and I didn't spend a lot of time looking at it so I can't be sure I annotated all the accesses correctly. But I figured someone will scream at me if I got it wrong ;) Cc: Dave Airlie <airl...@redhat.com> Cc: Jason Ekstrand <ja...@jlekstrand.net> Cc: linaro-mm-...@lists.linaro.org Cc: linux-media@vger.kernel.org Cc: Alex Deucher <alexander.deuc...@amd.com> Cc: Christian König <christian.koe...@amd.com> Cc: Sumit Semwal <sumit.sem...@linaro.org> Cc: Chris Wilson <ch...@chris-wilson.co.uk> Chris Wilson (2): drm/syncobj: Mark up the fence as an RCU protected pointer dma-buf/fence: Sparse wants __rcu on the object itself Ville Syrjälä (2): drm/syncobj: Use proper methods for accessing rcu protected pointers dma-buf: Use rcu_assign_pointer() to set rcu protected pointers drivers/dma-buf/reservation.c | 2 +- drivers/gpu/drm/drm_syncobj.c | 11 +++ include/drm/drm_syncobj.h | 2 +- include/linux/dma-fence.h | 2 +- 4 files changed, 10 insertions(+), 7 deletions(-)
Re: [PATCH] dma-buf: Cleanup comments on dma_buf_map_attachment()
Am 01.11.2017 um 15:06 schrieb Liviu Dudau: Mappings need to be unmapped by calling dma_buf_unmap_attachment() and not by calling again dma_buf_map_attachment(). Also fix some spelling mistakes. Signed-off-by: Liviu Dudau <liviu.du...@arm.com> Reviewed-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index bc1cb284111cb..1792385405f0e 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -351,13 +351,13 @@ static inline int is_dma_buf_file(struct file *file) * * 2. Userspace passes this file-descriptors to all drivers it wants this buffer *to share with: First the filedescriptor is converted to a _buf using - *dma_buf_get(). The the buffer is attached to the device using + *dma_buf_get(). Then the buffer is attached to the device using *dma_buf_attach(). * *Up to this stage the exporter is still free to migrate or reallocate the *backing storage. * - * 3. Once the buffer is attached to all devices userspace can inniate DMA + * 3. Once the buffer is attached to all devices userspace can initiate DMA *access to the shared buffer. In the kernel this is done by calling *dma_buf_map_attachment() and dma_buf_unmap_attachment(). * @@ -617,7 +617,7 @@ EXPORT_SYMBOL_GPL(dma_buf_detach); * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR * on error. May return -EINTR if it is interrupted by a signal. * - * A mapping must be unmapped again using dma_buf_map_attachment(). Note that + * A mapping must be unmapped by using dma_buf_unmap_attachment(). Note that * the underlying backing storage is pinned for as long as a mapping exists, * therefore users/importers should not hold onto a mapping for undue amounts of * time.
Re: [PATCH] dma-fence: fix dma_fence_get_rcu_safe
Am 20.09.2017 um 20:20 schrieb Daniel Vetter: On Mon, Sep 11, 2017 at 01:06:32PM +0200, Christian König wrote: Am 11.09.2017 um 12:01 schrieb Chris Wilson: [SNIP] Yeah, but that is illegal with a fence objects. When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single(). Many, many months ago I sent patches to fix them all. Found those after a bit a searching. Yeah, those patches where proposed more than a year ago, but never pushed upstream. Not sure if we really should go this way. dma_fence objects are shared between drivers and since we can't judge if it's the correct fence based on a criteria in the object (only the read counter which is outside) all drivers need to be correct for this. I would rather go the way and change dma_fence_release() to wrap fence->ops->release into call_rcu() to keep the whole RCU handling outside of the individual drivers. Hm, I entirely dropped the ball on this, I kinda assumed that we managed to get some agreement on this between i915 and dma_fence. Adding a pile more people. For the meantime I've send a v2 of this patch to fix at least the buggy return of NULL when we fail to grab the RCU reference but keeping the extra checking for now. Can I get an rb on this please so that we fix at least the bug at hand? Thanks, Christian. Joonas, Tvrtko, I guess we need to fix this one way or the other. -Daniel
[PATCH] dma-fence: fix dma_fence_get_rcu_safe v2
From: Christian König <christian.koe...@amd.com> When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all. It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL. v2: Keep extra check after dma_fence_get_rcu(). Signed-off-by: Christian König <christian.koe...@amd.com> Cc: Chris Wilson <ch...@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vet...@ffwll.ch> Cc: Sumit Semwal <sumit.sem...@linaro.org> Cc: linux-media@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linaro-mm-...@lists.linaro.org --- include/linux/dma-fence.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 0a186c4..f4f23cb 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -248,9 +248,12 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep) struct dma_fence *fence; fence = rcu_dereference(*fencep); - if (!fence || !dma_fence_get_rcu(fence)) + if (!fence) return NULL; + if (!dma_fence_get_rcu(fence)) + continue; + /* The atomic_inc_not_zero() inside dma_fence_get_rcu() * provides a full memory barrier upon success (such as now). * This is paired with the write barrier from assigning -- 2.7.4
Re: [PATCH] dma-fence: fix dma_fence_get_rcu_safe
Am 11.09.2017 um 12:01 schrieb Chris Wilson: [SNIP] Yeah, but that is illegal with a fence objects. When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single(). Many, many months ago I sent patches to fix them all. Found those after a bit a searching. Yeah, those patches where proposed more than a year ago, but never pushed upstream. Not sure if we really should go this way. dma_fence objects are shared between drivers and since we can't judge if it's the correct fence based on a criteria in the object (only the read counter which is outside) all drivers need to be correct for this. I would rather go the way and change dma_fence_release() to wrap fence->ops->release into call_rcu() to keep the whole RCU handling outside of the individual drivers. Regards, Christian.
Re: [PATCH] dma-fence: fix dma_fence_get_rcu_safe
Am 11.09.2017 um 11:23 schrieb Chris Wilson: Quoting Christian König (2017-09-11 10:06:50) Am 11.09.2017 um 10:59 schrieb Chris Wilson: Quoting Christian König (2017-09-11 09:50:40) Sorry for the delayed response, but your mail somehow ended up in the Spam folder. Am 04.09.2017 um 15:40 schrieb Chris Wilson: Quoting Christian König (2017-09-04 14:27:33) From: Christian König <christian.koe...@amd.com> The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all. It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL. Which is not guaranteed by the code you wrote either. The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence. You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence. You are completely missing the point here. It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid. So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for. Not quite. We can grab a reference on a fence that was already freed and reused between the rcu_dereference() and dma_fence_get_rcu(). Reusing a memory structure before the RCU grace period is completed is illegal, otherwise the whole RCU approach won't work. RCU only protects that the pointer remains valid. If you use SLAB_TYPESAFE_BY_RCU, it is possible to reuse the pointer within a grace period. It does happen and that is the point the comment is trying to make. Yeah, but that is illegal with a fence objects. When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single(). Cause all of them rely on dma_fence_get() to return NULL when the fence isn't valid any more to restart the operation. When dma_fence_get_rcu() returns a reallocated fence the operation wouldn't correctly restart and the end result most likely not be correct at all. Using SLAB_TYPESAFE_BY_RCU is only valid if you can ensure that you have the right object using a second criteria and that is not the case with fences. Regards, Christian.
Re: [PATCH] dma-fence: fix dma_fence_get_rcu_safe
Am 11.09.2017 um 10:59 schrieb Chris Wilson: Quoting Christian König (2017-09-11 09:50:40) Sorry for the delayed response, but your mail somehow ended up in the Spam folder. Am 04.09.2017 um 15:40 schrieb Chris Wilson: Quoting Christian König (2017-09-04 14:27:33) From: Christian König <christian.koe...@amd.com> The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all. It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL. Which is not guaranteed by the code you wrote either. The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence. You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence. You are completely missing the point here. It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid. So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for. Not quite. We can grab a reference on a fence that was already freed and reused between the rcu_dereference() and dma_fence_get_rcu(). Reusing a memory structure before the RCU grace period is completed is illegal, otherwise the whole RCU approach won't work. So that case can't happen. Regards, Christian. -Chris
Re: [PATCH] dma-fence: fix dma_fence_get_rcu_safe
Sorry for the delayed response, but your mail somehow ended up in the Spam folder. Am 04.09.2017 um 15:40 schrieb Chris Wilson: Quoting Christian König (2017-09-04 14:27:33) From: Christian König <christian.koe...@amd.com> The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all. It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL. Which is not guaranteed by the code you wrote either. The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence. You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence. You are completely missing the point here. It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid. So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for. See the usual life of a fence* variable looks like this: 1. assigning pointer to fence A; 2. assigning pointer to fence B; 3. assigning pointer to fence C; When dma_fence_get_rcu_safe() is called between step #1 and step #2 for example it is perfectly valid to just return either fence A or fence B. But it is invalid to return NULL because that suggests that we don't need to sync at all. Regards, Christian.
Re: [PATCH] dma-fence: fix dma_fence_get_rcu_safe
I really wonder what's wrong with my mail client, but it looks like this patch never made it at least to dri-devel. Forwarding manually now, Christian. Am 04.09.2017 um 15:16 schrieb Christian König: From: Christian König <christian.koe...@amd.com> The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all. It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL. Signed-off-by: Christian König <christian.koe...@amd.com> Cc: Chris Wilson <ch...@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vet...@ffwll.ch> Cc: Sumit Semwal <sumit.sem...@linaro.org> Cc: linux-media@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linaro-mm-...@lists.linaro.org --- include/linux/dma-fence.h | 23 ++- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index a5195a7..37f3d67 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -246,27 +246,8 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep) struct dma_fence *fence; fence = rcu_dereference(*fencep); - if (!fence || !dma_fence_get_rcu(fence)) - return NULL; - - /* The atomic_inc_not_zero() inside dma_fence_get_rcu() -* provides a full memory barrier upon success (such as now). -* This is paired with the write barrier from assigning -* to the __rcu protected fence pointer so that if that -* pointer still matches the current fence, we know we -* have successfully acquire a reference to it. If it no -* longer matches, we are holding a reference to some other -* reallocated pointer. This is possible if the allocator -* is using a freelist like SLAB_TYPESAFE_BY_RCU where the -* fence remains valid for the RCU grace period, but it -* may be reallocated. When using such allocators, we are -* responsible for ensuring the reference we get is to -* the right fence, as below. -*/ - if (fence == rcu_access_pointer(*fencep)) - return rcu_pointer_handoff(fence); - - dma_fence_put(fence); + if (!fence || dma_fence_get_rcu(fence)) + return fence; } while (1); }
[PATCH] dma-fence: fix dma_fence_get_rcu_safe
From: Christian König <christian.koe...@amd.com> The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all. It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL. Signed-off-by: Christian König <christian.koe...@amd.com> Cc: Chris Wilson <ch...@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vet...@ffwll.ch> Cc: Sumit Semwal <sumit.sem...@linaro.org> Cc: linux-media@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linaro-mm-...@lists.linaro.org --- include/linux/dma-fence.h | 23 ++- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index a5195a7..37f3d67 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -246,27 +246,8 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep) struct dma_fence *fence; fence = rcu_dereference(*fencep); - if (!fence || !dma_fence_get_rcu(fence)) - return NULL; - - /* The atomic_inc_not_zero() inside dma_fence_get_rcu() -* provides a full memory barrier upon success (such as now). -* This is paired with the write barrier from assigning -* to the __rcu protected fence pointer so that if that -* pointer still matches the current fence, we know we -* have successfully acquire a reference to it. If it no -* longer matches, we are holding a reference to some other -* reallocated pointer. This is possible if the allocator -* is using a freelist like SLAB_TYPESAFE_BY_RCU where the -* fence remains valid for the RCU grace period, but it -* may be reallocated. When using such allocators, we are -* responsible for ensuring the reference we get is to -* the right fence, as below. -*/ - if (fence == rcu_access_pointer(*fencep)) - return rcu_pointer_handoff(fence); - - dma_fence_put(fence); + if (!fence || dma_fence_get_rcu(fence)) + return fence; } while (1); } -- 2.7.4
[PATCH] dma-fence: fix dma_fence_get_rcu_safe
From: Christian König <christian.koe...@amd.com> The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all. It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL. Signed-off-by: Christian König <christian.koe...@amd.com> Cc: Chris Wilson <ch...@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vet...@ffwll.ch> Cc: Sumit Semwal <sumit.sem...@linaro.org> Cc: linux-media@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linaro-mm-...@lists.linaro.org --- include/linux/dma-fence.h | 23 ++- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index a5195a7..37f3d67 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -246,27 +246,8 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep) struct dma_fence *fence; fence = rcu_dereference(*fencep); - if (!fence || !dma_fence_get_rcu(fence)) - return NULL; - - /* The atomic_inc_not_zero() inside dma_fence_get_rcu() -* provides a full memory barrier upon success (such as now). -* This is paired with the write barrier from assigning -* to the __rcu protected fence pointer so that if that -* pointer still matches the current fence, we know we -* have successfully acquire a reference to it. If it no -* longer matches, we are holding a reference to some other -* reallocated pointer. This is possible if the allocator -* is using a freelist like SLAB_TYPESAFE_BY_RCU where the -* fence remains valid for the RCU grace period, but it -* may be reallocated. When using such allocators, we are -* responsible for ensuring the reference we get is to -* the right fence, as below. -*/ - if (fence == rcu_access_pointer(*fencep)) - return rcu_pointer_handoff(fence); - - dma_fence_put(fence); + if (!fence || dma_fence_get_rcu(fence)) + return fence; } while (1); } -- 2.7.4
Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly v2
Ping, what do you guys think of that? Am 25.07.2017 um 15:35 schrieb Christian König: From: Christian König <christian.koe...@amd.com> With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation. v2: make sure we always wait for the exclusive fence Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 33 +++-- 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index 393817e..9d4316d 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -373,12 +373,25 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, long ret = timeout ? timeout : 1; retry: - fence = NULL; shared_count = 0; seq = read_seqcount_begin(>seq); rcu_read_lock(); - if (wait_all) { + fence = rcu_dereference(obj->fence_excl); + if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)) { + if (!dma_fence_get_rcu(fence)) + goto unlock_retry; + + if (dma_fence_is_signaled(fence)) { + dma_fence_put(fence); + fence = NULL; + } + + } else { + fence = NULL; + } + + if (!fence && wait_all) { struct reservation_object_list *fobj = rcu_dereference(obj->fence); @@ -405,22 +418,6 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, } } - if (!shared_count) { - struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); - - if (fence_excl && - !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, - _excl->flags)) { - if (!dma_fence_get_rcu(fence_excl)) - goto unlock_retry; - - if (dma_fence_is_signaled(fence_excl)) - dma_fence_put(fence_excl); - else - fence = fence_excl; - } - } - rcu_read_unlock(); if (fence) { if (read_seqcount_retry(>seq, seq)) {
[PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly v2
From: Christian König <christian.koe...@amd.com> With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation. v2: make sure we always wait for the exclusive fence Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 33 +++-- 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index 393817e..9d4316d 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -373,12 +373,25 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, long ret = timeout ? timeout : 1; retry: - fence = NULL; shared_count = 0; seq = read_seqcount_begin(>seq); rcu_read_lock(); - if (wait_all) { + fence = rcu_dereference(obj->fence_excl); + if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)) { + if (!dma_fence_get_rcu(fence)) + goto unlock_retry; + + if (dma_fence_is_signaled(fence)) { + dma_fence_put(fence); + fence = NULL; + } + + } else { + fence = NULL; + } + + if (!fence && wait_all) { struct reservation_object_list *fobj = rcu_dereference(obj->fence); @@ -405,22 +418,6 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, } } - if (!shared_count) { - struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); - - if (fence_excl && - !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, - _excl->flags)) { - if (!dma_fence_get_rcu(fence_excl)) - goto unlock_retry; - - if (dma_fence_is_signaled(fence_excl)) - dma_fence_put(fence_excl); - else - fence = fence_excl; - } - } - rcu_read_unlock(); if (fence) { if (read_seqcount_retry(>seq, seq)) { -- 2.7.4
Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
Am 24.07.2017 um 13:57 schrieb Daniel Vetter: On Mon, Jul 24, 2017 at 11:51 AM, Christian König <deathsim...@vodafone.de> wrote: Am 24.07.2017 um 10:33 schrieb Daniel Vetter: On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote: From: Christian König <christian.koe...@amd.com> With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation. How did you end up with both shared and exclusive fences on the same reservation object? At least I thought the point of exclusive was that it's exclusive (and has an implicit barrier on all previous shared fences). Same for shared fences, they need to wait for the exclusive one (and replace it). Is this fallout from the amdgpu trickery where by default you do all shared fences? I thought we've aligned semantics a while back ... No, that is perfectly normal even for other drivers. Take a look at the reservation code. The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well. Hm right. Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object. How does that work? The batch(es) with the shared fence are all supposed to wait for the exclusive fence before they start, which means even if you gpu reset and restart/cancel certain things, they shouldn't be able to complete out of order. Assume the following: 1. The exclusive fence is some move operation by the kernel which executes on a DMA engine. 2. The shared fence is a 3D operation submitted by userspace which executes on the 3D engine. Now we found the 3D engine to be hung and needs a reset, all currently submitted jobs are aborted, marked with an error code and their fences put into the signaled state. Since we only reset the 3D engine, the move operation (fortunately) isn't affected by this. I think this applies to all drivers and isn't something amdgpu specific. Regards, Christian. If you outright cancel a fence then you're supposed to first call dma_fence_set_error(-EIO) and then complete it. Note that atm that part might be slightly overengineered and I'm not sure about how we expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or soon, has been) used by i915 for it's internal book-keeping, which might not be the best to leak to other consumers. But completing fences (at least exported ones, where userspace or other drivers can get at them) shouldn't be possible. -Daniel
Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
Am 24.07.2017 um 10:34 schrieb zhoucm1: On 2017年07月22日 00:20, Christian König wrote: From: Christian König <christian.koe...@amd.com> With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index e2eff86..ce3f9c1 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, } } -if (!shared_count) { +if (!fence) { previous code seems be a bug, the exclusive fence isn't be waited at all if shared_count != 0. With your fix, there still is a case the exclusive fence could be skipped, that when fobj->shared[shared_count-1] isn't signalled. Yeah, indeed that looks like it needs to be fixed as well. I'm still completely jet lagged and need to work through tons of stuff from last week. Do you have time to take care of fixing up this patch and send a v2? Thanks in advance, Christian. Regards, David Zhou struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); if (fence_excl &&
Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
Am 24.07.2017 um 10:33 schrieb Daniel Vetter: On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote: From: Christian König <christian.koe...@amd.com> With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation. How did you end up with both shared and exclusive fences on the same reservation object? At least I thought the point of exclusive was that it's exclusive (and has an implicit barrier on all previous shared fences). Same for shared fences, they need to wait for the exclusive one (and replace it). Is this fallout from the amdgpu trickery where by default you do all shared fences? I thought we've aligned semantics a while back ... No, that is perfectly normal even for other drivers. Take a look at the reservation code. The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well. Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object. Christian. -Daniel Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index e2eff86..ce3f9c1 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, } } - if (!shared_count) { + if (!fence) { struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); if (fence_excl && -- 2.7.4 ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
From: Christian König <christian.koe...@amd.com> With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation. Signed-off-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/reservation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index e2eff86..ce3f9c1 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj, } } - if (!shared_count) { + if (!fence) { struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); if (fence_excl && -- 2.7.4
Re: [PATCH] dma-buf: avoid scheduling on fence status query v2
Am 26.04.2017 um 16:46 schrieb Andres Rodriguez: When a timeout of zero is specified, the caller is only interested in the fence status. In the current implementation, dma_fence_default_wait will always call schedule_timeout() at least once for an unsignaled fence. This adds a significant overhead to a fence status query. Avoid this overhead by returning early if a zero timeout is specified. v2: move early return after enable_signaling Signed-off-by: Andres Rodriguez <andre...@gmail.com> Reviewed-by: Christian König <christian.koe...@amd.com> --- If I'm understanding correctly, I don't think we need to register the default wait callback. But if that isn't the case please let me know. This patch has the same perf improvements as v1. drivers/dma-buf/dma-fence.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 0918d3f..57da14c 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -402,6 +402,11 @@ dma_fence_default_wait(struct dma_fence *fence, bool intr, signed long timeout) } } + if (!timeout) { + ret = 0; + goto out; + } + cb.base.func = dma_fence_default_wait_cb; cb.task = current; list_add(, >cb_list);
Re: [PATCH] dma-buf: avoid scheduling on fence status query
Am 26.04.2017 um 11:59 schrieb Dave Airlie: On 26 April 2017 at 17:20, Christian König <deathsim...@vodafone.de> wrote: NAK, I'm wondering how often I have to reject that change. We should probably add a comment here. Even with a zero timeout we still need to enable signaling, otherwise some fence will never signal if userspace just polls on them. If a caller is only interested in the fence status without enabling the signaling it should call dma_fence_is_signaled() instead. Can we not move the return 0 (with spin unlock) down after we enabling signalling, but before we enter the schedule_timeout(1)? Yes, that would be an option. Christian. Dave.
Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions
Am 25.04.2017 um 20:20 schrieb Logan Gunthorpe: This patch introduces functions which kmap the pages inside an sgl. These functions replace a common pattern of kmap(sg_page(sg)) that is used in more than 50 places within the kernel. The motivation for this work is to eventually safely support sgls that contain io memory. In order for that to work, any access to the contents of an iomem SGL will need to be done with iomemcpy or hit some warning. (The exact details of how this will work have yet to be worked out.) Having all the kmaps in one place is just a first step in that direction. Additionally, seeing this helps cut down the users of sg_page, it should make any effort to go to struct-page-less DMAs a little easier (should that idea ever swing back into favour again). A flags option is added to select between a regular or atomic mapping so these functions can replace kmap(sg_page or kmap_atomic(sg_page. Future work may expand this to have flags for using page_address or vmap. We include a flag to require the function not to fail to support legacy code that has no easy error path. Much further in the future, there may be a flag to allocate memory and copy the data from/to iomem. We also add the semantic that sg_map can fail to create a mapping, despite the fact that the current code this is replacing is assumed to never fail and the current version of these functions cannot fail. This is to support iomem which may either have to fail to create the mapping or allocate memory as a bounce buffer which itself can fail. Also, in terms of cleanup, a few of the existing kmap(sg_page) users play things a bit loose in terms of whether they apply sg->offset so using these helper functions should help avoid such issues. Signed-off-by: Logan Gunthorpe <log...@deltatee.com> --- Good to know that somebody is working on this. Those problems troubled us as well. Patch is Acked-by: Christian König <christian.koe...@amd.com>. Regards, Christian. include/linux/scatterlist.h | 85 + 1 file changed, 85 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index cb3c8fe..fad170b 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -5,6 +5,7 @@ #include #include #include +#include #include struct scatterlist { @@ -126,6 +127,90 @@ static inline struct page *sg_page(struct scatterlist *sg) return (struct page *)((sg)->page_link & ~0x3); } +#define SG_KMAP (1 << 0) /* create a mapping with kmap */ +#define SG_KMAP_ATOMIC (1 << 1) /* create a mapping with kmap_atomic */ +#define SG_MAP_MUST_NOT_FAIL (1 << 2)/* indicate sg_map should not fail */ + +/** + * sg_map - kmap a page inside an sgl + * @sg:SG entry + * @offset:Offset into entry + * @flags: Flags for creating the mapping + * + * Description: + * Use this function to map a page in the scatterlist at the specified + * offset. sg->offset is already added for you. Note: the semantics of + * this function are that it may fail. Thus, its output should be checked + * with IS_ERR and PTR_ERR. Otherwise, a pointer to the specified offset + * in the mapped page is returned. + * + * Flags can be any of: + * * SG_KMAP - Use kmap to create the mapping + * * SG_KMAP_ATOMIC- Use kmap_atomic to map the page atommically. + * Thus, the rules of that function apply: the + * cpu may not sleep until it is unmaped. + * * SG_MAP_MUST_NOT_FAIL - Indicate that sg_map must not fail. + * If it does, it will issue a BUG_ON instead. + * This is intended for legacy code only, it + * is not to be used in new code. + * + * Also, consider carefully whether this function is appropriate. It is + * largely not recommended for new code and if the sgl came from another + * subsystem and you don't know what kind of memory might be in the list + * then you definitely should not call it. Non-mappable memory may be in + * the sgl and thus this function may fail unexpectedly. Consider using + * sg_copy_to_buffer instead. + **/ +static inline void *sg_map(struct scatterlist *sg, size_t offset, int flags) +{ + struct page *pg; + unsigned int pg_off; + void *ret; + + offset += sg->offset; + pg = nth_page(sg_page(sg), offset >> PAGE_SHIFT); + pg_off = offset_in_page(offset); + + if (flags & SG_KMAP_ATOMIC) + ret = kmap_atomic(pg) + pg_off; + else if (flags & SG_KMAP) + ret = kmap(pg) + pg_off; + else + ret = ERR_PTR(-EINVAL); + + /* +* In theory, this can't happen yet. Once we start adding +* unmapable memory, it also shouldn't happen unless
Re: [PATCH] dma-buf: avoid scheduling on fence status query
NAK, I'm wondering how often I have to reject that change. We should probably add a comment here. Even with a zero timeout we still need to enable signaling, otherwise some fence will never signal if userspace just polls on them. If a caller is only interested in the fence status without enabling the signaling it should call dma_fence_is_signaled() instead. Regards, Christian. Am 26.04.2017 um 04:50 schrieb Andres Rodriguez: CC a few extra lists I missed. Regards, Andres On 2017-04-25 09:36 PM, Andres Rodriguez wrote: When a timeout of zero is specified, the caller is only interested in the fence status. In the current implementation, dma_fence_default_wait will always call schedule_timeout() at least once for an unsignaled fence. This adds a significant overhead to a fence status query. Avoid this overhead by returning early if a zero timeout is specified. Signed-off-by: Andres Rodriguez--- This heavily affects the performance of the Source2 engine running on radv. This patch improves dota2(radv) perf on a i7-6700k+RX480 system from 72fps->81fps. drivers/dma-buf/dma-fence.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 0918d3f..348e9e2 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -380,6 +380,9 @@ dma_fence_default_wait(struct dma_fence *fence, bool intr, signed long timeout) if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)) return ret; +if (!timeout) +return 0; + spin_lock_irqsave(fence->lock, flags); if (intr && signal_pending(current)) {
Re: [PATCH] dma-buf: fence debugging
Am 31.03.2017 um 12:00 schrieb Russell King: Add debugfs output to report shared and exclusive fences on a dma_buf object. This produces output such as: Dma-buf Objects: sizeflags modecount exp_name 0829440000050005drm Exclusive fence: etnaviv 134000.gpu signalled Attached Devices: gpu-subsystem Total 1 devices attached Total 1 objects, 8294400 bytes Signed-off-by: Russell King <rmk+ker...@arm.linux.org.uk> Reviewed-by: Christian König <christian.koe...@amd.com> --- drivers/dma-buf/dma-buf.c | 34 +- 1 file changed, 33 insertions(+), 1 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 0007b792827b..f72aaacbe023 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1059,7 +1059,11 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) int ret; struct dma_buf *buf_obj; struct dma_buf_attachment *attach_obj; - int count = 0, attach_count; + struct reservation_object *robj; + struct reservation_object_list *fobj; + struct dma_fence *fence; + unsigned seq; + int count = 0, attach_count, shared_count, i; size_t size = 0; ret = mutex_lock_interruptible(_list.lock); @@ -1085,6 +1090,34 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) file_count(buf_obj->file), buf_obj->exp_name); + robj = buf_obj->resv; + while (true) { + seq = read_seqcount_begin(>seq); + rcu_read_lock(); + fobj = rcu_dereference(robj->fence); + shared_count = fobj ? fobj->shared_count : 0; + fence = rcu_dereference(robj->fence_excl); + if (!read_seqcount_retry(>seq, seq)) + break; + rcu_read_unlock(); + } + + if (fence) + seq_printf(s, "\tExclusive fence: %s %s %ssignalled\n", + fence->ops->get_driver_name(fence), + fence->ops->get_timeline_name(fence), + dma_fence_is_signaled(fence) ? "" : "un"); + for (i = 0; i < shared_count; i++) { + fence = rcu_dereference(fobj->shared[i]); + if (!dma_fence_get_rcu(fence)) + continue; + seq_printf(s, "\tShared fence: %s %s %ssignalled\n", + fence->ops->get_driver_name(fence), + fence->ops->get_timeline_name(fence), + dma_fence_is_signaled(fence) ? "" : "un"); + } + rcu_read_unlock(); + seq_puts(s, "\tAttached Devices:\n"); attach_count = 0;
Re: [PATCH] dma-buf: add support for compat ioctl
Am 21.02.2017 um 15:55 schrieb Marek Szyprowski: Dear All, On 2017-02-21 15:37, Marek Szyprowski wrote: Hi Christian, On 2017-02-21 14:59, Christian König wrote: Am 21.02.2017 um 14:21 schrieb Marek Szyprowski: Add compat ioctl support to dma-buf. This lets one to use DMA_BUF_IOCTL_SYNC ioctl from 32bit application on 64bit kernel. Data structures for both 32 and 64bit modes are same, so there is no need for additional translation layer. Well I might be wrong, but IIRC compat_ioctl was just optional and if not specified unlocked_ioctl was called instead. If that is true your patch wouldn't have any effect at all. Well, then why I got -ENOTTY in the 32bit test app for this ioctl on 64bit ARM64 kernel without this patch? I've checked in fs/compat_ioctl.c, I see no fallback in COMPAT_SYSCALL_DEFINE3, so one has to provide compat_ioctl callback to have ioctl working with 32bit apps. Then my memory cheated on me. In this case the patch is Reviewed-by: Christian König <christian.koe...@amd.com>. Regards, Christian. Best regards
Re: [PATCH] dma-buf: add support for compat ioctl
Am 21.02.2017 um 14:21 schrieb Marek Szyprowski: Add compat ioctl support to dma-buf. This lets one to use DMA_BUF_IOCTL_SYNC ioctl from 32bit application on 64bit kernel. Data structures for both 32 and 64bit modes are same, so there is no need for additional translation layer. Well I might be wrong, but IIRC compat_ioctl was just optional and if not specified unlocked_ioctl was called instead. If that is true your patch wouldn't have any effect at all. Regards, Christian. Signed-off-by: Marek Szyprowski--- drivers/dma-buf/dma-buf.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 718f832a5c71..0007b792827b 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -325,6 +325,9 @@ static long dma_buf_ioctl(struct file *file, .llseek = dma_buf_llseek, .poll = dma_buf_poll, .unlocked_ioctl = dma_buf_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = dma_buf_ioctl, +#endif }; /*
Re: Enabling peer to peer device transactions for PCIe devices
Am 12.01.2017 um 16:11 schrieb Jerome Glisse: On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote: On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote: On 06/01/17 11:26 AM, Jason Gunthorpe wrote: Make a generic API for all of this and you'd have my vote.. IMHO, you must support basic pinning semantics - that is necessary to support generic short lived DMA (eg filesystem, etc). That hardware can clearly do that if it can support ODP. I agree completely. What we want is for RDMA, O_DIRECT, etc to just work with special VMAs (ie. at least those backed with ZONE_DEVICE memory). Then GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace (using whatever interface is most appropriate) and userspace can do what it pleases with them. This makes _so_ much sense and actually largely already works today (as demonstrated by iopmem). +1 for iopmem ;-) I feel like we are going around and around on this topic. I would like to see something that is upstream that enables P2P even if it is only the minimum viable useful functionality to begin. I think aiming for the moon (which is what HMM and things like it are) are simply going to take more time if they ever get there. There is a use case for in-kernel P2P PCIe transfers between two NVMe devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or BARs on the NIC). I am even seeing users who now want to move data P2P between FPGAs and NVMe SSDs and the upstream kernel should be able to support these users or they will look elsewhere. The iopmem patchset addressed all the use cases above and while it is not an in kernel API it could have been modified to be one reasonably easily. As Logan states the driver can then choose to pass the VMAs to user-space in a manner that makes sense. Earlier in the thread someone mentioned LSF/MM. There is already a proposal to discuss this topic so if you are interested please respond to the email letting the committee know this topic is of interest to you [1]. Also earlier in the thread someone discussed the issues around the IOMMU. Given the known issues around P2P transfers in certain CPU root complexes [2] it might just be a case of only allowing P2P when a PCIe switch connects the two EPs. Another option is just to use CONFIG_EXPERT and make sure people are aware of the pitfalls if they invoke the P2P option. iopmem is not applicable to GPU what i propose is to split the issue in 2 so that everyone can reuse the part that needs to be common namely the DMA API part where you have to create IOMMU mapping for one device to point to the other device memory. We can have a DMA API that is agnostic to how the device memory is manage (so does not matter if device memory have struct page or not). This what i have been arguing in this thread. To make progress on this issue we need to stop conflicting different use case. So i say let solve the IOMMU issue first and let everyone use it in their own way with their device. I do not think we can share much more than that. Yeah, exactly what I said from the very beginning as well. Just hacking together quick solutions doesn't really solve the problem in the long term. What we need is proper adjusting of the DMA API towards handling of P2P and then build solutions for the different use cases on top of that. We should also avoid falling into the trap of trying to just handle the existing get_user_pages and co interfaces so that the existing code doesn't need to change. P2P needs to be validated for each use case individually and not implemented in workarounds with fingers crossed and hoped for the best. Regards, Christian. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling peer to peer device transactions for PCIe devices
Am 27.11.2016 um 15:02 schrieb Haggai Eran: On 11/25/2016 9:32 PM, Jason Gunthorpe wrote: On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote: Like you say below we have to handle short lived in the usual way, and that covers basically every device except IB MRs, including the command queue on a NVMe drive. Well a problem which wasn't mentioned so far is that while GPUs do have a page table to mirror the CPU page table, they usually can't recover from page faults. So what we do is making sure that all memory accessed by the GPU Jobs stays in place while those jobs run (pretty much the same pinning you do for the DMA). Yes, it is DMA, so this is a valid approach. But, you don't need page faults from the GPU to do proper coherent page table mirroring. Basically when the driver submits the work to the GPU it 'faults' the pages into the CPU and mirror translation table (instead of pinning). Like in ODP, MMU notifiers/HMM are used to monitor for translation changes. If a change comes in the GPU driver checks if an executing command is touching those pages and blocks the MMU notifier until the command flushes, then unfaults the page (blocking future commands) and unblocks the mmu notifier. I think blocking mmu notifiers against something that is basically controlled by user-space can be problematic. This can block things like memory reclaim. If you have user-space access to the device's queues, user-space can block the mmu notifier forever. Really good point. I think this means the bare minimum if we don't have recoverable page faults is to have preemption support like Felix described in his answer as well. Going to keep that in mind, Christian. -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html