Re: [PATCH 1/1] mm/migrate: Trylock device page in do_swap_page
On 2024-09-20 17:23, Matthew Brost wrote: On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote: On 2024-09-18 11:10, Alistair Popple wrote: Matthew Brost writes: On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote: Matthew Brost writes: I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know) Still a driver. Indeed, and I'm happy to answer any questions about our implementation. but theoretically it seems like it should be possible. However we serialize migrations of the same virtual address range to avoid these kind of issues as they can happen the other way too (ie. multiple threads trying to migrate to GPU). So I suspect what happens in UVM is that one thread wins and installs the migration entry while the others fail to get the driver migration lock and bail out sufficiently early in the fault path to avoid the live-lock. I had to try hard to show this, doubt an actual user could trigger this. I wrote a test which kicked 8 threads, each thread did a pthread join, and then tried to read the same page. This repeats in loop for like 512 pages or something. I needed an exclusive lock in migrate_to_ram vfunc for it to livelock. Without an exclusive lock I think on average I saw about 32k retries (i.e. migrate_to_ram calls on the same page) before a thread won this race. From reading UVM, pretty sure if you tried hard enough you could trigger a livelock given it appears you take excluvise locks in migrate_to_ram. Yes, I suspect you're correct here and that we just haven't tried hard enough to trigger it. Cc: Philip Yang Cc: Felix Kuehling Cc: Christian König Cc: Andrew Morton Suggessted-by: Simona Vetter Signed-off-by: Matthew Brost --- mm/memory.c | 13 +++--- mm/migrate_device.c | 60 +++-- 2 files changed, 50 insertions(+), 23 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3c01d68065be..bbd97d16a96a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Get a page reference while we know the page can't be * freed. */ - get_page(vmf->page); - pte_unmap_unlock(vmf->pte, vmf->ptl); - ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); - put_page(vmf->page); + if (trylock_page(vmf->page)) { + get_page(vmf->page); + pte_unmap_unlock(vmf->pte, vmf->ptl); This is all beginning to look a lot like migrate_vma_collect_pmd(). So rather than do this and then have to pass all this context (ie. fault_page) down to the migrate_vma_* functions could we instead just do what migrate_vma_collect_pmd() does here? Ie. we already have the PTL and the page lock so there's no reason we couldn't just setup the migration entry prior to calling migrate_to_ram(). Obviously calling migrate_vma_setup() would show the page as not migrating, but drivers could easily just fill in the src_pfn info after calling migrate_vma_setup(). This would eliminate the whole fault_page ugliness. This seems like it would work and agree it likely be cleaner. Let me play around with this and see what I come up with. Multi-tasking a bit so expect a bit of delay here. Thanks for the input, Matt Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip, Christian and myself in the loop with future patches in this area. Will do. Already have another local patch set which helps drivers dma map 2M pages for migrations if SRAM is physically contiguous. Seems helpful for performance on Intel hardware. Probably post that soon for early feedack. OK. Longer term I thinking 2M migration entries, 2M device pages, and being able to install 2M THP on VRAM -> SRAM could be really helpful. I'm finding migrate_vma_* functions take up like 80-90% of the time in the CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem ideal. Seems like 2M entries for everything would really help here. No idea how feasible this is as the core MM stuff gets confusing fast. Any input on this idea? I agree with your observations. We found that the migrate_vma_* code was the bottle neck for migration performance as well, and not breaking 2M pages could reduce that overhead a lot. I don't have any specific ideas. I'm not familiar with the details of that code myself. Philip has looked at this (and some old NVidia patches from a few years ago) in the past but never had enough uninterrupted time to make it past prototyping. Regards, Felix Matt Regards, Felix + ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); +
Re: [PATCH 1/1] mm/migrate: Trylock device page in do_swap_page
On 2024-09-18 11:10, Alistair Popple wrote: Matthew Brost writes: On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote: Matthew Brost writes: I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know) Still a driver. Indeed, and I'm happy to answer any questions about our implementation. but theoretically it seems like it should be possible. However we serialize migrations of the same virtual address range to avoid these kind of issues as they can happen the other way too (ie. multiple threads trying to migrate to GPU). So I suspect what happens in UVM is that one thread wins and installs the migration entry while the others fail to get the driver migration lock and bail out sufficiently early in the fault path to avoid the live-lock. I had to try hard to show this, doubt an actual user could trigger this. I wrote a test which kicked 8 threads, each thread did a pthread join, and then tried to read the same page. This repeats in loop for like 512 pages or something. I needed an exclusive lock in migrate_to_ram vfunc for it to livelock. Without an exclusive lock I think on average I saw about 32k retries (i.e. migrate_to_ram calls on the same page) before a thread won this race. From reading UVM, pretty sure if you tried hard enough you could trigger a livelock given it appears you take excluvise locks in migrate_to_ram. Yes, I suspect you're correct here and that we just haven't tried hard enough to trigger it. Cc: Philip Yang Cc: Felix Kuehling Cc: Christian König Cc: Andrew Morton Suggessted-by: Simona Vetter Signed-off-by: Matthew Brost --- mm/memory.c | 13 +++--- mm/migrate_device.c | 60 +++-- 2 files changed, 50 insertions(+), 23 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3c01d68065be..bbd97d16a96a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Get a page reference while we know the page can't be * freed. */ - get_page(vmf->page); - pte_unmap_unlock(vmf->pte, vmf->ptl); - ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); - put_page(vmf->page); + if (trylock_page(vmf->page)) { + get_page(vmf->page); + pte_unmap_unlock(vmf->pte, vmf->ptl); This is all beginning to look a lot like migrate_vma_collect_pmd(). So rather than do this and then have to pass all this context (ie. fault_page) down to the migrate_vma_* functions could we instead just do what migrate_vma_collect_pmd() does here? Ie. we already have the PTL and the page lock so there's no reason we couldn't just setup the migration entry prior to calling migrate_to_ram(). Obviously calling migrate_vma_setup() would show the page as not migrating, but drivers could easily just fill in the src_pfn info after calling migrate_vma_setup(). This would eliminate the whole fault_page ugliness. This seems like it would work and agree it likely be cleaner. Let me play around with this and see what I come up with. Multi-tasking a bit so expect a bit of delay here. Thanks for the input, Matt Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip, Christian and myself in the loop with future patches in this area. Regards, Felix + ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); + put_page(vmf->page); + unlock_page(vmf->page); + } else { + pte_unmap_unlock(vmf->pte, vmf->ptl); + } } else if (is_hwpoison_entry(entry)) { ret = VM_FAULT_HWPOISON; } else if (is_pte_marker_entry(entry)) { diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6d66dc1c6ffa..049893a5a179 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, struct mm_walk *walk) { struct migrate_vma *migrate = walk->private; + struct folio *fault_folio = migrate->fault_page ? + page_folio(migrate->fault_page) : NULL; struct vm_area_struct *vma = walk->vma; struct mm_struct *mm = vma->vm_mm; unsigned long addr = start, unmapped = 0; @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, folio_get(folio); spin_unlock(ptl); - if (unlikely(!folio_trylock(folio))) + if (unlikely(fault_folio != folio && +
Re: [PATCH 2/4] amdgpu: fix a race in kfd_mem_export_dmabuf()
On 2024-08-12 02:59, Al Viro wrote: Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into descriptor table, only to have it looked up by file descriptor and remove it from descriptor table is not just too convoluted - it's racy; another thread might have modified the descriptor table while we'd been going through that song and dance. Switch kfd_mem_export_dmabuf() to using drm_gem_prime_handle_to_dmabuf() and leave the descriptor table alone... Signed-off-by: Al Viro This patch is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 11672bfe4fad..bc5401de2948 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #include @@ -818,18 +817,13 @@ static int kfd_mem_export_dmabuf(struct kgd_mem *mem) if (!mem->dmabuf) { struct amdgpu_device *bo_adev; struct dma_buf *dmabuf; - int r, fd; bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); - r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, + dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev, bo_adev->kfd.client.file, mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0, &fd); - if (r) - return r; - dmabuf = dma_buf_get(fd); - close_fd(fd); - if (WARN_ON_ONCE(IS_ERR(dmabuf))) + DRM_RDWR : 0); + if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf); mem->dmabuf = dmabuf; }
Re: va range based memory management discussion (was: 回复:回复:Re:Proposal to add CRIU support to DRM render nodes)
On 2024-07-09 22:38, 周春明(日月) wrote: -- 发件人:Felix Kuehling 发送时间:2024年7月10日(星期三) 01:07 收件人:周春明(日月) ; Tvrtko Ursulin ; dri-devel@lists.freedesktop.org ; amd-...@lists.freedesktop.org ; Dave Airlie ; Daniel Vetter ; criu 抄 送:"Errabolu, Ramesh" ; "Christian König" 主 题:Re: 回复:Re:Proposal to add CRIU support to DRM render nodes On 2024-07-09 5:30, 周春明(日月) wrote: > > > > > > > -- > 发件人:Felix Kuehling > 发送时间:2024年7月9日(星期二) 06:40 > 收件人:周春明(日月) ; Tvrtko Ursulin ; dri-devel@lists.freedesktop.org ; amd-...@lists.freedesktop.org ; Dave Airlie ; Daniel Vetter ; criu > 抄 送:"Errabolu, Ramesh" ; "Christian König" > 主 题:Re: Re:Proposal to add CRIU support to DRM render nodes > > > On 2024-07-08 2:51, 周春明(日月) wrote: >> >>> Hi Felix, >>> >>> When I learn CRIU you introduced in https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>>> , there is a sentence >>> "ROCm manages memory in the form of buffer objects (BOs). We are also working on a new memory management API that will be based on virtual address ranges...", >>> Out of curious, how about "new memory management based on virtual address ranges"? Any introduction for that? >> >>>Hi David, >> >>>This refers to the SVM API that has been in the upstream driver for a while now: https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732 <https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732> <https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732>> >> >> [David] Can all ROCm runtime memory management switch to use svm apis? No need BOs any more? >I had thought about that when I started working on SVM years ago. But I came to the conclusion that we need to use BOs for VRAM to support DMABuf exports and imports to support P2P and IPC features. [David] OK, I guessed you would say DMABuf and IPC factors, if we don't use dmabuf (as you know, dmabuf isn't popular in compute area) and implement a new ipc based on va ranges, is that possible to using svm api to cover all ROCm memory management? When I tried memory pool used by cuda graph, seems that's OK. DMABuf and IPC are important for collective communications libraries used by distributed applications. You could get away without it when you're running a single-process application on a single machine. But changing all memory allocations to SVM would probably cause some performance regressions, because our BO allocators and memory mapping functions are simpler and easier to optimize than for unified memory. That leaves the question, what's the expected benefit or a compelling reason for making such an invasive change? Regards, Felix Thanks, -David >Regards, > Felix > > Thanks, > -David > > Regards, > Felix > > >> >> Thanks, >> -David >> >> -- >> 发件人:Felix Kuehling >> 发送时间:2024年5月3日(星期五) 22:44 >> 收件人:Tvrtko Ursulin ; dri-devel@lists.freedesktop.org ; amd-...@lists.freedesktop.org ; Dave Airlie ; Daniel Vetter ; criu >> 抄 送:"Errabolu, Ramesh" ; "Christian König" >> 主 题:Re: Proposal to add CRIU support to DRM render nodes >> >> >> >> On 2024-04-16 10:04, Tvrtko Ursulin wrote: >> > >> > On 01/04/2024 18:58, Felix Kuehling wrote: >> >> >> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: >> >>> >> >>> On 01/04/2024 17:37, Felix Kuehling wrote: >> >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote: >> >>>>> >> >>>>> On 28/03/2024 20:42, Felix Kuehling wrote: >> >>>>>> >> >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote: >> >>>>>>> >> >>>>>>> Hi Felix, >> >>>>>>> >> >>&
Re: 回复:Re:Proposal to add CRIU support to DRM render nodes
On 2024-07-09 5:30, 周春明(日月) wrote: > > > > > > > -- > 发件人:Felix Kuehling > 发送时间:2024年7月9日(星期二) 06:40 > 收件人:周春明(日月) ; Tvrtko Ursulin > ; dri-devel@lists.freedesktop.org > ; amd-...@lists.freedesktop.org > ; Dave Airlie ; Daniel > Vetter ; criu > 抄 送:"Errabolu, Ramesh" ; "Christian König" > > 主 题:Re: Re:Proposal to add CRIU support to DRM render nodes > > > On 2024-07-08 2:51, 周春明(日月) wrote: >> >> Hi Felix, >> >> When I learn CRIU you introduced in >> https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu >> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> >> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> >> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> , >> there is a sentence >> "ROCm manages memory in the form of buffer objects (BOs). We are also >> working on a new memory management API that will be based on virtual address >> ranges...", >> Out of curious, how about "new memory management based on virtual address >> ranges"? Any introduction for that? > >>Hi David, > >>This refers to the SVM API that has been in the upstream driver for a while >>now: >>https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732 >> >><https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732> > > [David] Can all ROCm runtime memory management switch to use svm apis? No > need BOs any more? I had thought about that when I started working on SVM years ago. But I came to the conclusion that we need to use BOs for VRAM to support DMABuf exports and imports to support P2P and IPC features. Regards, Felix > > Thanks, > -David > > Regards, > Felix > > >> >> Thanks, >> -David >> >> -- >> 发件人:Felix Kuehling >> 发送时间:2024年5月3日(星期五) 22:44 >> 收件人:Tvrtko Ursulin ; >> dri-devel@lists.freedesktop.org ; >> amd-...@lists.freedesktop.org ; Dave Airlie >> ; Daniel Vetter ; criu >> 抄 送:"Errabolu, Ramesh" ; "Christian König" >> >> 主 题:Re: Proposal to add CRIU support to DRM render nodes >> >> >> >> On 2024-04-16 10:04, Tvrtko Ursulin wrote: >> > >> > On 01/04/2024 18:58, Felix Kuehling wrote: >> >> >> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: >> >>> >> >>> On 01/04/2024 17:37, Felix Kuehling wrote: >> >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote: >> >>>>> >> >>>>> On 28/03/2024 20:42, Felix Kuehling wrote: >> >>>>>> >> >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote: >> >>>>>>> >> >>>>>>> Hi Felix, >> >>>>>>> >> >>>>>>> I had one more thought while browsing around the amdgpu CRIU >> plugin. It appears it relies on the KFD support being compiled in and >> /dev/kfd present, correct? AFAICT at least, it relies on that to figure out >> the amdgpu DRM node. >> >>>>>>> >> >>>>>>> In would be probably good to consider designing things without >> that dependency. So that checkpointing an application which does not use >> /dev/kfd is possible. Or if the kernel does not even have the KFD support >> compiled in. >> >>>>>> >> >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we >> should definitely do that. Currently we get a lot of topology information >> from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed >> by KFD. We'd need to get GPU device info from the render nodes instead. And >> if KFD is available, we may need to integrate both sources of information. >> >>>>>> >> >>>>>> >> >>>>>>> >> >>>>>>> It could perhaps mean no more than adding some GPU discovery >> code into CRIU. Which shuold be flexible enough to account for things like >> re-assigned minor numbers due driver reload. >> >>>>>> >> >>>>>> Do you mean adding
Re: Re:Proposal to add CRIU support to DRM render nodes
On 2024-07-08 2:51, 周春明(日月) wrote: > > Hi Felix, > > When I learn CRIU you introduced in > https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu > <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> , > there is a sentence > "ROCm manages memory in the form of buffer objects (BOs). We are also working > on a new memory management API that will be based on virtual address > ranges...", > Out of curious, how about "new memory management based on virtual address > ranges"? Any introduction for that? Hi David, This refers to the SVM API that has been in the upstream driver for a while now: https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732 Regards, Felix > > Thanks, > -David > > -- > 发件人:Felix Kuehling > 发送时间:2024年5月3日(星期五) 22:44 > 收件人:Tvrtko Ursulin ; > dri-devel@lists.freedesktop.org ; > amd-...@lists.freedesktop.org ; Dave Airlie > ; Daniel Vetter ; criu > 抄 送:"Errabolu, Ramesh" ; "Christian König" > > 主 题:Re: Proposal to add CRIU support to DRM render nodes > > > > On 2024-04-16 10:04, Tvrtko Ursulin wrote: > > > > On 01/04/2024 18:58, Felix Kuehling wrote: > >> > >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: > >>> > >>> On 01/04/2024 17:37, Felix Kuehling wrote: > >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote: > >>>>> > >>>>> On 28/03/2024 20:42, Felix Kuehling wrote: > >>>>>> > >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote: > >>>>>>> > >>>>>>> Hi Felix, > >>>>>>> > >>>>>>> I had one more thought while browsing around the amdgpu CRIU > plugin. It appears it relies on the KFD support being compiled in and > /dev/kfd present, correct? AFAICT at least, it relies on that to figure out > the amdgpu DRM node. > >>>>>>> > >>>>>>> In would be probably good to consider designing things without > that dependency. So that checkpointing an application which does not use > /dev/kfd is possible. Or if the kernel does not even have the KFD support > compiled in. > >>>>>> > >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we > should definitely do that. Currently we get a lot of topology information > from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed > by KFD. We'd need to get GPU device info from the render nodes instead. And > if KFD is available, we may need to integrate both sources of information. > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> It could perhaps mean no more than adding some GPU discovery code > into CRIU. Which shuold be flexible enough to account for things like > re-assigned minor numbers due driver reload. > >>>>>> > >>>>>> Do you mean adding GPU discovery to the core CRIU, or to the > plugin. I was thinking this is still part of the plugin. > >>>>> > >>>>> Yes I agree. I was only thinking about adding some DRM device > discovery code in a more decoupled fashion from the current plugin, for both > the reason discussed above (decoupling a bit from reliance on kfd sysfs), and > then also if/when a new DRM driver might want to implement this the code > could be move to some common plugin area. > >>>>> > >>>>> I am not sure how feasible that would be though. The "gpu id" > concept and it's matching in the current kernel code and CRIU plugin - is > that value tied to the physical GPU instance or how it works? > >>>> > >>>> The concept of the GPU ID is that it's stable while the system is > up, even when devices get added and removed dynamically. It was baked into > the API early on, but I don't think we ever fully validated device hot plug. > I think the closest we're getting is with our latest MI GPUs and dynamic > partition mode change. > >>> > >>> Doesn't it read the saved gpu id from the image file while doing > restore and tries to open the render node to match it? Maybe I am misreading > the code.. But if it does, does it imply that in practice it could be stable > acro
Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()
On 2024-06-05 05:14, Christian König wrote: Am 04.06.24 um 20:08 schrieb Felix Kuehling: On 2024-06-03 22:13, Al Viro wrote: Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into descriptor table, only to have it looked up by file descriptor and remove it from descriptor table is not just too convoluted - it's racy; another thread might have modified the descriptor table while we'd been going through that song and dance. It's not hard to fix - turn drm_gem_prime_handle_to_fd() into a wrapper for a new helper that would simply return the dmabuf, without messing with descriptor table. Then kfd_mem_export_dmabuf() would simply use that new helper and leave the descriptor table alone. Signed-off-by: Al Viro This patch looks good to me on the amdgpu side. For the DRM side I'm adding dri-devel. Yeah that patch should probably be split up and the DRM changes discussed separately. On the other hand skimming over it it seems reasonable to me. Felix are you going to look into this or should I take a look and try to push it through drm-misc-next? It doesn't matter much to me, as long as we submit both changes together. Thanks, Felix Thanks, Christian. Acked-by: Felix Kuehling --- diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 8975cf41a91a..793780bb819c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #include @@ -812,18 +811,13 @@ static int kfd_mem_export_dmabuf(struct kgd_mem *mem) if (!mem->dmabuf) { struct amdgpu_device *bo_adev; struct dma_buf *dmabuf; - int r, fd; bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); - r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, + dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev, bo_adev->kfd.client.file, mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0, &fd); - if (r) - return r; - dmabuf = dma_buf_get(fd); - close_fd(fd); - if (WARN_ON_ONCE(IS_ERR(dmabuf))) + DRM_RDWR : 0); + if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf); mem->dmabuf = dmabuf; } diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 03bd3c7bd0dc..622c51d3fe18 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -409,23 +409,9 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/** - * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers - * @dev: dev to export the buffer from - * @file_priv: drm file-private structure - * @handle: buffer handle to export - * @flags: flags like DRM_CLOEXEC - * @prime_fd: pointer to storage for the fd id of the create dma-buf - * - * This is the PRIME export function which must be used mandatorily by GEM - * drivers to ensure correct lifetime management of the underlying GEM object. - * The actual exporting from GEM object to a dma-buf is done through the - * &drm_gem_object_funcs.export callback. - */ -int drm_gem_prime_handle_to_fd(struct drm_device *dev, +struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev, struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) + uint32_t flags) { struct drm_gem_object *obj; int ret = 0; @@ -434,14 +420,14 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, mutex_lock(&file_priv->prime.lock); obj = drm_gem_object_lookup(file_priv, handle); if (!obj) { - ret = -ENOENT; + dmabuf = ERR_PTR(-ENOENT); goto out_unlock; } dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, handle); if (dmabuf) { get_dma_buf(dmabuf); - goto out_have_handle; + goto out; } mutex_lock(&dev->object_name_lock); @@ -463,7 +449,6 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, /* normally the created dma-buf takes ownership of the ref, * but if that fails then drop the ref */ - ret = PTR_ERR(dmabuf); mutex_unlock(&dev->object_name_lock); goto out; } @@ -478,34 +463,49 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, ret = drm_prime_add_buf_handle(&file_priv->prime, dmabuf, handle); mutex_unlock(&dev->object_name_lock); - if (ret) - goto fail_put_dmabuf; - -out_have_handle: - ret = dma_buf_fd(d
Re: [PATCH] Revert "drm/amdgpu: init iommu after amdkfd device init"
On 2024-06-03 18:19, Armin Wolf wrote: Am 23.05.24 um 19:30 schrieb Armin Wolf: This reverts commit 56b522f4668167096a50c39446d6263c96219f5f. A user reported that this commit breaks the integrated gpu of his notebook, causing a black screen. He was able to bisect the problematic commit and verified that by reverting it the notebook works again. He also confirmed that kernel 6.8.1 also works on his device, so the upstream commit itself seems to be ok. An amdgpu developer (Alex Deucher) confirmed that this patch should have never been ported to 5.15 in the first place, so revert this commit from the 5.15 stable series. Hi, what is the status of this? Which branch is this for? This patch won't apply to anything after Linux 6.5. Support for IOMMUv2 was removed from amdgpu in Linux 6.6 by: commit c99a2e7ae291e5b19b60443eb6397320ef9e8571 Author: Alex Deucher Date: Fri Jul 28 12:20:12 2023 -0400 drm/amdkfd: drop IOMMUv2 support Now that we use the dGPU path for all APUs, drop the IOMMUv2 support. v2: drop the now unused queue manager functions for gfx7/8 APUs Reviewed-by: Felix Kuehling Acked-by: Christian König Tested-by: Mike Lothian Signed-off-by: Alex Deucher Regards, Felix Armin Wolf Reported-by: Barry Kauler Signed-off-by: Armin Wolf --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 222a1d9ecf16..5f6c32ec674d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2487,6 +2487,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev) if (r) goto init_failed; + r = amdgpu_amdkfd_resume_iommu(adev); + if (r) + goto init_failed; + r = amdgpu_device_ip_hw_init_phase1(adev); if (r) goto init_failed; @@ -2525,10 +2529,6 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev) if (!adev->gmc.xgmi.pending_reset) amdgpu_amdkfd_device_init(adev); - r = amdgpu_amdkfd_resume_iommu(adev); - if (r) - goto init_failed; - amdgpu_fru_get_product_info(adev); init_failed: -- 2.39.2
Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()
On 2024-06-03 22:13, Al Viro wrote: Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into descriptor table, only to have it looked up by file descriptor and remove it from descriptor table is not just too convoluted - it's racy; another thread might have modified the descriptor table while we'd been going through that song and dance. It's not hard to fix - turn drm_gem_prime_handle_to_fd() into a wrapper for a new helper that would simply return the dmabuf, without messing with descriptor table. Then kfd_mem_export_dmabuf() would simply use that new helper and leave the descriptor table alone. Signed-off-by: Al Viro This patch looks good to me on the amdgpu side. For the DRM side I'm adding dri-devel. Acked-by: Felix Kuehling --- diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 8975cf41a91a..793780bb819c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #include @@ -812,18 +811,13 @@ static int kfd_mem_export_dmabuf(struct kgd_mem *mem) if (!mem->dmabuf) { struct amdgpu_device *bo_adev; struct dma_buf *dmabuf; - int r, fd; bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); - r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, + dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev, bo_adev->kfd.client.file, mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0, &fd); - if (r) - return r; - dmabuf = dma_buf_get(fd); - close_fd(fd); - if (WARN_ON_ONCE(IS_ERR(dmabuf))) + DRM_RDWR : 0); + if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf); mem->dmabuf = dmabuf; } diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 03bd3c7bd0dc..622c51d3fe18 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -409,23 +409,9 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/** - * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers - * @dev: dev to export the buffer from - * @file_priv: drm file-private structure - * @handle: buffer handle to export - * @flags: flags like DRM_CLOEXEC - * @prime_fd: pointer to storage for the fd id of the create dma-buf - * - * This is the PRIME export function which must be used mandatorily by GEM - * drivers to ensure correct lifetime management of the underlying GEM object. - * The actual exporting from GEM object to a dma-buf is done through the - * &drm_gem_object_funcs.export callback. - */ -int drm_gem_prime_handle_to_fd(struct drm_device *dev, +struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev, struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) + uint32_t flags) { struct drm_gem_object *obj; int ret = 0; @@ -434,14 +420,14 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, mutex_lock(&file_priv->prime.lock); obj = drm_gem_object_lookup(file_priv, handle); if (!obj) { - ret = -ENOENT; + dmabuf = ERR_PTR(-ENOENT); goto out_unlock; } dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, handle); if (dmabuf) { get_dma_buf(dmabuf); - goto out_have_handle; + goto out; } mutex_lock(&dev->object_name_lock); @@ -463,7 +449,6 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, /* normally the created dma-buf takes ownership of the ref, * but if that fails then drop the ref */ - ret = PTR_ERR(dmabuf); mutex_unlock(&dev->object_name_lock); goto out; } @@ -478,34 +463,49 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, ret = drm_prime_add_buf_handle(&file_priv->prime, dmabuf, handle); mutex_unlock(&dev->object_name_lock); - if (ret) - goto fail_put_dmabuf; - -out_have_handle: - ret = dma_buf_fd(dmabuf, flags); - /* -* We must _not_ remove the buffer from the handle cache since the newly -* created dma buf is already link
Re: [PATCH 11/11] drm/tegra: Use fbdev client helpers
On 2024-05-07 07:58, Thomas Zimmermann wrote: Implement struct drm_client_funcs with the respective helpers and remove the custom code from the emulation. The generic helpers are equivalent in functionality. Signed-off-by: Thomas Zimmermann --- drivers/gpu/drm/radeon/radeon_fbdev.c | 66 ++- Was radeon meant to be a separate patch? Regards, Felix drivers/gpu/drm/tegra/fbdev.c | 58 ++- 2 files changed, 6 insertions(+), 118 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon_fbdev.c b/drivers/gpu/drm/radeon/radeon_fbdev.c index 02bf25759059a..cf790922174ea 100644 --- a/drivers/gpu/drm/radeon/radeon_fbdev.c +++ b/drivers/gpu/drm/radeon/radeon_fbdev.c @@ -29,7 +29,6 @@ #include #include -#include #include #include #include @@ -293,71 +292,12 @@ static const struct drm_fb_helper_funcs radeon_fbdev_fb_helper_funcs = { }; /* - * Fbdev client and struct drm_client_funcs + * struct drm_client_funcs */ -static void radeon_fbdev_client_unregister(struct drm_client_dev *client) -{ - struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client); - struct drm_device *dev = fb_helper->dev; - struct radeon_device *rdev = dev->dev_private; - - if (fb_helper->info) { - vga_switcheroo_client_fb_set(rdev->pdev, NULL); - drm_helper_force_disable_all(dev); - drm_fb_helper_unregister_info(fb_helper); - } else { - drm_client_release(&fb_helper->client); - drm_fb_helper_unprepare(fb_helper); - kfree(fb_helper); - } -} - -static int radeon_fbdev_client_restore(struct drm_client_dev *client) -{ - drm_fb_helper_lastclose(client->dev); - vga_switcheroo_process_delayed_switch(); - - return 0; -} - -static int radeon_fbdev_client_hotplug(struct drm_client_dev *client) -{ - struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client); - struct drm_device *dev = client->dev; - struct radeon_device *rdev = dev->dev_private; - int ret; - - if (dev->fb_helper) - return drm_fb_helper_hotplug_event(dev->fb_helper); - - ret = drm_fb_helper_init(dev, fb_helper); - if (ret) - goto err_drm_err; - - if (!drm_drv_uses_atomic_modeset(dev)) - drm_helper_disable_unused_functions(dev); - - ret = drm_fb_helper_initial_config(fb_helper); - if (ret) - goto err_drm_fb_helper_fini; - - vga_switcheroo_client_fb_set(rdev->pdev, fb_helper->info); - - return 0; - -err_drm_fb_helper_fini: - drm_fb_helper_fini(fb_helper); -err_drm_err: - drm_err(dev, "Failed to setup radeon fbdev emulation (ret=%d)\n", ret); - return ret; -} - static const struct drm_client_funcs radeon_fbdev_client_funcs = { - .owner = THIS_MODULE, - .unregister = radeon_fbdev_client_unregister, - .restore= radeon_fbdev_client_restore, - .hotplug= radeon_fbdev_client_hotplug, + .owner = THIS_MODULE, + DRM_FBDEV_HELPER_CLIENT_FUNCS, }; void radeon_fbdev_setup(struct radeon_device *rdev) diff --git a/drivers/gpu/drm/tegra/fbdev.c b/drivers/gpu/drm/tegra/fbdev.c index db6eaac3d30e6..f9cc365cfed94 100644 --- a/drivers/gpu/drm/tegra/fbdev.c +++ b/drivers/gpu/drm/tegra/fbdev.c @@ -12,7 +12,6 @@ #include #include -#include #include #include #include @@ -150,63 +149,12 @@ static const struct drm_fb_helper_funcs tegra_fb_helper_funcs = { }; /* - * struct drm_client + * struct drm_client_funcs */ -static void tegra_fbdev_client_unregister(struct drm_client_dev *client) -{ - struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client); - - if (fb_helper->info) { - drm_fb_helper_unregister_info(fb_helper); - } else { - drm_client_release(&fb_helper->client); - drm_fb_helper_unprepare(fb_helper); - kfree(fb_helper); - } -} - -static int tegra_fbdev_client_restore(struct drm_client_dev *client) -{ - drm_fb_helper_lastclose(client->dev); - - return 0; -} - -static int tegra_fbdev_client_hotplug(struct drm_client_dev *client) -{ - struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client); - struct drm_device *dev = client->dev; - int ret; - - if (dev->fb_helper) - return drm_fb_helper_hotplug_event(dev->fb_helper); - - ret = drm_fb_helper_init(dev, fb_helper); - if (ret) - goto err_drm_err; - - if (!drm_drv_uses_atomic_modeset(dev)) - drm_helper_disable_unused_functions(dev); - - ret = drm_fb_helper_initial_config(fb_helper); - if (ret) - goto err_drm_fb_helper_fini; - - return 0; - -err_drm_fb_helper_fini: - drm_fb_helper_fini(fb_helper); -err_drm_err: - drm_err(dev,
Re: Proposal to add CRIU support to DRM render nodes
On 2024-04-16 10:04, Tvrtko Ursulin wrote: > > On 01/04/2024 18:58, Felix Kuehling wrote: >> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: >>> >>> On 01/04/2024 17:37, Felix Kuehling wrote: >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote: >>>>> >>>>> On 28/03/2024 20:42, Felix Kuehling wrote: >>>>>> >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote: >>>>>>> >>>>>>> Hi Felix, >>>>>>> >>>>>>> I had one more thought while browsing around the amdgpu CRIU plugin. It >>>>>>> appears it relies on the KFD support being compiled in and /dev/kfd >>>>>>> present, correct? AFAICT at least, it relies on that to figure out the >>>>>>> amdgpu DRM node. >>>>>>> >>>>>>> In would be probably good to consider designing things without that >>>>>>> dependency. So that checkpointing an application which does not use >>>>>>> /dev/kfd is possible. Or if the kernel does not even have the KFD >>>>>>> support compiled in. >>>>>> >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we should >>>>>> definitely do that. Currently we get a lot of topology information from >>>>>> KFD, not even from the /dev/kfd device but from the sysfs nodes exposed >>>>>> by KFD. We'd need to get GPU device info from the render nodes instead. >>>>>> And if KFD is available, we may need to integrate both sources of >>>>>> information. >>>>>> >>>>>> >>>>>>> >>>>>>> It could perhaps mean no more than adding some GPU discovery code into >>>>>>> CRIU. Which shuold be flexible enough to account for things like >>>>>>> re-assigned minor numbers due driver reload. >>>>>> >>>>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I >>>>>> was thinking this is still part of the plugin. >>>>> >>>>> Yes I agree. I was only thinking about adding some DRM device discovery >>>>> code in a more decoupled fashion from the current plugin, for both the >>>>> reason discussed above (decoupling a bit from reliance on kfd sysfs), and >>>>> then also if/when a new DRM driver might want to implement this the code >>>>> could be move to some common plugin area. >>>>> >>>>> I am not sure how feasible that would be though. The "gpu id" concept and >>>>> it's matching in the current kernel code and CRIU plugin - is that value >>>>> tied to the physical GPU instance or how it works? >>>> >>>> The concept of the GPU ID is that it's stable while the system is up, even >>>> when devices get added and removed dynamically. It was baked into the API >>>> early on, but I don't think we ever fully validated device hot plug. I >>>> think the closest we're getting is with our latest MI GPUs and dynamic >>>> partition mode change. >>> >>> Doesn't it read the saved gpu id from the image file while doing restore >>> and tries to open the render node to match it? Maybe I am misreading the >>> code.. But if it does, does it imply that in practice it could be stable >>> across reboots? Or that it is not possible to restore to a different >>> instance of maybe the same GPU model installed in a system? >> >> Ah, the idea is, that when you restore on a different system, you may get >> different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore >> it on GPU 2 on the same system. That's why we need to translate GPU IDs in >> restored applications. User mode still uses the old GPU IDs, but the kernel >> mode driver translates them to the actual GPU IDs of the GPUs that the >> process was restored on. > > I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but > for restored clients it gets overriden during restore so that any further > ioctls can actually not instantly fail. > > And then in amdgpu_plugin_restore_file, when it is opening the render node, > it relies on the kfd topology to have filled in (more or less) the > target_gpu_id corresponding to the render node gpu id of the target GPU - the > one associated with the new kfd gpu_id? Ye
Re: [PATCH] drm/amdkfd: fix NULL pointer dereference
This patch does not apply to amd-staging-drm-next. This is against a DKMS branch and should be reviewed on our internal mailing list. However, I suspect that part of the problem is, that the DKMS branch has diverged quite a bit in this area, and is missing at least one patch from me that was reverted, probably because of an improper port. The proper solution should involve getting the DKMS branch back in sync with upstream. I'll look into that. Regards, Felix On 2024-04-13 14:07, vitaly.pros...@amd.com wrote: From: Vitaly Prosyak [ +0.006038] BUG: kernel NULL pointer dereference, address: 0028 [ +0.006969] #PF: supervisor read access in kernel mode [ +0.005139] #PF: error_code(0x) - not-present page [ +0.005139] PGD 0 P4D 0 [ +0.002530] Oops: [#1] PREEMPT SMP NOPTI [ +0.004356] CPU: 11 PID: 12625 Comm: kworker/11:0 Tainted: GW 6.7.0+ #2 [ +0.008097] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE WIFI II, BIOS 1302 12/08/2023 [ +0.009398] Workqueue: events evict_process_worker [amdgpu] [ +0.005750] RIP: 0010:evict_process_worker+0x2f/0x460 [amdgpu] [ +0.005991] Code: 55 48 89 e5 41 57 41 56 4c 8d b7 a8 fc ff ff 41 55 41 54 53 48 89 fb 48 83 ec 10 0f 1f 44 00 00 48 8b 43 f8 8b 93 b0 00 00 00 <48> 3b 50 28 0f 85 50 03 00 00 48 8d 7b 58 e8 ee be cb bf 48 8b 05 [ +0.018791] RSP: 0018:c90009a2be10 EFLAGS: 00010282 [ +0.005226] RAX: RBX: 888197ffc358 RCX: [ +0.007140] RDX: 0a1b RSI: RDI: 888197ffc358 [ +0.007139] RBP: c90009a2be48 R08: R09: [ +0.007139] R10: R11: R12: 888197ffc358 [ +0.007139] R13: 888100153a00 R14: 888197ffc000 R15: 888100153a05 [ +0.007137] FS: () GS:889facac() knlGS: [ +0.008094] CS: 0010 DS: ES: CR0: 80050033 [ +0.005747] CR2: 0028 CR3: 00010d1fc001 CR4: 00770ef0 [ +0.007138] PKRU: 5554 [ +0.002702] Call Trace: [ +0.002443] [ +0.002096] ? show_regs+0x72/0x90 [ +0.003402] ? __die+0x25/0x80 [ +0.003052] ? page_fault_oops+0x154/0x4c0 [ +0.004099] ? do_user_addr_fault+0x30e/0x6e0 [ +0.004357] ? psi_group_change+0x237/0x520 [ +0.004185] ? exc_page_fault+0x84/0x1b0 [ +0.003926] ? asm_exc_page_fault+0x27/0x30 [ +0.004187] ? evict_process_worker+0x2f/0x460 [amdgpu] [ +0.005377] process_one_work+0x17b/0x360 [ +0.004011] ? __pfx_worker_thread+0x10/0x10 [ +0.004269] worker_thread+0x307/0x430 [ +0.003748] ? __pfx_worker_thread+0x10/0x10 [ +0.004268] kthread+0xf7/0x130 [ +0.003142] ? __pfx_kthread+0x10/0x10 [ +0.003749] ret_from_fork+0x46/0x70 [ +0.003573] ? __pfx_kthread+0x10/0x10 [ +0.003747] ret_from_fork_asm+0x1b/0x30 [ +0.003924] When we run stressful tests, the eviction fence could be zero and not match to last_eviction_seqno. Avoid calling dma_fence_signal and dma_fence_put with zero fences to rely on checking parameters in DMA API. Cc: Alex Deucher Cc: Christian Koenig Cc: Xiaogang Chen Cc: Felix Kuehling Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index eb380296017d..a15fae1c398a 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -2118,7 +2118,7 @@ static void evict_process_worker(struct work_struct *work) */ p = container_of(dwork, struct kfd_process, eviction_work); trace_kfd_evict_process_worker_start(p); - WARN_ONCE(p->last_eviction_seqno != p->ef->seqno, + WARN_ONCE(p->ef && p->last_eviction_seqno != p->ef->seqno, "Eviction fence mismatch\n"); /* Narrow window of overlap between restore and evict work @@ -2134,9 +2134,11 @@ static void evict_process_worker(struct work_struct *work) pr_debug("Started evicting pasid 0x%x\n", p->pasid); ret = kfd_process_evict_queues(p, false, KFD_QUEUE_EVICTION_TRIGGER_TTM); if (!ret) { - dma_fence_signal(p->ef); - dma_fence_put(p->ef); - p->ef = NULL; + if (p->ef) { + dma_fence_signal(p->ef); + dma_fence_put(p->ef); + p->ef = NULL; + } if (!kfd_process_unmap_doorbells_if_idle(p)) kfd_process_schedule_restore(p);
Re: [PATCH] drm/ttm: stop pooling cached NUMA pages v2
On 2024-04-15 10:08, Christian König wrote: Am 15.04.24 um 15:53 schrieb Felix Kuehling: On 2024-04-15 9:48, Christian König wrote: From: Christian König We only pool write combined and uncached allocations because they require extra overhead on allocation and release. If we also pool cached NUMA it not only means some extra unnecessary overhead, but also that under memory pressure it can happen that pages from the wrong NUMA node enters the pool and are re-used over and over again. This can lead to performance reduction after running into memory pressure. v2: restructure and cleanup the code a bit from the internal hack to test this. Signed-off-by: Christian König Fixes: 4482d3c94d7f ("drm/ttm: add NUMA node id to the pool") CC: sta...@vger.kernel.org --- drivers/gpu/drm/ttm/ttm_pool.c | 38 +- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index 112438d965ff..6e1fd6985ffc 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -288,17 +288,23 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool, enum ttm_caching caching, unsigned int order) { - if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) + if (pool->use_dma_alloc) return &pool->caching[caching].orders[order]; #ifdef CONFIG_X86 switch (caching) { case ttm_write_combined: + if (pool->nid != NUMA_NO_NODE) + return &pool->caching[caching].orders[order]; Doesn't this break USWC allocations on NUMA systems, where we set a NUMA node for the default pool (at least we were planning to at some point)? I don't think so, but I might have missed something. Why do you think that would break? I mean the idea is basically if the pool is associated with a NUMA id we should rather use this pool instead of the global one. And that is true for both cases, the default pool and the specialized ones. OK, I think I misunderstood what I was reading. It looked to me like it would always use a "caching" pool if nid was set. But caching here is a variable; each node still has specialized pools for write combining etc. Then the concern you stated in the commit message "under memory pressure it can happen that pages from the wrong NUMA node enters the pool and are re-used over and over again" is still possible for uncached and wc pages. Anyway, it's better than not having NUMA, I guess. The patch is Reviewed-by: Felix Kuehling Regards, Christian. Regards, Felix + if (pool->use_dma32) return &global_dma32_write_combined[order]; return &global_write_combined[order]; case ttm_uncached: + if (pool->nid != NUMA_NO_NODE) + return &pool->caching[caching].orders[order]; + if (pool->use_dma32) return &global_dma32_uncached[order]; @@ -566,11 +572,17 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev, pool->use_dma_alloc = use_dma_alloc; pool->use_dma32 = use_dma32; - if (use_dma_alloc || nid != NUMA_NO_NODE) { - for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) - for (j = 0; j < NR_PAGE_ORDERS; ++j) - ttm_pool_type_init(&pool->caching[i].orders[j], - pool, i, j); + for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { + for (j = 0; j < NR_PAGE_ORDERS; ++j) { + struct ttm_pool_type *pt; + + /* Initialize only pool types which are actually used */ + pt = ttm_pool_select_type(pool, i, j); + if (pt != &pool->caching[i].orders[j]) + continue; + + ttm_pool_type_init(pt, pool, i, j); + } } } EXPORT_SYMBOL(ttm_pool_init); @@ -599,10 +611,16 @@ void ttm_pool_fini(struct ttm_pool *pool) { unsigned int i, j; - if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) { - for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) - for (j = 0; j < NR_PAGE_ORDERS; ++j) - ttm_pool_type_fini(&pool->caching[i].orders[j]); + for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { + for (j = 0; j < NR_PAGE_ORDERS; ++j) { + struct ttm_pool_type *pt; + + pt = ttm_pool_select_type(pool, i, j); + if (pt != &pool->caching[i].orders[j]) + continue; + + ttm_pool_type_fini(pt); + } } /* We removed the pool types from the LRU, but we need to also make sure
Re: [PATCH] drm/ttm: stop pooling cached NUMA pages v2
On 2024-04-15 9:48, Christian König wrote: From: Christian König We only pool write combined and uncached allocations because they require extra overhead on allocation and release. If we also pool cached NUMA it not only means some extra unnecessary overhead, but also that under memory pressure it can happen that pages from the wrong NUMA node enters the pool and are re-used over and over again. This can lead to performance reduction after running into memory pressure. v2: restructure and cleanup the code a bit from the internal hack to test this. Signed-off-by: Christian König Fixes: 4482d3c94d7f ("drm/ttm: add NUMA node id to the pool") CC: sta...@vger.kernel.org --- drivers/gpu/drm/ttm/ttm_pool.c | 38 +- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index 112438d965ff..6e1fd6985ffc 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -288,17 +288,23 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool, enum ttm_caching caching, unsigned int order) { - if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) + if (pool->use_dma_alloc) return &pool->caching[caching].orders[order]; #ifdef CONFIG_X86 switch (caching) { case ttm_write_combined: + if (pool->nid != NUMA_NO_NODE) + return &pool->caching[caching].orders[order]; Doesn't this break USWC allocations on NUMA systems, where we set a NUMA node for the default pool (at least we were planning to at some point)? Regards, Felix + if (pool->use_dma32) return &global_dma32_write_combined[order]; return &global_write_combined[order]; case ttm_uncached: + if (pool->nid != NUMA_NO_NODE) + return &pool->caching[caching].orders[order]; + if (pool->use_dma32) return &global_dma32_uncached[order]; @@ -566,11 +572,17 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev, pool->use_dma_alloc = use_dma_alloc; pool->use_dma32 = use_dma32; - if (use_dma_alloc || nid != NUMA_NO_NODE) { - for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) - for (j = 0; j < NR_PAGE_ORDERS; ++j) - ttm_pool_type_init(&pool->caching[i].orders[j], - pool, i, j); + for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { + for (j = 0; j < NR_PAGE_ORDERS; ++j) { + struct ttm_pool_type *pt; + + /* Initialize only pool types which are actually used */ + pt = ttm_pool_select_type(pool, i, j); + if (pt != &pool->caching[i].orders[j]) + continue; + + ttm_pool_type_init(pt, pool, i, j); + } } } EXPORT_SYMBOL(ttm_pool_init); @@ -599,10 +611,16 @@ void ttm_pool_fini(struct ttm_pool *pool) { unsigned int i, j; - if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) { - for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) - for (j = 0; j < NR_PAGE_ORDERS; ++j) - ttm_pool_type_fini(&pool->caching[i].orders[j]); + for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { + for (j = 0; j < NR_PAGE_ORDERS; ++j) { + struct ttm_pool_type *pt; + + pt = ttm_pool_select_type(pool, i, j); + if (pt != &pool->caching[i].orders[j]) + continue; + + ttm_pool_type_fini(pt); + } } /* We removed the pool types from the LRU, but we need to also make sure
Re: Proposal to add CRIU support to DRM render nodes
On 2024-04-01 12:56, Tvrtko Ursulin wrote: On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. Doesn't it read the saved gpu id from the image file while doing restore and tries to open the render node to match it? Maybe I am misreading the code.. But if it does, does it imply that in practice it could be stable across reboots? Or that it is not possible to restore to a different instance of maybe the same GPU model installed in a system? Ah, the idea is, that when you restore on a different system, you may get different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore it on GPU 2 on the same system. That's why we need to translate GPU IDs in restored applications. User mode still uses the old GPU IDs, but the kernel mode driver translates them to the actual GPU IDs of the GPUs that the process was restored on. This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. I am not familiar with this. This is not SR-IOV but some other kind of partitioning? Would you have any links where I could read more? Right, the bare-metal driver can partition a PF spatially without SRIOV. SRIOV can also use spatial partitioning and expose each partition through its own VF, but that's not useful for bare metal. Spatial partitioning is new in MI300. There is some high-level info in this whitepaper: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf. Regards, Felix Regards, Tvrtko Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on
Re: Proposal to add CRIU support to DRM render nodes
On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. Regards, Felix Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers i
Re: Proposal to add CRIU support to DRM render nodes
On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and
Re: [PATCH 05/10] drivers: use new capable_any functionality
On 2024-03-15 7:37, Christian Göttsche wrote: Use the new added capable_any function in appropriate cases, where a task is required to have any of two capabilities. Reorder CAP_SYS_ADMIN last. Signed-off-by: Christian Göttsche Acked-by: Alexander Gordeev (s390 portion) Acked-by: Felix Kuehling (amdkfd portion) --- v4: Additional usage in kfd_ioctl() v3: rename to capable_any() --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 3 +-- drivers/net/caif/caif_serial.c | 2 +- drivers/s390/block/dasd_eckd.c | 2 +- 3 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index dfa8c69532d4..8c7ebca01c17 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -3290,8 +3290,7 @@ static long kfd_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) * more priviledged access. */ if (unlikely(ioctl->flags & KFD_IOC_FLAG_CHECKPOINT_RESTORE)) { - if (!capable(CAP_CHECKPOINT_RESTORE) && - !capable(CAP_SYS_ADMIN)) { + if (!capable_any(CAP_CHECKPOINT_RESTORE, CAP_SYS_ADMIN)) { retcode = -EACCES; goto err_i1; } diff --git a/drivers/net/caif/caif_serial.c b/drivers/net/caif/caif_serial.c index ed3a589def6b..e908b9ce57dc 100644 --- a/drivers/net/caif/caif_serial.c +++ b/drivers/net/caif/caif_serial.c @@ -326,7 +326,7 @@ static int ldisc_open(struct tty_struct *tty) /* No write no play */ if (tty->ops->write == NULL) return -EOPNOTSUPP; - if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_TTY_CONFIG)) + if (!capable_any(CAP_SYS_TTY_CONFIG, CAP_SYS_ADMIN)) return -EPERM; /* release devices to avoid name collision */ diff --git a/drivers/s390/block/dasd_eckd.c b/drivers/s390/block/dasd_eckd.c index 373c1a86c33e..8f9a5136306a 100644 --- a/drivers/s390/block/dasd_eckd.c +++ b/drivers/s390/block/dasd_eckd.c @@ -5384,7 +5384,7 @@ static int dasd_symm_io(struct dasd_device *device, void __user *argp) char psf0, psf1; int rc; - if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RAWIO)) + if (!capable_any(CAP_SYS_RAWIO, CAP_SYS_ADMIN)) return -EACCES; psf0 = psf1 = 0;
Re: Proposal to add CRIU support to DRM render nodes
On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? In our compute use cases, we suspend user mode queues. This can include CWSR (compute-wave-save-restore) where the state of in-flight waves is stored in memory and can be reloaded and resumed from memory later. We don't use any fences other than "eviction fences", that are signaled after the queues are suspended. And those fences are never handed to user mode. So we don't need to worry about any fence state in the checkpoint. If we extended this to support the kernel mode command submission APIs, I would expect that we'd wait for all current submissions to complete, and stop new ones from being sent to the HW before taking the checkpoint. When we take the checkpoint in the CRIU plugin, the CPU threads are already frozen and cannot submit any more work. If we wait
Re: [PATCH AUTOSEL 5.15 3/5] drm/amdgpu: Enable gpu reset for S3 abort cases on Raven series
On 2024-03-11 11:14, Sasha Levin wrote: From: Prike Liang [ Upstream commit c671ec01311b4744b377f98b0b4c6d033fe569b3 ] Currently, GPU resets can now be performed successfully on the Raven series. While GPU reset is required for the S3 suspend abort case. So now can enable gpu reset for S3 abort cases on the Raven series. This looks suspicious to me. I'm not sure what conditions made the GPU reset successful. But unless all the changes involved were also backported, this should probably not be applied to older kernel branches. I'm speculating it may be related to the removal of AMD IOMMUv2. Regards, Felix Signed-off-by: Prike Liang Acked-by: Alex Deucher Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- drivers/gpu/drm/amd/amdgpu/soc15.c | 45 +- 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 6a3486f52d698..ef5b3eedc8615 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -605,11 +605,34 @@ soc15_asic_reset_method(struct amdgpu_device *adev) return AMD_RESET_METHOD_MODE1; } +static bool soc15_need_reset_on_resume(struct amdgpu_device *adev) +{ + u32 sol_reg; + + sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81); + + /* Will reset for the following suspend abort cases. +* 1) Only reset limit on APU side, dGPU hasn't checked yet. +* 2) S3 suspend abort and TOS already launched. +*/ + if (adev->flags & AMD_IS_APU && adev->in_s3 && + !adev->suspend_complete && + sol_reg) + return true; + + return false; +} + static int soc15_asic_reset(struct amdgpu_device *adev) { /* original raven doesn't have full asic reset */ - if ((adev->apu_flags & AMD_APU_IS_RAVEN) || - (adev->apu_flags & AMD_APU_IS_RAVEN2)) + /* On the latest Raven, the GPU reset can be performed +* successfully. So now, temporarily enable it for the +* S3 suspend abort case. +*/ + if (((adev->apu_flags & AMD_APU_IS_RAVEN) || + (adev->apu_flags & AMD_APU_IS_RAVEN2)) && + !soc15_need_reset_on_resume(adev)) return 0; switch (soc15_asic_reset_method(adev)) { @@ -1490,24 +1513,6 @@ static int soc15_common_suspend(void *handle) return soc15_common_hw_fini(adev); } -static bool soc15_need_reset_on_resume(struct amdgpu_device *adev) -{ - u32 sol_reg; - - sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81); - - /* Will reset for the following suspend abort cases. -* 1) Only reset limit on APU side, dGPU hasn't checked yet. -* 2) S3 suspend abort and TOS already launched. -*/ - if (adev->flags & AMD_IS_APU && adev->in_s3 && - !adev->suspend_complete && - sol_reg) - return true; - - return false; -} - static int soc15_common_resume(void *handle) { struct amdgpu_device *adev = (struct amdgpu_device *)handle;
Re: [PATCH] drm/amdkfd: make kfd_class constant
On 2024-03-05 7:15, Ricardo B. Marliere wrote: Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the kfd_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman Suggested-by: Greg Kroah-Hartman Signed-off-by: Ricardo B. Marliere The patch looks good to me. Do you want me to apply this to Alex's amd-staging-drm-next? Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index f030cafc5a0a..dfa8c69532d4 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -63,8 +63,10 @@ static const struct file_operations kfd_fops = { }; static int kfd_char_dev_major = -1; -static struct class *kfd_class; struct device *kfd_device; +static const struct class kfd_class = { + .name = kfd_dev_name, +}; static inline struct kfd_process_device *kfd_lock_pdd_by_id(struct kfd_process *p, __u32 gpu_id) { @@ -94,14 +96,13 @@ int kfd_chardev_init(void) if (err < 0) goto err_register_chrdev; - kfd_class = class_create(kfd_dev_name); - err = PTR_ERR(kfd_class); - if (IS_ERR(kfd_class)) + err = class_register(&kfd_class); + if (err) goto err_class_create; - kfd_device = device_create(kfd_class, NULL, - MKDEV(kfd_char_dev_major, 0), - NULL, kfd_dev_name); + kfd_device = device_create(&kfd_class, NULL, + MKDEV(kfd_char_dev_major, 0), + NULL, kfd_dev_name); err = PTR_ERR(kfd_device); if (IS_ERR(kfd_device)) goto err_device_create; @@ -109,7 +110,7 @@ int kfd_chardev_init(void) return 0; err_device_create: - class_destroy(kfd_class); + class_unregister(&kfd_class); err_class_create: unregister_chrdev(kfd_char_dev_major, kfd_dev_name); err_register_chrdev: @@ -118,8 +119,8 @@ int kfd_chardev_init(void) void kfd_chardev_exit(void) { - device_destroy(kfd_class, MKDEV(kfd_char_dev_major, 0)); - class_destroy(kfd_class); + device_destroy(&kfd_class, MKDEV(kfd_char_dev_major, 0)); + class_unregister(&kfd_class); unregister_chrdev(kfd_char_dev_major, kfd_dev_name); kfd_device = NULL; } --- base-commit: 8bc75586ea01f1c645063d3472c115ecab03e76c change-id: 20240305-class_cleanup-drm-amd-bdc7255b7540 Best regards,
Re: Making drm_gpuvm work across gpu devices
On 2024-01-29 14:03, Christian König wrote: Am 29.01.24 um 18:52 schrieb Felix Kuehling: On 2024-01-29 11:28, Christian König wrote: Am 29.01.24 um 17:24 schrieb Felix Kuehling: On 2024-01-29 10:33, Christian König wrote: Am 29.01.24 um 16:03 schrieb Felix Kuehling: On 2024-01-25 13:32, Daniel Vetter wrote: On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote: Am 23.01.24 um 20:37 schrieb Zeng, Oak: [SNIP] Yes most API are per device based. One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Yeah and that was a big mistake in my opinion. We should really not do that ever again. Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc). Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea. The background is that this only works for some use cases but not all of them. What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space. Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management. When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow. [SNIP] I think if you start using the same drm_gpuvm for multiple devices you will sooner or later start to run into the same mess we have seen with KFD, where we moved more and more functionality from the KFD to the DRM render node because we found that a lot of the stuff simply doesn't work correctly with a single object to maintain the state. As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process. Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda. This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general. A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process. SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI. If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus. I think the one and only one exception where an SVM uapi like in kfd makes sense, is if the _hardware_ itself, not the software stack defined semantics that you've happened to build on top of that hw, enforces a 1:1 mapping with the cpu process address space. Which means your hardware is using PASID, IOMMU based translation, PCI-ATS (address translation services) or whatever your hw calls it and has _no_ device-side pagetables on top. Which from what I've seen all devices with device-memory have, simply because they need some place to store whether that memory is currently in device memory or should be translated using PASID. Currently there's no gpu that works with PASID only, but there are some on-cpu-die accelerator things that do work like that. Maybe in the future there will be some accelerators that are fully cpu cache coherent (including atomics) with something like CXL, and the on-device memory is managed as normal system memory with struct page as ZONE_DEVICE and accelerator va -> physical address translation is only done with PASID ... but for now I haven't seen that, definitely not in upstream drivers. And the moment you have some per-device pagetables or per-device memory management of some sort (like using gpuva mgr) then I'm 100% agreeing with Christian that the kfd SVM model is too str
Re: Making drm_gpuvm work across gpu devices
On 2024-01-29 11:28, Christian König wrote: Am 29.01.24 um 17:24 schrieb Felix Kuehling: On 2024-01-29 10:33, Christian König wrote: Am 29.01.24 um 16:03 schrieb Felix Kuehling: On 2024-01-25 13:32, Daniel Vetter wrote: On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote: Am 23.01.24 um 20:37 schrieb Zeng, Oak: [SNIP] Yes most API are per device based. One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Yeah and that was a big mistake in my opinion. We should really not do that ever again. Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc). Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea. The background is that this only works for some use cases but not all of them. What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space. Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management. When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow. [SNIP] I think if you start using the same drm_gpuvm for multiple devices you will sooner or later start to run into the same mess we have seen with KFD, where we moved more and more functionality from the KFD to the DRM render node because we found that a lot of the stuff simply doesn't work correctly with a single object to maintain the state. As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process. Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda. This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general. A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process. SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI. If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus. I think the one and only one exception where an SVM uapi like in kfd makes sense, is if the _hardware_ itself, not the software stack defined semantics that you've happened to build on top of that hw, enforces a 1:1 mapping with the cpu process address space. Which means your hardware is using PASID, IOMMU based translation, PCI-ATS (address translation services) or whatever your hw calls it and has _no_ device-side pagetables on top. Which from what I've seen all devices with device-memory have, simply because they need some place to store whether that memory is currently in device memory or should be translated using PASID. Currently there's no gpu that works with PASID only, but there are some on-cpu-die accelerator things that do work like that. Maybe in the future there will be some accelerators that are fully cpu cache coherent (including atomics) with something like CXL, and the on-device memory is managed as normal system memory with struct page as ZONE_DEVICE and accelerator va -> physical address translation is only done with PASID ... but for now I haven't seen that, definitely not in upstream drivers. And the moment you have some per-device pagetables or per-device memory management of some sort (like using gpuva mgr) then I'm 100% agreeing with Christian that the kfd SVM model is too strict and not a great idea. That basically means, without ATS/PRI+PASID you cannot im
Re: Making drm_gpuvm work across gpu devices
On 2024-01-29 10:33, Christian König wrote: Am 29.01.24 um 16:03 schrieb Felix Kuehling: On 2024-01-25 13:32, Daniel Vetter wrote: On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote: Am 23.01.24 um 20:37 schrieb Zeng, Oak: [SNIP] Yes most API are per device based. One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Yeah and that was a big mistake in my opinion. We should really not do that ever again. Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc). Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea. The background is that this only works for some use cases but not all of them. What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space. Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management. When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow. [SNIP] I think if you start using the same drm_gpuvm for multiple devices you will sooner or later start to run into the same mess we have seen with KFD, where we moved more and more functionality from the KFD to the DRM render node because we found that a lot of the stuff simply doesn't work correctly with a single object to maintain the state. As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process. Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda. This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general. A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process. SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI. If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus. I think the one and only one exception where an SVM uapi like in kfd makes sense, is if the _hardware_ itself, not the software stack defined semantics that you've happened to build on top of that hw, enforces a 1:1 mapping with the cpu process address space. Which means your hardware is using PASID, IOMMU based translation, PCI-ATS (address translation services) or whatever your hw calls it and has _no_ device-side pagetables on top. Which from what I've seen all devices with device-memory have, simply because they need some place to store whether that memory is currently in device memory or should be translated using PASID. Currently there's no gpu that works with PASID only, but there are some on-cpu-die accelerator things that do work like that. Maybe in the future there will be some accelerators that are fully cpu cache coherent (including atomics) with something like CXL, and the on-device memory is managed as normal system memory with struct page as ZONE_DEVICE and accelerator va -> physical address translation is only done with PASID ... but for now I haven't seen that, definitely not in upstream drivers. And the moment you have some per-device pagetables or per-device memory management of some sort (like using gpuva mgr) then I'm 100% agreeing with Christian that the kfd SVM model is too strict and not a great idea. That basically means, without ATS/PRI+PASID you cannot implement a unified memory programming model, where GPUs or accelerators access virtual addresses wi
Re: Making drm_gpuvm work across gpu devices
On 2024-01-25 13:32, Daniel Vetter wrote: On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote: Am 23.01.24 um 20:37 schrieb Zeng, Oak: [SNIP] Yes most API are per device based. One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Yeah and that was a big mistake in my opinion. We should really not do that ever again. Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc). Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea. The background is that this only works for some use cases but not all of them. What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space. Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management. When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow. [SNIP] I think if you start using the same drm_gpuvm for multiple devices you will sooner or later start to run into the same mess we have seen with KFD, where we moved more and more functionality from the KFD to the DRM render node because we found that a lot of the stuff simply doesn't work correctly with a single object to maintain the state. As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process. Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda. This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general. A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process. SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI. If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus. I think the one and only one exception where an SVM uapi like in kfd makes sense, is if the _hardware_ itself, not the software stack defined semantics that you've happened to build on top of that hw, enforces a 1:1 mapping with the cpu process address space. Which means your hardware is using PASID, IOMMU based translation, PCI-ATS (address translation services) or whatever your hw calls it and has _no_ device-side pagetables on top. Which from what I've seen all devices with device-memory have, simply because they need some place to store whether that memory is currently in device memory or should be translated using PASID. Currently there's no gpu that works with PASID only, but there are some on-cpu-die accelerator things that do work like that. Maybe in the future there will be some accelerators that are fully cpu cache coherent (including atomics) with something like CXL, and the on-device memory is managed as normal system memory with struct page as ZONE_DEVICE and accelerator va -> physical address translation is only done with PASID ... but for now I haven't seen that, definitely not in upstream drivers. And the moment you have some per-device pagetables or per-device memory management of some sort (like using gpuva mgr) then I'm 100% agreeing with Christian that the kfd SVM model is too strict and not a great idea. That basically means, without ATS/PRI+PASID you cannot implement a unified memory programming model, where GPUs or accelerators access virtual addresses without pre-registering them with an SVM API call. Unified memory is a feature implemented by the KFD SVM API and used by ROCm. This is used e.g. to implement OpenMP USM (unified shared memory). It'
Re: Making drm_gpuvm work across gpu devices
On 2024-01-24 20:17, Zeng, Oak wrote: Hi Christian, Even though I mentioned KFD design, I didn’t mean to copy the KFD design. I also had hard time to understand the difficulty of KFD under virtualization environment. The problem with virtualization is related to virtualization design choices. There is a single process that proxies requests from multiple processes in one (or more?) VMs to the GPU driver. That means, we need a single process with multiple contexts (and address spaces). One proxy process on the host must support multiple guest address spaces. I don't know much more than these very high level requirements, and I only found out about those a few weeks ago. Due to my own bias I can't comment whether there are bad design choices in the proxy architecture or in KFD or both. The way we are considering fixing this, is to enable creating multiple KFD contexts in the same process. Each of those contexts will still represent a shared virtual address space across devices (but not the CPU). Because the device address space is not shared with the CPU, we cannot support our SVM API in this situation. I still believe that it makes sense to have the kernel mode driver aware of a shared virtual address space at some level. A per-GPU API and an API that doesn't require matching CPU and GPU virtual addresses would enable more flexibility at the cost duplicate information tracking for multiple devices and duplicate overhead for things like MMU notifiers and interval tree data structures. Having to coordinate multiple devices with potentially different address spaces would probably make it more awkward to implement memory migration. The added flexibility would go mostly unused, except in some very niche applications. Regards, Felix For us, Xekmd doesn't need to know it is running under bare metal or virtualized environment. Xekmd is always a guest driver. All the virtual address used in xekmd is guest virtual address. For SVM, we require all the VF devices share one single shared address space with guest CPU program. So all the design works in bare metal environment can automatically work under virtualized environment. +@Shah, Ankur N <mailto:ankur.n.s...@intel.com> +@Winiarski, Michal <mailto:michal.winiar...@intel.com> to backup me if I am wrong. Again, shared virtual address space b/t cpu and all gpu devices is a hard requirement for our system allocator design (which means malloc’ed memory, cpu stack variables, globals can be directly used in gpu program. Same requirement as kfd SVM design). This was aligned with our user space software stack. For anyone who want to implement system allocator, or SVM, this is a hard requirement. I started this thread hoping I can leverage the drm_gpuvm design to manage the shared virtual address space (as the address range split/merge function was scary to me and I didn’t want re-invent). I guess my takeaway from this you and Danilo is this approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill for our svm address range split/merge. So I will make things work first by manage address range xekmd internally. I can re-look drm-gpuvm approach in the future. Maybe a pseudo user program can illustrate our programming model: Fd0 = open(card0) Fd1 = open(card1) Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first vm_create Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from same process Queue0 = xe_exec_queue_create(fd0, vm0) Queue1 = xe_exec_queue_create(fd1, vm1) //check p2p capability calling L0 API…. ptr = malloc()//this replace bo_create, vm_bind, dma-import/export Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0 Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1 //Gpu page fault handles memory allocation/migration/mapping to gpu As you can see, from above model, our design is a little bit different than the KFD design. user need to explicitly create gpuvm (vm0 and vm1 above) for each gpu device. Driver internally have a xe_svm represent the shared address space b/t cpu and multiple gpu devices. But end user doesn’t see and no need to create xe_svm. The shared virtual address space is really managed by linux core mm (through the vma struct, mm_struct etc). From each gpu device’s perspective, it just operate under its own gpuvm, not aware of the existence of other gpuvm, even though in reality all those gpuvm shares a same virtual address space. See one more comment inline *From:*Christian König *Sent:* Wednesday, January 24, 2024 3:33 AM *To:* Zeng, Oak ; Danilo Krummrich ; Dave Airlie ; Daniel Vetter ; Felix Kuehling *Cc:* Welty, Brian ; dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; Bommu, Krishnaiah ; Ghimiray, Himal Prasad ; thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana ; Brost, Matthew ; Gupta, saurabhg *Subject:*
Re: [bug report] drm/amdkfd: Export DMABufs from KFD using GEM handles
On 2024-01-23 5:21, Dan Carpenter wrote: Hello Felix Kuehling, The patch 1819200166ce: "drm/amdkfd: Export DMABufs from KFD using GEM handles" from Aug 24, 2023 (linux-next), leads to the following Smatch static checker warning: drivers/dma-buf/dma-buf.c:729 dma_buf_get() warn: fd used after fd_install() 'fd' drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 809 static int kfd_mem_export_dmabuf(struct kgd_mem *mem) 810 { 811 if (!mem->dmabuf) { 812 struct amdgpu_device *bo_adev; 813 struct dma_buf *dmabuf; 814 int r, fd; 815 816 bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); 817 r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, 818 mem->gem_handle, 819 mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? 820 DRM_RDWR : 0, &fd); ^^^ The drm_gem_prime_handle_to_fd() function does an fd_install() and returns the result as "fd". 821 if (r) 822 return r; 823 dmabuf = dma_buf_get(fd); ^^ Then we do another fget() inside dma_buf_get(). I'm not an expert, but this looks wrong. We can't assume that the dmabuf here is the same one from drm_gem_prime_handle_to_fd() because the user could change it after the fd_install(). I suspect drm_gem_prime_handle_to_fd() should pass the dmabuf back instead. We had several CVEs similar to this such as CVE-2022-1998. That CVE is a system crash. I don't think that can happen here. It's true that user mode can close the fd. But then dma_buf_get would return an error that we'd catch with "WARN_ON_ONCE(IS_ERR(dmabuf))" below. It's possible that a different DMABuf gets bound to the fd by a malicious user mode. That said, we're treating this fd as if it had come from user mode. dma_buf_get and the subsequent check on the dmabuf should be robust against any user mode messing with the file descriptor in the meantime. Regards, Felix 824 close_fd(fd); 825 if (WARN_ON_ONCE(IS_ERR(dmabuf))) 826 return PTR_ERR(dmabuf); 827 mem->dmabuf = dmabuf; 828 } 829 830 return 0; 831 } regards, dan carpenter
Re: Making drm_gpuvm work across gpu devices
On 2024-01-23 14:37, Zeng, Oak wrote: Thanks Christian. I have some comment inline below. Danilo, can you also take a look and give your feedback? Thanks. Sorry, just catching up with this thread now. I'm also not familiar with drm_gpuvm. Some general observations based on my experience with KFD, amdgpu and SVM. With SVM we have a single virtual address space managed in user mode (basically using mmap) with attributes per virtual address range maintained in the kernel mode driver. Different devices get their mappings of pieces of that address space using the same virtual addresses. We also support migration to different DEVICE_PRIVATE memory spaces. However, we still have page tables managed per device. Each device can have different page table formats and layout (e.g. different GPU generations in the same system) and the same memory may be mapped with different flags on different devices in order to get the right coherence behaviour. We also need to maintain per-device DMA mappings somewhere. That means, as far as the device page tables are concerned, we still have separate address spaces. SVM only adds a layer on top, which coordinates these separate device virtual address spaces so that some parts of them provide the appearance of a shared virtual address space. At some point you need to decide, where you draw the boundary between managing a per-process shared virtual address space and managing per-device virtual address spaces. In amdgpu that boundary is currently where kfd_svm code calls amdgpu_vm code to manage the per-device page tables. In the amdgpu driver, we still have the traditional memory management APIs in the render nodes that don't do SVM. They share the device virtual address spaces with SVM. We have to be careful that we don't try to manage the same device virtual address ranges with these two different memory managers. In practice, we let the non-SVM APIs use the upper half of the canonical address space, while the lower half can be used almost entirely for SVM. Regards, Felix -Original Message- From: Christian König Sent: Tuesday, January 23, 2024 6:13 AM To: Zeng, Oak ; Danilo Krummrich ; Dave Airlie ; Daniel Vetter Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; intel- x...@lists.freedesktop.org; Bommu, Krishnaiah ; Ghimiray, Himal Prasad ; thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana ; Brost, Matthew Subject: Re: Making drm_gpuvm work across gpu devices Hi Oak, Am 23.01.24 um 04:21 schrieb Zeng, Oak: Hi Danilo and all, During the work of Intel's SVM code, we came up the idea of making drm_gpuvm to work across multiple gpu devices. See some discussion here: https://lore.kernel.org/dri- devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd 11.prod.outlook.com/ The reason we try to do this is, for a SVM (shared virtual memory across cpu program and all gpu program on all gpu devices) process, the address space has to be across all gpu devices. So if we make drm_gpuvm to work across devices, then our SVM code can leverage drm_gpuvm as well. At a first look, it seems feasible because drm_gpuvm doesn't really use the drm_device *drm pointer a lot. This param is used only for printing/warning. So I think maybe we can delete this drm field from drm_gpuvm. This way, on a multiple gpu device system, for one process, we can have only one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for each gpu device). What do you think? Well from the GPUVM side I don't think it would make much difference if we have the drm device or not. But the experience we had with the KFD I think I should mention that we should absolutely *not* deal with multiple devices at the same time in the UAPI or VM objects inside the driver. The background is that all the APIs inside the Linux kernel are build around the idea that they work with only one device at a time. This accounts for both low level APIs like the DMA API as well as pretty high level things like for example file system address space etc... Yes most API are per device based. One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Cc Felix. Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc). We have the same design requirement of SVM. For anyone who want to implement the SVM concept, this is a hard requirement. Since now drm has the drm_gpuvm concept which strictly speaking is designed for one device, I want to see whether we can extend drm_gpuvm to make it work for both single device (as used in
Re: [pull] amdgpu, amdkfd drm-fixes-6.8
On 2024-01-15 17:08, Alex Deucher wrote: Hi Dave, Sima, Same PR as Friday, but with the new clang warning fixed. The following changes since commit e54478fbdad20f2c58d0a4f99d01299ed8e7fe9c: Merge tag 'amd-drm-next-6.8-2024-01-05' of https://gitlab.freedesktop.org/agd5f/linux into drm-next (2024-01-09 09:07:50 +1000) are available in the Git repository at: https://gitlab.freedesktop.org/agd5f/linux.git tags/amd-drm-fixes-6.8-2024-01-15 for you to fetch changes up to 86718cf93237a0f45773bbc49b6006733fc4e051: drm/amd/display: Avoid enum conversion warning (2024-01-15 16:28:05 -0500) amd-drm-fixes-6.8-2024-01-15: amdgpu: - SubVP fixes - VRR fixes - USB4 fixes - DCN 3.5 fixes - GFX11 harvesting fix - RAS fixes - Misc small fixes - KFD dma-buf import fixes - Power reporting fixes - ATHUB 3.3 fix - SR-IOV fix - Add missing fw release for fiji - GFX 11.5 fix - Debugging module parameter fix - SMU 13.0.6 fixes - Fix new clang warning amdkfd: - Fix lockdep warnings - Fix sparse __rcu warnings - Bump interface version so userspace knows that the kernel supports dma-bufs exported from KFD Most of the fixes for this went into 6.7, but the last fix is in this PR - HMM fix - SVM fix Alex Deucher (4): drm/amdgpu: fix avg vs input power reporting on smu7 drm/amdgpu: fall back to INPUT power for AVG power via INFO IOCTL drm/amdgpu/pm: clarify debugfs pm output drm/amdgpu: drop exp hw support check for GC 9.4.3 Aric Cyr (1): drm/amd/display: 3.2.266 Candice Li (2): drm/amdgpu: Drop unnecessary sentences about CE and deferred error. drm/amdgpu: Support poison error injection via ras_ctrl debugfs Charlene Liu (1): drm/amd/display: Update z8 latency Dafna Hirschfeld (1): drm/amdkfd: fixes for HMM mem allocation Daniel Miess (1): Revert "drm/amd/display: Fix conversions between bytes and KB" Felix Kuehling (4): drm/amdkfd: Fix lock dependency warning drm/amdkfd: Fix sparse __rcu annotation warnings drm/amdgpu: Auto-validate DMABuf imports in compute VMs drm/amdkfd: Bump KFD ioctl version I don't think the last two patches should be on -fixes. This is really for a new interoperability feature and not meant to fix other functionality that was already expected to work. That's why the user mode code checks for the API version. Regards, Felix George Shen (1): drm/amd/display: Disconnect phantom pipe OPP from OPTC being disabled Hawking Zhang (1): drm/amdgpu: Packed socket_id to ras feature mask Ivan Lipski (1): Revert "drm/amd/display: fix bandwidth validation failure on DCN 2.1" James Zhu (1): drm/amdgpu: make a correction on comment Le Ma (3): Revert "drm/amdgpu: add param to specify fw bo location for front-door loading" drm/amdgpu: add debug flag to place fw bo on vram for frontdoor loading drm/amdgpu: move debug options init prior to amdgpu device init Lijo Lazar (2): drm/amd/pm: Add error log for smu v13.0.6 reset drm/amd/pm: Fix smuv13.0.6 current clock reporting Likun Gao (1): drm/amdgpu: correct the cu count for gfx v11 Martin Leung (2): drm/amd/display: revert "for FPO & SubVP/DRR config program vmin/max" drm/amd/display: revert "Optimize VRR updates to only necessary ones" Martin Tsai (1): drm/amd/display: To adjust dprefclk by down spread percentage Meenakshikumar Somasundaram (1): drm/amd/display: Dpia hpd status not in sync after S4 Melissa Wen (1): drm/amd/display: cleanup inconsistent indenting in amdgpu_dm_color Nathan Chancellor (1): drm/amd/display: Avoid enum conversion warning Peichen Huang (1): drm/amd/display: Request usb4 bw for mst streams Philip Yang (1): drm/amdkfd: Fix lock dependency warning with srcu Srinivasan Shanmugam (6): drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 'get_platform_power_management_table()' drm/amdgpu: Fix with right return code '-EIO' in 'amdgpu_gmc_vram_checking()' drm/amdgpu: Fix unsigned comparison with less than zero in vpe_u1_8_from_fraction() drm/amdgpu: Release 'adev->pm.fw' before return in 'amdgpu_device_need_post()' drm/amd/display: Fix variable deferencing before NULL check in edp_setup_replay() drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()' Victor Lu (1): drm/amdgpu: Do not program VM_L2_CNTL under SRIOV Yifan Zhang (3): drm/amdgpu: update headers for nbio v7.11 drm/amdgpu: update ATHUB_MISC_CNTL offset for athub v3.3 drm/amdgpu: update regGL2C_CTRL4 value in golden sett
Re: Proposal to add CRIU support to DRM render nodes
I haven't seen any replies to this proposal. Either it got lost in the pre-holiday noise, or there is genuinely no interest in this. If it's the latter, I would look for an AMDGPU driver-specific solution with minimally invasive changes in DRM and DMABuf code, if needed. Maybe it could be generalized later if there is interest then. Regards, Felix On 2023-12-06 16:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to parse that information. Therefore, much of the information in our KFD CRIU ioctl API is opaque. It is written by kernel mode in the checkpoint, it is consumed by kernel mode when restoring the checkpoint, but user mode doesn't care about the contents or binary layout, so there is no user mode ABI to break. This is how we were able to maintain CRIU support when we added the SVM API to KFD without changing the CRIU plugin and without breaking our ABI. Opaque information may also lend itself to API abstraction, if this becomes a generic DRM API with driver-specific callbacks that fill in HW-specific opaque data. # Second API for creating objects Creating BOs and other objects when restoring a checkpoint needs more information than the usual BO alloc and similar APIs provide. For example, we need to restore BOs with the same GEM handles so that user mode can continue using those handles after resuming execution. If BOs are shared through DMABufs without dynamic a
Re: [PATCH v2] drm/amdkfd: fixes for HMM mem allocation
On 2024-01-07 08:07, Dafna Hirschfeld wrote: Fix err return value and reset pgmap->type after checking it. Fixes: c83dee9b6394 ("drm/amdkfd: add SPM support for SVM") Reviewed-by: Felix Kuehling Signed-off-by: Dafna Hirschfeld --- v2: remove unrelated DOC fix and add 'Fixes' tag. Thank you. I'm going to apply this patch to amd-staging-drm-next. Regards, Felix drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 6c25dab051d5..b8680e0753ca 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -1021,7 +1021,7 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev) } else { res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); if (IS_ERR(res)) - return -ENOMEM; + return PTR_ERR(res); pgmap->range.start = res->start; pgmap->range.end = res->end; pgmap->type = MEMORY_DEVICE_PRIVATE; @@ -1037,10 +1037,10 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev) r = devm_memremap_pages(adev->dev, pgmap); if (IS_ERR(r)) { pr_err("failed to register HMM device memory\n"); - /* Disable SVM support capability */ - pgmap->type = 0; if (pgmap->type == MEMORY_DEVICE_PRIVATE) devm_release_mem_region(adev->dev, res->start, resource_size(res)); + /* Disable SVM support capability */ + pgmap->type = 0; return PTR_ERR(r); }
Re: [PATCH] drm/amdkfd: fixes for HMM mem allocation
On 2023-12-31 09:37, Dafna Hirschfeld wrote: Few fixes to amdkfd and the doc of devm_request_free_mem_region. Signed-off-by: Dafna Hirschfeld --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 6 +++--- kernel/resource.c| 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 6c25dab051d5..b8680e0753ca 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -1021,7 +1021,7 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev) } else { res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); if (IS_ERR(res)) - return -ENOMEM; + return PTR_ERR(res); pgmap->range.start = res->start; pgmap->range.end = res->end; pgmap->type = MEMORY_DEVICE_PRIVATE; @@ -1037,10 +1037,10 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev) r = devm_memremap_pages(adev->dev, pgmap); if (IS_ERR(r)) { pr_err("failed to register HMM device memory\n"); - /* Disable SVM support capability */ - pgmap->type = 0; if (pgmap->type == MEMORY_DEVICE_PRIVATE) devm_release_mem_region(adev->dev, res->start, resource_size(res)); + /* Disable SVM support capability */ + pgmap->type = 0; Ooff, thanks for catching that. For the KFD driver changes you can add Fixes: c83dee9b6394 ("drm/amdkfd: add SPM support for SVM") Reviewed-by: Felix Kuehling return PTR_ERR(r); } diff --git a/kernel/resource.c b/kernel/resource.c index 866ef3663a0b..fe890b874606 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1905,8 +1905,8 @@ get_free_mem_region(struct device *dev, struct resource *base, * devm_request_free_mem_region - find free region for device private memory * * @dev: device struct to bind the resource to - * @size: size in bytes of the device memory to add * @base: resource tree to look in + * @size: size in bytes of the device memory to add * * This function tries to find an empty range of physical address big enough to * contain the new resource, so that it can later be hotplugged as ZONE_DEVICE
Re: [PATCH v3 2/2] drm/amdgpu: Enable clear page functionality
On 2023-12-14 08:42, Arunpravin Paneer Selvam wrote: Add clear page support in vram memory region. v1:(Christian) - Dont handle clear page as TTM flag since when moving the BO back in from GTT again we don't need that. - Make a specialized version of amdgpu_fill_buffer() which only clears the VRAM areas which are not already cleared - Drop the TTM_PL_FLAG_WIPE_ON_RELEASE check in amdgpu_object.c v2: - Modify the function name amdgpu_ttm_* (Alex) - Drop the delayed parameter (Christian) - handle amdgpu_res_cleared(&cursor) just above the size calculation (Christian) - Use AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE for clearing the buffers in the free path to properly wait for fences etc.. (Christian) v3:(Christian) - Remove buffer clear code in VRAM manager instead change the AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE handling to set the DRM_BUDDY_CLEARED flag. - Remove ! from amdgpu_res_cleared(&cursor) check. Signed-off-by: Arunpravin Paneer Selvam Suggested-by: Christian König Acked-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c| 22 --- .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h| 25 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 61 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 5 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h | 5 ++ 6 files changed, 111 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index cef920a93924..be8bf375d823 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -39,6 +39,7 @@ #include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h" +#include "amdgpu_vram_mgr.h" /** * DOC: amdgpu_object @@ -598,8 +599,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev, if (!amdgpu_bo_support_uswc(bo->flags)) bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC; - if (adev->ras_enabled) - bo->flags |= AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE; + bo->flags |= AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE; bo->tbo.bdev = &adev->mman.bdev; if (bp->domain & (AMDGPU_GEM_DOMAIN_GWS | AMDGPU_GEM_DOMAIN_OA | @@ -629,15 +629,17 @@ int amdgpu_bo_create(struct amdgpu_device *adev, if (bp->flags & AMDGPU_GEM_CREATE_VRAM_CLEARED && bo->tbo.resource->mem_type == TTM_PL_VRAM) { - struct dma_fence *fence; + struct dma_fence *fence = NULL; - r = amdgpu_fill_buffer(bo, 0, bo->tbo.base.resv, &fence, true); + r = amdgpu_ttm_clear_buffer(bo, bo->tbo.base.resv, &fence); if (unlikely(r)) goto fail_unreserve; - dma_resv_add_fence(bo->tbo.base.resv, fence, - DMA_RESV_USAGE_KERNEL); - dma_fence_put(fence); + if (fence) { + dma_resv_add_fence(bo->tbo.base.resv, fence, + DMA_RESV_USAGE_KERNEL); + dma_fence_put(fence); + } } if (!bp->resv) amdgpu_bo_unreserve(bo); @@ -1360,8 +1362,12 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv))) return; - r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence, true); + r = amdgpu_fill_buffer(abo, 0, bo->base.resv, &fence, true); if (!WARN_ON(r)) { + struct amdgpu_vram_mgr_resource *vres; + + vres = to_amdgpu_vram_mgr_resource(bo->resource); + vres->flags |= DRM_BUDDY_CLEARED; amdgpu_bo_fence(abo, fence, false); dma_fence_put(fence); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h index 381101d2bf05..50fcd86e1033 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h @@ -164,4 +164,29 @@ static inline void amdgpu_res_next(struct amdgpu_res_cursor *cur, uint64_t size) } } +/** + * amdgpu_res_cleared - check if blocks are cleared + * + * @cur: the cursor to extract the block + * + * Check if the @cur block is cleared + */ +static inline bool amdgpu_res_cleared(struct amdgpu_res_cursor *cur) +{ + struct drm_buddy_block *block; + + switch (cur->mem_type) { + case TTM_PL_VRAM: + block = cur->node; + + if (!amdgpu_vram_mgr_is_cleared(block)) + return false; + break; + default: + return false;
Re: [PATCH 1/2] drm: update drm_show_memory_stats() for dma-bufs
On 2023-12-07 13:02, Alex Deucher wrote: Show buffers as shared if they are shared via dma-buf as well (e.g., shared with v4l or some other subsystem). You can add KFD to that list. With the in-progress CUDA11 VM changes and improved interop between KFD and render nodes, sharing DMABufs between KFD and render nodes will become much more common. Regards, Felix Signed-off-by: Alex Deucher Cc: Rob Clark --- drivers/gpu/drm/drm_file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c index 5ddaffd32586..5d5f93b9c263 100644 --- a/drivers/gpu/drm/drm_file.c +++ b/drivers/gpu/drm/drm_file.c @@ -973,7 +973,7 @@ void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file) DRM_GEM_OBJECT_PURGEABLE; } - if (obj->handle_count > 1) { + if ((obj->handle_count > 1) || obj->dma_buf) { status.shared += obj->size; } else { status.private += obj->size;
Re: [PATCH 2/2] drm/amdgpu: Enable clear page functionality
On 2023-12-13 9:20, Christian König wrote: Am 12.12.23 um 00:32 schrieb Felix Kuehling: On 2023-12-11 04:50, Christian König wrote: Am 08.12.23 um 20:53 schrieb Alex Deucher: [SNIP] You also need a functionality which resets all cleared blocks to uncleared after suspend/resume. No idea how to do this, maybe Alex knows of hand. Since the buffers are cleared on creation, is there actually anything to do? Well exactly that's the problem, the buffers are no longer always cleared on creation with this patch. Instead we clear on free, track which areas are cleared and clear only the ones which aren't cleared yet on creation. The code I added for clearing-on-free a long time ago, does not clear to 0, but to a non-0 poison value. That was meant to make it easier to catch applications incorrectly relying on 0-initialized memory. Is that being changed? I didn't see it in this patch series. Yeah, Arun stumbled over that as well. Any objections that we fill with zeros instead or is that poison value something necessary for debugging? I don't think it's strictly necessary. But it may encourage sloppy user mode programming to rely on 0-initialized memory that ends up breaking in corner cases or on older kernels. That said, I see that this patch series adds clearing of memory in the VRAM manager, but it doesn't remove the clearing for AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE in amdgpu_bo_release_notify and amdgpu_move_blit. This will lead to duplicate work. I'm also not sure how the clearing added in this patch series will affect free-latency observed in user mode. Will this be synchronous and cause the user mode thread to stall while the memory is being cleared? Regards, Felix Regards, Christian. Regards, Felix So some cases need special handling. E.g. when the engine is not initialized yet or suspend/resume. In theory after a suspend/resume cycle the VRAM is cleared to zeros, but in practice that's not always true. Christian. Alex
Re: [PATCH 2/2] drm/amdgpu: Enable clear page functionality
On 2023-12-11 04:50, Christian König wrote: Am 08.12.23 um 20:53 schrieb Alex Deucher: [SNIP] You also need a functionality which resets all cleared blocks to uncleared after suspend/resume. No idea how to do this, maybe Alex knows of hand. Since the buffers are cleared on creation, is there actually anything to do? Well exactly that's the problem, the buffers are no longer always cleared on creation with this patch. Instead we clear on free, track which areas are cleared and clear only the ones which aren't cleared yet on creation. The code I added for clearing-on-free a long time ago, does not clear to 0, but to a non-0 poison value. That was meant to make it easier to catch applications incorrectly relying on 0-initialized memory. Is that being changed? I didn't see it in this patch series. Regards, Felix So some cases need special handling. E.g. when the engine is not initialized yet or suspend/resume. In theory after a suspend/resume cycle the VRAM is cleared to zeros, but in practice that's not always true. Christian. Alex
Proposal to add CRIU support to DRM render nodes
Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to parse that information. Therefore, much of the information in our KFD CRIU ioctl API is opaque. It is written by kernel mode in the checkpoint, it is consumed by kernel mode when restoring the checkpoint, but user mode doesn't care about the contents or binary layout, so there is no user mode ABI to break. This is how we were able to maintain CRIU support when we added the SVM API to KFD without changing the CRIU plugin and without breaking our ABI. Opaque information may also lend itself to API abstraction, if this becomes a generic DRM API with driver-specific callbacks that fill in HW-specific opaque data. # Second API for creating objects Creating BOs and other objects when restoring a checkpoint needs more information than the usual BO alloc and similar APIs provide. For example, we need to restore BOs with the same GEM handles so that user mode can continue using those handles after resuming execution. If BOs are shared through DMABufs without dynamic attachment, we need to restore pinned BOs as pinned. Validation of virtual addresses and handling MMU notifiers must be suspended until the virtual address space is restored. For user mode queues we need to save and restore a lot of queue execution state so that execution can resume cleanly. # Restoring buffer sharing relationships Different GEM handles in different render nodes and processes can refer to the same underlying shared memory, either b
Re: [PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle conversion"
On 2023-12-04 12:49, Deucher, Alexander wrote: [AMD Official Use Only - General] -Original Message- From: Kuehling, Felix Sent: Friday, December 1, 2023 6:40 PM To: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Deucher, Alexander Cc: Daniel Vetter ; Koenig, Christian ; Thomas Zimmermann Subject: Re: [PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle conversion" Hi Alex, I'm about to push patches 1-3 to the rebased amd-staging-drm-next. It would be good to get patch 1 into drm-fixes so that Linux 6.6 will be the only kernel without these prime helpers. That would minimize the hassle for DKMS driver installations on future distros. Already done: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0514f63cfff38a0dcb7ba9c5f245827edc0c5107 Thank you, all! I also saw Sasha Levin is backporting it to 6.6. Cheers, Felix Alex Thanks, Felix On 2023-12-01 18:34, Felix Kuehling wrote: This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. Acked-by: Christian König Acked-by: Thomas Zimmermann Acked-by: Daniel Vetter Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, + struct drm_file *file_priv, uint32_t handle, + uint32_t flags, + int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export + functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, int flags) @@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * @dev: drm_device to import into * @dma_bu
Re: [PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle conversion"
Hi Alex, I'm about to push patches 1-3 to the rebased amd-staging-drm-next. It would be good to get patch 1 into drm-fixes so that Linux 6.6 will be the only kernel without these prime helpers. That would minimize the hassle for DKMS driver installations on future distros. Thanks, Felix On 2023-12-01 18:34, Felix Kuehling wrote: This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. Acked-by: Christian König Acked-by: Thomas Zimmermann Acked-by: Daniel Vetter Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, + struct drm_file *file_priv, uint32_t handle, + uint32_t flags, + int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, int flags) @@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * @dev: drm_device to import into * @dma_buf: dma-buf object to import * - * This is the implementation of the gem_prime_import functions for GEM - * drivers using the PRIME helpers. It is the default for drivers that do - * not set their own &drm_driver.gem_prime_import. + * This is the implementation of the gem_prime_import functions for GEM drivers + * using the PRIME helpers. Drivers can use this as their + * &drm_driver.gem_prime_import implementation. It is used as the default + * implementation in drm_gem_prime_fd_to_handle(). * * Drivers must arrange to call drm_prime_gem_destroy() from their * &drm_ge
[PATCH 3/6] drm/amdkfd: Import DMABufs for interop through DRM
Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This ensures that a GEM handle is created on import and that obj->dma_buf will be set and remain set as long as the object is imported into KFD. Signed-off-by: Felix Kuehling Reviewed-by: Ramesh Errabolu Reviewed-by: Xiaogang.Chen Acked-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 9 ++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 64 +-- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15 ++--- 3 files changed, 52 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 02973f5c8caf..cf6ed5fce291 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -314,11 +314,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *process_info, struct dma_fence **ef); int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, struct kfd_vm_fault_info *info); -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dmabuf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset); +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset); int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem, struct dma_buf **dmabuf); void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index ae7dfaf59159..48697b789342 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1956,8 +1956,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu( /* Free the BO*/ drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv); - if (!mem->is_imported) - drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); + drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); if (mem->dmabuf) { dma_buf_put(mem->dmabuf); mem->dmabuf = NULL; @@ -2313,34 +2312,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, return 0; } -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dma_buf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset) +static int import_obj_create(struct amdgpu_device *adev, +struct dma_buf *dma_buf, +struct drm_gem_object *obj, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) { struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv); - struct drm_gem_object *obj; struct amdgpu_bo *bo; int ret; - obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf); - if (IS_ERR(obj)) - return PTR_ERR(obj); - bo = gem_to_amdgpu_bo(obj); if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM | - AMDGPU_GEM_DOMAIN_GTT))) { + AMDGPU_GEM_DOMAIN_GTT))) /* Only VRAM and GTT BOs are supported */ - ret = -EINVAL; - goto err_put_obj; - } + return -EINVAL; *mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL); - if (!*mem) { - ret = -ENOMEM; - goto err_put_obj; - } + if (!*mem) + return -ENOMEM; ret = drm_vma_node_allow(&obj->vma_node, drm_priv); if (ret) @@ -2390,8 +2381,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, drm_vma_node_revoke(&obj->vma_node, drm_priv); err_free_mem: kfree(*mem); + return ret; +} + +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) +{ + struct drm_gem_object *ob
[PATCH 2/6] drm/amdkfd: Export DMABufs from KFD using GEM handles
Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM handles are created in a drm_client_dev context to avoid exposing them in user mode contexts through a DMABuf import. Signed-off-by: Felix Kuehling Reviewed-by: Ramesh Errabolu --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 5 +++ .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 33 +++ drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 +-- 4 files changed, 44 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index 2d22f7d45512..067690ba7bff 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) { int i; int last_valid_bit; + int ret; amdgpu_amdkfd_gpuvm_init_mem_limits(); @@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) .enable_mes = adev->enable_mes, }; + ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", NULL); + if (ret) { + dev_err(adev->dev, "Failed to init DRM client: %d\n", ret); + return; + } + /* this is going to have a few of the MSBs set that we need to * clear */ @@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev, &gpu_resources); + if (adev->kfd.init_complete) + drm_client_register(&adev->kfd.client); + else + drm_client_release(&adev->kfd.client); amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 16794c2eea35..02973f5c8caf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -33,6 +33,7 @@ #include #include #include +#include #include "amdgpu_sync.h" #include "amdgpu_vm.h" #include "amdgpu_xcp.h" @@ -83,6 +84,7 @@ struct kgd_mem { struct amdgpu_sync sync; + uint32_t gem_handle; bool aql_queue; bool is_imported; }; @@ -105,6 +107,9 @@ struct amdgpu_kfd_dev { /* HMM page migration MEMORY_DEVICE_PRIVATE mapping */ struct dev_pagemap pgmap; + + /* Client for KFD BO GEM handle allocations */ + struct drm_client_dev client; }; enum kgd_engine_type { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 73288f9ccaf8..ae7dfaf59159 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -806,13 +807,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem, static int kfd_mem_export_dmabuf(struct kgd_mem *mem) { if (!mem->dmabuf) { - struct dma_buf *ret = amdgpu_gem_prime_export( - &mem->bo->tbo.base, + struct amdgpu_device *bo_adev; + struct dma_buf *dmabuf; + int r, fd; + + bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); + r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, + mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0); - if (IS_ERR(ret)) - return PTR_ERR(ret); - mem->dmabuf = ret; + DRM_RDWR : 0, &fd); + if (r) + return r; + dmabuf = dma_buf_get(fd); + close_fd(fd); + if (WARN_ON_ONCE(IS_ERR(dmabuf))) + return PTR_ERR(dmabuf); + mem->dmabuf = dmabuf; } return 0; @@ -1778,6 +1788,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( pr_debug("Failed to allow vma node access. ret %d\n", ret); goto err_node_allow; } + ret = drm_gem_handle_create(adev->kfd.client.file, gobj, &(*mem)->gem_handle); + if (ret) + goto err_gem_handle_create; bo = gem_to_amdgpu_bo(gobj); if (bo_type == ttm_bo_type_sg) { bo->tb
[PATCH 5/6] drm/amdgpu: Auto-validate DMABuf imports in compute VMs
DMABuf imports in compute VMs are not wrapped in a kgd_mem object on the process_info->kfd_bo_list. There is no explicit KFD API call to validate them or add eviction fences to them. This patch automatically validates and fences dymanic DMABuf imports when they are added to a compute VM. Revalidation after evictions is handled in the VM code. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 3 + .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 45 --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 26 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 122 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 12 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 12 +- 8 files changed, 196 insertions(+), 34 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index cf6ed5fce291..f2e920734c98 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -182,6 +182,9 @@ int amdgpu_queue_mask_bit_to_set_resource_bit(struct amdgpu_device *adev, struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context, struct mm_struct *mm, struct svm_range_bo *svm_bo); +int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo, + uint32_t domain, + struct dma_fence *fence); #if defined(CONFIG_DEBUG_FS) int kfd_debugfs_kfd_mem_limits(struct seq_file *m, void *data); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 48697b789342..7d91f99acb59 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -426,9 +426,9 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain, return ret; } -static int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo, - uint32_t domain, - struct dma_fence *fence) +int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo, + uint32_t domain, + struct dma_fence *fence) { int ret = amdgpu_bo_reserve(bo, false); @@ -464,13 +464,16 @@ static int amdgpu_amdkfd_validate_vm_bo(void *_unused, struct amdgpu_bo *bo) * again. Page directories are only updated after updating page * tables. */ -static int vm_validate_pt_pd_bos(struct amdgpu_vm *vm) +static int vm_validate_pt_pd_bos(struct amdgpu_vm *vm, +struct ww_acquire_ctx *ticket) { struct amdgpu_bo *pd = vm->root.bo; struct amdgpu_device *adev = amdgpu_ttm_adev(pd->tbo.bdev); int ret; - ret = amdgpu_vm_validate_pt_bos(adev, vm, amdgpu_amdkfd_validate_vm_bo, NULL); + ret = amdgpu_vm_validate_evicted_bos(adev, vm, ticket, +amdgpu_amdkfd_validate_vm_bo, +NULL); if (ret) { pr_err("failed to validate PT BOs\n"); return ret; @@ -1310,14 +1313,15 @@ static int map_bo_to_gpuvm(struct kgd_mem *mem, return ret; } -static int process_validate_vms(struct amdkfd_process_info *process_info) +static int process_validate_vms(struct amdkfd_process_info *process_info, + struct ww_acquire_ctx *ticket) { struct amdgpu_vm *peer_vm; int ret; list_for_each_entry(peer_vm, &process_info->vm_list_head, vm_list_node) { - ret = vm_validate_pt_pd_bos(peer_vm); + ret = vm_validate_pt_pd_bos(peer_vm, ticket); if (ret) return ret; } @@ -1402,7 +1406,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void **process_info, ret = amdgpu_bo_reserve(vm->root.bo, true); if (ret) goto reserve_pd_fail; - ret = vm_validate_pt_pd_bos(vm); + ret = vm_validate_pt_pd_bos(vm, NULL); if (ret) { pr_err("validate_pt_pd_bos() failed\n"); goto validate_pd_fail; @@ -2043,7 +2047,7 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu( bo->tbo.resource->mem_type == TTM_PL_SYSTEM) is_invalid_userptr = true; - ret = vm_validate_pt_pd_bos(avm); + ret = vm_validate_pt_pd_bos(avm, NULL); if (unlikely(ret)) goto out_unreserve; @@ -2122,7 +2126,7 @@ int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu( goto unreserve_out; } - ret = vm_validate_pt_pd_
[PATCH 6/6] drm/amdkfd: Bump KFD ioctl version
This is not strictly a change in the IOCTL API. This version bump is meant to indicate to user mode the presence of a number of changes and fixes that enable the management of VA mappings in compute VMs using the GEM_VA ioctl for DMABufs exported from KFD. Signed-off-by: Felix Kuehling --- include/uapi/linux/kfd_ioctl.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index f0ed68974c54..9ce46edc62a5 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -40,9 +40,10 @@ * - 1.12 - Add DMA buf export ioctl * - 1.13 - Add debugger API * - 1.14 - Update kfd_event_data + * - 1.15 - Enable managing mappings in compute VMs with GEM_VA ioctl */ #define KFD_IOCTL_MAJOR_VERSION 1 -#define KFD_IOCTL_MINOR_VERSION 14 +#define KFD_IOCTL_MINOR_VERSION 15 struct kfd_ioctl_get_version_args { __u32 major_version;/* from KFD */ -- 2.34.1
[PATCH 4/6] drm/amdgpu: New VM state for evicted user BOs
Create a new VM state to track user BOs that are in the system domain. In the next patch this will be used do conditionally re-validate them in amdgpu_vm_handle_moved. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 + drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 5 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 7da71b6a9dc6..a748e17ff031 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -233,6 +233,22 @@ static void amdgpu_vm_bo_invalidated(struct amdgpu_vm_bo_base *vm_bo) spin_unlock(&vm_bo->vm->status_lock); } +/** + * amdgpu_vm_bo_evicted_user - vm_bo is evicted + * + * @vm_bo: vm_bo which is evicted + * + * State for BOs used by user mode queues which are not at the location they + * should be. + */ +static void amdgpu_vm_bo_evicted_user(struct amdgpu_vm_bo_base *vm_bo) +{ + vm_bo->moved = true; + spin_lock(&vm_bo->vm->status_lock); + list_move(&vm_bo->vm_status, &vm_bo->vm->evicted_user); + spin_unlock(&vm_bo->vm->status_lock); +} + /** * amdgpu_vm_bo_relocated - vm_bo is reloacted * @@ -2195,6 +2211,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, for (i = 0; i < AMDGPU_MAX_VMHUBS; i++) vm->reserved_vmid[i] = NULL; INIT_LIST_HEAD(&vm->evicted); + INIT_LIST_HEAD(&vm->evicted_user); INIT_LIST_HEAD(&vm->relocated); INIT_LIST_HEAD(&vm->moved); INIT_LIST_HEAD(&vm->idle); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index b6cd565562ad..9156ed22abe7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -288,9 +288,12 @@ struct amdgpu_vm { /* Lock to protect vm_bo add/del/move on all lists of vm */ spinlock_t status_lock; - /* BOs who needs a validation */ + /* Per VM and PT BOs who needs a validation */ struct list_headevicted; + /* BOs for user mode queues that need a validation */ + struct list_headevicted_user; + /* PT BOs which relocated and their parent need an update */ struct list_headrelocated; -- 2.34.1
[PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle conversion"
This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. Acked-by: Christian König Acked-by: Thomas Zimmermann Acked-by: Daniel Vetter Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, + struct drm_file *file_priv, uint32_t handle, + uint32_t flags, + int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, int flags) @@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * @dev: drm_device to import into * @dma_buf: dma-buf object to import * - * This is the implementation of the gem_prime_import functions for GEM - * drivers using the PRIME helpers. It is the default for drivers that do - * not set their own &drm_driver.gem_prime_import. + * This is the implementation of the gem_prime_import functions for GEM drivers + * using the PRIME helpers. Drivers can use this as their + * &drm_driver.gem_prime_import implementation. It is used as the default + * implementation in drm_gem_prime_fd_to_handle(). * * Drivers must arrange to call drm_prime_gem_destroy() from their * &drm_gem_object_funcs.free hook when using this function. diff --git a/include/drm/drm_prime.h b/include/drm/drm_prime.h index a7abf9f3e697..2a1d01e5b56b 100644 --- a/include/drm/drm_prime.h +++ b/include/drm/drm_prime.h @@ -60,12 +60,19 @@ enum dma_data_direction; struct drm_device; struct drm_gem_object; +struct drm_file; /* core prime functions */ struct dma_buf *drm_gem_dmabuf_expor
Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"
On 2023-11-28 12:22, Alex Deucher wrote: On Thu, Nov 23, 2023 at 6:12 PM Felix Kuehling wrote: [+Alex] On 2023-11-17 16:44, Felix Kuehling wrote: This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. CC: Christian König CC: Thomas Zimmermann Signed-off-by: Felix Kuehling Re: our discussion about v2 of this patch: If this version is acceptable, can I get an R-b or A-b? I would like to get this patch into drm-next as a prerequisite for patches 2 and 3. I cannot submit it to the current amd-staging-drm-next because the patch I'm reverting doesn't exist there yet. Patch 2 and 3 could go into drm-next as well, or go through Alex's amd-staging-drm-next branch once patch 1 is in drm-next. Alex, how do you prefer to coordinate this? I guess ideally this would go through my drm-next tree since your other patches depend on it unless others feel strongly that it should go through drm-misc. Yes, drm-next would work best for applying this patch and the two patches that depend on it. I can send you the rebased patches from my drm-next based branch that I used for testing this. Regards, Felix Alex Regards, Felix --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, +struct drm_file *file_priv, int prime_fd, +uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, +struct drm_file *file_priv, uint32_t handle, +uint32_t flags, +int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, int flags) @@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * @dev
Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"
[+Alex] On 2023-11-17 16:44, Felix Kuehling wrote: This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. CC: Christian König CC: Thomas Zimmermann Signed-off-by: Felix Kuehling Re: our discussion about v2 of this patch: If this version is acceptable, can I get an R-b or A-b? I would like to get this patch into drm-next as a prerequisite for patches 2 and 3. I cannot submit it to the current amd-staging-drm-next because the patch I'm reverting doesn't exist there yet. Patch 2 and 3 could go into drm-next as well, or go through Alex's amd-staging-drm-next branch once patch 1 is in drm-next. Alex, how do you prefer to coordinate this? Regards, Felix --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, + struct drm_file *file_priv, uint32_t handle, + uint32_t flags, + int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, int flags) @@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * @dev: drm_device to import into * @dma_buf: dma-buf object to import * - * This is the implementation of the gem_prime_import functions for GEM - * drivers using the PRIME helpers. It is the default for drivers that do - * not set their own &drm_driver.gem_prime_import. + * This is the implementation of the gem_prime_import functions for GEM drivers + * using the PRIME helpers. Drivers can use this as their + * &drm_driver.gem_prime_import implem
Re: [PATCH v2 2/4] drm/prime: Helper to export dmabuf without fd
On 2023-11-22 05:32, Thomas Zimmermann wrote: Hi, my apologies if this sounds picky or annoying. This change appears to be going in the wrong direction. The goal of the refactoring is to be able to use drm_driver.gem_prime_import and drm_gem_object_funcs.export for the additional import/export code; and hence keep the GEM object code in a single place. Keeping the prime_fd file descriptor within amdkfd will likely help with that. Here's my suggestion: 1) Please keep the internal interfaces drm_gem_prime_handle_to_fd() and drm_gem_prime_fd_to_handle(). They should be called from the _ioctl entry functions as is. That could be stream-lined in a later patch set. 2) From drm_gem_prime_handle_to_fd() and drm_gem_prime_fd_to_handle(), create drm_gem_prime_handle_to_dmabuf() and drm_gem_prime_dmabuf_to_handle(). Do you mean duplicate the code, or call drm_gem_prime_handle_to_dmabuf from drm_gem_prime_handle_to_fd? They should be exported. You can then keep the file-descriptor code in amdkfd and out of the PRIME helpers. 3) Patches 1 and 2 should be squashed into one. 4) And if I'm not mistaken, the additional import/export code can then go into drm_driver.gem_prime_import and drm_gem_object_funcs.export, which are being called from within the PRIME helpers. I'm not sure what you mean by "additional import/export code" that would move into those driver callbacks. That's admittedly quite a bit of refactoring. OR simply go back to v1 of this patch set, which was consistent at least. I think I'd prefer that because I don't really understand what you're trying to achieve. Thanks, Felix Best regards Thomas Am 22.11.23 um 00:11 schrieb Felix Kuehling: Change drm_gem_prime_handle_to_fd to drm_gem_prime_handle_to_dmabuf to export a dmabuf without creating an FD as a user mode handle. This is more useful for users in kernel mode. Suggested-by: Thomas Zimmermann Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 63 ++--- include/drm/drm_prime.h | 6 ++-- 2 files changed, 33 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 834a5e28abbe..d491b5f73eea 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -410,26 +410,25 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, } /** - * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers + * drm_gem_prime_handle_to_dmabuf - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure * @handle: buffer handle to export * @flags: flags like DRM_CLOEXEC - * @prime_fd: pointer to storage for the fd id of the create dma-buf + * @dma_buf: pointer to storage for the dma-buf reference * * This is the PRIME export function which must be used mandatorily by GEM * drivers to ensure correct lifetime management of the underlying GEM object. * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev, + struct drm_file *file_priv, + uint32_t handle, uint32_t flags) { struct drm_gem_object *obj; int ret = 0; - struct dma_buf *dmabuf; + struct dma_buf *dmabuf = NULL; mutex_lock(&file_priv->prime.lock); obj = drm_gem_object_lookup(file_priv, handle); @@ -441,7 +440,7 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, handle); if (dmabuf) { get_dma_buf(dmabuf); - goto out_have_handle; + goto out; } mutex_lock(&dev->object_name_lock); @@ -479,40 +478,22 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, dmabuf, handle); mutex_unlock(&dev->object_name_lock); if (ret) - goto fail_put_dmabuf; - -out_have_handle: - ret = dma_buf_fd(dmabuf, flags); - /* -* We must _not_ remove the buffer from the handle cache since the newly -* created dma buf is already linked in the global obj->dma_buf pointer, -* and that is invariant as long as a userspace gem handle exists. -* Closing the handle will clean out the cache anyway, so we don't leak. -*/ - if (ret < 0) { - goto fail_put_dmabuf; - } else { - *p
[PATCH v2 4/4] drm/amdkfd: Import DMABufs for interop through DRM
Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This ensures that a GEM handle is created on import and that obj->dma_buf will be set and remain set as long as the object is imported into KFD. Signed-off-by: Felix Kuehling Reviewed-by: Ramesh Errabolu Reviewed-by: Xiaogang.Chen Acked-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 9 ++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 64 +-- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15 ++--- 3 files changed, 52 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index c1195eb67057..8da42e0dddcb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -319,11 +319,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *process_info, struct dma_fence **ef); int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, struct kfd_vm_fault_info *info); -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dmabuf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset); +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset); int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem, struct dma_buf **dmabuf); void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index e96e1595791f..652657c863ff 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1953,8 +1953,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu( /* Free the BO*/ drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv); - if (!mem->is_imported) - drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); + drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); if (mem->dmabuf) { dma_buf_put(mem->dmabuf); mem->dmabuf = NULL; @@ -2310,34 +2309,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, return 0; } -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dma_buf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset) +static int import_obj_create(struct amdgpu_device *adev, +struct dma_buf *dma_buf, +struct drm_gem_object *obj, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) { struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv); - struct drm_gem_object *obj; struct amdgpu_bo *bo; int ret; - obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf); - if (IS_ERR(obj)) - return PTR_ERR(obj); - bo = gem_to_amdgpu_bo(obj); if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM | - AMDGPU_GEM_DOMAIN_GTT))) { + AMDGPU_GEM_DOMAIN_GTT))) /* Only VRAM and GTT BOs are supported */ - ret = -EINVAL; - goto err_put_obj; - } + return -EINVAL; *mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL); - if (!*mem) { - ret = -ENOMEM; - goto err_put_obj; - } + if (!*mem) + return -ENOMEM; ret = drm_vma_node_allow(&obj->vma_node, drm_priv); if (ret) @@ -2387,8 +2378,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, drm_vma_node_revoke(&obj->vma_node, drm_priv); err_free_mem: kfree(*mem); + return ret; +} + +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) +{ + struct drm_gem_object *ob
[PATCH v2 1/4] Revert "drm/prime: Unexport helpers for fd/handle conversion"
This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. CC: Christian König CC: Thomas Zimmermann Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, + struct drm_file *file_priv, uint32_t handle, + uint32_t flags, + int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, int flags) @@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * @dev: drm_device to import into * @dma_buf: dma-buf object to import * - * This is the implementation of the gem_prime_import functions for GEM - * drivers using the PRIME helpers. It is the default for drivers that do - * not set their own &drm_driver.gem_prime_import. + * This is the implementation of the gem_prime_import functions for GEM drivers + * using the PRIME helpers. Drivers can use this as their + * &drm_driver.gem_prime_import implementation. It is used as the default + * implementation in drm_gem_prime_fd_to_handle(). * * Drivers must arrange to call drm_prime_gem_destroy() from their * &drm_gem_object_funcs.free hook when using this function. diff --git a/include/drm/drm_prime.h b/include/drm/drm_prime.h index a7abf9f3e697..2a1d01e5b56b 100644 --- a/include/drm/drm_prime.h +++ b/include/drm/drm_prime.h @@ -60,12 +60,19 @@ enum dma_data_direction; struct drm_device; struct drm_gem_object; +struct drm_file; /* core prime functions */ struct dma_buf *drm_gem_dmabuf_expor
[PATCH v2 3/4] drm/amdkfd: Export DMABufs from KFD using GEM handles
Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM handles are created in a drm_client_dev context to avoid exposing them in user mode contexts through a DMABuf import. Signed-off-by: Felix Kuehling Reviewed-by: Ramesh Errabolu --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 5 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 29 ++- 3 files changed, 38 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index b8412202a1b0..aa8b24079070 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) { int i; int last_valid_bit; + int ret; amdgpu_amdkfd_gpuvm_init_mem_limits(); @@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) .enable_mes = adev->enable_mes, }; + ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", NULL); + if (ret) { + dev_err(adev->dev, "Failed to init DRM client: %d\n", ret); + return; + } + /* this is going to have a few of the MSBs set that we need to * clear */ @@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev, &gpu_resources); + if (adev->kfd.init_complete) + drm_client_register(&adev->kfd.client); + else + drm_client_release(&adev->kfd.client); amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index dac983da961d..c1195eb67057 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -33,6 +33,7 @@ #include #include #include +#include #include "amdgpu_sync.h" #include "amdgpu_vm.h" #include "amdgpu_xcp.h" @@ -83,6 +84,7 @@ struct kgd_mem { struct amdgpu_sync sync; + uint32_t gem_handle; bool aql_queue; bool is_imported; }; @@ -105,6 +107,9 @@ struct amdgpu_kfd_dev { /* HMM page migration MEMORY_DEVICE_PRIVATE mapping */ struct dev_pagemap pgmap; + + /* Client for KFD BO GEM handle allocations */ + struct drm_client_dev client; }; enum kgd_engine_type { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 41fbc4fd0fac..e96e1595791f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -806,13 +807,18 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem, static int kfd_mem_export_dmabuf(struct kgd_mem *mem) { if (!mem->dmabuf) { - struct dma_buf *ret = amdgpu_gem_prime_export( - &mem->bo->tbo.base, + struct amdgpu_device *bo_adev; + struct dma_buf *dmabuf; + + bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); + dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev, + bo_adev->kfd.client.file, + mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0); - if (IS_ERR(ret)) - return PTR_ERR(ret); - mem->dmabuf = ret; + DRM_RDWR : 0); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + mem->dmabuf = dmabuf; } return 0; @@ -1779,6 +1785,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( pr_debug("Failed to allow vma node access. ret %d\n", ret); goto err_node_allow; } + ret = drm_gem_handle_create(adev->kfd.client.file, gobj, &(*mem)->gem_handle); + if (ret) + goto err_gem_handle_create; bo = gem_to_amdgpu_bo(gobj); if (bo_type == ttm_bo_type_sg) { bo->tbo.sg = sg; @@ -1830,6 +1839,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( err_pin_bo: err_validate_bo: remove_kgd_mem_from_kfd_bo_list(*mem, avm
[PATCH v2 2/4] drm/prime: Helper to export dmabuf without fd
Change drm_gem_prime_handle_to_fd to drm_gem_prime_handle_to_dmabuf to export a dmabuf without creating an FD as a user mode handle. This is more useful for users in kernel mode. Suggested-by: Thomas Zimmermann Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 63 ++--- include/drm/drm_prime.h | 6 ++-- 2 files changed, 33 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 834a5e28abbe..d491b5f73eea 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -410,26 +410,25 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, } /** - * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers + * drm_gem_prime_handle_to_dmabuf - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure * @handle: buffer handle to export * @flags: flags like DRM_CLOEXEC - * @prime_fd: pointer to storage for the fd id of the create dma-buf + * @dma_buf: pointer to storage for the dma-buf reference * * This is the PRIME export function which must be used mandatorily by GEM * drivers to ensure correct lifetime management of the underlying GEM object. * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev, + struct drm_file *file_priv, + uint32_t handle, uint32_t flags) { struct drm_gem_object *obj; int ret = 0; - struct dma_buf *dmabuf; + struct dma_buf *dmabuf = NULL; mutex_lock(&file_priv->prime.lock); obj = drm_gem_object_lookup(file_priv, handle); @@ -441,7 +440,7 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, handle); if (dmabuf) { get_dma_buf(dmabuf); - goto out_have_handle; + goto out; } mutex_lock(&dev->object_name_lock); @@ -479,40 +478,22 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, dmabuf, handle); mutex_unlock(&dev->object_name_lock); if (ret) - goto fail_put_dmabuf; - -out_have_handle: - ret = dma_buf_fd(dmabuf, flags); - /* -* We must _not_ remove the buffer from the handle cache since the newly -* created dma buf is already linked in the global obj->dma_buf pointer, -* and that is invariant as long as a userspace gem handle exists. -* Closing the handle will clean out the cache anyway, so we don't leak. -*/ - if (ret < 0) { - goto fail_put_dmabuf; - } else { - *prime_fd = ret; - ret = 0; - } - - goto out; - -fail_put_dmabuf: - dma_buf_put(dmabuf); + dma_buf_put(dmabuf); out: drm_gem_object_put(obj); out_unlock: mutex_unlock(&file_priv->prime.lock); - return ret; + return ret ? ERR_PTR(ret) : dmabuf; } -EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); +EXPORT_SYMBOL(drm_gem_prime_handle_to_dmabuf); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) { struct drm_prime_handle *args = data; + struct dma_buf *dmabuf; + int ret; /* check flags are valid */ if (args->flags & ~(DRM_CLOEXEC | DRM_RDWR)) @@ -523,8 +504,24 @@ int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, args->handle, args->flags, &args->fd); } - return drm_gem_prime_handle_to_fd(dev, file_priv, args->handle, - args->flags, &args->fd); + dmabuf = drm_gem_prime_handle_to_dmabuf(dev, file_priv, args->handle, + args->flags); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + ret = dma_buf_fd(dmabuf, args->flags); + /* +* We must _not_ remove the buffer from the handle cache since the newly +* created dma buf is already linked in the global obj->dma_buf pointer, +* and that is invariant as long as a userspace gem handle exists. +* Closing the handle will clean out the cache anyway, so we don't leak. +*/ +
Re: [Bug 218168] New: amdgpu: kfd_topology.c warning: the frame size of 1408 bytes is larger than 1024 bytes
There are two patches that didn't make it into Linux 6.6 that reduce the stack size in kfd_topology_add_device. Can you check if those fix the problem? commit aa5a9b2ccda2fa834fddb4bd30a2ab831598f551 Author: Alex Deucher Date: Tue Sep 26 12:00:23 2023 -0400 drm/amdkfd: drop struct kfd_cu_info I think this was an abstraction back from when kfd supported both radeon and amdgpu. Since we just support amdgpu now, there is no more need for this and we can use the amdgpu structures directly. This also avoids having the kfd_cu_info structures on the stack when inlining which can blow up the stack. Cc: Arnd Bergmann Acked-by: Arnd Bergmann Reviewed-by: Felix Kuehling Acked-by: Christian König Signed-off-by: Alex Deucher commit 1f3b515578a1d73926993629a06a7f3b60535b59 Author: Alex Deucher Date: Thu Sep 21 10:32:09 2023 -0400 drm/amdkfd: reduce stack size in kfd_topology_add_device() kfd_topology.c:2082:1: warning: the frame size of 1440 bytes is larger than 1024 bytes Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2866 Cc: Arnd Bergmann Acked-by: Arnd Bergmann Acked-by: Christian König Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Regards, Felix On 2023-11-20 10:36, Hamza Mahfooz wrote: + amd-gfx + Felix On 11/20/23 10:16, bugzilla-dae...@kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=218168 Bug ID: 218168 Summary: amdgpu: kfd_topology.c warning: the frame size of 1408 bytes is larger than 1024 bytes Product: Drivers Version: 2.5 Hardware: All OS: Linux Status: NEW Severity: normal Priority: P3 Component: Video(DRI - non Intel) Assignee: drivers_video-...@kernel-bugs.osdl.org Reporter: bluesun...@gmail.com Regression: No Trying to compile Linux 6.6.2 with GCC 13.2.1 and CONFIG_WERROR=y: [...] drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c: In function 'kfd_topology_add_device': drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:2082:1: error: the frame size of 1408 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] 2082 | } | ^ cc1: all warnings being treated as errors [...]
Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"
On 2023-11-20 11:02, Thomas Zimmermann wrote: Hi Christian Am 20.11.23 um 16:22 schrieb Christian König: Am 20.11.23 um 16:18 schrieb Thomas Zimmermann: Hi Am 20.11.23 um 16:06 schrieb Felix Kuehling: On 2023-11-20 6:54, Thomas Zimmermann wrote: Hi Am 17.11.23 um 22:44 schrieb Felix Kuehling: This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. I'm unhappy to see these functions making a comeback. They are the boiler-plate logic that all drivers should use. Historically, drivers did a lot one things in their GEM code that was only semi-correct. Unifying most of that made the memory management more readable. Not giving back drivers to option of tinkering with this might be preferable. The rsp hooks in struct drm_driver, prime_fd_to_handle and prime_handle_to_fd, are only there for vmwgfx. If you want to hook into prime import and export, there are drm_driver.gem_prime_import and drm_gem_object_funcs.export. Isn't it possible to move the additional code behind these pointers? I'm not trying to hook into these functions, I'm just trying to call them. I'm not bringing back the .prime_handle_to_fd and .prime_fd_to_handle hooks in struct drm_driver. I need a clean way to export and import DMAbuffers from a kernel mode context. I had incorrect or semi-correct ways of doing that by calling some driver-internal functions, but then my DMABufs aren't correctly linked with GEM handles in DRM and move notifiers in amdgpu aren't working correctly. I understand that. But why don't you use drm_driver.gem_prime_import and drm_gem_object_funcs.export to add the amdkfd-specific code? These callbacks are being invoked from within drm_gem_prime_fd_to_handle() and drm_gem_prime_handle_to_fd() as part of the importing and exporting logic. With the intention of doing driver-specific things. Hence you should not have to re-export the internal drm_gem_prime_*_to_*() helpers. My question is if the existing hooks are not suitable for your needs. If so, how could we improve them? No no. You don't seem to understand the use case :) Felix doesn't try to implement any driver-specific things. I meant that I understand that this patchset is not about setting drm_driver.prime_handle_to_fd, et al. What Felix tries to do is to export a DMA-buf handle from kernel space. For example, looking at patch 2, it converts a GEM handle to a file descriptor and then assigns the rsp dmabuf to mem, which is of type struct kgd_mem. From my impression, this could be done within the existing ->export hook. That would skip the call to export_and_register_object. I think that's what I'm currently missing to set up gem_obj->dmabuf. Regards, Felix Then there's close_fd(), which cannot go into ->export. It looks like the fd isn't really required. Could the drm_prime_handle_to_fd() be reworked into a helper that converts the handle to the dmabuf without the fd? Something like drm_gem_prime_handle_to_dmabuf(), which would then be exported? And I have the question wrt the 3rd patch; just that it's about importing. (Looking further through the code, it appears that the fd could be removed from the helpers, the callbacks and vmwgfx. It would then be used entirely in the ioctl entry points, such as drm_prime_fd_to_handle_ioctl().) Best regards Thomas Regards, Christian. Best regards Thomas Regards, Felix Best regards Thomas CC: Christian König CC: Thomas Zimmermann Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime
Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"
On 2023-11-20 6:54, Thomas Zimmermann wrote: Hi Am 17.11.23 um 22:44 schrieb Felix Kuehling: This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. I'm unhappy to see these functions making a comeback. They are the boiler-plate logic that all drivers should use. Historically, drivers did a lot one things in their GEM code that was only semi-correct. Unifying most of that made the memory management more readable. Not giving back drivers to option of tinkering with this might be preferable. The rsp hooks in struct drm_driver, prime_fd_to_handle and prime_handle_to_fd, are only there for vmwgfx. If you want to hook into prime import and export, there are drm_driver.gem_prime_import and drm_gem_object_funcs.export. Isn't it possible to move the additional code behind these pointers? I'm not trying to hook into these functions, I'm just trying to call them. I'm not bringing back the .prime_handle_to_fd and .prime_fd_to_handle hooks in struct drm_driver. I need a clean way to export and import DMAbuffers from a kernel mode context. I had incorrect or semi-correct ways of doing that by calling some driver-internal functions, but then my DMABufs aren't correctly linked with GEM handles in DRM and move notifiers in amdgpu aren't working correctly. Regards, Felix Best regards Thomas CC: Christian König CC: Thomas Zimmermann Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, + struct drm_file *file_priv, uint32_t handle, + uint32_t flags, + int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
[PATCH 3/3] drm/amdkfd: Import DMABufs for interop through DRM
Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This ensures that a GEM handle is created on import and that obj->dma_buf will be set and remain set as long as the object is imported into KFD. Signed-off-by: Felix Kuehling Reviewed-by: Ramesh Errabolu Reviewed-by: Xiaogang.Chen Acked-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 9 ++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 64 +-- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15 ++--- 3 files changed, 52 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index c1195eb67057..8da42e0dddcb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -319,11 +319,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *process_info, struct dma_fence **ef); int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, struct kfd_vm_fault_info *info); -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dmabuf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset); +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset); int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem, struct dma_buf **dmabuf); void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index b13d68b7bb28..966272e067b2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1957,8 +1957,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu( /* Free the BO*/ drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv); - if (!mem->is_imported) - drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); + drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); if (mem->dmabuf) { dma_buf_put(mem->dmabuf); mem->dmabuf = NULL; @@ -2314,34 +2313,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, return 0; } -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dma_buf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset) +static int import_obj_create(struct amdgpu_device *adev, +struct dma_buf *dma_buf, +struct drm_gem_object *obj, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) { struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv); - struct drm_gem_object *obj; struct amdgpu_bo *bo; int ret; - obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf); - if (IS_ERR(obj)) - return PTR_ERR(obj); - bo = gem_to_amdgpu_bo(obj); if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM | - AMDGPU_GEM_DOMAIN_GTT))) { + AMDGPU_GEM_DOMAIN_GTT))) /* Only VRAM and GTT BOs are supported */ - ret = -EINVAL; - goto err_put_obj; - } + return -EINVAL; *mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL); - if (!*mem) { - ret = -ENOMEM; - goto err_put_obj; - } + if (!*mem) + return -ENOMEM; ret = drm_vma_node_allow(&obj->vma_node, drm_priv); if (ret) @@ -2391,8 +2382,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, drm_vma_node_revoke(&obj->vma_node, drm_priv); err_free_mem: kfree(*mem); + return ret; +} + +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) +{ + struct drm_gem_object *ob
[PATCH 2/3] drm/amdkfd: Export DMABufs from KFD using GEM handles
Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM handles are created in a drm_client_dev context to avoid exposing them in user mode contexts through a DMABuf import. Signed-off-by: Felix Kuehling Reviewed-by: Ramesh Errabolu --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 5 +++ .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 33 +++ 3 files changed, 42 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index b8412202a1b0..aa8b24079070 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) { int i; int last_valid_bit; + int ret; amdgpu_amdkfd_gpuvm_init_mem_limits(); @@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) .enable_mes = adev->enable_mes, }; + ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", NULL); + if (ret) { + dev_err(adev->dev, "Failed to init DRM client: %d\n", ret); + return; + } + /* this is going to have a few of the MSBs set that we need to * clear */ @@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev, &gpu_resources); + if (adev->kfd.init_complete) + drm_client_register(&adev->kfd.client); + else + drm_client_release(&adev->kfd.client); amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index dac983da961d..c1195eb67057 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -33,6 +33,7 @@ #include #include #include +#include #include "amdgpu_sync.h" #include "amdgpu_vm.h" #include "amdgpu_xcp.h" @@ -83,6 +84,7 @@ struct kgd_mem { struct amdgpu_sync sync; + uint32_t gem_handle; bool aql_queue; bool is_imported; }; @@ -105,6 +107,9 @@ struct amdgpu_kfd_dev { /* HMM page migration MEMORY_DEVICE_PRIVATE mapping */ struct dev_pagemap pgmap; + + /* Client for KFD BO GEM handle allocations */ + struct drm_client_dev client; }; enum kgd_engine_type { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 41fbc4fd0fac..b13d68b7bb28 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -806,13 +807,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem, static int kfd_mem_export_dmabuf(struct kgd_mem *mem) { if (!mem->dmabuf) { - struct dma_buf *ret = amdgpu_gem_prime_export( - &mem->bo->tbo.base, + struct amdgpu_device *bo_adev; + struct dma_buf *dmabuf; + int r, fd; + + bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); + r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, + mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0); - if (IS_ERR(ret)) - return PTR_ERR(ret); - mem->dmabuf = ret; + DRM_RDWR : 0, &fd); + if (r) + return r; + dmabuf = dma_buf_get(fd); + close_fd(fd); + if (WARN_ON_ONCE(IS_ERR(dmabuf))) + return PTR_ERR(dmabuf); + mem->dmabuf = dmabuf; } return 0; @@ -1779,6 +1789,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( pr_debug("Failed to allow vma node access. ret %d\n", ret); goto err_node_allow; } + ret = drm_gem_handle_create(adev->kfd.client.file, gobj, &(*mem)->gem_handle); + if (ret) + goto err_gem_handle_create; bo = gem_to_amdgpu_bo(gobj); if (bo_type == ttm_bo_type_sg) { bo->tbo.sg = sg; @@ -1830,6 +1843,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_g
[PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"
This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3. These helper functions are needed for KFD to export and import DMABufs the right way without duplicating the tracking of DMABufs associated with GEM objects while ensuring that move notifier callbacks are working as intended. CC: Christian König CC: Thomas Zimmermann Signed-off-by: Felix Kuehling --- drivers/gpu/drm/drm_prime.c | 33 ++--- include/drm/drm_prime.h | 7 +++ 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 63b709a67471..834a5e28abbe 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf) } EXPORT_SYMBOL(drm_gem_dmabuf_release); -/* +/** * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers * @dev: drm_device to import into * @file_priv: drm file-private structure @@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release); * * Returns 0 on success or a negative error code on failure. */ -static int drm_gem_prime_fd_to_handle(struct drm_device *dev, - struct drm_file *file_priv, int prime_fd, - uint32_t *handle) +int drm_gem_prime_fd_to_handle(struct drm_device *dev, + struct drm_file *file_priv, int prime_fd, + uint32_t *handle) { struct dma_buf *dma_buf; struct drm_gem_object *obj; @@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device *dev, dma_buf_put(dma_buf); return ret; } +EXPORT_SYMBOL(drm_gem_prime_fd_to_handle); int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; } -/* +/** * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers * @dev: dev to export the buffer from * @file_priv: drm file-private structure @@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * The actual exporting from GEM object to a dma-buf is done through the * &drm_gem_object_funcs.export callback. */ -static int drm_gem_prime_handle_to_fd(struct drm_device *dev, - struct drm_file *file_priv, uint32_t handle, - uint32_t flags, - int *prime_fd) +int drm_gem_prime_handle_to_fd(struct drm_device *dev, + struct drm_file *file_priv, uint32_t handle, + uint32_t flags, + int *prime_fd) { struct drm_gem_object *obj; int ret = 0; @@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device *dev, return ret; } +EXPORT_SYMBOL(drm_gem_prime_handle_to_fd); int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) @@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size); * @obj: GEM object to export * @flags: flags like DRM_CLOEXEC and DRM_RDWR * - * This is the implementation of the &drm_gem_object_funcs.export functions - * for GEM drivers using the PRIME helpers. It is used as the default for - * drivers that do not set their own. + * This is the implementation of the &drm_gem_object_funcs.export functions for GEM drivers + * using the PRIME helpers. It is used as the default in + * drm_gem_prime_handle_to_fd(). */ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, int flags) @@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * @dev: drm_device to import into * @dma_buf: dma-buf object to import * - * This is the implementation of the gem_prime_import functions for GEM - * drivers using the PRIME helpers. It is the default for drivers that do - * not set their own &drm_driver.gem_prime_import. + * This is the implementation of the gem_prime_import functions for GEM drivers + * using the PRIME helpers. Drivers can use this as their + * &drm_driver.gem_prime_import implementation. It is used as the default + * implementation in drm_gem_prime_fd_to_handle(). * * Drivers must arrange to call drm_prime_gem_destroy() from their * &drm_gem_object_funcs.free hook when using this function. diff --git a/include/drm/drm_prime.h b/include/drm/drm_prime.h index a7abf9f3e697..2a1d01e5b56b 100644 --- a/include/drm/drm_prime.h +++ b/include/drm/drm_prime.h @@ -60,12 +60,19 @@ enum dma_data_direction; struct drm_device; struct drm_gem_object; +struct drm_file; /* core prime functions */ struct dma_buf *drm_gem_dmabuf_expor
Re: [PATCH 4/6] drm/amdkfd: Export DMABufs from KFD using GEM handles
On 2023-11-07 11:58, Felix Kuehling wrote: Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM handles are created in a drm_client_dev context to avoid exposing them in user mode contexts through a DMABuf import. This patch (and the next one) won't apply upstream because Thomas Zimmerman just made drm_gem_prime_fd_to_handle and drm_gem_prime_handle_to_fd private because nobody was using them. (drm/prime: Unexport helpers for fd/handle conversion) Is it OK to export those APIs again? Or is there a better way for drivers to export/import DMABufs without using the GEM ioctls? Regards, Felix Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 5 +++ .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 33 +++ drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 +-- 4 files changed, 44 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index 6ab17330a6ed..b0a67f16540a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) { int i; int last_valid_bit; + int ret; amdgpu_amdkfd_gpuvm_init_mem_limits(); @@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) .enable_mes = adev->enable_mes, }; + ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", NULL); + if (ret) { + dev_err(adev->dev, "Failed to init DRM client: %d\n", ret); + return; + } + /* this is going to have a few of the MSBs set that we need to * clear */ @@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev, &gpu_resources); + if (adev->kfd.init_complete) + drm_client_register(&adev->kfd.client); + else + drm_client_release(&adev->kfd.client); amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 68d534a89942..4caf8cece028 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -32,6 +32,7 @@ #include #include #include +#include #include #include "amdgpu_sync.h" #include "amdgpu_vm.h" @@ -84,6 +85,7 @@ struct kgd_mem { struct amdgpu_sync sync; + uint32_t gem_handle; bool aql_queue; bool is_imported; }; @@ -106,6 +108,9 @@ struct amdgpu_kfd_dev { /* HMM page migration MEMORY_DEVICE_PRIVATE mapping */ struct dev_pagemap pgmap; + + /* Client for KFD BO GEM handle allocations */ + struct drm_client_dev client; }; enum kgd_engine_type { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 0c1cb6048259..4bb8b5fd7598 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include "amdgpu_object.h" @@ -804,13 +805,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem, static int kfd_mem_export_dmabuf(struct kgd_mem *mem) { if (!mem->dmabuf) { - struct dma_buf *ret = amdgpu_gem_prime_export( - &mem->bo->tbo.base, + struct amdgpu_device *bo_adev; + struct dma_buf *dmabuf; + int r, fd; + + bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); + r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, + mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0); - if (IS_ERR(ret)) - return PTR_ERR(ret); - mem->dmabuf = ret; + DRM_RDWR : 0, &fd); + if (r) + return r; + dmabuf = dma_buf_get(fd); + close_fd(fd); + if (WARN_ON_ONCE(IS_ERR(dmabuf))) + return PTR_ERR(dmabuf); + mem->dmabuf = dmabuf; } return 0; @@ -1826,6 +1836,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
Re: [Patch v2] drm/ttm: Schedule delayed_delete worker closer
On 2023-11-08 09:49, Christian König wrote: Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj: Try to allocate system memory on the NUMA node the device is closest to and try to run delayed_delete workers on a CPU of this node as well. To optimize the memory clearing operation when a TTM BO gets freed by the delayed_delete worker, scheduling it closer to a NUMA node where the memory was initially allocated helps avoid the cases where the worker gets randomly scheduled on the CPU cores that are across interconnect boundaries such as xGMI, PCIe etc. This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD APU platforms such as GFXIP9.4.3. Acked-by: Felix Kuehling Signed-off-by: Rajneesh Bhardwaj Reviewed-by: Christian König Going to push this to drm-misc-next. Hold on. Rajneesh just pointed out a WARN regression from testing. I think the problem is that the bdev->wq is not unbound. Regards, Felix Thanks, Christian. --- Changes in v2: - Absorbed the feedback provided by Christian in the commit message and the comment. drivers/gpu/drm/ttm/ttm_bo.c | 8 +++- drivers/gpu/drm/ttm/ttm_device.c | 3 ++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 5757b9415e37..6f28a77a565b 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref) spin_unlock(&bo->bdev->lru_lock); INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); - queue_work(bdev->wq, &bo->delayed_delete); + + /* Schedule the worker on the closest NUMA node. This + * improves performance since system memory might be + * cleared on free and that is best done on a CPU core + * close to it. + */ + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); return; } diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index 43e27ab77f95..72b81a2ee6c7 100644 --- a/drivers/gpu/drm/ttm/ttm_device.c +++ b/drivers/gpu/drm/ttm/ttm_device.c @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, bdev->funcs = funcs; ttm_sys_man_init(bdev); - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); + + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); bdev->vma_manager = vma_manager; spin_lock_init(&bdev->lru_lock);
Re: [PATCH] drm/ttm: Schedule delayed_delete worker closer
On 2023-11-07 14:45, Rajneesh Bhardwaj wrote: When a TTM BO is getting freed, to optimize the clearing operation on the workqueue, schedule it closer to a NUMA node where the memory was allocated. This avoids the cases where the ttm_bo_delayed_delete gets scheduled on the CPU cores that are across interconnect boundaries such as xGMI, PCIe etc. This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD APU platforms such as GFXIP9.4.3. Signed-off-by: Rajneesh Bhardwaj Acked-by: Felix Kuehling --- drivers/gpu/drm/ttm/ttm_bo.c | 10 +- drivers/gpu/drm/ttm/ttm_device.c | 3 ++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 5757b9415e37..0d608441a112 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref) spin_unlock(&bo->bdev->lru_lock); INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); - queue_work(bdev->wq, &bo->delayed_delete); + /* Schedule the worker on the closest NUMA node, if no +* CPUs are available, this falls back to any CPU core +* available system wide. This helps avoid the +* bottleneck to clear memory in cases where the worker +* is scheduled on a CPU which is remote to the node +* where the memory is getting freed. +*/ + + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); return; } diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index 43e27ab77f95..72b81a2ee6c7 100644 --- a/drivers/gpu/drm/ttm/ttm_device.c +++ b/drivers/gpu/drm/ttm/ttm_device.c @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, bdev->funcs = funcs; ttm_sys_man_init(bdev); - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); + + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); bdev->vma_manager = vma_manager; spin_lock_init(&bdev->lru_lock);
Re: [PATCH 03/11] drm/amdkfd: Improve amdgpu_vm_handle_moved
On 2023-10-17 17:13, Felix Kuehling wrote: Let amdgpu_vm_handle_moved update all BO VA mappings of BOs reserved by the caller. This will be useful for handling extra BO VA mappings in KFD VMs that are managed through the render node API. Signed-off-by: Felix Kuehling Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 22 + drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 19 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 3 ++- 4 files changed, 18 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 74769afaa33d..c8f2907ebd6f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1113,7 +1113,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p) struct amdgpu_vm *vm = &fpriv->vm; struct amdgpu_bo_list_entry *e; struct amdgpu_bo_va *bo_va; - struct amdgpu_bo *bo; unsigned int i; int r; @@ -1141,26 +1140,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p) return r; } - amdgpu_bo_list_for_each_entry(e, p->bo_list) { - /* ignore duplicates */ - bo = ttm_to_amdgpu_bo(e->tv.bo); - if (!bo) - continue; - - bo_va = e->bo_va; - if (bo_va == NULL) - continue; - - r = amdgpu_vm_bo_update(adev, bo_va, false); - if (r) - return r; - - r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update); - if (r) - return r; - } FYI, removing this loop seemed to cause PSDB failures, mostly in RADV tests. It may have been a glitch in the infrastructure, but the failures were consistent on the three subsequent patches, too. Restoring this loop seems to make the tests pass, so I'll do that for now. I don't have time to debug CS with RADV, and this change is not needed for my ROCm work. Regards, Felix - - r = amdgpu_vm_handle_moved(adev, vm); + r = amdgpu_vm_handle_moved(adev, vm, &p->ticket); if (r) return r; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index b5e28fa3f414..e7e87a3b2601 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -409,7 +409,7 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) if (!r) r = amdgpu_vm_clear_freed(adev, vm, NULL); if (!r) - r = amdgpu_vm_handle_moved(adev, vm); + r = amdgpu_vm_handle_moved(adev, vm, ticket); if (r && r != -EBUSY) DRM_ERROR("Failed to invalidate VM page tables (%d))\n", diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index d72daf15662f..c586d0e93d75 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1285,6 +1285,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, * * @adev: amdgpu_device pointer * @vm: requested vm + * @ticket: optional reservation ticket used to reserve the VM * * Make sure all BOs which are moved are updated in the PTs. * @@ -1294,11 +1295,12 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, * PTs have to be reserved! */ int amdgpu_vm_handle_moved(struct amdgpu_device *adev, - struct amdgpu_vm *vm) + struct amdgpu_vm *vm, + struct ww_acquire_ctx *ticket) { struct amdgpu_bo_va *bo_va; struct dma_resv *resv; - bool clear; + bool clear, unlock; int r; spin_lock(&vm->status_lock); @@ -1321,17 +1323,24 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev, spin_unlock(&vm->status_lock); /* Try to reserve the BO to avoid clearing its ptes */ - if (!adev->debug_vm && dma_resv_trylock(resv)) + if (!adev->debug_vm && dma_resv_trylock(resv)) { clear = false; + unlock = true; + /* The caller is already holding the reservation lock */ + } else if (ticket && dma_resv_locking_ctx(resv) == ticket) { + clear = false; + unlock = false; /* Somebody else is using the BO right now */ - else + } else { clear = true; + unlock = false; + } r = amdgpu_vm_
[PATCH 11/11] drm/amdkfd: Bump KFD ioctl version
This is not strictly a change in the IOCTL API. This version bump is meant to indicate to user mode the presence of a number of changes and fixes that enable the management of VA mappings in compute VMs using the GEM_VA ioctl for DMABufs exported from KFD. Signed-off-by: Felix Kuehling --- include/uapi/linux/kfd_ioctl.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index f0ed68974c54..9ce46edc62a5 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -40,9 +40,10 @@ * - 1.12 - Add DMA buf export ioctl * - 1.13 - Add debugger API * - 1.14 - Update kfd_event_data + * - 1.15 - Enable managing mappings in compute VMs with GEM_VA ioctl */ #define KFD_IOCTL_MAJOR_VERSION 1 -#define KFD_IOCTL_MINOR_VERSION 14 +#define KFD_IOCTL_MINOR_VERSION 15 struct kfd_ioctl_get_version_args { __u32 major_version;/* from KFD */ -- 2.34.1
[PATCH 08/11] drm/amdgpu: Auto-validate DMABuf imports in compute VMs
DMABuf imports in compute VMs are not wrapped in a kgd_mem object on the process_info->kfd_bo_list. There is no explicit KFD API call to validate them or add eviction fences to them. This patch automatically validates and fences dymanic DMABuf imports when they are added to a compute VM. Revalidation after evictions is handled in the VM code. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 3 + .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 26 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 117 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 6 +- 7 files changed, 164 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index fcf8a98ad15e..68d534a89942 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -178,6 +178,9 @@ int amdgpu_queue_mask_bit_to_set_resource_bit(struct amdgpu_device *adev, struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context, struct mm_struct *mm, struct svm_range_bo *svm_bo); +int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo, + uint32_t domain, + struct dma_fence *fence); #if defined(CONFIG_DEBUG_FS) int kfd_debugfs_kfd_mem_limits(struct seq_file *m, void *data); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 2e302956a279..0c1cb6048259 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -423,9 +423,9 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain, return ret; } -static int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo, - uint32_t domain, - struct dma_fence *fence) +int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo, + uint32_t domain, + struct dma_fence *fence) { int ret = amdgpu_bo_reserve(bo, false); @@ -2948,7 +2948,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) struct amdgpu_device *adev = amdgpu_ttm_adev( peer_vm->root.bo->tbo.bdev); - ret = amdgpu_vm_handle_moved(adev, peer_vm, &ctx.ticket); + ret = amdgpu_vm_handle_moved(adev, peer_vm, &ctx.ticket, true); if (ret) { pr_debug("Memory eviction: handle moved failed. Try again\n"); goto validate_map_fail; @@ -3001,7 +3001,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) &process_info->eviction_fence->base, DMA_RESV_USAGE_BOOKKEEP); } - /* Attach eviction fence to PD / PT BOs */ + /* Attach eviction fence to PD / PT BOs and DMABuf imports */ list_for_each_entry(peer_vm, &process_info->vm_list_head, vm_list_node) { struct amdgpu_bo *bo = peer_vm->root.bo; @@ -3009,6 +3009,11 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) dma_resv_add_fence(bo->tbo.base.resv, &process_info->eviction_fence->base, DMA_RESV_USAGE_BOOKKEEP); + + ret = amdgpu_vm_fence_imports(peer_vm, &ctx.ticket, + &process_info->eviction_fence->base); + if (ret) + break; } validate_map_fail: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index c8f2907ebd6f..e6dcd17c3c8e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1140,7 +1140,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p) return r; } - r = amdgpu_vm_handle_moved(adev, vm, &p->ticket); + r = amdgpu_vm_handle_moved(adev, vm, &p->ticket, false); if (r) return r; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index e7e87a3b2601..234244704f27 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -373,6 +373,10 @@ amdgpu_dma_buf_move_notify(stru
[PATCH 03/11] drm/amdkfd: Improve amdgpu_vm_handle_moved
Let amdgpu_vm_handle_moved update all BO VA mappings of BOs reserved by the caller. This will be useful for handling extra BO VA mappings in KFD VMs that are managed through the render node API. Signed-off-by: Felix Kuehling Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 22 + drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 19 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 3 ++- 4 files changed, 18 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 74769afaa33d..c8f2907ebd6f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1113,7 +1113,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p) struct amdgpu_vm *vm = &fpriv->vm; struct amdgpu_bo_list_entry *e; struct amdgpu_bo_va *bo_va; - struct amdgpu_bo *bo; unsigned int i; int r; @@ -1141,26 +1140,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p) return r; } - amdgpu_bo_list_for_each_entry(e, p->bo_list) { - /* ignore duplicates */ - bo = ttm_to_amdgpu_bo(e->tv.bo); - if (!bo) - continue; - - bo_va = e->bo_va; - if (bo_va == NULL) - continue; - - r = amdgpu_vm_bo_update(adev, bo_va, false); - if (r) - return r; - - r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update); - if (r) - return r; - } - - r = amdgpu_vm_handle_moved(adev, vm); + r = amdgpu_vm_handle_moved(adev, vm, &p->ticket); if (r) return r; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index b5e28fa3f414..e7e87a3b2601 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -409,7 +409,7 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) if (!r) r = amdgpu_vm_clear_freed(adev, vm, NULL); if (!r) - r = amdgpu_vm_handle_moved(adev, vm); + r = amdgpu_vm_handle_moved(adev, vm, ticket); if (r && r != -EBUSY) DRM_ERROR("Failed to invalidate VM page tables (%d))\n", diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index d72daf15662f..c586d0e93d75 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1285,6 +1285,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, * * @adev: amdgpu_device pointer * @vm: requested vm + * @ticket: optional reservation ticket used to reserve the VM * * Make sure all BOs which are moved are updated in the PTs. * @@ -1294,11 +1295,12 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, * PTs have to be reserved! */ int amdgpu_vm_handle_moved(struct amdgpu_device *adev, - struct amdgpu_vm *vm) + struct amdgpu_vm *vm, + struct ww_acquire_ctx *ticket) { struct amdgpu_bo_va *bo_va; struct dma_resv *resv; - bool clear; + bool clear, unlock; int r; spin_lock(&vm->status_lock); @@ -1321,17 +1323,24 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev, spin_unlock(&vm->status_lock); /* Try to reserve the BO to avoid clearing its ptes */ - if (!adev->debug_vm && dma_resv_trylock(resv)) + if (!adev->debug_vm && dma_resv_trylock(resv)) { clear = false; + unlock = true; + /* The caller is already holding the reservation lock */ + } else if (ticket && dma_resv_locking_ctx(resv) == ticket) { + clear = false; + unlock = false; /* Somebody else is using the BO right now */ - else + } else { clear = true; + unlock = false; + } r = amdgpu_vm_bo_update(adev, bo_va, clear); if (r) return r; - if (!clear) + if (unlock) dma_resv_unlock(resv); spin_lock(&vm->status_lock); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 6e71978db13f..ebcc75132b74 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_
[PATCH 10/11] drm/amdkfd: Import DMABufs for interop through DRM
Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This ensures that a GEM handle is created on import and that obj->dma_buf will be set and remain set as long as the object is imported into KFD. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 9 ++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 64 +-- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15 ++--- 3 files changed, 52 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 4caf8cece028..88a0e0734270 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -318,11 +318,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *process_info, struct dma_fence **ef); int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, struct kfd_vm_fault_info *info); -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dmabuf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset); +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset); int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem, struct dma_buf **dmabuf); void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 4bb8b5fd7598..1077de8bced2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -2006,8 +2006,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu( /* Free the BO*/ drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv); - if (!mem->is_imported) - drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); + drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle); if (mem->dmabuf) { dma_buf_put(mem->dmabuf); mem->dmabuf = NULL; @@ -2363,34 +2362,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev, return 0; } -int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, - struct dma_buf *dma_buf, - uint64_t va, void *drm_priv, - struct kgd_mem **mem, uint64_t *size, - uint64_t *mmap_offset) +static int import_obj_create(struct amdgpu_device *adev, +struct dma_buf *dma_buf, +struct drm_gem_object *obj, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) { struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv); - struct drm_gem_object *obj; struct amdgpu_bo *bo; int ret; - obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf); - if (IS_ERR(obj)) - return PTR_ERR(obj); - bo = gem_to_amdgpu_bo(obj); if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM | - AMDGPU_GEM_DOMAIN_GTT))) { + AMDGPU_GEM_DOMAIN_GTT))) /* Only VRAM and GTT BOs are supported */ - ret = -EINVAL; - goto err_put_obj; - } + return -EINVAL; *mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL); - if (!*mem) { - ret = -ENOMEM; - goto err_put_obj; - } + if (!*mem) + return -ENOMEM; ret = drm_vma_node_allow(&obj->vma_node, drm_priv); if (ret) @@ -2440,8 +2431,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, drm_vma_node_revoke(&obj->vma_node, drm_priv); err_free_mem: kfree(*mem); + return ret; +} + +int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd, +uint64_t va, void *drm_priv, +struct kgd_mem **mem, uint64_t *size, +uint64_t *mmap_offset) +{ + struct drm_gem_object *obj; + uint32_t handle; + int ret; + + ret = drm_gem
[PATCH 09/11] drm/amdkfd: Export DMABufs from KFD using GEM handles
Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM handles are created in a drm_client_dev context to avoid exposing them in user mode contexts through a DMABuf import. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 5 +++ .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 33 +++ drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 +-- 4 files changed, 44 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index 6ab17330a6ed..b0a67f16540a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) { int i; int last_valid_bit; + int ret; amdgpu_amdkfd_gpuvm_init_mem_limits(); @@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) .enable_mes = adev->enable_mes, }; + ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", NULL); + if (ret) { + dev_err(adev->dev, "Failed to init DRM client: %d\n", ret); + return; + } + /* this is going to have a few of the MSBs set that we need to * clear */ @@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev, &gpu_resources); + if (adev->kfd.init_complete) + drm_client_register(&adev->kfd.client); + else + drm_client_release(&adev->kfd.client); amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 68d534a89942..4caf8cece028 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -32,6 +32,7 @@ #include #include #include +#include #include #include "amdgpu_sync.h" #include "amdgpu_vm.h" @@ -84,6 +85,7 @@ struct kgd_mem { struct amdgpu_sync sync; + uint32_t gem_handle; bool aql_queue; bool is_imported; }; @@ -106,6 +108,9 @@ struct amdgpu_kfd_dev { /* HMM page migration MEMORY_DEVICE_PRIVATE mapping */ struct dev_pagemap pgmap; + + /* Client for KFD BO GEM handle allocations */ + struct drm_client_dev client; }; enum kgd_engine_type { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 0c1cb6048259..4bb8b5fd7598 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include "amdgpu_object.h" @@ -804,13 +805,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem, static int kfd_mem_export_dmabuf(struct kgd_mem *mem) { if (!mem->dmabuf) { - struct dma_buf *ret = amdgpu_gem_prime_export( - &mem->bo->tbo.base, + struct amdgpu_device *bo_adev; + struct dma_buf *dmabuf; + int r, fd; + + bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev); + r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, bo_adev->kfd.client.file, + mem->gem_handle, mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? - DRM_RDWR : 0); - if (IS_ERR(ret)) - return PTR_ERR(ret); - mem->dmabuf = ret; + DRM_RDWR : 0, &fd); + if (r) + return r; + dmabuf = dma_buf_get(fd); + close_fd(fd); + if (WARN_ON_ONCE(IS_ERR(dmabuf))) + return PTR_ERR(dmabuf); + mem->dmabuf = dmabuf; } return 0; @@ -1826,6 +1836,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( pr_debug("Failed to allow vma node access. ret %d\n", ret); goto err_node_allow; } + ret = drm_gem_handle_create(adev->kfd.client.file, gobj, &(*mem)->gem_handle); + if (ret) + goto err_gem_handle_create; bo = gem_to_amdgpu_bo(gobj); if (bo_type == ttm_bo_type_sg) { bo->tbo.sg = sg; @@ -1877,6 +1890,8 @@ int
[PATCH 07/11] drm/amdgpu: New VM state for evicted user BOs
Create a new VM state to track user BOs that are in the system domain. In the next patch this will be used do conditionally re-validate them in amdgpu_vm_handle_moved. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 + drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 5 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 3307c5765787..76a8a7fd3fde 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -232,6 +232,22 @@ static void amdgpu_vm_bo_invalidated(struct amdgpu_vm_bo_base *vm_bo) spin_unlock(&vm_bo->vm->status_lock); } +/** + * amdgpu_vm_bo_evicted_user - vm_bo is evicted + * + * @vm_bo: vm_bo which is evicted + * + * State for BOs used by user mode queues which are not at the location they + * should be. + */ +static void amdgpu_vm_bo_evicted_user(struct amdgpu_vm_bo_base *vm_bo) +{ + vm_bo->moved = true; + spin_lock(&vm_bo->vm->status_lock); + list_move(&vm_bo->vm_status, &vm_bo->vm->evicted_user); + spin_unlock(&vm_bo->vm->status_lock); +} + /** * amdgpu_vm_bo_relocated - vm_bo is reloacted * @@ -2105,6 +2121,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, int32_t xcp for (i = 0; i < AMDGPU_MAX_VMHUBS; i++) vm->reserved_vmid[i] = NULL; INIT_LIST_HEAD(&vm->evicted); + INIT_LIST_HEAD(&vm->evicted_user); INIT_LIST_HEAD(&vm->relocated); INIT_LIST_HEAD(&vm->moved); INIT_LIST_HEAD(&vm->idle); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 577cdb6d1649..914e6753a6d0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -280,9 +280,12 @@ struct amdgpu_vm { /* Lock to protect vm_bo add/del/move on all lists of vm */ spinlock_t status_lock; - /* BOs who needs a validation */ + /* Per VM and PT BOs who needs a validation */ struct list_headevicted; + /* BOs for user mode queues that need a validation */ + struct list_headevicted_user; + /* PT BOs which relocated and their parent need an update */ struct list_headrelocated; -- 2.34.1
[PATCH 06/11] drm/amdkfd: Move TLB flushing logic into amdgpu
This will make it possible for amdgpu GEM ioctls to flush TLBs on compute VMs. This removes VMID-based TLB flushing and always uses PASID-based flushing. This still works because it scans the VMID-PASID mapping registers to find the right VMID. It's only slightly less efficient. This is not a production use case. Signed-off-by: Felix Kuehling Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 29 -- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 5 --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 44 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 5 +++ drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 10 - drivers/gpu/drm/amd/amdkfd/kfd_process.c | 31 --- 6 files changed, 57 insertions(+), 67 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index b8412202a1b0..6ab17330a6ed 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -710,35 +710,6 @@ bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid) return false; } -int amdgpu_amdkfd_flush_gpu_tlb_vmid(struct amdgpu_device *adev, -uint16_t vmid) -{ - if (adev->family == AMDGPU_FAMILY_AI) { - int i; - - for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS) - amdgpu_gmc_flush_gpu_tlb(adev, vmid, i, 0); - } else { - amdgpu_gmc_flush_gpu_tlb(adev, vmid, AMDGPU_GFXHUB(0), 0); - } - - return 0; -} - -int amdgpu_amdkfd_flush_gpu_tlb_pasid(struct amdgpu_device *adev, - uint16_t pasid, - enum TLB_FLUSH_TYPE flush_type, - uint32_t inst) -{ - bool all_hub = false; - - if (adev->family == AMDGPU_FAMILY_AI || - adev->family == AMDGPU_FAMILY_RV) - all_hub = true; - - return amdgpu_gmc_flush_gpu_tlb_pasid(adev, pasid, flush_type, all_hub, inst); -} - bool amdgpu_amdkfd_have_atomics_support(struct amdgpu_device *adev) { return adev->have_atomics_support; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 3ad8dc523b42..fcf8a98ad15e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -163,11 +163,6 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev, uint32_t *ib_cmd, uint32_t ib_len); void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool idle); bool amdgpu_amdkfd_have_atomics_support(struct amdgpu_device *adev); -int amdgpu_amdkfd_flush_gpu_tlb_vmid(struct amdgpu_device *adev, - uint16_t vmid); -int amdgpu_amdkfd_flush_gpu_tlb_pasid(struct amdgpu_device *adev, - uint16_t pasid, enum TLB_FLUSH_TYPE flush_type, - uint32_t inst); bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index c586d0e93d75..3307c5765787 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1349,6 +1349,50 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev, return 0; } +/** + * amdgpu_vm_flush_compute_tlb - Flush TLB on compute VM + * + * @adev: amdgpu_device pointer + * @vm: requested vm + * @flush_type: flush type + * + * Flush TLB if needed for a compute VM. + * + * Returns: + * 0 for success. + */ +int amdgpu_vm_flush_compute_tlb(struct amdgpu_device *adev, + struct amdgpu_vm *vm, + uint32_t flush_type, + uint32_t xcc_mask) +{ + uint64_t tlb_seq = amdgpu_vm_tlb_seq(vm); + bool all_hub = false; + int xcc = 0, r = 0; + + WARN_ON_ONCE(!vm->is_compute_context); + + /* +* It can be that we race and lose here, but that is extremely unlikely +* and the worst thing which could happen is that we flush the changes +* into the TLB once more which is harmless. +*/ + if (atomic64_xchg(&vm->kfd_last_flushed_seq, tlb_seq) == tlb_seq) + return 0; + + if (adev->family == AMDGPU_FAMILY_AI || + adev->family == AMDGPU_FAMILY_RV) + all_hub = true; + + for_each_inst(xcc, xcc_mask) { + r = amdgpu_gmc_flush_gpu_tlb_pasid(adev, vm->pasid, flush_type, + all_hub, xcc); + if (r) + break; + } + return r; +} + /** * amdgpu_vm_bo_add - add a bo to a specific vm * diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drive
[PATCH 05/11] drm/amdgpu: update mappings not managed by KFD
When restoring after an eviction, use amdgpu_vm_handle_moved to update BO VA mappings in KFD VMs that are not managed through the KFD API. This should allow using the render node API to create more flexible memory mappings in KFD VMs. Signed-off-by: Felix Kuehling Acked-by: Christian König --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 28 +++ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 7c29f6c377a8..2e302956a279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -2892,12 +2892,6 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) if (ret) goto validate_map_fail; - ret = process_sync_pds_resv(process_info, &sync_obj); - if (ret) { - pr_debug("Memory eviction: Failed to sync to PD BO moving fence. Try again\n"); - goto validate_map_fail; - } - /* Validate BOs and map them to GPUVM (update VM page tables). */ list_for_each_entry(mem, &process_info->kfd_bo_list, validate_list.head) { @@ -2948,6 +2942,19 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) if (failed_size) pr_debug("0x%lx/0x%lx in system\n", failed_size, total_size); + /* Update mappings not managed by KFD */ + list_for_each_entry(peer_vm, &process_info->vm_list_head, + vm_list_node) { + struct amdgpu_device *adev = amdgpu_ttm_adev( + peer_vm->root.bo->tbo.bdev); + + ret = amdgpu_vm_handle_moved(adev, peer_vm, &ctx.ticket); + if (ret) { + pr_debug("Memory eviction: handle moved failed. Try again\n"); + goto validate_map_fail; + } + } + /* Update page directories */ ret = process_update_pds(process_info, &sync_obj); if (ret) { @@ -2955,6 +2962,15 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) goto validate_map_fail; } + /* Sync with fences on all the page tables. They implicitly depend on any +* move fences from amdgpu_vm_handle_moved above. +*/ + ret = process_sync_pds_resv(process_info, &sync_obj); + if (ret) { + pr_debug("Memory eviction: Failed to sync to PD BO moving fence. Try again\n"); + goto validate_map_fail; + } + /* Wait for validate and PT updates to finish */ amdgpu_sync_wait(&sync_obj, false); -- 2.34.1
[PATCH 02/11] drm/amdgpu: Reserve fences for VM update
In amdgpu_dma_buf_move_notify reserve fences for the page table updates in amdgpu_vm_clear_freed and amdgpu_vm_handle_moved. This fixes a BUG_ON in dma_resv_add_fence when using SDMA for page table updates. Signed-off-by: Felix Kuehling Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 76b618735dc0..b5e28fa3f414 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -404,7 +404,10 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) continue; } - r = amdgpu_vm_clear_freed(adev, vm, NULL); + /* Reserve fences for two SDMA page table updates */ + r = dma_resv_reserve_fences(resv, 2); + if (!r) + r = amdgpu_vm_clear_freed(adev, vm, NULL); if (!r) r = amdgpu_vm_handle_moved(adev, vm); -- 2.34.1
[PATCH 04/11] drm/amdgpu: Attach eviction fence on alloc
Instead of attaching the eviction fence when a KFD BO is first mapped, attach it when it is allocated or imported. This in preparation to allow KFD BOs to be mapped using the render node API. Signed-off-by: Felix Kuehling Acked-by: Christian König --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 79 +++ 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 54f31a420229..7c29f6c377a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -423,6 +423,32 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain, return ret; } +static int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo, + uint32_t domain, + struct dma_fence *fence) +{ + int ret = amdgpu_bo_reserve(bo, false); + + if (ret) + return ret; + + ret = amdgpu_amdkfd_bo_validate(bo, domain, true); + if (ret) + goto unreserve_out; + + ret = dma_resv_reserve_fences(bo->tbo.base.resv, 1); + if (ret) + goto unreserve_out; + + dma_resv_add_fence(bo->tbo.base.resv, fence, + DMA_RESV_USAGE_BOOKKEEP); + +unreserve_out: + amdgpu_bo_unreserve(bo); + + return ret; +} + static int amdgpu_amdkfd_validate_vm_bo(void *_unused, struct amdgpu_bo *bo) { return amdgpu_amdkfd_bo_validate(bo, bo->allowed_domains, false); @@ -1831,6 +1857,15 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( } bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT; + } else { + mutex_lock(&avm->process_info->lock); + if (avm->process_info->eviction_fence && + !dma_fence_is_signaled(&avm->process_info->eviction_fence->base)) + ret = amdgpu_amdkfd_bo_validate_and_fence(bo, domain, + &avm->process_info->eviction_fence->base); + mutex_unlock(&avm->process_info->lock); + if (ret) + goto err_validate_bo; } if (offset) @@ -1840,6 +1875,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( allocate_init_user_pages_failed: err_pin_bo: +err_validate_bo: remove_kgd_mem_from_kfd_bo_list(*mem, avm->process_info); drm_vma_node_revoke(&gobj->vma_node, drm_priv); err_node_allow: @@ -1915,10 +1951,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu( if (unlikely(ret)) return ret; - /* The eviction fence should be removed by the last unmap. -* TODO: Log an error condition if the bo still has the eviction fence -* attached -*/ amdgpu_amdkfd_remove_eviction_fence(mem->bo, process_info->eviction_fence); pr_debug("Release VA 0x%llx - 0x%llx\n", mem->va, @@ -2047,19 +2079,6 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu( if (unlikely(ret)) goto out_unreserve; - if (mem->mapped_to_gpu_memory == 0 && - !amdgpu_ttm_tt_get_usermm(bo->tbo.ttm)) { - /* Validate BO only once. The eviction fence gets added to BO -* the first time it is mapped. Validate will wait for all -* background evictions to complete. -*/ - ret = amdgpu_amdkfd_bo_validate(bo, domain, true); - if (ret) { - pr_debug("Validate failed\n"); - goto out_unreserve; - } - } - list_for_each_entry(entry, &mem->attachments, list) { if (entry->bo_va->base.vm != avm || entry->is_mapped) continue; @@ -2086,10 +2105,6 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu( mem->mapped_to_gpu_memory); } - if (!amdgpu_ttm_tt_get_usermm(bo->tbo.ttm) && !bo->tbo.pin_count) - dma_resv_add_fence(bo->tbo.base.resv, - &avm->process_info->eviction_fence->base, - DMA_RESV_USAGE_BOOKKEEP); ret = unreserve_bo_and_vms(&ctx, false, false); goto out; @@ -2123,7 +2138,6 @@ int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu( struct amdgpu_device *adev, struct kgd_mem *mem, void *drm_priv) { struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv); - struct amdkfd_process_info *process_info = avm->process_info; unsigned long bo_size = mem->bo->tbo.base.si
[PATCH 01/11] drm/amdgpu: Fix possible null pointer dereference
abo->tbo.resource may be NULL in amdgpu_vm_bo_update. Fixes: 180253782038 ("drm/ttm: stop allocating dummy resources during BO creation") Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 46d27c87..d72daf15662f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1007,7 +1007,8 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va, struct drm_gem_object *gobj = dma_buf->priv; struct amdgpu_bo *abo = gem_to_amdgpu_bo(gobj); - if (abo->tbo.resource->mem_type == TTM_PL_VRAM) + if (abo->tbo.resource && + abo->tbo.resource->mem_type == TTM_PL_VRAM) bo = gem_to_amdgpu_bo(gobj); } mem = bo->tbo.resource; -- 2.34.1
[PATCH 00/11] Enable integration of KFD with DRM GEM_VA ioctl
This patch series enables better integration of KFD memory management with the DRM GEM ioctl API. It allow managing virtual address mappings in compute VMs with the GEM_VA ioctl after importing DMABufs exported from KFD into libdrm. This will enable more flexible virtual address management for ROCm user mode, better interoperability between compute and graphics, as well as sharing of memory between processes using DMABufs. Felix Kuehling (11): drm/amdgpu: Fix possible null pointer dereference drm/amdgpu: Reserve fences for VM update drm/amdkfd: Improve amdgpu_vm_handle_moved drm/amdgpu: Attach eviction fence on alloc drm/amdgpu: update mappings not managed by KFD drm/amdkfd: Move TLB flushing logic into amdgpu drm/amdgpu: New VM state for evicted user BOs drm/amdgpu: Auto-validate DMABuf imports in compute VMs drm/amdkfd: Export DMABufs from KFD using GEM handles drm/amdkfd: Import DMABufs for interop through DRM drm/amdkfd: Bump KFD ioctl version drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 40 +--- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 22 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 207 -- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 22 +- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 11 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 26 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 198 - drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 17 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 19 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 10 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 31 --- include/uapi/linux/kfd_ioctl.h| 3 +- 12 files changed, 424 insertions(+), 182 deletions(-) -- 2.34.1
Re: [PATCH] drm/amdkfd: clean up some inconsistent indenting
On 2023-10-12 23:21, Jiapeng Chong wrote: No functional modification involved. drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:305 svm_range_free() warn: inconsistent indenting. Reported-by: Abaci Robot Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6804 Signed-off-by: Jiapeng Chong The patch is Reviewed-by: Felix Kuehling Applied to amd-staging-drm-next. Thanks, Felix --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index f4038b33c404..eef76190800c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -302,7 +302,7 @@ static void svm_range_free(struct svm_range *prange, bool do_unmap) for (gpuidx = 0; gpuidx < MAX_GPU_INSTANCE; gpuidx++) { if (prange->dma_addr[gpuidx]) { kvfree(prange->dma_addr[gpuidx]); - prange->dma_addr[gpuidx] = NULL; + prange->dma_addr[gpuidx] = NULL; } }
Re: [PATCH v4 1/1] drm/amdkfd: get doorbell's absolute offset based on the db_size
On 2023-10-05 13:20, Arvind Yadav wrote: Here, Adding db_size in byte to find the doorbell's absolute offset for both 32-bit and 64-bit doorbell sizes. So that doorbell offset will be aligned based on the doorbell size. v2: - Addressed the review comment from Felix. v3: - Adding doorbell_size as parameter to get db absolute offset. v4: Squash the two patches into one. Cc: Christian Koenig Cc: Alex Deucher Signed-off-by: Shashank Sharma Signed-off-by: Arvind Yadav Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h| 5 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c| 13 + .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 3 ++- drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 10 -- .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 3 ++- 5 files changed, 24 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h index 09f6727e7c73..4a8b33f55f6b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h @@ -357,8 +357,9 @@ int amdgpu_doorbell_init(struct amdgpu_device *adev); void amdgpu_doorbell_fini(struct amdgpu_device *adev); int amdgpu_doorbell_create_kernel_doorbells(struct amdgpu_device *adev); uint32_t amdgpu_doorbell_index_on_bar(struct amdgpu_device *adev, - struct amdgpu_bo *db_bo, - uint32_t doorbell_index); + struct amdgpu_bo *db_bo, + uint32_t doorbell_index, + uint32_t db_size); #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index)) #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v)) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c index da4be0bbb446..6690f5a72f4d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c @@ -114,19 +114,24 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v) * @adev: amdgpu_device pointer * @db_bo: doorbell object's bo * @db_index: doorbell relative index in this doorbell object + * @db_size: doorbell size is in byte * * returns doorbell's absolute index in BAR */ uint32_t amdgpu_doorbell_index_on_bar(struct amdgpu_device *adev, - struct amdgpu_bo *db_bo, - uint32_t doorbell_index) + struct amdgpu_bo *db_bo, + uint32_t doorbell_index, + uint32_t db_size) { int db_bo_offset; db_bo_offset = amdgpu_bo_gpu_offset_no_check(db_bo); - /* doorbell index is 32 bit but doorbell's size is 64-bit, so *2 */ - return db_bo_offset / sizeof(u32) + doorbell_index * 2; + /* doorbell index is 32 bit but doorbell's size can be 32 bit +* or 64 bit, so *db_size(in byte)/4 for alignment. +*/ + return db_bo_offset / sizeof(u32) + doorbell_index * + DIV_ROUND_UP(db_size, 4); } /** diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 0d3d538b64eb..e07652e72496 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -407,7 +407,8 @@ static int allocate_doorbell(struct qcm_process_device *qpd, q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev->adev, qpd->proc_doorbells, - q->doorbell_id); + q->doorbell_id, + dev->kfd->device_info.doorbell_size); return 0; } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c index 7b38537c7c99..05c74887fd6f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c @@ -161,7 +161,10 @@ void __iomem *kfd_get_kernel_doorbell(struct kfd_dev *kfd, if (inx >= KFD_MAX_NUM_OF_QUEUES_PER_PROCESS) return NULL; - *doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, kfd->doorbells, inx); + *doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, +kfd->doorbells, +inx, + kfd->device_info.doorbell_size); inx *= 2;
Re: [PATCH v3 2/2] drm/amdkfd: get doorbell's absolute offset based on the db size
On 2023-10-04 12:16, Arvind Yadav wrote: This patch is to align the absolute doorbell offset based on the doorbell's size. So that doorbell offset will be aligned for both 32 bit and 64 bit. v2: - Addressed the review comment from Felix. v3: - Adding doorbell_size as parameter to get db absolute offset. Cc: Christian Koenig Cc: Alex Deucher Signed-off-by: Shashank Sharma Signed-off-by: Arvind Yadav The final result looks good to me. But please squash the two patches into one. The first patch on its own breaks the build, and that's something we don't want to commit to the branch history as it makes tracking regressions (e.g. with git bisect) very hard or impossible. More nit-picks inline. --- .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 13 +++-- .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 4 +++- 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 0d3d538b64eb..690ff131fe4b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -346,6 +346,7 @@ static int allocate_doorbell(struct qcm_process_device *qpd, uint32_t const *restore_id) { struct kfd_node *dev = qpd->dqm->dev; + uint32_t doorbell_size; if (!KFD_IS_SOC15(dev)) { /* On pre-SOC15 chips we need to use the queue ID to @@ -405,9 +406,12 @@ static int allocate_doorbell(struct qcm_process_device *qpd, } } + doorbell_size = dev->kfd->device_info.doorbell_size; + q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev->adev, qpd->proc_doorbells, - q->doorbell_id); + q->doorbell_id, + doorbell_size); You don't need a local variable for doorbell size that's only used once. Just pass dev->kfd->device_info.doorbell_size directly. return 0; } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c index 7b38537c7c99..59dd76c4b138 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c @@ -161,7 +161,10 @@ void __iomem *kfd_get_kernel_doorbell(struct kfd_dev *kfd, if (inx >= KFD_MAX_NUM_OF_QUEUES_PER_PROCESS) return NULL; - *doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, kfd->doorbells, inx); + *doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, +kfd->doorbells, +inx, + kfd->device_info.doorbell_size); inx *= 2; pr_debug("Get kernel queue doorbell\n" @@ -233,6 +236,7 @@ phys_addr_t kfd_get_process_doorbells(struct kfd_process_device *pdd) { struct amdgpu_device *adev = pdd->dev->adev; uint32_t first_db_index; + uint32_t doorbell_size; if (!pdd->qpd.proc_doorbells) { if (kfd_alloc_process_doorbells(pdd->dev->kfd, pdd)) @@ -240,7 +244,12 @@ phys_addr_t kfd_get_process_doorbells(struct kfd_process_device *pdd) return 0; } - first_db_index = amdgpu_doorbell_index_on_bar(adev, pdd->qpd.proc_doorbells, 0); + doorbell_size = pdd->dev->kfd->device_info.doorbell_size; + + first_db_index = amdgpu_doorbell_index_on_bar(adev, + pdd->qpd.proc_doorbells, + 0, + doorbell_size); Same as above, no local variable needed. return adev->doorbell.base + first_db_index * sizeof(uint32_t); } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index adb5e4bdc0b2..010cd8e8e6a1 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -375,9 +375,11 @@ int pqm_create_queue(struct process_queue_manager *pqm, * relative doorbell index = Absolute doorbell index - * absolute index of first doorbell in the page. */ + uint32_t doorbell_size = pdd->dev->kfd->device_info.doorbell_size; uint32_t first_db_index = amdgpu_doorbell_index_on_bar(pdd->dev->adev, pdd->qpd.proc_doorbells, -
Re: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8
On 2023-09-28 11:38, Shashank Sharma wrote: Hello Felix, Mukul, On 28/09/2023 17:30, Felix Kuehling wrote: On 2023-09-28 10:30, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Yadav, Arvind Sent: Thursday, September 28, 2023 5:54 AM To: Koenig, Christian ; Deucher, Alexander ; Sharma, Shashank ; Kuehling, Felix ; Joshi, Mukul ; Pan, Xinhui ; airl...@gmail.com; dan...@ffwll.ch Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux- ker...@vger.kernel.org; Yadav, Arvind ; Koenig, Christian Subject: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8 This patch is to adjust the absolute doorbell offset against the doorbell id considering the doorbell size of 32/64 bit. v2: - Addressed the review comment from Felix. Cc: Christian Koenig Cc: Alex Deucher Signed-off-by: Shashank Sharma Signed-off-by: Arvind Yadav --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 0d3d538b64eb..c54c4392d26e 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -407,7 +407,14 @@ static int allocate_doorbell(struct qcm_process_device *qpd, q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev- adev, qpd- proc_doorbells, - q- doorbell_id); + 0); + It looks like amdgpu_doorbell_index_on_bar() works only for 64-bit doorbells. Shouldn't it work for both 32-bit and 64-bit doorbells considering this is common doorbell manager code? Yes, You are right that the calculations to find a particular doorbell in the doorbell page considers a doorbell width of 64-bit. I could see this argument going either way. KFD is the only one that cares about managing doorbells for user mode queues on GFXv8 GPUs. This is not a use case that amdgpu cares about. So I'm OK with KFD doing its own address calculations to make sure doorbells continue to work on GFXv8. It may not be worth adding complexity to the common doorbell manager code to support legacy GPUs with 32-bit doorbells. I was thinking about adding an additional input parameter which will indicate if the doorbell width is 32-bit vs 64-bit (like is_doorbell_64_bit), and doorbell manager can alter the multiplier while calculating the final offset. Please let me know if that will work for both the cases. Yes, that would work for KFD because we already have the doorbell size in our device-info structure. Instead of making it a boolean flag, you could make it a doorbell_size parameter, in byte or dword units to simplify the pointer math. Regards, Felix - Shashank Regards, Felix Thanks, Mukul + /* Adjust the absolute doorbell offset against the doorbell id considering + * the doorbell size of 32/64 bit. + */ + q->properties.doorbell_off += q->doorbell_id * + dev->kfd->device_info.doorbell_size / 4; + return 0; } -- 2.34.1
Re: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8
On 2023-09-28 10:30, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Yadav, Arvind Sent: Thursday, September 28, 2023 5:54 AM To: Koenig, Christian ; Deucher, Alexander ; Sharma, Shashank ; Kuehling, Felix ; Joshi, Mukul ; Pan, Xinhui ; airl...@gmail.com; dan...@ffwll.ch Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux- ker...@vger.kernel.org; Yadav, Arvind ; Koenig, Christian Subject: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8 This patch is to adjust the absolute doorbell offset against the doorbell id considering the doorbell size of 32/64 bit. v2: - Addressed the review comment from Felix. Cc: Christian Koenig Cc: Alex Deucher Signed-off-by: Shashank Sharma Signed-off-by: Arvind Yadav --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 0d3d538b64eb..c54c4392d26e 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -407,7 +407,14 @@ static int allocate_doorbell(struct qcm_process_device *qpd, q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev- adev, qpd- proc_doorbells, - q- doorbell_id); + 0); + It looks like amdgpu_doorbell_index_on_bar() works only for 64-bit doorbells. Shouldn't it work for both 32-bit and 64-bit doorbells considering this is common doorbell manager code? I could see this argument going either way. KFD is the only one that cares about managing doorbells for user mode queues on GFXv8 GPUs. This is not a use case that amdgpu cares about. So I'm OK with KFD doing its own address calculations to make sure doorbells continue to work on GFXv8. It may not be worth adding complexity to the common doorbell manager code to support legacy GPUs with 32-bit doorbells. Regards, Felix Thanks, Mukul + /* Adjust the absolute doorbell offset against the doorbell id considering + * the doorbell size of 32/64 bit. + */ + q->properties.doorbell_off += q->doorbell_id * + dev->kfd->device_info.doorbell_size / 4; + return 0; } -- 2.34.1
Re: [PATCH 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8
[+Mukul] On 2023-09-27 12:16, Arvind Yadav wrote: This patch is to adjust the absolute doorbell offset against the doorbell id considering the doorbell size of 32/64 bit. Cc: Christian Koenig Cc: Alex Deucher Signed-off-by: Shashank Sharma Signed-off-by: Arvind Yadav --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 0d3d538b64eb..c327f7f604d7 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -407,7 +407,16 @@ static int allocate_doorbell(struct qcm_process_device *qpd, q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev->adev, qpd->proc_doorbells, - q->doorbell_id); + 0); Looks like we're now always calling amdgpu_doorbell_index_on_bar with the third parameter = 0. So we could remove that parameter. It doesn't do us any good and only causes bugs if we use any non-0 value. + + /* Adjust the absolute doorbell offset against the doorbell id considering +* the doorbell size of 32/64 bit. +*/ + if (!KFD_IS_SOC15(dev)) + q->properties.doorbell_off += q->doorbell_id; + else + q->properties.doorbell_off += q->doorbell_id * 2; This could be simplified and generalized as q->properties.doorbell_off += q->doorbell_id * dev->kfd->device_info.doorbell_size / 4; Regards, Felix + return 0; }
Re: [PATCH v2 1/2] drm/amdgpu: Merge debug module parameters
On 2023-08-30 18:08, André Almeida wrote: Merge all developer debug options available as separated module parameters in one, making it obvious that are for developers. Drop the obsolete module options in favor of the new ones. Signed-off-by: André Almeida --- v2: - drop old module params - use BIT() macros - replace global var with adev-> vars --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 48 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_crat.c| 2 +- drivers/gpu/drm/amd/include/amd_shared.h | 8 8 files changed, 45 insertions(+), 25 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 4de074243c4d..82eaccfce347 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1101,6 +1101,10 @@ struct amdgpu_device { booldc_enabled; /* Mask of active clusters */ uint32_taid_mask; + + /* Debug */ + booldebug_vm; + booldebug_largebar; }; static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index fb78a8f47587..8a26bed76505 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1191,7 +1191,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p) job->vm_pd_addr = amdgpu_gmc_pd_addr(vm->root.bo); } - if (amdgpu_vm_debug) { + if (adev->debug_vm) { /* Invalidate all BOs to test for userspace bugs */ amdgpu_bo_list_for_each_entry(e, p->bo_list) { struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index f5856b82605e..0cd48c025433 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -140,7 +140,6 @@ int amdgpu_vm_size = -1; int amdgpu_vm_fragment_size = -1; int amdgpu_vm_block_size = -1; int amdgpu_vm_fault_stop; -int amdgpu_vm_debug; int amdgpu_vm_update_mode = -1; int amdgpu_exp_hw_support; int amdgpu_dc = -1; @@ -194,6 +193,7 @@ int amdgpu_use_xgmi_p2p = 1; int amdgpu_vcnfw_log; int amdgpu_sg_display = -1; /* auto */ int amdgpu_user_partt_mode = AMDGPU_AUTO_COMPUTE_PARTITION_MODE; +uint amdgpu_debug_mask; static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work); @@ -405,13 +405,6 @@ module_param_named(vm_block_size, amdgpu_vm_block_size, int, 0444); MODULE_PARM_DESC(vm_fault_stop, "Stop on VM fault (0 = never (default), 1 = print first, 2 = always)"); module_param_named(vm_fault_stop, amdgpu_vm_fault_stop, int, 0444); -/** - * DOC: vm_debug (int) - * Debug VM handling (0 = disabled, 1 = enabled). The default is 0 (Disabled). - */ -MODULE_PARM_DESC(vm_debug, "Debug VM handling (0 = disabled (default), 1 = enabled)"); -module_param_named(vm_debug, amdgpu_vm_debug, int, 0644); This parameter used to be writable, which means it could be changed through sysfs after loading the module. Code looking at the global variable would see the last value written by user mode. With your changes, this is no longer writable, and driver code is now looking at adev->debug_vm, which cannot be updated through sysfs. As long as everyone is OK with that change, I have no objections. Just pointing it out. Regardless, this patch is Acked-by: Felix Kuehling - /** * DOC: vm_update_mode (int) * Override VM update mode. VM updated by using CPU (0 = never, 1 = Graphics only, 2 = Compute only, 3 = Both). The default @@ -743,18 +736,6 @@ module_param(send_sigterm, int, 0444); MODULE_PARM_DESC(send_sigterm, "Send sigterm to HSA process on unhandled exception (0 = disable, 1 = enable)"); -/** - * DOC: debug_largebar (int) - * Set debug_largebar as 1 to enable simulating large-bar capability on non-large bar - * system. This limits the VRAM size reported to ROCm applications to the visible - * size, usually 256MB. - * Default value is 0, diabled. - */ -int debug_largebar; -module_param(debug_largebar, int, 0444); -MODULE_PARM_DESC(debug_largebar, - "Debug large-bar flag used to simulate large-bar capability on non-large bar machine (0 = disable, 1 = enable)"); - /** * DOC: halt_if_hws_hang (int) * Halt if HWS hang is detected. Default value, 0, disables the halt on hang. @@ -938,6 +919,18 @@ module_param_named(user_partt_mode, amdgpu_user_partt_mo
Re: [PATCH] drm/prime: Support page array >= 4GB
On 2023-08-21 16:02, Philip Yang wrote: Without unsigned long typecast, the size is passed in as zero if page array size >= 4GB, nr_pages >= 0x10, then sg list converted will have the first and the last chunk lost. Signed-off-by: Philip Yang The patch looks reasonable to me. I don't have authority to approve it. But FWIW, Acked-by: Felix Kuehling Can anyone give a Reviewed-by? Thanks, Felix --- drivers/gpu/drm/drm_prime.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index f924b8b4ab6b..2630ad2e504d 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -830,7 +830,7 @@ struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev, if (max_segment == 0) max_segment = UINT_MAX; err = sg_alloc_table_from_pages_segment(sg, pages, nr_pages, 0, - nr_pages << PAGE_SHIFT, + (unsigned long)nr_pages << PAGE_SHIFT, max_segment, GFP_KERNEL); if (err) { kfree(sg);
Re: [PATCH AUTOSEL 5.15 6/6] drm/amdkfd: ignore crat by default
On 2023-08-22 11:41, Deucher, Alexander wrote: [Public] -Original Message- From: Sasha Levin Sent: Tuesday, August 22, 2023 7:37 AM To: linux-ker...@vger.kernel.org; sta...@vger.kernel.org Cc: Deucher, Alexander ; Kuehling, Felix ; Koenig, Christian ; Mike Lothian ; Sasha Levin ; Pan, Xinhui ; airl...@gmail.com; dan...@ffwll.ch; amd- g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org Subject: [PATCH AUTOSEL 5.15 6/6] drm/amdkfd: ignore crat by default From: Alex Deucher [ Upstream commit a6dea2d64ff92851e68cd4e20a35f6534286e016 ] We are dropping the IOMMUv2 path, so no need to enable this. It's often buggy on consumer platforms anyway. This is not needed for stable. I agree. I was about to comment in the 5.10 patch as well. Regards, Felix Alex Reviewed-by: Felix Kuehling Acked-by: Christian König Tested-by: Mike Lothian Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c index e574aa32a111d..46dfd9baeb013 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c @@ -1523,11 +1523,7 @@ static bool kfd_ignore_crat(void) if (ignore_crat) return true; -#ifndef KFD_SUPPORT_IOMMU_V2 ret = true; -#else - ret = false; -#endif return ret; } -- 2.40.1
Re: [PATCH] drm/prime: Support page array >= 4GB
On 2023-08-23 01:49, Christian König wrote: Am 22.08.23 um 20:27 schrieb Philip Yang: On 2023-08-22 05:43, Christian König wrote: Am 21.08.23 um 22:02 schrieb Philip Yang: Without unsigned long typecast, the size is passed in as zero if page array size >= 4GB, nr_pages >= 0x10, then sg list converted will have the first and the last chunk lost. Good catch, but I'm not sure if this is enough to make it work. Additional to that I don't think we have an use case for BOs > 4GiB. >4GB buffer is normal for compute applications, the issue is reported by "Maelstrom generated exerciser detects micompares when GPU accesses larger remote GPU memory." on GFX 9.4.3 APU, which uses GTT domain to allocate VRAM, and trigger the bug in this drm prime helper. With this fix, the test passed. Why is the application allocating all the data as a single BO? Usually you have a single texture, image, array etc... in a single BO but this here looks a bit like the application tries to allocate all their memory in a single BO (could of course be that this isn't the case and that's really just one giant data structure). Compute applications work with pretty big data structures. For example huge multi-dimensional matrices are not uncommon in large machine-learning models. Swapping such large BOs out at once is quite impractical, so should we ever have an use case like suspend/resume or checkpoint/restore with this it will most likely fail. Checkpointing and restoring multiple GB at a time should not be a problem. I'm pretty sure we have tested that. On systems with 100s of GBs of memory, HBM memory bandwidth approaching TB/s and PCIe/CXL bus bandwidths going into 10s of GB/s, dealing with multi-GB BOs should not be a fundamental problem. That said, if you wanted to impose limits on the size of single allocations, then I would expect some policy somewhere that prohibits large allocations. On the contrary, I see long or 64-bit data types all over the VRAM manager and TTM code, which tells me that >4GB allocations must be part of the plan. This patch is clearly addressing a bug in the code that results in data corruption when mapping large BOs on multiple GPUs. You could address this with an allocation policy change, if you want, and leave the bug in place. Then we have to update ROCm user mode to break large allocations into multiple BOs. It would break applications that try to share such large allocations via DMABufs (e.g. with an RDMA NIC), because it would become impossible to share large allocations with a single DMABuf handle. Regards, Felix Christian. Regards, Philip Christian. Signed-off-by: Philip Yang --- drivers/gpu/drm/drm_prime.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index f924b8b4ab6b..2630ad2e504d 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -830,7 +830,7 @@ struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev, if (max_segment == 0) max_segment = UINT_MAX; err = sg_alloc_table_from_pages_segment(sg, pages, nr_pages, 0, - nr_pages << PAGE_SHIFT, + (unsigned long)nr_pages << PAGE_SHIFT, max_segment, GFP_KERNEL); if (err) { kfree(sg);
Re: Implement svm without BO concept in xe driver
On 2023-08-21 15:41, Zeng, Oak wrote: I have thought about emulating BO allocation APIs on top of system SVM. This was in the context of KFD where memory management is not tied into command submissions APIs, which would add a whole other layer of complexity. The main unsolved (unsolvable?) problem I ran into was, that there is no way to share SVM memory as DMABufs. So there is no good way to support applications that expect to share memory in that way. Great point. I also discussed the dmabuf thing with Mike (cc'ed). dmabuf is a particular technology created specially for the BO driver (and other driver) to share buffer b/t devices. Hmm/system SVM doesn't need this technology: malloc'ed memory by the nature is already shared b/t different devices (in one process) and CPU. We just can simply submit GPU kernel to all devices with malloc'ed memory and let kmd decide the memory placement (such as map in place or migrate). No need of buffer export/import in hmm/system SVM world. I disagree. DMABuf can be used for sharing memory between processes. And it can be used for sharing memory with 3rd-party devices via PCIe P2P (e.g. a Mellanox NIC). You cannot easily do that with malloc'ed memory. POSIX IPC requires that you know that you'll be sharing the memory at allocation time. It adds overhead. And because it's file-backed, it's currently incompatible with migration. And HMM currently doesn't have a solution for P2P. Any access by a different device causes a migration to system memory. Regards, Felix So yes from buffer sharing perspective, the design philosophy is also very different. Thanks, Oak
Re: Implement svm without BO concept in xe driver
On 2023-08-21 11:10, Zeng, Oak wrote: Accidently deleted Brian. Add back. Thanks, Oak -Original Message- From: Zeng, Oak Sent: August 21, 2023 11:07 AM To: Dave Airlie Cc: Brost, Matthew ; Thomas Hellström ; Philip Yang ; Felix Kuehling ; dri-devel@lists.freedesktop.org; intel- x...@lists.freedesktop.org; Vishwanathapura, Niranjana ; Christian König Subject: RE: Implement svm without BO concept in xe driver -Original Message- From: dri-devel On Behalf Of Dave Airlie Sent: August 20, 2023 6:21 PM To: Zeng, Oak Cc: Brost, Matthew ; Thomas Hellström ; Philip Yang ; Felix Kuehling ; Welty, Brian ; dri- de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Vishwanathapura, Niranjana ; Christian König Subject: Re: Implement svm without BO concept in xe driver On Thu, 17 Aug 2023 at 12:13, Zeng, Oak wrote: -Original Message- From: Dave Airlie Sent: August 16, 2023 6:52 PM To: Felix Kuehling Cc: Zeng, Oak ; Christian König ; Thomas Hellström ; Brost, Matthew ; maarten.lankho...@linux.intel.com; Vishwanathapura, Niranjana ; Welty, Brian ; Philip Yang ; intel- x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org Subject: Re: Implement svm without BO concept in xe driver On Thu, 17 Aug 2023 at 08:15, Felix Kuehling wrote: On 2023-08-16 13:30, Zeng, Oak wrote: I spoke with Thomas. We discussed two approaches: 1) make ttm_resource a central place for vram management functions such as eviction, cgroup memory accounting. Both the BO-based driver and BO-less SVM codes call into ttm_resource_alloc/free functions for vram allocation/free. *This way BO driver and SVM driver shares the eviction/cgroup logic, no need to reimplment LRU eviction list in SVM driver. Cgroup logic should be in ttm_resource layer. +Maarten. *ttm_resource is not a perfect match for SVM to allocate vram. It is still a big overhead. The *bo* member of ttm_resource is not needed for SVM - this might end up with invasive changes to ttm...need to look into more details Overhead is a problem. We'd want to be able to allocate, free and evict memory at a similar granularity as our preferred migration and page fault granularity, which defaults to 2MB in our SVM implementation. 2) svm code allocate memory directly from drm-buddy allocator, and expose memory eviction functions from both ttm and svm so they can evict memory from each other. For example, expose the ttm_mem_evict_first function from ttm side so hmm/svm code can call it; expose a similar function from svm side so ttm can evict hmm memory. I like this option. One thing that needs some thought with this is how to get some semblance of fairness between the two types of clients. Basically how to choose what to evict. And what share of the available memory does each side get to use on average. E.g. an idle client may get all its memory evicted while a busy client may get a bigger share of the available memory. I'd also like to suggest we try to write any management/generic code in driver agnostic way as much as possible here. I don't really see much hw difference should be influencing it. I do worry about having effectively 2 LRUs here, you can't really have two "leasts". Like if we hit the shrinker paths who goes first? do we shrink one object from each side in turn? One way to solve this fairness problem is to create a driver agnostic drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory eviction/cgroups memory accounting logic from ttm_resource manager to drm_vram_mgr. Both BO-based driver and SVM driver calls to drm_vram_mgr to allocate/free memory. I am not sure whether this meets the 2M allocate/free/evict granularity requirement Felix mentioned above. SVM can allocate 2M size blocks. But BO driver should be able to allocate any arbitrary sized blocks - So the eviction is also arbitrary size. Also will we have systems where we can expose system SVM but userspace may choose to not use the fine grained SVM and use one of the older modes, will that path get emulated on top of SVM or use the BO paths? If by "older modes" you meant the gem_bo_create (such as xe_gem_create or amdgpu_gem_create), then today both amd and intel implement those interfaces using BO path. We don't have a plan to emulate that old mode on tope of SVM, afaict. I'm not sure how the older modes manifest in the kernel I assume as bo creates (but they may use userptr), SVM isn't a specific thing, it's a group of 3 things. 1) coarse-grained SVM which I think is BO 2) fine-grained SVM which is page level 3) fine-grained system SVM which is HMM I suppose I'm asking about the previous versions and how they would operate in a system SVM capable system. I got your question now. As I understand it, the system SVM provides similar functionality as BO-based SVM (i.e., share virtual address space b/t cpu and gpu program, no explici
Re: Implement svm without BO concept in xe driver
On 2023-08-18 12:10, Zeng, Oak wrote: Thanks Thomas. I will then look into more details of option 3: * create a lean drm layer vram manager, a central control place for vram eviction and cgroup accounting. Single LRU for eviction fairness. * pretty much move the current ttm_resource eviction/cgroups logic to drm layer * the eviction/allocation granularity should be flexible so svm can do 2M while ttm can do arbitrary size SVM will need smaller sizes too, for VMAs that are smaller or not aligned to 2MB size. Regards, Felix * both ttm_resource and svm code should call the new drm_vram_manager for eviction/accounting I will come back with some RFC proof of concept codes later. Cheers, Oak -Original Message- From: Thomas Hellström Sent: August 18, 2023 3:36 AM To: Zeng, Oak ; Dave Airlie ; Felix Kuehling Cc: Christian König ; Brost, Matthew ; maarten.lankho...@linux.intel.com; Vishwanathapura, Niranjana ; Welty, Brian ; Philip Yang ; intel- x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org Subject: Re: Implement svm without BO concept in xe driver On 8/17/23 04:12, Zeng, Oak wrote: -Original Message- From: Dave Airlie Sent: August 16, 2023 6:52 PM To: Felix Kuehling Cc: Zeng, Oak ; Christian König ; Thomas Hellström ; Brost, Matthew ; maarten.lankho...@linux.intel.com; Vishwanathapura, Niranjana ; Welty, Brian ; Philip Yang ; intel- x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org Subject: Re: Implement svm without BO concept in xe driver On Thu, 17 Aug 2023 at 08:15, Felix Kuehling wrote: On 2023-08-16 13:30, Zeng, Oak wrote: I spoke with Thomas. We discussed two approaches: 1) make ttm_resource a central place for vram management functions such as eviction, cgroup memory accounting. Both the BO-based driver and BO-less SVM codes call into ttm_resource_alloc/free functions for vram allocation/free. *This way BO driver and SVM driver shares the eviction/cgroup logic, no need to reimplment LRU eviction list in SVM driver. Cgroup logic should be in ttm_resource layer. +Maarten. *ttm_resource is not a perfect match for SVM to allocate vram. It is still a big overhead. The *bo* member of ttm_resource is not needed for SVM - this might end up with invasive changes to ttm...need to look into more details Overhead is a problem. We'd want to be able to allocate, free and evict memory at a similar granularity as our preferred migration and page fault granularity, which defaults to 2MB in our SVM implementation. 2) svm code allocate memory directly from drm-buddy allocator, and expose memory eviction functions from both ttm and svm so they can evict memory from each other. For example, expose the ttm_mem_evict_first function from ttm side so hmm/svm code can call it; expose a similar function from svm side so ttm can evict hmm memory. I like this option. One thing that needs some thought with this is how to get some semblance of fairness between the two types of clients. Basically how to choose what to evict. And what share of the available memory does each side get to use on average. E.g. an idle client may get all its memory evicted while a busy client may get a bigger share of the available memory. I'd also like to suggest we try to write any management/generic code in driver agnostic way as much as possible here. I don't really see much hw difference should be influencing it. I do worry about having effectively 2 LRUs here, you can't really have two "leasts". Like if we hit the shrinker paths who goes first? do we shrink one object from each side in turn? One way to solve this fairness problem is to create a driver agnostic drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory eviction/cgroups memory accounting logic from ttm_resource manager to drm_vram_mgr. Both BO-based driver and SVM driver calls to drm_vram_mgr to allocate/free memory. I am not sure whether this meets the 2M allocate/free/evict granularity requirement Felix mentioned above. SVM can allocate 2M size blocks. But BO driver should be able to allocate any arbitrary sized blocks - So the eviction is also arbitrary size. This is not far from what a TTM resource manager does with TTM resources, only made generic at the drm level, and making the "resource" as lean as possible. With 2M granularity this seems plausible. Also will we have systems where we can expose system SVM but userspace may choose to not use the fine grained SVM and use one of the older modes, will that path get emulated on top of SVM or use the BO paths? If by "older modes" you meant the gem_bo_create (such as xe_gem_create or amdgpu_gem_create), then today both amd and intel implement those interfaces using BO path. We don't have a plan to emulate that old mode on tope of SVM, afaict. I think we might end up emulating "older modes" on top of SVM at some point, not to f
Re: Implement svm without BO concept in xe driver
On 2023-08-16 13:30, Zeng, Oak wrote: I spoke with Thomas. We discussed two approaches: 1) make ttm_resource a central place for vram management functions such as eviction, cgroup memory accounting. Both the BO-based driver and BO-less SVM codes call into ttm_resource_alloc/free functions for vram allocation/free. *This way BO driver and SVM driver shares the eviction/cgroup logic, no need to reimplment LRU eviction list in SVM driver. Cgroup logic should be in ttm_resource layer. +Maarten. *ttm_resource is not a perfect match for SVM to allocate vram. It is still a big overhead. The *bo* member of ttm_resource is not needed for SVM - this might end up with invasive changes to ttm...need to look into more details Overhead is a problem. We'd want to be able to allocate, free and evict memory at a similar granularity as our preferred migration and page fault granularity, which defaults to 2MB in our SVM implementation. 2) svm code allocate memory directly from drm-buddy allocator, and expose memory eviction functions from both ttm and svm so they can evict memory from each other. For example, expose the ttm_mem_evict_first function from ttm side so hmm/svm code can call it; expose a similar function from svm side so ttm can evict hmm memory. I like this option. One thing that needs some thought with this is how to get some semblance of fairness between the two types of clients. Basically how to choose what to evict. And what share of the available memory does each side get to use on average. E.g. an idle client may get all its memory evicted while a busy client may get a bigger share of the available memory. Regards, Felix Today we don't know which approach is better. I will work on some prove of concept codes, starting with #1 approach firstly. Btw, I talked with application engineers and they said most applications actually use a mixture of gem_bo create and malloc, so we definitely need to solve this problem. Cheers, Oak -Original Message- From: Christian König Sent: August 16, 2023 2:06 AM To: Zeng, Oak ; Felix Kuehling ; Thomas Hellström ; Brost, Matthew ; Vishwanathapura, Niranjana ; Welty, Brian ; Philip Yang ; intel...@lists.freedesktop.org; dri- de...@lists.freedesktop.org Subject: Re: Implement svm without BO concept in xe driver Hi Oak, yeah, I completely agree with you and Felix. The main problem here is getting the memory pressure visible on both sides. At the moment I have absolutely no idea how to handle that, maybe something like the ttm_resource object shared between TTM and HMM? Regards, Christian. Am 16.08.23 um 05:47 schrieb Zeng, Oak: Hi Felix, It is great to hear from you! When I implement the HMM-based SVM for intel devices, I found this interesting problem: HMM uses struct page based memory management scheme which is completely different against the BO/TTM style memory management philosophy. Writing SVM code upon the BO/TTM concept seems overkill and awkward. So I thought we better make the SVM code BO-less and TTM-less. But on the other hand, currently vram eviction and cgroup memory accounting are all hooked to the TTM layer, which means a TTM-less SVM driver won't be able to evict vram allocated through TTM/gpu_vram_mgr. Ideally HMM migration should use drm-buddy for vram allocation, but we need to solve this TTM/HMM mutual eviction problem as you pointed out (I am working with application engineers to figure out whether mutual eviction can truly benefit applications). Maybe we can implement a TTM-less vram management block which can be shared b/t the HMM-based driver and the BO- based driver: * allocate/free memory from drm-buddy, buddy-block based * memory eviction logics, allow driver to specify which allocation is evictable * memory accounting, cgroup logic Maybe such a block can be placed at drm layer (say, call it drm_vram_mgr for now), so it can be shared b/t amd and intel. So I involved amd folks. Today both amd and intel-xe driver implemented a TTM-based vram manager which doesn't serve above design goal. Once the drm_vram_mgr is implemented, both amd and intel's BO-based/TTM-based vram manager, and the HMM-based vram manager can call into this drm-vram-mgr. Thanks again, Oak -Original Message- From: Felix Kuehling Sent: August 15, 2023 6:17 PM To: Zeng, Oak ; Thomas Hellström ; Brost, Matthew ; Vishwanathapura, Niranjana ; Welty, Brian ; Christian König ; Philip Yang ; intel...@lists.freedesktop.org; dri- de...@lists.freedesktop.org Subject: Re: Implement svm without BO concept in xe driver Hi Oak, I'm not sure what you're looking for from AMD? Are we just CC'ed FYI? Or are you looking for comments about * Our plans for VRAM management with HMM * Our experience with BO-based VRAM management * Something else? IMO, having separate memory pools for HMM and TTM is a non-starter for AMD. We need access to
Re: Implement svm without BO concept in xe driver
Hi Oak, I'm not sure what you're looking for from AMD? Are we just CC'ed FYI? Or are you looking for comments about * Our plans for VRAM management with HMM * Our experience with BO-based VRAM management * Something else? IMO, having separate memory pools for HMM and TTM is a non-starter for AMD. We need access to the full VRAM in either of the APIs for it to be useful. That also means, we need to handle memory pressure in both directions. That's one of the main reasons we went with the BO-based approach initially. I think in the long run, using the buddy allocator, or the amdgpu_vram_mgr directly for HMM migrations would be better, assuming we can handle memory pressure in both directions between HMM and TTM sharing the same pool of physical memory. Regards, Felix On 2023-08-15 16:34, Zeng, Oak wrote: Also + Christian Thanks, Oak *From:*Intel-xe *On Behalf Of *Zeng, Oak *Sent:* August 14, 2023 11:38 PM *To:* Thomas Hellström ; Brost, Matthew ; Vishwanathapura, Niranjana ; Welty, Brian ; Felix Kuehling ; Philip Yang ; intel...@lists.freedesktop.org; dri-devel@lists.freedesktop.org *Subject:* [Intel-xe] Implement svm without BO concept in xe driver Hi Thomas, Matt and all, This came up when I port i915 svm codes to xe driver. In i915 implementation, we have i915_buddy manage gpu vram and svm codes directly call i915_buddy layer to allocate/free vram. There is no gem_bo/ttm bo concept involved in the svm implementation. In xe driver, we have drm_buddy, xe_ttm_vram_mgr and ttm layer to manage vram. Drm_buddy is initialized during xe_ttm_vram_mgr initialization. Vram allocation/free is done through xe_ttm_vram_mgr functions which call into drm_buddy layer to allocate vram blocks. I plan to implement xe svm driver the same way as we did in i915, which means there will not be bo concept in the svm implementation. Drm_buddy will be passed to svm layer during vram initialization and svm will allocate/free memory directly from drm_buddy, bypassing ttm/xee vram manager. Here are a few considerations/things we are aware of: 1. This approach seems match hmm design better than bo concept. Our svm implementation will be based on hmm. In hmm design, each vram page is backed by a struct page. It is very easy to perform page granularity migrations (b/t vram and system memory). If BO concept is involved, we will have to split/remerge BOs during page granularity migrations. 2. We have a prove of concept of this approach in i915, originally implemented by Niranjana. It seems work but it only has basic functionalities for now. We don’t have advanced features such as memory eviction etc. 3. With this approach, vram will divided into two separate pools: one for xe_gem_created BOs and one for vram used by svm. Those two pools are not connected: memory pressure from one pool won’t be able to evict vram from another pool. At this point, we don’t whether this aspect is good or not. 4. Amdkfd svm went different approach which is BO based. The benefit of this approach is a lot of existing driver facilities (such as memory eviction/cgroup/accounting) can be reused Do you have any comment to this approach? Should I come back with a RFC of some POC codes? Thanks, Oak
Re: [PATCH] drm/amdkfd: fix build failure without CONFIG_DYNAMIC_DEBUG
On 2023-08-04 9:29, Arnd Bergmann wrote: From: Arnd Bergmann When CONFIG_DYNAMIC_DEBUG is disabled altogether, calling _dynamic_func_call_no_desc() does not work: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c: In function 'svm_range_set_attr': drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:52:9: error: implicit declaration of function '_dynamic_func_call_no_desc' [-Werror=implicit-function-declaration] 52 | _dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms) | ^~ drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:3564:9: note: in expansion of macro 'dynamic_svm_range_dump' 3564 | dynamic_svm_range_dump(svms); | ^~ Add a compile-time conditional in addition to the runtime check. Fixes: 8923137dbe4b2 ("drm/amdkfd: avoid svm dump when dynamic debug disabled") Signed-off-by: Arnd Bergmann The patch is Reviewed-by: Felix Kuehling I'm applying it to amd-staging-drm-next. Thanks, Felix --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 308384dbc502d..44e710821b6d9 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -23,6 +23,7 @@ #include #include +#include #include #include @@ -48,8 +49,13 @@ * page table is updated. */ #define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING (2UL * NSEC_PER_MSEC) +#if IS_ENABLED(CONFIG_DYNAMIC_DEBUG) #define dynamic_svm_range_dump(svms) \ _dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms) +#else +#define dynamic_svm_range_dump(svms) \ + do { if (0) svm_range_debug_dump(svms); } while (0) +#endif /* Giant svm range split into smaller ranges based on this, it is decided using * minimum of all dGPU/APU 1/32 VRAM size, between 2MB to 1GB and alignment to
Re: [PATCH v2 2/4] drm/amdkfd: use vma_is_initial_stack() and vma_is_initial_heap()
Am 2023-07-19 um 03:51 schrieb Kefeng Wang: Use the helpers to simplify code. Cc: Felix Kuehling Cc: Alex Deucher Cc: "Christian König" Cc: "Pan, Xinhui" Cc: David Airlie Cc: Daniel Vetter Signed-off-by: Kefeng Wang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 5ff1a5a89d96..0b7bfbd0cb66 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -2621,10 +2621,7 @@ svm_range_get_range_boundaries(struct kfd_process *p, int64_t addr, return -EFAULT; } - *is_heap_stack = (vma->vm_start <= vma->vm_mm->brk && - vma->vm_end >= vma->vm_mm->start_brk) || -(vma->vm_start <= vma->vm_mm->start_stack && - vma->vm_end >= vma->vm_mm->start_stack); + *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma); start_limit = max(vma->vm_start >> PAGE_SHIFT, (unsigned long)ALIGN_DOWN(addr, 2UL << 8));
Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()
Am 2023-07-14 um 10:26 schrieb Vlastimil Babka: On 7/12/23 18:24, Felix Kuehling wrote: Allocations in the heap and stack tend to be small, with several allocations sharing the same page. Sharing the same page for different allocations with different access patterns leads to thrashing when we migrate data back and forth on GPU and CPU access. To avoid this we disable HMM migrations for head and stack VMAs. Wonder how well does it really work in practice? AFAIK "heaps" (malloc()) today uses various arenas obtained by mmap() and not a single brk() managed space anymore? And programs might be multithreaded, thus have multiple stacks, while vma_is_stack() will recognize only the initial one... Thanks for these pointers. I have not heard of such problems with mmap arenas and multiple thread stacks in practice. But I'll keep it in mind in case we observe unexpected thrashing in the future. FWIW, we once had the opposite problem of a custom malloc implementation that used sbrk for very large allocations. This disabled migrations of large buffers unexpectedly. I agree that eventually we'll want a more dynamic way of detecting and suppressing thrashing that's based on observed memory access patterns. Getting this right is probably trickier than it sounds, so I'd prefer to have some more experience with real workloads to use as benchmarks. Compared to other things we're working on, this is fairly low on our priority list at the moment. Using the VMA flags is a simple and effective method for now, at least until we see it failing in real workloads. Regards, Felix Vlastimil Regards, Felix Am 2023-07-12 um 10:42 schrieb Christoph Hellwig: On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote: Use the helpers to simplify code. Nothing against your addition of a helper, but a GPU driver really should have no business even looking at this information..
Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()
Allocations in the heap and stack tend to be small, with several allocations sharing the same page. Sharing the same page for different allocations with different access patterns leads to thrashing when we migrate data back and forth on GPU and CPU access. To avoid this we disable HMM migrations for head and stack VMAs. Regards, Felix Am 2023-07-12 um 10:42 schrieb Christoph Hellwig: On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote: Use the helpers to simplify code. Nothing against your addition of a helper, but a GPU driver really should have no business even looking at this information..
Re: [PATCH] drm/amdkfd: Switch over to memdup_user()
Am 2023-06-13 um 22:04 schrieb Jiapeng Chong: Use memdup_user() rather than duplicating its implementation. This is a little bit restricted to reduce false positives. ./drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:2813:13-20: WARNING opportunity for memdup_user. Reported-by: Abaci Robot Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5523 Signed-off-by: Jiapeng Chong Kernel test robot is reporting a failure with this patch, looks like you used PTR_ERR incorrectly. Please make sure your patch compiles without warnings. I see more opportunities to use memdup_user in kfd_chardev.c, kfd_events.c, kfd_process_queue_manager.c and kfd_svm.c. Do you want to fix those, too, while you're at it? Thanks, Felix --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index d6b15493fffd..637962d4083c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -2810,12 +2810,9 @@ static uint32_t *get_queue_ids(uint32_t num_queues, uint32_t *usr_queue_id_array if (!usr_queue_id_array) return NULL; - queue_ids = kzalloc(array_size, GFP_KERNEL); - if (!queue_ids) - return ERR_PTR(-ENOMEM); - - if (copy_from_user(queue_ids, usr_queue_id_array, array_size)) - return ERR_PTR(-EFAULT); + queue_ids = memdup_user(usr_queue_id_array, array_size); + if (IS_ERR(queue_ids)) + return PTR_ERR(queue_ids); return queue_ids; }
Re: [PATCH v2] gpu: drm/amd: Remove the redundant null pointer check in list_for_each_entry() loops
[+Jon] Am 2023-06-12 um 07:58 schrieb Lu Hongfei: pqn bound in list_for_each_entry loop will not be null, so there is no need to check whether pqn is NULL or not. Thus remove a redundant null pointer check. Signed-off-by: Lu Hongfei --- The filename of the previous version was: 0001-gpu-drm-amd-Fix-the-bug-in-list_for_each_entry-loops.patch The modifications made compared to the previous version are as follows: 1. Modified the patch title 2. "Thus remove a redundant null pointer check." is used instead of "We could remove this check." drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c index cd34e7aaead4..10d0cef844f0 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c @@ -1097,9 +1097,6 @@ void kfd_dbg_set_enabled_debug_exception_mask(struct kfd_process *target, pqm = &target->pqm; list_for_each_entry(pqn, &pqm->queues, process_queue_list) { - if (!pqn) Right, this check doesn't make a lot of sense. Jon, was this meant to check pqn->q? Regards, Felix - continue; - found_mask |= pqn->q->properties.exception_status; }
Re: [PATCH -next] drm/amdkfd: clean up one inconsistent indenting
On 2023-05-30 22:08, Yang Li wrote: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:1036 kgd2kfd_interrupt() warn: inconsistent indenting Signed-off-by: Yang Li Reviewed-by: Felix Kuehling I'm applying the patch to amd-staging-drm-next. Thanks! --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index 862a50f7b490..0398a8c52a44 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -1033,7 +1033,7 @@ void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry) is_patched ? patched_ihre : ih_ring_entry)) { kfd_queue_work(node->ih_wq, &node->interrupt_work); spin_unlock_irqrestore(&node->interrupt_lock, flags); - return; + return; } spin_unlock_irqrestore(&node->interrupt_lock, flags); }
Re: [PATCH 32/33] drm/amdkfd: add debug device snapshot operation
Am 2023-05-25 um 13:27 schrieb Jonathan Kim: Similar to queue snapshot, return an array of device information using an entry_size check and return. Unlike queue snapshots, the debugger needs to pass to correct number of devices that exist. If it fails to do so, the KFD will return the number of actual devices so that the debugger can make a subsequent successful call. v2: add num_xcc to device snapshot and fixup new kfd_node reference Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 7 ++- drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 73 drivers/gpu/drm/amd/amdkfd/kfd_debug.h | 5 ++ 3 files changed, 83 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index b24a73fd53af..f522325b409b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -3060,8 +3060,11 @@ static int kfd_ioctl_set_debug_trap(struct file *filep, struct kfd_process *p, v &args->queue_snapshot.entry_size); break; case KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT: - pr_warn("Debug op %i not supported yet\n", args->op); - r = -EACCES; + r = kfd_dbg_trap_device_snapshot(target, + args->device_snapshot.exception_mask, + (void __user *)args->device_snapshot.snapshot_buf_ptr, + &args->device_snapshot.num_devices, + &args->device_snapshot.entry_size); break; default: pr_err("Invalid option: %i\n", args->op); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c index 24e2b285448a..125274445f43 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c @@ -22,6 +22,7 @@ #include "kfd_debug.h" #include "kfd_device_queue_manager.h" +#include "kfd_topology.h" #include #include @@ -1010,6 +1011,78 @@ int kfd_dbg_trap_query_exception_info(struct kfd_process *target, return r; } +int kfd_dbg_trap_device_snapshot(struct kfd_process *target, + uint64_t exception_clear_mask, + void __user *user_info, + uint32_t *number_of_device_infos, + uint32_t *entry_size) +{ + struct kfd_dbg_device_info_entry device_info; + uint32_t tmp_entry_size = *entry_size, tmp_num_devices; + int i, r = 0; + + if (!(target && user_info && number_of_device_infos && entry_size)) + return -EINVAL; + + tmp_num_devices = min_t(size_t, *number_of_device_infos, target->n_pdds); + *number_of_device_infos = target->n_pdds; + *entry_size = min_t(size_t, *entry_size, sizeof(device_info)); + + if (!tmp_num_devices) + return 0; + + memset(&device_info, 0, sizeof(device_info)); + + mutex_lock(&target->event_mutex); + + /* Run over all pdd of the process */ + for (i = 0; i < tmp_num_devices; i++) { + struct kfd_process_device *pdd = target->pdds[i]; + struct kfd_topology_device *topo_dev = kfd_topology_device_by_id(pdd->dev->id); + + device_info.gpu_id = pdd->dev->id; + device_info.exception_status = pdd->exception_status; + device_info.lds_base = pdd->lds_base; + device_info.lds_limit = pdd->lds_limit; + device_info.scratch_base = pdd->scratch_base; + device_info.scratch_limit = pdd->scratch_limit; + device_info.gpuvm_base = pdd->gpuvm_base; + device_info.gpuvm_limit = pdd->gpuvm_limit; + device_info.location_id = topo_dev->node_props.location_id; + device_info.vendor_id = topo_dev->node_props.vendor_id; + device_info.device_id = topo_dev->node_props.device_id; + device_info.revision_id = pdd->dev->adev->pdev->revision; + device_info.subsystem_vendor_id = pdd->dev->adev->pdev->subsystem_vendor; + device_info.subsystem_device_id = pdd->dev->adev->pdev->subsystem_device; + device_info.fw_version = pdd->dev->kfd->mec_fw_version; + device_info.gfx_target_version = + topo_dev->node_props.gfx_target_version; + device_info.simd_count = topo_dev->node_props.simd_count; + device_info.max_waves_per_simd = + topo_dev->node_props.max_waves_per_simd; + device_info.array_count = topo_dev->node_props.array_count
Re: [PATCH 28/33] drm/amdkfd: add debug set flags operation
Am 2023-05-25 um 13:27 schrieb Jonathan Kim: Allow the debugger to set single memory and single ALU operations. Some exceptions are imprecise (memory violations, address watch) in the sense that a trap occurs only when the exception interrupt occurs and not at the non-halting faulty instruction. Trap temporaries 0 & 1 save the program counter address, which means that these values will not point to the faulty instruction address but to whenever the interrupt was raised. Setting the Single Memory Operations flag will inject an automatic wait on every memory operation instruction forcing imprecise memory exceptions to become precise at the cost of performance. This setting is not permitted on debug devices that support only a global setting of this option. Return the previous set flags to the debugger as well. v2: fixup with new kfd_node struct reference mes checks Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 + drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 58 drivers/gpu/drm/amd/amdkfd/kfd_debug.h | 1 + 3 files changed, 61 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index e88be582d44d..e5d95b144dcd 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -3035,6 +3035,8 @@ static int kfd_ioctl_set_debug_trap(struct file *filep, struct kfd_process *p, v args->clear_node_address_watch.id); break; case KFD_IOC_DBG_TRAP_SET_FLAGS: + r = kfd_dbg_trap_set_flags(target, &args->set_flags.flags); + break; case KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT: case KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO: case KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c index 4b36cc8b5fb7..43c3170998d3 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c @@ -23,6 +23,7 @@ #include "kfd_debug.h" #include "kfd_device_queue_manager.h" #include +#include #define MAX_WATCH_ADDRESSES 4 @@ -423,6 +424,59 @@ static void kfd_dbg_clear_process_address_watch(struct kfd_process *target) kfd_dbg_trap_clear_dev_address_watch(target->pdds[i], j); } +int kfd_dbg_trap_set_flags(struct kfd_process *target, uint32_t *flags) +{ + uint32_t prev_flags = target->dbg_flags; + int i, r = 0, rewind_count = 0; + + for (i = 0; i < target->n_pdds; i++) { + if (!kfd_dbg_is_per_vmid_supported(target->pdds[i]->dev) && + (*flags & KFD_DBG_TRAP_FLAG_SINGLE_MEM_OP)) { + *flags = prev_flags; + return -EACCES; + } + } + + target->dbg_flags = *flags & KFD_DBG_TRAP_FLAG_SINGLE_MEM_OP; + *flags = prev_flags; + for (i = 0; i < target->n_pdds; i++) { + struct kfd_process_device *pdd = target->pdds[i]; + + if (!kfd_dbg_is_per_vmid_supported(pdd->dev)) + continue; + + if (!pdd->dev->kfd->shared_resources.enable_mes) + r = debug_refresh_runlist(pdd->dev->dqm); + else + r = kfd_dbg_set_mes_debug_mode(pdd); + + if (r) { + target->dbg_flags = prev_flags; + break; + } + + rewind_count++; + } + + /* Rewind flags */ + if (r) { + target->dbg_flags = prev_flags; + + for (i = 0; i < rewind_count; i++) { + struct kfd_process_device *pdd = target->pdds[i]; + + if (!kfd_dbg_is_per_vmid_supported(pdd->dev)) + continue; + + if (!pdd->dev->kfd->shared_resources.enable_mes) + debug_refresh_runlist(pdd->dev->dqm); + else + kfd_dbg_set_mes_debug_mode(pdd); + } + } + + return r; +} /* kfd_dbg_trap_deactivate: *target: target process @@ -437,9 +491,13 @@ void kfd_dbg_trap_deactivate(struct kfd_process *target, bool unwind, int unwind int i; if (!unwind) { + uint32_t flags = 0; + cancel_work_sync(&target->debug_event_workarea); kfd_dbg_clear_process_address_watch(target); kfd_dbg_trap_set_wave_launch_mode(target, 0); + + kfd_dbg_trap_set_flags(target, &flags); } for (i = 0; i < target->n_pdds; i++) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.h b/drivers/gpu
Re: [PATCH 27/33] drm/amdkfd: add debug set and clear address watch points operation
Am 2023-05-25 um 13:27 schrieb Jonathan Kim: Shader read, write and atomic memory operations can be alerted to the debugger as an address watch exception. Allow the debugger to pass in a watch point to a particular memory address per device. Note that there exists only 4 watch points per devices to date, so have the KFD keep track of what watch points are allocated or not. v2: fixup with new kfd_node struct reference for mes and watch point checks Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling --- .../drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 51 +++ .../drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 + .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 78 ++ .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h| 8 ++ .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10_3.c | 5 +- .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c| 52 ++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 77 ++ .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h | 8 ++ drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 24 drivers/gpu/drm/amd/amdkfd/kfd_debug.c| 136 ++ drivers/gpu/drm/amd/amdkfd/kfd_debug.h| 8 +- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 6 +- 13 files changed, 452 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c index 774ecfc3451a..efd6a72aab4e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c @@ -118,6 +118,55 @@ static uint32_t kgd_aldebaran_set_wave_launch_mode(struct amdgpu_device *adev, return data; } +#define TCP_WATCH_STRIDE (regTCP_WATCH1_ADDR_H - regTCP_WATCH0_ADDR_H) +static uint32_t kgd_gfx_aldebaran_set_address_watch( + struct amdgpu_device *adev, + uint64_t watch_address, + uint32_t watch_address_mask, + uint32_t watch_id, + uint32_t watch_mode, + uint32_t debug_vmid) +{ + uint32_t watch_address_high; + uint32_t watch_address_low; + uint32_t watch_address_cntl; + + watch_address_cntl = 0; + watch_address_low = lower_32_bits(watch_address); + watch_address_high = upper_32_bits(watch_address) & 0x; + + watch_address_cntl = REG_SET_FIELD(watch_address_cntl, + TCP_WATCH0_CNTL, + MODE, + watch_mode); + + watch_address_cntl = REG_SET_FIELD(watch_address_cntl, + TCP_WATCH0_CNTL, + MASK, + watch_address_mask >> 6); + + watch_address_cntl = REG_SET_FIELD(watch_address_cntl, + TCP_WATCH0_CNTL, + VALID, + 1); + + WREG32_RLC((SOC15_REG_OFFSET(GC, 0, regTCP_WATCH0_ADDR_H) + + (watch_id * TCP_WATCH_STRIDE)), + watch_address_high); + + WREG32_RLC((SOC15_REG_OFFSET(GC, 0, regTCP_WATCH0_ADDR_L) + + (watch_id * TCP_WATCH_STRIDE)), + watch_address_low); + + return watch_address_cntl; +} + +uint32_t kgd_gfx_aldebaran_clear_address_watch(struct amdgpu_device *adev, + uint32_t watch_id) +{ + return 0; +} + const struct kfd2kgd_calls aldebaran_kfd2kgd = { .program_sh_mem_settings = kgd_gfx_v9_program_sh_mem_settings, .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping, @@ -141,6 +190,8 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = { .validate_trap_override_request = kgd_aldebaran_validate_trap_override_request, .set_wave_launch_trap_override = kgd_aldebaran_set_wave_launch_trap_override, .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode, + .set_address_watch = kgd_gfx_aldebaran_set_address_watch, + .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch, .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times, .build_grace_period_packet_info = kgd_gfx_v9_build_grace_period_packet_info, .program_trap_handler_settings = kgd_gfx_v9_program_trap_handler_settings, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c index fbdc1b7b1e42..6df215aba4c4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c @@ -413,6 +413,8 @@ const struct kfd2kgd_calls arcturus_kfd2kgd = { .validate_trap_override_request = kgd_gfx_v9_validate_trap_override_request, .set_wave_launch_tra
Re: [PATCH 26/33] drm/amdkfd: add debug suspend and resume process queues operation
Am 2023-05-25 um 13:27 schrieb Jonathan Kim: In order to inspect waves from the saved context at any point during a debug session, the debugger must be able to preempt queues to trigger context save by suspending them. On queue suspend, the KFD will copy the context save header information so that the debugger can correctly crawl the appropriate size of the saved context. The debugger must then also be allowed to resume suspended queues. A queue that is newly created cannot be suspended because queue ids are recycled after destruction so the debugger needs to know that this has occurred. Query functions will be later added that will clear a given queue of its new queue status. A queue cannot be destroyed while it is suspended to preserve its saved context during debugger inspection. Have queue destruction block while a queue is suspended and unblocked when it is resumed. Likewise, if a queue is about to be destroyed, it cannot be suspended. Return the number of queues successfully suspended or resumed along with a per queue status array where the upper bits per queue status show that the request was invalid (new/destroyed queue suspend request, missing queue) or an error occurred (HWS in a fatal state so it can't suspend or resume queues). v2: fixup new kfd_node struct reference for mes fw check. also fixup missing EC_QUEUE_NEW flagging on newly created queue. Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 1 + drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 11 + drivers/gpu/drm/amd/amdkfd/kfd_debug.c| 7 + .../drm/amd/amdkfd/kfd_device_queue_manager.c | 447 +- .../drm/amd/amdkfd/kfd_device_queue_manager.h | 10 + .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 10 + .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c | 15 +- .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 14 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 5 +- .../amd/amdkfd/kfd_process_queue_manager.c| 1 + 11 files changed, 512 insertions(+), 14 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index 98cd52bb005f..b4fcad0e62f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -772,6 +772,11 @@ bool amdgpu_amdkfd_have_atomics_support(struct amdgpu_device *adev) return adev->have_atomics_support; } +void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev) +{ + amdgpu_device_flush_hdp(adev, NULL); +} + void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, bool reset) { amdgpu_umc_poison_handler(adev, reset); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index dd740e64e6e1..2d0406bff84e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -322,6 +322,7 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev, uint64_t *mmap_offset); int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem, struct dma_buf **dmabuf); +void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev); int amdgpu_amdkfd_get_tile_config(struct amdgpu_device *adev, struct tile_config *config); void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 4b45d4539d48..adda60273456 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -410,6 +410,7 @@ static int kfd_ioctl_create_queue(struct file *filep, struct kfd_process *p, pr_debug("Write ptr address == 0x%016llX\n", args->write_pointer_address); + kfd_dbg_ev_raise(KFD_EC_MASK(EC_QUEUE_NEW), p, dev, queue_id, false, NULL, 0); return 0; err_create_queue: @@ -2996,7 +2997,17 @@ static int kfd_ioctl_set_debug_trap(struct file *filep, struct kfd_process *p, v args->launch_mode.launch_mode); break; case KFD_IOC_DBG_TRAP_SUSPEND_QUEUES: + r = suspend_queues(target, + args->suspend_queues.num_queues, + args->suspend_queues.grace_period, + args->suspend_queues.exception_mask, + (uint32_t *)args->suspend_queues.queue_array_ptr); + + break; case KFD_IOC_DBG_TRAP_RESUME_QUEUES: + r = resume_queues(target, args->resume_queues.num_queues, + (uint32_t *)args->resume_queues.queue_array