from:"Felix Kuehling"

Re: [PATCH 1/1] mm/migrate: Trylock device page in do_swap_page

2024-09-20 Thread Felix Kuehling




On 2024-09-20 17:23, Matthew Brost wrote:

On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:

On 2024-09-18 11:10, Alistair Popple wrote:

Matthew Brost  writes:


On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:

Matthew Brost  writes:

I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)

Still a driver.

Indeed, and I'm happy to answer any questions about our implementation.


but theoretically it seems like it should be possible. However we
serialize migrations of the same virtual address range to avoid these
kind of issues as they can happen the other way too (ie. multiple
threads trying to migrate to GPU).

So I suspect what happens in UVM is that one thread wins and installs
the migration entry while the others fail to get the driver migration
lock and bail out sufficiently early in the fault path to avoid the
live-lock.


I had to try hard to show this, doubt an actual user could trigger this.

I wrote a test which kicked 8 threads, each thread did a pthread join,
and then tried to read the same page. This repeats in loop for like 512
pages or something. I needed an exclusive lock in migrate_to_ram vfunc
for it to livelock. Without an exclusive lock I think on average I saw
about 32k retries (i.e. migrate_to_ram calls on the same page) before a
thread won this race.

  From reading UVM, pretty sure if you tried hard enough you could trigger
a livelock given it appears you take excluvise locks in migrate_to_ram.

Yes, I suspect you're correct here and that we just haven't tried hard
enough to trigger it.


Cc: Philip Yang 
Cc: Felix Kuehling 
Cc: Christian König 
Cc: Andrew Morton 
Suggessted-by: Simona Vetter 
Signed-off-by: Matthew Brost 
---
   mm/memory.c | 13 +++---
   mm/migrate_device.c | 60 +++--
   2 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3c01d68065be..bbd97d16a96a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 * Get a page reference while we know the page can't be
 * freed.
 */
-   get_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
-   put_page(vmf->page);
+   if (trylock_page(vmf->page)) {
+   get_page(vmf->page);
+   pte_unmap_unlock(vmf->pte, vmf->ptl);

This is all beginning to look a lot like migrate_vma_collect_pmd(). So
rather than do this and then have to pass all this context
(ie. fault_page) down to the migrate_vma_* functions could we instead
just do what migrate_vma_collect_pmd() does here? Ie. we already have
the PTL and the page lock so there's no reason we couldn't just setup
the migration entry prior to calling migrate_to_ram().

Obviously calling migrate_vma_setup() would show the page as not
migrating, but drivers could easily just fill in the src_pfn info after
calling migrate_vma_setup().

This would eliminate the whole fault_page ugliness.


This seems like it would work and agree it likely be cleaner. Let me
play around with this and see what I come up with. Multi-tasking a bit
so expect a bit of delay here.

Thanks for the input,
Matt

Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
Christian and myself in the loop with future patches in this area.


Will do. Already have another local patch set which helps drivers dma
map 2M pages for migrations if SRAM is physically contiguous. Seems
helpful for performance on Intel hardware. Probably post that soon for
early feedack.


OK.




Longer term I thinking 2M migration entries, 2M device pages, and being
able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
finding migrate_vma_* functions take up like 80-90% of the time in the
CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
ideal. Seems like 2M entries for everything would really help here. No
idea how feasible this is as the core MM stuff gets confusing fast. Any
input on this idea?


I agree with your observations. We found that the migrate_vma_* code was 
the bottle neck for migration performance as well, and not breaking 2M 
pages could reduce that overhead a lot. I don't have any specific ideas. 
I'm not familiar with the details of that code myself. Philip has looked 
at this (and some old NVidia patches from a few years ago) in the past 
but never had enough uninterrupted time to make it past prototyping.


Regards,
  Felix




Matt


Regards,
   Felix



+   ret = 
vmf->page->pgmap->ops->migrate_to_ram(vmf);
+

Re: [PATCH 1/1] mm/migrate: Trylock device page in do_swap_page

2024-09-20 Thread Felix Kuehling


On 2024-09-18 11:10, Alistair Popple wrote:

Matthew Brost  writes:


On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:

Matthew Brost  writes:

I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)

Still a driver.

Indeed, and I'm happy to answer any questions about our implementation.


but theoretically it seems like it should be possible. However we
serialize migrations of the same virtual address range to avoid these
kind of issues as they can happen the other way too (ie. multiple
threads trying to migrate to GPU).

So I suspect what happens in UVM is that one thread wins and installs
the migration entry while the others fail to get the driver migration
lock and bail out sufficiently early in the fault path to avoid the
live-lock.


I had to try hard to show this, doubt an actual user could trigger this.

I wrote a test which kicked 8 threads, each thread did a pthread join,
and then tried to read the same page. This repeats in loop for like 512
pages or something. I needed an exclusive lock in migrate_to_ram vfunc
for it to livelock. Without an exclusive lock I think on average I saw
about 32k retries (i.e. migrate_to_ram calls on the same page) before a
thread won this race.

 From reading UVM, pretty sure if you tried hard enough you could trigger
a livelock given it appears you take excluvise locks in migrate_to_ram.

Yes, I suspect you're correct here and that we just haven't tried hard
enough to trigger it.


Cc: Philip Yang 
Cc: Felix Kuehling 
Cc: Christian König 
Cc: Andrew Morton 
Suggessted-by: Simona Vetter 
Signed-off-by: Matthew Brost 
---
  mm/memory.c | 13 +++---
  mm/migrate_device.c | 60 +++--
  2 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3c01d68065be..bbd97d16a96a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 * Get a page reference while we know the page can't be
 * freed.
 */
-   get_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
-   put_page(vmf->page);
+   if (trylock_page(vmf->page)) {
+   get_page(vmf->page);
+   pte_unmap_unlock(vmf->pte, vmf->ptl);

This is all beginning to look a lot like migrate_vma_collect_pmd(). So
rather than do this and then have to pass all this context
(ie. fault_page) down to the migrate_vma_* functions could we instead
just do what migrate_vma_collect_pmd() does here? Ie. we already have
the PTL and the page lock so there's no reason we couldn't just setup
the migration entry prior to calling migrate_to_ram().

Obviously calling migrate_vma_setup() would show the page as not
migrating, but drivers could easily just fill in the src_pfn info after
calling migrate_vma_setup().

This would eliminate the whole fault_page ugliness.


This seems like it would work and agree it likely be cleaner. Let me
play around with this and see what I come up with. Multi-tasking a bit
so expect a bit of delay here.

Thanks for the input,
Matt


Thanks! Sorry, I'm late catching up after a vacation. Please keep 
Philip, Christian and myself in the loop with future patches in this area.


Regards,
  Felix





+   ret = 
vmf->page->pgmap->ops->migrate_to_ram(vmf);
+   put_page(vmf->page);
+   unlock_page(vmf->page);
+   } else {
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   }
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else if (is_pte_marker_entry(entry)) {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa..049893a5a179 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
   struct mm_walk *walk)
  {
struct migrate_vma *migrate = walk->private;
+   struct folio *fault_folio = migrate->fault_page ?
+   page_folio(migrate->fault_page) : NULL;
struct vm_area_struct *vma = walk->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long addr = start, unmapped = 0;
@@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
  
  			folio_get(folio);

spin_unlock(ptl);
-   if (unlikely(!folio_trylock(folio)))
+   if (unlikely(fault_folio != folio &&
+

Re: [PATCH 2/4] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-08-14 Thread Felix Kuehling




On 2024-08-12 02:59, Al Viro wrote:

Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into
descriptor table, only to have it looked up by file descriptor and
remove it from descriptor table is not just too convoluted - it's
racy; another thread might have modified the descriptor table while
we'd been going through that song and dance.

Switch kfd_mem_export_dmabuf() to using drm_gem_prime_handle_to_dmabuf()
and leave the descriptor table alone...

Signed-off-by: Al Viro 


This patch is

Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 12 +++-
  1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 11672bfe4fad..bc5401de2948 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,7 +25,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  
  #include 

@@ -818,18 +817,13 @@ static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
if (!mem->dmabuf) {
struct amdgpu_device *bo_adev;
struct dma_buf *dmabuf;
-   int r, fd;
  
  		bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);

-   r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
+   dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev, 
bo_adev->kfd.client.file,
   mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-  DRM_RDWR : 0, &fd);
-   if (r)
-   return r;
-   dmabuf = dma_buf_get(fd);
-   close_fd(fd);
-   if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+  DRM_RDWR : 0);
+   if (IS_ERR(dmabuf))
return PTR_ERR(dmabuf);
mem->dmabuf = dmabuf;
}

Re: va range based memory management discussion (was: 回复：回复：Re：Proposal to add CRIU support to DRM render nodes)

2024-07-10 Thread Felix Kuehling


On 2024-07-09 22:38, 周春明(日月) wrote:







--
发件人：Felix Kuehling 
发送时间：2024年7月10日(星期三) 01:07
收件人：周春明(日月) ; Tvrtko Ursulin 
; dri-devel@lists.freedesktop.org 
; amd-...@lists.freedesktop.org 
; Dave Airlie ; 
Daniel Vetter ; criu 
抄　送："Errabolu, Ramesh" ; "Christian König" 


主　题：Re: 回复：Re：Proposal to add CRIU support to DRM render nodes



On 2024-07-09 5:30, 周春明(日月) wrote:
>
>
>
>
>
>
> --
> 发件人：Felix Kuehling 
> 发送时间：2024年7月9日(星期二) 06:40
> 收件人：周春明(日月) ; Tvrtko Ursulin 
; dri-devel@lists.freedesktop.org 
; amd-...@lists.freedesktop.org 
; Dave Airlie ; 
Daniel Vetter ; criu 
> 抄 送："Errabolu, Ramesh" ; "Christian König" 


> 主 题：Re: Re：Proposal to add CRIU support to DRM render nodes
>
>
> On 2024-07-08 2:51, 周春明(日月) wrote:
>>
>>> Hi Felix,
>>>
>>> When I learn CRIU you introduced in 
https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu 
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> 
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> 
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> 
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> 
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> 
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>>> 
, there is a sentence
>>> "ROCm manages memory in the form of buffer objects (BOs). We are 
also working on a new memory management API that will be based on 
virtual address ranges...",
>>> Out of curious, how about "new memory management based on virtual 
address ranges"? Any introduction for that?

>>
>>>Hi David,
>>
>>>This refers to the SVM API that has been in the upstream driver for 
a while now: 
https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732 
<https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732> 
<https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732>>

>>
>> [David] Can all ROCm runtime memory management switch to use svm 
apis? No need BOs any more?


>I had thought about that when I started working on SVM years ago. But 
I came to the conclusion that we need to use BOs for VRAM to support 
DMABuf exports and imports to support P2P and IPC features.


[David] OK, I guessed you would say DMABuf and IPC factors, if we 
don't use dmabuf (as you know, dmabuf isn't popular in compute area) 
and implement a new ipc based on va ranges, is that possible to using 
svm api to cover all ROCm memory management?

When I tried memory pool used by cuda graph, seems that's OK.


DMABuf and IPC are important for collective communications libraries 
used by distributed applications. You could get away without it when 
you're running a single-process application on a single machine. But 
changing all memory allocations to SVM would probably cause some 
performance regressions, because our BO allocators and memory mapping 
functions are simpler and easier to optimize than for unified memory.


That leaves the question, what's the expected benefit or a compelling 
reason for making such an invasive change?


Regards,
  Felix




Thanks,
-David

>Regards,
>  Felix


>
> Thanks,
> -David
>
> Regards,
>   Felix
>
>
>>
>> Thanks,
>> -David
>>
>> --
>>     发件人：Felix Kuehling 
>>     发送时间：2024年5月3日(星期五) 22:44
>>     收件人：Tvrtko Ursulin ; 
dri-devel@lists.freedesktop.org ; 
amd-...@lists.freedesktop.org ; Dave 
Airlie ; Daniel Vetter ; criu 

>>     抄 送："Errabolu, Ramesh" ; "Christian 
König" 

>>     主 题：Re: Proposal to add CRIU support to DRM render nodes
>>
>>
>>
>>     On 2024-04-16 10:04, Tvrtko Ursulin wrote:
>>     >
>>     > On 01/04/2024 18:58, Felix Kuehling wrote:
>>     >>
>>     >> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>>     >>>
>>     >>> On 01/04/2024 17:37, Felix Kuehling wrote:
>>     >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>     >>>>>
>>     >>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>     >>>>>>
>>     >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>     >>>>>>>
>>     >>>>>>> Hi Felix,
>>     >>>>>>>
>>     >>&

Re: 回复：Re：Proposal to add CRIU support to DRM render nodes

2024-07-09 Thread Felix Kuehling




On 2024-07-09 5:30, 周春明(日月) wrote:
> 
> 
> 
> 
> 
> 
> --
> 发件人：Felix Kuehling 
> 发送时间：2024年7月9日(星期二) 06:40
> 收件人：周春明(日月) ; Tvrtko Ursulin 
> ; dri-devel@lists.freedesktop.org 
> ; amd-...@lists.freedesktop.org 
> ; Dave Airlie ; Daniel 
> Vetter ; criu 
> 抄　送："Errabolu, Ramesh" ; "Christian König" 
> 
> 主　题：Re: Re：Proposal to add CRIU support to DRM render nodes
> 
> 
> On 2024-07-08 2:51, 周春明(日月) wrote:
>>
>> Hi Felix,
>>
>> When I learn CRIU you introduced in 
>> https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu 
>> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> 
>> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> 
>> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>> , 
>> there is a sentence
>> "ROCm manages memory in the form of buffer objects (BOs). We are also 
>> working on a new memory management API that will be based on virtual address 
>> ranges...",
>> Out of curious, how about "new memory management based on virtual address 
>> ranges"? Any introduction for that?
> 
>>Hi David,
> 
>>This refers to the SVM API that has been in the upstream driver for a while 
>>now: 
>>https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732
>> 
>><https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732>
> 
> [David] Can all ROCm runtime memory management switch to use svm apis? No 
> need BOs any more?

I had thought about that when I started working on SVM years ago. But I came to 
the conclusion that we need to use BOs for VRAM to support DMABuf exports and 
imports to support P2P and IPC features.

Regards,
  Felix


> 
> Thanks,
> -David
> 
> Regards,
>   Felix
> 
> 
>>
>> Thanks,
>> -David
>>
>>     --
>>     发件人：Felix Kuehling 
>>     发送时间：2024年5月3日(星期五) 22:44
>>     收件人：Tvrtko Ursulin ; 
>> dri-devel@lists.freedesktop.org ; 
>> amd-...@lists.freedesktop.org ; Dave Airlie 
>> ; Daniel Vetter ; criu 
>>     抄 送："Errabolu, Ramesh" ; "Christian König" 
>> 
>>     主 题：Re: Proposal to add CRIU support to DRM render nodes
>>
>>
>>
>>     On 2024-04-16 10:04, Tvrtko Ursulin wrote:
>>     >
>>     > On 01/04/2024 18:58, Felix Kuehling wrote:
>>     >>
>>     >> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>>     >>>
>>     >>> On 01/04/2024 17:37, Felix Kuehling wrote:
>>     >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>     >>>>>
>>     >>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>     >>>>>>
>>     >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>     >>>>>>>
>>     >>>>>>> Hi Felix,
>>     >>>>>>>
>>     >>>>>>> I had one more thought while browsing around the amdgpu CRIU 
>> plugin. It appears it relies on the KFD support being compiled in and 
>> /dev/kfd present, correct? AFAICT at least, it relies on that to figure out 
>> the amdgpu DRM node.
>>     >>>>>>>
>>     >>>>>>> In would be probably good to consider designing things without 
>> that dependency. So that checkpointing an application which does not use 
>> /dev/kfd is possible. Or if the kernel does not even have the KFD support 
>> compiled in.
>>     >>>>>>
>>     >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we 
>> should definitely do that. Currently we get a lot of topology information 
>> from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
>> by KFD. We'd need to get GPU device info from the render nodes instead. And 
>> if KFD is available, we may need to integrate both sources of information.
>>     >>>>>>
>>     >>>>>>
>>     >>>>>>>
>>     >>>>>>> It could perhaps mean no more than adding some GPU discovery 
>> code into CRIU. Which shuold be flexible enough to account for things like 
>> re-assigned minor numbers due driver reload.
>>     >>>>>>
>>     >>>>>> Do you mean adding

Re: Re：Proposal to add CRIU support to DRM render nodes

2024-07-08 Thread Felix Kuehling



On 2024-07-08 2:51, 周春明(日月) wrote:
> 
> Hi Felix,
> 
> When I learn CRIU you introduced in 
> https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu 
> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> , 
> there is a sentence
> "ROCm manages memory in the form of buffer objects (BOs). We are also working 
> on a new memory management API that will be based on virtual address 
> ranges...", 
> Out of curious, how about "new memory management based on virtual address 
> ranges"? Any introduction for that?

Hi David,

This refers to the SVM API that has been in the upstream driver for a while 
now: 
https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732

Regards,
  Felix


> 
> Thanks,
> -David
> 
> --
> 发件人：Felix Kuehling 
> 发送时间：2024年5月3日(星期五) 22:44
> 收件人：Tvrtko Ursulin ; 
> dri-devel@lists.freedesktop.org ; 
> amd-...@lists.freedesktop.org ; Dave Airlie 
> ; Daniel Vetter ; criu 
> 抄　送："Errabolu, Ramesh" ; "Christian König" 
> 
> 主　题：Re: Proposal to add CRIU support to DRM render nodes
> 
> 
> 
> On 2024-04-16 10:04, Tvrtko Ursulin wrote:
> >
> > On 01/04/2024 18:58, Felix Kuehling wrote:
> >>
> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
> >>>
> >>> On 01/04/2024 17:37, Felix Kuehling wrote:
> >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
> >>>>>
> >>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
> >>>>>>
> >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
> >>>>>>>
> >>>>>>> Hi Felix,
> >>>>>>>
> >>>>>>> I had one more thought while browsing around the amdgpu CRIU 
> plugin. It appears it relies on the KFD support being compiled in and 
> /dev/kfd present, correct? AFAICT at least, it relies on that to figure out 
> the amdgpu DRM node.
> >>>>>>>
> >>>>>>> In would be probably good to consider designing things without 
> that dependency. So that checkpointing an application which does not use 
> /dev/kfd is possible. Or if the kernel does not even have the KFD support 
> compiled in.
> >>>>>>
> >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we 
> should definitely do that. Currently we get a lot of topology information 
> from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
> by KFD. We'd need to get GPU device info from the render nodes instead. And 
> if KFD is available, we may need to integrate both sources of information.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> It could perhaps mean no more than adding some GPU discovery code 
> into CRIU. Which shuold be flexible enough to account for things like 
> re-assigned minor numbers due driver reload.
> >>>>>>
> >>>>>> Do you mean adding GPU discovery to the core CRIU, or to the 
> plugin. I was thinking this is still part of the plugin.
> >>>>>
> >>>>> Yes I agree. I was only thinking about adding some DRM device 
> discovery code in a more decoupled fashion from the current plugin, for both 
> the reason discussed above (decoupling a bit from reliance on kfd sysfs), and 
> then also if/when a new DRM driver might want to implement this the code 
> could be move to some common plugin area.
> >>>>>
> >>>>> I am not sure how feasible that would be though. The "gpu id" 
> concept and it's matching in the current kernel code and CRIU plugin - is 
> that value tied to the physical GPU instance or how it works?
> >>>>
> >>>> The concept of the GPU ID is that it's stable while the system is 
> up, even when devices get added and removed dynamically. It was baked into 
> the API early on, but I don't think we ever fully validated device hot plug. 
> I think the closest we're getting is with our latest MI GPUs and dynamic 
> partition mode change.
> >>>
> >>> Doesn't it read the saved gpu id from the image file while doing 
> restore and tries to open the render node to match it? Maybe I am misreading 
> the code.. But if it does, does it imply that in practice it could be stable 
> acro

Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-06-06 Thread Felix Kuehling




On 2024-06-05 05:14, Christian König wrote:

Am 04.06.24 um 20:08 schrieb Felix Kuehling:


On 2024-06-03 22:13, Al Viro wrote:

Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into
descriptor table, only to have it looked up by file descriptor and
remove it from descriptor table is not just too convoluted - it's
racy; another thread might have modified the descriptor table while
we'd been going through that song and dance.

It's not hard to fix - turn drm_gem_prime_handle_to_fd()
into a wrapper for a new helper that would simply return the
dmabuf, without messing with descriptor table.

Then kfd_mem_export_dmabuf() would simply use that new helper
and leave the descriptor table alone.

Signed-off-by: Al Viro 


This patch looks good to me on the amdgpu side. For the DRM side I'm 
adding dri-devel.


Yeah that patch should probably be split up and the DRM changes 
discussed separately.


On the other hand skimming over it it seems reasonable to me.

Felix are you going to look into this or should I take a look and try 
to push it through drm-misc-next?


It doesn't matter much to me, as long as we submit both changes together.

Thanks,
  Felix




Thanks,
Christian.



Acked-by: Felix Kuehling 



---
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

index 8975cf41a91a..793780bb819c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,7 +25,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
    #include 
@@ -812,18 +811,13 @@ static int kfd_mem_export_dmabuf(struct 
kgd_mem *mem)

  if (!mem->dmabuf) {
  struct amdgpu_device *bo_adev;
  struct dma_buf *dmabuf;
-    int r, fd;
    bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
-    r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
+    dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev, 
bo_adev->kfd.client.file,

 mem->gem_handle,
  mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-   DRM_RDWR : 0, &fd);
-    if (r)
-    return r;
-    dmabuf = dma_buf_get(fd);
-    close_fd(fd);
-    if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+   DRM_RDWR : 0);
+    if (IS_ERR(dmabuf))
  return PTR_ERR(dmabuf);
  mem->dmabuf = dmabuf;
  }
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 03bd3c7bd0dc..622c51d3fe18 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -409,23 +409,9 @@ static struct dma_buf 
*export_and_register_object(struct drm_device *dev,

  return dmabuf;
  }
  -/**
- * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
- * @dev: dev to export the buffer from
- * @file_priv: drm file-private structure
- * @handle: buffer handle to export
- * @flags: flags like DRM_CLOEXEC
- * @prime_fd: pointer to storage for the fd id of the create dma-buf
- *
- * This is the PRIME export function which must be used mandatorily 
by GEM
- * drivers to ensure correct lifetime management of the underlying 
GEM object.
- * The actual exporting from GEM object to a dma-buf is done 
through the

- * &drm_gem_object_funcs.export callback.
- */
-int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev,
 struct drm_file *file_priv, uint32_t handle,
-   uint32_t flags,
-   int *prime_fd)
+   uint32_t flags)
  {
  struct drm_gem_object *obj;
  int ret = 0;
@@ -434,14 +420,14 @@ int drm_gem_prime_handle_to_fd(struct 
drm_device *dev,

  mutex_lock(&file_priv->prime.lock);
  obj = drm_gem_object_lookup(file_priv, handle);
  if (!obj)  {
-    ret = -ENOENT;
+    dmabuf = ERR_PTR(-ENOENT);
  goto out_unlock;
  }
    dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, 
handle);

  if (dmabuf) {
  get_dma_buf(dmabuf);
-    goto out_have_handle;
+    goto out;
  }
    mutex_lock(&dev->object_name_lock);
@@ -463,7 +449,6 @@ int drm_gem_prime_handle_to_fd(struct drm_device 
*dev,

  /* normally the created dma-buf takes ownership of the ref,
   * but if that fails then drop the ref
   */
-    ret = PTR_ERR(dmabuf);
  mutex_unlock(&dev->object_name_lock);
  goto out;
  }
@@ -478,34 +463,49 @@ int drm_gem_prime_handle_to_fd(struct 
drm_device *dev,

  ret = drm_prime_add_buf_handle(&file_priv->prime,
 dmabuf, handle);
  mutex_unlock(&dev->object_name_lock);
-    if (ret)
-    goto fail_put_dmabuf;
-
-out_have_handle:
-    ret = dma_buf_fd(d

Re: [PATCH] Revert "drm/amdgpu: init iommu after amdkfd device init"

2024-06-04 Thread Felix Kuehling




On 2024-06-03 18:19, Armin Wolf wrote:

Am 23.05.24 um 19:30 schrieb Armin Wolf:


This reverts commit 56b522f4668167096a50c39446d6263c96219f5f.

A user reported that this commit breaks the integrated gpu of his
notebook, causing a black screen. He was able to bisect the problematic
commit and verified that by reverting it the notebook works again.
He also confirmed that kernel 6.8.1 also works on his device, so the
upstream commit itself seems to be ok.

An amdgpu developer (Alex Deucher) confirmed that this patch should
have never been ported to 5.15 in the first place, so revert this
commit from the 5.15 stable series.


Hi,

what is the status of this?


Which branch is this for? This patch won't apply to anything after Linux 
6.5. Support for IOMMUv2 was removed from amdgpu in Linux 6.6 by:


commit c99a2e7ae291e5b19b60443eb6397320ef9e8571
Author: Alex Deucher 
Date:   Fri Jul 28 12:20:12 2023 -0400

    drm/amdkfd: drop IOMMUv2 support

    Now that we use the dGPU path for all APUs, drop the
    IOMMUv2 support.

    v2: drop the now unused queue manager functions for gfx7/8 APUs

    Reviewed-by: Felix Kuehling 
    Acked-by: Christian König 
    Tested-by: Mike Lothian 
    Signed-off-by: Alex Deucher 

Regards,
  Felix




Armin Wolf



Reported-by: Barry Kauler 
Signed-off-by: Armin Wolf 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 222a1d9ecf16..5f6c32ec674d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2487,6 +2487,10 @@ static int amdgpu_device_ip_init(struct 
amdgpu_device *adev)

  if (r)
  goto init_failed;

+    r = amdgpu_amdkfd_resume_iommu(adev);
+    if (r)
+    goto init_failed;
+
  r = amdgpu_device_ip_hw_init_phase1(adev);
  if (r)
  goto init_failed;
@@ -2525,10 +2529,6 @@ static int amdgpu_device_ip_init(struct 
amdgpu_device *adev)

  if (!adev->gmc.xgmi.pending_reset)
  amdgpu_amdkfd_device_init(adev);

-    r = amdgpu_amdkfd_resume_iommu(adev);
-    if (r)
-    goto init_failed;
-
  amdgpu_fru_get_product_info(adev);

  init_failed:
--
2.39.2

Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-06-04 Thread Felix Kuehling




On 2024-06-03 22:13, Al Viro wrote:

Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into
descriptor table, only to have it looked up by file descriptor and
remove it from descriptor table is not just too convoluted - it's
racy; another thread might have modified the descriptor table while
we'd been going through that song and dance.

It's not hard to fix - turn drm_gem_prime_handle_to_fd()
into a wrapper for a new helper that would simply return the
dmabuf, without messing with descriptor table.

Then kfd_mem_export_dmabuf() would simply use that new helper
and leave the descriptor table alone.

Signed-off-by: Al Viro 


This patch looks good to me on the amdgpu side. For the DRM side I'm 
adding dri-devel.


Acked-by: Felix Kuehling 



---
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 8975cf41a91a..793780bb819c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,7 +25,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  
  #include 

@@ -812,18 +811,13 @@ static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
if (!mem->dmabuf) {
struct amdgpu_device *bo_adev;
struct dma_buf *dmabuf;
-   int r, fd;
  
  		bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);

-   r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
+   dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev, 
bo_adev->kfd.client.file,
   mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-  DRM_RDWR : 0, &fd);
-   if (r)
-   return r;
-   dmabuf = dma_buf_get(fd);
-   close_fd(fd);
-   if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+  DRM_RDWR : 0);
+   if (IS_ERR(dmabuf))
return PTR_ERR(dmabuf);
mem->dmabuf = dmabuf;
}
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 03bd3c7bd0dc..622c51d3fe18 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -409,23 +409,9 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
return dmabuf;
  }
  
-/**

- * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
- * @dev: dev to export the buffer from
- * @file_priv: drm file-private structure
- * @handle: buffer handle to export
- * @flags: flags like DRM_CLOEXEC
- * @prime_fd: pointer to storage for the fd id of the create dma-buf
- *
- * This is the PRIME export function which must be used mandatorily by GEM
- * drivers to ensure correct lifetime management of the underlying GEM object.
- * The actual exporting from GEM object to a dma-buf is done through the
- * &drm_gem_object_funcs.export callback.
- */
-int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev,
   struct drm_file *file_priv, uint32_t handle,
-  uint32_t flags,
-  int *prime_fd)
+  uint32_t flags)
  {
struct drm_gem_object *obj;
int ret = 0;
@@ -434,14 +420,14 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
mutex_lock(&file_priv->prime.lock);
obj = drm_gem_object_lookup(file_priv, handle);
if (!obj)  {
-   ret = -ENOENT;
+   dmabuf = ERR_PTR(-ENOENT);
goto out_unlock;
}
  
  	dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, handle);

if (dmabuf) {
get_dma_buf(dmabuf);
-   goto out_have_handle;
+   goto out;
}
  
  	mutex_lock(&dev->object_name_lock);

@@ -463,7 +449,6 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
/* normally the created dma-buf takes ownership of the ref,
 * but if that fails then drop the ref
 */
-   ret = PTR_ERR(dmabuf);
mutex_unlock(&dev->object_name_lock);
goto out;
}
@@ -478,34 +463,49 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
ret = drm_prime_add_buf_handle(&file_priv->prime,
   dmabuf, handle);
mutex_unlock(&dev->object_name_lock);
-   if (ret)
-   goto fail_put_dmabuf;
-
-out_have_handle:
-   ret = dma_buf_fd(dmabuf, flags);
-   /*
-* We must _not_ remove the buffer from the handle cache since the newly
-* created dma buf is already link

Re: [PATCH 11/11] drm/tegra: Use fbdev client helpers

2024-05-07 Thread Felix Kuehling




On 2024-05-07 07:58, Thomas Zimmermann wrote:

Implement struct drm_client_funcs with the respective helpers and
remove the custom code from the emulation. The generic helpers are
equivalent in functionality.

Signed-off-by: Thomas Zimmermann 
---
  drivers/gpu/drm/radeon/radeon_fbdev.c | 66 ++-


Was radeon meant to be a separate patch?

Regards,
  Felix



  drivers/gpu/drm/tegra/fbdev.c | 58 ++-
  2 files changed, 6 insertions(+), 118 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_fbdev.c 
b/drivers/gpu/drm/radeon/radeon_fbdev.c
index 02bf25759059a..cf790922174ea 100644
--- a/drivers/gpu/drm/radeon/radeon_fbdev.c
+++ b/drivers/gpu/drm/radeon/radeon_fbdev.c
@@ -29,7 +29,6 @@
  #include 
  #include 
  
-#include 

  #include 
  #include 
  #include 
@@ -293,71 +292,12 @@ static const struct drm_fb_helper_funcs 
radeon_fbdev_fb_helper_funcs = {
  };
  
  /*

- * Fbdev client and struct drm_client_funcs
+ * struct drm_client_funcs
   */
  
-static void radeon_fbdev_client_unregister(struct drm_client_dev *client)

-{
-   struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client);
-   struct drm_device *dev = fb_helper->dev;
-   struct radeon_device *rdev = dev->dev_private;
-
-   if (fb_helper->info) {
-   vga_switcheroo_client_fb_set(rdev->pdev, NULL);
-   drm_helper_force_disable_all(dev);
-   drm_fb_helper_unregister_info(fb_helper);
-   } else {
-   drm_client_release(&fb_helper->client);
-   drm_fb_helper_unprepare(fb_helper);
-   kfree(fb_helper);
-   }
-}
-
-static int radeon_fbdev_client_restore(struct drm_client_dev *client)
-{
-   drm_fb_helper_lastclose(client->dev);
-   vga_switcheroo_process_delayed_switch();
-
-   return 0;
-}
-
-static int radeon_fbdev_client_hotplug(struct drm_client_dev *client)
-{
-   struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client);
-   struct drm_device *dev = client->dev;
-   struct radeon_device *rdev = dev->dev_private;
-   int ret;
-
-   if (dev->fb_helper)
-   return drm_fb_helper_hotplug_event(dev->fb_helper);
-
-   ret = drm_fb_helper_init(dev, fb_helper);
-   if (ret)
-   goto err_drm_err;
-
-   if (!drm_drv_uses_atomic_modeset(dev))
-   drm_helper_disable_unused_functions(dev);
-
-   ret = drm_fb_helper_initial_config(fb_helper);
-   if (ret)
-   goto err_drm_fb_helper_fini;
-
-   vga_switcheroo_client_fb_set(rdev->pdev, fb_helper->info);
-
-   return 0;
-
-err_drm_fb_helper_fini:
-   drm_fb_helper_fini(fb_helper);
-err_drm_err:
-   drm_err(dev, "Failed to setup radeon fbdev emulation (ret=%d)\n", ret);
-   return ret;
-}
-
  static const struct drm_client_funcs radeon_fbdev_client_funcs = {
-   .owner  = THIS_MODULE,
-   .unregister = radeon_fbdev_client_unregister,
-   .restore= radeon_fbdev_client_restore,
-   .hotplug= radeon_fbdev_client_hotplug,
+   .owner = THIS_MODULE,
+   DRM_FBDEV_HELPER_CLIENT_FUNCS,
  };
  
  void radeon_fbdev_setup(struct radeon_device *rdev)

diff --git a/drivers/gpu/drm/tegra/fbdev.c b/drivers/gpu/drm/tegra/fbdev.c
index db6eaac3d30e6..f9cc365cfed94 100644
--- a/drivers/gpu/drm/tegra/fbdev.c
+++ b/drivers/gpu/drm/tegra/fbdev.c
@@ -12,7 +12,6 @@
  #include 
  
  #include 

-#include 
  #include 
  #include 
  #include 
@@ -150,63 +149,12 @@ static const struct drm_fb_helper_funcs 
tegra_fb_helper_funcs = {
  };
  
  /*

- * struct drm_client
+ * struct drm_client_funcs
   */
  
-static void tegra_fbdev_client_unregister(struct drm_client_dev *client)

-{
-   struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client);
-
-   if (fb_helper->info) {
-   drm_fb_helper_unregister_info(fb_helper);
-   } else {
-   drm_client_release(&fb_helper->client);
-   drm_fb_helper_unprepare(fb_helper);
-   kfree(fb_helper);
-   }
-}
-
-static int tegra_fbdev_client_restore(struct drm_client_dev *client)
-{
-   drm_fb_helper_lastclose(client->dev);
-
-   return 0;
-}
-
-static int tegra_fbdev_client_hotplug(struct drm_client_dev *client)
-{
-   struct drm_fb_helper *fb_helper = drm_fb_helper_from_client(client);
-   struct drm_device *dev = client->dev;
-   int ret;
-
-   if (dev->fb_helper)
-   return drm_fb_helper_hotplug_event(dev->fb_helper);
-
-   ret = drm_fb_helper_init(dev, fb_helper);
-   if (ret)
-   goto err_drm_err;
-
-   if (!drm_drv_uses_atomic_modeset(dev))
-   drm_helper_disable_unused_functions(dev);
-
-   ret = drm_fb_helper_initial_config(fb_helper);
-   if (ret)
-   goto err_drm_fb_helper_fini;
-
-   return 0;
-
-err_drm_fb_helper_fini:
-   drm_fb_helper_fini(fb_helper);
-err_drm_err:
-   drm_err(dev,

Re: Proposal to add CRIU support to DRM render nodes

2024-05-03 Thread Felix Kuehling




On 2024-04-16 10:04, Tvrtko Ursulin wrote:
> 
> On 01/04/2024 18:58, Felix Kuehling wrote:
>>
>> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>>>
>>> On 01/04/2024 17:37, Felix Kuehling wrote:
>>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>>>>>
>>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> Hi Felix,
>>>>>>>
>>>>>>> I had one more thought while browsing around the amdgpu CRIU plugin. It 
>>>>>>> appears it relies on the KFD support being compiled in and /dev/kfd 
>>>>>>> present, correct? AFAICT at least, it relies on that to figure out the 
>>>>>>> amdgpu DRM node.
>>>>>>>
>>>>>>> In would be probably good to consider designing things without that 
>>>>>>> dependency. So that checkpointing an application which does not use 
>>>>>>> /dev/kfd is possible. Or if the kernel does not even have the KFD 
>>>>>>> support compiled in.
>>>>>>
>>>>>> Yeah, if we want to support graphics apps that don't use KFD, we should 
>>>>>> definitely do that. Currently we get a lot of topology information from 
>>>>>> KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
>>>>>> by KFD. We'd need to get GPU device info from the render nodes instead. 
>>>>>> And if KFD is available, we may need to integrate both sources of 
>>>>>> information.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> It could perhaps mean no more than adding some GPU discovery code into 
>>>>>>> CRIU. Which shuold be flexible enough to account for things like 
>>>>>>> re-assigned minor numbers due driver reload.
>>>>>>
>>>>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
>>>>>> was thinking this is still part of the plugin.
>>>>>
>>>>> Yes I agree. I was only thinking about adding some DRM device discovery 
>>>>> code in a more decoupled fashion from the current plugin, for both the 
>>>>> reason discussed above (decoupling a bit from reliance on kfd sysfs), and 
>>>>> then also if/when a new DRM driver might want to implement this the code 
>>>>> could be move to some common plugin area.
>>>>>
>>>>> I am not sure how feasible that would be though. The "gpu id" concept and 
>>>>> it's matching in the current kernel code and CRIU plugin - is that value 
>>>>> tied to the physical GPU instance or how it works?
>>>>
>>>> The concept of the GPU ID is that it's stable while the system is up, even 
>>>> when devices get added and removed dynamically. It was baked into the API 
>>>> early on, but I don't think we ever fully validated device hot plug. I 
>>>> think the closest we're getting is with our latest MI GPUs and dynamic 
>>>> partition mode change.
>>>
>>> Doesn't it read the saved gpu id from the image file while doing restore 
>>> and tries to open the render node to match it? Maybe I am misreading the 
>>> code.. But if it does, does it imply that in practice it could be stable 
>>> across reboots? Or that it is not possible to restore to a different 
>>> instance of maybe the same GPU model installed in a system?
>>
>> Ah, the idea is, that when you restore on a different system, you may get 
>> different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore 
>> it on GPU 2 on the same system. That's why we need to translate GPU IDs in 
>> restored applications. User mode still uses the old GPU IDs, but the kernel 
>> mode driver translates them to the actual GPU IDs of the GPUs that the 
>> process was restored on.
> 
> I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but 
> for restored clients it gets overriden during restore so that any further 
> ioctls can actually not instantly fail.
> 
> And then in amdgpu_plugin_restore_file, when it is opening the render node, 
> it relies on the kfd topology to have filled in (more or less) the 
> target_gpu_id corresponding to the render node gpu id of the target GPU - the 
> one associated with the new kfd gpu_id?

Ye

Re: [PATCH] drm/amdkfd: fix NULL pointer dereference

2024-04-15 Thread Felix Kuehling

This patch does not apply to amd-staging-drm-next. This is against a 
DKMS branch and should be reviewed on our internal mailing list.


However, I suspect that part of the problem is, that the DKMS branch has 
diverged quite a bit in this area, and is missing at least one patch 
from me that was reverted, probably because of an improper port. The 
proper solution should involve getting the DKMS branch back in sync with 
upstream. I'll look into that.


Regards,
  Felix

On 2024-04-13 14:07, vitaly.pros...@amd.com wrote:

From: Vitaly Prosyak 

[  +0.006038] BUG: kernel NULL pointer dereference, address: 0028
[  +0.006969] #PF: supervisor read access in kernel mode
[  +0.005139] #PF: error_code(0x) - not-present page
[  +0.005139] PGD 0 P4D 0
[  +0.002530] Oops:  [#1] PREEMPT SMP NOPTI
[  +0.004356] CPU: 11 PID: 12625 Comm: kworker/11:0 Tainted: GW 
 6.7.0+ #2
[  +0.008097] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE 
WIFI II, BIOS 1302 12/08/2023
[  +0.009398] Workqueue: events evict_process_worker [amdgpu]
[  +0.005750] RIP: 0010:evict_process_worker+0x2f/0x460 [amdgpu]
[  +0.005991] Code: 55 48 89 e5 41 57 41 56 4c 8d b7 a8 fc ff ff 41 55 41 54 53 48 89 
fb 48 83 ec 10 0f 1f 44 00 00 48 8b 43 f8 8b 93 b0 00 00 00 <48> 3b 50 28 0f 85 
50 03 00 00 48 8d 7b 58 e8 ee be cb bf 48 8b 05
[  +0.018791] RSP: 0018:c90009a2be10 EFLAGS: 00010282
[  +0.005226] RAX:  RBX: 888197ffc358 RCX: 
[  +0.007140] RDX: 0a1b RSI:  RDI: 888197ffc358
[  +0.007139] RBP: c90009a2be48 R08:  R09: 
[  +0.007139] R10:  R11:  R12: 888197ffc358
[  +0.007139] R13: 888100153a00 R14: 888197ffc000 R15: 888100153a05
[  +0.007137] FS:  () GS:889facac() 
knlGS:
[  +0.008094] CS:  0010 DS:  ES:  CR0: 80050033
[  +0.005747] CR2: 0028 CR3: 00010d1fc001 CR4: 00770ef0
[  +0.007138] PKRU: 5554
[  +0.002702] Call Trace:
[  +0.002443]  
[  +0.002096]  ? show_regs+0x72/0x90
[  +0.003402]  ? __die+0x25/0x80
[  +0.003052]  ? page_fault_oops+0x154/0x4c0
[  +0.004099]  ? do_user_addr_fault+0x30e/0x6e0
[  +0.004357]  ? psi_group_change+0x237/0x520
[  +0.004185]  ? exc_page_fault+0x84/0x1b0
[  +0.003926]  ? asm_exc_page_fault+0x27/0x30
[  +0.004187]  ? evict_process_worker+0x2f/0x460 [amdgpu]
[  +0.005377]  process_one_work+0x17b/0x360
[  +0.004011]  ? __pfx_worker_thread+0x10/0x10
[  +0.004269]  worker_thread+0x307/0x430
[  +0.003748]  ? __pfx_worker_thread+0x10/0x10
[  +0.004268]  kthread+0xf7/0x130
[  +0.003142]  ? __pfx_kthread+0x10/0x10
[  +0.003749]  ret_from_fork+0x46/0x70
[  +0.003573]  ? __pfx_kthread+0x10/0x10
[  +0.003747]  ret_from_fork_asm+0x1b/0x30
[  +0.003924]  

When we run stressful tests, the eviction fence could be zero and not match
to last_eviction_seqno.

Avoid calling dma_fence_signal and dma_fence_put with zero fences to rely
on checking parameters in DMA API.

Cc: Alex Deucher 
Cc: Christian Koenig 
Cc: Xiaogang Chen 
Cc: Felix Kuehling 
Signed-off-by: Vitaly Prosyak 
---
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index eb380296017d..a15fae1c398a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2118,7 +2118,7 @@ static void evict_process_worker(struct work_struct *work)
 */
p = container_of(dwork, struct kfd_process, eviction_work);
trace_kfd_evict_process_worker_start(p);
-   WARN_ONCE(p->last_eviction_seqno != p->ef->seqno,
+   WARN_ONCE(p->ef && p->last_eviction_seqno != p->ef->seqno,
  "Eviction fence mismatch\n");
  
  	/* Narrow window of overlap between restore and evict work

@@ -2134,9 +2134,11 @@ static void evict_process_worker(struct work_struct 
*work)
pr_debug("Started evicting pasid 0x%x\n", p->pasid);
ret = kfd_process_evict_queues(p, false, 
KFD_QUEUE_EVICTION_TRIGGER_TTM);
if (!ret) {
-   dma_fence_signal(p->ef);
-   dma_fence_put(p->ef);
-   p->ef = NULL;
+   if (p->ef) {
+   dma_fence_signal(p->ef);
+   dma_fence_put(p->ef);
+   p->ef = NULL;
+   }
  
  		if (!kfd_process_unmap_doorbells_if_idle(p))

kfd_process_schedule_restore(p);

Re: [PATCH] drm/ttm: stop pooling cached NUMA pages v2

2024-04-15 Thread Felix Kuehling


On 2024-04-15 10:08, Christian König wrote:

Am 15.04.24 um 15:53 schrieb Felix Kuehling:

On 2024-04-15 9:48, Christian König wrote:

From: Christian König 

We only pool write combined and uncached allocations because they
require extra overhead on allocation and release.

If we also pool cached NUMA it not only means some extra unnecessary
overhead, but also that under memory pressure it can happen that
pages from the wrong NUMA node enters the pool and are re-used
over and over again.

This can lead to performance reduction after running into memory
pressure.

v2: restructure and cleanup the code a bit from the internal hack to
 test this.

Signed-off-by: Christian König 
Fixes: 4482d3c94d7f ("drm/ttm: add NUMA node id to the pool")
CC: sta...@vger.kernel.org
---
  drivers/gpu/drm/ttm/ttm_pool.c | 38 
+-

  1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c 
b/drivers/gpu/drm/ttm/ttm_pool.c

index 112438d965ff..6e1fd6985ffc 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -288,17 +288,23 @@ static struct ttm_pool_type 
*ttm_pool_select_type(struct ttm_pool *pool,

    enum ttm_caching caching,
    unsigned int order)
  {
-    if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE)
+    if (pool->use_dma_alloc)
  return &pool->caching[caching].orders[order];
    #ifdef CONFIG_X86
  switch (caching) {
  case ttm_write_combined:
+    if (pool->nid != NUMA_NO_NODE)
+    return &pool->caching[caching].orders[order];


Doesn't this break USWC allocations on NUMA systems, where we set a 
NUMA node for the default pool (at least we were planning to at some 
point)?


I don't think so, but I might have missed something. Why do you think 
that would break?


I mean the idea is basically if the pool is associated with a NUMA id 
we should rather use this pool instead of the global one.


And that is true for both cases, the default pool and the specialized 
ones.


OK, I think I misunderstood what I was reading. It looked to me like it 
would always use a "caching" pool if nid was set. But caching here is a 
variable; each node still has specialized pools for write combining etc.


Then the concern you stated in the commit message "under memory pressure 
it can happen that pages from the wrong NUMA node enters the pool and 
are re-used over and over again" is still possible for uncached and wc 
pages. Anyway, it's better than not having NUMA, I guess.


The patch is

Reviewed-by: Felix Kuehling 




Regards,
Christian.



Regards,
  Felix



+
  if (pool->use_dma32)
  return &global_dma32_write_combined[order];
    return &global_write_combined[order];
  case ttm_uncached:
+    if (pool->nid != NUMA_NO_NODE)
+    return &pool->caching[caching].orders[order];
+
  if (pool->use_dma32)
  return &global_dma32_uncached[order];
  @@ -566,11 +572,17 @@ void ttm_pool_init(struct ttm_pool *pool, 
struct device *dev,

  pool->use_dma_alloc = use_dma_alloc;
  pool->use_dma32 = use_dma32;
  -    if (use_dma_alloc || nid != NUMA_NO_NODE) {
-    for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
-    for (j = 0; j < NR_PAGE_ORDERS; ++j)
- ttm_pool_type_init(&pool->caching[i].orders[j],
-   pool, i, j);
+    for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
+    for (j = 0; j < NR_PAGE_ORDERS; ++j) {
+    struct ttm_pool_type *pt;
+
+    /* Initialize only pool types which are actually used */
+    pt = ttm_pool_select_type(pool, i, j);
+    if (pt != &pool->caching[i].orders[j])
+    continue;
+
+    ttm_pool_type_init(pt, pool, i, j);
+    }
  }
  }
  EXPORT_SYMBOL(ttm_pool_init);
@@ -599,10 +611,16 @@ void ttm_pool_fini(struct ttm_pool *pool)
  {
  unsigned int i, j;
  -    if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) {
-    for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
-    for (j = 0; j < NR_PAGE_ORDERS; ++j)
- ttm_pool_type_fini(&pool->caching[i].orders[j]);
+    for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
+    for (j = 0; j < NR_PAGE_ORDERS; ++j) {
+    struct ttm_pool_type *pt;
+
+    pt = ttm_pool_select_type(pool, i, j);
+    if (pt != &pool->caching[i].orders[j])
+    continue;
+
+    ttm_pool_type_fini(pt);
+    }
  }
    /* We removed the pool types from the LRU, but we need to 
also make sure

Re: [PATCH] drm/ttm: stop pooling cached NUMA pages v2

2024-04-15 Thread Felix Kuehling


On 2024-04-15 9:48, Christian König wrote:

From: Christian König 

We only pool write combined and uncached allocations because they
require extra overhead on allocation and release.

If we also pool cached NUMA it not only means some extra unnecessary
overhead, but also that under memory pressure it can happen that
pages from the wrong NUMA node enters the pool and are re-used
over and over again.

This can lead to performance reduction after running into memory
pressure.

v2: restructure and cleanup the code a bit from the internal hack to
 test this.

Signed-off-by: Christian König 
Fixes: 4482d3c94d7f ("drm/ttm: add NUMA node id to the pool")
CC: sta...@vger.kernel.org
---
  drivers/gpu/drm/ttm/ttm_pool.c | 38 +-
  1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 112438d965ff..6e1fd6985ffc 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -288,17 +288,23 @@ static struct ttm_pool_type *ttm_pool_select_type(struct 
ttm_pool *pool,
  enum ttm_caching caching,
  unsigned int order)
  {
-   if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE)
+   if (pool->use_dma_alloc)
return &pool->caching[caching].orders[order];
  
  #ifdef CONFIG_X86

switch (caching) {
case ttm_write_combined:
+   if (pool->nid != NUMA_NO_NODE)
+   return &pool->caching[caching].orders[order];


Doesn't this break USWC allocations on NUMA systems, where we set a NUMA 
node for the default pool (at least we were planning to at some point)?


Regards,
  Felix



+
if (pool->use_dma32)
return &global_dma32_write_combined[order];
  
  		return &global_write_combined[order];

case ttm_uncached:
+   if (pool->nid != NUMA_NO_NODE)
+   return &pool->caching[caching].orders[order];
+
if (pool->use_dma32)
return &global_dma32_uncached[order];
  
@@ -566,11 +572,17 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,

pool->use_dma_alloc = use_dma_alloc;
pool->use_dma32 = use_dma32;
  
-	if (use_dma_alloc || nid != NUMA_NO_NODE) {

-   for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
-   for (j = 0; j < NR_PAGE_ORDERS; ++j)
-   ttm_pool_type_init(&pool->caching[i].orders[j],
-  pool, i, j);
+   for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
+   for (j = 0; j < NR_PAGE_ORDERS; ++j) {
+   struct ttm_pool_type *pt;
+
+   /* Initialize only pool types which are actually used */
+   pt = ttm_pool_select_type(pool, i, j);
+   if (pt != &pool->caching[i].orders[j])
+   continue;
+
+   ttm_pool_type_init(pt, pool, i, j);
+   }
}
  }
  EXPORT_SYMBOL(ttm_pool_init);
@@ -599,10 +611,16 @@ void ttm_pool_fini(struct ttm_pool *pool)
  {
unsigned int i, j;
  
-	if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) {

-   for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
-   for (j = 0; j < NR_PAGE_ORDERS; ++j)
-   ttm_pool_type_fini(&pool->caching[i].orders[j]);
+   for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
+   for (j = 0; j < NR_PAGE_ORDERS; ++j) {
+   struct ttm_pool_type *pt;
+
+   pt = ttm_pool_select_type(pool, i, j);
+   if (pt != &pool->caching[i].orders[j])
+   continue;
+
+   ttm_pool_type_fini(pt);
+   }
}
  
  	/* We removed the pool types from the LRU, but we need to also make sure

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling

On 2024-04-01 12:56, Tvrtko Ursulin wrote:

On 01/04/2024 17:37, Felix Kuehling wrote:

On 2024-04-01 11:09, Tvrtko Ursulin wrote:

On 28/03/2024 20:42, Felix Kuehling wrote:

On 2024-03-28 12:03, Tvrtko Ursulin wrote:

Hi Felix,

I had one more thought while browsing around the amdgpu CRIU
plugin. It appears it relies on the KFD support being compiled in
and /dev/kfd present, correct? AFAICT at least, it relies on that
to figure out the amdgpu DRM node.

In would be probably good to consider designing things without
that dependency. So that checkpointing an application which does
not use /dev/kfd is possible. Or if the kernel does not even have
the KFD support compiled in.

Yeah, if we want to support graphics apps that don't use KFD, we
should definitely do that. Currently we get a lot of topology
information from KFD, not even from the /dev/kfd device but from
the sysfs nodes exposed by KFD. We'd need to get GPU device info
from the render nodes instead. And if KFD is available, we may need
to integrate both sources of information.

It could perhaps mean no more than adding some GPU discovery code
into CRIU. Which shuold be flexible enough to account for things
like re-assigned minor numbers due driver reload.

Do you mean adding GPU discovery to the core CRIU, or to the
plugin. I was thinking this is still part of the plugin.

Yes I agree. I was only thinking about adding some DRM device
discovery code in a more decoupled fashion from the current plugin,
for both the reason discussed above (decoupling a bit from reliance
on kfd sysfs), and then also if/when a new DRM driver might want to
implement this the code could be move to some common plugin area.

I am not sure how feasible that would be though. The "gpu id"
concept and it's matching in the current kernel code and CRIU plugin
- is that value tied to the physical GPU instance or how it works?

The concept of the GPU ID is that it's stable while the system is up,
even when devices get added and removed dynamically. It was baked
into the API early on, but I don't think we ever fully validated
device hot plug. I think the closest we're getting is with our latest
MI GPUs and dynamic partition mode change.

Doesn't it read the saved gpu id from the image file while doing
restore and tries to open the render node to match it? Maybe I am
misreading the code.. But if it does, does it imply that in practice
it could be stable across reboots? Or that it is not possible to
restore to a different instance of maybe the same GPU model installed
in a system?

Ah, the idea is, that when you restore on a different system, you may
get different GPU IDs. Or you may checkpoint an app running on GPU 1 but
restore it on GPU 2 on the same system. That's why we need to translate
GPU IDs in restored applications. User mode still uses the old GPU IDs,
but the kernel mode driver translates them to the actual GPU IDs of the
GPUs that the process was restored on.

This also highlights another aspect on those spatially partitioned
GPUs. GPU IDs identify device partitions, not devices. Similarly,
each partition has its own render node, and the KFD topology info in
sysfs points to the render-minor number corresponding to each GPU ID.

I am not familiar with this. This is not SR-IOV but some other kind of
partitioning? Would you have any links where I could read more?

Right, the bare-metal driver can partition a PF spatially without SRIOV.
SRIOV can also use spatial partitioning and expose each partition
through its own VF, but that's not useful for bare metal. Spatial
partitioning is new in MI300. There is some high-level info in this
whitepaper:
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.

Regards,
Felix

Regards,

Tvrtko

Otherwise I am eagerly awaiting to hear more about the design
specifics around dma-buf handling. And also seeing how to extend
to other DRM related anonymous fds.

I've been pretty far under-water lately. I hope I'll find time to
work on this more, but it's probably going to be at least a few weeks.

Got it.

Regards,

Tvrtko

Regards,
Felix

Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:

On 15/03/2024 02:33, Felix Kuehling wrote:

On 2024-03-12 5:45, Tvrtko Ursulin wrote:

On 11/03/2024 14:48, Tvrtko Ursulin wrote:

Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render
nodes in order to maintain CRIU support for ROCm application
once they start relying on render nodes for more GPU memory
management. In this email I'm providing some background why
we are doing this, and outlining some of the problems we need
to solve to checkpoint and restore render node state and
shared memory (DMABuf) state. I have some thoughts on the API
design, leaning on

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling


On 2024-04-01 11:09, Tvrtko Ursulin wrote:


On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and 
/dev/kfd present, correct? AFAICT at least, it relies on that to 
figure out the amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from the 
sysfs nodes exposed by KFD. We'd need to get GPU device info from the 
render nodes instead. And if KFD is available, we may need to 
integrate both sources of information.





It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. 
I was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance on 
kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" concept 
and it's matching in the current kernel code and CRIU plugin - is that 
value tied to the physical GPU instance or how it works?


The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked into 
the API early on, but I don't think we ever fully validated device hot 
plug. I think the closest we're getting is with our latest MI GPUs and 
dynamic partition mode change.


This also highlights another aspect on those spatially partitioned GPUs. 
GPU IDs identify device partitions, not devices. Similarly, each 
partition has its own render node, and the KFD topology info in sysfs 
points to the render-minor number corresponding to each GPU ID.


Regards,
  Felix




Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to 
work on this more, but it's probably going to be at least a few weeks.


Got it.

Regards,

Tvrtko



Regards,
   Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application 
once they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual 
address management, that creates a problem for checkpointing 
and restoring ROCm applications with CRIU. Currently there is 
no support for checkpointing and restoring render node state, 
other than CPU virtual address mappings. Support will be needed 
for checkpointing GEM buffer objects and handles, their GPU 
virtual address mappings and memory sharing relationships 
between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including 
scheduler contexts and BO lists. Most of this state is 
driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf 
APIs and may have implications for other drivers i

Re: Proposal to add CRIU support to DRM render nodes

2024-03-28 Thread Felix Kuehling




On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and /dev/kfd 
present, correct? AFAICT at least, it relies on that to figure out the 
amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we should 
definitely do that. Currently we get a lot of topology information from 
KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
by KFD. We'd need to get GPU device info from the render nodes instead. 
And if KFD is available, we may need to integrate both sources of 
information.





It could perhaps mean no more than adding some GPU discovery code into 
CRIU. Which shuold be flexible enough to account for things like 
re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
was thinking this is still part of the plugin.





Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to work 
on this more, but it's probably going to be at least a few weeks.


Regards,
  Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application once 
they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and 
restoring ROCm applications with CRIU. Currently there is no 
support for checkpointing and restoring render node state, other 
than CPU virtual address mappings. Support will be needed for 
checkpointing GEM buffer objects and handles, their GPU virtual 
address mappings and memory sharing relationships between devices 
and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including 
scheduler contexts and BO lists. Most of this state is 
driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf 
APIs and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, 
although I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of 
any concrete plans, but I think feature is pretty cool and if 
amdgpu gets it working I wouldn't be surprised if other drivers 
would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I 
recently implemented on Mesa's request, which is to be able to 
"upload" the GPU context from the GPU hang error state and replay 
the hanging request. It is kind of (at a stretch) a very special 
tiny subset of checkout and restore so I am not mentioning it as a 
curiosity.


And there is also another partical conceptual intersect with the 
(at the moment not yet upstream) i915 online debugger. This part 
being in the area of discovering and enumerating GPU resources 
beloning to the client.


I don't see an immediate design or code sharing opportunities 
though but just mentioning.


I did spend some time reading your plugin and kernel 
implementation out of curiousity and have some comments and

Re: [PATCH 05/10] drivers: use new capable_any functionality

2024-03-15 Thread Felix Kuehling


On 2024-03-15 7:37, Christian Göttsche wrote:

Use the new added capable_any function in appropriate cases, where a
task is required to have any of two capabilities.

Reorder CAP_SYS_ADMIN last.

Signed-off-by: Christian Göttsche 
Acked-by: Alexander Gordeev  (s390 portion)


Acked-by: Felix Kuehling  (amdkfd portion)



---
v4:
Additional usage in kfd_ioctl()
v3:
rename to capable_any()
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 3 +--
  drivers/net/caif/caif_serial.c   | 2 +-
  drivers/s390/block/dasd_eckd.c   | 2 +-
  3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index dfa8c69532d4..8c7ebca01c17 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -3290,8 +3290,7 @@ static long kfd_ioctl(struct file *filep, unsigned int 
cmd, unsigned long arg)
 * more priviledged access.
 */
if (unlikely(ioctl->flags & KFD_IOC_FLAG_CHECKPOINT_RESTORE)) {
-   if (!capable(CAP_CHECKPOINT_RESTORE) &&
-   !capable(CAP_SYS_ADMIN)) {
+   if (!capable_any(CAP_CHECKPOINT_RESTORE, CAP_SYS_ADMIN)) {
retcode = -EACCES;
goto err_i1;
}
diff --git a/drivers/net/caif/caif_serial.c b/drivers/net/caif/caif_serial.c
index ed3a589def6b..e908b9ce57dc 100644
--- a/drivers/net/caif/caif_serial.c
+++ b/drivers/net/caif/caif_serial.c
@@ -326,7 +326,7 @@ static int ldisc_open(struct tty_struct *tty)
/* No write no play */
if (tty->ops->write == NULL)
return -EOPNOTSUPP;
-   if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_TTY_CONFIG))
+   if (!capable_any(CAP_SYS_TTY_CONFIG, CAP_SYS_ADMIN))
return -EPERM;
  
  	/* release devices to avoid name collision */

diff --git a/drivers/s390/block/dasd_eckd.c b/drivers/s390/block/dasd_eckd.c
index 373c1a86c33e..8f9a5136306a 100644
--- a/drivers/s390/block/dasd_eckd.c
+++ b/drivers/s390/block/dasd_eckd.c
@@ -5384,7 +5384,7 @@ static int dasd_symm_io(struct dasd_device *device, void 
__user *argp)
char psf0, psf1;
int rc;
  
-	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RAWIO))

+   if (!capable_any(CAP_SYS_RAWIO, CAP_SYS_ADMIN))
return -EACCES;
psf0 = psf1 = 0;

Re: Proposal to add CRIU support to DRM render nodes

2024-03-14 Thread Felix Kuehling




On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes 
in order to maintain CRIU support for ROCm application once they 
start relying on render nodes for more GPU memory management. In 
this email I'm providing some background why we are doing this, and 
outlining some of the problems we need to solve to checkpoint and 
restore render node state and shared memory (DMABuf) state. I have 
some thoughts on the API design, leaning on what we did for KFD, but 
would like to get feedback from the DRI community regarding that API 
and to what extent there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM API 
and improve interoperability between graphics and compute. This uses 
DMABufs for sharing buffer objects between KFD and multiple render 
node devices, as well as between processes. In the long run this 
also provides a path to moving all or most memory management from 
the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU 
virtual address mappings. Support will be needed for checkpointing 
GEM buffer objects and handles, their GPU virtual address mappings 
and memory sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including scheduler 
contexts and BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf APIs 
and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although 
I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets 
it working I wouldn't be surprised if other drivers would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the 
GPU context from the GPU hang error state and replay the hanging 
request. It is kind of (at a stretch) a very special tiny subset of 
checkout and restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at 
the moment not yet upstream) i915 online debugger. This part being in 
the area of discovering and enumerating GPU resources beloning to the 
client.


I don't see an immediate design or code sharing opportunities though 
but just mentioning.


I did spend some time reading your plugin and kernel implementation 
out of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM 
CRIU API (either generic of AMDGPU driver specific): The API goes 
through several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can 
allocate

    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at 
this time)
 2. Resume (final fixups after the VMAs are sorted out, resume 
execution)


Btw is check-pointing guaranteeing all relevant activity is idled? 
For instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that 
handled?


In our compute use cases, we suspend user mode queues. This can include 
CWSR (compute-wave-save-restore) where the state of in-flight waves is 
stored in memory and can be reloaded and resumed from memory later. We 
don't use any fences other than "eviction fences", that are signaled 
after the queues are suspended. And those fences are never handed to 
user mode. So we don't need to worry about any fence state in the 
checkpoint.


If we extended this to support the kernel mode command submission APIs, 
I would expect that we'd wait for all current submissions to complete, 
and stop new ones from being sent to the HW before taking the 
checkpoint. When we take the checkpoint in the CRIU plugin, the CPU 
threads are already frozen and cannot submit any more work. If we wait

Re: [PATCH AUTOSEL 5.15 3/5] drm/amdgpu: Enable gpu reset for S3 abort cases on Raven series

2024-03-13 Thread Felix Kuehling


On 2024-03-11 11:14, Sasha Levin wrote:

From: Prike Liang 

[ Upstream commit c671ec01311b4744b377f98b0b4c6d033fe569b3 ]

Currently, GPU resets can now be performed successfully on the Raven
series. While GPU reset is required for the S3 suspend abort case.
So now can enable gpu reset for S3 abort cases on the Raven series.


This looks suspicious to me. I'm not sure what conditions made the GPU 
reset successful. But unless all the changes involved were also 
backported, this should probably not be applied to older kernel 
branches. I'm speculating it may be related to the removal of AMD IOMMUv2.


Regards,
  Felix




Signed-off-by: Prike Liang 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
  drivers/gpu/drm/amd/amdgpu/soc15.c | 45 +-
  1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 6a3486f52d698..ef5b3eedc8615 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -605,11 +605,34 @@ soc15_asic_reset_method(struct amdgpu_device *adev)
return AMD_RESET_METHOD_MODE1;
  }
  
+static bool soc15_need_reset_on_resume(struct amdgpu_device *adev)

+{
+   u32 sol_reg;
+
+   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
+
+   /* Will reset for the following suspend abort cases.
+* 1) Only reset limit on APU side, dGPU hasn't checked yet.
+* 2) S3 suspend abort and TOS already launched.
+*/
+   if (adev->flags & AMD_IS_APU && adev->in_s3 &&
+   !adev->suspend_complete &&
+   sol_reg)
+   return true;
+
+   return false;
+}
+
  static int soc15_asic_reset(struct amdgpu_device *adev)
  {
/* original raven doesn't have full asic reset */
-   if ((adev->apu_flags & AMD_APU_IS_RAVEN) ||
-   (adev->apu_flags & AMD_APU_IS_RAVEN2))
+   /* On the latest Raven, the GPU reset can be performed
+* successfully. So now, temporarily enable it for the
+* S3 suspend abort case.
+*/
+   if (((adev->apu_flags & AMD_APU_IS_RAVEN) ||
+   (adev->apu_flags & AMD_APU_IS_RAVEN2)) &&
+   !soc15_need_reset_on_resume(adev))
return 0;
  
  	switch (soc15_asic_reset_method(adev)) {

@@ -1490,24 +1513,6 @@ static int soc15_common_suspend(void *handle)
return soc15_common_hw_fini(adev);
  }
  
-static bool soc15_need_reset_on_resume(struct amdgpu_device *adev)

-{
-   u32 sol_reg;
-
-   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
-
-   /* Will reset for the following suspend abort cases.
-* 1) Only reset limit on APU side, dGPU hasn't checked yet.
-* 2) S3 suspend abort and TOS already launched.
-*/
-   if (adev->flags & AMD_IS_APU && adev->in_s3 &&
-   !adev->suspend_complete &&
-   sol_reg)
-   return true;
-
-   return false;
-}
-
  static int soc15_common_resume(void *handle)
  {
struct amdgpu_device *adev = (struct amdgpu_device *)handle;

Re: [PATCH] drm/amdkfd: make kfd_class constant

2024-03-05 Thread Felix Kuehling


On 2024-03-05 7:15, Ricardo B. Marliere wrote:

Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the kfd_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Greg Kroah-Hartman 
Suggested-by: Greg Kroah-Hartman 
Signed-off-by: Ricardo B. Marliere 


The patch looks good to me. Do you want me to apply this to Alex's 
amd-staging-drm-next?


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 21 +++--
  1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index f030cafc5a0a..dfa8c69532d4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -63,8 +63,10 @@ static const struct file_operations kfd_fops = {
  };
  
  static int kfd_char_dev_major = -1;

-static struct class *kfd_class;
  struct device *kfd_device;
+static const struct class kfd_class = {
+   .name = kfd_dev_name,
+};
  
  static inline struct kfd_process_device *kfd_lock_pdd_by_id(struct kfd_process *p, __u32 gpu_id)

  {
@@ -94,14 +96,13 @@ int kfd_chardev_init(void)
if (err < 0)
goto err_register_chrdev;
  
-	kfd_class = class_create(kfd_dev_name);

-   err = PTR_ERR(kfd_class);
-   if (IS_ERR(kfd_class))
+   err = class_register(&kfd_class);
+   if (err)
goto err_class_create;
  
-	kfd_device = device_create(kfd_class, NULL,

-   MKDEV(kfd_char_dev_major, 0),
-   NULL, kfd_dev_name);
+   kfd_device = device_create(&kfd_class, NULL,
+  MKDEV(kfd_char_dev_major, 0),
+  NULL, kfd_dev_name);
err = PTR_ERR(kfd_device);
if (IS_ERR(kfd_device))
goto err_device_create;
@@ -109,7 +110,7 @@ int kfd_chardev_init(void)
return 0;
  
  err_device_create:

-   class_destroy(kfd_class);
+   class_unregister(&kfd_class);
  err_class_create:
unregister_chrdev(kfd_char_dev_major, kfd_dev_name);
  err_register_chrdev:
@@ -118,8 +119,8 @@ int kfd_chardev_init(void)
  
  void kfd_chardev_exit(void)

  {
-   device_destroy(kfd_class, MKDEV(kfd_char_dev_major, 0));
-   class_destroy(kfd_class);
+   device_destroy(&kfd_class, MKDEV(kfd_char_dev_major, 0));
+   class_unregister(&kfd_class);
unregister_chrdev(kfd_char_dev_major, kfd_dev_name);
kfd_device = NULL;
  }

---
base-commit: 8bc75586ea01f1c645063d3472c115ecab03e76c
change-id: 20240305-class_cleanup-drm-amd-bdc7255b7540

Best regards,

Re: Making drm_gpuvm work across gpu devices

2024-01-29 Thread Felix Kuehling




On 2024-01-29 14:03, Christian König wrote:

Am 29.01.24 um 18:52 schrieb Felix Kuehling:

On 2024-01-29 11:28, Christian König wrote:

Am 29.01.24 um 17:24 schrieb Felix Kuehling:

On 2024-01-29 10:33, Christian König wrote:

Am 29.01.24 um 16:03 schrieb Felix Kuehling:

On 2024-01-25 13:32, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]
Yes most API are per device based.

One exception I know is actually the kfd SVM API. If you look 
at the svm_ioctl function, it is per-process based. Each 
kfd_process represent a process across N gpu devices.
Yeah and that was a big mistake in my opinion. We should really 
not do that

ever again.

Need to say, kfd SVM represent a shared virtual address space 
across CPU and all GPU devices on the system. This is by the 
definition of SVM (shared virtual memory). This is very 
different from our legacy gpu *device* driver which works for 
only one device (i.e., if you want one device to access 
another device's memory, you will have to use dma-buf 
export/import etc).
Exactly that thinking is what we have currently found as 
blocker for a
virtualization projects. Having SVM as device independent 
feature which
somehow ties to the process address space turned out to be an 
extremely bad

idea.

The background is that this only works for some use cases but 
not all of

them.

What's working much better is to just have a mirror 
functionality which says
that a range A..B of the process address space is mapped into a 
range C..D

of the GPU address space.

Those ranges can then be used to implement the SVM feature 
required for
higher level APIs and not something you need at the UAPI or 
even inside the

low level kernel memory management.

When you talk about migrating memory to a device you also do 
this on a per
device basis and *not* tied to the process address space. If 
you then get
crappy performance because userspace gave contradicting 
information where to
migrate memory then that's a bug in userspace and not something 
the kernel

should try to prevent somehow.

[SNIP]
I think if you start using the same drm_gpuvm for multiple 
devices you
will sooner or later start to run into the same mess we have 
seen with
KFD, where we moved more and more functionality from the KFD 
to the DRM
render node because we found that a lot of the stuff simply 
doesn't work

correctly with a single object to maintain the state.
As I understand it, KFD is designed to work across devices. A 
single pseudo /dev/kfd device represent all hardware gpu 
devices. That is why during kfd open, many pdd (process device 
data) is created, each for one hardware device for this process.
Yes, I'm perfectly aware of that. And I can only repeat myself 
that I see
this design as a rather extreme failure. And I think it's one 
of the reasons

why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of 
extending the
CPU process into the GPUs, but this idea only works for a few 
use cases and

is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end 
up with CPU
address != GPU address because the VAs are actually coming from 
the guest VM

and not the host process.

SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This 
should not have

any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you 
need to get
explicit permission to do this from Dave and Daniel and maybe 
even Linus.
I think the one and only one exception where an SVM uapi like in 
kfd makes

sense, is if the _hardware_ itself, not the software stack defined
semantics that you've happened to build on top of that hw, 
enforces a 1:1

mapping with the cpu process address space.

Which means your hardware is using PASID, IOMMU based 
translation, PCI-ATS
(address translation services) or whatever your hw calls it and 
has _no_
device-side pagetables on top. Which from what I've seen all 
devices with
device-memory have, simply because they need some place to store 
whether
that memory is currently in device memory or should be 
translated using
PASID. Currently there's no gpu that works with PASID only, but 
there are

some on-cpu-die accelerator things that do work like that.

Maybe in the future there will be some accelerators that are 
fully cpu

cache coherent (including atomics) with something like CXL, and the
on-device memory is managed as normal system memory with struct 
page as
ZONE_DEVICE and accelerator va -> physical address translation 
is only
done with PASID ... but for now I haven't seen that, definitely 
not in

upstream drivers.

And the moment you have some per-device pagetables or per-device 
memory
management of some sort (like using gpuva mgr) then I'm 100% 
agreeing with
Christian that the kfd SVM model is too str

Re: Making drm_gpuvm work across gpu devices

2024-01-29 Thread Felix Kuehling




On 2024-01-29 11:28, Christian König wrote:

Am 29.01.24 um 17:24 schrieb Felix Kuehling:

On 2024-01-29 10:33, Christian König wrote:

Am 29.01.24 um 16:03 schrieb Felix Kuehling:

On 2024-01-25 13:32, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]
Yes most API are per device based.

One exception I know is actually the kfd SVM API. If you look at 
the svm_ioctl function, it is per-process based. Each 
kfd_process represent a process across N gpu devices.
Yeah and that was a big mistake in my opinion. We should really 
not do that

ever again.

Need to say, kfd SVM represent a shared virtual address space 
across CPU and all GPU devices on the system. This is by the 
definition of SVM (shared virtual memory). This is very 
different from our legacy gpu *device* driver which works for 
only one device (i.e., if you want one device to access another 
device's memory, you will have to use dma-buf export/import etc).
Exactly that thinking is what we have currently found as blocker 
for a
virtualization projects. Having SVM as device independent feature 
which
somehow ties to the process address space turned out to be an 
extremely bad

idea.

The background is that this only works for some use cases but not 
all of

them.

What's working much better is to just have a mirror functionality 
which says
that a range A..B of the process address space is mapped into a 
range C..D

of the GPU address space.

Those ranges can then be used to implement the SVM feature 
required for
higher level APIs and not something you need at the UAPI or even 
inside the

low level kernel memory management.

When you talk about migrating memory to a device you also do this 
on a per
device basis and *not* tied to the process address space. If you 
then get
crappy performance because userspace gave contradicting 
information where to
migrate memory then that's a bug in userspace and not something 
the kernel

should try to prevent somehow.

[SNIP]
I think if you start using the same drm_gpuvm for multiple 
devices you
will sooner or later start to run into the same mess we have 
seen with
KFD, where we moved more and more functionality from the KFD to 
the DRM
render node because we found that a lot of the stuff simply 
doesn't work

correctly with a single object to maintain the state.
As I understand it, KFD is designed to work across devices. A 
single pseudo /dev/kfd device represent all hardware gpu 
devices. That is why during kfd open, many pdd (process device 
data) is created, each for one hardware device for this process.
Yes, I'm perfectly aware of that. And I can only repeat myself 
that I see
this design as a rather extreme failure. And I think it's one of 
the reasons

why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of 
extending the
CPU process into the GPUs, but this idea only works for a few use 
cases and

is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up 
with CPU
address != GPU address because the VAs are actually coming from 
the guest VM

and not the host process.

SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This 
should not have

any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you 
need to get
explicit permission to do this from Dave and Daniel and maybe 
even Linus.
I think the one and only one exception where an SVM uapi like in 
kfd makes

sense, is if the _hardware_ itself, not the software stack defined
semantics that you've happened to build on top of that hw, 
enforces a 1:1

mapping with the cpu process address space.

Which means your hardware is using PASID, IOMMU based translation, 
PCI-ATS
(address translation services) or whatever your hw calls it and 
has _no_
device-side pagetables on top. Which from what I've seen all 
devices with
device-memory have, simply because they need some place to store 
whether
that memory is currently in device memory or should be translated 
using
PASID. Currently there's no gpu that works with PASID only, but 
there are

some on-cpu-die accelerator things that do work like that.

Maybe in the future there will be some accelerators that are fully 
cpu

cache coherent (including atomics) with something like CXL, and the
on-device memory is managed as normal system memory with struct 
page as
ZONE_DEVICE and accelerator va -> physical address translation is 
only
done with PASID ... but for now I haven't seen that, definitely 
not in

upstream drivers.

And the moment you have some per-device pagetables or per-device 
memory
management of some sort (like using gpuva mgr) then I'm 100% 
agreeing with

Christian that the kfd SVM model is too strict and not a great idea.


That basically means, without ATS/PRI+PASID you cannot im

Re: Making drm_gpuvm work across gpu devices

2024-01-29 Thread Felix Kuehling




On 2024-01-29 10:33, Christian König wrote:

Am 29.01.24 um 16:03 schrieb Felix Kuehling:

On 2024-01-25 13:32, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]
Yes most API are per device based.

One exception I know is actually the kfd SVM API. If you look at 
the svm_ioctl function, it is per-process based. Each kfd_process 
represent a process across N gpu devices.
Yeah and that was a big mistake in my opinion. We should really not 
do that

ever again.

Need to say, kfd SVM represent a shared virtual address space 
across CPU and all GPU devices on the system. This is by the 
definition of SVM (shared virtual memory). This is very different 
from our legacy gpu *device* driver which works for only one 
device (i.e., if you want one device to access another device's 
memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a
virtualization projects. Having SVM as device independent feature 
which
somehow ties to the process address space turned out to be an 
extremely bad

idea.

The background is that this only works for some use cases but not 
all of

them.

What's working much better is to just have a mirror functionality 
which says
that a range A..B of the process address space is mapped into a 
range C..D

of the GPU address space.

Those ranges can then be used to implement the SVM feature required 
for
higher level APIs and not something you need at the UAPI or even 
inside the

low level kernel memory management.

When you talk about migrating memory to a device you also do this 
on a per
device basis and *not* tied to the process address space. If you 
then get
crappy performance because userspace gave contradicting information 
where to
migrate memory then that's a bug in userspace and not something the 
kernel

should try to prevent somehow.

[SNIP]
I think if you start using the same drm_gpuvm for multiple 
devices you
will sooner or later start to run into the same mess we have seen 
with
KFD, where we moved more and more functionality from the KFD to 
the DRM
render node because we found that a lot of the stuff simply 
doesn't work

correctly with a single object to maintain the state.
As I understand it, KFD is designed to work across devices. A 
single pseudo /dev/kfd device represent all hardware gpu devices. 
That is why during kfd open, many pdd (process device data) is 
created, each for one hardware device for this process.
Yes, I'm perfectly aware of that. And I can only repeat myself that 
I see
this design as a rather extreme failure. And I think it's one of 
the reasons

why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of 
extending the
CPU process into the GPUs, but this idea only works for a few use 
cases and

is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up 
with CPU
address != GPU address because the VAs are actually coming from the 
guest VM

and not the host process.

SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should 
not have

any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you need 
to get
explicit permission to do this from Dave and Daniel and maybe even 
Linus.
I think the one and only one exception where an SVM uapi like in kfd 
makes

sense, is if the _hardware_ itself, not the software stack defined
semantics that you've happened to build on top of that hw, enforces 
a 1:1

mapping with the cpu process address space.

Which means your hardware is using PASID, IOMMU based translation, 
PCI-ATS
(address translation services) or whatever your hw calls it and has 
_no_
device-side pagetables on top. Which from what I've seen all devices 
with
device-memory have, simply because they need some place to store 
whether

that memory is currently in device memory or should be translated using
PASID. Currently there's no gpu that works with PASID only, but 
there are

some on-cpu-die accelerator things that do work like that.

Maybe in the future there will be some accelerators that are fully cpu
cache coherent (including atomics) with something like CXL, and the
on-device memory is managed as normal system memory with struct page as
ZONE_DEVICE and accelerator va -> physical address translation is only
done with PASID ... but for now I haven't seen that, definitely not in
upstream drivers.

And the moment you have some per-device pagetables or per-device memory
management of some sort (like using gpuva mgr) then I'm 100% 
agreeing with

Christian that the kfd SVM model is too strict and not a great idea.


That basically means, without ATS/PRI+PASID you cannot implement a 
unified memory programming model, where GPUs or accelerators access 
virtual addresses wi

Re: Making drm_gpuvm work across gpu devices

2024-01-29 Thread Felix Kuehling


On 2024-01-25 13:32, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]
Yes most API are per device based.

One exception I know is actually the kfd SVM API. If you look at the svm_ioctl 
function, it is per-process based. Each kfd_process represent a process across 
N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that
ever again.


Need to say, kfd SVM represent a shared virtual address space across CPU and 
all GPU devices on the system. This is by the definition of SVM (shared virtual 
memory). This is very different from our legacy gpu *device* driver which works 
for only one device (i.e., if you want one device to access another device's 
memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a
virtualization projects. Having SVM as device independent feature which
somehow ties to the process address space turned out to be an extremely bad
idea.

The background is that this only works for some use cases but not all of
them.

What's working much better is to just have a mirror functionality which says
that a range A..B of the process address space is mapped into a range C..D
of the GPU address space.

Those ranges can then be used to implement the SVM feature required for
higher level APIs and not something you need at the UAPI or even inside the
low level kernel memory management.

When you talk about migrating memory to a device you also do this on a per
device basis and *not* tied to the process address space. If you then get
crappy performance because userspace gave contradicting information where to
migrate memory then that's a bug in userspace and not something the kernel
should try to prevent somehow.

[SNIP]

I think if you start using the same drm_gpuvm for multiple devices you
will sooner or later start to run into the same mess we have seen with
KFD, where we moved more and more functionality from the KFD to the DRM
render node because we found that a lot of the stuff simply doesn't work
correctly with a single object to maintain the state.

As I understand it, KFD is designed to work across devices. A single pseudo 
/dev/kfd device represent all hardware gpu devices. That is why during kfd 
open, many pdd (process device data) is created, each for one hardware device 
for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I see
this design as a rather extreme failure. And I think it's one of the reasons
why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending the
CPU process into the GPUs, but this idea only works for a few use cases and
is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with CPU
address != GPU address because the VAs are actually coming from the guest VM
and not the host process.

SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have
any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you need to get
explicit permission to do this from Dave and Daniel and maybe even Linus.

I think the one and only one exception where an SVM uapi like in kfd makes
sense, is if the _hardware_ itself, not the software stack defined
semantics that you've happened to build on top of that hw, enforces a 1:1
mapping with the cpu process address space.

Which means your hardware is using PASID, IOMMU based translation, PCI-ATS
(address translation services) or whatever your hw calls it and has _no_
device-side pagetables on top. Which from what I've seen all devices with
device-memory have, simply because they need some place to store whether
that memory is currently in device memory or should be translated using
PASID. Currently there's no gpu that works with PASID only, but there are
some on-cpu-die accelerator things that do work like that.

Maybe in the future there will be some accelerators that are fully cpu
cache coherent (including atomics) with something like CXL, and the
on-device memory is managed as normal system memory with struct page as
ZONE_DEVICE and accelerator va -> physical address translation is only
done with PASID ... but for now I haven't seen that, definitely not in
upstream drivers.

And the moment you have some per-device pagetables or per-device memory
management of some sort (like using gpuva mgr) then I'm 100% agreeing with
Christian that the kfd SVM model is too strict and not a great idea.


That basically means, without ATS/PRI+PASID you cannot implement a 
unified memory programming model, where GPUs or accelerators access 
virtual addresses without pre-registering them with an SVM API call.


Unified memory is a feature implemented by the KFD SVM API and used by 
ROCm. This is used e.g. to implement OpenMP USM (unified shared memory). 
It'

Re: Making drm_gpuvm work across gpu devices

2024-01-25 Thread Felix Kuehling




On 2024-01-24 20:17, Zeng, Oak wrote:


Hi Christian,

Even though I mentioned KFD design, I didn’t mean to copy the KFD 
design. I also had hard time to understand the difficulty of KFD under 
virtualization environment.


The problem with virtualization is related to virtualization design 
choices. There is a single process that proxies requests from multiple 
processes in one (or more?) VMs to the GPU driver. That means, we need a 
single process with multiple contexts (and address spaces). One proxy 
process on the host must support multiple guest address spaces.


I don't know much more than these very high level requirements, and I 
only found out about those a few weeks ago. Due to my own bias I can't 
comment whether there are bad design choices in the proxy architecture 
or in KFD or both. The way we are considering fixing this, is to enable 
creating multiple KFD contexts in the same process. Each of those 
contexts will still represent a shared virtual address space across 
devices (but not the CPU). Because the device address space is not 
shared with the CPU, we cannot support our SVM API in this situation.


I still believe that it makes sense to have the kernel mode driver aware 
of a shared virtual address space at some level. A per-GPU API and an 
API that doesn't require matching CPU and GPU virtual addresses would 
enable more flexibility at the cost duplicate information tracking for 
multiple devices and duplicate overhead for things like MMU notifiers 
and interval tree data structures. Having to coordinate multiple devices 
with potentially different address spaces would probably make it more 
awkward to implement memory migration. The added flexibility would go 
mostly unused, except in some very niche applications.


Regards,
  Felix


For us, Xekmd doesn't need to know it is running under bare metal or 
virtualized environment. Xekmd is always a guest driver. All the 
virtual address used in xekmd is guest virtual address. For SVM, we 
require all the VF devices share one single shared address space with 
guest CPU program. So all the design works in bare metal environment 
can automatically work under virtualized environment. +@Shah, Ankur N 
<mailto:ankur.n.s...@intel.com> +@Winiarski, Michal 
<mailto:michal.winiar...@intel.com> to backup me if I am wrong.


Again, shared virtual address space b/t cpu and all gpu devices is a 
hard requirement for our system allocator design (which means 
malloc’ed memory, cpu stack variables, globals can be directly used in 
gpu program. Same requirement as kfd SVM design). This was aligned 
with our user space software stack.


For anyone who want to implement system allocator, or SVM, this is a 
hard requirement. I started this thread hoping I can leverage the 
drm_gpuvm design to manage the shared virtual address space (as the 
address range split/merge function was scary to me and I didn’t want 
re-invent). I guess my takeaway from this you and Danilo is this 
approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill 
for our svm address range split/merge. So I will make things work 
first by manage address range xekmd internally. I can re-look 
drm-gpuvm approach in the future.


Maybe a pseudo user program can illustrate our programming model:

Fd0 = open(card0)

Fd1 = open(card1)

Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's 
first vm_create


Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called 
from same process


Queue0 = xe_exec_queue_create(fd0, vm0)

Queue1 = xe_exec_queue_create(fd1, vm1)

//check p2p capability calling L0 API….

ptr = malloc()//this replace bo_create, vm_bind, dma-import/export

Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0

Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1

//Gpu page fault handles memory allocation/migration/mapping to gpu

As you can see, from above model, our design is a little bit different 
than the KFD design. user need to explicitly create gpuvm (vm0 and vm1 
above) for each gpu device. Driver internally have a xe_svm represent 
the shared address space b/t cpu and multiple gpu devices. But end 
user doesn’t see and no need to create xe_svm. The shared virtual 
address space is really managed by linux core mm (through the vma 
struct, mm_struct etc). From each gpu device’s perspective, it just 
operate under its own gpuvm, not aware of the existence of other 
gpuvm, even though in reality all those gpuvm shares a same virtual 
address space.


See one more comment inline

*From:*Christian König 
*Sent:* Wednesday, January 24, 2024 3:33 AM
*To:* Zeng, Oak ; Danilo Krummrich 
; Dave Airlie ; Daniel Vetter 
; Felix Kuehling 
*Cc:* Welty, Brian ; 
dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; 
Bommu, Krishnaiah ; Ghimiray, Himal Prasad 
; thomas.hellst...@linux.intel.com; 
Vishwanathapura, Niranjana ; 
Brost, Matthew ; Gupta, saurabhg 


*Subject:*

Re: [bug report] drm/amdkfd: Export DMABufs from KFD using GEM handles

2024-01-23 Thread Felix Kuehling


On 2024-01-23 5:21, Dan Carpenter wrote:

Hello Felix Kuehling,

The patch 1819200166ce: "drm/amdkfd: Export DMABufs from KFD using
GEM handles" from Aug 24, 2023 (linux-next), leads to the following
Smatch static checker warning:

drivers/dma-buf/dma-buf.c:729 dma_buf_get()
warn: fd used after fd_install() 'fd'

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
809  static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
810  {
811  if (!mem->dmabuf) {
812  struct amdgpu_device *bo_adev;
813  struct dma_buf *dmabuf;
814  int r, fd;
815
816  bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
817  r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
818 mem->gem_handle,
819  mem->alloc_flags & 
KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
820 DRM_RDWR : 0, &fd);
  ^^^
The drm_gem_prime_handle_to_fd() function does an fd_install() and
returns the result as "fd".

821  if (r)
822  return r;
823  dmabuf = dma_buf_get(fd);
  ^^
Then we do another fget() inside dma_buf_get().  I'm not an expert,
but this looks wrong.  We can't assume that the dmabuf here is the
same one from drm_gem_prime_handle_to_fd() because the user could
change it after the fd_install().  I suspect drm_gem_prime_handle_to_fd()
should pass the dmabuf back instead.

We had several CVEs similar to this such as CVE-2022-1998.


That CVE is a system crash. I don't think that can happen here. It's 
true that user mode can close the fd. But then dma_buf_get would return 
an error that we'd catch with "WARN_ON_ONCE(IS_ERR(dmabuf))" below.


It's possible that a different DMABuf gets bound to the fd by a 
malicious user mode. That said, we're treating this fd as if it had come 
from user mode. dma_buf_get and the subsequent check on the dmabuf 
should be robust against any user mode messing with the file descriptor 
in the meantime.


Regards,
  Felix




824  close_fd(fd);
825  if (WARN_ON_ONCE(IS_ERR(dmabuf)))
826  return PTR_ERR(dmabuf);
827  mem->dmabuf = dmabuf;
828  }
829
830  return 0;
831  }

regards,
dan carpenter

Re: Making drm_gpuvm work across gpu devices

2024-01-23 Thread Felix Kuehling


On 2024-01-23 14:37, Zeng, Oak wrote:

Thanks Christian. I have some comment inline below.

Danilo, can you also take a look and give your feedback? Thanks.


Sorry, just catching up with this thread now. I'm also not familiar with 
drm_gpuvm.


Some general observations based on my experience with KFD, amdgpu and 
SVM. With SVM we have a single virtual address space managed in user 
mode (basically using mmap) with attributes per virtual address range 
maintained in the kernel mode driver. Different devices get their 
mappings of pieces of that address space using the same virtual 
addresses. We also support migration to different DEVICE_PRIVATE memory 
spaces.


However, we still have page tables managed per device. Each device can 
have different page table formats and layout (e.g. different GPU 
generations in the same system) and the same memory may be mapped with 
different flags on different devices in order to get the right coherence 
behaviour. We also need to maintain per-device DMA mappings somewhere. 
That means, as far as the device page tables are concerned, we still 
have separate address spaces. SVM only adds a layer on top, which 
coordinates these separate device virtual address spaces so that some 
parts of them provide the appearance of a shared virtual address space.


At some point you need to decide, where you draw the boundary between 
managing a per-process shared virtual address space and managing 
per-device virtual address spaces. In amdgpu that boundary is currently 
where kfd_svm code calls amdgpu_vm code to manage the per-device page 
tables.


In the amdgpu driver, we still have the traditional memory management 
APIs in the render nodes that don't do SVM. They share the device 
virtual address spaces with SVM. We have to be careful that we don't try 
to manage the same device virtual address ranges with these two 
different memory managers. In practice, we let the non-SVM APIs use the 
upper half of the canonical address space, while the lower half can be 
used almost entirely for SVM.


Regards,
  Felix





-Original Message-
From: Christian König 
Sent: Tuesday, January 23, 2024 6:13 AM
To: Zeng, Oak ; Danilo Krummrich ;
Dave Airlie ; Daniel Vetter 
Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
intel-
x...@lists.freedesktop.org; Bommu, Krishnaiah ;
Ghimiray, Himal Prasad ;
thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
; Brost, Matthew

Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,

Am 23.01.24 um 04:21 schrieb Zeng, Oak:

Hi Danilo and all,

During the work of Intel's SVM code, we came up the idea of making

drm_gpuvm to work across multiple gpu devices. See some discussion here:
https://lore.kernel.org/dri-
devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
11.prod.outlook.com/

The reason we try to do this is, for a SVM (shared virtual memory across cpu

program and all gpu program on all gpu devices) process, the address space has
to be across all gpu devices. So if we make drm_gpuvm to work across devices,
then our SVM code can leverage drm_gpuvm as well.

At a first look, it seems feasible because drm_gpuvm doesn't really use the

drm_device *drm pointer a lot. This param is used only for printing/warning. So 
I
think maybe we can delete this drm field from drm_gpuvm.

This way, on a multiple gpu device system, for one process, we can have only

one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
each gpu device).

What do you think?

Well from the GPUVM side I don't think it would make much difference if
we have the drm device or not.

But the experience we had with the KFD I think I should mention that we
should absolutely *not* deal with multiple devices at the same time in
the UAPI or VM objects inside the driver.

The background is that all the APIs inside the Linux kernel are build
around the idea that they work with only one device at a time. This
accounts for both low level APIs like the DMA API as well as pretty high
level things like for example file system address space etc...

Yes most API are per device based.

One exception I know is actually the kfd SVM API. If you look at the svm_ioctl 
function, it is per-process based. Each kfd_process represent a process across 
N gpu devices. Cc Felix.

Need to say, kfd SVM represent a shared virtual address space across CPU and 
all GPU devices on the system. This is by the definition of SVM (shared virtual 
memory). This is very different from our legacy gpu *device* driver which works 
for only one device (i.e., if you want one device to access another device's 
memory, you will have to use dma-buf export/import etc).

We have the same design requirement of SVM. For anyone who want to implement 
the SVM concept, this is a hard requirement. Since now drm has the drm_gpuvm 
concept which strictly speaking is designed for one device, I want to see 
whether we can extend drm_gpuvm to make it work for both single device (as used 
in

Re: [pull] amdgpu, amdkfd drm-fixes-6.8

2024-01-15 Thread Felix Kuehling




On 2024-01-15 17:08, Alex Deucher wrote:

Hi Dave, Sima,

Same PR as Friday, but with the new clang warning fixed.

The following changes since commit e54478fbdad20f2c58d0a4f99d01299ed8e7fe9c:

   Merge tag 'amd-drm-next-6.8-2024-01-05' of 
https://gitlab.freedesktop.org/agd5f/linux into drm-next (2024-01-09 09:07:50 
+1000)

are available in the Git repository at:

   https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.8-2024-01-15

for you to fetch changes up to 86718cf93237a0f45773bbc49b6006733fc4e051:

   drm/amd/display: Avoid enum conversion warning (2024-01-15 16:28:05 -0500)


amd-drm-fixes-6.8-2024-01-15:

amdgpu:
- SubVP fixes
- VRR fixes
- USB4 fixes
- DCN 3.5 fixes
- GFX11 harvesting fix
- RAS fixes
- Misc small fixes
- KFD dma-buf import fixes
- Power reporting fixes
- ATHUB 3.3 fix
- SR-IOV fix
- Add missing fw release for fiji
- GFX 11.5 fix
- Debugging module parameter fix
- SMU 13.0.6 fixes
- Fix new clang warning

amdkfd:
- Fix lockdep warnings
- Fix sparse __rcu warnings
- Bump interface version so userspace knows that the kernel supports dma-bufs 
exported from KFD
   Most of the fixes for this went into 6.7, but the last fix is in this PR
- HMM fix
- SVM fix


Alex Deucher (4):
   drm/amdgpu: fix avg vs input power reporting on smu7
   drm/amdgpu: fall back to INPUT power for AVG power via INFO IOCTL
   drm/amdgpu/pm: clarify debugfs pm output
   drm/amdgpu: drop exp hw support check for GC 9.4.3

Aric Cyr (1):
   drm/amd/display: 3.2.266

Candice Li (2):
   drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
   drm/amdgpu: Support poison error injection via ras_ctrl debugfs

Charlene Liu (1):
   drm/amd/display: Update z8 latency

Dafna Hirschfeld (1):
   drm/amdkfd: fixes for HMM mem allocation

Daniel Miess (1):
   Revert "drm/amd/display: Fix conversions between bytes and KB"

Felix Kuehling (4):
   drm/amdkfd: Fix lock dependency warning
   drm/amdkfd: Fix sparse __rcu annotation warnings
   drm/amdgpu: Auto-validate DMABuf imports in compute VMs
   drm/amdkfd: Bump KFD ioctl version


I don't think the last two patches should be on -fixes. This is really 
for a new interoperability feature and not meant to fix other 
functionality that was already expected to work. That's why the user 
mode code checks for the API version.


Regards,
  Felix




George Shen (1):
   drm/amd/display: Disconnect phantom pipe OPP from OPTC being disabled

Hawking Zhang (1):
   drm/amdgpu: Packed socket_id to ras feature mask

Ivan Lipski (1):
   Revert "drm/amd/display: fix bandwidth validation failure on DCN 2.1"

James Zhu (1):
   drm/amdgpu: make a correction on comment

Le Ma (3):
   Revert "drm/amdgpu: add param to specify fw bo location for front-door 
loading"
   drm/amdgpu: add debug flag to place fw bo on vram for frontdoor loading
   drm/amdgpu: move debug options init prior to amdgpu device init

Lijo Lazar (2):
   drm/amd/pm: Add error log for smu v13.0.6 reset
   drm/amd/pm: Fix smuv13.0.6 current clock reporting

Likun Gao (1):
   drm/amdgpu: correct the cu count for gfx v11

Martin Leung (2):
   drm/amd/display: revert "for FPO & SubVP/DRR config program vmin/max"
   drm/amd/display: revert "Optimize VRR updates to only necessary ones"

Martin Tsai (1):
   drm/amd/display: To adjust dprefclk by down spread percentage

Meenakshikumar Somasundaram (1):
   drm/amd/display: Dpia hpd status not in sync after S4

Melissa Wen (1):
   drm/amd/display: cleanup inconsistent indenting in amdgpu_dm_color

Nathan Chancellor (1):
   drm/amd/display: Avoid enum conversion warning

Peichen Huang (1):
   drm/amd/display: Request usb4 bw for mst streams

Philip Yang (1):
   drm/amdkfd: Fix lock dependency warning with srcu

Srinivasan Shanmugam (6):
   drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 
'get_platform_power_management_table()'
   drm/amdgpu: Fix with right return code '-EIO' in 
'amdgpu_gmc_vram_checking()'
   drm/amdgpu: Fix unsigned comparison with less than zero in 
vpe_u1_8_from_fraction()
   drm/amdgpu: Release 'adev->pm.fw' before return in 
'amdgpu_device_need_post()'
   drm/amd/display: Fix variable deferencing before NULL check in 
edp_setup_replay()
   drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()'

Victor Lu (1):
   drm/amdgpu: Do not program VM_L2_CNTL under SRIOV

Yifan Zhang (3):
   drm/amdgpu: update headers for nbio v7.11
   drm/amdgpu: update ATHUB_MISC_CNTL offset for athub v3.3
   drm/amdgpu: update regGL2C_CTRL4 value in golden sett

Re: Proposal to add CRIU support to DRM render nodes

2024-01-15 Thread Felix Kuehling

I haven't seen any replies to this proposal. Either it got lost in the 
pre-holiday noise, or there is genuinely no interest in this.


If it's the latter, I would look for an AMDGPU driver-specific solution 
with minimally invasive changes in DRM and DMABuf code, if needed. Maybe 
it could be generalized later if there is interest then.


Regards,
  Felix


On 2023-12-06 16:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining 
some of the problems we need to solve to checkpoint and restore render 
node state and shared memory (DMABuf) state. I have some thoughts on 
the API design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address mappings 
in ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, 
as well as between processes. In the long run this also provides a 
path to moving all or most memory management from the KFD ioctl API to 
libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU virtual 
address mappings. Support will be needed for checkpointing GEM buffer 
objects and handles, their GPU virtual address mappings and memory 
sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may 
have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can
allocate memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this
time)
 2. Resume (final fixups after the VMAs are sorted out, resume execution)

For some more background about our implementation in KFD, you can 
refer to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


  * Opaque information in the checkpoint data that user mode can't
interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and import
from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to parse that 
information. Therefore, much of the information in our KFD CRIU ioctl 
API is opaque. It is written by kernel mode in the checkpoint, it is 
consumed by kernel mode when restoring the checkpoint, but user mode 
doesn't care about the contents or binary layout, so there is no user 
mode ABI to break. This is how we were able to maintain CRIU support 
when we added the SVM API to KFD without changing the CRIU plugin and 
without breaking our ABI.


Opaque information may also lend itself to API abstraction, if this 
becomes a generic DRM API with driver-specific callbacks that fill in 
HW-specific opaque data.


# Second API for creating objects

Creating BOs and other objects when restoring a checkpoint needs more 
information than the usual BO alloc and similar APIs provide. For 
example, we need to restore BOs with the same GEM handles so that user 
mode can continue using those handles after resuming execution. If BOs 
are shared through DMABufs without dynamic a

Re: [PATCH v2] drm/amdkfd: fixes for HMM mem allocation

2024-01-08 Thread Felix Kuehling




On 2024-01-07 08:07, Dafna Hirschfeld wrote:

Fix err return value and reset pgmap->type after checking it.

Fixes: c83dee9b6394 ("drm/amdkfd: add SPM support for SVM")
Reviewed-by: Felix Kuehling 
Signed-off-by: Dafna Hirschfeld 
---
v2: remove unrelated DOC fix and add 'Fixes' tag.


Thank you. I'm going to apply this patch to amd-staging-drm-next.

Regards,
  Felix




  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 6c25dab051d5..b8680e0753ca 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -1021,7 +1021,7 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
} else {
res = devm_request_free_mem_region(adev->dev, &iomem_resource, 
size);
if (IS_ERR(res))
-   return -ENOMEM;
+   return PTR_ERR(res);
pgmap->range.start = res->start;
pgmap->range.end = res->end;
pgmap->type = MEMORY_DEVICE_PRIVATE;
@@ -1037,10 +1037,10 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
-   /* Disable SVM support capability */
-   pgmap->type = 0;
if (pgmap->type == MEMORY_DEVICE_PRIVATE)
devm_release_mem_region(adev->dev, res->start, 
resource_size(res));
+   /* Disable SVM support capability */
+   pgmap->type = 0;
return PTR_ERR(r);
}

Re: [PATCH] drm/amdkfd: fixes for HMM mem allocation

2024-01-02 Thread Felix Kuehling




On 2023-12-31 09:37, Dafna Hirschfeld wrote:

Few fixes to amdkfd and the doc of
devm_request_free_mem_region.

Signed-off-by: Dafna Hirschfeld 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 6 +++---
  kernel/resource.c| 2 +-
  2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 6c25dab051d5..b8680e0753ca 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -1021,7 +1021,7 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
} else {
res = devm_request_free_mem_region(adev->dev, &iomem_resource, 
size);
if (IS_ERR(res))
-   return -ENOMEM;
+   return PTR_ERR(res);
pgmap->range.start = res->start;
pgmap->range.end = res->end;
pgmap->type = MEMORY_DEVICE_PRIVATE;
@@ -1037,10 +1037,10 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
-   /* Disable SVM support capability */
-   pgmap->type = 0;
if (pgmap->type == MEMORY_DEVICE_PRIVATE)
devm_release_mem_region(adev->dev, res->start, 
resource_size(res));
+   /* Disable SVM support capability */
+   pgmap->type = 0;


Ooff, thanks for catching that. For the KFD driver changes you can add

Fixes: c83dee9b6394 ("drm/amdkfd: add SPM support for SVM")
Reviewed-by: Felix Kuehling 



return PTR_ERR(r);
}
  
diff --git a/kernel/resource.c b/kernel/resource.c

index 866ef3663a0b..fe890b874606 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1905,8 +1905,8 @@ get_free_mem_region(struct device *dev, struct resource 
*base,
   * devm_request_free_mem_region - find free region for device private memory
   *
   * @dev: device struct to bind the resource to
- * @size: size in bytes of the device memory to add
   * @base: resource tree to look in
+ * @size: size in bytes of the device memory to add
   *
   * This function tries to find an empty range of physical address big enough 
to
   * contain the new resource, so that it can later be hotplugged as ZONE_DEVICE

Re: [PATCH v3 2/2] drm/amdgpu: Enable clear page functionality

2023-12-14 Thread Felix Kuehling




On 2023-12-14 08:42, Arunpravin Paneer Selvam wrote:

Add clear page support in vram memory region.

v1:(Christian)
   - Dont handle clear page as TTM flag since when moving the BO back
 in from GTT again we don't need that.
   - Make a specialized version of amdgpu_fill_buffer() which only
 clears the VRAM areas which are not already cleared
   - Drop the TTM_PL_FLAG_WIPE_ON_RELEASE check in
 amdgpu_object.c

v2:
   - Modify the function name amdgpu_ttm_* (Alex)
   - Drop the delayed parameter (Christian)
   - handle amdgpu_res_cleared(&cursor) just above the size
 calculation (Christian)
   - Use AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE for clearing the buffers
 in the free path to properly wait for fences etc.. (Christian)

v3:(Christian)
   - Remove buffer clear code in VRAM manager instead change the
 AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE handling to set
 the DRM_BUDDY_CLEARED flag.
   - Remove ! from amdgpu_res_cleared(&cursor) check.

Signed-off-by: Arunpravin Paneer Selvam 
Suggested-by: Christian König 


Acked-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c| 22 ---
  .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h| 25 
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 61 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |  5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  |  6 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h  |  5 ++
  6 files changed, 111 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index cef920a93924..be8bf375d823 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -39,6 +39,7 @@
  #include "amdgpu.h"
  #include "amdgpu_trace.h"
  #include "amdgpu_amdkfd.h"
+#include "amdgpu_vram_mgr.h"
  
  /**

   * DOC: amdgpu_object
@@ -598,8 +599,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
if (!amdgpu_bo_support_uswc(bo->flags))
bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
  
-	if (adev->ras_enabled)

-   bo->flags |= AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE;
+   bo->flags |= AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE;
  
  	bo->tbo.bdev = &adev->mman.bdev;

if (bp->domain & (AMDGPU_GEM_DOMAIN_GWS | AMDGPU_GEM_DOMAIN_OA |
@@ -629,15 +629,17 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
  
  	if (bp->flags & AMDGPU_GEM_CREATE_VRAM_CLEARED &&

bo->tbo.resource->mem_type == TTM_PL_VRAM) {
-   struct dma_fence *fence;
+   struct dma_fence *fence = NULL;
  
-		r = amdgpu_fill_buffer(bo, 0, bo->tbo.base.resv, &fence, true);

+   r = amdgpu_ttm_clear_buffer(bo, bo->tbo.base.resv, &fence);
if (unlikely(r))
goto fail_unreserve;
  
-		dma_resv_add_fence(bo->tbo.base.resv, fence,

-  DMA_RESV_USAGE_KERNEL);
-   dma_fence_put(fence);
+   if (fence) {
+   dma_resv_add_fence(bo->tbo.base.resv, fence,
+  DMA_RESV_USAGE_KERNEL);
+   dma_fence_put(fence);
+   }
}
if (!bp->resv)
amdgpu_bo_unreserve(bo);
@@ -1360,8 +1362,12 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object 
*bo)
if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
return;
  
-	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence, true);

+   r = amdgpu_fill_buffer(abo, 0, bo->base.resv, &fence, true);
if (!WARN_ON(r)) {
+   struct amdgpu_vram_mgr_resource *vres;
+
+   vres = to_amdgpu_vram_mgr_resource(bo->resource);
+   vres->flags |= DRM_BUDDY_CLEARED;
amdgpu_bo_fence(abo, fence, false);
dma_fence_put(fence);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
index 381101d2bf05..50fcd86e1033 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
@@ -164,4 +164,29 @@ static inline void amdgpu_res_next(struct 
amdgpu_res_cursor *cur, uint64_t size)
}
  }
  
+/**

+ * amdgpu_res_cleared - check if blocks are cleared
+ *
+ * @cur: the cursor to extract the block
+ *
+ * Check if the @cur block is cleared
+ */
+static inline bool amdgpu_res_cleared(struct amdgpu_res_cursor *cur)
+{
+   struct drm_buddy_block *block;
+
+   switch (cur->mem_type) {
+   case TTM_PL_VRAM:
+   block = cur->node;
+
+   if (!amdgpu_vram_mgr_is_cleared(block))
+   return false;
+   break;
+   default:
+   return false;

Re: [PATCH 1/2] drm: update drm_show_memory_stats() for dma-bufs

2023-12-13 Thread Felix Kuehling


On 2023-12-07 13:02, Alex Deucher wrote:

Show buffers as shared if they are shared via dma-buf as well
(e.g., shared with v4l or some other subsystem).


You can add KFD to that list. With the in-progress CUDA11 VM changes and 
improved interop between KFD and render nodes, sharing DMABufs between 
KFD and render nodes will become much more common.


Regards,
  Felix




Signed-off-by: Alex Deucher 
Cc: Rob Clark 
---
  drivers/gpu/drm/drm_file.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index 5ddaffd32586..5d5f93b9c263 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -973,7 +973,7 @@ void drm_show_memory_stats(struct drm_printer *p, struct 
drm_file *file)
DRM_GEM_OBJECT_PURGEABLE;
}
  
-		if (obj->handle_count > 1) {

+   if ((obj->handle_count > 1) || obj->dma_buf) {
status.shared += obj->size;
} else {
status.private += obj->size;

Re: [PATCH 2/2] drm/amdgpu: Enable clear page functionality

2023-12-13 Thread Felix Kuehling


On 2023-12-13 9:20, Christian König wrote:

Am 12.12.23 um 00:32 schrieb Felix Kuehling:


On 2023-12-11 04:50, Christian König wrote:

Am 08.12.23 um 20:53 schrieb Alex Deucher:

[SNIP]

You also need a functionality which resets all cleared blocks to
uncleared after suspend/resume.

No idea how to do this, maybe Alex knows of hand.
Since the buffers are cleared on creation, is there actually 
anything to do?


Well exactly that's the problem, the buffers are no longer always 
cleared on creation with this patch.


Instead we clear on free, track which areas are cleared and clear 
only the ones which aren't cleared yet on creation.


The code I added for clearing-on-free a long time ago, does not clear 
to 0, but to a non-0 poison value. That was meant to make it easier 
to catch applications incorrectly relying on 0-initialized memory. Is 
that being changed? I didn't see it in this patch series.


Yeah, Arun stumbled over that as well. Any objections that we fill 
with zeros instead or is that poison value something necessary for 
debugging?


I don't think it's strictly necessary. But it may encourage sloppy user 
mode programming to rely on 0-initialized memory that ends up breaking 
in corner cases or on older kernels.


That said, I see that this patch series adds clearing of memory in the 
VRAM manager, but it doesn't remove the clearing for 
AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE in amdgpu_bo_release_notify and 
amdgpu_move_blit. This will lead to duplicate work.


I'm also not sure how the clearing added in this patch series will 
affect free-latency observed in user mode. Will this be synchronous and 
cause the user mode thread to stall while the memory is being cleared?


Regards,
  Felix




Regards,
Christian.



Regards,
  Felix




So some cases need special handling. E.g. when the engine is not 
initialized yet or suspend/resume.


In theory after a suspend/resume cycle the VRAM is cleared to zeros, 
but in practice that's not always true.


Christian.


Alex

Re: [PATCH 2/2] drm/amdgpu: Enable clear page functionality

2023-12-11 Thread Felix Kuehling




On 2023-12-11 04:50, Christian König wrote:

Am 08.12.23 um 20:53 schrieb Alex Deucher:

[SNIP]

You also need a functionality which resets all cleared blocks to
uncleared after suspend/resume.

No idea how to do this, maybe Alex knows of hand.

Since the buffers are cleared on creation, is there actually anything to do?


Well exactly that's the problem, the buffers are no longer always 
cleared on creation with this patch.


Instead we clear on free, track which areas are cleared and clear only 
the ones which aren't cleared yet on creation.


The code I added for clearing-on-free a long time ago, does not clear to 
0, but to a non-0 poison value. That was meant to make it easier to 
catch applications incorrectly relying on 0-initialized memory. Is that 
being changed? I didn't see it in this patch series.


Regards,
  Felix




So some cases need special handling. E.g. when the engine is not 
initialized yet or suspend/resume.


In theory after a suspend/resume cycle the VRAM is cleared to zeros, 
but in practice that's not always true.


Christian.


Alex

Proposal to add CRIU support to DRM render nodes

2023-12-06 Thread Felix Kuehling

Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining some 
of the problems we need to solve to checkpoint and restore render node 
state and shared memory (DMABuf) state. I have some thoughts on the API 
design, leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent there is 
interest in making that generic.


We are working on using DRM render nodes for virtual address mappings in 
ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, as 
well as between processes. In the long run this also provides a path to 
moving all or most memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring ROCm 
applications with CRIU. Currently there is no support for checkpointing 
and restoring render node state, other than CPU virtual address 
mappings. Support will be needed for checkpointing GEM buffer objects 
and handles, their GPU virtual address mappings and memory sharing 
relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may have 
implications for other drivers in the future.


One basic question before going into any API details: Is there a desire 
to have CRIU support for other DRM drivers?


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

1. Process-info (enumerates objects and sizes so user mode can allocate
   memory for the checkpoint, stops execution on the GPU)
2. Checkpoint (store object metadata for BOs, queues, etc.)
3. Unpause (resumes execution after the checkpoint is complete)

Restore:

1. Restore (restore objects, VMAs are not in the right place at this time)
2. Resume (final fixups after the VMAs are sorted out, resume execution)

For some more background about our implementation in KFD, you can refer 
to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


 * Opaque information in the checkpoint data that user mode can't
   interpret or do anything with
 * A second API for creating objects (e.g. BOs) that is separate from
   the regular BO creation API
 * Kernel mode would need to be involved in restoring BO sharing
   relationships rather than replaying BO creation, export and import
   from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to parse that information. 
Therefore, much of the information in our KFD CRIU ioctl API is opaque. 
It is written by kernel mode in the checkpoint, it is consumed by kernel 
mode when restoring the checkpoint, but user mode doesn't care about the 
contents or binary layout, so there is no user mode ABI to break. This 
is how we were able to maintain CRIU support when we added the SVM API 
to KFD without changing the CRIU plugin and without breaking our ABI.


Opaque information may also lend itself to API abstraction, if this 
becomes a generic DRM API with driver-specific callbacks that fill in 
HW-specific opaque data.


# Second API for creating objects

Creating BOs and other objects when restoring a checkpoint needs more 
information than the usual BO alloc and similar APIs provide. For 
example, we need to restore BOs with the same GEM handles so that user 
mode can continue using those handles after resuming execution. If BOs 
are shared through DMABufs without dynamic attachment, we need to 
restore pinned BOs as pinned. Validation of virtual addresses and 
handling MMU notifiers must be suspended until the virtual address space 
is restored. For user mode queues we need to save and restore a lot of 
queue execution state so that execution can resume cleanly.


# Restoring buffer sharing relationships

Different GEM handles in different render nodes and processes can refer 
to the same underlying shared memory, either b

Re: [PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-12-04 Thread Felix Kuehling




On 2023-12-04 12:49, Deucher, Alexander wrote:

[AMD Official Use Only - General]


-Original Message-
From: Kuehling, Felix 
Sent: Friday, December 1, 2023 6:40 PM
To: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Deucher,
Alexander 
Cc: Daniel Vetter ; Koenig, Christian
; Thomas Zimmermann

Subject: Re: [PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle
conversion"

Hi Alex,

I'm about to push patches 1-3 to the rebased amd-staging-drm-next. It would
be good to get patch 1 into drm-fixes so that Linux 6.6 will be the only kernel
without these prime helpers. That would minimize the hassle for DKMS driver
installations on future distros.

Already done:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0514f63cfff38a0dcb7ba9c5f245827edc0c5107


Thank you, all! I also saw Sasha Levin is backporting it to 6.6.

Cheers,
  Felix




Alex


Thanks,
Felix


On 2023-12-01 18:34, Felix Kuehling wrote:

This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated
with GEM objects while ensuring that move notifier callbacks are
working as intended.

Acked-by: Christian König 
Acked-by: Thomas Zimmermann 
Acked-by: Daniel Vetter 
Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/drm_prime.c | 33 ++---
   include/drm/drm_prime.h |  7 +++
   2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf

*dma_buf)

   }
   EXPORT_SYMBOL(drm_gem_dmabuf_release);

-/*
+/**
* drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
* @dev: drm_device to import into
* @file_priv: drm file-private structure @@ -292,9 +292,9 @@
EXPORT_SYMBOL(drm_gem_dmabuf_release);
*
* Returns 0 on success or a negative error code on failure.
*/
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
- struct drm_file *file_priv, int prime_fd,
- uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+  struct drm_file *file_priv, int prime_fd,
+  uint32_t *handle)
   {
 struct dma_buf *dma_buf;
 struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct

drm_device *dev,

 dma_buf_put(dma_buf);
 return ret;
   }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);

   int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,
  struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf

*export_and_register_object(struct drm_device *dev,

 return dmabuf;
   }

-/*
+/**
* drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
* @dev: dev to export the buffer from
* @file_priv: drm file-private structure @@ -421,10 +422,10 @@
static struct dma_buf *export_and_register_object(struct drm_device *dev,
* The actual exporting from GEM object to a dma-buf is done through the
* &drm_gem_object_funcs.export callback.
*/
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
- struct drm_file *file_priv, uint32_t handle,
- uint32_t flags,
- int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+  struct drm_file *file_priv, uint32_t handle,
+  uint32_t flags,
+  int *prime_fd)
   {
 struct drm_gem_object *obj;
 int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct
drm_device *dev,

 return ret;
   }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);

   int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,
  struct drm_file *file_priv)
@@ -864,9 +866,9 @@

EXPORT_SYMBOL(drm_prime_get_contiguous_size);

* @obj: GEM object to export
* @flags: flags like DRM_CLOEXEC and DRM_RDWR
*
- * This is the implementation of the &drm_gem_object_funcs.export
functions
- * for GEM drivers using the PRIME helpers. It is used as the default
for
- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export
+ functions for GEM drivers
+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
*/
   struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
  int flags)
@@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
* @dev: drm_device to import into
* @dma_bu

Re: [PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-12-01 Thread Felix Kuehling


Hi Alex,

I'm about to push patches 1-3 to the rebased amd-staging-drm-next. It 
would be good to get patch 1 into drm-fixes so that Linux 6.6 will be 
the only kernel without these prime helpers. That would minimize the 
hassle for DKMS driver installations on future distros.


Thanks,
  Felix


On 2023-12-01 18:34, Felix Kuehling wrote:

This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

Acked-by: Christian König 
Acked-by: Thomas Zimmermann 
Acked-by: Daniel Vetter 
Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/drm_prime.c | 33 ++---
  include/drm/drm_prime.h |  7 +++
  2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf)
  }
  EXPORT_SYMBOL(drm_gem_dmabuf_release);
  
-/*

+/**
   * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
   * @dev: drm_device to import into
   * @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
   *
   * Returns 0 on success or a negative error code on failure.
   */
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
- struct drm_file *file_priv, int prime_fd,
- uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+  struct drm_file *file_priv, int prime_fd,
+  uint32_t *handle)
  {
struct dma_buf *dma_buf;
struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device 
*dev,
dma_buf_put(dma_buf);
return ret;
  }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);
  
  int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,

 struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
return dmabuf;
  }
  
-/*

+/**
   * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
   * @dev: dev to export the buffer from
   * @file_priv: drm file-private structure
@@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
   * The actual exporting from GEM object to a dma-buf is done through the
   * &drm_gem_object_funcs.export callback.
   */
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
- struct drm_file *file_priv, uint32_t 
handle,
- uint32_t flags,
- int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+  struct drm_file *file_priv, uint32_t handle,
+  uint32_t flags,
+  int *prime_fd)
  {
struct drm_gem_object *obj;
int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device 
*dev,
  
  	return ret;

  }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);
  
  int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,

 struct drm_file *file_priv)
@@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size);
   * @obj: GEM object to export
   * @flags: flags like DRM_CLOEXEC and DRM_RDWR
   *
- * This is the implementation of the &drm_gem_object_funcs.export functions
- * for GEM drivers using the PRIME helpers. It is used as the default for
- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export functions 
for GEM drivers
+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
   */
  struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
 int flags)
@@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
   * @dev: drm_device to import into
   * @dma_buf: dma-buf object to import
   *
- * This is the implementation of the gem_prime_import functions for GEM
- * drivers using the PRIME helpers. It is the default for drivers that do
- * not set their own &drm_driver.gem_prime_import.
+ * This is the implementation of the gem_prime_import functions for GEM drivers
+ * using the PRIME helpers. Drivers can use this as their
+ * &drm_driver.gem_prime_import implementation. It is used as the default
+ * implementation in drm_gem_prime_fd_to_handle().
   *
   * Drivers must arrange to call drm_prime_gem_destroy() from their
   * &drm_ge

[PATCH 3/6] drm/amdkfd: Import DMABufs for interop through DRM

2023-12-01 Thread Felix Kuehling

Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This
ensures that a GEM handle is created on import and that obj->dma_buf
will be set and remain set as long as the object is imported into KFD.

Signed-off-by: Felix Kuehling 
Reviewed-by: Ramesh Errabolu 
Reviewed-by: Xiaogang.Chen 
Acked-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  9 ++-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 64 +--
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 15 ++---
 3 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 02973f5c8caf..cf6ed5fce291 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -314,11 +314,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void 
*process_info,
struct dma_fence **ef);
 int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev,
  struct kfd_vm_fault_info *info);
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dmabuf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset);
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset);
 int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem,
  struct dma_buf **dmabuf);
 void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index ae7dfaf59159..48697b789342 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1956,8 +1956,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
 
/* Free the BO*/
drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv);
-   if (!mem->is_imported)
-   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
+   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
if (mem->dmabuf) {
dma_buf_put(mem->dmabuf);
mem->dmabuf = NULL;
@@ -2313,34 +2312,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct 
amdgpu_device *adev,
return 0;
 }
 
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dma_buf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset)
+static int import_obj_create(struct amdgpu_device *adev,
+struct dma_buf *dma_buf,
+struct drm_gem_object *obj,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
 {
struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv);
-   struct drm_gem_object *obj;
struct amdgpu_bo *bo;
int ret;
 
-   obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf);
-   if (IS_ERR(obj))
-   return PTR_ERR(obj);
-
bo = gem_to_amdgpu_bo(obj);
if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM |
-   AMDGPU_GEM_DOMAIN_GTT))) {
+   AMDGPU_GEM_DOMAIN_GTT)))
/* Only VRAM and GTT BOs are supported */
-   ret = -EINVAL;
-   goto err_put_obj;
-   }
+   return -EINVAL;
 
*mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL);
-   if (!*mem) {
-   ret = -ENOMEM;
-   goto err_put_obj;
-   }
+   if (!*mem)
+   return -ENOMEM;
 
ret = drm_vma_node_allow(&obj->vma_node, drm_priv);
if (ret)
@@ -2390,8 +2381,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
drm_vma_node_revoke(&obj->vma_node, drm_priv);
 err_free_mem:
kfree(*mem);
+   return ret;
+}
+
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
+{
+   struct drm_gem_object *ob

[PATCH 2/6] drm/amdkfd: Export DMABufs from KFD using GEM handles

2023-12-01 Thread Felix Kuehling

Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM
handles are created in a drm_client_dev context to avoid exposing them
in user mode contexts through a DMABuf import.

Signed-off-by: Felix Kuehling 
Reviewed-by: Ramesh Errabolu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  5 +++
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 33 +++
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  4 +--
 4 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 2d22f7d45512..067690ba7bff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
int i;
int last_valid_bit;
+   int ret;
 
amdgpu_amdkfd_gpuvm_init_mem_limits();
 
@@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
.enable_mes = adev->enable_mes,
};
 
+   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", 
NULL);
+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
+   return;
+   }
+
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
&gpu_resources);
+   if (adev->kfd.init_complete)
+   drm_client_register(&adev->kfd.client);
+   else
+   drm_client_release(&adev->kfd.client);
 
amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 16794c2eea35..02973f5c8caf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "amdgpu_sync.h"
 #include "amdgpu_vm.h"
 #include "amdgpu_xcp.h"
@@ -83,6 +84,7 @@ struct kgd_mem {
 
struct amdgpu_sync sync;
 
+   uint32_t gem_handle;
bool aql_queue;
bool is_imported;
 };
@@ -105,6 +107,9 @@ struct amdgpu_kfd_dev {
 
/* HMM page migration MEMORY_DEVICE_PRIVATE mapping */
struct dev_pagemap pgmap;
+
+   /* Client for KFD BO GEM handle allocations */
+   struct drm_client_dev client;
 };
 
 enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 73288f9ccaf8..ae7dfaf59159 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -806,13 +807,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem,
 static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
 {
if (!mem->dmabuf) {
-   struct dma_buf *ret = amdgpu_gem_prime_export(
-   &mem->bo->tbo.base,
+   struct amdgpu_device *bo_adev;
+   struct dma_buf *dmabuf;
+   int r, fd;
+
+   bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
+   r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
+  mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-   DRM_RDWR : 0);
-   if (IS_ERR(ret))
-   return PTR_ERR(ret);
-   mem->dmabuf = ret;
+  DRM_RDWR : 0, &fd);
+   if (r)
+   return r;
+   dmabuf = dma_buf_get(fd);
+   close_fd(fd);
+   if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+   return PTR_ERR(dmabuf);
+   mem->dmabuf = dmabuf;
}
 
return 0;
@@ -1778,6 +1788,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
pr_debug("Failed to allow vma node access. ret %d\n", ret);
goto err_node_allow;
}
+   ret = drm_gem_handle_create(adev->kfd.client.file, gobj, 
&(*mem)->gem_handle);
+   if (ret)
+   goto err_gem_handle_create;
bo = gem_to_amdgpu_bo(gobj);
if (bo_type == ttm_bo_type_sg) {
bo->tb

[PATCH 5/6] drm/amdgpu: Auto-validate DMABuf imports in compute VMs

2023-12-01 Thread Felix Kuehling

DMABuf imports in compute VMs are not wrapped in a kgd_mem object on the
process_info->kfd_bo_list. There is no explicit KFD API call to validate
them or add eviction fences to them.

This patch automatically validates and fences dymanic DMABuf imports when
they are added to a compute VM. Revalidation after evictions is handled
in the VM code.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|   3 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  45 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c   |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  26 
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 122 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  12 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  |  12 +-
 8 files changed, 196 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index cf6ed5fce291..f2e920734c98 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -182,6 +182,9 @@ int amdgpu_queue_mask_bit_to_set_resource_bit(struct 
amdgpu_device *adev,
 struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
struct mm_struct *mm,
struct svm_range_bo *svm_bo);
+int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo,
+   uint32_t domain,
+   struct dma_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 int kfd_debugfs_kfd_mem_limits(struct seq_file *m, void *data);
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 48697b789342..7d91f99acb59 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -426,9 +426,9 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, 
uint32_t domain,
return ret;
 }
 
-static int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo,
-  uint32_t domain,
-  struct dma_fence *fence)
+int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo,
+   uint32_t domain,
+   struct dma_fence *fence)
 {
int ret = amdgpu_bo_reserve(bo, false);
 
@@ -464,13 +464,16 @@ static int amdgpu_amdkfd_validate_vm_bo(void *_unused, 
struct amdgpu_bo *bo)
  * again. Page directories are only updated after updating page
  * tables.
  */
-static int vm_validate_pt_pd_bos(struct amdgpu_vm *vm)
+static int vm_validate_pt_pd_bos(struct amdgpu_vm *vm,
+struct ww_acquire_ctx *ticket)
 {
struct amdgpu_bo *pd = vm->root.bo;
struct amdgpu_device *adev = amdgpu_ttm_adev(pd->tbo.bdev);
int ret;
 
-   ret = amdgpu_vm_validate_pt_bos(adev, vm, amdgpu_amdkfd_validate_vm_bo, 
NULL);
+   ret = amdgpu_vm_validate_evicted_bos(adev, vm, ticket,
+amdgpu_amdkfd_validate_vm_bo,
+NULL);
if (ret) {
pr_err("failed to validate PT BOs\n");
return ret;
@@ -1310,14 +1313,15 @@ static int map_bo_to_gpuvm(struct kgd_mem *mem,
return ret;
 }
 
-static int process_validate_vms(struct amdkfd_process_info *process_info)
+static int process_validate_vms(struct amdkfd_process_info *process_info,
+   struct ww_acquire_ctx *ticket)
 {
struct amdgpu_vm *peer_vm;
int ret;
 
list_for_each_entry(peer_vm, &process_info->vm_list_head,
vm_list_node) {
-   ret = vm_validate_pt_pd_bos(peer_vm);
+   ret = vm_validate_pt_pd_bos(peer_vm, ticket);
if (ret)
return ret;
}
@@ -1402,7 +1406,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void 
**process_info,
ret = amdgpu_bo_reserve(vm->root.bo, true);
if (ret)
goto reserve_pd_fail;
-   ret = vm_validate_pt_pd_bos(vm);
+   ret = vm_validate_pt_pd_bos(vm, NULL);
if (ret) {
pr_err("validate_pt_pd_bos() failed\n");
goto validate_pd_fail;
@@ -2043,7 +2047,7 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
bo->tbo.resource->mem_type == TTM_PL_SYSTEM)
is_invalid_userptr = true;
 
-   ret = vm_validate_pt_pd_bos(avm);
+   ret = vm_validate_pt_pd_bos(avm, NULL);
if (unlikely(ret))
goto out_unreserve;
 
@@ -2122,7 +2126,7 @@ int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu(
goto unreserve_out;
}
 
-   ret = vm_validate_pt_pd_

[PATCH 6/6] drm/amdkfd: Bump KFD ioctl version

2023-12-01 Thread Felix Kuehling

This is not strictly a change in the IOCTL API. This version bump is meant
to indicate to user mode the presence of a number of changes and fixes
that enable the management of VA mappings in compute VMs using the GEM_VA
ioctl for DMABufs exported from KFD.

Signed-off-by: Felix Kuehling 
---
 include/uapi/linux/kfd_ioctl.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index f0ed68974c54..9ce46edc62a5 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -40,9 +40,10 @@
  * - 1.12 - Add DMA buf export ioctl
  * - 1.13 - Add debugger API
  * - 1.14 - Update kfd_event_data
+ * - 1.15 - Enable managing mappings in compute VMs with GEM_VA ioctl
  */
 #define KFD_IOCTL_MAJOR_VERSION 1
-#define KFD_IOCTL_MINOR_VERSION 14
+#define KFD_IOCTL_MINOR_VERSION 15
 
 struct kfd_ioctl_get_version_args {
__u32 major_version;/* from KFD */
-- 
2.34.1

[PATCH 4/6] drm/amdgpu: New VM state for evicted user BOs

2023-12-01 Thread Felix Kuehling

Create a new VM state to track user BOs that are in the system domain.
In the next patch this will be used do conditionally re-validate them in
amdgpu_vm_handle_moved.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  5 -
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 7da71b6a9dc6..a748e17ff031 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -233,6 +233,22 @@ static void amdgpu_vm_bo_invalidated(struct 
amdgpu_vm_bo_base *vm_bo)
spin_unlock(&vm_bo->vm->status_lock);
 }
 
+/**
+ * amdgpu_vm_bo_evicted_user - vm_bo is evicted
+ *
+ * @vm_bo: vm_bo which is evicted
+ *
+ * State for BOs used by user mode queues which are not at the location they
+ * should be.
+ */
+static void amdgpu_vm_bo_evicted_user(struct amdgpu_vm_bo_base *vm_bo)
+{
+   vm_bo->moved = true;
+   spin_lock(&vm_bo->vm->status_lock);
+   list_move(&vm_bo->vm_status, &vm_bo->vm->evicted_user);
+   spin_unlock(&vm_bo->vm->status_lock);
+}
+
 /**
  * amdgpu_vm_bo_relocated - vm_bo is reloacted
  *
@@ -2195,6 +2211,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
for (i = 0; i < AMDGPU_MAX_VMHUBS; i++)
vm->reserved_vmid[i] = NULL;
INIT_LIST_HEAD(&vm->evicted);
+   INIT_LIST_HEAD(&vm->evicted_user);
INIT_LIST_HEAD(&vm->relocated);
INIT_LIST_HEAD(&vm->moved);
INIT_LIST_HEAD(&vm->idle);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index b6cd565562ad..9156ed22abe7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -288,9 +288,12 @@ struct amdgpu_vm {
/* Lock to protect vm_bo add/del/move on all lists of vm */
spinlock_t  status_lock;
 
-   /* BOs who needs a validation */
+   /* Per VM and PT BOs who needs a validation */
struct list_headevicted;
 
+   /* BOs for user mode queues that need a validation */
+   struct list_headevicted_user;
+
/* PT BOs which relocated and their parent need an update */
struct list_headrelocated;
 
-- 
2.34.1

[PATCH 1/6] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-12-01 Thread Felix Kuehling

This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

Acked-by: Christian König 
Acked-by: Thomas Zimmermann 
Acked-by: Daniel Vetter 
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/drm_prime.c | 33 ++---
 include/drm/drm_prime.h |  7 +++
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf)
 }
 EXPORT_SYMBOL(drm_gem_dmabuf_release);
 
-/*
+/**
  * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
  * @dev: drm_device to import into
  * @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
  *
  * Returns 0 on success or a negative error code on failure.
  */
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
- struct drm_file *file_priv, int prime_fd,
- uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+  struct drm_file *file_priv, int prime_fd,
+  uint32_t *handle)
 {
struct dma_buf *dma_buf;
struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device 
*dev,
dma_buf_put(dma_buf);
return ret;
 }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);
 
 int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,
 struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
return dmabuf;
 }
 
-/*
+/**
  * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
  * @dev: dev to export the buffer from
  * @file_priv: drm file-private structure
@@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
  * The actual exporting from GEM object to a dma-buf is done through the
  * &drm_gem_object_funcs.export callback.
  */
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
- struct drm_file *file_priv, uint32_t 
handle,
- uint32_t flags,
- int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+  struct drm_file *file_priv, uint32_t handle,
+  uint32_t flags,
+  int *prime_fd)
 {
struct drm_gem_object *obj;
int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device 
*dev,
 
return ret;
 }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);
 
 int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,
 struct drm_file *file_priv)
@@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size);
  * @obj: GEM object to export
  * @flags: flags like DRM_CLOEXEC and DRM_RDWR
  *
- * This is the implementation of the &drm_gem_object_funcs.export functions
- * for GEM drivers using the PRIME helpers. It is used as the default for
- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export functions 
for GEM drivers
+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
  */
 struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
 int flags)
@@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
  * @dev: drm_device to import into
  * @dma_buf: dma-buf object to import
  *
- * This is the implementation of the gem_prime_import functions for GEM
- * drivers using the PRIME helpers. It is the default for drivers that do
- * not set their own &drm_driver.gem_prime_import.
+ * This is the implementation of the gem_prime_import functions for GEM drivers
+ * using the PRIME helpers. Drivers can use this as their
+ * &drm_driver.gem_prime_import implementation. It is used as the default
+ * implementation in drm_gem_prime_fd_to_handle().
  *
  * Drivers must arrange to call drm_prime_gem_destroy() from their
  * &drm_gem_object_funcs.free hook when using this function.
diff --git a/include/drm/drm_prime.h b/include/drm/drm_prime.h
index a7abf9f3e697..2a1d01e5b56b 100644
--- a/include/drm/drm_prime.h
+++ b/include/drm/drm_prime.h
@@ -60,12 +60,19 @@ enum dma_data_direction;
 
 struct drm_device;
 struct drm_gem_object;
+struct drm_file;
 
 /* core prime functions */
 struct dma_buf *drm_gem_dmabuf_expor

Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-11-28 Thread Felix Kuehling


On 2023-11-28 12:22, Alex Deucher wrote:

On Thu, Nov 23, 2023 at 6:12 PM Felix Kuehling  wrote:

[+Alex]

On 2023-11-17 16:44, Felix Kuehling wrote:


This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

CC: Christian König 
CC: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 

Re: our discussion about v2 of this patch: If this version is
acceptable, can I get an R-b or A-b?

I would like to get this patch into drm-next as a prerequisite for
patches 2 and 3. I cannot submit it to the current amd-staging-drm-next
because the patch I'm reverting doesn't exist there yet.

Patch 2 and 3 could go into drm-next as well, or go through Alex's
amd-staging-drm-next branch once patch 1 is in drm-next. Alex, how do
you prefer to coordinate this?

I guess ideally this would go through my drm-next tree since your
other patches depend on it unless others feel strongly that it should
go through drm-misc.


Yes, drm-next would work best for applying this patch and the two 
patches that depend on it. I can send you the rebased patches from my 
drm-next based branch that I used for testing this.


Regards,
  Felix




Alex



Regards,
Felix



---
   drivers/gpu/drm/drm_prime.c | 33 ++---
   include/drm/drm_prime.h |  7 +++
   2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf)
   }
   EXPORT_SYMBOL(drm_gem_dmabuf_release);

-/*
+/**
* drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
* @dev: drm_device to import into
* @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
*
* Returns 0 on success or a negative error code on failure.
*/
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
-   struct drm_file *file_priv, int prime_fd,
-   uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+struct drm_file *file_priv, int prime_fd,
+uint32_t *handle)
   {
   struct dma_buf *dma_buf;
   struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device 
*dev,
   dma_buf_put(dma_buf);
   return ret;
   }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);

   int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,
struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
   return dmabuf;
   }

-/*
+/**
* drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
* @dev: dev to export the buffer from
* @file_priv: drm file-private structure
@@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
* The actual exporting from GEM object to a dma-buf is done through the
* &drm_gem_object_funcs.export callback.
*/
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
-   struct drm_file *file_priv, uint32_t handle,
-   uint32_t flags,
-   int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+struct drm_file *file_priv, uint32_t handle,
+uint32_t flags,
+int *prime_fd)
   {
   struct drm_gem_object *obj;
   int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device 
*dev,

   return ret;
   }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);

   int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,
struct drm_file *file_priv)
@@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size);
* @obj: GEM object to export
* @flags: flags like DRM_CLOEXEC and DRM_RDWR
*
- * This is the implementation of the &drm_gem_object_funcs.export functions
- * for GEM drivers using the PRIME helpers. It is used as the default for
- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export functions 
for GEM drivers
+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
*/
   struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
int flags)
@@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
* @dev

Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-11-23 Thread Felix Kuehling


[+Alex]

On 2023-11-17 16:44, Felix Kuehling wrote:


This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

CC: Christian König 
CC: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 


Re: our discussion about v2 of this patch: If this version is 
acceptable, can I get an R-b or A-b?


I would like to get this patch into drm-next as a prerequisite for 
patches 2 and 3. I cannot submit it to the current amd-staging-drm-next 
because the patch I'm reverting doesn't exist there yet.


Patch 2 and 3 could go into drm-next as well, or go through Alex's 
amd-staging-drm-next branch once patch 1 is in drm-next. Alex, how do 
you prefer to coordinate this?


Regards,
  Felix



---
  drivers/gpu/drm/drm_prime.c | 33 ++---
  include/drm/drm_prime.h |  7 +++
  2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf)
  }
  EXPORT_SYMBOL(drm_gem_dmabuf_release);
  
-/*

+/**
   * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
   * @dev: drm_device to import into
   * @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
   *
   * Returns 0 on success or a negative error code on failure.
   */
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
- struct drm_file *file_priv, int prime_fd,
- uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+  struct drm_file *file_priv, int prime_fd,
+  uint32_t *handle)
  {
struct dma_buf *dma_buf;
struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device 
*dev,
dma_buf_put(dma_buf);
return ret;
  }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);
  
  int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,

 struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
return dmabuf;
  }
  
-/*

+/**
   * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
   * @dev: dev to export the buffer from
   * @file_priv: drm file-private structure
@@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
   * The actual exporting from GEM object to a dma-buf is done through the
   * &drm_gem_object_funcs.export callback.
   */
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
- struct drm_file *file_priv, uint32_t 
handle,
- uint32_t flags,
- int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+  struct drm_file *file_priv, uint32_t handle,
+  uint32_t flags,
+  int *prime_fd)
  {
struct drm_gem_object *obj;
int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device 
*dev,
  
  	return ret;

  }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);
  
  int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,

 struct drm_file *file_priv)
@@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size);
   * @obj: GEM object to export
   * @flags: flags like DRM_CLOEXEC and DRM_RDWR
   *
- * This is the implementation of the &drm_gem_object_funcs.export functions
- * for GEM drivers using the PRIME helpers. It is used as the default for
- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export functions 
for GEM drivers
+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
   */
  struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
 int flags)
@@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
   * @dev: drm_device to import into
   * @dma_buf: dma-buf object to import
   *
- * This is the implementation of the gem_prime_import functions for GEM
- * drivers using the PRIME helpers. It is the default for drivers that do
- * not set their own &drm_driver.gem_prime_import.
+ * This is the implementation of the gem_prime_import functions for GEM drivers
+ * using the PRIME helpers. Drivers can use this as their
+ * &drm_driver.gem_prime_import implem

Re: [PATCH v2 2/4] drm/prime: Helper to export dmabuf without fd

2023-11-22 Thread Felix Kuehling




On 2023-11-22 05:32, Thomas Zimmermann wrote:

Hi,

my apologies if this sounds picky or annoying. This change appears to be
going in the wrong direction. The goal of the refactoring is to be able
to use drm_driver.gem_prime_import and drm_gem_object_funcs.export for
the additional import/export code; and hence keep the GEM object code in
a single place. Keeping the prime_fd file descriptor within amdkfd will
likely help with that.

Here's my suggestion:

   1) Please keep the internal interfaces drm_gem_prime_handle_to_fd()
and drm_gem_prime_fd_to_handle(). They should be called from the _ioctl
entry functions as is. That could be stream-lined in a later patch set.

   2) From drm_gem_prime_handle_to_fd() and drm_gem_prime_fd_to_handle(),
create drm_gem_prime_handle_to_dmabuf() and
drm_gem_prime_dmabuf_to_handle().


Do you mean duplicate the code, or call drm_gem_prime_handle_to_dmabuf 
from drm_gem_prime_handle_to_fd?




  They should be exported. You can then
keep the file-descriptor code in amdkfd and out of the PRIME helpers.

   3) Patches 1 and 2 should be squashed into one.

   4) And if I'm not mistaken, the additional import/export code can then
go into drm_driver.gem_prime_import and drm_gem_object_funcs.export,
which are being called from within the PRIME helpers.


I'm not sure what you mean by "additional import/export code" that would 
move into those driver callbacks.





That's admittedly quite a bit of refactoring. OR simply go back to v1 of
this patch set, which was consistent at least.


I think I'd prefer that because I don't really understand what you're 
trying to achieve.


Thanks,
  Felix




Best regards
Thomas


Am 22.11.23 um 00:11 schrieb Felix Kuehling:

Change drm_gem_prime_handle_to_fd to drm_gem_prime_handle_to_dmabuf to
export a dmabuf without creating an FD as a user mode handle. This is
more useful for users in kernel mode.

Suggested-by: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/drm_prime.c | 63 ++---
   include/drm/drm_prime.h |  6 ++--
   2 files changed, 33 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 834a5e28abbe..d491b5f73eea 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -410,26 +410,25 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
   }
   
   /**

- * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
+ * drm_gem_prime_handle_to_dmabuf - PRIME export function for GEM drivers
* @dev: dev to export the buffer from
* @file_priv: drm file-private structure
* @handle: buffer handle to export
* @flags: flags like DRM_CLOEXEC
- * @prime_fd: pointer to storage for the fd id of the create dma-buf
+ * @dma_buf: pointer to storage for the dma-buf reference
*
* This is the PRIME export function which must be used mandatorily by GEM
* drivers to ensure correct lifetime management of the underlying GEM 
object.
* The actual exporting from GEM object to a dma-buf is done through the
* &drm_gem_object_funcs.export callback.
*/
-int drm_gem_prime_handle_to_fd(struct drm_device *dev,
-  struct drm_file *file_priv, uint32_t handle,
-  uint32_t flags,
-  int *prime_fd)
+struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev,
+  struct drm_file *file_priv,
+  uint32_t handle, uint32_t flags)
   {
struct drm_gem_object *obj;
int ret = 0;
-   struct dma_buf *dmabuf;
+   struct dma_buf *dmabuf = NULL;
   
   	mutex_lock(&file_priv->prime.lock);

obj = drm_gem_object_lookup(file_priv, handle);
@@ -441,7 +440,7 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, handle);
if (dmabuf) {
get_dma_buf(dmabuf);
-   goto out_have_handle;
+   goto out;
}
   
   	mutex_lock(&dev->object_name_lock);

@@ -479,40 +478,22 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
   dmabuf, handle);
mutex_unlock(&dev->object_name_lock);
if (ret)
-   goto fail_put_dmabuf;
-
-out_have_handle:
-   ret = dma_buf_fd(dmabuf, flags);
-   /*
-* We must _not_ remove the buffer from the handle cache since the newly
-* created dma buf is already linked in the global obj->dma_buf pointer,
-* and that is invariant as long as a userspace gem handle exists.
-* Closing the handle will clean out the cache anyway, so we don't leak.
-*/
-   if (ret < 0) {
-   goto fail_put_dmabuf;
-   } else {
-   *p

[PATCH v2 4/4] drm/amdkfd: Import DMABufs for interop through DRM

2023-11-21 Thread Felix Kuehling

Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This
ensures that a GEM handle is created on import and that obj->dma_buf
will be set and remain set as long as the object is imported into KFD.

Signed-off-by: Felix Kuehling 
Reviewed-by: Ramesh Errabolu 
Reviewed-by: Xiaogang.Chen 
Acked-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  9 ++-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 64 +--
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 15 ++---
 3 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index c1195eb67057..8da42e0dddcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -319,11 +319,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void 
*process_info,
struct dma_fence **ef);
 int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev,
  struct kfd_vm_fault_info *info);
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dmabuf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset);
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset);
 int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem,
  struct dma_buf **dmabuf);
 void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index e96e1595791f..652657c863ff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1953,8 +1953,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
 
/* Free the BO*/
drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv);
-   if (!mem->is_imported)
-   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
+   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
if (mem->dmabuf) {
dma_buf_put(mem->dmabuf);
mem->dmabuf = NULL;
@@ -2310,34 +2309,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct 
amdgpu_device *adev,
return 0;
 }
 
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dma_buf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset)
+static int import_obj_create(struct amdgpu_device *adev,
+struct dma_buf *dma_buf,
+struct drm_gem_object *obj,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
 {
struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv);
-   struct drm_gem_object *obj;
struct amdgpu_bo *bo;
int ret;
 
-   obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf);
-   if (IS_ERR(obj))
-   return PTR_ERR(obj);
-
bo = gem_to_amdgpu_bo(obj);
if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM |
-   AMDGPU_GEM_DOMAIN_GTT))) {
+   AMDGPU_GEM_DOMAIN_GTT)))
/* Only VRAM and GTT BOs are supported */
-   ret = -EINVAL;
-   goto err_put_obj;
-   }
+   return -EINVAL;
 
*mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL);
-   if (!*mem) {
-   ret = -ENOMEM;
-   goto err_put_obj;
-   }
+   if (!*mem)
+   return -ENOMEM;
 
ret = drm_vma_node_allow(&obj->vma_node, drm_priv);
if (ret)
@@ -2387,8 +2378,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
drm_vma_node_revoke(&obj->vma_node, drm_priv);
 err_free_mem:
kfree(*mem);
+   return ret;
+}
+
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
+{
+   struct drm_gem_object *ob

[PATCH v2 1/4] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-11-21 Thread Felix Kuehling

This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

CC: Christian König 
CC: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/drm_prime.c | 33 ++---
 include/drm/drm_prime.h |  7 +++
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf)
 }
 EXPORT_SYMBOL(drm_gem_dmabuf_release);
 
-/*
+/**
  * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
  * @dev: drm_device to import into
  * @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
  *
  * Returns 0 on success or a negative error code on failure.
  */
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
- struct drm_file *file_priv, int prime_fd,
- uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+  struct drm_file *file_priv, int prime_fd,
+  uint32_t *handle)
 {
struct dma_buf *dma_buf;
struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device 
*dev,
dma_buf_put(dma_buf);
return ret;
 }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);
 
 int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,
 struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
return dmabuf;
 }
 
-/*
+/**
  * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
  * @dev: dev to export the buffer from
  * @file_priv: drm file-private structure
@@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
  * The actual exporting from GEM object to a dma-buf is done through the
  * &drm_gem_object_funcs.export callback.
  */
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
- struct drm_file *file_priv, uint32_t 
handle,
- uint32_t flags,
- int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+  struct drm_file *file_priv, uint32_t handle,
+  uint32_t flags,
+  int *prime_fd)
 {
struct drm_gem_object *obj;
int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device 
*dev,
 
return ret;
 }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);
 
 int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,
 struct drm_file *file_priv)
@@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size);
  * @obj: GEM object to export
  * @flags: flags like DRM_CLOEXEC and DRM_RDWR
  *
- * This is the implementation of the &drm_gem_object_funcs.export functions
- * for GEM drivers using the PRIME helpers. It is used as the default for
- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export functions 
for GEM drivers
+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
  */
 struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
 int flags)
@@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
  * @dev: drm_device to import into
  * @dma_buf: dma-buf object to import
  *
- * This is the implementation of the gem_prime_import functions for GEM
- * drivers using the PRIME helpers. It is the default for drivers that do
- * not set their own &drm_driver.gem_prime_import.
+ * This is the implementation of the gem_prime_import functions for GEM drivers
+ * using the PRIME helpers. Drivers can use this as their
+ * &drm_driver.gem_prime_import implementation. It is used as the default
+ * implementation in drm_gem_prime_fd_to_handle().
  *
  * Drivers must arrange to call drm_prime_gem_destroy() from their
  * &drm_gem_object_funcs.free hook when using this function.
diff --git a/include/drm/drm_prime.h b/include/drm/drm_prime.h
index a7abf9f3e697..2a1d01e5b56b 100644
--- a/include/drm/drm_prime.h
+++ b/include/drm/drm_prime.h
@@ -60,12 +60,19 @@ enum dma_data_direction;
 
 struct drm_device;
 struct drm_gem_object;
+struct drm_file;
 
 /* core prime functions */
 struct dma_buf *drm_gem_dmabuf_expor

[PATCH v2 3/4] drm/amdkfd: Export DMABufs from KFD using GEM handles

2023-11-21 Thread Felix Kuehling

Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM
handles are created in a drm_client_dev context to avoid exposing them
in user mode contexts through a DMABuf import.

Signed-off-by: Felix Kuehling 
Reviewed-by: Ramesh Errabolu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  5 
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 29 ++-
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index b8412202a1b0..aa8b24079070 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
int i;
int last_valid_bit;
+   int ret;
 
amdgpu_amdkfd_gpuvm_init_mem_limits();
 
@@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
.enable_mes = adev->enable_mes,
};
 
+   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", 
NULL);
+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
+   return;
+   }
+
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
&gpu_resources);
+   if (adev->kfd.init_complete)
+   drm_client_register(&adev->kfd.client);
+   else
+   drm_client_release(&adev->kfd.client);
 
amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index dac983da961d..c1195eb67057 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "amdgpu_sync.h"
 #include "amdgpu_vm.h"
 #include "amdgpu_xcp.h"
@@ -83,6 +84,7 @@ struct kgd_mem {
 
struct amdgpu_sync sync;
 
+   uint32_t gem_handle;
bool aql_queue;
bool is_imported;
 };
@@ -105,6 +107,9 @@ struct amdgpu_kfd_dev {
 
/* HMM page migration MEMORY_DEVICE_PRIVATE mapping */
struct dev_pagemap pgmap;
+
+   /* Client for KFD BO GEM handle allocations */
+   struct drm_client_dev client;
 };
 
 enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 41fbc4fd0fac..e96e1595791f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -806,13 +807,18 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem,
 static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
 {
if (!mem->dmabuf) {
-   struct dma_buf *ret = amdgpu_gem_prime_export(
-   &mem->bo->tbo.base,
+   struct amdgpu_device *bo_adev;
+   struct dma_buf *dmabuf;
+
+   bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
+   dmabuf = drm_gem_prime_handle_to_dmabuf(&bo_adev->ddev,
+   
bo_adev->kfd.client.file,
+   mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-   DRM_RDWR : 0);
-   if (IS_ERR(ret))
-   return PTR_ERR(ret);
-   mem->dmabuf = ret;
+   DRM_RDWR : 0);
+   if (IS_ERR(dmabuf))
+   return PTR_ERR(dmabuf);
+   mem->dmabuf = dmabuf;
}
 
return 0;
@@ -1779,6 +1785,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
pr_debug("Failed to allow vma node access. ret %d\n", ret);
goto err_node_allow;
}
+   ret = drm_gem_handle_create(adev->kfd.client.file, gobj, 
&(*mem)->gem_handle);
+   if (ret)
+   goto err_gem_handle_create;
bo = gem_to_amdgpu_bo(gobj);
if (bo_type == ttm_bo_type_sg) {
bo->tbo.sg = sg;
@@ -1830,6 +1839,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 err_pin_bo:
 err_validate_bo:
remove_kgd_mem_from_kfd_bo_list(*mem, avm

[PATCH v2 2/4] drm/prime: Helper to export dmabuf without fd

2023-11-21 Thread Felix Kuehling

Change drm_gem_prime_handle_to_fd to drm_gem_prime_handle_to_dmabuf to
export a dmabuf without creating an FD as a user mode handle. This is
more useful for users in kernel mode.

Suggested-by: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/drm_prime.c | 63 ++---
 include/drm/drm_prime.h |  6 ++--
 2 files changed, 33 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 834a5e28abbe..d491b5f73eea 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -410,26 +410,25 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
 }
 
 /**
- * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
+ * drm_gem_prime_handle_to_dmabuf - PRIME export function for GEM drivers
  * @dev: dev to export the buffer from
  * @file_priv: drm file-private structure
  * @handle: buffer handle to export
  * @flags: flags like DRM_CLOEXEC
- * @prime_fd: pointer to storage for the fd id of the create dma-buf
+ * @dma_buf: pointer to storage for the dma-buf reference
  *
  * This is the PRIME export function which must be used mandatorily by GEM
  * drivers to ensure correct lifetime management of the underlying GEM object.
  * The actual exporting from GEM object to a dma-buf is done through the
  * &drm_gem_object_funcs.export callback.
  */
-int drm_gem_prime_handle_to_fd(struct drm_device *dev,
-  struct drm_file *file_priv, uint32_t handle,
-  uint32_t flags,
-  int *prime_fd)
+struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev,
+  struct drm_file *file_priv,
+  uint32_t handle, uint32_t flags)
 {
struct drm_gem_object *obj;
int ret = 0;
-   struct dma_buf *dmabuf;
+   struct dma_buf *dmabuf = NULL;
 
mutex_lock(&file_priv->prime.lock);
obj = drm_gem_object_lookup(file_priv, handle);
@@ -441,7 +440,7 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
dmabuf = drm_prime_lookup_buf_by_handle(&file_priv->prime, handle);
if (dmabuf) {
get_dma_buf(dmabuf);
-   goto out_have_handle;
+   goto out;
}
 
mutex_lock(&dev->object_name_lock);
@@ -479,40 +478,22 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
   dmabuf, handle);
mutex_unlock(&dev->object_name_lock);
if (ret)
-   goto fail_put_dmabuf;
-
-out_have_handle:
-   ret = dma_buf_fd(dmabuf, flags);
-   /*
-* We must _not_ remove the buffer from the handle cache since the newly
-* created dma buf is already linked in the global obj->dma_buf pointer,
-* and that is invariant as long as a userspace gem handle exists.
-* Closing the handle will clean out the cache anyway, so we don't leak.
-*/
-   if (ret < 0) {
-   goto fail_put_dmabuf;
-   } else {
-   *prime_fd = ret;
-   ret = 0;
-   }
-
-   goto out;
-
-fail_put_dmabuf:
-   dma_buf_put(dmabuf);
+   dma_buf_put(dmabuf);
 out:
drm_gem_object_put(obj);
 out_unlock:
mutex_unlock(&file_priv->prime.lock);
 
-   return ret;
+   return ret ? ERR_PTR(ret) : dmabuf;
 }
-EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);
+EXPORT_SYMBOL(drm_gem_prime_handle_to_dmabuf);
 
 int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,
 struct drm_file *file_priv)
 {
struct drm_prime_handle *args = data;
+   struct dma_buf *dmabuf;
+   int ret;
 
/* check flags are valid */
if (args->flags & ~(DRM_CLOEXEC | DRM_RDWR))
@@ -523,8 +504,24 @@ int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, 
void *data,
   args->handle, 
args->flags,
   &args->fd);
}
-   return drm_gem_prime_handle_to_fd(dev, file_priv, args->handle,
- args->flags, &args->fd);
+   dmabuf = drm_gem_prime_handle_to_dmabuf(dev, file_priv, args->handle,
+   args->flags);
+   if (IS_ERR(dmabuf))
+   return PTR_ERR(dmabuf);
+   ret = dma_buf_fd(dmabuf, args->flags);
+   /*
+* We must _not_ remove the buffer from the handle cache since the newly
+* created dma buf is already linked in the global obj->dma_buf pointer,
+* and that is invariant as long as a userspace gem handle exists.
+* Closing the handle will clean out the cache anyway, so we don't leak.
+*/
+

Re: [Bug 218168] New: amdgpu: kfd_topology.c warning: the frame size of 1408 bytes is larger than 1024 bytes

2023-11-21 Thread Felix Kuehling

There are two patches that didn't make it into Linux 6.6 that reduce the 
stack size in kfd_topology_add_device. Can you check if those fix the 
problem?


commit aa5a9b2ccda2fa834fddb4bd30a2ab831598f551
Author: Alex Deucher 
Date:   Tue Sep 26 12:00:23 2023 -0400

drm/amdkfd: drop struct kfd_cu_info

I think this was an abstraction back from when

kfd supported both radeon and amdgpu.  Since we just
support amdgpu now, there is no more need for this and
we can use the amdgpu structures directly.

This also avoids having the kfd_cu_info structures on

the stack when inlining which can blow up the stack.

Cc: Arnd Bergmann 

Acked-by: Arnd Bergmann 
Reviewed-by: Felix Kuehling 
Acked-by: Christian König 
Signed-off-by: Alex Deucher 

commit 1f3b515578a1d73926993629a06a7f3b60535b59
Author: Alex Deucher 
Date:   Thu Sep 21 10:32:09 2023 -0400

drm/amdkfd: reduce stack size in kfd_topology_add_device()

kfd_topology.c:2082:1: warning: the frame size of 1440 bytes is larger than 1024 bytes

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2866

Cc: Arnd Bergmann 
Acked-by: Arnd Bergmann 
Acked-by: Christian König 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 

Regards,
  Felix


On 2023-11-20 10:36, Hamza Mahfooz wrote:

+ amd-gfx
+ Felix

On 11/20/23 10:16, bugzilla-dae...@kernel.org wrote:

https://bugzilla.kernel.org/show_bug.cgi?id=218168

 Bug ID: 218168
    Summary: amdgpu: kfd_topology.c warning: the frame size 
of 1408

 bytes is larger than 1024 bytes
    Product: Drivers
    Version: 2.5
   Hardware: All
 OS: Linux
 Status: NEW
   Severity: normal
   Priority: P3
  Component: Video(DRI - non Intel)
   Assignee: drivers_video-...@kernel-bugs.osdl.org
   Reporter: bluesun...@gmail.com
 Regression: No

Trying to compile Linux 6.6.2 with GCC 13.2.1 and CONFIG_WERROR=y:

[...]
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c: In function
'kfd_topology_add_device':
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:2082:1: error: 
the frame
size of 1408 bytes is larger than 1024 bytes 
[-Werror=frame-larger-than=]

  2082 | }
   | ^
cc1: all warnings being treated as errors
[...]

Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-11-20 Thread Felix Kuehling




On 2023-11-20 11:02, Thomas Zimmermann wrote:

Hi Christian

Am 20.11.23 um 16:22 schrieb Christian König:

Am 20.11.23 um 16:18 schrieb Thomas Zimmermann:

Hi

Am 20.11.23 um 16:06 schrieb Felix Kuehling:

On 2023-11-20 6:54, Thomas Zimmermann wrote:

Hi

Am 17.11.23 um 22:44 schrieb Felix Kuehling:

This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs
associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

I'm unhappy to see these functions making a comeback. They are the
boiler-plate logic that all drivers should use. Historically,
drivers did a lot one things in their GEM code that was only
semi-correct. Unifying most of that made the memory management more
readable. Not giving back drivers to option of tinkering with this
might be preferable. The rsp hooks in struct drm_driver,
prime_fd_to_handle and prime_handle_to_fd, are only there for vmwgfx.

If you want to hook into prime import and export, there are
drm_driver.gem_prime_import and drm_gem_object_funcs.export. Isn't
it possible to move the additional code behind these pointers?

I'm not trying to hook into these functions, I'm just trying to call
them. I'm not bringing back the .prime_handle_to_fd and
.prime_fd_to_handle hooks in struct drm_driver. I need a clean way to
export and import DMAbuffers from a kernel mode context. I had
incorrect or semi-correct ways of doing that by calling some
driver-internal functions, but then my DMABufs aren't correctly
linked with GEM handles in DRM and move notifiers in amdgpu aren't
working correctly.

I understand that. But why don't you use drm_driver.gem_prime_import
and drm_gem_object_funcs.export to add the amdkfd-specific code? These
callbacks are being invoked from within drm_gem_prime_fd_to_handle() and
drm_gem_prime_handle_to_fd() as part of the importing and exporting
logic. With the intention of doing driver-specific things. Hence you
should not have to re-export the internal drm_gem_prime_*_to_*() helpers.

My question is if the existing hooks are not suitable for your needs.
If so, how could we improve them?

No no. You don't seem to understand the use case :) Felix doesn't try to
implement any driver-specific things.

I meant that I understand that this patchset is not about setting
drm_driver.prime_handle_to_fd, et al.


What Felix tries to do is to export a DMA-buf handle from kernel space.

For example, looking at patch 2, it converts a GEM handle to a file
descriptor and then assigns the rsp dmabuf to mem, which is of type
struct kgd_mem. From my impression, this could be done within the
existing ->export hook.


That would skip the call to export_and_register_object. I think that's 
what I'm currently missing to set up gem_obj->dmabuf.


Regards,
  Felix




Then there's close_fd(), which cannot go into ->export. It looks like
the fd isn't really required.  Could the drm_prime_handle_to_fd() be
reworked into a helper that converts the handle to the dmabuf without
the fd?  Something like drm_gem_prime_handle_to_dmabuf(), which would
then be exported?

And I have the question wrt the 3rd patch; just that it's about importing.

(Looking further through the code, it appears that the fd could be
removed from the helpers, the callbacks and vmwgfx. It would then be
used entirely in the ioctl entry points, such as
drm_prime_fd_to_handle_ioctl().)

Best regards
Thomas



Regards,
Christian.


Best regards
Thomas



Regards,
    Felix



Best regards
Thomas


CC: Christian König 
CC: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/drm_prime.c | 33 ++---
   include/drm/drm_prime.h |  7 +++
   2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf
*dma_buf)
   }
   EXPORT_SYMBOL(drm_gem_dmabuf_release);
   -/*
+/**
    * drm_gem_prime_fd_to_handle - PRIME import function for GEM
drivers
    * @dev: drm_device to import into
    * @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
    *
    * Returns 0 on success or a negative error code on failure.
    */
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
-  struct drm_file *file_priv, int prime_fd,
-  uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+   struct drm_file *file_priv, int prime_fd,
+   uint32_t *handle)
   {
   struct dma_buf *dma_buf;
   struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime

Re: [PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-11-20 Thread Felix Kuehling


On 2023-11-20 6:54, Thomas Zimmermann wrote:

Hi

Am 17.11.23 um 22:44 schrieb Felix Kuehling:

This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated 
with

GEM objects while ensuring that move notifier callbacks are working as
intended.


I'm unhappy to see these functions making a comeback. They are the 
boiler-plate logic that all drivers should use. Historically, drivers 
did a lot one things in their GEM code that was only semi-correct. 
Unifying most of that made the memory management more readable. Not 
giving back drivers to option of tinkering with this might be 
preferable. The rsp hooks in struct drm_driver, prime_fd_to_handle and 
prime_handle_to_fd, are only there for vmwgfx.


If you want to hook into prime import and export, there are 
drm_driver.gem_prime_import and drm_gem_object_funcs.export. Isn't it 
possible to move the additional code behind these pointers?


I'm not trying to hook into these functions, I'm just trying to call 
them. I'm not bringing back the .prime_handle_to_fd and 
.prime_fd_to_handle hooks in struct drm_driver. I need a clean way to 
export and import DMAbuffers from a kernel mode context. I had incorrect 
or semi-correct ways of doing that by calling some driver-internal 
functions, but then my DMABufs aren't correctly linked with GEM handles 
in DRM and move notifiers in amdgpu aren't working correctly.


Regards,
  Felix




Best regards
Thomas



CC: Christian König 
CC: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/drm_prime.c | 33 ++---
  include/drm/drm_prime.h |  7 +++
  2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf)
  }
  EXPORT_SYMBOL(drm_gem_dmabuf_release);
  -/*
+/**
   * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
   * @dev: drm_device to import into
   * @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
   *
   * Returns 0 on success or a negative error code on failure.
   */
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
-  struct drm_file *file_priv, int prime_fd,
-  uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+   struct drm_file *file_priv, int prime_fd,
+   uint32_t *handle)
  {
  struct dma_buf *dma_buf;
  struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct 
drm_device *dev,

  dma_buf_put(dma_buf);
  return ret;
  }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);
    int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,
   struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf 
*export_and_register_object(struct drm_device *dev,

  return dmabuf;
  }
  -/*
+/**
   * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
   * @dev: dev to export the buffer from
   * @file_priv: drm file-private structure
@@ -421,10 +422,10 @@ static struct dma_buf 
*export_and_register_object(struct drm_device *dev,
   * The actual exporting from GEM object to a dma-buf is done 
through the

   * &drm_gem_object_funcs.export callback.
   */
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
-  struct drm_file *file_priv, uint32_t handle,
-  uint32_t flags,
-  int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+   struct drm_file *file_priv, uint32_t handle,
+   uint32_t flags,
+   int *prime_fd)
  {
  struct drm_gem_object *obj;
  int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct 
drm_device *dev,

    return ret;
  }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);
    int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,
   struct drm_file *file_priv)
@@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size);
   * @obj: GEM object to export
   * @flags: flags like DRM_CLOEXEC and DRM_RDWR
   *
- * This is the implementation of the &drm_gem_object_funcs.export 
functions
- * for GEM drivers using the PRIME helpers. It is used as the 
default for

- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export 
functions for GEM drivers

+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
   */
  struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,

[PATCH 3/3] drm/amdkfd: Import DMABufs for interop through DRM

2023-11-17 Thread Felix Kuehling

Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This
ensures that a GEM handle is created on import and that obj->dma_buf
will be set and remain set as long as the object is imported into KFD.

Signed-off-by: Felix Kuehling 
Reviewed-by: Ramesh Errabolu 
Reviewed-by: Xiaogang.Chen 
Acked-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  9 ++-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 64 +--
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 15 ++---
 3 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index c1195eb67057..8da42e0dddcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -319,11 +319,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void 
*process_info,
struct dma_fence **ef);
 int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev,
  struct kfd_vm_fault_info *info);
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dmabuf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset);
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset);
 int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem,
  struct dma_buf **dmabuf);
 void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index b13d68b7bb28..966272e067b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1957,8 +1957,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
 
/* Free the BO*/
drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv);
-   if (!mem->is_imported)
-   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
+   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
if (mem->dmabuf) {
dma_buf_put(mem->dmabuf);
mem->dmabuf = NULL;
@@ -2314,34 +2313,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct 
amdgpu_device *adev,
return 0;
 }
 
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dma_buf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset)
+static int import_obj_create(struct amdgpu_device *adev,
+struct dma_buf *dma_buf,
+struct drm_gem_object *obj,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
 {
struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv);
-   struct drm_gem_object *obj;
struct amdgpu_bo *bo;
int ret;
 
-   obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf);
-   if (IS_ERR(obj))
-   return PTR_ERR(obj);
-
bo = gem_to_amdgpu_bo(obj);
if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM |
-   AMDGPU_GEM_DOMAIN_GTT))) {
+   AMDGPU_GEM_DOMAIN_GTT)))
/* Only VRAM and GTT BOs are supported */
-   ret = -EINVAL;
-   goto err_put_obj;
-   }
+   return -EINVAL;
 
*mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL);
-   if (!*mem) {
-   ret = -ENOMEM;
-   goto err_put_obj;
-   }
+   if (!*mem)
+   return -ENOMEM;
 
ret = drm_vma_node_allow(&obj->vma_node, drm_priv);
if (ret)
@@ -2391,8 +2382,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
drm_vma_node_revoke(&obj->vma_node, drm_priv);
 err_free_mem:
kfree(*mem);
+   return ret;
+}
+
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
+{
+   struct drm_gem_object *ob

[PATCH 2/3] drm/amdkfd: Export DMABufs from KFD using GEM handles

2023-11-17 Thread Felix Kuehling

Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM
handles are created in a drm_client_dev context to avoid exposing them
in user mode contexts through a DMABuf import.

Signed-off-by: Felix Kuehling 
Reviewed-by: Ramesh Errabolu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  5 +++
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 33 +++
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index b8412202a1b0..aa8b24079070 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
int i;
int last_valid_bit;
+   int ret;
 
amdgpu_amdkfd_gpuvm_init_mem_limits();
 
@@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
.enable_mes = adev->enable_mes,
};
 
+   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", 
NULL);
+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
+   return;
+   }
+
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
&gpu_resources);
+   if (adev->kfd.init_complete)
+   drm_client_register(&adev->kfd.client);
+   else
+   drm_client_release(&adev->kfd.client);
 
amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index dac983da961d..c1195eb67057 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "amdgpu_sync.h"
 #include "amdgpu_vm.h"
 #include "amdgpu_xcp.h"
@@ -83,6 +84,7 @@ struct kgd_mem {
 
struct amdgpu_sync sync;
 
+   uint32_t gem_handle;
bool aql_queue;
bool is_imported;
 };
@@ -105,6 +107,9 @@ struct amdgpu_kfd_dev {
 
/* HMM page migration MEMORY_DEVICE_PRIVATE mapping */
struct dev_pagemap pgmap;
+
+   /* Client for KFD BO GEM handle allocations */
+   struct drm_client_dev client;
 };
 
 enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 41fbc4fd0fac..b13d68b7bb28 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -806,13 +807,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem,
 static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
 {
if (!mem->dmabuf) {
-   struct dma_buf *ret = amdgpu_gem_prime_export(
-   &mem->bo->tbo.base,
+   struct amdgpu_device *bo_adev;
+   struct dma_buf *dmabuf;
+   int r, fd;
+
+   bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
+   r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
+  mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-   DRM_RDWR : 0);
-   if (IS_ERR(ret))
-   return PTR_ERR(ret);
-   mem->dmabuf = ret;
+  DRM_RDWR : 0, &fd);
+   if (r)
+   return r;
+   dmabuf = dma_buf_get(fd);
+   close_fd(fd);
+   if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+   return PTR_ERR(dmabuf);
+   mem->dmabuf = dmabuf;
}
 
return 0;
@@ -1779,6 +1789,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
pr_debug("Failed to allow vma node access. ret %d\n", ret);
goto err_node_allow;
}
+   ret = drm_gem_handle_create(adev->kfd.client.file, gobj, 
&(*mem)->gem_handle);
+   if (ret)
+   goto err_gem_handle_create;
bo = gem_to_amdgpu_bo(gobj);
if (bo_type == ttm_bo_type_sg) {
bo->tbo.sg = sg;
@@ -1830,6 +1843,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_g

[PATCH 1/3] Revert "drm/prime: Unexport helpers for fd/handle conversion"

2023-11-17 Thread Felix Kuehling

This reverts commit 71a7974ac7019afeec105a54447ae1dc7216cbb3.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

CC: Christian König 
CC: Thomas Zimmermann 
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/drm_prime.c | 33 ++---
 include/drm/drm_prime.h |  7 +++
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 63b709a67471..834a5e28abbe 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -278,7 +278,7 @@ void drm_gem_dmabuf_release(struct dma_buf *dma_buf)
 }
 EXPORT_SYMBOL(drm_gem_dmabuf_release);
 
-/*
+/**
  * drm_gem_prime_fd_to_handle - PRIME import function for GEM drivers
  * @dev: drm_device to import into
  * @file_priv: drm file-private structure
@@ -292,9 +292,9 @@ EXPORT_SYMBOL(drm_gem_dmabuf_release);
  *
  * Returns 0 on success or a negative error code on failure.
  */
-static int drm_gem_prime_fd_to_handle(struct drm_device *dev,
- struct drm_file *file_priv, int prime_fd,
- uint32_t *handle)
+int drm_gem_prime_fd_to_handle(struct drm_device *dev,
+  struct drm_file *file_priv, int prime_fd,
+  uint32_t *handle)
 {
struct dma_buf *dma_buf;
struct drm_gem_object *obj;
@@ -360,6 +360,7 @@ static int drm_gem_prime_fd_to_handle(struct drm_device 
*dev,
dma_buf_put(dma_buf);
return ret;
 }
+EXPORT_SYMBOL(drm_gem_prime_fd_to_handle);
 
 int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data,
 struct drm_file *file_priv)
@@ -408,7 +409,7 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
return dmabuf;
 }
 
-/*
+/**
  * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
  * @dev: dev to export the buffer from
  * @file_priv: drm file-private structure
@@ -421,10 +422,10 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
  * The actual exporting from GEM object to a dma-buf is done through the
  * &drm_gem_object_funcs.export callback.
  */
-static int drm_gem_prime_handle_to_fd(struct drm_device *dev,
- struct drm_file *file_priv, uint32_t 
handle,
- uint32_t flags,
- int *prime_fd)
+int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+  struct drm_file *file_priv, uint32_t handle,
+  uint32_t flags,
+  int *prime_fd)
 {
struct drm_gem_object *obj;
int ret = 0;
@@ -506,6 +507,7 @@ static int drm_gem_prime_handle_to_fd(struct drm_device 
*dev,
 
return ret;
 }
+EXPORT_SYMBOL(drm_gem_prime_handle_to_fd);
 
 int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data,
 struct drm_file *file_priv)
@@ -864,9 +866,9 @@ EXPORT_SYMBOL(drm_prime_get_contiguous_size);
  * @obj: GEM object to export
  * @flags: flags like DRM_CLOEXEC and DRM_RDWR
  *
- * This is the implementation of the &drm_gem_object_funcs.export functions
- * for GEM drivers using the PRIME helpers. It is used as the default for
- * drivers that do not set their own.
+ * This is the implementation of the &drm_gem_object_funcs.export functions 
for GEM drivers
+ * using the PRIME helpers. It is used as the default in
+ * drm_gem_prime_handle_to_fd().
  */
 struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
 int flags)
@@ -962,9 +964,10 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
  * @dev: drm_device to import into
  * @dma_buf: dma-buf object to import
  *
- * This is the implementation of the gem_prime_import functions for GEM
- * drivers using the PRIME helpers. It is the default for drivers that do
- * not set their own &drm_driver.gem_prime_import.
+ * This is the implementation of the gem_prime_import functions for GEM drivers
+ * using the PRIME helpers. Drivers can use this as their
+ * &drm_driver.gem_prime_import implementation. It is used as the default
+ * implementation in drm_gem_prime_fd_to_handle().
  *
  * Drivers must arrange to call drm_prime_gem_destroy() from their
  * &drm_gem_object_funcs.free hook when using this function.
diff --git a/include/drm/drm_prime.h b/include/drm/drm_prime.h
index a7abf9f3e697..2a1d01e5b56b 100644
--- a/include/drm/drm_prime.h
+++ b/include/drm/drm_prime.h
@@ -60,12 +60,19 @@ enum dma_data_direction;
 
 struct drm_device;
 struct drm_gem_object;
+struct drm_file;
 
 /* core prime functions */
 struct dma_buf *drm_gem_dmabuf_expor

Re: [PATCH 4/6] drm/amdkfd: Export DMABufs from KFD using GEM handles

2023-11-16 Thread Felix Kuehling




On 2023-11-07 11:58, Felix Kuehling wrote:

Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM
handles are created in a drm_client_dev context to avoid exposing them
in user mode contexts through a DMABuf import.
This patch (and the next one) won't apply upstream because Thomas 
Zimmerman just made drm_gem_prime_fd_to_handle and 
drm_gem_prime_handle_to_fd private because nobody was using them. 
(drm/prime: Unexport helpers for fd/handle conversion)


Is it OK to export those APIs again? Or is there a better way for 
drivers to export/import DMABufs without using the GEM ioctls?


Regards,
  Felix




Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  5 +++
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 33 +++
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  4 +--
  4 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 6ab17330a6ed..b0a67f16540a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
  {
int i;
int last_valid_bit;
+   int ret;
  
  	amdgpu_amdkfd_gpuvm_init_mem_limits();
  
@@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)

.enable_mes = adev->enable_mes,
};
  
+		ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", NULL);

+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
+   return;
+   }
+
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
  
  		adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,

&gpu_resources);
+   if (adev->kfd.init_complete)
+   drm_client_register(&adev->kfd.client);
+   else
+   drm_client_release(&adev->kfd.client);
  
  		amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h

index 68d534a89942..4caf8cece028 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -32,6 +32,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include "amdgpu_sync.h"
  #include "amdgpu_vm.h"
@@ -84,6 +85,7 @@ struct kgd_mem {
  
  	struct amdgpu_sync sync;
  
+	uint32_t gem_handle;

bool aql_queue;
bool is_imported;
  };
@@ -106,6 +108,9 @@ struct amdgpu_kfd_dev {
  
  	/* HMM page migration MEMORY_DEVICE_PRIVATE mapping */

struct dev_pagemap pgmap;
+
+   /* Client for KFD BO GEM handle allocations */
+   struct drm_client_dev client;
  };
  
  enum kgd_engine_type {

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 0c1cb6048259..4bb8b5fd7598 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #include "amdgpu_object.h"

@@ -804,13 +805,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem,
  static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
  {
if (!mem->dmabuf) {
-   struct dma_buf *ret = amdgpu_gem_prime_export(
-   &mem->bo->tbo.base,
+   struct amdgpu_device *bo_adev;
+   struct dma_buf *dmabuf;
+   int r, fd;
+
+   bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
+   r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
+  mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-   DRM_RDWR : 0);
-   if (IS_ERR(ret))
-   return PTR_ERR(ret);
-   mem->dmabuf = ret;
+  DRM_RDWR : 0, &fd);
+   if (r)
+   return r;
+   dmabuf = dma_buf_get(fd);
+   close_fd(fd);
+   if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+   return PTR_ERR(dmabuf);
+   mem->dmabuf = dmabuf;
}
  
  	return 0;

@@ -1826,6 +1836,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(

Re: [Patch v2] drm/ttm: Schedule delayed_delete worker closer

2023-11-08 Thread Felix Kuehling


On 2023-11-08 09:49, Christian König wrote:

Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj:

Try to allocate system memory on the NUMA node the device is closest to
and try to run delayed_delete workers on a CPU of this node as well.

To optimize the memory clearing operation when a TTM BO gets freed by
the delayed_delete worker, scheduling it closer to a NUMA node where the
memory was initially allocated helps avoid the cases where the worker
gets randomly scheduled on the CPU cores that are across interconnect
boundaries such as xGMI, PCIe etc.

This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
APU platforms such as GFXIP9.4.3.

Acked-by: Felix Kuehling 
Signed-off-by: Rajneesh Bhardwaj 


Reviewed-by: Christian König 

Going to push this to drm-misc-next.


Hold on. Rajneesh just pointed out a WARN regression from testing. I 
think the problem is that the bdev->wq is not unbound.


Regards,
  Felix




Thanks,
Christian.


---

Changes in v2:
  - Absorbed the feedback provided by Christian in the commit message 
and

    the comment.

  drivers/gpu/drm/ttm/ttm_bo.c | 8 +++-
  drivers/gpu/drm/ttm/ttm_device.c | 3 ++-
  2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 5757b9415e37..6f28a77a565b 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref)
  spin_unlock(&bo->bdev->lru_lock);
    INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
-    queue_work(bdev->wq, &bo->delayed_delete);
+
+    /* Schedule the worker on the closest NUMA node. This
+ * improves performance since system memory might be
+ * cleared on free and that is best done on a CPU core
+ * close to it.
+ */
+    queue_work_node(bdev->pool.nid, bdev->wq, 
&bo->delayed_delete);

  return;
  }
  diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
b/drivers/gpu/drm/ttm/ttm_device.c

index 43e27ab77f95..72b81a2ee6c7 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, 
struct ttm_device_funcs *funcs,

  bdev->funcs = funcs;
    ttm_sys_man_init(bdev);
-    ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, 
use_dma32);

+
+    ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, 
use_dma32);

    bdev->vma_manager = vma_manager;
  spin_lock_init(&bdev->lru_lock);

Re: [PATCH] drm/ttm: Schedule delayed_delete worker closer

2023-11-07 Thread Felix Kuehling


On 2023-11-07 14:45, Rajneesh Bhardwaj wrote:

When a TTM BO is getting freed, to optimize the clearing operation on
the workqueue, schedule it closer to a NUMA node where the memory was
allocated. This avoids the cases where the ttm_bo_delayed_delete gets
scheduled on the CPU cores that are across interconnect boundaries such
as xGMI, PCIe etc.

This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
APU platforms such as GFXIP9.4.3.

Signed-off-by: Rajneesh Bhardwaj 


Acked-by: Felix Kuehling 



---
  drivers/gpu/drm/ttm/ttm_bo.c | 10 +-
  drivers/gpu/drm/ttm/ttm_device.c |  3 ++-
  2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 5757b9415e37..0d608441a112 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref)
spin_unlock(&bo->bdev->lru_lock);
  
  			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);

-   queue_work(bdev->wq, &bo->delayed_delete);
+   /* Schedule the worker on the closest NUMA node, if no
+* CPUs are available, this falls back to any CPU core
+* available system wide. This helps avoid the
+* bottleneck to clear memory in cases where the worker
+* is scheduled on a CPU which is remote to the node
+* where the memory is getting freed.
+*/
+
+   queue_work_node(bdev->pool.nid, bdev->wq, 
&bo->delayed_delete);
return;
}
  
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c

index 43e27ab77f95..72b81a2ee6c7 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct 
ttm_device_funcs *funcs,
bdev->funcs = funcs;
  
  	ttm_sys_man_init(bdev);

-   ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
+
+   ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, 
use_dma32);
  
  	bdev->vma_manager = vma_manager;

spin_lock_init(&bdev->lru_lock);

Re: [PATCH 03/11] drm/amdkfd: Improve amdgpu_vm_handle_moved

2023-11-01 Thread Felix Kuehling


On 2023-10-17 17:13, Felix Kuehling wrote:

Let amdgpu_vm_handle_moved update all BO VA mappings of BOs reserved by
the caller. This will be useful for handling extra BO VA mappings in
KFD VMs that are managed through the render node API.

Signed-off-by: Felix Kuehling 
Reviewed-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 22 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 19 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  3 ++-
  4 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 74769afaa33d..c8f2907ebd6f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1113,7 +1113,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser 
*p)
struct amdgpu_vm *vm = &fpriv->vm;
struct amdgpu_bo_list_entry *e;
struct amdgpu_bo_va *bo_va;
-   struct amdgpu_bo *bo;
unsigned int i;
int r;
  
@@ -1141,26 +1140,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)

return r;
}
  
-	amdgpu_bo_list_for_each_entry(e, p->bo_list) {

-   /* ignore duplicates */
-   bo = ttm_to_amdgpu_bo(e->tv.bo);
-   if (!bo)
-   continue;
-
-   bo_va = e->bo_va;
-   if (bo_va == NULL)
-   continue;
-
-   r = amdgpu_vm_bo_update(adev, bo_va, false);
-   if (r)
-   return r;
-
-   r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
-   if (r)
-   return r;
-   }


FYI, removing this loop seemed to cause PSDB failures, mostly in RADV 
tests. It may have been a glitch in the infrastructure, but the failures 
were consistent on the three subsequent patches, too. Restoring this 
loop seems to make the tests pass, so I'll do that for now. I don't have 
time to debug CS with RADV, and this change is not needed for my ROCm work.


Regards,
  Felix



-
-   r = amdgpu_vm_handle_moved(adev, vm);
+   r = amdgpu_vm_handle_moved(adev, vm, &p->ticket);
if (r)
return r;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c

index b5e28fa3f414..e7e87a3b2601 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -409,7 +409,7 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment 
*attach)
if (!r)
r = amdgpu_vm_clear_freed(adev, vm, NULL);
if (!r)
-   r = amdgpu_vm_handle_moved(adev, vm);
+   r = amdgpu_vm_handle_moved(adev, vm, ticket);
  
  		if (r && r != -EBUSY)

DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index d72daf15662f..c586d0e93d75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1285,6 +1285,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
   *
   * @adev: amdgpu_device pointer
   * @vm: requested vm
+ * @ticket: optional reservation ticket used to reserve the VM
   *
   * Make sure all BOs which are moved are updated in the PTs.
   *
@@ -1294,11 +1295,12 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
   * PTs have to be reserved!
   */
  int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
-  struct amdgpu_vm *vm)
+  struct amdgpu_vm *vm,
+  struct ww_acquire_ctx *ticket)
  {
struct amdgpu_bo_va *bo_va;
struct dma_resv *resv;
-   bool clear;
+   bool clear, unlock;
int r;
  
  	spin_lock(&vm->status_lock);

@@ -1321,17 +1323,24 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
spin_unlock(&vm->status_lock);
  
  		/* Try to reserve the BO to avoid clearing its ptes */

-   if (!adev->debug_vm && dma_resv_trylock(resv))
+   if (!adev->debug_vm && dma_resv_trylock(resv)) {
clear = false;
+   unlock = true;
+   /* The caller is already holding the reservation lock */
+   } else if (ticket && dma_resv_locking_ctx(resv) == ticket) {
+   clear = false;
+   unlock = false;
/* Somebody else is using the BO right now */
-   else
+   } else {
clear = true;
+   unlock = false;
+   }
  
  		r = amdgpu_vm_

[PATCH 11/11] drm/amdkfd: Bump KFD ioctl version

2023-10-17 Thread Felix Kuehling

This is not strictly a change in the IOCTL API. This version bump is meant
to indicate to user mode the presence of a number of changes and fixes
that enable the management of VA mappings in compute VMs using the GEM_VA
ioctl for DMABufs exported from KFD.

Signed-off-by: Felix Kuehling 
---
 include/uapi/linux/kfd_ioctl.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index f0ed68974c54..9ce46edc62a5 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -40,9 +40,10 @@
  * - 1.12 - Add DMA buf export ioctl
  * - 1.13 - Add debugger API
  * - 1.14 - Update kfd_event_data
+ * - 1.15 - Enable managing mappings in compute VMs with GEM_VA ioctl
  */
 #define KFD_IOCTL_MAJOR_VERSION 1
-#define KFD_IOCTL_MINOR_VERSION 14
+#define KFD_IOCTL_MINOR_VERSION 15
 
 struct kfd_ioctl_get_version_args {
__u32 major_version;/* from KFD */
-- 
2.34.1

[PATCH 08/11] drm/amdgpu: Auto-validate DMABuf imports in compute VMs

2023-10-17 Thread Felix Kuehling

DMABuf imports in compute VMs are not wrapped in a kgd_mem object on the
process_info->kfd_bo_list. There is no explicit KFD API call to validate
them or add eviction fences to them.

This patch automatically validates and fences dymanic DMABuf imports when
they are added to a compute VM. Revalidation after evictions is handled
in the VM code.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|   3 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  15 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c   |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  26 
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 117 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|   6 +-
 7 files changed, 164 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index fcf8a98ad15e..68d534a89942 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -178,6 +178,9 @@ int amdgpu_queue_mask_bit_to_set_resource_bit(struct 
amdgpu_device *adev,
 struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
struct mm_struct *mm,
struct svm_range_bo *svm_bo);
+int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo,
+   uint32_t domain,
+   struct dma_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 int kfd_debugfs_kfd_mem_limits(struct seq_file *m, void *data);
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 2e302956a279..0c1cb6048259 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -423,9 +423,9 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, 
uint32_t domain,
return ret;
 }
 
-static int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo,
-  uint32_t domain,
-  struct dma_fence *fence)
+int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo,
+   uint32_t domain,
+   struct dma_fence *fence)
 {
int ret = amdgpu_bo_reserve(bo, false);
 
@@ -2948,7 +2948,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence **ef)
struct amdgpu_device *adev = amdgpu_ttm_adev(
peer_vm->root.bo->tbo.bdev);
 
-   ret = amdgpu_vm_handle_moved(adev, peer_vm, &ctx.ticket);
+   ret = amdgpu_vm_handle_moved(adev, peer_vm, &ctx.ticket, true);
if (ret) {
pr_debug("Memory eviction: handle moved failed. Try 
again\n");
goto validate_map_fail;
@@ -3001,7 +3001,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence **ef)
   &process_info->eviction_fence->base,
   DMA_RESV_USAGE_BOOKKEEP);
}
-   /* Attach eviction fence to PD / PT BOs */
+   /* Attach eviction fence to PD / PT BOs and DMABuf imports */
list_for_each_entry(peer_vm, &process_info->vm_list_head,
vm_list_node) {
struct amdgpu_bo *bo = peer_vm->root.bo;
@@ -3009,6 +3009,11 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence **ef)
dma_resv_add_fence(bo->tbo.base.resv,
   &process_info->eviction_fence->base,
   DMA_RESV_USAGE_BOOKKEEP);
+
+   ret = amdgpu_vm_fence_imports(peer_vm, &ctx.ticket,
+ 
&process_info->eviction_fence->base);
+   if (ret)
+   break;
}
 
 validate_map_fail:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index c8f2907ebd6f..e6dcd17c3c8e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1140,7 +1140,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser 
*p)
return r;
}
 
-   r = amdgpu_vm_handle_moved(adev, vm, &p->ticket);
+   r = amdgpu_vm_handle_moved(adev, vm, &p->ticket, false);
if (r)
return r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index e7e87a3b2601..234244704f27 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -373,6 +373,10 @@ amdgpu_dma_buf_move_notify(stru

[PATCH 03/11] drm/amdkfd: Improve amdgpu_vm_handle_moved

2023-10-17 Thread Felix Kuehling

Let amdgpu_vm_handle_moved update all BO VA mappings of BOs reserved by
the caller. This will be useful for handling extra BO VA mappings in
KFD VMs that are managed through the render node API.

Signed-off-by: Felix Kuehling 
Reviewed-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 22 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 19 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  3 ++-
 4 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 74769afaa33d..c8f2907ebd6f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1113,7 +1113,6 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser 
*p)
struct amdgpu_vm *vm = &fpriv->vm;
struct amdgpu_bo_list_entry *e;
struct amdgpu_bo_va *bo_va;
-   struct amdgpu_bo *bo;
unsigned int i;
int r;
 
@@ -1141,26 +1140,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser 
*p)
return r;
}
 
-   amdgpu_bo_list_for_each_entry(e, p->bo_list) {
-   /* ignore duplicates */
-   bo = ttm_to_amdgpu_bo(e->tv.bo);
-   if (!bo)
-   continue;
-
-   bo_va = e->bo_va;
-   if (bo_va == NULL)
-   continue;
-
-   r = amdgpu_vm_bo_update(adev, bo_va, false);
-   if (r)
-   return r;
-
-   r = amdgpu_sync_fence(&p->sync, bo_va->last_pt_update);
-   if (r)
-   return r;
-   }
-
-   r = amdgpu_vm_handle_moved(adev, vm);
+   r = amdgpu_vm_handle_moved(adev, vm, &p->ticket);
if (r)
return r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index b5e28fa3f414..e7e87a3b2601 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -409,7 +409,7 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment 
*attach)
if (!r)
r = amdgpu_vm_clear_freed(adev, vm, NULL);
if (!r)
-   r = amdgpu_vm_handle_moved(adev, vm);
+   r = amdgpu_vm_handle_moved(adev, vm, ticket);
 
if (r && r != -EBUSY)
DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index d72daf15662f..c586d0e93d75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1285,6 +1285,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
  *
  * @adev: amdgpu_device pointer
  * @vm: requested vm
+ * @ticket: optional reservation ticket used to reserve the VM
  *
  * Make sure all BOs which are moved are updated in the PTs.
  *
@@ -1294,11 +1295,12 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
  * PTs have to be reserved!
  */
 int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
-  struct amdgpu_vm *vm)
+  struct amdgpu_vm *vm,
+  struct ww_acquire_ctx *ticket)
 {
struct amdgpu_bo_va *bo_va;
struct dma_resv *resv;
-   bool clear;
+   bool clear, unlock;
int r;
 
spin_lock(&vm->status_lock);
@@ -1321,17 +1323,24 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
spin_unlock(&vm->status_lock);
 
/* Try to reserve the BO to avoid clearing its ptes */
-   if (!adev->debug_vm && dma_resv_trylock(resv))
+   if (!adev->debug_vm && dma_resv_trylock(resv)) {
clear = false;
+   unlock = true;
+   /* The caller is already holding the reservation lock */
+   } else if (ticket && dma_resv_locking_ctx(resv) == ticket) {
+   clear = false;
+   unlock = false;
/* Somebody else is using the BO right now */
-   else
+   } else {
clear = true;
+   unlock = false;
+   }
 
r = amdgpu_vm_bo_update(adev, bo_va, clear);
if (r)
return r;
 
-   if (!clear)
+   if (unlock)
dma_resv_unlock(resv);
spin_lock(&vm->status_lock);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 6e71978db13f..ebcc75132b74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_

[PATCH 10/11] drm/amdkfd: Import DMABufs for interop through DRM

2023-10-17 Thread Felix Kuehling

Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This
ensures that a GEM handle is created on import and that obj->dma_buf
will be set and remain set as long as the object is imported into KFD.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  9 ++-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 64 +--
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 15 ++---
 3 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 4caf8cece028..88a0e0734270 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -318,11 +318,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void 
*process_info,
struct dma_fence **ef);
 int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev,
  struct kfd_vm_fault_info *info);
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dmabuf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset);
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset);
 int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem,
  struct dma_buf **dmabuf);
 void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 4bb8b5fd7598..1077de8bced2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2006,8 +2006,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
 
/* Free the BO*/
drm_vma_node_revoke(&mem->bo->tbo.base.vma_node, drm_priv);
-   if (!mem->is_imported)
-   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
+   drm_gem_handle_delete(adev->kfd.client.file, mem->gem_handle);
if (mem->dmabuf) {
dma_buf_put(mem->dmabuf);
mem->dmabuf = NULL;
@@ -2363,34 +2362,26 @@ int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct 
amdgpu_device *adev,
return 0;
 }
 
-int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device *adev,
- struct dma_buf *dma_buf,
- uint64_t va, void *drm_priv,
- struct kgd_mem **mem, uint64_t *size,
- uint64_t *mmap_offset)
+static int import_obj_create(struct amdgpu_device *adev,
+struct dma_buf *dma_buf,
+struct drm_gem_object *obj,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
 {
struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv);
-   struct drm_gem_object *obj;
struct amdgpu_bo *bo;
int ret;
 
-   obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf);
-   if (IS_ERR(obj))
-   return PTR_ERR(obj);
-
bo = gem_to_amdgpu_bo(obj);
if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM |
-   AMDGPU_GEM_DOMAIN_GTT))) {
+   AMDGPU_GEM_DOMAIN_GTT)))
/* Only VRAM and GTT BOs are supported */
-   ret = -EINVAL;
-   goto err_put_obj;
-   }
+   return -EINVAL;
 
*mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL);
-   if (!*mem) {
-   ret = -ENOMEM;
-   goto err_put_obj;
-   }
+   if (!*mem)
+   return -ENOMEM;
 
ret = drm_vma_node_allow(&obj->vma_node, drm_priv);
if (ret)
@@ -2440,8 +2431,41 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
drm_vma_node_revoke(&obj->vma_node, drm_priv);
 err_free_mem:
kfree(*mem);
+   return ret;
+}
+
+int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct amdgpu_device *adev, int fd,
+uint64_t va, void *drm_priv,
+struct kgd_mem **mem, uint64_t *size,
+uint64_t *mmap_offset)
+{
+   struct drm_gem_object *obj;
+   uint32_t handle;
+   int ret;
+
+   ret = drm_gem

[PATCH 09/11] drm/amdkfd: Export DMABufs from KFD using GEM handles

2023-10-17 Thread Felix Kuehling

Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM
handles are created in a drm_client_dev context to avoid exposing them
in user mode contexts through a DMABuf import.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  5 +++
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 33 +++
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  4 +--
 4 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 6ab17330a6ed..b0a67f16540a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -142,6 +142,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
int i;
int last_valid_bit;
+   int ret;
 
amdgpu_amdkfd_gpuvm_init_mem_limits();
 
@@ -160,6 +161,12 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
.enable_mes = adev->enable_mes,
};
 
+   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", 
NULL);
+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
+   return;
+   }
+
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -198,6 +205,10 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
&gpu_resources);
+   if (adev->kfd.init_complete)
+   drm_client_register(&adev->kfd.client);
+   else
+   drm_client_release(&adev->kfd.client);
 
amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 68d534a89942..4caf8cece028 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "amdgpu_sync.h"
 #include "amdgpu_vm.h"
@@ -84,6 +85,7 @@ struct kgd_mem {
 
struct amdgpu_sync sync;
 
+   uint32_t gem_handle;
bool aql_queue;
bool is_imported;
 };
@@ -106,6 +108,9 @@ struct amdgpu_kfd_dev {
 
/* HMM page migration MEMORY_DEVICE_PRIVATE mapping */
struct dev_pagemap pgmap;
+
+   /* Client for KFD BO GEM handle allocations */
+   struct drm_client_dev client;
 };
 
 enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 0c1cb6048259..4bb8b5fd7598 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "amdgpu_object.h"
@@ -804,13 +805,22 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem,
 static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
 {
if (!mem->dmabuf) {
-   struct dma_buf *ret = amdgpu_gem_prime_export(
-   &mem->bo->tbo.base,
+   struct amdgpu_device *bo_adev;
+   struct dma_buf *dmabuf;
+   int r, fd;
+
+   bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
+   r = drm_gem_prime_handle_to_fd(&bo_adev->ddev, 
bo_adev->kfd.client.file,
+  mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-   DRM_RDWR : 0);
-   if (IS_ERR(ret))
-   return PTR_ERR(ret);
-   mem->dmabuf = ret;
+  DRM_RDWR : 0, &fd);
+   if (r)
+   return r;
+   dmabuf = dma_buf_get(fd);
+   close_fd(fd);
+   if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+   return PTR_ERR(dmabuf);
+   mem->dmabuf = dmabuf;
}
 
return 0;
@@ -1826,6 +1836,9 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
pr_debug("Failed to allow vma node access. ret %d\n", ret);
goto err_node_allow;
}
+   ret = drm_gem_handle_create(adev->kfd.client.file, gobj, 
&(*mem)->gem_handle);
+   if (ret)
+   goto err_gem_handle_create;
bo = gem_to_amdgpu_bo(gobj);
if (bo_type == ttm_bo_type_sg) {
bo->tbo.sg = sg;
@@ -1877,6 +1890,8 @@ int

[PATCH 07/11] drm/amdgpu: New VM state for evicted user BOs

2023-10-17 Thread Felix Kuehling

Create a new VM state to track user BOs that are in the system domain.
In the next patch this will be used do conditionally re-validate them in
amdgpu_vm_handle_moved.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  5 -
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3307c5765787..76a8a7fd3fde 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -232,6 +232,22 @@ static void amdgpu_vm_bo_invalidated(struct 
amdgpu_vm_bo_base *vm_bo)
spin_unlock(&vm_bo->vm->status_lock);
 }
 
+/**
+ * amdgpu_vm_bo_evicted_user - vm_bo is evicted
+ *
+ * @vm_bo: vm_bo which is evicted
+ *
+ * State for BOs used by user mode queues which are not at the location they
+ * should be.
+ */
+static void amdgpu_vm_bo_evicted_user(struct amdgpu_vm_bo_base *vm_bo)
+{
+   vm_bo->moved = true;
+   spin_lock(&vm_bo->vm->status_lock);
+   list_move(&vm_bo->vm_status, &vm_bo->vm->evicted_user);
+   spin_unlock(&vm_bo->vm->status_lock);
+}
+
 /**
  * amdgpu_vm_bo_relocated - vm_bo is reloacted
  *
@@ -2105,6 +2121,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm, int32_t xcp
for (i = 0; i < AMDGPU_MAX_VMHUBS; i++)
vm->reserved_vmid[i] = NULL;
INIT_LIST_HEAD(&vm->evicted);
+   INIT_LIST_HEAD(&vm->evicted_user);
INIT_LIST_HEAD(&vm->relocated);
INIT_LIST_HEAD(&vm->moved);
INIT_LIST_HEAD(&vm->idle);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 577cdb6d1649..914e6753a6d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -280,9 +280,12 @@ struct amdgpu_vm {
/* Lock to protect vm_bo add/del/move on all lists of vm */
spinlock_t  status_lock;
 
-   /* BOs who needs a validation */
+   /* Per VM and PT BOs who needs a validation */
struct list_headevicted;
 
+   /* BOs for user mode queues that need a validation */
+   struct list_headevicted_user;
+
/* PT BOs which relocated and their parent need an update */
struct list_headrelocated;
 
-- 
2.34.1

[PATCH 06/11] drm/amdkfd: Move TLB flushing logic into amdgpu

2023-10-17 Thread Felix Kuehling

This will make it possible for amdgpu GEM ioctls to flush TLBs on compute
VMs.

This removes VMID-based TLB flushing and always uses PASID-based
flushing. This still works because it scans the VMID-PASID mapping
registers to find the right VMID. It's only slightly less efficient. This
is not a production use case.

Signed-off-by: Felix Kuehling 
Reviewed-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 29 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  5 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 44 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  5 +++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h  | 10 -
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 31 ---
 6 files changed, 57 insertions(+), 67 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index b8412202a1b0..6ab17330a6ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -710,35 +710,6 @@ bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, 
u32 vmid)
return false;
 }
 
-int amdgpu_amdkfd_flush_gpu_tlb_vmid(struct amdgpu_device *adev,
-uint16_t vmid)
-{
-   if (adev->family == AMDGPU_FAMILY_AI) {
-   int i;
-
-   for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
-   amdgpu_gmc_flush_gpu_tlb(adev, vmid, i, 0);
-   } else {
-   amdgpu_gmc_flush_gpu_tlb(adev, vmid, AMDGPU_GFXHUB(0), 0);
-   }
-
-   return 0;
-}
-
-int amdgpu_amdkfd_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
- uint16_t pasid,
- enum TLB_FLUSH_TYPE flush_type,
- uint32_t inst)
-{
-   bool all_hub = false;
-
-   if (adev->family == AMDGPU_FAMILY_AI ||
-   adev->family == AMDGPU_FAMILY_RV)
-   all_hub = true;
-
-   return amdgpu_gmc_flush_gpu_tlb_pasid(adev, pasid, flush_type, all_hub, 
inst);
-}
-
 bool amdgpu_amdkfd_have_atomics_support(struct amdgpu_device *adev)
 {
return adev->have_atomics_support;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 3ad8dc523b42..fcf8a98ad15e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -163,11 +163,6 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,
uint32_t *ib_cmd, uint32_t ib_len);
 void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool idle);
 bool amdgpu_amdkfd_have_atomics_support(struct amdgpu_device *adev);
-int amdgpu_amdkfd_flush_gpu_tlb_vmid(struct amdgpu_device *adev,
-   uint16_t vmid);
-int amdgpu_amdkfd_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
-   uint16_t pasid, enum TLB_FLUSH_TYPE flush_type,
-   uint32_t inst);
 
 bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index c586d0e93d75..3307c5765787 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1349,6 +1349,50 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
return 0;
 }
 
+/**
+ * amdgpu_vm_flush_compute_tlb - Flush TLB on compute VM
+ *
+ * @adev: amdgpu_device pointer
+ * @vm: requested vm
+ * @flush_type: flush type
+ *
+ * Flush TLB if needed for a compute VM.
+ *
+ * Returns:
+ * 0 for success.
+ */
+int amdgpu_vm_flush_compute_tlb(struct amdgpu_device *adev,
+   struct amdgpu_vm *vm,
+   uint32_t flush_type,
+   uint32_t xcc_mask)
+{
+   uint64_t tlb_seq = amdgpu_vm_tlb_seq(vm);
+   bool all_hub = false;
+   int xcc = 0, r = 0;
+
+   WARN_ON_ONCE(!vm->is_compute_context);
+
+   /*
+* It can be that we race and lose here, but that is extremely unlikely
+* and the worst thing which could happen is that we flush the changes
+* into the TLB once more which is harmless.
+*/
+   if (atomic64_xchg(&vm->kfd_last_flushed_seq, tlb_seq) == tlb_seq)
+   return 0;
+
+   if (adev->family == AMDGPU_FAMILY_AI ||
+   adev->family == AMDGPU_FAMILY_RV)
+   all_hub = true;
+
+   for_each_inst(xcc, xcc_mask) {
+   r = amdgpu_gmc_flush_gpu_tlb_pasid(adev, vm->pasid, flush_type,
+  all_hub, xcc);
+   if (r)
+   break;
+   }
+   return r;
+}
+
 /**
  * amdgpu_vm_bo_add - add a bo to a specific vm
  *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drive

[PATCH 05/11] drm/amdgpu: update mappings not managed by KFD

2023-10-17 Thread Felix Kuehling

When restoring after an eviction, use amdgpu_vm_handle_moved to update
BO VA mappings in KFD VMs that are not managed through the KFD API. This
should allow using the render node API to create more flexible memory
mappings in KFD VMs.

Signed-off-by: Felix Kuehling 
Acked-by: Christian König 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 28 +++
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7c29f6c377a8..2e302956a279 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2892,12 +2892,6 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence **ef)
if (ret)
goto validate_map_fail;
 
-   ret = process_sync_pds_resv(process_info, &sync_obj);
-   if (ret) {
-   pr_debug("Memory eviction: Failed to sync to PD BO moving 
fence. Try again\n");
-   goto validate_map_fail;
-   }
-
/* Validate BOs and map them to GPUVM (update VM page tables). */
list_for_each_entry(mem, &process_info->kfd_bo_list,
validate_list.head) {
@@ -2948,6 +2942,19 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence **ef)
if (failed_size)
pr_debug("0x%lx/0x%lx in system\n", failed_size, total_size);
 
+   /* Update mappings not managed by KFD */
+   list_for_each_entry(peer_vm, &process_info->vm_list_head,
+   vm_list_node) {
+   struct amdgpu_device *adev = amdgpu_ttm_adev(
+   peer_vm->root.bo->tbo.bdev);
+
+   ret = amdgpu_vm_handle_moved(adev, peer_vm, &ctx.ticket);
+   if (ret) {
+   pr_debug("Memory eviction: handle moved failed. Try 
again\n");
+   goto validate_map_fail;
+   }
+   }
+
/* Update page directories */
ret = process_update_pds(process_info, &sync_obj);
if (ret) {
@@ -2955,6 +2962,15 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence **ef)
goto validate_map_fail;
}
 
+   /* Sync with fences on all the page tables. They implicitly depend on 
any
+* move fences from amdgpu_vm_handle_moved above.
+*/
+   ret = process_sync_pds_resv(process_info, &sync_obj);
+   if (ret) {
+   pr_debug("Memory eviction: Failed to sync to PD BO moving 
fence. Try again\n");
+   goto validate_map_fail;
+   }
+
/* Wait for validate and PT updates to finish */
amdgpu_sync_wait(&sync_obj, false);
 
-- 
2.34.1

[PATCH 02/11] drm/amdgpu: Reserve fences for VM update

2023-10-17 Thread Felix Kuehling

In amdgpu_dma_buf_move_notify reserve fences for the page table updates
in amdgpu_vm_clear_freed and amdgpu_vm_handle_moved. This fixes a BUG_ON
in dma_resv_add_fence when using SDMA for page table updates.

Signed-off-by: Felix Kuehling 
Reviewed-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 76b618735dc0..b5e28fa3f414 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -404,7 +404,10 @@ amdgpu_dma_buf_move_notify(struct dma_buf_attachment 
*attach)
continue;
}
 
-   r = amdgpu_vm_clear_freed(adev, vm, NULL);
+   /* Reserve fences for two SDMA page table updates */
+   r = dma_resv_reserve_fences(resv, 2);
+   if (!r)
+   r = amdgpu_vm_clear_freed(adev, vm, NULL);
if (!r)
r = amdgpu_vm_handle_moved(adev, vm);
 
-- 
2.34.1

[PATCH 04/11] drm/amdgpu: Attach eviction fence on alloc

2023-10-17 Thread Felix Kuehling

Instead of attaching the eviction fence when a KFD BO is first mapped,
attach it when it is allocated or imported. This in preparation to allow
KFD BOs to be mapped using the render node API.

Signed-off-by: Felix Kuehling 
Acked-by: Christian König 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 79 +++
 1 file changed, 48 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 54f31a420229..7c29f6c377a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -423,6 +423,32 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, 
uint32_t domain,
return ret;
 }
 
+static int amdgpu_amdkfd_bo_validate_and_fence(struct amdgpu_bo *bo,
+  uint32_t domain,
+  struct dma_fence *fence)
+{
+   int ret = amdgpu_bo_reserve(bo, false);
+
+   if (ret)
+   return ret;
+
+   ret = amdgpu_amdkfd_bo_validate(bo, domain, true);
+   if (ret)
+   goto unreserve_out;
+
+   ret = dma_resv_reserve_fences(bo->tbo.base.resv, 1);
+   if (ret)
+   goto unreserve_out;
+
+   dma_resv_add_fence(bo->tbo.base.resv, fence,
+  DMA_RESV_USAGE_BOOKKEEP);
+
+unreserve_out:
+   amdgpu_bo_unreserve(bo);
+
+   return ret;
+}
+
 static int amdgpu_amdkfd_validate_vm_bo(void *_unused, struct amdgpu_bo *bo)
 {
return amdgpu_amdkfd_bo_validate(bo, bo->allowed_domains, false);
@@ -1831,6 +1857,15 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
}
bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT;
bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT;
+   } else {
+   mutex_lock(&avm->process_info->lock);
+   if (avm->process_info->eviction_fence &&
+   
!dma_fence_is_signaled(&avm->process_info->eviction_fence->base))
+   ret = amdgpu_amdkfd_bo_validate_and_fence(bo, domain,
+   &avm->process_info->eviction_fence->base);
+   mutex_unlock(&avm->process_info->lock);
+   if (ret)
+   goto err_validate_bo;
}
 
if (offset)
@@ -1840,6 +1875,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 
 allocate_init_user_pages_failed:
 err_pin_bo:
+err_validate_bo:
remove_kgd_mem_from_kfd_bo_list(*mem, avm->process_info);
drm_vma_node_revoke(&gobj->vma_node, drm_priv);
 err_node_allow:
@@ -1915,10 +1951,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
if (unlikely(ret))
return ret;
 
-   /* The eviction fence should be removed by the last unmap.
-* TODO: Log an error condition if the bo still has the eviction fence
-* attached
-*/
amdgpu_amdkfd_remove_eviction_fence(mem->bo,
process_info->eviction_fence);
pr_debug("Release VA 0x%llx - 0x%llx\n", mem->va,
@@ -2047,19 +2079,6 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
if (unlikely(ret))
goto out_unreserve;
 
-   if (mem->mapped_to_gpu_memory == 0 &&
-   !amdgpu_ttm_tt_get_usermm(bo->tbo.ttm)) {
-   /* Validate BO only once. The eviction fence gets added to BO
-* the first time it is mapped. Validate will wait for all
-* background evictions to complete.
-*/
-   ret = amdgpu_amdkfd_bo_validate(bo, domain, true);
-   if (ret) {
-   pr_debug("Validate failed\n");
-   goto out_unreserve;
-   }
-   }
-
list_for_each_entry(entry, &mem->attachments, list) {
if (entry->bo_va->base.vm != avm || entry->is_mapped)
continue;
@@ -2086,10 +2105,6 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
 mem->mapped_to_gpu_memory);
}
 
-   if (!amdgpu_ttm_tt_get_usermm(bo->tbo.ttm) && !bo->tbo.pin_count)
-   dma_resv_add_fence(bo->tbo.base.resv,
-  &avm->process_info->eviction_fence->base,
-  DMA_RESV_USAGE_BOOKKEEP);
ret = unreserve_bo_and_vms(&ctx, false, false);
 
goto out;
@@ -2123,7 +2138,6 @@ int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu(
struct amdgpu_device *adev, struct kgd_mem *mem, void *drm_priv)
 {
struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv);
-   struct amdkfd_process_info *process_info = avm->process_info;
unsigned long bo_size = mem->bo->tbo.base.si

[PATCH 01/11] drm/amdgpu: Fix possible null pointer dereference

2023-10-17 Thread Felix Kuehling

abo->tbo.resource may be NULL in amdgpu_vm_bo_update.

Fixes: 180253782038 ("drm/ttm: stop allocating dummy resources during BO 
creation")
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 46d27c87..d72daf15662f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1007,7 +1007,8 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, 
struct amdgpu_bo_va *bo_va,
struct drm_gem_object *gobj = dma_buf->priv;
struct amdgpu_bo *abo = gem_to_amdgpu_bo(gobj);
 
-   if (abo->tbo.resource->mem_type == TTM_PL_VRAM)
+   if (abo->tbo.resource &&
+   abo->tbo.resource->mem_type == TTM_PL_VRAM)
bo = gem_to_amdgpu_bo(gobj);
}
mem = bo->tbo.resource;
-- 
2.34.1

[PATCH 00/11] Enable integration of KFD with DRM GEM_VA ioctl

2023-10-17 Thread Felix Kuehling

This patch series enables better integration of KFD memory management with
the DRM GEM ioctl API. It allow managing virtual address mappings in
compute VMs with the GEM_VA ioctl after importing DMABufs exported from
KFD into libdrm.

This will enable more flexible virtual address management for ROCm user
mode, better interoperability between compute and graphics, as well as
sharing of memory between processes using DMABufs.

Felix Kuehling (11):
  drm/amdgpu: Fix possible null pointer dereference
  drm/amdgpu: Reserve fences for VM update
  drm/amdkfd: Improve amdgpu_vm_handle_moved
  drm/amdgpu: Attach eviction fence on alloc
  drm/amdgpu: update mappings not managed by KFD
  drm/amdkfd: Move TLB flushing logic into amdgpu
  drm/amdgpu: New VM state for evicted user BOs
  drm/amdgpu: Auto-validate DMABuf imports in compute VMs
  drm/amdkfd: Export DMABufs from KFD using GEM handles
  drm/amdkfd: Import DMABufs for interop through DRM
  drm/amdkfd: Bump KFD ioctl version

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  40 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  22 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 207 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|  22 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c   |  11 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  26 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 198 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  17 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  19 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  10 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  |  31 ---
 include/uapi/linux/kfd_ioctl.h|   3 +-
 12 files changed, 424 insertions(+), 182 deletions(-)

-- 
2.34.1

Re: [PATCH] drm/amdkfd: clean up some inconsistent indenting

2023-10-17 Thread Felix Kuehling




On 2023-10-12 23:21, Jiapeng Chong wrote:

No functional modification involved.

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:305 svm_range_free() warn: 
inconsistent indenting.

Reported-by: Abaci Robot 
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6804
Signed-off-by: Jiapeng Chong 


The patch is

Reviewed-by: Felix Kuehling 

Applied to amd-staging-drm-next.

Thanks,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index f4038b33c404..eef76190800c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -302,7 +302,7 @@ static void svm_range_free(struct svm_range *prange, bool 
do_unmap)
for (gpuidx = 0; gpuidx < MAX_GPU_INSTANCE; gpuidx++) {
if (prange->dma_addr[gpuidx]) {
kvfree(prange->dma_addr[gpuidx]);
-   prange->dma_addr[gpuidx] = NULL;
+   prange->dma_addr[gpuidx] = NULL;
}
}

Re: [PATCH v4 1/1] drm/amdkfd: get doorbell's absolute offset based on the db_size

2023-10-05 Thread Felix Kuehling


On 2023-10-05 13:20, Arvind Yadav wrote:

Here, Adding db_size in byte to find the doorbell's
absolute offset for both 32-bit and 64-bit doorbell sizes.
So that doorbell offset will be aligned based on the doorbell
size.

v2:
- Addressed the review comment from Felix.
v3:
- Adding doorbell_size as parameter to get db absolute offset.
v4:
   Squash the two patches into one.

Cc: Christian Koenig 
Cc: Alex Deucher 
Signed-off-by: Shashank Sharma 
Signed-off-by: Arvind Yadav 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h|  5 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c| 13 +
  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c   |  3 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c   | 10 --
  .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c  |  3 ++-
  5 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
index 09f6727e7c73..4a8b33f55f6b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
@@ -357,8 +357,9 @@ int amdgpu_doorbell_init(struct amdgpu_device *adev);
  void amdgpu_doorbell_fini(struct amdgpu_device *adev);
  int amdgpu_doorbell_create_kernel_doorbells(struct amdgpu_device *adev);
  uint32_t amdgpu_doorbell_index_on_bar(struct amdgpu_device *adev,
-  struct amdgpu_bo *db_bo,
-  uint32_t doorbell_index);
+ struct amdgpu_bo *db_bo,
+ uint32_t doorbell_index,
+ uint32_t db_size);
  
  #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))

  #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
index da4be0bbb446..6690f5a72f4d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
@@ -114,19 +114,24 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
u32 index, u64 v)
   * @adev: amdgpu_device pointer
   * @db_bo: doorbell object's bo
   * @db_index: doorbell relative index in this doorbell object
+ * @db_size: doorbell size is in byte
   *
   * returns doorbell's absolute index in BAR
   */
  uint32_t amdgpu_doorbell_index_on_bar(struct amdgpu_device *adev,
-  struct amdgpu_bo *db_bo,
-  uint32_t doorbell_index)
+ struct amdgpu_bo *db_bo,
+ uint32_t doorbell_index,
+ uint32_t db_size)
  {
int db_bo_offset;
  
  	db_bo_offset = amdgpu_bo_gpu_offset_no_check(db_bo);
  
-	/* doorbell index is 32 bit but doorbell's size is 64-bit, so *2 */

-   return db_bo_offset / sizeof(u32) + doorbell_index * 2;
+   /* doorbell index is 32 bit but doorbell's size can be 32 bit
+* or 64 bit, so *db_size(in byte)/4 for alignment.
+*/
+   return db_bo_offset / sizeof(u32) + doorbell_index *
+  DIV_ROUND_UP(db_size, 4);
  }
  
  /**

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 0d3d538b64eb..e07652e72496 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -407,7 +407,8 @@ static int allocate_doorbell(struct qcm_process_device *qpd,
  
  	q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev->adev,

  
qpd->proc_doorbells,
- 
q->doorbell_id);
+ 
q->doorbell_id,
+ 
dev->kfd->device_info.doorbell_size);
return 0;
  }
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c

index 7b38537c7c99..05c74887fd6f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -161,7 +161,10 @@ void __iomem *kfd_get_kernel_doorbell(struct kfd_dev *kfd,
if (inx >= KFD_MAX_NUM_OF_QUEUES_PER_PROCESS)
return NULL;
  
-	*doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, kfd->doorbells, inx);

+   *doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev,
+kfd->doorbells,
+inx,
+
kfd->device_info.doorbell_size);
inx *= 2;

Re: [PATCH v3 2/2] drm/amdkfd: get doorbell's absolute offset based on the db size

2023-10-04 Thread Felix Kuehling




On 2023-10-04 12:16, Arvind Yadav wrote:

This patch is to align the absolute doorbell offset
based on the doorbell's size. So that doorbell offset
will be aligned for both 32 bit and 64 bit.

v2:
- Addressed the review comment from Felix.
v3:
- Adding doorbell_size as parameter to get db absolute offset.

Cc: Christian Koenig 
Cc: Alex Deucher 
Signed-off-by: Shashank Sharma 
Signed-off-by: Arvind Yadav 


The final result looks good to me. But please squash the two patches 
into one. The first patch on its own breaks the build, and that's 
something we don't want to commit to the branch history as it makes 
tracking regressions (e.g. with git bisect) very hard or impossible.


More nit-picks inline.



---
  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c   |  6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c   | 13 +++--
  .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c  |  4 +++-
  3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 0d3d538b64eb..690ff131fe4b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -346,6 +346,7 @@ static int allocate_doorbell(struct qcm_process_device *qpd,
 uint32_t const *restore_id)
  {
struct kfd_node *dev = qpd->dqm->dev;
+   uint32_t doorbell_size;
  
  	if (!KFD_IS_SOC15(dev)) {

/* On pre-SOC15 chips we need to use the queue ID to
@@ -405,9 +406,12 @@ static int allocate_doorbell(struct qcm_process_device 
*qpd,
}
}
  
+	doorbell_size = dev->kfd->device_info.doorbell_size;

+
q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev->adev,
  
qpd->proc_doorbells,
- 
q->doorbell_id);
+ 
q->doorbell_id,
+ 
doorbell_size);


You don't need a local variable for doorbell size that's only used once. 
Just pass dev->kfd->device_info.doorbell_size directly.




return 0;
  }
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c

index 7b38537c7c99..59dd76c4b138 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -161,7 +161,10 @@ void __iomem *kfd_get_kernel_doorbell(struct kfd_dev *kfd,
if (inx >= KFD_MAX_NUM_OF_QUEUES_PER_PROCESS)
return NULL;
  
-	*doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, kfd->doorbells, inx);

+   *doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev,
+kfd->doorbells,
+inx,
+
kfd->device_info.doorbell_size);
inx *= 2;
  
  	pr_debug("Get kernel queue doorbell\n"

@@ -233,6 +236,7 @@ phys_addr_t kfd_get_process_doorbells(struct 
kfd_process_device *pdd)
  {
struct amdgpu_device *adev = pdd->dev->adev;
uint32_t first_db_index;
+   uint32_t doorbell_size;
  
  	if (!pdd->qpd.proc_doorbells) {

if (kfd_alloc_process_doorbells(pdd->dev->kfd, pdd))
@@ -240,7 +244,12 @@ phys_addr_t kfd_get_process_doorbells(struct 
kfd_process_device *pdd)
return 0;
}
  
-	first_db_index = amdgpu_doorbell_index_on_bar(adev, pdd->qpd.proc_doorbells, 0);

+   doorbell_size = pdd->dev->kfd->device_info.doorbell_size;
+
+   first_db_index = amdgpu_doorbell_index_on_bar(adev,
+ pdd->qpd.proc_doorbells,
+ 0,
+ doorbell_size);


Same as above, no local variable needed.



return adev->doorbell.base + first_db_index * sizeof(uint32_t);
  }
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c

index adb5e4bdc0b2..010cd8e8e6a1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -375,9 +375,11 @@ int pqm_create_queue(struct process_queue_manager *pqm,
 * relative doorbell index = Absolute doorbell index -
 * absolute index of first doorbell in the page.
 */
+   uint32_t doorbell_size = 
pdd->dev->kfd->device_info.doorbell_size;
uint32_t first_db_index = 
amdgpu_doorbell_index_on_bar(pdd->dev->adev,
   
pdd->qpd.proc_doorbells,
-

Re: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8

2023-09-28 Thread Felix Kuehling


On 2023-09-28 11:38, Shashank Sharma wrote:

Hello Felix, Mukul,

On 28/09/2023 17:30, Felix Kuehling wrote:

On 2023-09-28 10:30, Joshi, Mukul wrote:

[AMD Official Use Only - General]


-Original Message-
From: Yadav, Arvind 
Sent: Thursday, September 28, 2023 5:54 AM
To: Koenig, Christian ; Deucher, Alexander
; Sharma, Shashank
; Kuehling, Felix ;
Joshi, Mukul ; Pan, Xinhui ;
airl...@gmail.com; dan...@ffwll.ch
Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; 
linux-

ker...@vger.kernel.org; Yadav, Arvind ; Koenig,
Christian 
Subject: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute 
offset

for gfx8

This patch is to adjust the absolute doorbell offset against the 
doorbell id

considering the doorbell size of 32/64 bit.

v2:
- Addressed the review comment from Felix.

Cc: Christian Koenig 
Cc: Alex Deucher 
Signed-off-by: Shashank Sharma 
Signed-off-by: Arvind Yadav 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 0d3d538b64eb..c54c4392d26e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -407,7 +407,14 @@ static int allocate_doorbell(struct
qcm_process_device *qpd,

   q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev-

adev,

 qpd-

proc_doorbells,

-   q-

doorbell_id);

+   0);
+
It looks like amdgpu_doorbell_index_on_bar() works only for 64-bit 
doorbells.
Shouldn't it work for both 32-bit and 64-bit doorbells considering 
this is common

doorbell manager code?



Yes, You are right that the calculations to find a particular doorbell 
in the doorbell page considers a doorbell width of 64-bit.




I could see this argument going either way. KFD is the only one that 
cares about managing doorbells for user mode queues on GFXv8 GPUs. 
This is not a use case that amdgpu cares about. So I'm OK with KFD 
doing its own address calculations to make sure doorbells continue to 
work on GFXv8.


It may not be worth adding complexity to the common doorbell manager 
code to support legacy GPUs with 32-bit doorbells.



I was thinking about adding an additional input parameter which will 
indicate if the doorbell width is 32-bit vs 64-bit (like 
is_doorbell_64_bit), and doorbell manager can alter the multiplier 
while calculating the final offset. Please let me know if that will 
work for both the cases.


Yes, that would work for KFD because we already have the doorbell size 
in our device-info structure. Instead of making it a boolean flag, you 
could make it a doorbell_size parameter, in byte or dword units to 
simplify the pointer math.


Regards,
  Felix




- Shashank




Regards,
  Felix




Thanks,
Mukul


+ /* Adjust the absolute doorbell offset against the doorbell id
considering
+  * the doorbell size of 32/64 bit.
+  */
+ q->properties.doorbell_off += q->doorbell_id *
+ dev->kfd->device_info.doorbell_size / 4;
+
   return 0;
  }

--
2.34.1

Re: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8

2023-09-28 Thread Felix Kuehling

On 2023-09-28 10:30, Joshi, Mukul wrote:

[AMD Official Use Only - General]

-Original Message-
From: Yadav, Arvind 
Sent: Thursday, September 28, 2023 5:54 AM
To: Koenig, Christian ; Deucher, Alexander
; Sharma, Shashank
; Kuehling, Felix ;
Joshi, Mukul ; Pan, Xinhui ;
airl...@gmail.com; dan...@ffwll.ch
Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-
ker...@vger.kernel.org; Yadav, Arvind ; Koenig,
Christian 
Subject: [PATCH v2 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset
for gfx8

This patch is to adjust the absolute doorbell offset against the doorbell id
considering the doorbell size of 32/64 bit.

v2:
- Addressed the review comment from Felix.

Cc: Christian Koenig 
Cc: Alex Deucher 
Signed-off-by: Shashank Sharma 
Signed-off-by: Arvind Yadav 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 0d3d538b64eb..c54c4392d26e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -407,7 +407,14 @@ static int allocate_doorbell(struct
qcm_process_device *qpd,

   q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev-

adev,

 qpd-

proc_doorbells,

-   q-

doorbell_id);

+   0);
+

It looks like amdgpu_doorbell_index_on_bar() works only for 64-bit doorbells.
Shouldn't it work for both 32-bit and 64-bit doorbells considering this is 
common
doorbell manager code?

I could see this argument going either way. KFD is the only one that 
cares about managing doorbells for user mode queues on GFXv8 GPUs. This 
is not a use case that amdgpu cares about. So I'm OK with KFD doing its 
own address calculations to make sure doorbells continue to work on GFXv8.

It may not be worth adding complexity to the common doorbell manager 
code to support legacy GPUs with 32-bit doorbells.

Regards,
  Felix

Thanks,
Mukul

+ /* Adjust the absolute doorbell offset against the doorbell id
considering
+  * the doorbell size of 32/64 bit.
+  */
+ q->properties.doorbell_off += q->doorbell_id *
+   dev->kfd->device_info.doorbell_size / 4;
+
   return 0;
  }

--
2.34.1

Re: [PATCH 1/1] drm/amdkfd: Fix unaligned doorbell absolute offset for gfx8

2023-09-27 Thread Felix Kuehling


[+Mukul]

On 2023-09-27 12:16, Arvind Yadav wrote:

This patch is to adjust the absolute doorbell offset
against the doorbell id considering the doorbell
size of 32/64 bit.

Cc: Christian Koenig
Cc: Alex Deucher
Signed-off-by: Shashank Sharma
Signed-off-by: Arvind Yadav
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 0d3d538b64eb..c327f7f604d7 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -407,7 +407,16 @@ static int allocate_doorbell(struct qcm_process_device 
*qpd,
  
  	q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev->adev,

  
qpd->proc_doorbells,
- 
q->doorbell_id);
+ 0);


Looks like we're now always calling amdgpu_doorbell_index_on_bar with 
the third parameter = 0. So we could remove that parameter. It doesn't 
do us any good and only causes bugs if we use any non-0 value.




+
+   /* Adjust the absolute doorbell offset against the doorbell id 
considering
+* the doorbell size of 32/64 bit.
+*/
+   if (!KFD_IS_SOC15(dev))
+   q->properties.doorbell_off += q->doorbell_id;
+   else
+   q->properties.doorbell_off += q->doorbell_id * 2;


This could be simplified and generalized as

q->properties.doorbell_off += q->doorbell_id * 
dev->kfd->device_info.doorbell_size / 4;

Regards,
  Felix



+
return 0;
  }

Re: [PATCH v2 1/2] drm/amdgpu: Merge debug module parameters

2023-08-31 Thread Felix Kuehling




On 2023-08-30 18:08, André Almeida wrote:

Merge all developer debug options available as separated module
parameters in one, making it obvious that are for developers.

Drop the obsolete module options in favor of the new ones.

Signed-off-by: André Almeida 
---
v2:
- drop old module params
- use BIT() macros
- replace global var with adev-> vars
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h  |  4 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  | 48 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c  |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c   |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c|  2 +-
  drivers/gpu/drm/amd/include/amd_shared.h |  8 
  8 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4de074243c4d..82eaccfce347 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1101,6 +1101,10 @@ struct amdgpu_device {
booldc_enabled;
/* Mask of active clusters */
uint32_taid_mask;
+
+   /* Debug */
+   booldebug_vm;
+   booldebug_largebar;
  };
  
  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fb78a8f47587..8a26bed76505 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1191,7 +1191,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser 
*p)
job->vm_pd_addr = amdgpu_gmc_pd_addr(vm->root.bo);
}
  
-	if (amdgpu_vm_debug) {

+   if (adev->debug_vm) {
/* Invalidate all BOs to test for userspace bugs */
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index f5856b82605e..0cd48c025433 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -140,7 +140,6 @@ int amdgpu_vm_size = -1;
  int amdgpu_vm_fragment_size = -1;
  int amdgpu_vm_block_size = -1;
  int amdgpu_vm_fault_stop;
-int amdgpu_vm_debug;
  int amdgpu_vm_update_mode = -1;
  int amdgpu_exp_hw_support;
  int amdgpu_dc = -1;
@@ -194,6 +193,7 @@ int amdgpu_use_xgmi_p2p = 1;
  int amdgpu_vcnfw_log;
  int amdgpu_sg_display = -1; /* auto */
  int amdgpu_user_partt_mode = AMDGPU_AUTO_COMPUTE_PARTITION_MODE;
+uint amdgpu_debug_mask;
  
  static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work);
  
@@ -405,13 +405,6 @@ module_param_named(vm_block_size, amdgpu_vm_block_size, int, 0444);

  MODULE_PARM_DESC(vm_fault_stop, "Stop on VM fault (0 = never (default), 1 = print 
first, 2 = always)");
  module_param_named(vm_fault_stop, amdgpu_vm_fault_stop, int, 0444);
  
-/**

- * DOC: vm_debug (int)
- * Debug VM handling (0 = disabled, 1 = enabled). The default is 0 (Disabled).
- */
-MODULE_PARM_DESC(vm_debug, "Debug VM handling (0 = disabled (default), 1 = 
enabled)");
-module_param_named(vm_debug, amdgpu_vm_debug, int, 0644);


This parameter used to be writable, which means it could be changed 
through sysfs after loading the module. Code looking at the global 
variable would see the last value written by user mode. With your 
changes, this is no longer writable, and driver code is now looking at 
adev->debug_vm, which cannot be updated through sysfs. As long as 
everyone is OK with that change, I have no objections. Just pointing it out.


Regardless, this patch is

Acked-by: Felix Kuehling 



-
  /**
   * DOC: vm_update_mode (int)
   * Override VM update mode. VM updated by using CPU (0 = never, 1 = Graphics 
only, 2 = Compute only, 3 = Both). The default
@@ -743,18 +736,6 @@ module_param(send_sigterm, int, 0444);
  MODULE_PARM_DESC(send_sigterm,
"Send sigterm to HSA process on unhandled exception (0 = disable, 1 = 
enable)");
  
-/**

- * DOC: debug_largebar (int)
- * Set debug_largebar as 1 to enable simulating large-bar capability on 
non-large bar
- * system. This limits the VRAM size reported to ROCm applications to the 
visible
- * size, usually 256MB.
- * Default value is 0, diabled.
- */
-int debug_largebar;
-module_param(debug_largebar, int, 0444);
-MODULE_PARM_DESC(debug_largebar,
-   "Debug large-bar flag used to simulate large-bar capability on non-large bar 
machine (0 = disable, 1 = enable)");
-
  /**
   * DOC: halt_if_hws_hang (int)
   * Halt if HWS hang is detected. Default value, 0, disables the halt on hang.
@@ -938,6 +919,18 @@ module_param_named(user_partt_mode, 
amdgpu_user_partt_mo

Re: [PATCH] drm/prime: Support page array >= 4GB

2023-08-28 Thread Felix Kuehling


On 2023-08-21 16:02, Philip Yang wrote:

Without unsigned long typecast, the size is passed in as zero if page
array size >= 4GB, nr_pages >= 0x10, then sg list converted will
have the first and the last chunk lost.

Signed-off-by: Philip Yang 


The patch looks reasonable to me. I don't have authority to approve it. 
But FWIW,


Acked-by: Felix Kuehling 

Can anyone give a Reviewed-by?

Thanks,
  Felix



---
  drivers/gpu/drm/drm_prime.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index f924b8b4ab6b..2630ad2e504d 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -830,7 +830,7 @@ struct sg_table *drm_prime_pages_to_sg(struct drm_device 
*dev,
if (max_segment == 0)
max_segment = UINT_MAX;
err = sg_alloc_table_from_pages_segment(sg, pages, nr_pages, 0,
-   nr_pages << PAGE_SHIFT,
+   (unsigned long)nr_pages << 
PAGE_SHIFT,
max_segment, GFP_KERNEL);
if (err) {
kfree(sg);

Re: [PATCH AUTOSEL 5.15 6/6] drm/amdkfd: ignore crat by default

2023-08-23 Thread Felix Kuehling


On 2023-08-22 11:41, Deucher, Alexander wrote:

[Public]


-Original Message-
From: Sasha Levin 
Sent: Tuesday, August 22, 2023 7:37 AM
To: linux-ker...@vger.kernel.org; sta...@vger.kernel.org
Cc: Deucher, Alexander ; Kuehling, Felix
; Koenig, Christian ;
Mike Lothian ; Sasha Levin ; Pan,
Xinhui ; airl...@gmail.com; dan...@ffwll.ch; amd-
g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
Subject: [PATCH AUTOSEL 5.15 6/6] drm/amdkfd: ignore crat by default

From: Alex Deucher 

[ Upstream commit a6dea2d64ff92851e68cd4e20a35f6534286e016 ]

We are dropping the IOMMUv2 path, so no need to enable this.
It's often buggy on consumer platforms anyway.

This is not needed for stable.


I agree. I was about to comment in the 5.10 patch as well.

Regards,
  Felix




Alex


Reviewed-by: Felix Kuehling 
Acked-by: Christian König 
Tested-by: Mike Lothian 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index e574aa32a111d..46dfd9baeb013 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -1523,11 +1523,7 @@ static bool kfd_ignore_crat(void)
   if (ignore_crat)
   return true;

-#ifndef KFD_SUPPORT_IOMMU_V2
   ret = true;
-#else
- ret = false;
-#endif

   return ret;
  }
--
2.40.1

Re: [PATCH] drm/prime: Support page array >= 4GB

2023-08-23 Thread Felix Kuehling


On 2023-08-23 01:49, Christian König wrote:

Am 22.08.23 um 20:27 schrieb Philip Yang:


On 2023-08-22 05:43, Christian König wrote:



Am 21.08.23 um 22:02 schrieb Philip Yang:

Without unsigned long typecast, the size is passed in as zero if page
array size >= 4GB, nr_pages >= 0x10, then sg list converted will
have the first and the last chunk lost.


Good catch, but I'm not sure if this is enough to make it work.

Additional to that I don't think we have an use case for BOs > 4GiB.


>4GB buffer is normal for compute applications, the issue is reported 
by "Maelstrom generated exerciser detects micompares when GPU 
accesses larger remote GPU memory." on GFX 9.4.3 APU, which uses GTT 
domain to allocate VRAM, and trigger the bug in this drm prime 
helper. With this fix, the test passed.




Why is the application allocating all the data as a single BO?

Usually you have a single texture, image, array etc... in a single BO 
but this here looks a bit like the application tries to allocate all 
their memory in a single BO (could of course be that this isn't the 
case and that's really just one giant data structure).


Compute applications work with pretty big data structures. For example 
huge multi-dimensional matrices are not uncommon in large 
machine-learning models.






Swapping such large BOs out at once is quite impractical, so should we 
ever have an use case like suspend/resume or checkpoint/restore with 
this it will most likely fail.
Checkpointing and restoring multiple GB at a time should not be a 
problem. I'm pretty sure we have tested that. On systems with 100s of 
GBs of memory, HBM memory bandwidth approaching TB/s and PCIe/CXL bus 
bandwidths going into 10s of GB/s, dealing with multi-GB BOs should not 
be a fundamental problem.


That said, if you wanted to impose limits on the size of single 
allocations, then I would expect some policy somewhere that prohibits 
large allocations. On the contrary, I see long or 64-bit data types all 
over the VRAM manager and TTM code, which tells me that >4GB allocations 
must be part of the plan.


This patch is clearly addressing a bug in the code that results in data 
corruption when mapping large BOs on multiple GPUs. You could address 
this with an allocation policy change, if you want, and leave the bug in 
place. Then we have to update ROCm user mode to break large allocations 
into multiple BOs. It would break applications that try to share such 
large allocations via DMABufs (e.g. with an RDMA NIC), because it would 
become impossible to share large allocations with a single DMABuf handle.


Regards,
  Felix




Christian.


Regards,

Philip



Christian.



Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/drm_prime.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index f924b8b4ab6b..2630ad2e504d 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -830,7 +830,7 @@ struct sg_table *drm_prime_pages_to_sg(struct 
drm_device *dev,

  if (max_segment == 0)
  max_segment = UINT_MAX;
  err = sg_alloc_table_from_pages_segment(sg, pages, nr_pages, 0,
-    nr_pages << PAGE_SHIFT,
+    (unsigned long)nr_pages << PAGE_SHIFT,
  max_segment, GFP_KERNEL);
  if (err) {
  kfree(sg);

Re: Implement svm without BO concept in xe driver

2023-08-21 Thread Felix Kuehling




On 2023-08-21 15:41, Zeng, Oak wrote:

I have thought about emulating BO allocation APIs on top of system SVM.
This was in the context of KFD where memory management is not tied into
command submissions APIs, which would add a whole other layer of
complexity. The main unsolved (unsolvable?) problem I ran into was, that
there is no way to share SVM memory as DMABufs. So there is no good way
to support applications that expect to share memory in that way.

Great point. I also discussed the dmabuf thing with Mike (cc'ed). dmabuf is a 
particular technology created specially for the BO driver (and other driver) to 
share buffer b/t devices. Hmm/system SVM doesn't need this technology: 
malloc'ed memory by the nature is already shared b/t different devices (in one 
process) and CPU. We just can simply submit GPU kernel to all devices with 
malloc'ed memory and let kmd decide the memory placement (such as map in place 
or migrate). No need of buffer export/import in hmm/system SVM world.


I disagree. DMABuf can be used for sharing memory between processes. And 
it can be used for sharing memory with 3rd-party devices via PCIe P2P 
(e.g. a Mellanox NIC). You cannot easily do that with malloc'ed memory. 
POSIX IPC requires that you know that you'll be sharing the memory at 
allocation time. It adds overhead. And because it's file-backed, it's 
currently incompatible with migration. And HMM currently doesn't have a 
solution for P2P. Any access by a different device causes a migration to 
system memory.


Regards,
  Felix




So yes from buffer sharing perspective, the design philosophy is also very 
different.

Thanks,
Oak

Re: Implement svm without BO concept in xe driver

2023-08-21 Thread Felix Kuehling




On 2023-08-21 11:10, Zeng, Oak wrote:

Accidently deleted Brian. Add back.

Thanks,
Oak


-Original Message-
From: Zeng, Oak
Sent: August 21, 2023 11:07 AM
To: Dave Airlie 
Cc: Brost, Matthew ; Thomas Hellström
; Philip Yang ; Felix
Kuehling ; dri-devel@lists.freedesktop.org; intel-
x...@lists.freedesktop.org; Vishwanathapura, Niranjana
; Christian König

Subject: RE: Implement svm without BO concept in xe driver


-Original Message-
From: dri-devel  On Behalf Of Dave
Airlie
Sent: August 20, 2023 6:21 PM
To: Zeng, Oak 
Cc: Brost, Matthew ; Thomas Hellström
; Philip Yang ;

Felix

Kuehling ; Welty, Brian ;

dri-

de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Vishwanathapura,
Niranjana ; Christian König

Subject: Re: Implement svm without BO concept in xe driver

On Thu, 17 Aug 2023 at 12:13, Zeng, Oak  wrote:

-Original Message-
From: Dave Airlie 
Sent: August 16, 2023 6:52 PM
To: Felix Kuehling 
Cc: Zeng, Oak ; Christian König
; Thomas Hellström
; Brost, Matthew
; maarten.lankho...@linux.intel.com;
Vishwanathapura, Niranjana ;

Welty,

Brian ; Philip Yang ; intel-
x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
Subject: Re: Implement svm without BO concept in xe driver

On Thu, 17 Aug 2023 at 08:15, Felix Kuehling 

wrote:

On 2023-08-16 13:30, Zeng, Oak wrote:

I spoke with Thomas. We discussed two approaches:

1) make ttm_resource a central place for vram management functions

such as

eviction, cgroup memory accounting. Both the BO-based driver and BO-less

SVM

codes call into ttm_resource_alloc/free functions for vram allocation/free.

  *This way BO driver and SVM driver shares the eviction/cgroup logic,

no

need to reimplment LRU eviction list in SVM driver. Cgroup logic should be

in

ttm_resource layer. +Maarten.

  *ttm_resource is not a perfect match for SVM to allocate vram. It is

still

a

big overhead. The *bo* member of ttm_resource is not needed for SVM -

this

might end up with invasive changes to ttm...need to look into more details

Overhead is a problem. We'd want to be able to allocate, free and evict
memory at a similar granularity as our preferred migration and page
fault granularity, which defaults to 2MB in our SVM implementation.



2) svm code allocate memory directly from drm-buddy allocator, and

expose

memory eviction functions from both ttm and svm so they can evict

memory

from each other. For example, expose the ttm_mem_evict_first function

from

ttm side so hmm/svm code can call it; expose a similar function from svm

side

so

ttm can evict hmm memory.

I like this option. One thing that needs some thought with this is how
to get some semblance of fairness between the two types of clients.
Basically how to choose what to evict. And what share of the available
memory does each side get to use on average. E.g. an idle client may get
all its memory evicted while a busy client may get a bigger share of the
available memory.

I'd also like to suggest we try to write any management/generic code
in driver agnostic way as much as possible here. I don't really see
much hw difference should be influencing it.

I do worry about having effectively 2 LRUs here, you can't really have
two "leasts".

Like if we hit the shrinker paths who goes first? do we shrink one
object from each side in turn?

One way to solve this fairness problem is to create a driver agnostic

drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory
eviction/cgroups memory accounting logic from ttm_resource manager to
drm_vram_mgr. Both BO-based driver and SVM driver calls to drm_vram_mgr

to

allocate/free memory.

I am not sure whether this meets the 2M allocate/free/evict granularity

requirement Felix mentioned above. SVM can allocate 2M size blocks. But BO
driver should be able to allocate any arbitrary sized blocks - So the eviction 
is

also

arbitrary size.

Also will we have systems where we can expose system SVM but userspace
may choose to not use the fine grained SVM and use one of the older
modes, will that path get emulated on top of SVM or use the BO paths?

If by "older modes" you meant the gem_bo_create (such as xe_gem_create

or

amdgpu_gem_create), then today both amd and intel implement those
interfaces using BO path. We don't have a plan to emulate that old mode on

tope

of SVM, afaict.

I'm not sure how the older modes manifest in the kernel I assume as bo
creates (but they may use userptr), SVM isn't a specific thing, it's a
group of 3 things.

1) coarse-grained SVM which I think is BO
2) fine-grained SVM which is page level
3) fine-grained system SVM which is HMM

I suppose I'm asking about the previous versions and how they would
operate in a system SVM capable system.

I got your question now.

As I understand it, the system SVM provides similar functionality as BO-based
SVM (i.e., share virtual address space b/t cpu and gpu program, no explici

Re: Implement svm without BO concept in xe driver

2023-08-18 Thread Felix Kuehling




On 2023-08-18 12:10, Zeng, Oak wrote:

Thanks Thomas. I will then look into more details of option 3:

* create a lean drm layer vram manager, a central control place for vram 
eviction and cgroup accounting. Single LRU for eviction fairness.
* pretty much move the current ttm_resource eviction/cgroups logic to drm 
layer
* the eviction/allocation granularity should be flexible so svm can do 2M 
while ttm can do arbitrary size


SVM will need smaller sizes too, for VMAs that are smaller or not 
aligned to 2MB size.


Regards,
  Felix



* both ttm_resource and svm code should call the new drm_vram_manager for 
eviction/accounting

I will come back with some RFC proof of concept codes later.

Cheers,
Oak


-Original Message-
From: Thomas Hellström 
Sent: August 18, 2023 3:36 AM
To: Zeng, Oak ; Dave Airlie ; Felix
Kuehling 
Cc: Christian König ; Brost, Matthew
; maarten.lankho...@linux.intel.com;
Vishwanathapura, Niranjana ; Welty,
Brian ; Philip Yang ; intel-
x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
Subject: Re: Implement svm without BO concept in xe driver


On 8/17/23 04:12, Zeng, Oak wrote:

-Original Message-
From: Dave Airlie 
Sent: August 16, 2023 6:52 PM
To: Felix Kuehling 
Cc: Zeng, Oak ; Christian König
; Thomas Hellström
; Brost, Matthew
; maarten.lankho...@linux.intel.com;
Vishwanathapura, Niranjana ; Welty,
Brian ; Philip Yang ; intel-
x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
Subject: Re: Implement svm without BO concept in xe driver

On Thu, 17 Aug 2023 at 08:15, Felix Kuehling  wrote:

On 2023-08-16 13:30, Zeng, Oak wrote:

I spoke with Thomas. We discussed two approaches:

1) make ttm_resource a central place for vram management functions such

as

eviction, cgroup memory accounting. Both the BO-based driver and BO-less

SVM

codes call into ttm_resource_alloc/free functions for vram allocation/free.

   *This way BO driver and SVM driver shares the eviction/cgroup logic, no

need to reimplment LRU eviction list in SVM driver. Cgroup logic should be in
ttm_resource layer. +Maarten.

   *ttm_resource is not a perfect match for SVM to allocate vram. It is 
still a

big overhead. The *bo* member of ttm_resource is not needed for SVM - this
might end up with invasive changes to ttm...need to look into more details

Overhead is a problem. We'd want to be able to allocate, free and evict
memory at a similar granularity as our preferred migration and page
fault granularity, which defaults to 2MB in our SVM implementation.



2) svm code allocate memory directly from drm-buddy allocator, and

expose

memory eviction functions from both ttm and svm so they can evict memory
from each other. For example, expose the ttm_mem_evict_first function

from

ttm side so hmm/svm code can call it; expose a similar function from svm side

so

ttm can evict hmm memory.

I like this option. One thing that needs some thought with this is how
to get some semblance of fairness between the two types of clients.
Basically how to choose what to evict. And what share of the available
memory does each side get to use on average. E.g. an idle client may get
all its memory evicted while a busy client may get a bigger share of the
available memory.

I'd also like to suggest we try to write any management/generic code
in driver agnostic way as much as possible here. I don't really see
much hw difference should be influencing it.

I do worry about having effectively 2 LRUs here, you can't really have
two "leasts".

Like if we hit the shrinker paths who goes first? do we shrink one
object from each side in turn?

One way to solve this fairness problem is to create a driver agnostic

drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory
eviction/cgroups memory accounting logic from ttm_resource manager to
drm_vram_mgr. Both BO-based driver and SVM driver calls to drm_vram_mgr to
allocate/free memory.

I am not sure whether this meets the 2M allocate/free/evict granularity

requirement Felix mentioned above. SVM can allocate 2M size blocks. But BO
driver should be able to allocate any arbitrary sized blocks - So the eviction 
is also
arbitrary size.

This is not far from what a TTM resource manager does with TTM
resources, only made generic at the drm level, and making the "resource"
as lean as possible. With 2M granularity this seems plausible.


Also will we have systems where we can expose system SVM but userspace
may choose to not use the fine grained SVM and use one of the older
modes, will that path get emulated on top of SVM or use the BO paths?

If by "older modes" you meant the gem_bo_create (such as xe_gem_create or

amdgpu_gem_create), then today both amd and intel implement those
interfaces using BO path. We don't have a plan to emulate that old mode on tope
of SVM, afaict.

I think we might end up emulating "older modes" on top of SVM at some
point, not to f

Re: Implement svm without BO concept in xe driver

2023-08-16 Thread Felix Kuehling


On 2023-08-16 13:30, Zeng, Oak wrote:

I spoke with Thomas. We discussed two approaches:

1) make ttm_resource a central place for vram management functions such as 
eviction, cgroup memory accounting. Both the BO-based driver and BO-less SVM 
codes call into ttm_resource_alloc/free functions for vram allocation/free.
 *This way BO driver and SVM driver shares the eviction/cgroup logic, no 
need to reimplment LRU eviction list in SVM driver. Cgroup logic should be in 
ttm_resource layer. +Maarten.
 *ttm_resource is not a perfect match for SVM to allocate vram. It is still 
a big overhead. The *bo* member of ttm_resource is not needed for SVM - this 
might end up with invasive changes to ttm...need to look into more details


Overhead is a problem. We'd want to be able to allocate, free and evict 
memory at a similar granularity as our preferred migration and page 
fault granularity, which defaults to 2MB in our SVM implementation.





2) svm code allocate memory directly from drm-buddy allocator, and expose 
memory eviction functions from both ttm and svm so they can evict memory from 
each other. For example, expose the ttm_mem_evict_first function from ttm side 
so hmm/svm code can call it; expose a similar function from svm side so ttm can 
evict hmm memory.


I like this option. One thing that needs some thought with this is how 
to get some semblance of fairness between the two types of clients. 
Basically how to choose what to evict. And what share of the available 
memory does each side get to use on average. E.g. an idle client may get 
all its memory evicted while a busy client may get a bigger share of the 
available memory.


Regards,
  Felix





Today we don't know which approach is better. I will work on some prove of 
concept codes, starting with #1 approach firstly.

Btw, I talked with application engineers and they said most applications 
actually use a mixture of gem_bo create and malloc, so we definitely need to 
solve this problem.

Cheers,
Oak


-Original Message-
From: Christian König 
Sent: August 16, 2023 2:06 AM
To: Zeng, Oak ; Felix Kuehling ;
Thomas Hellström ; Brost, Matthew
; Vishwanathapura, Niranjana
; Welty, Brian ;
Philip Yang ; intel...@lists.freedesktop.org; dri-
de...@lists.freedesktop.org
Subject: Re: Implement svm without BO concept in xe driver

Hi Oak,

yeah, I completely agree with you and Felix. The main problem here is
getting the memory pressure visible on both sides.

At the moment I have absolutely no idea how to handle that, maybe
something like the ttm_resource object shared between TTM and HMM?

Regards,
Christian.

Am 16.08.23 um 05:47 schrieb Zeng, Oak:

Hi Felix,

It is great to hear from you!

When I implement the HMM-based SVM for intel devices, I found this

interesting problem: HMM uses struct page based memory management scheme
which is completely different against the BO/TTM style memory management
philosophy. Writing SVM code upon the BO/TTM concept seems overkill and
awkward. So I thought we better make the SVM code BO-less and TTM-less. But
on the other hand, currently vram eviction and cgroup memory accounting are all
hooked to the TTM layer, which means a TTM-less SVM driver won't be able to
evict vram allocated through TTM/gpu_vram_mgr.

Ideally HMM migration should use drm-buddy for vram allocation, but we need

to solve this TTM/HMM mutual eviction problem as you pointed out (I am
working with application engineers to figure out whether mutual eviction can
truly benefit applications). Maybe we can implement a TTM-less vram
management block which can be shared b/t the HMM-based driver and the BO-
based driver:

 * allocate/free memory from drm-buddy, buddy-block based
 * memory eviction logics, allow driver to specify which allocation is 
evictable
 * memory accounting, cgroup logic

Maybe such a block can be placed at drm layer (say, call it drm_vram_mgr for

now), so it can be shared b/t amd and intel. So I involved amd folks. Today both
amd and intel-xe driver implemented a TTM-based vram manager which doesn't
serve above design goal. Once the drm_vram_mgr is implemented, both amd
and intel's BO-based/TTM-based vram manager, and the HMM-based vram
manager can call into this drm-vram-mgr.

Thanks again,
Oak


-Original Message-
From: Felix Kuehling 
Sent: August 15, 2023 6:17 PM
To: Zeng, Oak ; Thomas Hellström
; Brost, Matthew
; Vishwanathapura, Niranjana
; Welty, Brian

;

Christian König ; Philip Yang
; intel...@lists.freedesktop.org; dri-
de...@lists.freedesktop.org
Subject: Re: Implement svm without BO concept in xe driver

Hi Oak,

I'm not sure what you're looking for from AMD? Are we just CC'ed FYI? Or
are you looking for comments about

* Our plans for VRAM management with HMM
* Our experience with BO-based VRAM management
* Something else?

IMO, having separate memory pools for HMM and TTM is a non-starter for
AMD. We need access to

Re: Implement svm without BO concept in xe driver

2023-08-15 Thread Felix Kuehling


Hi Oak,

I'm not sure what you're looking for from AMD? Are we just CC'ed FYI? Or 
are you looking for comments about


 * Our plans for VRAM management with HMM
 * Our experience with BO-based VRAM management
 * Something else?

IMO, having separate memory pools for HMM and TTM is a non-starter for 
AMD. We need access to the full VRAM in either of the APIs for it to be 
useful. That also means, we need to handle memory pressure in both 
directions. That's one of the main reasons we went with the BO-based 
approach initially. I think in the long run, using the buddy allocator, 
or the amdgpu_vram_mgr directly for HMM migrations would be better, 
assuming we can handle memory pressure in both directions between HMM 
and TTM sharing the same pool of physical memory.


Regards,
  Felix


On 2023-08-15 16:34, Zeng, Oak wrote:


Also + Christian

Thanks,

Oak

*From:*Intel-xe  *On Behalf Of 
*Zeng, Oak

*Sent:* August 14, 2023 11:38 PM
*To:* Thomas Hellström ; Brost, 
Matthew ; Vishwanathapura, Niranjana 
; Welty, Brian 
; Felix Kuehling ; 
Philip Yang ; intel...@lists.freedesktop.org; 
dri-devel@lists.freedesktop.org

*Subject:* [Intel-xe] Implement svm without BO concept in xe driver

Hi Thomas, Matt and all,

This came up when I port i915 svm codes to xe driver. In i915 
implementation, we have i915_buddy manage gpu vram and svm codes 
directly call i915_buddy layer to allocate/free vram. There is no 
gem_bo/ttm bo concept involved in the svm implementation.


In xe driver,  we have drm_buddy, xe_ttm_vram_mgr and ttm layer to 
manage vram. Drm_buddy is initialized during xe_ttm_vram_mgr 
initialization. Vram allocation/free is done through xe_ttm_vram_mgr 
functions which call into drm_buddy layer to allocate vram blocks.


I plan to implement xe svm driver the same way as we did in i915, 
which means there will not be bo concept in the svm implementation. 
Drm_buddy will be passed to svm layer during vram initialization and 
svm will allocate/free memory directly from drm_buddy, bypassing 
ttm/xee vram manager. Here are a few considerations/things we are 
aware of:


 1. This approach seems match hmm design better than bo concept. Our
svm implementation will be based on hmm. In hmm design, each vram
page is backed by a struct page. It is very easy to perform page
granularity migrations (b/t vram and system memory). If BO concept
is involved, we will have to split/remerge BOs during page
granularity migrations.

 2. We have a prove of concept of this approach in i915, originally
implemented by Niranjana. It seems work but it only has basic
functionalities for now. We don’t have advanced features such as
memory eviction etc.

 3. With this approach, vram will divided into two separate pools: one
for xe_gem_created BOs and one for vram used by svm. Those two
pools are not connected: memory pressure from one pool won’t be
able to evict vram from another pool. At this point, we don’t
whether this aspect is good or not.

 4. Amdkfd svm went different approach which is BO based. The benefit
of this approach is a lot of existing driver facilities (such as
memory eviction/cgroup/accounting) can be reused

Do you have any comment to this approach? Should I come back with a 
RFC of some POC codes?


Thanks,

Oak

Re: [PATCH] drm/amdkfd: fix build failure without CONFIG_DYNAMIC_DEBUG

2023-08-04 Thread Felix Kuehling


On 2023-08-04 9:29, Arnd Bergmann wrote:

From: Arnd Bergmann 

When CONFIG_DYNAMIC_DEBUG is disabled altogether, calling
_dynamic_func_call_no_desc() does not work:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c: In function 
'svm_range_set_attr':
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:52:9: error: implicit 
declaration of function '_dynamic_func_call_no_desc' 
[-Werror=implicit-function-declaration]
52 | _dynamic_func_call_no_desc("svm_range_dump", 
svm_range_debug_dump, svms)
   | ^~
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:3564:9: note: in expansion of 
macro 'dynamic_svm_range_dump'
  3564 | dynamic_svm_range_dump(svms);
   | ^~

Add a compile-time conditional in addition to the runtime check.

Fixes: 8923137dbe4b2 ("drm/amdkfd: avoid svm dump when dynamic debug disabled")
Signed-off-by: Arnd Bergmann 


The patch is

Reviewed-by: Felix Kuehling 

I'm applying it to amd-staging-drm-next.

Thanks,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 308384dbc502d..44e710821b6d9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -23,6 +23,7 @@
  
  #include 

  #include 
+#include 
  #include 
  #include 
  
@@ -48,8 +49,13 @@

   * page table is updated.
   */
  #define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING  (2UL * NSEC_PER_MSEC)
+#if IS_ENABLED(CONFIG_DYNAMIC_DEBUG)
  #define dynamic_svm_range_dump(svms) \
_dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)
+#else
+#define dynamic_svm_range_dump(svms) \
+   do { if (0) svm_range_debug_dump(svms); } while (0)
+#endif
  
  /* Giant svm range split into smaller ranges based on this, it is decided using

   * minimum of all dGPU/APU 1/32 VRAM size, between 2MB to 1GB and alignment to

Re: [PATCH v2 2/4] drm/amdkfd: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread Felix Kuehling




Am 2023-07-19 um 03:51 schrieb Kefeng Wang:

Use the helpers to simplify code.

Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: "Pan, Xinhui" 
Cc: David Airlie 
Cc: Daniel Vetter 
Signed-off-by: Kefeng Wang 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 5ff1a5a89d96..0b7bfbd0cb66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2621,10 +2621,7 @@ svm_range_get_range_boundaries(struct kfd_process *p, 
int64_t addr,
return -EFAULT;
}
  
-	*is_heap_stack = (vma->vm_start <= vma->vm_mm->brk &&

- vma->vm_end >= vma->vm_mm->start_brk) ||
-(vma->vm_start <= vma->vm_mm->start_stack &&
- vma->vm_end >= vma->vm_mm->start_stack);
+   *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma);
  
  	start_limit = max(vma->vm_start >> PAGE_SHIFT,

  (unsigned long)ALIGN_DOWN(addr, 2UL << 8));

Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()

2023-07-14 Thread Felix Kuehling


Am 2023-07-14 um 10:26 schrieb Vlastimil Babka:

On 7/12/23 18:24, Felix Kuehling wrote:

Allocations in the heap and stack tend to be small, with several
allocations sharing the same page. Sharing the same page for different
allocations with different access patterns leads to thrashing when we
migrate data back and forth on GPU and CPU access. To avoid this we
disable HMM migrations for head and stack VMAs.

Wonder how well does it really work in practice? AFAIK "heaps" (malloc())
today uses various arenas obtained by mmap() and not a single brk() managed
space anymore? And programs might be multithreaded, thus have multiple
stacks, while vma_is_stack() will recognize only the initial one...


Thanks for these pointers. I have not heard of such problems with mmap 
arenas and multiple thread stacks in practice. But I'll keep it in mind 
in case we observe unexpected thrashing in the future. FWIW, we once had 
the opposite problem of a custom malloc implementation that used sbrk 
for very large allocations. This disabled migrations of large buffers 
unexpectedly.


I agree that eventually we'll want a more dynamic way of detecting and 
suppressing thrashing that's based on observed memory access patterns. 
Getting this right is probably trickier than it sounds, so I'd prefer to 
have some more experience with real workloads to use as benchmarks. 
Compared to other things we're working on, this is fairly low on our 
priority list at the moment. Using the VMA flags is a simple and 
effective method for now, at least until we see it failing in real 
workloads.


Regards,
  Felix




Vlastimil


Regards,
    Felix


Am 2023-07-12 um 10:42 schrieb Christoph Hellwig:

On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote:

Use the helpers to simplify code.

Nothing against your addition of a helper, but a GPU driver really
should have no business even looking at this information..

Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()

2023-07-12 Thread Felix Kuehling

Allocations in the heap and stack tend to be small, with several 
allocations sharing the same page. Sharing the same page for different 
allocations with different access patterns leads to thrashing when we 
migrate data back and forth on GPU and CPU access. To avoid this we 
disable HMM migrations for head and stack VMAs.


Regards,
  Felix


Am 2023-07-12 um 10:42 schrieb Christoph Hellwig:

On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote:

Use the helpers to simplify code.

Nothing against your addition of a helper, but a GPU driver really
should have no business even looking at this information..

Re: [PATCH] drm/amdkfd: Switch over to memdup_user()

2023-06-14 Thread Felix Kuehling




Am 2023-06-13 um 22:04 schrieb Jiapeng Chong:

Use memdup_user() rather than duplicating its implementation. This is a
little bit restricted to reduce false positives.

./drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:2813:13-20: WARNING 
opportunity for memdup_user.

Reported-by: Abaci Robot 
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5523
Signed-off-by: Jiapeng Chong 


Kernel test robot is reporting a failure with this patch, looks like you 
used PTR_ERR incorrectly. Please make sure your patch compiles without 
warnings.


I see more opportunities to use memdup_user in kfd_chardev.c, 
kfd_events.c, kfd_process_queue_manager.c and kfd_svm.c. Do you want to 
fix those, too, while you're at it?


Thanks,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 +++--
  1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index d6b15493fffd..637962d4083c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2810,12 +2810,9 @@ static uint32_t *get_queue_ids(uint32_t num_queues, 
uint32_t *usr_queue_id_array
if (!usr_queue_id_array)
return NULL;
  
-	queue_ids = kzalloc(array_size, GFP_KERNEL);

-   if (!queue_ids)
-   return ERR_PTR(-ENOMEM);
-
-   if (copy_from_user(queue_ids, usr_queue_id_array, array_size))
-   return ERR_PTR(-EFAULT);
+   queue_ids = memdup_user(usr_queue_id_array, array_size);
+   if (IS_ERR(queue_ids))
+   return PTR_ERR(queue_ids);
  
  	return queue_ids;

  }

Re: [PATCH v2] gpu: drm/amd: Remove the redundant null pointer check in list_for_each_entry() loops

2023-06-12 Thread Felix Kuehling


[+Jon]

Am 2023-06-12 um 07:58 schrieb Lu Hongfei:

pqn bound in list_for_each_entry loop will not be null, so there is
no need to check whether pqn is NULL or not.
Thus remove a redundant null pointer check.

Signed-off-by: Lu Hongfei 
---
The filename of the previous version was:
0001-gpu-drm-amd-Fix-the-bug-in-list_for_each_entry-loops.patch

The modifications made compared to the previous version are as follows:
1. Modified the patch title
2. "Thus remove a redundant null pointer check." is used instead of
"We could remove this check."

  drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index cd34e7aaead4..10d0cef844f0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -1097,9 +1097,6 @@ void kfd_dbg_set_enabled_debug_exception_mask(struct 
kfd_process *target,
  
  	pqm = &target->pqm;

list_for_each_entry(pqn, &pqm->queues, process_queue_list) {
-   if (!pqn)


Right, this check doesn't make a lot of sense. Jon, was this meant to 
check pqn->q?


Regards,
  Felix



-   continue;
-
found_mask |= pqn->q->properties.exception_status;
}

Re: [PATCH -next] drm/amdkfd: clean up one inconsistent indenting

2023-06-01 Thread Felix Kuehling


On 2023-05-30 22:08, Yang Li wrote:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:1036 kgd2kfd_interrupt() 
warn: inconsistent indenting

Signed-off-by: Yang Li 


Reviewed-by: Felix Kuehling 

I'm applying the patch to amd-staging-drm-next. Thanks!



---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 862a50f7b490..0398a8c52a44 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1033,7 +1033,7 @@ void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
*ih_ring_entry)
is_patched ? patched_ihre : ih_ring_entry)) {
kfd_queue_work(node->ih_wq, &node->interrupt_work);
spin_unlock_irqrestore(&node->interrupt_lock, flags);
-   return;
+   return;
}
spin_unlock_irqrestore(&node->interrupt_lock, flags);
}

Re: [PATCH 32/33] drm/amdkfd: add debug device snapshot operation

2023-05-30 Thread Felix Kuehling


Am 2023-05-25 um 13:27 schrieb Jonathan Kim:

Similar to queue snapshot, return an array of device information using
an entry_size check and return.
Unlike queue snapshots, the debugger needs to pass to correct number of
devices that exist.  If it fails to do so, the KFD will return the
number of actual devices so that the debugger can make a subsequent
successful call.

v2: add num_xcc to device snapshot and fixup new kfd_node reference

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  7 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c   | 73 
  drivers/gpu/drm/amd/amdkfd/kfd_debug.h   |  5 ++
  3 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index b24a73fd53af..f522325b409b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -3060,8 +3060,11 @@ static int kfd_ioctl_set_debug_trap(struct file *filep, 
struct kfd_process *p, v
&args->queue_snapshot.entry_size);
break;
case KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT:
-   pr_warn("Debug op %i not supported yet\n", args->op);
-   r = -EACCES;
+   r = kfd_dbg_trap_device_snapshot(target,
+   args->device_snapshot.exception_mask,
+   (void __user 
*)args->device_snapshot.snapshot_buf_ptr,
+   &args->device_snapshot.num_devices,
+   &args->device_snapshot.entry_size);
break;
default:
pr_err("Invalid option: %i\n", args->op);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index 24e2b285448a..125274445f43 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -22,6 +22,7 @@
  
  #include "kfd_debug.h"

  #include "kfd_device_queue_manager.h"
+#include "kfd_topology.h"
  #include 
  #include 
  
@@ -1010,6 +1011,78 @@ int kfd_dbg_trap_query_exception_info(struct kfd_process *target,

return r;
  }
  
+int kfd_dbg_trap_device_snapshot(struct kfd_process *target,

+   uint64_t exception_clear_mask,
+   void __user *user_info,
+   uint32_t *number_of_device_infos,
+   uint32_t *entry_size)
+{
+   struct kfd_dbg_device_info_entry device_info;
+   uint32_t tmp_entry_size = *entry_size, tmp_num_devices;
+   int i, r = 0;
+
+   if (!(target && user_info && number_of_device_infos && entry_size))
+   return -EINVAL;
+
+   tmp_num_devices = min_t(size_t, *number_of_device_infos, 
target->n_pdds);
+   *number_of_device_infos = target->n_pdds;
+   *entry_size = min_t(size_t, *entry_size, sizeof(device_info));
+
+   if (!tmp_num_devices)
+   return 0;
+
+   memset(&device_info, 0, sizeof(device_info));
+
+   mutex_lock(&target->event_mutex);
+
+   /* Run over all pdd of the process */
+   for (i = 0; i < tmp_num_devices; i++) {
+   struct kfd_process_device *pdd = target->pdds[i];
+   struct kfd_topology_device *topo_dev = 
kfd_topology_device_by_id(pdd->dev->id);
+
+   device_info.gpu_id = pdd->dev->id;
+   device_info.exception_status = pdd->exception_status;
+   device_info.lds_base = pdd->lds_base;
+   device_info.lds_limit = pdd->lds_limit;
+   device_info.scratch_base = pdd->scratch_base;
+   device_info.scratch_limit = pdd->scratch_limit;
+   device_info.gpuvm_base = pdd->gpuvm_base;
+   device_info.gpuvm_limit = pdd->gpuvm_limit;
+   device_info.location_id = topo_dev->node_props.location_id;
+   device_info.vendor_id = topo_dev->node_props.vendor_id;
+   device_info.device_id = topo_dev->node_props.device_id;
+   device_info.revision_id = pdd->dev->adev->pdev->revision;
+   device_info.subsystem_vendor_id = 
pdd->dev->adev->pdev->subsystem_vendor;
+   device_info.subsystem_device_id = 
pdd->dev->adev->pdev->subsystem_device;
+   device_info.fw_version = pdd->dev->kfd->mec_fw_version;
+   device_info.gfx_target_version =
+   topo_dev->node_props.gfx_target_version;
+   device_info.simd_count = topo_dev->node_props.simd_count;
+   device_info.max_waves_per_simd =
+   topo_dev->node_props.max_waves_per_simd;
+   device_info.array_count = topo_dev->node_props.array_count

Re: [PATCH 28/33] drm/amdkfd: add debug set flags operation

2023-05-30 Thread Felix Kuehling




Am 2023-05-25 um 13:27 schrieb Jonathan Kim:

Allow the debugger to set single memory and single ALU operations.

Some exceptions are imprecise (memory violations, address watch) in the
sense that a trap occurs only when the exception interrupt occurs and
not at the non-halting faulty instruction.  Trap temporaries 0 & 1 save
the program counter address, which means that these values will not point
to the faulty instruction address but to whenever the interrupt was
raised.

Setting the Single Memory Operations flag will inject an automatic wait
on every memory operation instruction forcing imprecise memory exceptions
to become precise at the cost of performance.  This setting is not
permitted on debug devices that support only a global setting of this
option.

Return the previous set flags to the debugger as well.

v2: fixup with new kfd_node struct reference mes checks

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  2 +
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c   | 58 
  drivers/gpu/drm/amd/amdkfd/kfd_debug.h   |  1 +
  3 files changed, 61 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index e88be582d44d..e5d95b144dcd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -3035,6 +3035,8 @@ static int kfd_ioctl_set_debug_trap(struct file *filep, 
struct kfd_process *p, v
args->clear_node_address_watch.id);
break;
case KFD_IOC_DBG_TRAP_SET_FLAGS:
+   r = kfd_dbg_trap_set_flags(target, &args->set_flags.flags);
+   break;
case KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT:
case KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO:
case KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index 4b36cc8b5fb7..43c3170998d3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -23,6 +23,7 @@
  #include "kfd_debug.h"
  #include "kfd_device_queue_manager.h"
  #include 
+#include 
  
  #define MAX_WATCH_ADDRESSES	4
  
@@ -423,6 +424,59 @@ static void kfd_dbg_clear_process_address_watch(struct kfd_process *target)

kfd_dbg_trap_clear_dev_address_watch(target->pdds[i], 
j);
  }
  
+int kfd_dbg_trap_set_flags(struct kfd_process *target, uint32_t *flags)

+{
+   uint32_t prev_flags = target->dbg_flags;
+   int i, r = 0, rewind_count = 0;
+
+   for (i = 0; i < target->n_pdds; i++) {
+   if (!kfd_dbg_is_per_vmid_supported(target->pdds[i]->dev) &&
+   (*flags & KFD_DBG_TRAP_FLAG_SINGLE_MEM_OP)) {
+   *flags = prev_flags;
+   return -EACCES;
+   }
+   }
+
+   target->dbg_flags = *flags & KFD_DBG_TRAP_FLAG_SINGLE_MEM_OP;
+   *flags = prev_flags;
+   for (i = 0; i < target->n_pdds; i++) {
+   struct kfd_process_device *pdd = target->pdds[i];
+
+   if (!kfd_dbg_is_per_vmid_supported(pdd->dev))
+   continue;
+
+   if (!pdd->dev->kfd->shared_resources.enable_mes)
+   r = debug_refresh_runlist(pdd->dev->dqm);
+   else
+   r = kfd_dbg_set_mes_debug_mode(pdd);
+
+   if (r) {
+   target->dbg_flags = prev_flags;
+   break;
+   }
+
+   rewind_count++;
+   }
+
+   /* Rewind flags */
+   if (r) {
+   target->dbg_flags = prev_flags;
+
+   for (i = 0; i < rewind_count; i++) {
+   struct kfd_process_device *pdd = target->pdds[i];
+
+   if (!kfd_dbg_is_per_vmid_supported(pdd->dev))
+   continue;
+
+   if (!pdd->dev->kfd->shared_resources.enable_mes)
+   debug_refresh_runlist(pdd->dev->dqm);
+   else
+   kfd_dbg_set_mes_debug_mode(pdd);
+   }
+   }
+
+   return r;
+}
  
  /* kfd_dbg_trap_deactivate:

   *target: target process
@@ -437,9 +491,13 @@ void kfd_dbg_trap_deactivate(struct kfd_process *target, 
bool unwind, int unwind
int i;
  
  	if (!unwind) {

+   uint32_t flags = 0;
+
cancel_work_sync(&target->debug_event_workarea);
kfd_dbg_clear_process_address_watch(target);
kfd_dbg_trap_set_wave_launch_mode(target, 0);
+
+   kfd_dbg_trap_set_flags(target, &flags);
}
  
  	for (i = 0; i < target->n_pdds; i++) {

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.h 
b/drivers/gpu

Re: [PATCH 27/33] drm/amdkfd: add debug set and clear address watch points operation

2023-05-30 Thread Felix Kuehling


Am 2023-05-25 um 13:27 schrieb Jonathan Kim:

Shader read, write and atomic memory operations can be alerted to the
debugger as an address watch exception.

Allow the debugger to pass in a watch point to a particular memory
address per device.

Note that there exists only 4 watch points per devices to date, so have
the KFD keep track of what watch points are allocated or not.

v2: fixup with new kfd_node struct reference for mes and watch point
checks

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  .../drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c  |  51 +++
  .../drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c   |   2 +
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c|  78 ++
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h|   8 ++
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10_3.c  |   5 +-
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c|  52 ++-
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c |  77 ++
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h |   8 ++
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  24 
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c| 136 ++
  drivers/gpu/drm/amd/amdkfd/kfd_debug.h|   8 +-
  drivers/gpu/drm/amd/amdkfd/kfd_device.c   |   2 +
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   6 +-
  13 files changed, 452 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
index 774ecfc3451a..efd6a72aab4e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
@@ -118,6 +118,55 @@ static uint32_t kgd_aldebaran_set_wave_launch_mode(struct 
amdgpu_device *adev,
return data;
  }
  
+#define TCP_WATCH_STRIDE (regTCP_WATCH1_ADDR_H - regTCP_WATCH0_ADDR_H)

+static uint32_t kgd_gfx_aldebaran_set_address_watch(
+   struct amdgpu_device *adev,
+   uint64_t watch_address,
+   uint32_t watch_address_mask,
+   uint32_t watch_id,
+   uint32_t watch_mode,
+   uint32_t debug_vmid)
+{
+   uint32_t watch_address_high;
+   uint32_t watch_address_low;
+   uint32_t watch_address_cntl;
+
+   watch_address_cntl = 0;
+   watch_address_low = lower_32_bits(watch_address);
+   watch_address_high = upper_32_bits(watch_address) & 0x;
+
+   watch_address_cntl = REG_SET_FIELD(watch_address_cntl,
+   TCP_WATCH0_CNTL,
+   MODE,
+   watch_mode);
+
+   watch_address_cntl = REG_SET_FIELD(watch_address_cntl,
+   TCP_WATCH0_CNTL,
+   MASK,
+   watch_address_mask >> 6);
+
+   watch_address_cntl = REG_SET_FIELD(watch_address_cntl,
+   TCP_WATCH0_CNTL,
+   VALID,
+   1);
+
+   WREG32_RLC((SOC15_REG_OFFSET(GC, 0, regTCP_WATCH0_ADDR_H) +
+   (watch_id * TCP_WATCH_STRIDE)),
+   watch_address_high);
+
+   WREG32_RLC((SOC15_REG_OFFSET(GC, 0, regTCP_WATCH0_ADDR_L) +
+   (watch_id * TCP_WATCH_STRIDE)),
+   watch_address_low);
+
+   return watch_address_cntl;
+}
+
+uint32_t kgd_gfx_aldebaran_clear_address_watch(struct amdgpu_device *adev,
+   uint32_t watch_id)
+{
+   return 0;
+}
+
  const struct kfd2kgd_calls aldebaran_kfd2kgd = {
.program_sh_mem_settings = kgd_gfx_v9_program_sh_mem_settings,
.set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
@@ -141,6 +190,8 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
.validate_trap_override_request = 
kgd_aldebaran_validate_trap_override_request,
.set_wave_launch_trap_override = 
kgd_aldebaran_set_wave_launch_trap_override,
.set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
+   .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
+   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
.get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
.build_grace_period_packet_info = 
kgd_gfx_v9_build_grace_period_packet_info,
.program_trap_handler_settings = 
kgd_gfx_v9_program_trap_handler_settings,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
index fbdc1b7b1e42..6df215aba4c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
@@ -413,6 +413,8 @@ const struct kfd2kgd_calls arcturus_kfd2kgd = {
.validate_trap_override_request = 
kgd_gfx_v9_validate_trap_override_request,
.set_wave_launch_tra

Re: [PATCH 26/33] drm/amdkfd: add debug suspend and resume process queues operation

2023-05-30 Thread Felix Kuehling


Am 2023-05-25 um 13:27 schrieb Jonathan Kim:

In order to inspect waves from the saved context at any point during a
debug session, the debugger must be able to preempt queues to trigger
context save by suspending them.

On queue suspend, the KFD will copy the context save header information
so that the debugger can correctly crawl the appropriate size of the saved
context. The debugger must then also be allowed to resume suspended queues.

A queue that is newly created cannot be suspended because queue ids are
recycled after destruction so the debugger needs to know that this has
occurred.  Query functions will be later added that will clear a given
queue of its new queue status.

A queue cannot be destroyed while it is suspended to preserve its saved
context during debugger inspection.  Have queue destruction block while
a queue is suspended and unblocked when it is resumed.  Likewise, if a
queue is about to be destroyed, it cannot be suspended.

Return the number of queues successfully suspended or resumed along with
a per queue status array where the upper bits per queue status show that
the request was invalid (new/destroyed queue suspend request, missing
queue) or an error occurred (HWS in a fatal state so it can't suspend or
resume queues).

v2: fixup new kfd_node struct reference for mes fw check.
also fixup missing EC_QUEUE_NEW flagging on newly created queue.

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|   5 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|   1 +
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  11 +
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c|   7 +
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 447 +-
  .../drm/amd/amdkfd/kfd_device_queue_manager.h |  10 +
  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |  10 +
  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c  |  15 +-
  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |  14 +-
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   5 +-
  .../amd/amdkfd/kfd_process_queue_manager.c|   1 +
  11 files changed, 512 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 98cd52bb005f..b4fcad0e62f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -772,6 +772,11 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
amdgpu_device *adev)
return adev->have_atomics_support;
  }
  
+void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev)

+{
+   amdgpu_device_flush_hdp(adev, NULL);
+}
+
  void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, 
bool reset)
  {
amdgpu_umc_poison_handler(adev, reset);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index dd740e64e6e1..2d0406bff84e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -322,6 +322,7 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device 
*adev,
  uint64_t *mmap_offset);
  int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem,
  struct dma_buf **dmabuf);
+void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device *adev);
  int amdgpu_amdkfd_get_tile_config(struct amdgpu_device *adev,
struct tile_config *config);
  void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 4b45d4539d48..adda60273456 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -410,6 +410,7 @@ static int kfd_ioctl_create_queue(struct file *filep, 
struct kfd_process *p,
pr_debug("Write ptr address   == 0x%016llX\n",
args->write_pointer_address);
  
+	kfd_dbg_ev_raise(KFD_EC_MASK(EC_QUEUE_NEW), p, dev, queue_id, false, NULL, 0);

return 0;
  
  err_create_queue:

@@ -2996,7 +2997,17 @@ static int kfd_ioctl_set_debug_trap(struct file *filep, 
struct kfd_process *p, v
args->launch_mode.launch_mode);
break;
case KFD_IOC_DBG_TRAP_SUSPEND_QUEUES:
+   r = suspend_queues(target,
+   args->suspend_queues.num_queues,
+   args->suspend_queues.grace_period,
+   args->suspend_queues.exception_mask,
+   (uint32_t 
*)args->suspend_queues.queue_array_ptr);
+
+   break;
case KFD_IOC_DBG_TRAP_RESUME_QUEUES:
+   r = resume_queues(target, args->resume_queues.num_queues,
+   (uint32_t 
*)args->resume_queues.queue_array

1 2 3 4 5 6 7 8 9 >

1 - 100 of 872 matches

Mail list logo