Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access

2021-04-06 Thread Christian König

Hi Qu,

Am 06.04.21 um 08:04 schrieb Qu Huang:

Hi Christian,

On 2021/4/3 16:49, Christian König wrote:

Hi Qu,

Am 03.04.21 um 07:08 schrieb Qu Huang:

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:

Before dma_resv_lock(bo->base.resv, NULL) in
amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),


That can't happen since when bo_release_notify is called the BO has 
not

more references and is therefore deleted.

And we never evict a deleted BO, we just wait for it to become idle.


Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call 
amdgpu_bo_release_notify())

first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.


Not sure on which code base you are, but I don't see how this can 
happen.


ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
ttm_bo_release() is only called when the BO reference count becomes 
zero.


So ttm_mem_evict_first() will see that this BO is about to be destroyed
and skips it.



Yes, you are right. My version of TTM is ROCM 3.3, so
ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.



As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: 8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
  #0 [8e79eaa17970] machine_kexec at b2863784
  #1 [8e79eaa179d0] __crash_kexec at b291ce92
  #2 [8e79eaa17aa0] crash_kexec at b291cf80
  #3 [8e79eaa17ab8] oops_end at b2f6c768
  #4 [8e79eaa17ae0] no_context at b2f5aaa6
  #5 [8e79eaa17b30] __bad_area_nosemaphore at b2f5ab3d
  #6 [8e79eaa17b80] bad_area_nosemaphore at b2f5acae
  #7 [8e79eaa17b90] __do_page_fault at b2f6f6c0
  #8 [8e79eaa17c00] do_page_fault at b2f6f925
  #9 [8e79eaa17c30] page_fault at b2f6b758
 [exception RIP: memset+31]
 RIP: b2b8668f  RSP: 8e79eaa17ce8  RFLAGS: 00010a17
 RAX: bebebebebebebebe  RBX: 8e747bff10c0  RCX: 
060b0020
 RDX:   RSI: 00be  RDI: 
ab807f00
 RBP: 8e79eaa17d10   R8: 8e79eaa14000   R9: 
ab7c8000
 R10: bcba  R11: 01ba  R12: 
8e79ebaa4050
 R13: ab7c8000  R14: 00022600  R15: 
8e8136e04100

 ORIG_RAX:   CS: 0010  SS: 0018
#10 [8e79eaa17ce8] amdgpu_bo_release_notify at c092f2d1
[amdgpu]
#11 [8e79eaa17d18] ttm_bo_release at c08f39dd [amdttm]
#12 [8e79eaa17d58] amdttm_bo_put at c08f3c8c [amdttm]
#13 [8e79eaa17d68] amdttm_bo_vm_close at c08f7ac9 [amdttm]
#14 [8e79eaa17d80] remove_vma at b29ef115
#15 [8e79eaa17da0] exit_mmap at b29f2c64
#16 [8e79eaa17e58] mmput at b28940c7
#17 [8e79eaa17e78] do_exit at b289dc95
#18 [8e79eaa17f10] do_group_exit at b289e4cf
#19 [8e79eaa17f40] sys_exit_group at b289e544
#20 [8e79eaa17f50] system_call_fastpath at b2f74ddb


Well that might be perfectly expected. VRAM is not necessarily CPU
accessible.


As a test,use CPU memset instead of SDMA fill, This is my code:
void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
{
struct amdgpu_bo *abo;
uint64_t num_pages;
struct drm_mm_node *mm_node;
struct amdgpu_device *adev;
void __iomem *kaddr;

if (!amdgpu_bo_is_amdgpu_bo(bo))
    return;

abo = ttm_to_amdgpu_bo(bo);
num_pages = abo->tbo.num_pages;
mm_node = abo->tbo.mem.mm_node;
adev = amdgpu_ttm_adev(abo->tbo.bdev);
kaddr = adev->mman.aper_base_kaddr;

if (abo->kfd_bo)
    amdgpu_amdkfd_unreserve_memory_limit(abo);

if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
    return;

dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
while (num_pages && mm_node) {
    void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);


That might not work as expected.

aper_base_kaddr can only point to a 256MiB window into VRAM, but VRAM 
itself is usually much larger.


So your memset_io() might end up in nirvana if the BO is allocated 
outside of the window.



memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size 

Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access

2021-04-06 Thread Qu Huang

Hi Christian,

On 2021/4/3 16:49, Christian König wrote:

Hi Qu,

Am 03.04.21 um 07:08 schrieb Qu Huang:

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:

Before dma_resv_lock(bo->base.resv, NULL) in
amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),


That can't happen since when bo_release_notify is called the BO has not
more references and is therefore deleted.

And we never evict a deleted BO, we just wait for it to become idle.


Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.


Not sure on which code base you are, but I don't see how this can happen.

ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
ttm_bo_release() is only called when the BO reference count becomes zero.

So ttm_mem_evict_first() will see that this BO is about to be destroyed
and skips it.



Yes, you are right. My version of TTM is ROCM 3.3, so
ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.



As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: 8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
  #0 [8e79eaa17970] machine_kexec at b2863784
  #1 [8e79eaa179d0] __crash_kexec at b291ce92
  #2 [8e79eaa17aa0] crash_kexec at b291cf80
  #3 [8e79eaa17ab8] oops_end at b2f6c768
  #4 [8e79eaa17ae0] no_context at b2f5aaa6
  #5 [8e79eaa17b30] __bad_area_nosemaphore at b2f5ab3d
  #6 [8e79eaa17b80] bad_area_nosemaphore at b2f5acae
  #7 [8e79eaa17b90] __do_page_fault at b2f6f6c0
  #8 [8e79eaa17c00] do_page_fault at b2f6f925
  #9 [8e79eaa17c30] page_fault at b2f6b758
 [exception RIP: memset+31]
 RIP: b2b8668f  RSP: 8e79eaa17ce8  RFLAGS: 00010a17
 RAX: bebebebebebebebe  RBX: 8e747bff10c0  RCX: 060b0020
 RDX:   RSI: 00be  RDI: ab807f00
 RBP: 8e79eaa17d10   R8: 8e79eaa14000   R9: ab7c8000
 R10: bcba  R11: 01ba  R12: 8e79ebaa4050
 R13: ab7c8000  R14: 00022600  R15: 8e8136e04100
 ORIG_RAX:   CS: 0010  SS: 0018
#10 [8e79eaa17ce8] amdgpu_bo_release_notify at c092f2d1
[amdgpu]
#11 [8e79eaa17d18] ttm_bo_release at c08f39dd [amdttm]
#12 [8e79eaa17d58] amdttm_bo_put at c08f3c8c [amdttm]
#13 [8e79eaa17d68] amdttm_bo_vm_close at c08f7ac9 [amdttm]
#14 [8e79eaa17d80] remove_vma at b29ef115
#15 [8e79eaa17da0] exit_mmap at b29f2c64
#16 [8e79eaa17e58] mmput at b28940c7
#17 [8e79eaa17e78] do_exit at b289dc95
#18 [8e79eaa17f10] do_group_exit at b289e4cf
#19 [8e79eaa17f40] sys_exit_group at b289e544
#20 [8e79eaa17f50] system_call_fastpath at b2f74ddb


Well that might be perfectly expected. VRAM is not necessarily CPU
accessible.


As a test,use CPU memset instead of SDMA fill, This is my code:
void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
{
struct amdgpu_bo *abo;
uint64_t num_pages;
struct drm_mm_node *mm_node;
struct amdgpu_device *adev;
void __iomem *kaddr;

if (!amdgpu_bo_is_amdgpu_bo(bo))
return;

abo = ttm_to_amdgpu_bo(bo);
num_pages = abo->tbo.num_pages;
mm_node = abo->tbo.mem.mm_node;
adev = amdgpu_ttm_adev(abo->tbo.bdev);
kaddr = adev->mman.aper_base_kaddr;

if (abo->kfd_bo)
amdgpu_amdkfd_unreserve_memory_limit(abo);

if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
return;

dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
while (num_pages && mm_node) {
void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);
memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size 
base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a 

Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access

2021-04-03 Thread Christian König

Hi Qu,

Am 03.04.21 um 07:08 schrieb Qu Huang:

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:
Before dma_resv_lock(bo->base.resv, NULL) in 
amdgpu_bo_release_notify(),

the bo->base.resv lock may be held by ttm_mem_evict_first(),


That can't happen since when bo_release_notify is called the BO has not
more references and is therefore deleted.

And we never evict a deleted BO, we just wait for it to become idle.


Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.


Not sure on which code base you are, but I don't see how this can happen.

ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and 
ttm_bo_release() is only called when the BO reference count becomes zero.


So ttm_mem_evict_first() will see that this BO is about to be destroyed 
and skips it.




As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: 8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
  #0 [8e79eaa17970] machine_kexec at b2863784
  #1 [8e79eaa179d0] __crash_kexec at b291ce92
  #2 [8e79eaa17aa0] crash_kexec at b291cf80
  #3 [8e79eaa17ab8] oops_end at b2f6c768
  #4 [8e79eaa17ae0] no_context at b2f5aaa6
  #5 [8e79eaa17b30] __bad_area_nosemaphore at b2f5ab3d
  #6 [8e79eaa17b80] bad_area_nosemaphore at b2f5acae
  #7 [8e79eaa17b90] __do_page_fault at b2f6f6c0
  #8 [8e79eaa17c00] do_page_fault at b2f6f925
  #9 [8e79eaa17c30] page_fault at b2f6b758
 [exception RIP: memset+31]
 RIP: b2b8668f  RSP: 8e79eaa17ce8  RFLAGS: 00010a17
 RAX: bebebebebebebebe  RBX: 8e747bff10c0  RCX: 060b0020
 RDX:   RSI: 00be  RDI: ab807f00
 RBP: 8e79eaa17d10   R8: 8e79eaa14000   R9: ab7c8000
 R10: bcba  R11: 01ba  R12: 8e79ebaa4050
 R13: ab7c8000  R14: 00022600  R15: 8e8136e04100
 ORIG_RAX:   CS: 0010  SS: 0018
#10 [8e79eaa17ce8] amdgpu_bo_release_notify at c092f2d1 
[amdgpu]

#11 [8e79eaa17d18] ttm_bo_release at c08f39dd [amdttm]
#12 [8e79eaa17d58] amdttm_bo_put at c08f3c8c [amdttm]
#13 [8e79eaa17d68] amdttm_bo_vm_close at c08f7ac9 [amdttm]
#14 [8e79eaa17d80] remove_vma at b29ef115
#15 [8e79eaa17da0] exit_mmap at b29f2c64
#16 [8e79eaa17e58] mmput at b28940c7
#17 [8e79eaa17e78] do_exit at b289dc95
#18 [8e79eaa17f10] do_group_exit at b289e4cf
#19 [8e79eaa17f40] sys_exit_group at b289e544
#20 [8e79eaa17f50] system_call_fastpath at b2f74ddb


Well that might be perfectly expected. VRAM is not necessarily CPU 
accessible.


Regards,
Christian.



Regards,
Qu.



Regards,
Christian.


and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
ttm_buffer_object *bo)
  if (bo->base.resv == >base._resv)
  amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
  return;

  dma_resv_lock(bo->base.resv, NULL);

+    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+    dma_resv_unlock(bo->base.resv);
+    return;
+    }
+
  r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
);

  if (!WARN_ON(r)) {
  amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1







Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access

2021-04-02 Thread Qu Huang

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:

Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),


That can't happen since when bo_release_notify is called the BO has not
more references and is therefore deleted.

And we never evict a deleted BO, we just wait for it to become idle.


Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.

As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: 8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
  #0 [8e79eaa17970] machine_kexec at b2863784
  #1 [8e79eaa179d0] __crash_kexec at b291ce92
  #2 [8e79eaa17aa0] crash_kexec at b291cf80
  #3 [8e79eaa17ab8] oops_end at b2f6c768
  #4 [8e79eaa17ae0] no_context at b2f5aaa6
  #5 [8e79eaa17b30] __bad_area_nosemaphore at b2f5ab3d
  #6 [8e79eaa17b80] bad_area_nosemaphore at b2f5acae
  #7 [8e79eaa17b90] __do_page_fault at b2f6f6c0
  #8 [8e79eaa17c00] do_page_fault at b2f6f925
  #9 [8e79eaa17c30] page_fault at b2f6b758
 [exception RIP: memset+31]
 RIP: b2b8668f  RSP: 8e79eaa17ce8  RFLAGS: 00010a17
 RAX: bebebebebebebebe  RBX: 8e747bff10c0  RCX: 060b0020
 RDX:   RSI: 00be  RDI: ab807f00
 RBP: 8e79eaa17d10   R8: 8e79eaa14000   R9: ab7c8000
 R10: bcba  R11: 01ba  R12: 8e79ebaa4050
 R13: ab7c8000  R14: 00022600  R15: 8e8136e04100
 ORIG_RAX:   CS: 0010  SS: 0018
#10 [8e79eaa17ce8] amdgpu_bo_release_notify at c092f2d1 [amdgpu]
#11 [8e79eaa17d18] ttm_bo_release at c08f39dd [amdttm]
#12 [8e79eaa17d58] amdttm_bo_put at c08f3c8c [amdttm]
#13 [8e79eaa17d68] amdttm_bo_vm_close at c08f7ac9 [amdttm]
#14 [8e79eaa17d80] remove_vma at b29ef115
#15 [8e79eaa17da0] exit_mmap at b29f2c64
#16 [8e79eaa17e58] mmput at b28940c7
#17 [8e79eaa17e78] do_exit at b289dc95
#18 [8e79eaa17f10] do_group_exit at b289e4cf
#19 [8e79eaa17f40] sys_exit_group at b289e544
#20 [8e79eaa17f50] system_call_fastpath at b2f74ddb

Regards,
Qu.



Regards,
Christian.


and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
ttm_buffer_object *bo)
  if (bo->base.resv == >base._resv)
  amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
  return;

  dma_resv_lock(bo->base.resv, NULL);

+    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+    dma_resv_unlock(bo->base.resv);
+    return;
+    }
+
  r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, );
  if (!WARN_ON(r)) {
  amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1





Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access

2021-04-02 Thread Christian König

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:

Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),


That can't happen since when bo_release_notify is called the BO has not 
more references and is therefore deleted.


And we never evict a deleted BO, we just wait for it to become idle.

Regards,
Christian.


and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object 
*bo)
if (bo->base.resv == >base._resv)
amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-   if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-   !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+   if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
return;

dma_resv_lock(bo->base.resv, NULL);

+   if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+   dma_resv_unlock(bo->base.resv);
+   return;
+   }
+
r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, );
if (!WARN_ON(r)) {
amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1





[PATCH] drm/amdgpu: Fix a potential sdma invalid access

2021-04-01 Thread Qu Huang
Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),
and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object 
*bo)
if (bo->base.resv == >base._resv)
amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-   if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-   !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+   if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
return;

dma_resv_lock(bo->base.resv, NULL);

+   if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+   dma_resv_unlock(bo->base.resv);
+   return;
+   }
+
r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, );
if (!WARN_ON(r)) {
amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1