Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-30 Thread Christian König

Am 25.03.20 um 12:03 schrieb Nirmoy:


On 3/25/20 10:23 AM, Pan, Xinhui wrote:


2020年3月25日 15:48,Koenig, Christian  写道: 



Am 25.03.20 um 06:47 schrieb xinhui pan:

Hit panic during GPU recovery test. drm_sched_entity_select_rq might
set NULL to rq. So add a check like drm_sched_job_init does.

NAK, the rq should never be set to NULL in the first place.

How did that happened?

well, I have not check the details.
but just got the call trace below.
looks like sched is not ready, and drm_sched_entity_select_rq set 
entity->rq to NULL.
in the next amdgpu_vm_sdma_commit, hit panic when we deference 
entity->rq.


"drm/amdgpu: stop disable the scheduler during HW fini" from Christian 
should've fix it already. But


I can't find that commit in brahma/amd-staging-drm-next.


Yeah, my fault. I actually forgot to push it.

Should be fixed by now,
Christian.



Regards,

Nirmoy



297567 [   44.667677] amdgpu :03:00.0: GPU reset begin!
297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* 
Couldn't update BO_VA (-2)
297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 
0038

297572 [   44.955132] #PF: supervisor read access in kernel mode
297573 [   44.960451] #PF: error_code(0x) - not-present page
297574 [   44.965714] PGD 0 P4D 0
297575 [   44.968331] Oops:  [#1] SMP PTI
297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: 
G    W 5.4.0-rc7+ #1
297577 [   44.980221] Hardware name: System manufacturer System 
Product Name/Z170-A, BIOS 1702 01/28/2016
297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 
[amdgpu]
297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 
8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 
41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f   f 0f 84 
06 01 00 00 48 8b 80

297580 [   45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246
297581 [   45.020504] RAX:  RBX: b66e00883a30 
RCX: 00100400
297582 [   45.028062] RDX: 003c RSI: 8df123662138 
RDI: b66e00883a30
297583 [   45.035662] RBP: b66e00883a00 R08: b66e0088395c 
R09: b66e00883960
297584 [   45.043298] R10: 00100240 R11: 0035 
R12: 8df1425385e8
297585 [   45.050916] R13: 8df13cfd1288 R14: 8df123662138 
R15: 8df13cfd1000
297586 [   45.058524] FS:  7fcc8f6b2100() 
GS:8df15e38() knlGS:

297587 [   45.067114] CS:  0010 DS:  ES:  CR0: 80050033
297588 [   45.073206] CR2: 0038 CR3: 000641fb6006 
CR4: 003606e0
297589 [   45.080791] DR0:  DR1:  
DR2: 
297590 [   45.088277] DR3:  DR6: fffe0ff0 
DR7: 0400

297591 [   45.095773] Call Trace:
297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297603 [   45.11]  ? trace_hardirqs_on+0x3b/0xf0
297604 [   45.159892]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
297605 [   45.172104]  do_vfs_ioctl+0xa9/0x6f0
297606 [   45.175909]  ? tomoyo_file_ioctl+0x19/0x20
297607 [   45.180241]  ksys_ioctl+0x75/0x80
297608 [   45.183760]  ? do_syscall_64+0x17/0x230
297609 [   45.187833]  __x64_sys_ioctl+0x1a/0x20
297610 [   45.191846]  do_syscall_64+0x5f/0x230
297611 [   45.195764]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
297612 [   45.201126] RIP: 0033:0x7fcc8c7725d7


Regards,
Christian.


Cc: Christian König 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c

index cf96c335b258..d30d103e48a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
amdgpu_vm_update_params *p,

  int r;
    entity = p->direct ? &p->vm->direct : &p->vm->delayed;
+    if (!entity->rq)
+    return -ENOENT;
  ring = container_of(entity->rq->sched, struct amdgpu_ring, 
sched);

    WARN_ON(ib->length_dw == 0);

___
amd-gfx ma

Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Pan, Xinhui
well, submit job with HW disabled shluld be no harm.

The only concern is that we might use up IBs if we park scheduler thread during 
recovery. 
I have saw recovery stuck in sa new functuon. 
ring test alloc IBs to test if recovery succeed or not. But if there is no 
enough IBs it will wait fences to signal. 
However we have parked the scheduler thread,  the job will never run and no 
fences will be signaled.

see, deadlock indeed. Now we are allowing job submission here. it is more 
likely that IBs might be used up.

deadlock calltrace. 
271384 [27069.375047] INFO: task gnome-shell:2507 blocked for more than 120 
seconds.
271385 [27069.382510]   Tainted: GW 5.4.0-rc7+ #1
271386 [27069.388207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
271387 [27069.396221] gnome-shell D0  2507   2487 0x
271388 [27069.401869] Call Trace:
271389 [27069.404404]  __schedule+0x2ab/0x860
271390 [27069.408009]  ? dma_fence_wait_any_timeout+0x1a4/0x2b0
271391 [27069.413198]  schedule+0x3a/0xc0
271392 [27069.416432]  schedule_timeout+0x21d/0x3c0
271393 [27069.420583]  ? trace_hardirqs_on+0x3b/0xf0
271394 [27069.424815]  ? dma_fence_add_callback+0x6e/0xe0
271395 [27069.429449]  ? dma_fence_wait_any_timeout+0x1a4/0x2b0
271396 [27069.434640]  dma_fence_wait_any_timeout+0x205/0x2b0
271397 [27069.439633]  ? dma_fence_wait_any_timeout+0x238/0x2b0
271398 [27069.444944]  amdgpu_sa_bo_new+0x4d7/0x5c0 [amdgpu]
271399 [27069.449949]  amdgpu_ib_get+0x36/0xa0 [amdgpu]
271400 [27069.454534]  amdgpu_job_alloc_with_ib+0x4d/0x70 [amdgpu]
271401 [27069.460057]  amdgpu_vm_sdma_prepare+0x28/0x60 [amdgpu]
271402 [27069.465370]  amdgpu_vm_bo_update_mapping+0xd7/0x1f0 [amdgpu]
271403 [27069.471171]  ? mark_held_locks+0x4d/0x80
271404 [27069.475281]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
271405 [27069.480538]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
271406 [27069.485838]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271407 [27069.491380]  drm_ioctl_kernel+0xb0/0x100 [drm]
271408 [27069.496045]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271409 [27069.501569]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
271410 [27069.506353]  drm_ioctl+0x389/0x450 [drm]
271411 [27069.510458]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271412 [27069.516000]  ? trace_hardirqs_on+0x3b/0xf0
271413 [27069.520305]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
271414 [27069.525048]  do_vfs_ioctl+0xa9/0x6f0
271415 [27069.528753]  ? tomoyo_file_ioctl+0x19/0x20
271416 [27069.532972]  ksys_ioctl+0x75/0x80
271417 [27069.536396]  ? do_syscall_64+0x17/0x230
271418 [27069.540357]  __x64_sys_ioctl+0x1a/0x20
271419 [27069.544239]  do_syscall_64+0x5f/0x230


> 2020年3月25日 19:13,Koenig, Christian  写道:
> 
> Hi guys,
> 
> thanks for pointing this out Nirmoy.
> 
> Yeah, could be that I forgot to commit the patch. Currently I don't know at 
> which end of the chaos I should start to clean up.
> 
> Christian.
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> > be false.
> >
> > If we submit any jobs in this ring-disabled period. We fail to pick up
> > a rq for vm entity and entity->rq will set to NULL.
> > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> > hit panic.
> >
> > Cc: Christian König 
> > Cc: Alex Deucher 
> > Cc: Felix Kuehling 
> > Signed-off-by: xinhui pan 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index cf96c335b258..d30d103e48a2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
> > amdgpu_vm_update_params *p,
> >int r;
> >   
> >entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> > + if (!entity->rq)
> > + return -ENOENT;
> >ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> >   
> >WARN_ON(ib->length_dw == 0);
> 
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready wi

Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Pan, Xinhui
[AMD Official Use Only - Internal Distribution Only]

well, submit job with HW disabled shluld be no harm.

The only concern is that we might use up IBs if we park scheduler during 
recovery. I have saw recovery stuck in sa new functuon.

ring test alloc IBs to test if recovery succeed or not. But if there is no 
enough IBs it will wait fences to signal. However we have parked the scheduler 
thread,  the job will never run and no fences will be signaled.

see, deadlock indeed. Now we are allowing job submission here. it is more 
likely that IBs might be used up.


From: Koenig, Christian 
Sent: Wednesday, March 25, 2020 7:13:13 PM
To: Das, Nirmoy 
Cc: Pan, Xinhui ; amd-gfx@lists.freedesktop.org 
; Deucher, Alexander 
; Kuehling, Felix 
Subject: Re: [PATCH] drm/amdgpu: Check entity rq

Hi guys,

thanks for pointing this out Nirmoy.

Yeah, could be that I forgot to commit the patch. Currently I don't know at 
which end of the chaos I should start to clean up.

Christian.

Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Felix Kuehling 
> Signed-off-by: xinhui pan 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
> amdgpu_vm_update_params *p,
>int r;
>
>entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> + if (!entity->rq)
> + return -ENOENT;
>ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Felix Kuehling 
> Signed-off-by: xinhui pan 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
> amdgpu_vm_update_params *p,
>int r;
>
>entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> + if (!entity->rq)
> + return -ENOENT;
>ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit pan

Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Koenig, Christian
Hi guys,

thanks for pointing this out Nirmoy.

Yeah, could be that I forgot to commit the patch. Currently I don't know at 
which end of the chaos I should start to clean up.

Christian.

Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Felix Kuehling 
> Signed-off-by: xinhui pan 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
> amdgpu_vm_update_params *p,
>int r;
>
>entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> + if (!entity->rq)
> + return -ENOENT;
>ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Felix Kuehling 
> Signed-off-by: xinhui pan 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
> amdgpu_vm_update_params *p,
>int r;
>
>entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> + if (!entity->rq)
> + return -ENOENT;
>ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" :
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Felix Kuehling 
> Signed-off-by: xinhui pan 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
> amdgpu_vm_update_params *p,
>int r;
>
>entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> + if (!entity->rq)
> + return -ENOENT;
>ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>WARN_ON(ib->length_dw == 0);
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Nirmoy

Hi Xinhui,


Can you please check if you can reproduce the crash with 
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html


Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:

gpu recover will call sdma suspend/resume. In this period, ring will be
disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
be false.

If we submit any jobs in this ring-disabled period. We fail to pick up
a rq for vm entity and entity->rq will set to NULL.
amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
hit panic.

Cc: Christian König 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index cf96c335b258..d30d103e48a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
amdgpu_vm_update_params *p,
int r;
  
  	entity = p->direct ? &p->vm->direct : &p->vm->delayed;

+   if (!entity->rq)
+   return -ENOENT;
ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
  
  	WARN_ON(ib->length_dw == 0);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Nirmoy


On 3/25/20 10:23 AM, Pan, Xinhui wrote:



2020年3月25日 15:48,Koenig, Christian  写道:

Am 25.03.20 um 06:47 schrieb xinhui pan:

Hit panic during GPU recovery test. drm_sched_entity_select_rq might
set NULL to rq. So add a check like drm_sched_job_init does.

NAK, the rq should never be set to NULL in the first place.

How did that happened?

well, I have not check the details.
but just got the call trace below.
looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to 
NULL.
in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq.


"drm/amdgpu: stop disable the scheduler during HW fini" from Christian 
should've fix it already. But


I can't find that commit in brahma/amd-staging-drm-next.

Regards,

Nirmoy



297567 [   44.667677] amdgpu :03:00.0: GPU reset begin!
297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
update BO_VA (-2)
297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 
0038
297572 [   44.955132] #PF: supervisor read access in kernel mode
297573 [   44.960451] #PF: error_code(0x) - not-present page
297574 [   44.965714] PGD 0 P4D 0
297575 [   44.968331] Oops:  [#1] SMP PTI
297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: GW
 5.4.0-rc7+ #1
297577 [   44.980221] Hardware name: System manufacturer System Product 
Name/Z170-A, BIOS 1702 01/28/2016
297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu]
297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d 
a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 
38 85 d2 48 8d b8 30 ff ff f   f 0f 84 06 01 00 00 48 8b 80
297580 [   45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246
297581 [   45.020504] RAX:  RBX: b66e00883a30 RCX: 
00100400
297582 [   45.028062] RDX: 003c RSI: 8df123662138 RDI: 
b66e00883a30
297583 [   45.035662] RBP: b66e00883a00 R08: b66e0088395c R09: 
b66e00883960
297584 [   45.043298] R10: 00100240 R11: 0035 R12: 
8df1425385e8
297585 [   45.050916] R13: 8df13cfd1288 R14: 8df123662138 R15: 
8df13cfd1000
297586 [   45.058524] FS:  7fcc8f6b2100() GS:8df15e38() 
knlGS:
297587 [   45.067114] CS:  0010 DS:  ES:  CR0: 80050033
297588 [   45.073206] CR2: 0038 CR3: 000641fb6006 CR4: 
003606e0
297589 [   45.080791] DR0:  DR1:  DR2: 

297590 [   45.088277] DR3:  DR6: fffe0ff0 DR7: 
0400
297591 [   45.095773] Call Trace:
297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297603 [   45.11]  ? trace_hardirqs_on+0x3b/0xf0
297604 [   45.159892]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
297605 [   45.172104]  do_vfs_ioctl+0xa9/0x6f0
297606 [   45.175909]  ? tomoyo_file_ioctl+0x19/0x20
297607 [   45.180241]  ksys_ioctl+0x75/0x80
297608 [   45.183760]  ? do_syscall_64+0x17/0x230
297609 [   45.187833]  __x64_sys_ioctl+0x1a/0x20
297610 [   45.191846]  do_syscall_64+0x5f/0x230
297611 [   45.195764]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
297612 [   45.201126] RIP: 0033:0x7fcc8c7725d7


Regards,
Christian.


Cc: Christian König 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index cf96c335b258..d30d103e48a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
amdgpu_vm_update_params *p,
int r;
entity = p->direct ? &p->vm->direct : &p->vm->delayed;
+   if (!entity->rq)
+   return -ENOENT;
ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
WARN_ON(ib->length_dw == 0);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedeskto

Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Pan, Xinhui


> 2020年3月25日 17:23,Pan, Xinhui  写道:
> 
> 
> 
>> 2020年3月25日 15:48,Koenig, Christian  写道:
>> 
>> Am 25.03.20 um 06:47 schrieb xinhui pan:
>>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might
>>> set NULL to rq. So add a check like drm_sched_job_init does.
>> 
>> NAK, the rq should never be set to NULL in the first place.
>> 
>> How did that happened?
> 
> well, I have not check the details.

so recovery will disable sdma ring. the sched->ready will be false then. 
any job submitted during suspend and resume will meet this issue.

[   99.011614] amdgpu :03:00.0: GPU reset begin!
[   99.265504] CPU: 5 PID: 163 Comm: kworker/5:1 Tainted: GW 
5.4.0-rc7+ #1
[   99.273659] Hardware name: System manufacturer System Product Name/Z170-A, 
BIOS 1702 01/28/2016
[   99.282522] Workqueue: events drm_sched_job_timedout [gpu_sched]
[   99.288682] Call Trace:
[   99.291193]  dump_stack+0x98/0xd5
[   99.294629]  sdma_v5_0_enable+0x1ab/0x1d0 [amdgpu]
[   99.299563]  sdma_v5_0_suspend+0x2a/0x30 [amdgpu]
[   99.304360]  amdgpu_device_ip_suspend_phase2+0xa3/0x110 [amdgpu]
[   99.310504]  ? amdgpu_device_ip_suspend_phase1+0x5b/0xe0 [amdgpu]
[   99.316727]  amdgpu_device_ip_suspend+0x37/0x60 [amdgpu]
[   99.322159]  amdgpu_device_pre_asic_reset+0x81/0x1f0 [amdgpu]
[   99.328054]  amdgpu_device_gpu_recover+0x27f/0xc60 [amdgpu]
[   99.333767]  amdgpu_job_timedout+0x123/0x140 [amdgpu]
[   99.338898]  drm_sched_job_timedout+0x85/0xe0 [gpu_sched]
[   99.35]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu]
[   99.350145]  ? drm_sched_job_timedout+0x85/0xe0 [gpu_sched]
[   99.355834]  process_one_work+0x231/0x5c0
[   99.359927]  worker_thread+0x3f/0x3b0
[   99.363641]  ? __kthread_parkme+0x61/0x90
[   99.367701]  kthread+0x12c/0x150
[   99.371010]  ? process_one_work+0x5c0/0x5c0
[   99.375318]  ? kthread_park+0x90/0x90
[   99.379042]  ret_from_fork+0x3a/0x50


> but just got the call trace below.
> looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq 
> to NULL.
> in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq.
> 
> 297567 [   44.667677] amdgpu :03:00.0: GPU reset begin!
> 297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
> 297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
> 297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
> update BO_VA (-2)
> 297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 
> 0038
> 297572 [   44.955132] #PF: supervisor read access in kernel mode
> 297573 [   44.960451] #PF: error_code(0x) - not-present page
> 297574 [   44.965714] PGD 0 P4D 0
> 297575 [   44.968331] Oops:  [#1] SMP PTI
> 297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: GW  
>5.4.0-rc7+ #1
> 297577 [   44.980221] Hardware name: System manufacturer System Product 
> Name/Z170-A, BIOS 1702 01/28/2016
> 297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu]
> 297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 
> 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 
> <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f   f 0f 84 06 01 00 00 48 8b 80
> 297580 [   45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246
> 297581 [   45.020504] RAX:  RBX: b66e00883a30 RCX: 
> 00100400
> 297582 [   45.028062] RDX: 003c RSI: 8df123662138 RDI: 
> b66e00883a30
> 297583 [   45.035662] RBP: b66e00883a00 R08: b66e0088395c R09: 
> b66e00883960
> 297584 [   45.043298] R10: 00100240 R11: 0035 R12: 
> 8df1425385e8
> 297585 [   45.050916] R13: 8df13cfd1288 R14: 8df123662138 R15: 
> 8df13cfd1000
> 297586 [   45.058524] FS:  7fcc8f6b2100() GS:8df15e38() 
> knlGS:
> 297587 [   45.067114] CS:  0010 DS:  ES:  CR0: 80050033
> 297588 [   45.073206] CR2: 0038 CR3: 000641fb6006 CR4: 
> 003606e0
> 297589 [   45.080791] DR0:  DR1:  DR2: 
> 
> 297590 [   45.088277] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> 297591 [   45.095773] Call Trace:
> 297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
> 297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
> 297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
> 297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
> 297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
> 297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
> 297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
> 297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
> 297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297603 [   45.1

Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Pan, Xinhui


> 2020年3月25日 15:48,Koenig, Christian  写道:
> 
> Am 25.03.20 um 06:47 schrieb xinhui pan:
>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might
>> set NULL to rq. So add a check like drm_sched_job_init does.
> 
> NAK, the rq should never be set to NULL in the first place.
> 
> How did that happened?

well, I have not check the details.
but just got the call trace below.
looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to 
NULL.
in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq.

297567 [   44.667677] amdgpu :03:00.0: GPU reset begin!
297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
update BO_VA (-2)
297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 
0038
297572 [   44.955132] #PF: supervisor read access in kernel mode
297573 [   44.960451] #PF: error_code(0x) - not-present page
297574 [   44.965714] PGD 0 P4D 0
297575 [   44.968331] Oops:  [#1] SMP PTI
297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: GW
 5.4.0-rc7+ #1
297577 [   44.980221] Hardware name: System manufacturer System Product 
Name/Z170-A, BIOS 1702 01/28/2016
297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu]
297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 
4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 
8b 40 38 85 d2 48 8d b8 30 ff ff f   f 0f 84 06 01 00 00 48 8b 80
297580 [   45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246
297581 [   45.020504] RAX:  RBX: b66e00883a30 RCX: 
00100400
297582 [   45.028062] RDX: 003c RSI: 8df123662138 RDI: 
b66e00883a30
297583 [   45.035662] RBP: b66e00883a00 R08: b66e0088395c R09: 
b66e00883960
297584 [   45.043298] R10: 00100240 R11: 0035 R12: 
8df1425385e8
297585 [   45.050916] R13: 8df13cfd1288 R14: 8df123662138 R15: 
8df13cfd1000
297586 [   45.058524] FS:  7fcc8f6b2100() GS:8df15e38() 
knlGS:
297587 [   45.067114] CS:  0010 DS:  ES:  CR0: 80050033
297588 [   45.073206] CR2: 0038 CR3: 000641fb6006 CR4: 
003606e0
297589 [   45.080791] DR0:  DR1:  DR2: 

297590 [   45.088277] DR3:  DR6: fffe0ff0 DR7: 
0400
297591 [   45.095773] Call Trace:
297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297603 [   45.11]  ? trace_hardirqs_on+0x3b/0xf0
297604 [   45.159892]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
297605 [   45.172104]  do_vfs_ioctl+0xa9/0x6f0
297606 [   45.175909]  ? tomoyo_file_ioctl+0x19/0x20
297607 [   45.180241]  ksys_ioctl+0x75/0x80
297608 [   45.183760]  ? do_syscall_64+0x17/0x230
297609 [   45.187833]  __x64_sys_ioctl+0x1a/0x20
297610 [   45.191846]  do_syscall_64+0x5f/0x230
297611 [   45.195764]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
297612 [   45.201126] RIP: 0033:0x7fcc8c7725d7

> 
> Regards,
> Christian.
> 
>> 
>> Cc: Christian König 
>> Cc: Alex Deucher 
>> Cc: Felix Kuehling 
>> Signed-off-by: xinhui pan 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> index cf96c335b258..d30d103e48a2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
>> amdgpu_vm_update_params *p,
>>  int r;
>>  entity = p->direct ? &p->vm->direct : &p->vm->delayed;
>> +if (!entity->rq)
>> +return -ENOENT;
>>  ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>>  WARN_ON(ib->length_dw == 0);
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Check entity rq

2020-03-25 Thread Christian König

Am 25.03.20 um 06:47 schrieb xinhui pan:

Hit panic during GPU recovery test. drm_sched_entity_select_rq might
set NULL to rq. So add a check like drm_sched_job_init does.


NAK, the rq should never be set to NULL in the first place.

How did that happened?

Regards,
Christian.



Cc: Christian König 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index cf96c335b258..d30d103e48a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
amdgpu_vm_update_params *p,
int r;
  
  	entity = p->direct ? &p->vm->direct : &p->vm->delayed;

+   if (!entity->rq)
+   return -ENOENT;
ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
  
  	WARN_ON(ib->length_dw == 0);


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx