Re: [PATCH] drm/amdgpu: Check entity rq
Am 25.03.20 um 12:03 schrieb Nirmoy: On 3/25/20 10:23 AM, Pan, Xinhui wrote: 2020年3月25日 15:48,Koenig, Christian 写道: Am 25.03.20 um 06:47 schrieb xinhui pan: Hit panic during GPU recovery test. drm_sched_entity_select_rq might set NULL to rq. So add a check like drm_sched_job_init does. NAK, the rq should never be set to NULL in the first place. How did that happened? well, I have not check the details. but just got the call trace below. looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to NULL. in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq. "drm/amdgpu: stop disable the scheduler during HW fini" from Christian should've fix it already. But I can't find that commit in brahma/amd-staging-drm-next. Yeah, my fault. I actually forgot to push it. Should be fixed by now, Christian. Regards, Nirmoy 297567 [ 44.667677] amdgpu :03:00.0: GPU reset begin! 297568 [ 44.929047] [drm] scheduler sdma0 is not ready, skipping 297569 [ 44.929048] [drm] scheduler sdma1 is not ready, skipping 297570 [ 44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2) 297571 [ 44.947941] BUG: kernel NULL pointer dereference, address: 0038 297572 [ 44.955132] #PF: supervisor read access in kernel mode 297573 [ 44.960451] #PF: error_code(0x) - not-present page 297574 [ 44.965714] PGD 0 P4D 0 297575 [ 44.968331] Oops: [#1] SMP PTI 297576 [ 44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: G W 5.4.0-rc7+ #1 297577 [ 44.980221] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 297578 [ 44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu] 297579 [ 44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f f 0f 84 06 01 00 00 48 8b 80 297580 [ 45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246 297581 [ 45.020504] RAX: RBX: b66e00883a30 RCX: 00100400 297582 [ 45.028062] RDX: 003c RSI: 8df123662138 RDI: b66e00883a30 297583 [ 45.035662] RBP: b66e00883a00 R08: b66e0088395c R09: b66e00883960 297584 [ 45.043298] R10: 00100240 R11: 0035 R12: 8df1425385e8 297585 [ 45.050916] R13: 8df13cfd1288 R14: 8df123662138 R15: 8df13cfd1000 297586 [ 45.058524] FS: 7fcc8f6b2100() GS:8df15e38() knlGS: 297587 [ 45.067114] CS: 0010 DS: ES: CR0: 80050033 297588 [ 45.073206] CR2: 0038 CR3: 000641fb6006 CR4: 003606e0 297589 [ 45.080791] DR0: DR1: DR2: 297590 [ 45.088277] DR3: DR6: fffe0ff0 DR7: 0400 297591 [ 45.095773] Call Trace: 297592 [ 45.098354] amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu] 297593 [ 45.104427] ? mark_held_locks+0x4d/0x80 297594 [ 45.108682] amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu] 297595 [ 45.114049] ? rcu_read_lock_sched_held+0x4f/0x80 297596 [ 45.119111] amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu] 297597 [ 45.124495] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297598 [ 45.130250] drm_ioctl_kernel+0xb0/0x100 [drm] 297599 [ 45.134988] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297600 [ 45.140742] ? drm_ioctl_kernel+0xb0/0x100 [drm] 297601 [ 45.145622] drm_ioctl+0x389/0x450 [drm] 297602 [ 45.149804] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297603 [ 45.11] ? trace_hardirqs_on+0x3b/0xf0 297604 [ 45.159892] amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] 297605 [ 45.172104] do_vfs_ioctl+0xa9/0x6f0 297606 [ 45.175909] ? tomoyo_file_ioctl+0x19/0x20 297607 [ 45.180241] ksys_ioctl+0x75/0x80 297608 [ 45.183760] ? do_syscall_64+0x17/0x230 297609 [ 45.187833] __x64_sys_ioctl+0x1a/0x20 297610 [ 45.191846] do_syscall_64+0x5f/0x230 297611 [ 45.195764] entry_SYSCALL_64_after_hwframe+0x49/0xbe 297612 [ 45.201126] RIP: 0033:0x7fcc8c7725d7 Regards, Christian. Cc: Christian König Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index cf96c335b258..d30d103e48a2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, int r; entity = p->direct ? &p->vm->direct : &p->vm->delayed; + if (!entity->rq) + return -ENOENT; ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); WARN_ON(ib->length_dw == 0); ___ amd-gfx ma
Re: [PATCH] drm/amdgpu: Check entity rq
well, submit job with HW disabled shluld be no harm. The only concern is that we might use up IBs if we park scheduler thread during recovery. I have saw recovery stuck in sa new functuon. ring test alloc IBs to test if recovery succeed or not. But if there is no enough IBs it will wait fences to signal. However we have parked the scheduler thread, the job will never run and no fences will be signaled. see, deadlock indeed. Now we are allowing job submission here. it is more likely that IBs might be used up. deadlock calltrace. 271384 [27069.375047] INFO: task gnome-shell:2507 blocked for more than 120 seconds. 271385 [27069.382510] Tainted: GW 5.4.0-rc7+ #1 271386 [27069.388207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 271387 [27069.396221] gnome-shell D0 2507 2487 0x 271388 [27069.401869] Call Trace: 271389 [27069.404404] __schedule+0x2ab/0x860 271390 [27069.408009] ? dma_fence_wait_any_timeout+0x1a4/0x2b0 271391 [27069.413198] schedule+0x3a/0xc0 271392 [27069.416432] schedule_timeout+0x21d/0x3c0 271393 [27069.420583] ? trace_hardirqs_on+0x3b/0xf0 271394 [27069.424815] ? dma_fence_add_callback+0x6e/0xe0 271395 [27069.429449] ? dma_fence_wait_any_timeout+0x1a4/0x2b0 271396 [27069.434640] dma_fence_wait_any_timeout+0x205/0x2b0 271397 [27069.439633] ? dma_fence_wait_any_timeout+0x238/0x2b0 271398 [27069.444944] amdgpu_sa_bo_new+0x4d7/0x5c0 [amdgpu] 271399 [27069.449949] amdgpu_ib_get+0x36/0xa0 [amdgpu] 271400 [27069.454534] amdgpu_job_alloc_with_ib+0x4d/0x70 [amdgpu] 271401 [27069.460057] amdgpu_vm_sdma_prepare+0x28/0x60 [amdgpu] 271402 [27069.465370] amdgpu_vm_bo_update_mapping+0xd7/0x1f0 [amdgpu] 271403 [27069.471171] ? mark_held_locks+0x4d/0x80 271404 [27069.475281] amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu] 271405 [27069.480538] amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu] 271406 [27069.485838] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 271407 [27069.491380] drm_ioctl_kernel+0xb0/0x100 [drm] 271408 [27069.496045] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 271409 [27069.501569] ? drm_ioctl_kernel+0xb0/0x100 [drm] 271410 [27069.506353] drm_ioctl+0x389/0x450 [drm] 271411 [27069.510458] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 271412 [27069.516000] ? trace_hardirqs_on+0x3b/0xf0 271413 [27069.520305] amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] 271414 [27069.525048] do_vfs_ioctl+0xa9/0x6f0 271415 [27069.528753] ? tomoyo_file_ioctl+0x19/0x20 271416 [27069.532972] ksys_ioctl+0x75/0x80 271417 [27069.536396] ? do_syscall_64+0x17/0x230 271418 [27069.540357] __x64_sys_ioctl+0x1a/0x20 271419 [27069.544239] do_syscall_64+0x5f/0x230 > 2020年3月25日 19:13,Koenig, Christian 写道: > > Hi guys, > > thanks for pointing this out Nirmoy. > > Yeah, could be that I forgot to commit the patch. Currently I don't know at > which end of the chaos I should start to clean up. > > Christian. > > Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : > Hi Xinhui, > > > Can you please check if you can reproduce the crash with > https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html > > Christian fix it earlier, I think he forgot to push it. > > > Regards, > > Nirmoy > > On 3/25/20 12:07 PM, xinhui pan wrote: > > gpu recover will call sdma suspend/resume. In this period, ring will be > > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > > be false. > > > > If we submit any jobs in this ring-disabled period. We fail to pick up > > a rq for vm entity and entity->rq will set to NULL. > > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > > hit panic. > > > > Cc: Christian König > > Cc: Alex Deucher > > Cc: Felix Kuehling > > Signed-off-by: xinhui pan > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > index cf96c335b258..d30d103e48a2 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct > > amdgpu_vm_update_params *p, > >int r; > > > >entity = p->direct ? &p->vm->direct : &p->vm->delayed; > > + if (!entity->rq) > > + return -ENOENT; > >ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > > > >WARN_ON(ib->length_dw == 0); > > > Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : > Hi Xinhui, > > > Can you please check if you can reproduce the crash with > https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html > > Christian fix it earlier, I think he forgot to push it. > > > Regards, > > Nirmoy > > On 3/25/20 12:07 PM, xinhui pan wrote: > > gpu recover will call sdma suspend/resume. In this period, ring will be > > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready wi
Re: [PATCH] drm/amdgpu: Check entity rq
[AMD Official Use Only - Internal Distribution Only] well, submit job with HW disabled shluld be no harm. The only concern is that we might use up IBs if we park scheduler during recovery. I have saw recovery stuck in sa new functuon. ring test alloc IBs to test if recovery succeed or not. But if there is no enough IBs it will wait fences to signal. However we have parked the scheduler thread, the job will never run and no fences will be signaled. see, deadlock indeed. Now we are allowing job submission here. it is more likely that IBs might be used up. From: Koenig, Christian Sent: Wednesday, March 25, 2020 7:13:13 PM To: Das, Nirmoy Cc: Pan, Xinhui ; amd-gfx@lists.freedesktop.org ; Deucher, Alexander ; Kuehling, Felix Subject: Re: [PATCH] drm/amdgpu: Check entity rq Hi guys, thanks for pointing this out Nirmoy. Yeah, could be that I forgot to commit the patch. Currently I don't know at which end of the chaos I should start to clean up. Christian. Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct > amdgpu_vm_update_params *p, >int r; > >entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; >ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > >WARN_ON(ib->length_dw == 0); Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct > amdgpu_vm_update_params *p, >int r; > >entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; >ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > >WARN_ON(ib->length_dw == 0); Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit pan
Re: [PATCH] drm/amdgpu: Check entity rq
Hi guys, thanks for pointing this out Nirmoy. Yeah, could be that I forgot to commit the patch. Currently I don't know at which end of the chaos I should start to clean up. Christian. Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct > amdgpu_vm_update_params *p, >int r; > >entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; >ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > >WARN_ON(ib->length_dw == 0); Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct > amdgpu_vm_update_params *p, >int r; > >entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; >ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > >WARN_ON(ib->length_dw == 0); Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct > amdgpu_vm_update_params *p, >int r; > >entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; >ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > >WARN_ON(ib->length_dw == 0); ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: Check entity rq
Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: gpu recover will call sdma suspend/resume. In this period, ring will be disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will be false. If we submit any jobs in this ring-disabled period. We fail to pick up a rq for vm entity and entity->rq will set to NULL. amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise hit panic. Cc: Christian König Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index cf96c335b258..d30d103e48a2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, int r; entity = p->direct ? &p->vm->direct : &p->vm->delayed; + if (!entity->rq) + return -ENOENT; ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); WARN_ON(ib->length_dw == 0); ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: Check entity rq
On 3/25/20 10:23 AM, Pan, Xinhui wrote: 2020年3月25日 15:48,Koenig, Christian 写道: Am 25.03.20 um 06:47 schrieb xinhui pan: Hit panic during GPU recovery test. drm_sched_entity_select_rq might set NULL to rq. So add a check like drm_sched_job_init does. NAK, the rq should never be set to NULL in the first place. How did that happened? well, I have not check the details. but just got the call trace below. looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to NULL. in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq. "drm/amdgpu: stop disable the scheduler during HW fini" from Christian should've fix it already. But I can't find that commit in brahma/amd-staging-drm-next. Regards, Nirmoy 297567 [ 44.667677] amdgpu :03:00.0: GPU reset begin! 297568 [ 44.929047] [drm] scheduler sdma0 is not ready, skipping 297569 [ 44.929048] [drm] scheduler sdma1 is not ready, skipping 297570 [ 44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2) 297571 [ 44.947941] BUG: kernel NULL pointer dereference, address: 0038 297572 [ 44.955132] #PF: supervisor read access in kernel mode 297573 [ 44.960451] #PF: error_code(0x) - not-present page 297574 [ 44.965714] PGD 0 P4D 0 297575 [ 44.968331] Oops: [#1] SMP PTI 297576 [ 44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: GW 5.4.0-rc7+ #1 297577 [ 44.980221] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 297578 [ 44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu] 297579 [ 44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f f 0f 84 06 01 00 00 48 8b 80 297580 [ 45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246 297581 [ 45.020504] RAX: RBX: b66e00883a30 RCX: 00100400 297582 [ 45.028062] RDX: 003c RSI: 8df123662138 RDI: b66e00883a30 297583 [ 45.035662] RBP: b66e00883a00 R08: b66e0088395c R09: b66e00883960 297584 [ 45.043298] R10: 00100240 R11: 0035 R12: 8df1425385e8 297585 [ 45.050916] R13: 8df13cfd1288 R14: 8df123662138 R15: 8df13cfd1000 297586 [ 45.058524] FS: 7fcc8f6b2100() GS:8df15e38() knlGS: 297587 [ 45.067114] CS: 0010 DS: ES: CR0: 80050033 297588 [ 45.073206] CR2: 0038 CR3: 000641fb6006 CR4: 003606e0 297589 [ 45.080791] DR0: DR1: DR2: 297590 [ 45.088277] DR3: DR6: fffe0ff0 DR7: 0400 297591 [ 45.095773] Call Trace: 297592 [ 45.098354] amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu] 297593 [ 45.104427] ? mark_held_locks+0x4d/0x80 297594 [ 45.108682] amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu] 297595 [ 45.114049] ? rcu_read_lock_sched_held+0x4f/0x80 297596 [ 45.119111] amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu] 297597 [ 45.124495] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297598 [ 45.130250] drm_ioctl_kernel+0xb0/0x100 [drm] 297599 [ 45.134988] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297600 [ 45.140742] ? drm_ioctl_kernel+0xb0/0x100 [drm] 297601 [ 45.145622] drm_ioctl+0x389/0x450 [drm] 297602 [ 45.149804] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297603 [ 45.11] ? trace_hardirqs_on+0x3b/0xf0 297604 [ 45.159892] amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] 297605 [ 45.172104] do_vfs_ioctl+0xa9/0x6f0 297606 [ 45.175909] ? tomoyo_file_ioctl+0x19/0x20 297607 [ 45.180241] ksys_ioctl+0x75/0x80 297608 [ 45.183760] ? do_syscall_64+0x17/0x230 297609 [ 45.187833] __x64_sys_ioctl+0x1a/0x20 297610 [ 45.191846] do_syscall_64+0x5f/0x230 297611 [ 45.195764] entry_SYSCALL_64_after_hwframe+0x49/0xbe 297612 [ 45.201126] RIP: 0033:0x7fcc8c7725d7 Regards, Christian. Cc: Christian König Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index cf96c335b258..d30d103e48a2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, int r; entity = p->direct ? &p->vm->direct : &p->vm->delayed; + if (!entity->rq) + return -ENOENT; ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); WARN_ON(ib->length_dw == 0); ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedeskto
Re: [PATCH] drm/amdgpu: Check entity rq
> 2020年3月25日 17:23,Pan, Xinhui 写道: > > > >> 2020年3月25日 15:48,Koenig, Christian 写道: >> >> Am 25.03.20 um 06:47 schrieb xinhui pan: >>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might >>> set NULL to rq. So add a check like drm_sched_job_init does. >> >> NAK, the rq should never be set to NULL in the first place. >> >> How did that happened? > > well, I have not check the details. so recovery will disable sdma ring. the sched->ready will be false then. any job submitted during suspend and resume will meet this issue. [ 99.011614] amdgpu :03:00.0: GPU reset begin! [ 99.265504] CPU: 5 PID: 163 Comm: kworker/5:1 Tainted: GW 5.4.0-rc7+ #1 [ 99.273659] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 [ 99.282522] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 99.288682] Call Trace: [ 99.291193] dump_stack+0x98/0xd5 [ 99.294629] sdma_v5_0_enable+0x1ab/0x1d0 [amdgpu] [ 99.299563] sdma_v5_0_suspend+0x2a/0x30 [amdgpu] [ 99.304360] amdgpu_device_ip_suspend_phase2+0xa3/0x110 [amdgpu] [ 99.310504] ? amdgpu_device_ip_suspend_phase1+0x5b/0xe0 [amdgpu] [ 99.316727] amdgpu_device_ip_suspend+0x37/0x60 [amdgpu] [ 99.322159] amdgpu_device_pre_asic_reset+0x81/0x1f0 [amdgpu] [ 99.328054] amdgpu_device_gpu_recover+0x27f/0xc60 [amdgpu] [ 99.333767] amdgpu_job_timedout+0x123/0x140 [amdgpu] [ 99.338898] drm_sched_job_timedout+0x85/0xe0 [gpu_sched] [ 99.35] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [ 99.350145] ? drm_sched_job_timedout+0x85/0xe0 [gpu_sched] [ 99.355834] process_one_work+0x231/0x5c0 [ 99.359927] worker_thread+0x3f/0x3b0 [ 99.363641] ? __kthread_parkme+0x61/0x90 [ 99.367701] kthread+0x12c/0x150 [ 99.371010] ? process_one_work+0x5c0/0x5c0 [ 99.375318] ? kthread_park+0x90/0x90 [ 99.379042] ret_from_fork+0x3a/0x50 > but just got the call trace below. > looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq > to NULL. > in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq. > > 297567 [ 44.667677] amdgpu :03:00.0: GPU reset begin! > 297568 [ 44.929047] [drm] scheduler sdma0 is not ready, skipping > 297569 [ 44.929048] [drm] scheduler sdma1 is not ready, skipping > 297570 [ 44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't > update BO_VA (-2) > 297571 [ 44.947941] BUG: kernel NULL pointer dereference, address: > 0038 > 297572 [ 44.955132] #PF: supervisor read access in kernel mode > 297573 [ 44.960451] #PF: error_code(0x) - not-present page > 297574 [ 44.965714] PGD 0 P4D 0 > 297575 [ 44.968331] Oops: [#1] SMP PTI > 297576 [ 44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: GW >5.4.0-rc7+ #1 > 297577 [ 44.980221] Hardware name: System manufacturer System Product > Name/Z170-A, BIOS 1702 01/28/2016 > 297578 [ 44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu] > 297579 [ 44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 > 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 > <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f f 0f 84 06 01 00 00 48 8b 80 > 297580 [ 45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246 > 297581 [ 45.020504] RAX: RBX: b66e00883a30 RCX: > 00100400 > 297582 [ 45.028062] RDX: 003c RSI: 8df123662138 RDI: > b66e00883a30 > 297583 [ 45.035662] RBP: b66e00883a00 R08: b66e0088395c R09: > b66e00883960 > 297584 [ 45.043298] R10: 00100240 R11: 0035 R12: > 8df1425385e8 > 297585 [ 45.050916] R13: 8df13cfd1288 R14: 8df123662138 R15: > 8df13cfd1000 > 297586 [ 45.058524] FS: 7fcc8f6b2100() GS:8df15e38() > knlGS: > 297587 [ 45.067114] CS: 0010 DS: ES: CR0: 80050033 > 297588 [ 45.073206] CR2: 0038 CR3: 000641fb6006 CR4: > 003606e0 > 297589 [ 45.080791] DR0: DR1: DR2: > > 297590 [ 45.088277] DR3: DR6: fffe0ff0 DR7: > 0400 > 297591 [ 45.095773] Call Trace: > 297592 [ 45.098354] amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu] > 297593 [ 45.104427] ? mark_held_locks+0x4d/0x80 > 297594 [ 45.108682] amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu] > 297595 [ 45.114049] ? rcu_read_lock_sched_held+0x4f/0x80 > 297596 [ 45.119111] amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu] > 297597 [ 45.124495] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] > 297598 [ 45.130250] drm_ioctl_kernel+0xb0/0x100 [drm] > 297599 [ 45.134988] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] > 297600 [ 45.140742] ? drm_ioctl_kernel+0xb0/0x100 [drm] > 297601 [ 45.145622] drm_ioctl+0x389/0x450 [drm] > 297602 [ 45.149804] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] > 297603 [ 45.1
Re: [PATCH] drm/amdgpu: Check entity rq
> 2020年3月25日 15:48,Koenig, Christian 写道: > > Am 25.03.20 um 06:47 schrieb xinhui pan: >> Hit panic during GPU recovery test. drm_sched_entity_select_rq might >> set NULL to rq. So add a check like drm_sched_job_init does. > > NAK, the rq should never be set to NULL in the first place. > > How did that happened? well, I have not check the details. but just got the call trace below. looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to NULL. in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq. 297567 [ 44.667677] amdgpu :03:00.0: GPU reset begin! 297568 [ 44.929047] [drm] scheduler sdma0 is not ready, skipping 297569 [ 44.929048] [drm] scheduler sdma1 is not ready, skipping 297570 [ 44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2) 297571 [ 44.947941] BUG: kernel NULL pointer dereference, address: 0038 297572 [ 44.955132] #PF: supervisor read access in kernel mode 297573 [ 44.960451] #PF: error_code(0x) - not-present page 297574 [ 44.965714] PGD 0 P4D 0 297575 [ 44.968331] Oops: [#1] SMP PTI 297576 [ 44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: GW 5.4.0-rc7+ #1 297577 [ 44.980221] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 297578 [ 44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu] 297579 [ 44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f f 0f 84 06 01 00 00 48 8b 80 297580 [ 45.014931] RSP: 0018:b66e008839d0 EFLAGS: 00010246 297581 [ 45.020504] RAX: RBX: b66e00883a30 RCX: 00100400 297582 [ 45.028062] RDX: 003c RSI: 8df123662138 RDI: b66e00883a30 297583 [ 45.035662] RBP: b66e00883a00 R08: b66e0088395c R09: b66e00883960 297584 [ 45.043298] R10: 00100240 R11: 0035 R12: 8df1425385e8 297585 [ 45.050916] R13: 8df13cfd1288 R14: 8df123662138 R15: 8df13cfd1000 297586 [ 45.058524] FS: 7fcc8f6b2100() GS:8df15e38() knlGS: 297587 [ 45.067114] CS: 0010 DS: ES: CR0: 80050033 297588 [ 45.073206] CR2: 0038 CR3: 000641fb6006 CR4: 003606e0 297589 [ 45.080791] DR0: DR1: DR2: 297590 [ 45.088277] DR3: DR6: fffe0ff0 DR7: 0400 297591 [ 45.095773] Call Trace: 297592 [ 45.098354] amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu] 297593 [ 45.104427] ? mark_held_locks+0x4d/0x80 297594 [ 45.108682] amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu] 297595 [ 45.114049] ? rcu_read_lock_sched_held+0x4f/0x80 297596 [ 45.119111] amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu] 297597 [ 45.124495] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297598 [ 45.130250] drm_ioctl_kernel+0xb0/0x100 [drm] 297599 [ 45.134988] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297600 [ 45.140742] ? drm_ioctl_kernel+0xb0/0x100 [drm] 297601 [ 45.145622] drm_ioctl+0x389/0x450 [drm] 297602 [ 45.149804] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] 297603 [ 45.11] ? trace_hardirqs_on+0x3b/0xf0 297604 [ 45.159892] amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] 297605 [ 45.172104] do_vfs_ioctl+0xa9/0x6f0 297606 [ 45.175909] ? tomoyo_file_ioctl+0x19/0x20 297607 [ 45.180241] ksys_ioctl+0x75/0x80 297608 [ 45.183760] ? do_syscall_64+0x17/0x230 297609 [ 45.187833] __x64_sys_ioctl+0x1a/0x20 297610 [ 45.191846] do_syscall_64+0x5f/0x230 297611 [ 45.195764] entry_SYSCALL_64_after_hwframe+0x49/0xbe 297612 [ 45.201126] RIP: 0033:0x7fcc8c7725d7 > > Regards, > Christian. > >> >> Cc: Christian König >> Cc: Alex Deucher >> Cc: Felix Kuehling >> Signed-off-by: xinhui pan >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >> index cf96c335b258..d30d103e48a2 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct >> amdgpu_vm_update_params *p, >> int r; >> entity = p->direct ? &p->vm->direct : &p->vm->delayed; >> +if (!entity->rq) >> +return -ENOENT; >> ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); >> WARN_ON(ib->length_dw == 0); > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: Check entity rq
Am 25.03.20 um 06:47 schrieb xinhui pan: Hit panic during GPU recovery test. drm_sched_entity_select_rq might set NULL to rq. So add a check like drm_sched_job_init does. NAK, the rq should never be set to NULL in the first place. How did that happened? Regards, Christian. Cc: Christian König Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index cf96c335b258..d30d103e48a2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, int r; entity = p->direct ? &p->vm->direct : &p->vm->delayed; + if (!entity->rq) + return -ENOENT; ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); WARN_ON(ib->length_dw == 0); ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx