Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again
Am 29.08.2018 um 16:51 schrieb Michel Dänzer: On 2018-08-29 10:57 a.m., Christian König wrote: Am 29.08.2018 um 09:52 schrieb Michel Dänzer: On 2018-08-28 7:03 p.m., Michel Dänzer wrote: On 2018-08-28 11:14 a.m., Michel Dänzer wrote: On 2018-08-22 9:52 a.m., Huang Rui wrote: The new bulk moving functionality is ready, the overhead of moving PD/PT bos to LRU is fixed. So move them on LRU again. Signed-off-by: Huang Rui Tested-by: Mike Lothian Tested-by: Dieter Nützel Acked-by: Chunming Zhou Reviewed-by: Junwei Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index db1f28a..d195a3d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev, struct amdgpu_vm_bo_base, vm_status); bo_base->moved = false; - list_del_init(_base->vm_status); + list_move(_base->vm_status, >idle); bo = bo_base->bo->parent; if (!bo) Since this change, I'm getting various badness when running piglit using radeonsi on Bonaire, see the attached dmesg excerpt. Reverting just this change on top of current amd-staging-drm-next avoids the problem. Looks like some list manipulation isn't sufficiently protected against concurrent execution? KASAN pointed me to one issue: https://patchwork.freedesktop.org/patch/246212/ However, this doesn't fully fix the problem. Ray, any ideas yet for solving this? If not, let's revert this change for now. I've gone over this multiple times now as well, but can't find anything obvious wrong either. After looking at the code, one question: Why does vm->moved need a spinlock, but not vm->idle? What is protecting against concurrent access to the latter? The moved state is occupied by both normal and per VM BOs, e.g. BOs with different reservation objects. All other states are only used by per VM BOs or PDs/PTs, so we only put the BOs on those when the reservation object of the root BO is locked. We could probably split the moved state into two separate ones to further avoid that lock. Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again
On 2018-08-29 10:57 a.m., Christian König wrote: > Am 29.08.2018 um 09:52 schrieb Michel Dänzer: >> On 2018-08-28 7:03 p.m., Michel Dänzer wrote: >>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote: On 2018-08-22 9:52 a.m., Huang Rui wrote: > The new bulk moving functionality is ready, the overhead of moving > PD/PT bos to > LRU is fixed. So move them on LRU again. > > Signed-off-by: Huang Rui > Tested-by: Mike Lothian > Tested-by: Dieter Nützel > Acked-by: Chunming Zhou > Reviewed-by: Junwei Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index db1f28a..d195a3d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct > amdgpu_device *adev, > struct amdgpu_vm_bo_base, > vm_status); > bo_base->moved = false; > - list_del_init(_base->vm_status); > + list_move(_base->vm_status, >idle); > bo = bo_base->bo->parent; > if (!bo) > Since this change, I'm getting various badness when running piglit using radeonsi on Bonaire, see the attached dmesg excerpt. Reverting just this change on top of current amd-staging-drm-next avoids the problem. Looks like some list manipulation isn't sufficiently protected against concurrent execution? >>> KASAN pointed me to one issue: >>> https://patchwork.freedesktop.org/patch/246212/ >>> >>> However, this doesn't fully fix the problem. >> Ray, any ideas yet for solving this? If not, let's revert this change >> for now. > > I've gone over this multiple times now as well, but can't find anything > obvious wrong either. After looking at the code, one question: Why does vm->moved need a spinlock, but not vm->idle? What is protecting against concurrent access to the latter? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again
On 2018-08-29 10:57 a.m., Christian König wrote: > Am 29.08.2018 um 09:52 schrieb Michel Dänzer: >> On 2018-08-28 7:03 p.m., Michel Dänzer wrote: >>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote: On 2018-08-22 9:52 a.m., Huang Rui wrote: > The new bulk moving functionality is ready, the overhead of moving > PD/PT bos to > LRU is fixed. So move them on LRU again. > > Signed-off-by: Huang Rui > Tested-by: Mike Lothian > Tested-by: Dieter Nützel > Acked-by: Chunming Zhou > Reviewed-by: Junwei Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index db1f28a..d195a3d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct > amdgpu_device *adev, > struct amdgpu_vm_bo_base, > vm_status); > bo_base->moved = false; > - list_del_init(_base->vm_status); > + list_move(_base->vm_status, >idle); > bo = bo_base->bo->parent; > if (!bo) > Since this change, I'm getting various badness when running piglit using radeonsi on Bonaire, see the attached dmesg excerpt. Reverting just this change on top of current amd-staging-drm-next avoids the problem. Looks like some list manipulation isn't sufficiently protected against concurrent execution? >>> KASAN pointed me to one issue: >>> https://patchwork.freedesktop.org/patch/246212/ >>> >>> However, this doesn't fully fix the problem. >> Ray, any ideas yet for solving this? If not, let's revert this change >> for now. > > I've gone over this multiple times now as well, but can't find anything > obvious wrong either. Thanks for looking into it. > If we don't have any more ideas I would say revert it for now and try to > debug it further. Yep. > BTW: Any idea how to force the issue? Not specifically. It happens reliably and pretty quickly for me when running the piglit gpu profile. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again
Am 29.08.2018 um 09:52 schrieb Michel Dänzer: On 2018-08-28 7:03 p.m., Michel Dänzer wrote: On 2018-08-28 11:14 a.m., Michel Dänzer wrote: On 2018-08-22 9:52 a.m., Huang Rui wrote: The new bulk moving functionality is ready, the overhead of moving PD/PT bos to LRU is fixed. So move them on LRU again. Signed-off-by: Huang Rui Tested-by: Mike Lothian Tested-by: Dieter Nützel Acked-by: Chunming Zhou Reviewed-by: Junwei Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index db1f28a..d195a3d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev, struct amdgpu_vm_bo_base, vm_status); bo_base->moved = false; - list_del_init(_base->vm_status); + list_move(_base->vm_status, >idle); bo = bo_base->bo->parent; if (!bo) Since this change, I'm getting various badness when running piglit using radeonsi on Bonaire, see the attached dmesg excerpt. Reverting just this change on top of current amd-staging-drm-next avoids the problem. Looks like some list manipulation isn't sufficiently protected against concurrent execution? KASAN pointed me to one issue: https://patchwork.freedesktop.org/patch/246212/ However, this doesn't fully fix the problem. Ray, any ideas yet for solving this? If not, let's revert this change for now. I've gone over this multiple times now as well, but can't find anything obvious wrong either. If we don't have any more ideas I would say revert it for now and try to debug it further. BTW: Any idea how to force the issue? Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again
On 2018-08-28 7:03 p.m., Michel Dänzer wrote: > On 2018-08-28 11:14 a.m., Michel Dänzer wrote: >> On 2018-08-22 9:52 a.m., Huang Rui wrote: >>> The new bulk moving functionality is ready, the overhead of moving PD/PT >>> bos to >>> LRU is fixed. So move them on LRU again. >>> >>> Signed-off-by: Huang Rui >>> Tested-by: Mike Lothian >>> Tested-by: Dieter Nützel >>> Acked-by: Chunming Zhou >>> Reviewed-by: Junwei Zhang >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> index db1f28a..d195a3d 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device >>> *adev, >>>struct amdgpu_vm_bo_base, >>>vm_status); >>> bo_base->moved = false; >>> - list_del_init(_base->vm_status); >>> + list_move(_base->vm_status, >idle); >>> >>> bo = bo_base->bo->parent; >>> if (!bo) >>> >> >> Since this change, I'm getting various badness when running piglit using >> radeonsi on Bonaire, see the attached dmesg excerpt. >> >> Reverting just this change on top of current amd-staging-drm-next avoids >> the problem. >> >> Looks like some list manipulation isn't sufficiently protected against >> concurrent execution? > > KASAN pointed me to one issue: > https://patchwork.freedesktop.org/patch/246212/ > > However, this doesn't fully fix the problem. Ray, any ideas yet for solving this? If not, let's revert this change for now. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again
On 2018-08-28 11:14 a.m., Michel Dänzer wrote: > > Hi Ray, > > > On 2018-08-22 9:52 a.m., Huang Rui wrote: >> The new bulk moving functionality is ready, the overhead of moving PD/PT bos >> to >> LRU is fixed. So move them on LRU again. >> >> Signed-off-by: Huang Rui >> Tested-by: Mike Lothian >> Tested-by: Dieter Nützel >> Acked-by: Chunming Zhou >> Reviewed-by: Junwei Zhang >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> index db1f28a..d195a3d 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device >> *adev, >> struct amdgpu_vm_bo_base, >> vm_status); >> bo_base->moved = false; >> -list_del_init(_base->vm_status); >> +list_move(_base->vm_status, >idle); >> >> bo = bo_base->bo->parent; >> if (!bo) >> > > Since this change, I'm getting various badness when running piglit using > radeonsi on Bonaire, see the attached dmesg excerpt. > > Reverting just this change on top of current amd-staging-drm-next avoids > the problem. > > Looks like some list manipulation isn't sufficiently protected against > concurrent execution? KASAN pointed me to one issue: https://patchwork.freedesktop.org/patch/246212/ However, this doesn't fully fix the problem. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again
Hi Ray, On 2018-08-22 9:52 a.m., Huang Rui wrote: > The new bulk moving functionality is ready, the overhead of moving PD/PT bos > to > LRU is fixed. So move them on LRU again. > > Signed-off-by: Huang Rui > Tested-by: Mike Lothian > Tested-by: Dieter Nützel > Acked-by: Chunming Zhou > Reviewed-by: Junwei Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index db1f28a..d195a3d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device > *adev, > struct amdgpu_vm_bo_base, > vm_status); > bo_base->moved = false; > - list_del_init(_base->vm_status); > + list_move(_base->vm_status, >idle); > > bo = bo_base->bo->parent; > if (!bo) > Since this change, I'm getting various badness when running piglit using radeonsi on Bonaire, see the attached dmesg excerpt. Reverting just this change on top of current amd-staging-drm-next avoids the problem. Looks like some list manipulation isn't sufficiently protected against concurrent execution? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer Aug 27 17:34:59 kaveri kernel: [ 567.429026] WARNING: CPU: 7 PID: 12214 at drivers/gpu/drm//ttm/ttm_bo.c:228 ttm_bo_move_to_lru_tail+0x28b/0x3d0 [ttm] Aug 27 17:34:59 kaveri kernel: [ 567.429029] Modules linked in: fuse(E) lz4(E) lz4_compress(E) amdkfd(OE) amdgpu(OE) cpufreq_powersave(E) cpufreq_userspace(E) cpufreq_conservative(E) chash(OE) gpu_sched(OE) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) edac_mce_amd(E) fat(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) radeon(OE) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) ttm(OE) snd_hda_intel(E) wmi_bmof(E) aesni_intel(E) efi_pstore(E) aes_x86_64(E) crypto_simd(E) drm_kms_helper(OE) cryptd(E) glue_helper(E) pcspkr(E) efivars(E) snd_hda_codec(E) k10temp(E) drm(OE) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) i2c_algo_bit(E) fb_sys_fops(E) r8169(E) sp5100_tco(E) syscopyarea(E) snd(E) sysfillrect(E) ccp(E) sg(E) mii(E) sysimgblt(E) soundcore(E) i2c_piix4(E) Aug 27 17:34:59 kaveri kernel: [ 567.429118] rng_core(E) wmi(E) button(E) acpi_cpufreq(E) tcp_bbr(E) sch_fq(E) sunrpc(E) nct6775(E) hwmon_vid(E) efivarfs(E) ip_tables(E) x_tables(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) fscrypto(E) dm_mod(E) raid10(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) sd_mod(E) evdev(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) xhci_hcd(E) crc32c_intel(E) scsi_mod(E) usbcore(E) gpio_amdpt(E) gpio_generic(E) Aug 27 17:34:59 kaveri kernel: [ 567.429182] CPU: 7 PID: 12214 Comm: shader_run:cs0 Tainted: GW OE 4.18.0-rc1+ #111 Aug 27 17:34:59 kaveri kernel: [ 567.429184] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017 Aug 27 17:34:59 kaveri kernel: [ 567.429191] RIP: 0010:ttm_bo_move_to_lru_tail+0x28b/0x3d0 [ttm] Aug 27 17:34:59 kaveri kernel: [ 567.429192] Code: c1 ea 03 80 3c 02 00 0f 85 e6 00 00 00 48 8b 83 e8 01 00 00 be ff ff ff ff 48 8d 78 60 e8 1d 08 1e c3 85 c0 0f 85 c3 fd ff ff <0f> 0b e9 bc fd ff ff 48 8d bb d0 01 00 00 48 b8 00 00 00 00 00 fc Aug 27 17:34:59 kaveri kernel: [ 567.429285] RSP: 0018:8803e6aa76a8 EFLAGS: 00010246 Aug 27 17:34:59 kaveri kernel: [ 567.429290] RAX: RBX: 8803d7f4aad0 RCX: 11007afe95f1 Aug 27 17:34:59 kaveri kernel: [ 567.429292] RDX: RSI: 8803c61b57a0 RDI: 0246 Aug 27 17:34:59 kaveri kernel: [ 567.429295] RBP: 8803d4c58638 R08: ed007cd54ebe R09: ed007cd54ebe Aug 27 17:34:59 kaveri kernel: [ 567.429297] R10: 0001 R11: ed007cd54ebe R12: 8803d4c58078 Aug 27 17:34:59 kaveri kernel: [ 567.429299] R13: 8803d4c58000 R14: 8803d7f4aa80 R15: dc00 Aug 27 17:34:59 kaveri kernel: [ 567.429301] FS: 7ff030b2d700() GS:8803ee1c() knlGS: Aug 27 17:34:59 kaveri kernel: [ 567.429304] CS: 0010 DS: ES: CR0: 80050033 Aug 27 17:34:59 kaveri kernel: [ 567.429306] CR2: 7ff00cee6a90 CR3: 0003471e6000 CR4: 003406e0 Aug 27 17:34:59 kaveri kernel: [ 567.429307] Call Trace: Aug 27 17:34:59 kaveri kernel: [ 567.429363] amdgpu_vm_move_to_lru_tail+0x128/0x240 [amdgpu] Aug 27 17:34:59 kaveri kernel: [ 567.429420] amdgpu_cs_ioctl+0x967/0x4ba0 [amdgpu] Aug 27 17:34:59 kaveri kernel: [ 567.429433] ? lock_acquire+0x10b/0x330 Aug 27 17:34:59 kaveri kernel: [