Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again

2018-08-29 Thread Christian König

Am 29.08.2018 um 16:51 schrieb Michel Dänzer:

On 2018-08-29 10:57 a.m., Christian König wrote:

Am 29.08.2018 um 09:52 schrieb Michel Dänzer:

On 2018-08-28 7:03 p.m., Michel Dänzer wrote:

On 2018-08-28 11:14 a.m., Michel Dänzer wrote:

On 2018-08-22 9:52 a.m., Huang Rui wrote:

The new bulk moving functionality is ready, the overhead of moving
PD/PT bos to
LRU is fixed. So move them on LRU again.

Signed-off-by: Huang Rui 
Tested-by: Mike Lothian 
Tested-by: Dieter Nützel 
Acked-by: Chunming Zhou 
Reviewed-by: Junwei Zhang 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index db1f28a..d195a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct
amdgpu_device *adev,
  struct amdgpu_vm_bo_base,
  vm_status);
   bo_base->moved = false;
-    list_del_init(_base->vm_status);
+    list_move(_base->vm_status, >idle);
     bo = bo_base->bo->parent;
   if (!bo)


Since this change, I'm getting various badness when running piglit
using
radeonsi on Bonaire, see the attached dmesg excerpt.

Reverting just this change on top of current amd-staging-drm-next
avoids
the problem.

Looks like some list manipulation isn't sufficiently protected against
concurrent execution?

KASAN pointed me to one issue:
https://patchwork.freedesktop.org/patch/246212/

However, this doesn't fully fix the problem.

Ray, any ideas yet for solving this? If not, let's revert this change
for now.

I've gone over this multiple times now as well, but can't find anything
obvious wrong either.

After looking at the code, one question: Why does vm->moved need a
spinlock, but not vm->idle? What is protecting against concurrent access
to the latter?


The moved state is occupied by both normal and per VM BOs, e.g. BOs with 
different reservation objects.


All other states are only used by per VM BOs or PDs/PTs, so we only put 
the BOs on those when the reservation object of the root BO is locked.


We could probably split the moved state into two separate ones to 
further avoid that lock.


Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again

2018-08-29 Thread Michel Dänzer
On 2018-08-29 10:57 a.m., Christian König wrote:
> Am 29.08.2018 um 09:52 schrieb Michel Dänzer:
>> On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
>>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
 On 2018-08-22 9:52 a.m., Huang Rui wrote:
> The new bulk moving functionality is ready, the overhead of moving
> PD/PT bos to
> LRU is fixed. So move them on LRU again.
>
> Signed-off-by: Huang Rui 
> Tested-by: Mike Lothian 
> Tested-by: Dieter Nützel 
> Acked-by: Chunming Zhou 
> Reviewed-by: Junwei Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index db1f28a..d195a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct
> amdgpu_device *adev,
>  struct amdgpu_vm_bo_base,
>  vm_status);
>   bo_base->moved = false;
> -    list_del_init(_base->vm_status);
> +    list_move(_base->vm_status, >idle);
>     bo = bo_base->bo->parent;
>   if (!bo)
>
 Since this change, I'm getting various badness when running piglit
 using
 radeonsi on Bonaire, see the attached dmesg excerpt.

 Reverting just this change on top of current amd-staging-drm-next
 avoids
 the problem.

 Looks like some list manipulation isn't sufficiently protected against
 concurrent execution?
>>> KASAN pointed me to one issue:
>>> https://patchwork.freedesktop.org/patch/246212/
>>>
>>> However, this doesn't fully fix the problem.
>> Ray, any ideas yet for solving this? If not, let's revert this change
>> for now.
> 
> I've gone over this multiple times now as well, but can't find anything
> obvious wrong either.

After looking at the code, one question: Why does vm->moved need a
spinlock, but not vm->idle? What is protecting against concurrent access
to the latter?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again

2018-08-29 Thread Michel Dänzer
On 2018-08-29 10:57 a.m., Christian König wrote:
> Am 29.08.2018 um 09:52 schrieb Michel Dänzer:
>> On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
>>> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
 On 2018-08-22 9:52 a.m., Huang Rui wrote:
> The new bulk moving functionality is ready, the overhead of moving
> PD/PT bos to
> LRU is fixed. So move them on LRU again.
>
> Signed-off-by: Huang Rui 
> Tested-by: Mike Lothian 
> Tested-by: Dieter Nützel 
> Acked-by: Chunming Zhou 
> Reviewed-by: Junwei Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index db1f28a..d195a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct
> amdgpu_device *adev,
>  struct amdgpu_vm_bo_base,
>  vm_status);
>   bo_base->moved = false;
> -    list_del_init(_base->vm_status);
> +    list_move(_base->vm_status, >idle);
>     bo = bo_base->bo->parent;
>   if (!bo)
>
 Since this change, I'm getting various badness when running piglit
 using
 radeonsi on Bonaire, see the attached dmesg excerpt.

 Reverting just this change on top of current amd-staging-drm-next
 avoids
 the problem.

 Looks like some list manipulation isn't sufficiently protected against
 concurrent execution?
>>> KASAN pointed me to one issue:
>>> https://patchwork.freedesktop.org/patch/246212/
>>>
>>> However, this doesn't fully fix the problem.
>> Ray, any ideas yet for solving this? If not, let's revert this change
>> for now.
> 
> I've gone over this multiple times now as well, but can't find anything
> obvious wrong either.

Thanks for looking into it.


> If we don't have any more ideas I would say revert it for now and try to
> debug it further.

Yep.


> BTW: Any idea how to force the issue?

Not specifically. It happens reliably and pretty quickly for me when
running the piglit gpu profile.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again

2018-08-29 Thread Christian König

Am 29.08.2018 um 09:52 schrieb Michel Dänzer:

On 2018-08-28 7:03 p.m., Michel Dänzer wrote:

On 2018-08-28 11:14 a.m., Michel Dänzer wrote:

On 2018-08-22 9:52 a.m., Huang Rui wrote:

The new bulk moving functionality is ready, the overhead of moving PD/PT bos to
LRU is fixed. So move them on LRU again.

Signed-off-by: Huang Rui 
Tested-by: Mike Lothian 
Tested-by: Dieter Nützel 
Acked-by: Chunming Zhou 
Reviewed-by: Junwei Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index db1f28a..d195a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device 
*adev,
   struct amdgpu_vm_bo_base,
   vm_status);
bo_base->moved = false;
-   list_del_init(_base->vm_status);
+   list_move(_base->vm_status, >idle);
  
  		bo = bo_base->bo->parent;

if (!bo)


Since this change, I'm getting various badness when running piglit using
radeonsi on Bonaire, see the attached dmesg excerpt.

Reverting just this change on top of current amd-staging-drm-next avoids
the problem.

Looks like some list manipulation isn't sufficiently protected against
concurrent execution?

KASAN pointed me to one issue:
https://patchwork.freedesktop.org/patch/246212/

However, this doesn't fully fix the problem.

Ray, any ideas yet for solving this? If not, let's revert this change
for now.


I've gone over this multiple times now as well, but can't find anything 
obvious wrong either.


If we don't have any more ideas I would say revert it for now and try to 
debug it further.


BTW: Any idea how to force the issue?

Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again

2018-08-29 Thread Michel Dänzer
On 2018-08-28 7:03 p.m., Michel Dänzer wrote:
> On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
>> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>>> The new bulk moving functionality is ready, the overhead of moving PD/PT 
>>> bos to
>>> LRU is fixed. So move them on LRU again.
>>>
>>> Signed-off-by: Huang Rui 
>>> Tested-by: Mike Lothian 
>>> Tested-by: Dieter Nützel 
>>> Acked-by: Chunming Zhou 
>>> Reviewed-by: Junwei Zhang 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index db1f28a..d195a3d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device 
>>> *adev,
>>>struct amdgpu_vm_bo_base,
>>>vm_status);
>>> bo_base->moved = false;
>>> -   list_del_init(_base->vm_status);
>>> +   list_move(_base->vm_status, >idle);
>>>  
>>> bo = bo_base->bo->parent;
>>> if (!bo)
>>>
>>
>> Since this change, I'm getting various badness when running piglit using
>> radeonsi on Bonaire, see the attached dmesg excerpt.
>>
>> Reverting just this change on top of current amd-staging-drm-next avoids
>> the problem.
>>
>> Looks like some list manipulation isn't sufficiently protected against
>> concurrent execution?
> 
> KASAN pointed me to one issue:
> https://patchwork.freedesktop.org/patch/246212/
> 
> However, this doesn't fully fix the problem.

Ray, any ideas yet for solving this? If not, let's revert this change
for now.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again

2018-08-28 Thread Michel Dänzer
On 2018-08-28 11:14 a.m., Michel Dänzer wrote:
> 
> Hi Ray,
> 
> 
> On 2018-08-22 9:52 a.m., Huang Rui wrote:
>> The new bulk moving functionality is ready, the overhead of moving PD/PT bos 
>> to
>> LRU is fixed. So move them on LRU again.
>>
>> Signed-off-by: Huang Rui 
>> Tested-by: Mike Lothian 
>> Tested-by: Dieter Nützel 
>> Acked-by: Chunming Zhou 
>> Reviewed-by: Junwei Zhang 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index db1f28a..d195a3d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device 
>> *adev,
>> struct amdgpu_vm_bo_base,
>> vm_status);
>>  bo_base->moved = false;
>> -list_del_init(_base->vm_status);
>> +list_move(_base->vm_status, >idle);
>>  
>>  bo = bo_base->bo->parent;
>>  if (!bo)
>>
> 
> Since this change, I'm getting various badness when running piglit using
> radeonsi on Bonaire, see the attached dmesg excerpt.
> 
> Reverting just this change on top of current amd-staging-drm-next avoids
> the problem.
> 
> Looks like some list manipulation isn't sufficiently protected against
> concurrent execution?

KASAN pointed me to one issue:
https://patchwork.freedesktop.org/patch/246212/

However, this doesn't fully fix the problem.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 5/5] drm/amdgpu: move PD/PT bos on LRU again

2018-08-28 Thread Michel Dänzer

Hi Ray,


On 2018-08-22 9:52 a.m., Huang Rui wrote:
> The new bulk moving functionality is ready, the overhead of moving PD/PT bos 
> to
> LRU is fixed. So move them on LRU again.
> 
> Signed-off-by: Huang Rui 
> Tested-by: Mike Lothian 
> Tested-by: Dieter Nützel 
> Acked-by: Chunming Zhou 
> Reviewed-by: Junwei Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index db1f28a..d195a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1107,7 +1107,7 @@ int amdgpu_vm_update_directories(struct amdgpu_device 
> *adev,
>  struct amdgpu_vm_bo_base,
>  vm_status);
>   bo_base->moved = false;
> - list_del_init(_base->vm_status);
> + list_move(_base->vm_status, >idle);
>  
>   bo = bo_base->bo->parent;
>   if (!bo)
> 

Since this change, I'm getting various badness when running piglit using
radeonsi on Bonaire, see the attached dmesg excerpt.

Reverting just this change on top of current amd-staging-drm-next avoids
the problem.

Looks like some list manipulation isn't sufficiently protected against
concurrent execution?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
Aug 27 17:34:59 kaveri kernel: [  567.429026] WARNING: CPU: 7 PID: 12214 at drivers/gpu/drm//ttm/ttm_bo.c:228 ttm_bo_move_to_lru_tail+0x28b/0x3d0 [ttm]
Aug 27 17:34:59 kaveri kernel: [  567.429029] Modules linked in: fuse(E) lz4(E) lz4_compress(E) amdkfd(OE) amdgpu(OE) cpufreq_powersave(E) cpufreq_userspace(E) cpufreq_conservative(E) chash(OE) gpu_sched(OE) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) edac_mce_amd(E) fat(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) radeon(OE) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) ttm(OE) snd_hda_intel(E) wmi_bmof(E) aesni_intel(E) efi_pstore(E) aes_x86_64(E) crypto_simd(E) drm_kms_helper(OE) cryptd(E) glue_helper(E) pcspkr(E) efivars(E) snd_hda_codec(E) k10temp(E) drm(OE) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) i2c_algo_bit(E) fb_sys_fops(E) r8169(E) sp5100_tco(E) syscopyarea(E) snd(E) sysfillrect(E) ccp(E) sg(E) mii(E) sysimgblt(E) soundcore(E) i2c_piix4(E)
Aug 27 17:34:59 kaveri kernel: [  567.429118]  rng_core(E) wmi(E) button(E) acpi_cpufreq(E) tcp_bbr(E) sch_fq(E) sunrpc(E) nct6775(E) hwmon_vid(E) efivarfs(E) ip_tables(E) x_tables(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) fscrypto(E) dm_mod(E) raid10(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) sd_mod(E) evdev(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) xhci_hcd(E) crc32c_intel(E) scsi_mod(E) usbcore(E) gpio_amdpt(E) gpio_generic(E)
Aug 27 17:34:59 kaveri kernel: [  567.429182] CPU: 7 PID: 12214 Comm: shader_run:cs0 Tainted: GW  OE 4.18.0-rc1+ #111
Aug 27 17:34:59 kaveri kernel: [  567.429184] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017
Aug 27 17:34:59 kaveri kernel: [  567.429191] RIP: 0010:ttm_bo_move_to_lru_tail+0x28b/0x3d0 [ttm]
Aug 27 17:34:59 kaveri kernel: [  567.429192] Code: c1 ea 03 80 3c 02 00 0f 85 e6 00 00 00 48 8b 83 e8 01 00 00 be ff ff ff ff 48 8d 78 60 e8 1d 08 1e c3 85 c0 0f 85 c3 fd ff ff <0f> 0b e9 bc fd ff ff 48 8d bb d0 01 00 00 48 b8 00 00 00 00 00 fc 
Aug 27 17:34:59 kaveri kernel: [  567.429285] RSP: 0018:8803e6aa76a8 EFLAGS: 00010246
Aug 27 17:34:59 kaveri kernel: [  567.429290] RAX:  RBX: 8803d7f4aad0 RCX: 11007afe95f1
Aug 27 17:34:59 kaveri kernel: [  567.429292] RDX:  RSI: 8803c61b57a0 RDI: 0246
Aug 27 17:34:59 kaveri kernel: [  567.429295] RBP: 8803d4c58638 R08: ed007cd54ebe R09: ed007cd54ebe
Aug 27 17:34:59 kaveri kernel: [  567.429297] R10: 0001 R11: ed007cd54ebe R12: 8803d4c58078
Aug 27 17:34:59 kaveri kernel: [  567.429299] R13: 8803d4c58000 R14: 8803d7f4aa80 R15: dc00
Aug 27 17:34:59 kaveri kernel: [  567.429301] FS:  7ff030b2d700() GS:8803ee1c() knlGS:
Aug 27 17:34:59 kaveri kernel: [  567.429304] CS:  0010 DS:  ES:  CR0: 80050033
Aug 27 17:34:59 kaveri kernel: [  567.429306] CR2: 7ff00cee6a90 CR3: 0003471e6000 CR4: 003406e0
Aug 27 17:34:59 kaveri kernel: [  567.429307] Call Trace:
Aug 27 17:34:59 kaveri kernel: [  567.429363]  amdgpu_vm_move_to_lru_tail+0x128/0x240 [amdgpu]
Aug 27 17:34:59 kaveri kernel: [  567.429420]  amdgpu_cs_ioctl+0x967/0x4ba0 [amdgpu]
Aug 27 17:34:59 kaveri kernel: [  567.429433]  ? lock_acquire+0x10b/0x330
Aug 27 17:34:59 kaveri kernel: [