Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-26 Thread Sharma, Shashank

Thanks for the patch,

Patch pushed for staging.


Regards

Shashank

On 25/03/2024 00:23, Alex Deucher wrote:

On Sat, Mar 23, 2024 at 4:47 PM Sharma, Shashank
 wrote:


On 23/03/2024 15:52, Johannes Weiner wrote:

On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:

Hello,

On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:

Am 07.03.24 um 23:07 schrieb Johannes Weiner:

Lastly I went with an open loop instead of a memcpy() as I wasn't
sure if that memory is safe to address a byte at at time.

Shashank pointed out to me in private that byte access would indeed be
safe. However, after actually trying it it won't work because memcpy()
doesn't play nice with mqd being volatile:

/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In 
function 'amdgpu_debugfs_mqd_read':
/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: 
warning: passing argument 1 of '__builtin_dynamic_object_size' discards 
'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
550 | memcpy(kbuf, mqd, ring->mqd_size);

So I would propose leaving the patch as-is. Shashank, does that sound
good to you?

Friendly ping :)

Shashank, is your Reviewed-by still good for this patch, given the
above?

Ah, sorry I missed this due to some parallel work, and just realized the
memcpy/volatile limitation.

I also feel the need of protecting MQD read under a lock to avoid
parallel change in MQD while we do byte-by-byte copy, but I will add
that in my to-do list.

Please feel free to use my R-b.

Shashank, if the patch looks good, can you pick it up and apply it?

Alex



- Shashank


Thanks


Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-25 Thread Sharma, Shashank
[AMD Official Use Only - General]

Hey Alex,
Sure, I will pick it up and push it to staging.

Regards
Shashank

From: Alex Deucher 
Sent: Monday, March 25, 2024 12:23 AM
To: Sharma, Shashank 
Cc: Johannes Weiner ; Christian König 
; Deucher, Alexander 
; Koenig, Christian ; 
amd-...@lists.freedesktop.org ; 
dri-devel@lists.freedesktop.org ; 
linux-ker...@vger.kernel.org 
Subject: Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

On Sat, Mar 23, 2024 at 4:47 PM Sharma, Shashank
 wrote:
>
>
> On 23/03/2024 15:52, Johannes Weiner wrote:
> > On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:
> >> Hello,
> >>
> >> On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:
> >>> Am 07.03.24 um 23:07 schrieb Johannes Weiner:
> >>>> Lastly I went with an open loop instead of a memcpy() as I wasn't
> >>>> sure if that memory is safe to address a byte at at time.
> >> Shashank pointed out to me in private that byte access would indeed be
> >> safe. However, after actually trying it it won't work because memcpy()
> >> doesn't play nice with mqd being volatile:
> >>
> >> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In 
> >> function 'amdgpu_debugfs_mqd_read':
> >> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22:
> >>  warning: passing argument 1 of '__builtin_dynamic_object_size' discards 
> >> 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
> >>550 | memcpy(kbuf, mqd, ring->mqd_size);
> >>
> >> So I would propose leaving the patch as-is. Shashank, does that sound
> >> good to you?
> > Friendly ping :)
> >
> > Shashank, is your Reviewed-by still good for this patch, given the
> > above?
>
> Ah, sorry I missed this due to some parallel work, and just realized the
> memcpy/volatile limitation.
>
> I also feel the need of protecting MQD read under a lock to avoid
> parallel change in MQD while we do byte-by-byte copy, but I will add
> that in my to-do list.
>
> Please feel free to use my R-b.

Shashank, if the patch looks good, can you pick it up and apply it?

Alex


>
> - Shashank
>
> > Thanks


Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-24 Thread Alex Deucher
On Sat, Mar 23, 2024 at 4:47 PM Sharma, Shashank
 wrote:
>
>
> On 23/03/2024 15:52, Johannes Weiner wrote:
> > On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:
> >> Hello,
> >>
> >> On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:
> >>> Am 07.03.24 um 23:07 schrieb Johannes Weiner:
>  Lastly I went with an open loop instead of a memcpy() as I wasn't
>  sure if that memory is safe to address a byte at at time.
> >> Shashank pointed out to me in private that byte access would indeed be
> >> safe. However, after actually trying it it won't work because memcpy()
> >> doesn't play nice with mqd being volatile:
> >>
> >> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In 
> >> function 'amdgpu_debugfs_mqd_read':
> >> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22:
> >>  warning: passing argument 1 of '__builtin_dynamic_object_size' discards 
> >> 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
> >>550 | memcpy(kbuf, mqd, ring->mqd_size);
> >>
> >> So I would propose leaving the patch as-is. Shashank, does that sound
> >> good to you?
> > Friendly ping :)
> >
> > Shashank, is your Reviewed-by still good for this patch, given the
> > above?
>
> Ah, sorry I missed this due to some parallel work, and just realized the
> memcpy/volatile limitation.
>
> I also feel the need of protecting MQD read under a lock to avoid
> parallel change in MQD while we do byte-by-byte copy, but I will add
> that in my to-do list.
>
> Please feel free to use my R-b.

Shashank, if the patch looks good, can you pick it up and apply it?

Alex


>
> - Shashank
>
> > Thanks


Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-23 Thread Sharma, Shashank



On 23/03/2024 15:52, Johannes Weiner wrote:

On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:

Hello,

On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:

Am 07.03.24 um 23:07 schrieb Johannes Weiner:

Lastly I went with an open loop instead of a memcpy() as I wasn't
sure if that memory is safe to address a byte at at time.

Shashank pointed out to me in private that byte access would indeed be
safe. However, after actually trying it it won't work because memcpy()
doesn't play nice with mqd being volatile:

/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In 
function 'amdgpu_debugfs_mqd_read':
/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: 
warning: passing argument 1 of '__builtin_dynamic_object_size' discards 
'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
   550 | memcpy(kbuf, mqd, ring->mqd_size);

So I would propose leaving the patch as-is. Shashank, does that sound
good to you?

Friendly ping :)

Shashank, is your Reviewed-by still good for this patch, given the
above?


Ah, sorry I missed this due to some parallel work, and just realized the 
memcpy/volatile limitation.


I also feel the need of protecting MQD read under a lock to avoid 
parallel change in MQD while we do byte-by-byte copy, but I will add 
that in my to-do list.


Please feel free to use my R-b.

- Shashank


Thanks


Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-23 Thread Johannes Weiner
On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:
> Hello,
> 
> On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:
> > Am 07.03.24 um 23:07 schrieb Johannes Weiner:
> > > Lastly I went with an open loop instead of a memcpy() as I wasn't
> > > sure if that memory is safe to address a byte at at time.
> 
> Shashank pointed out to me in private that byte access would indeed be
> safe. However, after actually trying it it won't work because memcpy()
> doesn't play nice with mqd being volatile:
> 
> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In 
> function 'amdgpu_debugfs_mqd_read':
> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: 
> warning: passing argument 1 of '__builtin_dynamic_object_size' discards 
> 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
>   550 | memcpy(kbuf, mqd, ring->mqd_size);
> 
> So I would propose leaving the patch as-is. Shashank, does that sound
> good to you?

Friendly ping :)

Shashank, is your Reviewed-by still good for this patch, given the
above?

Thanks


Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-14 Thread Johannes Weiner
Hello,

On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:
> Am 07.03.24 um 23:07 schrieb Johannes Weiner:
> > Lastly I went with an open loop instead of a memcpy() as I wasn't
> > sure if that memory is safe to address a byte at at time.

Shashank pointed out to me in private that byte access would indeed be
safe. However, after actually trying it it won't work because memcpy()
doesn't play nice with mqd being volatile:

/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In 
function 'amdgpu_debugfs_mqd_read':
/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: 
warning: passing argument 1 of '__builtin_dynamic_object_size' discards 
'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
  550 | memcpy(kbuf, mqd, ring->mqd_size);

So I would propose leaving the patch as-is. Shashank, does that sound
good to you?

(Please keep me CC'd on replies, as I'm not subscribed to the graphics
lists.)

Thanks!


Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-14 Thread Sharma, Shashank

+ Johannes

Regards
Shashank

On 13/03/2024 18:22, Sharma, Shashank wrote:

Hello Johannes,

On 07/03/2024 23:07, Johannes Weiner wrote:

An errant disk backup on my desktop got into debugfs and triggered the
following deadlock scenario in the amdgpu debugfs files. The machine
also hard-resets immediately after those lines are printed (although I
wasn't able to reproduce that part when reading by hand):

[ 1318.016074][ T1082] 
==
[ 1318.016607][ T1082] WARNING: possible circular locking dependency 
detected

[ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
[ 1318.017598][ T1082] 
--

[ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
[ 1318.018585][ T1082] 98c44175d6a0 (>mmap_lock){}-{3:3}, 
at: __might_fault+0x40/0x80

[ 1318.019084][ T1082]
[ 1318.019084][ T1082] but task is already holding lock:
[ 1318.020052][ T1082] 98c4c13f55f8 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]

[ 1318.020607][ T1082]
[ 1318.020607][ T1082] which lock already depends on the new lock.
[ 1318.020607][ T1082]
[ 1318.022081][ T1082]
[ 1318.022081][ T1082] the existing dependency chain (in reverse 
order) is:

[ 1318.023083][ T1082]
[ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
[ 1318.024114][ T1082] __ww_mutex_lock.constprop.0+0xe0/0x12f0
[ 1318.024639][ T1082]    ww_mutex_lock+0x32/0x90
[ 1318.025161][ T1082]    dma_resv_lockdep+0x18a/0x330
[ 1318.025683][ T1082]    do_one_initcall+0x6a/0x350
[ 1318.026210][ T1082]    kernel_init_freeable+0x1a3/0x310
[ 1318.026728][ T1082]    kernel_init+0x15/0x1a0
[ 1318.027242][ T1082]    ret_from_fork+0x2c/0x40
[ 1318.027759][ T1082]    ret_from_fork_asm+0x11/0x20
[ 1318.028281][ T1082]
[ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
[ 1318.029297][ T1082]    dma_resv_lockdep+0x16c/0x330
[ 1318.029790][ T1082]    do_one_initcall+0x6a/0x350
[ 1318.030263][ T1082]    kernel_init_freeable+0x1a3/0x310
[ 1318.030722][ T1082]    kernel_init+0x15/0x1a0
[ 1318.031168][ T1082]    ret_from_fork+0x2c/0x40
[ 1318.031598][ T1082]    ret_from_fork_asm+0x11/0x20
[ 1318.032011][ T1082]
[ 1318.032011][ T1082] -> #0 (>mmap_lock){}-{3:3}:
[ 1318.032778][ T1082]    __lock_acquire+0x14bf/0x2680
[ 1318.033141][ T1082]    lock_acquire+0xcd/0x2c0
[ 1318.033487][ T1082]    __might_fault+0x58/0x80
[ 1318.033814][ T1082] amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
[ 1318.034181][ T1082]    full_proxy_read+0x55/0x80
[ 1318.034487][ T1082]    vfs_read+0xa7/0x360
[ 1318.034788][ T1082]    ksys_read+0x70/0xf0
[ 1318.035085][ T1082]    do_syscall_64+0x94/0x180
[ 1318.035375][ T1082] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.035664][ T1082]
[ 1318.035664][ T1082] other info that might help us debug this:
[ 1318.035664][ T1082]
[ 1318.036487][ T1082] Chain exists of:
[ 1318.036487][ T1082]   >mmap_lock --> 
reservation_ww_class_acquire --> reservation_ww_class_mutex

[ 1318.036487][ T1082]
[ 1318.037310][ T1082]  Possible unsafe locking scenario:
[ 1318.037310][ T1082]
[ 1318.037838][ T1082]    CPU0    CPU1
[ 1318.038101][ T1082]        
[ 1318.038350][ T1082]   lock(reservation_ww_class_mutex);
[ 1318.038590][ T1082] lock(reservation_ww_class_acquire);
[ 1318.038839][ T1082] lock(reservation_ww_class_mutex);
[ 1318.039083][ T1082]   rlock(>mmap_lock);
[ 1318.039328][ T1082]
[ 1318.039328][ T1082]  *** DEADLOCK ***
[ 1318.039328][ T1082]
[ 1318.040029][ T1082] 1 lock held by tar/1082:
[ 1318.040259][ T1082]  #0: 98c4c13f55f8 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]

[ 1318.040560][ T1082]
[ 1318.040560][ T1082] stack backtrace:
[ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 
6.8.0-rc7-00015-ge0c8221b72c0 #17 
3316c85d50e282c5643b075d1f01a4f6365e39c2
[ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. 
B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023

[ 1318.041614][ T1082] Call Trace:
[ 1318.041895][ T1082]  
[ 1318.042175][ T1082]  dump_stack_lvl+0x4a/0x80
[ 1318.042460][ T1082]  check_noncircular+0x145/0x160
[ 1318.042743][ T1082]  __lock_acquire+0x14bf/0x2680
[ 1318.043022][ T1082]  lock_acquire+0xcd/0x2c0
[ 1318.043301][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043580][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043856][ T1082]  __might_fault+0x58/0x80
[ 1318.044131][ T1082]  ? __might_fault+0x40/0x80
[ 1318.044408][ T1082]  amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 
8fe2afaa910cbd7654c8cab23563a94d6caebaab]

[ 1318.044749][ T1082]  full_proxy_read+0x55/0x80
[ 1318.045042][ T1082]  vfs_read+0xa7/0x360
[ 1318.045333][ T1082]  ksys_read+0x70/0xf0
[ 1318.045623][ T1082]  do_syscall_64+0x94/0x180
[ 1318.045913][ T1082]  ? do_syscall_64+0xa0/0x180
[ 

Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-13 Thread Sharma, Shashank

Hello Johannes,

On 07/03/2024 23:07, Johannes Weiner wrote:

An errant disk backup on my desktop got into debugfs and triggered the
following deadlock scenario in the amdgpu debugfs files. The machine
also hard-resets immediately after those lines are printed (although I
wasn't able to reproduce that part when reading by hand):

[ 1318.016074][ T1082] ==
[ 1318.016607][ T1082] WARNING: possible circular locking dependency detected
[ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
[ 1318.017598][ T1082] --
[ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
[ 1318.018585][ T1082] 98c44175d6a0 (>mmap_lock){}-{3:3}, at: 
__might_fault+0x40/0x80
[ 1318.019084][ T1082]
[ 1318.019084][ T1082] but task is already holding lock:
[ 1318.020052][ T1082] 98c4c13f55f8 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.020607][ T1082]
[ 1318.020607][ T1082] which lock already depends on the new lock.
[ 1318.020607][ T1082]
[ 1318.022081][ T1082]
[ 1318.022081][ T1082] the existing dependency chain (in reverse order) is:
[ 1318.023083][ T1082]
[ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
[ 1318.024114][ T1082]__ww_mutex_lock.constprop.0+0xe0/0x12f0
[ 1318.024639][ T1082]ww_mutex_lock+0x32/0x90
[ 1318.025161][ T1082]dma_resv_lockdep+0x18a/0x330
[ 1318.025683][ T1082]do_one_initcall+0x6a/0x350
[ 1318.026210][ T1082]kernel_init_freeable+0x1a3/0x310
[ 1318.026728][ T1082]kernel_init+0x15/0x1a0
[ 1318.027242][ T1082]ret_from_fork+0x2c/0x40
[ 1318.027759][ T1082]ret_from_fork_asm+0x11/0x20
[ 1318.028281][ T1082]
[ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
[ 1318.029297][ T1082]dma_resv_lockdep+0x16c/0x330
[ 1318.029790][ T1082]do_one_initcall+0x6a/0x350
[ 1318.030263][ T1082]kernel_init_freeable+0x1a3/0x310
[ 1318.030722][ T1082]kernel_init+0x15/0x1a0
[ 1318.031168][ T1082]ret_from_fork+0x2c/0x40
[ 1318.031598][ T1082]ret_from_fork_asm+0x11/0x20
[ 1318.032011][ T1082]
[ 1318.032011][ T1082] -> #0 (>mmap_lock){}-{3:3}:
[ 1318.032778][ T1082]__lock_acquire+0x14bf/0x2680
[ 1318.033141][ T1082]lock_acquire+0xcd/0x2c0
[ 1318.033487][ T1082]__might_fault+0x58/0x80
[ 1318.033814][ T1082]amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
[ 1318.034181][ T1082]full_proxy_read+0x55/0x80
[ 1318.034487][ T1082]vfs_read+0xa7/0x360
[ 1318.034788][ T1082]ksys_read+0x70/0xf0
[ 1318.035085][ T1082]do_syscall_64+0x94/0x180
[ 1318.035375][ T1082]entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.035664][ T1082]
[ 1318.035664][ T1082] other info that might help us debug this:
[ 1318.035664][ T1082]
[ 1318.036487][ T1082] Chain exists of:
[ 1318.036487][ T1082]   >mmap_lock --> reservation_ww_class_acquire --> 
reservation_ww_class_mutex
[ 1318.036487][ T1082]
[ 1318.037310][ T1082]  Possible unsafe locking scenario:
[ 1318.037310][ T1082]
[ 1318.037838][ T1082]CPU0CPU1
[ 1318.038101][ T1082]
[ 1318.038350][ T1082]   lock(reservation_ww_class_mutex);
[ 1318.038590][ T1082]
lock(reservation_ww_class_acquire);
[ 1318.038839][ T1082]
lock(reservation_ww_class_mutex);
[ 1318.039083][ T1082]   rlock(>mmap_lock);
[ 1318.039328][ T1082]
[ 1318.039328][ T1082]  *** DEADLOCK ***
[ 1318.039328][ T1082]
[ 1318.040029][ T1082] 1 lock held by tar/1082:
[ 1318.040259][ T1082]  #0: 98c4c13f55f8 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.040560][ T1082]
[ 1318.040560][ T1082] stack backtrace:
[ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 
6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2
[ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS 
PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
[ 1318.041614][ T1082] Call Trace:
[ 1318.041895][ T1082]  
[ 1318.042175][ T1082]  dump_stack_lvl+0x4a/0x80
[ 1318.042460][ T1082]  check_noncircular+0x145/0x160
[ 1318.042743][ T1082]  __lock_acquire+0x14bf/0x2680
[ 1318.043022][ T1082]  lock_acquire+0xcd/0x2c0
[ 1318.043301][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043580][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043856][ T1082]  __might_fault+0x58/0x80
[ 1318.044131][ T1082]  ? __might_fault+0x40/0x80
[ 1318.044408][ T1082]  amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 
8fe2afaa910cbd7654c8cab23563a94d6caebaab]
[ 1318.044749][ T1082]  full_proxy_read+0x55/0x80
[ 1318.045042][ T1082]  vfs_read+0xa7/0x360
[ 1318.045333][ T1082]  ksys_read+0x70/0xf0
[ 1318.045623][ T1082]  do_syscall_64+0x94/0x180
[ 1318.045913][ T1082]  ? do_syscall_64+0xa0/0x180
[ 

Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-08 Thread Christian König

Good catch, Shashank can you take a closer look?

Thanks,
Christian.

Am 07.03.24 um 23:07 schrieb Johannes Weiner:

An errant disk backup on my desktop got into debugfs and triggered the
following deadlock scenario in the amdgpu debugfs files. The machine
also hard-resets immediately after those lines are printed (although I
wasn't able to reproduce that part when reading by hand):

[ 1318.016074][ T1082] ==
[ 1318.016607][ T1082] WARNING: possible circular locking dependency detected
[ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
[ 1318.017598][ T1082] --
[ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
[ 1318.018585][ T1082] 98c44175d6a0 (>mmap_lock){}-{3:3}, at: 
__might_fault+0x40/0x80
[ 1318.019084][ T1082]
[ 1318.019084][ T1082] but task is already holding lock:
[ 1318.020052][ T1082] 98c4c13f55f8 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.020607][ T1082]
[ 1318.020607][ T1082] which lock already depends on the new lock.
[ 1318.020607][ T1082]
[ 1318.022081][ T1082]
[ 1318.022081][ T1082] the existing dependency chain (in reverse order) is:
[ 1318.023083][ T1082]
[ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
[ 1318.024114][ T1082]__ww_mutex_lock.constprop.0+0xe0/0x12f0
[ 1318.024639][ T1082]ww_mutex_lock+0x32/0x90
[ 1318.025161][ T1082]dma_resv_lockdep+0x18a/0x330
[ 1318.025683][ T1082]do_one_initcall+0x6a/0x350
[ 1318.026210][ T1082]kernel_init_freeable+0x1a3/0x310
[ 1318.026728][ T1082]kernel_init+0x15/0x1a0
[ 1318.027242][ T1082]ret_from_fork+0x2c/0x40
[ 1318.027759][ T1082]ret_from_fork_asm+0x11/0x20
[ 1318.028281][ T1082]
[ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
[ 1318.029297][ T1082]dma_resv_lockdep+0x16c/0x330
[ 1318.029790][ T1082]do_one_initcall+0x6a/0x350
[ 1318.030263][ T1082]kernel_init_freeable+0x1a3/0x310
[ 1318.030722][ T1082]kernel_init+0x15/0x1a0
[ 1318.031168][ T1082]ret_from_fork+0x2c/0x40
[ 1318.031598][ T1082]ret_from_fork_asm+0x11/0x20
[ 1318.032011][ T1082]
[ 1318.032011][ T1082] -> #0 (>mmap_lock){}-{3:3}:
[ 1318.032778][ T1082]__lock_acquire+0x14bf/0x2680
[ 1318.033141][ T1082]lock_acquire+0xcd/0x2c0
[ 1318.033487][ T1082]__might_fault+0x58/0x80
[ 1318.033814][ T1082]amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
[ 1318.034181][ T1082]full_proxy_read+0x55/0x80
[ 1318.034487][ T1082]vfs_read+0xa7/0x360
[ 1318.034788][ T1082]ksys_read+0x70/0xf0
[ 1318.035085][ T1082]do_syscall_64+0x94/0x180
[ 1318.035375][ T1082]entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.035664][ T1082]
[ 1318.035664][ T1082] other info that might help us debug this:
[ 1318.035664][ T1082]
[ 1318.036487][ T1082] Chain exists of:
[ 1318.036487][ T1082]   >mmap_lock --> reservation_ww_class_acquire --> 
reservation_ww_class_mutex
[ 1318.036487][ T1082]
[ 1318.037310][ T1082]  Possible unsafe locking scenario:
[ 1318.037310][ T1082]
[ 1318.037838][ T1082]CPU0CPU1
[ 1318.038101][ T1082]
[ 1318.038350][ T1082]   lock(reservation_ww_class_mutex);
[ 1318.038590][ T1082]
lock(reservation_ww_class_acquire);
[ 1318.038839][ T1082]
lock(reservation_ww_class_mutex);
[ 1318.039083][ T1082]   rlock(>mmap_lock);
[ 1318.039328][ T1082]
[ 1318.039328][ T1082]  *** DEADLOCK ***
[ 1318.039328][ T1082]
[ 1318.040029][ T1082] 1 lock held by tar/1082:
[ 1318.040259][ T1082]  #0: 98c4c13f55f8 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.040560][ T1082]
[ 1318.040560][ T1082] stack backtrace:
[ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 
6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2
[ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS 
PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
[ 1318.041614][ T1082] Call Trace:
[ 1318.041895][ T1082]  
[ 1318.042175][ T1082]  dump_stack_lvl+0x4a/0x80
[ 1318.042460][ T1082]  check_noncircular+0x145/0x160
[ 1318.042743][ T1082]  __lock_acquire+0x14bf/0x2680
[ 1318.043022][ T1082]  lock_acquire+0xcd/0x2c0
[ 1318.043301][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043580][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043856][ T1082]  __might_fault+0x58/0x80
[ 1318.044131][ T1082]  ? __might_fault+0x40/0x80
[ 1318.044408][ T1082]  amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 
8fe2afaa910cbd7654c8cab23563a94d6caebaab]
[ 1318.044749][ T1082]  full_proxy_read+0x55/0x80
[ 1318.045042][ T1082]  vfs_read+0xa7/0x360
[ 1318.045333][ T1082]  ksys_read+0x70/0xf0
[ 1318.045623][ T1082]  do_syscall_64+0x94/0x180
[