Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-04 Thread Timur Kristóf
Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: > That's the worst-case scenario where you're debugging HW or FW > issues. > Those should be pretty rare post-bringup. But are there hangs caused > by > user mode driver or application bugs that are easier to debug and >

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-04 Thread Christian König
Am 03.05.23 um 21:14 schrieb André Almeida: Em 03/05/2023 14:43, Timur Kristóf escreveu: Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: That's the worst-case scenario where you're debugging HW or FW issues. Those should be pretty rare post-bringup. But are there hangs

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
On Wed, May 3, 2023, 14:53 André Almeida wrote: > Em 03/05/2023 14:08, Marek Olšák escreveu: > > GPU hangs are pretty common post-bringup. They are not common per user, > > but if we gather all hangs from all users, we can have lots and lots of > > them. > > > > GPU hangs are indeed not very

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida
Em 03/05/2023 14:43, Timur Kristóf escreveu: Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: That's the worst-case scenario where you're debugging HW or FW issues. Those should be pretty rare post-bringup. But are there hangs caused by user mode driver or application bugs

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida
Em 03/05/2023 14:08, Marek Olšák escreveu: GPU hangs are pretty common post-bringup. They are not common per user, but if we gather all hangs from all users, we can have lots and lots of them. GPU hangs are indeed not very debuggable. There are however some things we can do: - Identify the

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
WRITE_DATA with ENGINE=PFP will execute the packet on the frontend engine, while ENGINE=ME will execute the packet on the backend engine. Marek On Wed, May 3, 2023 at 1:08 PM Marek Olšák wrote: > GPU hangs are pretty common post-bringup. They are not common per user, > but if we gather all

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
GPU hangs are pretty common post-bringup. They are not common per user, but if we gather all hangs from all users, we can have lots and lots of them. GPU hangs are indeed not very debuggable. There are however some things we can do: - Identify the hanging IB by its VA (the kernel should know it)

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 03.05.23 um 17:08 schrieb Felix Kuehling: Am 2023-05-03 um 03:59 schrieb Christian König: Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Felix Kuehling
Am 2023-05-03 um 03:59 schrieb Christian König: Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang before a GPU reset. The fundamental problem

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
Hi, On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > Christian König ezt írta (időpont: 2023. > > máj. 2., Ke 9:59): > >   > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > >  > Em 01/05/2023 16:24, Alex Deucher escreveu: > > >  >> On Mon, May 1, 2023 at 2:58 PM André

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf > wrote: > > > > Hi, > > > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > > > Christian König ezt írta (időpont: > > > > 2023. > > > > máj. 2., Ke 9:59): > > > > > >

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
Hi Christian, Christian König ezt írta (időpont: 2023. máj. 2., Ke 9:59): > Am 02.05.23 um 03:26 schrieb André Almeida: > > Em 01/05/2023 16:24, Alex Deucher escreveu: > >> On Mon, May 1, 2023 at 2:58 PM André Almeida > >> wrote: > >>> > >>> I know that devcoredump is also used for this kind

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Alex Deucher
On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: > > On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf > > wrote: > > > > > > Hi, > > > > > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > > > > > Christian König

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Alex Deucher
On Tue, May 2, 2023 at 9:35 AM Timur Kristóf wrote: > > Hi, > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > Christian König ezt írta (időpont: 2023. > > > máj. 2., Ke 9:59): > > > > > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > > > > Em 01/05/2023 16:24, Alex

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Christian König
Hi Timur, Am 02.05.23 um 11:12 schrieb Timur Kristóf: Hi Christian, Christian König ezt írta (időpont: 2023. máj. 2., Ke 9:59): Am 02.05.23 um 03:26 schrieb André Almeida: > Em 01/05/2023 16:24, Alex Deucher escreveu: >> On Mon, May 1, 2023 at 2:58 PM André Almeida >>

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Bas Nieuwenhuizen
On Tue, May 2, 2023 at 11:12 AM Timur Kristóf wrote: > > Hi Christian, > > Christian König ezt írta (időpont: 2023. máj. 2., > Ke 9:59): >> >> Am 02.05.23 um 03:26 schrieb André Almeida: >> > Em 01/05/2023 16:24, Alex Deucher escreveu: >> >> On Mon, May 1, 2023 at 2:58 PM André Almeida >> >>

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Christian König
Am 02.05.23 um 03:26 schrieb André Almeida: Em 01/05/2023 16:24, Alex Deucher escreveu: On Mon, May 1, 2023 at 2:58 PM André Almeida wrote: I know that devcoredump is also used for this kind of information, but I believe that using an IOCTL is better for interfacing Mesa + Linux rather

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Christian König
Well first of all don't expose the VMID to userspace. The UMD doesn't know (and shouldn't know) which VMID is used for a submission since this is dynamically assigned and can change at any time. For debugging there is an interface to use an reserved VMID for your debugged process which

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-01 Thread André Almeida
Em 01/05/2023 16:24, Alex Deucher escreveu: On Mon, May 1, 2023 at 2:58 PM André Almeida wrote: I know that devcoredump is also used for this kind of information, but I believe that using an IOCTL is better for interfacing Mesa + Linux rather than parsing a file that its contents are

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-01 Thread Alex Deucher
On Mon, May 1, 2023 at 2:58 PM André Almeida wrote: > > Currently UMD hasn't much information on what went wrong during a GPU reset. > To > help with that, this patch proposes a new IOCTL that can be used to query > information about the resources that caused the hang. If we went with the

[RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-01 Thread André Almeida
Currently UMD hasn't much information on what went wrong during a GPU reset. To help with that, this patch proposes a new IOCTL that can be used to query information about the resources that caused the hang. The goal of this RFC is to gather feedback about this interface. The mesa part can be