Re: [PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Lazar, Lijo
[AMD Official Use Only - General] A dynamic partition switch could happen later. The switch could still be successful in terms of hardware, and hence gives a false feeling of success even if there are no render nodes available for any app to make use of the partition. Also, a kfd node is not

Re: [PATCH -next 3/7] drm/msm: Remove unnecessary NULL values

2023-08-11 Thread Abhinav Kumar
On 8/8/2023 8:44 PM, Ruan Jinjie wrote: The NULL initialization of the pointers assigned by kzalloc() first is not necessary, because if the kzalloc() failed, the pointers will be assigned NULL, otherwise it works as usual. so remove it. Signed-off-by: Ruan Jinjie ---

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
On 2023-08-11 17:27, Chen, Xiaogang wrote: On 8/11/2023 4:22 PM, Felix Kuehling wrote: On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
-Remove others, continue discussing internally On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
one checkpoint: I saw they use serial port for console at kernel parameter: console=ttyS0,115200n8 * Booting Linux using a console connection that is too slow to keep up with the boot-time console-message rate. For example, a 115Kbaud serial console can be/way/too slow to keep up

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
If you have a complete kernel log, it may be worth looking at backtraces from other threads, to better understand the interactions. I'd expect that there is a thread there that's in an RCU read critical section. It may not be in our driver, though. If it's a customer system, it may also help

Re: [PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Felix Kuehling
On 2023-08-11 17:06, James Zhu wrote: Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
On 8/11/2023 4:22 PM, Felix Kuehling wrote: On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent task holding mm lock too long. Calling

[pull] amdgpu, amdkfd, radeon, drm_buddy drm-next-6.6

2023-08-11 Thread Alex Deucher
Hi Dave, Daniel, New stuff for 6.6. The following changes since commit d9aa1da9a8cfb0387eb5703c15bd1f54421460ac: Merge tag 'drm-intel-gt-next-2023-08-04' of git://anongit.freedesktop.org/drm/drm-intel into drm-next (2023-08-07 13:49:25 +1000) are available in the Git repository at:

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
I don't understand why this loop is causing a stall. These stall warnings indicate that there is an RCU grace period that's not making progress. That means there must be an RCU read critical section that's being blocked. But there is no RCU-read critical section in svm_range_set_attr function.

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent task holding mm lock too long. mm lock is rw_semophore, not RCU mechanism. Can you

[PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread James Zhu
Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current limitation. The proposal to add more drm

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
On 2023-08-11 16:06, Felix Kuehling wrote: On 2023-08-11 15:11, James Zhu wrote: update_list could be big in list_for_each_entry(prange, _list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP:

Re: [PATCH v2] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Felix Kuehling
On 2023-08-11 16:23, James Zhu wrote: Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current

[PATCH v2] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread James Zhu
Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current limitation. The proposal to add more drm

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
On 2023-08-11 15:11, James Zhu wrote: update_list could be big in list_for_each_entry(prange, _list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]

Re: [PATCH v6] drm/doc: Document DRM device reset expectations

2023-08-11 Thread Randy Dunlap
Hi, On 8/11/23 11:55, André Almeida wrote: > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Signed-off-by: André Almeida > > --- > > Changes: > - Due to substantial changes in the content, dropped Pekka's Acked-by > - Grammar fixes

[PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
update_list could be big in list_for_each_entry(prange, _list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu] Code: 00 00 00 bf 00 02 00 00 48 81 c2 90 00

[PATCH v6] drm/doc: Document DRM device reset expectations

2023-08-11 Thread André Almeida
Create a section that specifies how to deal with DRM device resets for kernel and userspace drivers. Signed-off-by: André Almeida --- Changes: - Due to substantial changes in the content, dropped Pekka's Acked-by - Grammar fixes (Randy) - Add paragraph about disabling device resets - Add

Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

2023-08-11 Thread Eric Huang
On 2023-08-11 09:26, Felix Kuehling wrote: Am 2023-08-10 um 18:27 schrieb Eric Huang: There is not UNMAP_QUEUES command sending for queue preemption because the queue is suspended and test is closed to the end. Function unmap_queue_cpsch will do nothing after that. How do you suspend queues

Re: [PATCH] drm/amdgpu: don't allow userspace to create a doorbell BO

2023-08-11 Thread Felix Kuehling
Am 2023-08-09 um 15:09 schrieb Alex Deucher: We need the domains in amdgpu_drm.h for the kernel driver to manage the pool, but we don't want userspace using it until the code is ready. So reject for now. Signed-off-by: Alex Deucher Acked-by: Felix Kuehling ---

Re: [PATCH V2 1/5] drm/amdkfd: ignore crat by default

2023-08-11 Thread Alex Deucher
On Fri, Aug 11, 2023 at 9:45 AM Jason Gunthorpe wrote: > > On Mon, Aug 07, 2023 at 06:05:41PM -0400, Alex Deucher wrote: > > We are dropping the IOMMUv2 path, so no need to enable this. > > It's often buggy on consumer platforms anyway. > > > > Signed-off-by: Alex Deucher > > --- > >

Re: [PATCH] drm/amdgpu: don't allow userspace to create a doorbell BO

2023-08-11 Thread Alex Deucher
Ping? On Thu, Aug 10, 2023 at 11:20 AM Alex Deucher wrote: > > Ping? > > On Wed, Aug 9, 2023 at 3:10 PM Alex Deucher wrote: > > > > We need the domains in amdgpu_drm.h for the kernel driver to manage > > the pool, but we don't want userspace using it until the code > > is ready. So reject for

Re: [PATCH V2 1/5] drm/amdkfd: ignore crat by default

2023-08-11 Thread Jason Gunthorpe
On Mon, Aug 07, 2023 at 06:05:41PM -0400, Alex Deucher wrote: > We are dropping the IOMMUv2 path, so no need to enable this. > It's often buggy on consumer platforms anyway. > > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 > 1 file changed, 4 deletions(-)

[PATCH] drm/radeon: Use pci_dev_id() to simplify the code

2023-08-11 Thread Zheng Zengkai
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it mannually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Zheng Zengkai --- drivers/gpu/drm/radeon/radeon_acpi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

Re: [PATCH 28/29] drm/amdkfd: Refactor migrate init to support partition switch

2023-08-11 Thread Linux regression tracking #update (Thorsten Leemhuis)
[TLDR: This mail in primarily relevant for Linux kernel regression tracking. See link in footer if these mails annoy you.] On 19.07.23 18:17, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > On 17.07.23 15:09, Michel Dänzer wrote: >> On 5/10/23 23:23, Alex Deucher wrote: >>> From:

Re: [PATCH 9/9] drm/amd: Hide unsupported power attributes

2023-08-11 Thread Alex Deucher
Series is: Reviewed-by: Alex Deucher On Thu, Aug 10, 2023 at 11:40 PM Mario Limonciello wrote: > > Some ASICS only offer one type of power attribute, so in the visible > callback check whether the attributes are supported and hide if not > supported. > > Signed-off-by: Mario Limonciello > ---

Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

2023-08-11 Thread Felix Kuehling
Am 2023-08-10 um 18:27 schrieb Eric Huang: There is not UNMAP_QUEUES command sending for queue preemption because the queue is suspended and test is closed to the end. Function unmap_queue_cpsch will do nothing after that. How do you suspend queues without sending an UNMAP_QUEUES command?

Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-08-11 Thread Felix Kuehling
Am 2023-08-11 um 06:11 schrieb Mike Lothian: On Thu, 3 Aug 2023 at 20:43, Felix Kuehling wrote: Is your kernel configured without dynamic debugging? Maybe we need to wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE). Apologies, I thought I'd replied to this, yes I didn't have dynamic

Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-08-11 Thread Mike Lothian
On Thu, 3 Aug 2023 at 20:43, Felix Kuehling wrote: > > Is your kernel configured without dynamic debugging? Maybe we need to > wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE). > Apologies, I thought I'd replied to this, yes I didn't have dynamic debugging enabled

Re: [PATCH V8 2/9] drivers core: add ACPI based WBRF mechanism introduced by AMD

2023-08-11 Thread Simon Horman
On Thu, Aug 10, 2023 at 03:37:56PM +0800, Evan Quan wrote: > AMD has introduced an ACPI based mechanism to support WBRF for some > platforms with AMD dGPU + WLAN. This needs support from BIOS equipped > with necessary AML implementations and dGPU firmwares. > > For those systems without the ACPI

Re: [PATCH V8 6/9] drm/amd/pm: setup the framework to support Wifi RFI mitigation feature

2023-08-11 Thread Simon Horman
On Thu, Aug 10, 2023 at 03:38:00PM +0800, Evan Quan wrote: > With WBRF feature supported, as a driver responding to the frequencies, > amdgpu driver is able to do shadow pstate switching to mitigate possible > interference(between its (G-)DDR memory clocks and local radio module > frequency bands

Re: [PATCH] drm/amdgpu: Add memory vendor information

2023-08-11 Thread Lazar, Lijo
On 8/11/2023 12:36 PM, Chen, Guchun wrote: [Public] -Original Message- From: amd-gfx On Behalf Of Lijo Lazar Sent: Friday, August 11, 2023 12:12 PM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Zhang, Hawking Subject: [PATCH] drm/amdgpu: Add memory vendor information

RE: [PATCH] drm/amdgpu: Add memory vendor information

2023-08-11 Thread Chen, Guchun
[Public] > -Original Message- > From: amd-gfx On Behalf Of Lijo > Lazar > Sent: Friday, August 11, 2023 12:12 PM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Zhang, Hawking > > Subject: [PATCH] drm/amdgpu: Add memory vendor information > > For ASICs with GC v9.4.3,

RE: [PATCH] drm/radeon: Cleanup radeon/radeon_fence.c

2023-08-11 Thread Chen, Guchun
[Public] Reviewed-by: Guchun Chen Regards, Guchun > -Original Message- > From: amd-gfx On Behalf Of > Srinivasan Shanmugam > Sent: Wednesday, August 9, 2023 3:14 PM > To: Koenig, Christian ; Deucher, Alexander > ; Chen, Guchun ; > Pan, Xinhui > Cc: SHANMUGAM, SRINIVASAN ; >

RE: [PATCH 2/4] drm/amdgpu: Add bootloader status check

2023-08-11 Thread Chen, Guchun
[Public] > -Original Message- > From: amd-gfx On Behalf Of Lijo > Lazar > Sent: Friday, August 11, 2023 1:18 PM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Ma, Le > ; Kamal, Asad ; Zhang, Hawking > > Subject: [PATCH 2/4] drm/amdgpu: Add bootloader status check > > Add

[PATCH 4/5] drm/amdgpu: Add API to queue and do reset work

2023-08-11 Thread Lijo Lazar
Add API which queues a work to reset domain and waits for it to finish. Signed-off-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 18 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 4 2 files changed, 22 insertions(+) diff --git

RE: [PATCH] drm/amd/pm: bump SMU v13.0.5 driver_if header version

2023-08-11 Thread Zhang, Yifan
[AMD Official Use Only - General] This patch is : Reviewed-by: Yifan Zhang Best Regards, Yifan -Original Message- From: amd-gfx On Behalf Of Tim Huang Sent: Friday, August 11, 2023 1:37 PM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Zhang, Yifan ; Zhang, Jesse(Jie) ;

[PATCH 3/5] drm/amdgpu: Set flags to cancel all pending resets

2023-08-11 Thread Lijo Lazar
If reset is already done as part of recovery, set flags to cancel all pending work items in the reset domain. Also, drop unused functions. Signed-off-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 6 -- 2 files

[PATCH 5/5] drm/amdgpu: Add TDR queue for ring

2023-08-11 Thread Lijo Lazar
Add a TDR queue for rings to handle job timeouts. Ring's scheduler will use this queue to for running job timeout handlers. Timeout handler will then use the appropriate reset domain to handle recovery. Signed-off-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-

[PATCH 2/5] drm/amdgpu: Move to reset_schedule_work

2023-08-11 Thread Lijo Lazar
Move recovery handlers to schedule reset work. Make use of the workpool in the reset domain and delete the individual work items. Signed-off-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 - drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 32 +-

[PATCH 0/5] Add work pool to reset domain

2023-08-11 Thread Lijo Lazar
Presently, there are multiple clients of reset like RAS, job timeout, KFD hang detection and debug method. Instead of each client maintaining a work item, reset work pool is moved to reset domain. When a client makes a recovery request, a work item is allocated by the reset domain and queued for

[PATCH 1/5] drm/amdgpu: Add work pool to reset domain

2023-08-11 Thread Lijo Lazar
Add a work pool to reset domain. The work pool will be used to schedule any task in the reset domain. If on successful reset of the domain indicated by a flag in reset context, all work that are queued will be drained. Their work handlers won't be executed. Signed-off-by: Lijo Lazar ---