[AMD Official Use Only - General]
A dynamic partition switch could happen later. The switch could still be
successful in terms of hardware, and hence gives a false feeling of success
even if there are no render nodes available for any app to make use of the
partition.
Also, a kfd node is not
On 8/8/2023 8:44 PM, Ruan Jinjie wrote:
The NULL initialization of the pointers assigned by kzalloc() first is
not necessary, because if the kzalloc() failed, the pointers will be
assigned NULL, otherwise it works as usual. so remove it.
Signed-off-by: Ruan Jinjie
---
On 2023-08-11 17:27, Chen, Xiaogang wrote:
On 8/11/2023 4:22 PM, Felix Kuehling wrote:
On 2023-08-11 17:12, Chen, Xiaogang wrote:
I know the original jira ticket. The system got RCU cpu stall, then
kernel enter panic, then no response or ssh. This patch let prange
list update task yield
-Remove others, continue discussing internally
On 2023-08-11 17:12, Chen, Xiaogang wrote:
I know the original jira ticket. The system got RCU cpu stall, then
kernel enter panic, then no response or ssh. This patch let prange
list update task yield cpu after each range update. It can prevent
one checkpoint: I saw they use serial port for console at kernel
parameter: console=ttyS0,115200n8
*
Booting Linux using a console connection that is too slow to keep up
with the boot-time console-message rate. For example, a 115Kbaud
serial console can be/way/too slow to keep up
If you have a complete kernel log, it may be worth looking at backtraces
from other threads, to better understand the interactions. I'd expect
that there is a thread there that's in an RCU read critical section. It
may not be in our driver, though. If it's a customer system, it may also
help
On 2023-08-11 17:06, James Zhu wrote:
Return 0 when drm device alloc failed with -ENOSPC in
order to allow amdgpu drive loading. But the xcp without
drm device node assigned won't be visiable in user space.
This helps amdgpu driver loading on system which has more
than 64 nodes, the current
On 8/11/2023 4:22 PM, Felix Kuehling wrote:
On 2023-08-11 17:12, Chen, Xiaogang wrote:
I know the original jira ticket. The system got RCU cpu stall, then
kernel enter panic, then no response or ssh. This patch let prange
list update task yield cpu after each range update. It can prevent
On 2023-08-11 17:12, Chen, Xiaogang wrote:
I know the original jira ticket. The system got RCU cpu stall, then
kernel enter panic, then no response or ssh. This patch let prange
list update task yield cpu after each range update. It can prevent
task holding mm lock too long.
Calling
Hi Dave, Daniel,
New stuff for 6.6.
The following changes since commit d9aa1da9a8cfb0387eb5703c15bd1f54421460ac:
Merge tag 'drm-intel-gt-next-2023-08-04' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next (2023-08-07 13:49:25
+1000)
are available in the Git repository at:
I don't understand why this loop is causing a stall. These stall
warnings indicate that there is an RCU grace period that's not making
progress. That means there must be an RCU read critical section that's
being blocked. But there is no RCU-read critical section in
svm_range_set_attr function.
I know the original jira ticket. The system got RCU cpu stall, then
kernel enter panic, then no response or ssh. This patch let prange list
update task yield cpu after each range update. It can prevent task
holding mm lock too long. mm lock is rw_semophore, not RCU mechanism.
Can you
Return 0 when drm device alloc failed with -ENOSPC in
order to allow amdgpu drive loading. But the xcp without
drm device node assigned won't be visiable in user space.
This helps amdgpu driver loading on system which has more
than 64 nodes, the current limitation.
The proposal to add more drm
On 2023-08-11 16:06, Felix Kuehling wrote:
On 2023-08-11 15:11, James Zhu wrote:
update_list could be big in list_for_each_entry(prange, _list,
update_list),
mmap_read_lock(mm) is kept hold all the time, adding schedule() can
remove
RCU stall on CPU for this case.
RIP:
On 2023-08-11 16:23, James Zhu wrote:
Return 0 when drm device alloc failed with -ENOSPC in
order to allow amdgpu drive loading. But the xcp without
drm device node assigned won't be visiable in user space.
This helps amdgpu driver loading on system which has more
than 64 nodes, the current
Return 0 when drm device alloc failed with -ENOSPC in
order to allow amdgpu drive loading. But the xcp without
drm device node assigned won't be visiable in user space.
This helps amdgpu driver loading on system which has more
than 64 nodes, the current limitation.
The proposal to add more drm
On 2023-08-11 15:11, James Zhu wrote:
update_list could be big in list_for_each_entry(prange, _list,
update_list),
mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove
RCU stall on CPU for this case.
RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]
Hi,
On 8/11/23 11:55, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida
>
> ---
>
> Changes:
> - Due to substantial changes in the content, dropped Pekka's Acked-by
> - Grammar fixes
update_list could be big in list_for_each_entry(prange, _list,
update_list),
mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove
RCU stall on CPU for this case.
RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]
Code: 00 00 00 bf 00 02 00 00 48 81 c2 90 00
Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.
Signed-off-by: André Almeida
---
Changes:
- Due to substantial changes in the content, dropped Pekka's Acked-by
- Grammar fixes (Randy)
- Add paragraph about disabling device resets
- Add
On 2023-08-11 09:26, Felix Kuehling wrote:
Am 2023-08-10 um 18:27 schrieb Eric Huang:
There is not UNMAP_QUEUES command sending for queue preemption
because the queue is suspended and test is closed to the end.
Function unmap_queue_cpsch will do nothing after that.
How do you suspend queues
Am 2023-08-09 um 15:09 schrieb Alex Deucher:
We need the domains in amdgpu_drm.h for the kernel driver to manage
the pool, but we don't want userspace using it until the code
is ready. So reject for now.
Signed-off-by: Alex Deucher
Acked-by: Felix Kuehling
---
On Fri, Aug 11, 2023 at 9:45 AM Jason Gunthorpe wrote:
>
> On Mon, Aug 07, 2023 at 06:05:41PM -0400, Alex Deucher wrote:
> > We are dropping the IOMMUv2 path, so no need to enable this.
> > It's often buggy on consumer platforms anyway.
> >
> > Signed-off-by: Alex Deucher
> > ---
> >
Ping?
On Thu, Aug 10, 2023 at 11:20 AM Alex Deucher wrote:
>
> Ping?
>
> On Wed, Aug 9, 2023 at 3:10 PM Alex Deucher wrote:
> >
> > We need the domains in amdgpu_drm.h for the kernel driver to manage
> > the pool, but we don't want userspace using it until the code
> > is ready. So reject for
On Mon, Aug 07, 2023 at 06:05:41PM -0400, Alex Deucher wrote:
> We are dropping the IOMMUv2 path, so no need to enable this.
> It's often buggy on consumer platforms anyway.
>
> Signed-off-by: Alex Deucher
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4
> 1 file changed, 4 deletions(-)
PCI core API pci_dev_id() can be used to get the BDF number for a pci
device. We don't need to compose it mannually. Use pci_dev_id() to
simplify the code a little bit.
Signed-off-by: Zheng Zengkai
---
drivers/gpu/drm/radeon/radeon_acpi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]
On 19.07.23 18:17, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 17.07.23 15:09, Michel Dänzer wrote:
>> On 5/10/23 23:23, Alex Deucher wrote:
>>> From:
Series is:
Reviewed-by: Alex Deucher
On Thu, Aug 10, 2023 at 11:40 PM Mario Limonciello
wrote:
>
> Some ASICS only offer one type of power attribute, so in the visible
> callback check whether the attributes are supported and hide if not
> supported.
>
> Signed-off-by: Mario Limonciello
> ---
Am 2023-08-10 um 18:27 schrieb Eric Huang:
There is not UNMAP_QUEUES command sending for queue preemption because
the queue is suspended and test is closed to the end. Function
unmap_queue_cpsch will do nothing after that.
How do you suspend queues without sending an UNMAP_QUEUES command?
Am 2023-08-11 um 06:11 schrieb Mike Lothian:
On Thu, 3 Aug 2023 at 20:43, Felix Kuehling wrote:
Is your kernel configured without dynamic debugging? Maybe we need to
wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE).
Apologies, I thought I'd replied to this, yes I didn't have dynamic
On Thu, 3 Aug 2023 at 20:43, Felix Kuehling wrote:
>
> Is your kernel configured without dynamic debugging? Maybe we need to
> wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE).
>
Apologies, I thought I'd replied to this, yes I didn't have dynamic
debugging enabled
On Thu, Aug 10, 2023 at 03:37:56PM +0800, Evan Quan wrote:
> AMD has introduced an ACPI based mechanism to support WBRF for some
> platforms with AMD dGPU + WLAN. This needs support from BIOS equipped
> with necessary AML implementations and dGPU firmwares.
>
> For those systems without the ACPI
On Thu, Aug 10, 2023 at 03:38:00PM +0800, Evan Quan wrote:
> With WBRF feature supported, as a driver responding to the frequencies,
> amdgpu driver is able to do shadow pstate switching to mitigate possible
> interference(between its (G-)DDR memory clocks and local radio module
> frequency bands
On 8/11/2023 12:36 PM, Chen, Guchun wrote:
[Public]
-Original Message-
From: amd-gfx On Behalf Of Lijo
Lazar
Sent: Friday, August 11, 2023 12:12 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Zhang, Hawking
Subject: [PATCH] drm/amdgpu: Add memory vendor information
[Public]
> -Original Message-
> From: amd-gfx On Behalf Of Lijo
> Lazar
> Sent: Friday, August 11, 2023 12:12 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Zhang, Hawking
>
> Subject: [PATCH] drm/amdgpu: Add memory vendor information
>
> For ASICs with GC v9.4.3,
[Public]
Reviewed-by: Guchun Chen
Regards,
Guchun
> -Original Message-
> From: amd-gfx On Behalf Of
> Srinivasan Shanmugam
> Sent: Wednesday, August 9, 2023 3:14 PM
> To: Koenig, Christian ; Deucher, Alexander
> ; Chen, Guchun ;
> Pan, Xinhui
> Cc: SHANMUGAM, SRINIVASAN ;
>
[Public]
> -Original Message-
> From: amd-gfx On Behalf Of Lijo
> Lazar
> Sent: Friday, August 11, 2023 1:18 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Ma, Le
> ; Kamal, Asad ; Zhang, Hawking
>
> Subject: [PATCH 2/4] drm/amdgpu: Add bootloader status check
>
> Add
Add API which queues a work to reset domain and waits for it to finish.
Signed-off-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 18 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 4
2 files changed, 22 insertions(+)
diff --git
[AMD Official Use Only - General]
This patch is :
Reviewed-by: Yifan Zhang
Best Regards,
Yifan
-Original Message-
From: amd-gfx On Behalf Of Tim Huang
Sent: Friday, August 11, 2023 1:37 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Zhang, Yifan
; Zhang, Jesse(Jie) ;
If reset is already done as part of recovery, set flags to cancel all
pending work items in the reset domain. Also, drop unused functions.
Signed-off-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 6 --
2 files
Add a TDR queue for rings to handle job timeouts. Ring's scheduler will
use this queue to for running job timeout handlers. Timeout handler will
then use the appropriate reset domain to handle recovery.
Signed-off-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
Move recovery handlers to schedule reset work. Make use of the workpool
in the reset domain and delete the individual work items.
Signed-off-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 -
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 32 +-
Presently, there are multiple clients of reset like RAS, job timeout, KFD hang
detection and debug method. Instead of each client maintaining a work item,
reset work pool is moved to reset domain. When a client makes a recovery
request,
a work item is allocated by the reset domain and queued for
Add a work pool to reset domain. The work pool will be used to schedule
any task in the reset domain. If on successful reset of the domain
indicated by a flag in reset context, all work that are queued will be
drained. Their work handlers won't be executed.
Signed-off-by: Lijo Lazar
---
44 matches
Mail list logo