Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 --
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
3 files changed, 6 insertions(+), 2 deletions
Save the extra usless work schedule. Also swith to delayed work.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 12 +++-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd
We need to be able to non blocking cancel pending reset works
from within GPU reset. Currently kernel API allows this only
for delayed_work and not for work_struct. Switch to delayed
work and queue it with delay 0 which is equal to queueing work
struct.
Signed-off-by: Andrey Grodzovsky
as was in v1[1] to eplicit
stopping of each reset request from each reset source
per each request submitter.
[1] -
https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzov...@amd.com/
Andrey Grodzovsky (7):
drm/amdgpu: Cache result of last reset at reset domain level.
drm/amdgpu
On 2022-05-16 11:08, Christian König wrote:
Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:
Ping
Ah, yes sorry.
Andrey
On 2022-05-13 11:41, Andrey Grodzovsky wrote:
Yes, exactly that's the idea.
Basically the reset domain knowns which amdgpu devices it needs to
reset together
Ping
Andrey
On 2022-05-13 11:41, Andrey Grodzovsky wrote:
Yes, exactly that's the idea.
Basically the reset domain knowns which amdgpu devices it needs to
reset together.
If you then represent that so that you always have a hive even when
you only have one device in it, or if you put
On 2022-05-12 09:15, Christian König wrote:
Am 12.05.22 um 15:07 schrieb Andrey Grodzovsky:
On 2022-05-12 02:06, Christian König wrote:
Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky:
On 2022-05-11 11:39, Christian König wrote:
Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:
On 2022-05
Sure, I will investigate that. What about the ticket which LIjo raised
which was basically doing 8 resets instead of one ? Lijo - can this
ticket wait until I come up with this new design for amdgpu reset
function or u need a quick solution now in which case we can use the
already existing
On 2022-05-12 02:06, Christian König wrote:
Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky:
On 2022-05-11 11:39, Christian König wrote:
Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:
On 2022-05-11 11:20, Lazar, Lijo wrote:
On 5/11/2022 7:28 PM, Christian König wrote:
Am 11.05.22 um 15
On 2022-05-12 02:03, Christian König wrote:
Am 11.05.22 um 17:57 schrieb Andrey Grodzovsky:
[SNIP]
How about we do it like this then:
struct amdgpu_reset_domain {
union {
struct {
struct work_item debugfs;
struct work_item ras
On 2022-05-11 11:39, Christian König wrote:
Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:
On 2022-05-11 11:20, Lazar, Lijo wrote:
On 5/11/2022 7:28 PM, Christian König wrote:
Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:
On 2022-05-11 03:38, Christian König wrote:
Am 10.05.22 um 20
On 2022-05-11 12:49, Felix Kuehling wrote:
Am 2022-05-11 um 09:49 schrieb Andrey Grodzovsky:
[snip]
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index f1a225a20719..4b789bec9670 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b
On 2022-05-11 11:39, Christian König wrote:
Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:
On 2022-05-11 11:20, Lazar, Lijo wrote:
On 5/11/2022 7:28 PM, Christian König wrote:
Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:
On 2022-05-11 03:38, Christian König wrote:
Am 10.05.22 um 20
On 2022-05-11 11:46, Lazar, Lijo wrote:
On 5/11/2022 9:13 PM, Andrey Grodzovsky wrote:
On 2022-05-11 11:37, Lazar, Lijo wrote:
On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote:
On 2022-05-11 11:20, Lazar, Lijo wrote:
On 5/11/2022 7:28 PM, Christian König wrote:
Am 11.05.22 um 15:43
On 2022-05-11 11:37, Lazar, Lijo wrote:
On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote:
On 2022-05-11 11:20, Lazar, Lijo wrote:
On 5/11/2022 7:28 PM, Christian König wrote:
Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:
On 2022-05-11 03:38, Christian König wrote:
Am 10.05.22 um 20:53
On 2022-05-11 11:20, Lazar, Lijo wrote:
On 5/11/2022 7:28 PM, Christian König wrote:
Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:
On 2022-05-11 03:38, Christian König wrote:
Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:
[SNIP]
E.g. in the reset code (either before or after the reset
On 2022-05-11 03:38, Christian König wrote:
Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:
On 2022-05-10 13:19, Christian König wrote:
Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky:
On 2022-05-10 12:17, Christian König wrote:
Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:
[SNIP
On 2022-05-10 13:19, Christian König wrote:
Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky:
On 2022-05-10 12:17, Christian König wrote:
Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:
[SNIP]
That's one of the reasons why we should have multiple work items
for job based reset and other
On 2022-05-09 14:03, Deucher, Alexander wrote:
[Public]
-Original Message-
From: Bjorn Helgaas
Sent: Monday, May 9, 2022 12:23 PM
To: Linux PCI
Cc: r087...@yahoo.it; Deucher, Alexander
; Koenig, Christian
; Pan, Xinhui ; amd-gfx
mailing list ; dri-devel
Subject: Re: [Bug 215958]
On 2022-05-10 12:17, Christian König wrote:
Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:
[SNIP]
That's one of the reasons why we should have multiple work items for
job based reset and other reset sources.
See the whole idea is the following:
1. We have one single queued work queue
On 2022-05-06 04:56, Christian König wrote:
Am 06.05.22 um 08:02 schrieb Lazar, Lijo:
On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote:
On 2022-05-05 15:49, Felix Kuehling wrote:
Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:
On 2022-05-05 11:06, Christian König wrote:
Am 05.05.22 um 15
Acked-by: Andrey Grodzovsky
Andrey
On 2022-05-10 10:58, Alex Deucher wrote:
Use kvmalloc and kvfree.
Fixes: 31aad22e2b3c ("drm/amdgpu/psp: Add vbflash sysfs interface support")
Reported-by: kernel test robot
Signed-off-by: Alex Deucher
---
drivers/gpu/drm/amd/amdgpu/amdgpu
On 2022-05-05 15:49, Felix Kuehling wrote:
Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:
On 2022-05-05 11:06, Christian König wrote:
Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:
On 2022-05-05 09:23, Christian König wrote:
Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:
On 2022
On 2022-05-05 11:06, Christian König wrote:
Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:
On 2022-05-05 09:23, Christian König wrote:
Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:
On 2022-05-05 06:09, Christian König wrote:
Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:
Problem
On 2022-05-05 09:23, Christian König wrote:
Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:
On 2022-05-05 06:09, Christian König wrote:
Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:
Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered
On 2022-05-05 06:09, Christian König wrote:
Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:
Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.
Fix: Rework GPU reset
reset domain will cancel all those pending redundant resets.
This is in line with what we already do for redundant TDRs
in scheduler code.
Signed-off-by: Andrey Grodzovsky
Tested-by: Bai Zoy
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 11 +---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17
ng recursive fault but reboot is needed!
On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
wrote:
I retested hot plug tests at the commit I mentioned bellow - looks
ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris
ASICs (whatever i had at home at the time). I
3 3 0 1
asserts 21 21 21 0 n/a
Elapsed time = 9.195 seconds
Andrey
On 2022-04-20 11:44, Andrey Grodzovsky wrote:
The only one in Radeon 7 I see is the same sysfs crash we already
fixed so you can use the same fix. The MI 200 issue i
The only one in Radeon 7 I see is the same sysfs crash we already fixed
so you can use the same fix. The MI 200 issue i haven't seen yet but I
also haven't tested MI200 so never saw it before. Need to test when i
get the time.
So try that fix with Radeon 7 again to see if you pass the tests
My bad, I see u already fixed this in amd-staging-drm-next. We had an
issue in an internal branch with this and just reinvented the wheel :))
Andrey
On 2022-04-14 10:32, Andrey Grodzovsky wrote:
Yea, i need to improve it a bit, ignore this one, will be back with V2.
Andrey
On 2022-04-14 03
handler
Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:
Lock reset domain unconditionally because on resume we unlock it
unconditionally.
This solved mutex deadlock when handling both FATAL and non FATAL PCI
errors one after another.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd
On 2022-04-14 02:40, Christian König wrote:
Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:
Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.
Signed-off
Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
1 file changed
On 2022-04-13 12:03, Shuotao Xu wrote:
On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky
wrote:
[Some people who received this message don't often get email
fromandrey.grodzov...@amd.com. Learn why this is important
athttp://aka.ms/LearnAboutSenderIdentification.]
On 2022-04-08 21:28
On 2022-04-08 21:28, Shuotao Xu wrote:
On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky
wrote:
[Some people who received this message don't often get email from
andrey.grodzov...@amd.com. Learn why this is important at
http://aka.ms/LearnAboutSenderIdentification.]
On 2022-04-08 04:45
On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o
I suggest adding another patch to handle unbalanced decrement of
kfd_lock in kgd2kfd_suspend. This patch alone is not enough to fix
all removal issues.
Andrey
On 2022-04-07 12:15, Mukul Joshi wrote:
Currently, the IO-links to the device being removed from topology,
are not cleared. As a
48 89 e5 48 89 7d f8 48 8b 45 f8
Best regards,
Shuotao
*From: *Andrey Grodzovsky
*Date: *Wednesday, April 6, 2022 at 10:36 PM
*To: *Shuotao Xu ,
amd-gfx@lists.freedesktop.org
*Cc: *Ziyue Yang , Lei Qu
, Peng Cheng , Ran Shu
*Subject: *Re: [EXTERNAL] Re: Code Review Request for AMDGPU
patch in this email, in case that
you would want to delete that later email.
Best regards,
Shuotao
*From: *Andrey Grodzovsky
*Date: *Wednesday, April 6, 2022 at 10:13 PM
*To: *Shuotao Xu ,
amd-gfx@lists.freedesktop.org
*Cc: *Ziyue Yang , Lei Qu
, Peng Cheng , Ran Shu
*Subject: *[EXTERN
Looks like you are using 5.13 kernel for this work, FYI we added
hot plug support for the graphic stack in 5.14 kernel (see
https://www.phoronix.com/scan.php?page=news_item=Linux-5.14-AMDGPU-Hot-Unplug)
I am not sure about the code part since it all touches KFD driver (KFD
team can comment
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-03-11 10:15, Philip Yang wrote:
amdgpu_detect_virtualization reads register, amdgpu_device_rreg access
adev->reset_domain->sem if kernel defined CONFIG_LOCKDEP, below is the
random boot hang backtrace on Vega10. It may get random NULL p
On 2022-03-10 11:21, Sharma, Shashank wrote:
On 3/10/2022 4:24 PM, Rob Clark wrote:
On Thu, Mar 10, 2022 at 1:55 AM Christian König
wrote:
Am 09.03.22 um 19:12 schrieb Rob Clark:
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
wrote:
From: Shashank Sharma
This patch adds a new
On 2022-03-10 05:06, Christian König wrote:
Am 10.03.22 um 07:11 schrieb Victor Zhao:
enable sdma v5_2 soft reset
Signed-off-by: Victor Zhao
---
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 79 +-
1 file changed, 78 insertions(+), 1 deletion(-)
diff --git
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-03-10 04:07, Somalapuram Amaranath wrote:
Schedule work function with valid PID, process name and
vram lost status during a GPU reset/recovery.
Signed-off-by: Somalapuram Amaranath
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13
On 2022-03-10 00:17, Lazar, Lijo wrote:
On 3/10/2022 2:33 AM, Andrey Grodzovsky wrote:
It will be used during GPU reset.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 10 +++
drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 3 +++
drivers/gpu
It will be used during GPU reset.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 10 +++
drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 3 +++
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++
drivers/gpu/drm/amd/pm/swsmu/inc
This should provide more debug info for the driver.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +
1 file changed, 9 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index
On 2022-03-08 12:20, Somalapuram, Amaranath wrote:
On 3/8/2022 10:00 PM, Sharma, Shashank wrote:
Hello Andrey
On 3/8/2022 5:26 PM, Andrey Grodzovsky wrote:
On 2022-03-07 11:26, Shashank Sharma wrote:
From: Shashank Sharma
This patch adds a work function, which will get scheduled
On 2022-03-08 12:04, Somalapuram, Amaranath wrote:
On 3/8/2022 10:27 PM, Sharma, Shashank wrote:
On 3/8/2022 5:55 PM, Andrey Grodzovsky wrote:
You can read on their side here -
https://www.phoronix.com/scan.php?page=news_item=AMD-STB-Linux-5.17
and see their patch. THey don't have
on behalf of
Andrey Grodzovsky
*Sent:* Tuesday, March 8, 2022 9:55:03 PM
*To:* Shashank Sharma ;
amd-gfx@lists.freedesktop.org
*Cc:* Deucher, Alexander ; Somalapuram,
Amaranath ; Koenig, Christian
; Sharma, Shashank
*Subject:* Re: [PATCH 1/2] drm: Add GPU reset sysfs event
On 2022-03-07 11
of
Andrey Grodzovsky
*Sent:* Tuesday, March 8, 2022 9:55:03 PM
*To:* Shashank Sharma ;
amd-gfx@lists.freedesktop.org
*Cc:* Deucher, Alexander ; Somalapuram,
Amaranath ; Koenig, Christian
; Sharma, Shashank
*Subject:* Re: [PATCH 1/2] drm: Add GPU reset sysfs event
On 2022-03-07 11:26
On 2022-03-08 11:35, Sharma, Shashank wrote:
On 3/8/2022 5:25 PM, Andrey Grodzovsky wrote:
On 2022-03-07 11:26, Shashank Sharma wrote:
From: Shashank Sharma
This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like
On 2022-03-07 11:26, Shashank Sharma wrote:
From: Shashank Sharma
This patch adds a work function, which will get scheduled
in event of a GPU reset, and will send a uevent to user with
some reset context infomration, like a PID and some flags.
Where is the actual scheduling of the work
On 2022-03-07 11:26, Shashank Sharma wrote:
From: Shashank Sharma
This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- which PID was involved in the GPU reset
- what was the GPU status (using flags)
This patch
On 2022-03-03 03:23, Christian König wrote:
Allows submitting jobs as gang which needs to run on multiple engines at the
same time.
All members of the gang get the same implicit, explicit and VM dependencies. So
no gang member will start running until everything else is ready.
The last job
:)
I am like - I must be crazy because no way this works but you insist
that it is and I know u are usually right.
Andrey
On 2022-03-07 10:59, Christian König wrote:
If we don't check for NULL here we would just crash.
But you go into the 'if clause' if job->gang_submit is equal to
On 2022-03-05 13:40, Christian König wrote:
Am 04.03.22 um 18:10 schrieb Andrey Grodzovsky:
On 2022-03-03 03:23, Christian König wrote:
Allows submitting jobs as gang which needs to run on multiple
engines at the same time.
Basic idea is that we have a global gang submit fence representing
On 2022-03-03 03:23, Christian König wrote:
Allows submitting jobs as gang which needs to run on multiple
engines at the same time.
Basic idea is that we have a global gang submit fence representing when the
gang leader is finally pushed to run on the hardware last.
Jobs submitted as gang
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-03-03 03:23, Christian König wrote:
This way we don't need to check for NULL any more.
Signed-off-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 1 +
2 files changed, 2
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-03-03 03:23, Christian König wrote:
We now have standard macros for that.
Signed-off-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 7 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 6 --
2 files changed, 5
Acked-by: Andrey Grodzovsky
Andrey
On 2022-03-03 03:23, Christian König wrote:
Instead of providing the ib index provide the job and ib pointers directly to
the patch and parse functions for UVD and VCE.
Also move the set/get functions for IB values to the IB declerations.
Signed-off
Acked-by: Andrey Grodzovsky
Andrey
On 2022-03-03 03:23, Christian König wrote:
No function change, just move a bunch of definitions from amdgpu.h into
separate header files.
Signed-off-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 95 ---
drivers
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-03-03 03:22, Christian König wrote:
Since we removed the context lock we need to make sure that not two threads
are trying to install an entity at the same time.
Signed-off-by: Christian König
Fixes: e68efb27647f ("drm/amdgpu: remove ctx-
I pushed all the changes including your patch.
Andrey
On 2022-03-02 22:16, Andrey Grodzovsky wrote:
OK, i will do quick smoke test tomorrow and push all of it it then.
Andrey
On 2022-03-02 21:59, Chen, JingWen wrote:
Hi Andrey,
I don't have the bare mental environment, I can only test
:
The patch is acked-by: Andrey Grodzovsky
If you also smoked tested bare metal feel free to apply all the patches, if no
let me know.
Andrey
On 2022-03-02 04:51, JingWen Chen wrote:
Hi Andrey,
Most part of the patches are OK, but the code will introduce a ib test fail on
the disabled vcn
The patch is acked-by: Andrey Grodzovsky
If you also smoked tested bare metal feel free to apply all the patches,
if no let me know.
Andrey
On 2022-03-02 04:51, JingWen Chen wrote:
Hi Andrey,
Most part of the patches are OK, but the code will introduce a ib test fail on
the disabled vcn
Thanks, already did. Code pushed both here and in libdrm.
Andrey
On 2022-03-02 03:37, Christian König wrote:
Am 01.03.22 um 19:07 schrieb Andrey Grodzovsky:
Protect with drm_dev_enter/exit
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König for this one
here.
Regarding
the tests finally - if other people during testing will
encounter errors they will report and I will be able to fix.
The releated merge request for enabling libdrm tests suite is in
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/227
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd
Protect with drm_dev_enter/exit
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 10 --
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index f522b52725e4
are depending on
this patch series to fix the concurrency issue within SRIOV TDR sequence.
On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
No problem if so but before I do,
JingWen - why you think this patch is needed as a standalone now ? It has no
use without the
entire feature together
On 2022-02-24 13:11, Alex Deucher wrote:
On Thu, Feb 24, 2022 at 1:05 PM Andrey Grodzovsky
wrote:
According to my investigation of the state of PCI
reset recently it's not working. The reason is
due to the fact the kernel PCI code rejects SBR
when there are more then one PF under same bridge
and devices under the same bridge as you and you
cannot assume they support SBR.
Once we anble FLR support we can reenable this option as
FLR is doable on single PF and doens't have this
restriction.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +
drivers
]
If it applies cleanly, feel free to drop it in. I'll drop those
patches for drm-next since they are already in drm-misc.
Alex
*From:* amd-gfx on behalf of
Andrey Grodzovsky
*Sent:* Thursday, February 24, 2022 11:24 AM
Grodzovsky wrote:
All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.
Andrey
On 2022-02-09 02:53, Christian König wrote:
Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
Before we initialize schedulers we must know which reset
domain are we in - for single device
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-02-22 09:37, Somalapuram Amaranath wrote:
Dump the list of register values to trace event on GPU reset.
Signed-off-by: Somalapuram Amaranath
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +
drivers/gpu/drm/amd/amdgpu
On 2022-02-20 22:32, Gu, JiaWei (Will) wrote:
[AMD Official Use Only]
Pinging.
-Original Message-
From: Jiawei Gu
Sent: Thursday, February 17, 2022 6:44 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian
; Grodzovsky, Andrey ; Liu, Monk
; Deng,
On 2022-02-16 05:46, Somalapuram, Amaranath wrote:
On 2/15/2022 10:09 PM, Andrey Grodzovsky wrote:
On 2022-02-15 05:12, Somalapuram Amaranath wrote:
Dump the list of register values to trace event on GPU reset.
Signed-off-by: Somalapuram Amaranath
---
drivers/gpu/drm/amd/amdgpu
On 2022-02-15 05:12, Somalapuram Amaranath wrote:
Dump the list of register values to trace event on GPU reset.
Signed-off-by: Somalapuram Amaranath
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 -
drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 16
2
Acked-by: Andrey Grodzovsky
Andrey
On 2022-02-15 06:29, Jiawei Gu wrote:
Add device pointer so scheduler's printing can use
DRM_DEV_ERROR() instead, which makes life easier under multiple GPU
scenario.
Signed-off-by: Jiawei Gu
---
drivers/gpu/drm/scheduler/sched_main.c | 9
Update function name.
Signed-off-by: Andrey Grodzovsky
Reported-by: kernel test robot
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
On 2022-02-11 03:24, Ken Xue wrote:
KMD reports a warning on holding a lock from drm_syncobj_find_fence,
when running amdgpu_test case “syncobj timeline test”.
ctx->lock was designed to prevent concurrent "amdgpu_ctx_wait_prev_fence"
calls and avoid dead reservation lock from GPU reset.
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-02-03 21:45, Surbhi Kakarya wrote:
This patch handles the GPU recovery failure in sriov environment by
retrying the reset if the first reset fails. To determine the condition
of retry, a new macro AMDGPU_RETRY_SRIOV_RESET is added which returns
On 2022-02-10 02:06, Christian König wrote:
Am 10.02.22 um 04:17 schrieb Andrey Grodzovsky:
Seems I forgot to add this to the relevant commit
when submitting.
Rebase/merge issue? Looks like it.
It looks more like I forgot to add the header file
change to the commit after updating
Seems I forgot to add this to the relevant commit
when submitting.
Signed-off-by: Andrey Grodzovsky
Reported-by: kernel test robot
---
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
b
All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.
Andrey
On 2022-02-09 02:53, Christian König wrote:
Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
Thanks a lot!
Andrey
On 2022-02-09 01:06, JingWen Chen wrote:
Hi Andrey,
I have been testing your patch and it seems fine till now.
Best Regards,
Jingwen Chen
On 2022/2/3 上午2:57, Andrey Grodzovsky wrote:
Just another ping, with Shyun's help I was able to do some smoke testing on
XGMI
put
and a wrapper around send to reset wq (Lijo)
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 6 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 44 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 40
drivers/gpu/drm/
Since we have a single instance of reset semaphore which we
lock only once even for XGMI hive we don't need the nested
locking hint anymore.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 --
1 file changed, 4 insertions(+), 10 deletions
Since now all GPU resets are serialzied there is no need for this.
This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++
1 file
This functions needs to be split into 2 parts where
one is called only once for locking single instance of
reset_domain's sem and reset flag and the other part
which handles MP1 states should still be called for
each device in XGMI hive.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd
We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 -
drivers/gpu/drm/amd
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 7 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 -
drivers/gpu/drm/amd/amdgpu
No need to to trigger another work queue inside the work queue.
v3:
Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.
Suggested-by: Liu Shaoyun
Signed-off-by: Andrey
to qeueue work and wait on it to finish.
v2: Rename to amdgpu_recover_work_struct
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +-
drivers/gpu/drm/amd/amdgpu
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Daniel Vetter
Suggested-by: Christian König
Reviewed-by: Christian
md-gfx/msg58836.html
P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work
hasn't landed yet there.
P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.
Andrey Grodzovsky (11):
drm/amdgpu: Introduce reset domain
drm/amdgpu: Move scheduler init to after XGMI is
On 2022-02-08 06:25, Lazar, Lijo wrote:
On 2/2/2022 10:56 PM, Andrey Grodzovsky wrote:
The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset
101 - 200 of 1477 matches
Mail list logo