On 2022-02-08 01:33, Lazar, Lijo wrote:
On 1/26/2022 4:07 AM, Andrey Grodzovsky wrote:
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19
Just another ping, with Shyun's help I was able to do some smoke testing
on XGMI SRIOV system (booting and triggering hive reset)
and for now looks good.
Andrey
On 2022-01-28 14:36, Andrey Grodzovsky wrote:
Just a gentle ping if people have more comments on this patch set ?
Especially last 5
on boot with XGMI hive by adding type to reset_domain.
XGMI will only create a new reset_domain if prevoius was of single
device type meaning it's first boot. Otherwsie it will take a
refocunt to exsiting reset_domain from the amdgou device.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd
On 2022-02-01 16:47, Surbhi Kakarya wrote:
This patch handles the GPU recovery faliure in sriov environment by
retrying the reset if the first reset fails. To determine the condition of
retry, a
new function amdgpu_is_retry_sriov_reset() is added which returns true if
failure is due
to
Just a gentle ping if people have more comments on this patch set ?
Especially last 5 patches
as first 7 are exact same as V2 and we already went over them mostly.
Andrey
On 2022-01-25 17:37, Andrey Grodzovsky wrote:
This patchset is based on earlier work by Boris[1] that allowed to have
14:21, Andrey Grodzovsky wrote:
On 2022-01-17 2:17 p.m., Christian König wrote:
Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:
Ping on the question
Oh, my! That was already more than a week ago and is completely
swapped out of my head again.
Andrey
On 2022-01-05 1:11 p.m., Andrey
On 2022-01-26 07:07, Christian König wrote:
Am 25.01.22 um 23:37 schrieb Andrey Grodzovsky:
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: Andrey
Since we have a single instance of reset semaphore which we
lock only once even for XGMI hive we don't need the nested
locking hint anymore.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 --
1 file changed, 4 insertions(+), 10 deletions
Since now all GPU resets are serialzied there is no need for this.
This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++
1 file
This functions needs to be split into 2 parts where
one is called only once for locking single instance of
reset_domain's sem and reset flag and the other part
which handles MP1 states should still be called for
each device in XGMI hive.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 7 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1
We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 -
drivers/gpu/drm/amd
-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 6 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 44 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 36 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 10 +
drivers/gpu/drm/amd/amdgpu
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 -
drivers/gpu/drm/amd/amdgpu
No need to to trigger another work queue inside the work queue.
v3:
Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.
Suggested-by: Liu Shaoyun
Signed-off-by: Andrey
to qeueue work and wait on it to finish.
v2: Rename to amdgpu_recover_work_struct
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +-
drivers/gpu/drm/amd/amdgpu
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 -
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Daniel Vetter
Suggested-by: Christian König
Reviewed-by: Christian
P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.
P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system
because drm-misc-next fails to load there.
Would appriciate if maybe jingwech can try it on his system like he tested V2.
Andrey Grodzovsky (12):
drm/
they
feel a need to dump info ? Also, how reliable is the STB infra during
a reset ?
Regards
Shashank
On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote:
You probably can add the STB dump we worked on a while ago to your
info dump - a reminder
on the feature is here
https://www.spinics.net/lists/amd-gfx
You probably can add the STB dump we worked on a while ago to your info
dump - a reminder
on the feature is here https://www.spinics.net/lists/amd-gfx/msg70751.html
Andrey
On 2022-01-21 15:34, Sharma, Shashank wrote:
From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001
From:
On 2022-01-17 2:17 p.m., Christian König wrote:
Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:
Ping on the question
Oh, my! That was already more than a week ago and is completely
swapped out of my head again.
Andrey
On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
Also, what about
Ping on the question
Andrey
On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
Also, what about having the reset_active or in_reset flag in the
reset_domain itself?
Of hand that sounds like a good idea.
What then about the adev->reset_sem semaphore ? Should we also m
On 2022-01-07 12:46 a.m., JingWen Chen wrote:
On 2022/1/7 上午11:57, JingWen Chen wrote:
On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
On 2022-01-06 12:18 a.m., JingWen Chen wrote:
On 2022/1/6 下午12:59, JingWen Chen wrote:
On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
On 2022-01-05 2:59 a.m
On 2022-01-06 12:18 a.m., JingWen Chen wrote:
On 2022/1/6 下午12:59, JingWen Chen wrote:
On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
On 2022-01-05 2:59 a.m., Christian König wrote:
Am 05.01.22 um 08:34 schrieb JingWen Chen:
On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
On 2022-01-04 6:36
Got it
See bellow one small comment, with that the patch is Reviewed-by: Andrey
Grodzovsky
On 2022-01-05 9:24 p.m., Shi, Leslie wrote:
[AMD Official Use Only]
Hi Andrey,
It is the following patch calls to amdgpu_device_unmap_mmio() conditioned on
device unplugged.
3efb17ae7e92 &quo
On 2022-01-05 2:59 a.m., Christian König wrote:
Am 05.01.22 um 08:34 schrieb JingWen Chen:
On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
On 2022-01-04 6:36 a.m., Christian König wrote:
Am 04.01.22 um 11:49 schrieb Liu, Monk:
[AMD Official Use Only]
See the FLR request from the hypervisor
On 2022-01-04 11:23 p.m., Leslie Shi wrote:
Patch: 3efb17ae7e92 ("drm/amdgpu: Call amdgpu_device_unmap_mmio() if device
is unplugged to prevent crash in GPU initialization failure") makes call to
amdgpu_device_unmap_mmio() conditioned on device unplugged. This patch unmaps
MMIO mappings even
On 2022-01-05 7:31 a.m., Christian König wrote:
Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency
at we need to change SRIOV and not the driver.
Christian.
Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
Sure, I guess i can drop this patch then.
Andrey
On 2021-12-24 4:57 a.m., JingWen Chen wrote:
I do agree with shaoyun, if the host find the gpu engine hangs first, and do
the flr, gues
work with that we need to change SRIOV and not the
driver.
Christian.
Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
Sure, I guess i can drop this patch then.
Andrey
On 2021-12-24 4:57 a.m., JingWen Chen wrote:
I do agree with shaoyun, if the host find the gpu engine hangs
first, and do th
On 2022-01-03 9:30 p.m., Leslie Shi wrote:
If the driver loads failed during hw_init(), delay unmapping MMIO VRAM to
amdgpu_ttm_fini().
Its prevents accessing invalid memory address in vcn_v3_0_sw_fini().
Signed-off-by: Leslie Shi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
-
de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: dan...@ffwll.ch; Liu, Monk ; Chen, Horace
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
for SRIOV
Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
Since now flr work is serialized against GPU resets there is no
-gfx@lists.freedesktop.org
Cc: dan...@ffwll.ch; Liu, Monk ; Chen, Horace
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
for SRIOV
Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
Since now flr work is serialized against GPU resets there is no need
for this.
Signed-off-by: An
No need to to trigger another work queue inside the work queue.
v3:
Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.
Suggested-by: Liu Shaoyun
Signed-off-by: Andrey
wondering what is the impact
on a system like MI200 A+A.
Thanks,
Lijo
-Original Message-
From: amd-gfx On Behalf Of
Andrey Grodzovsky
Sent: Friday, December 17, 2021 8:32 PM
To: Koenig, Christian ; Shi, Leslie
; Pan, Xinhui ; Deucher,
Alexander ; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun
Subj
Since now all GPU resets are serialzied there is no need for this.
This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++
1 file
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 -
drivers/gpu/drm/amd/amdgpu
Since now flr work is serialized against GPU resets
there is no need for this.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
2 files changed, 22 deletions(-)
diff --git a/drivers/gpu/drm/amd
No need to to trigger another work queue inside the work queue.
Suggested-by: Liu Shaoyun
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 +--
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 +--
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 +--
3 files
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers
to qeueue work and wait on it to finish.
v2: Rename to amdgpu_recover_work_struct
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 2 +-
3 files
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu
. (Shaoyun)
[1]
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/
P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work
hasn't landed yet there.
Andrey Grodzovsky (8):
drm/amdgpu: Introduce reset domain
drm/amdgpu
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Daniel Vetter
Suggested-by: Christian König
Reviewed-by: Christian
On 2021-12-21 2:59 a.m., Christian König wrote:
Am 20.12.21 um 23:17 schrieb Andrey Grodzovsky:
On 2021-12-20 2:20 a.m., Christian König wrote:
Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize
On 2021-12-21 2:02 a.m., Christian König wrote:
Am 20.12.21 um 20:22 schrieb Andrey Grodzovsky:
On 2021-12-20 2:17 a.m., Christian König wrote:
Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe
On 2021-12-20 2:20 a.m., Christian König wrote:
Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery
On 2021-12-20 2:16 a.m., Christian König wrote:
Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans
On 2021-12-20 2:17 a.m., Christian König wrote:
Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
1 file
, Monk
Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu
Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
This patchset is based on earlier work by Boris[1] that allowed to
have an ordered workqueue at the driver level that will be used by the
different schedulers to
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 -
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 -
3 files
Do it for both single device and XGMI hive cases.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 7 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 20 +++-
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 9 +
drivers/gpu/drm/amd
Since now all GPU resets are serialzied there is no need for this.
This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++
1 file changed, 7 insertions(+), 82
to qeueue work and wait on it to finish.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 2 +-
3 files changed, 35 insertions(+), 2 deletions
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu
landed yet there.
Andrey Grodzovsky (6):
drm/amdgpu: Init GPU reset single threaded wq
drm/amdgpu: Move scheduler init to after XGMI is ready
drm/amdgpu: Fix crash on modprobe
drm/amdgpu: Serialize non TDR gpu recovery with TDRs
drm/amdgpu: Drop hive->in_reset
drm/amdgpu: D
Reviewed-by: Andrey Grodzovsky
Andrey
On 2021-12-17 3:49 a.m., Christian König wrote:
Am 17.12.21 um 03:26 schrieb Leslie Shi:
[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error
during driver modprobe, it
will start the error handle path immediately and call
Maybe we just should use drm_dev_is_unplugged() for this particular case
because, there would be no race since when device is unplugged it's
final. It's the other way around that requires strict drm_dev_enter/exit
scope.
Andrey
On 2021-12-16 3:38 a.m., Christian König wrote:
The
I think that we should not call amdgpu_device_unmap_mmio unless device
is unplugged (as in amdgpu_pci_remove) because the point of this
function is to prevent accesses to MMIO range the device was occupying
before removal.
There is no point to prevent MMIO accesses when init failed and we want
On 2021-12-14 12:03 p.m., Andrey Grodzovsky wrote:
-
- if (job != NULL) {
- /* mark this fence has a parent job */
- set_bit(AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT, >flags);
+ if (job)
+ dma_fence_init(fence, _job_fence_ops,
+ >fence_dr
On 2021-12-14 6:15 a.m., Huang Rui wrote:
The job embedded fence donesn't initialize the flags at
dma_fence_init(). Then we will go a wrong way in
amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
once we enabled the trace event here. So introduce new amdgpu_fence
object
On 2021-12-09 10:47 p.m., Lang Yu wrote:
On 12/09/ , Christian KKKnig wrote:
Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:
On 2021-12-09 4:00 a.m., Christian König wrote:
Am 09.12.21 um 09:49 schrieb Lang Yu:
It is useful to maintain error context when debugging
SW/FW issues. We
.
Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.
Suggested-by: Christian Koenig
Suggested-by: Andrey Grodzovsky
Signed-off-by: Lang Yu
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h
On 2021-12-01 8:11 a.m., Christian König wrote:
Adding Andrey as well.
Am 01.12.21 um 12:37 schrieb Yu, Lang:
[SNIP]
+ BUG_ON(unlikely(smu->smu_debug_mode) && res);
BUG_ON() really crashes the kernel and is only allowed if we
prevent further data corruption with that.
Most of the time
Ping - mostly just to get final ack to push it into amd-stagin-drm-next
Andrey
On 2021-11-18 1:18 p.m., Andrey Grodzovsky wrote:
The Smart Trace Buffer (STB), is a cyclic data buffer used to
log information about system execution for characterization and debug
purposes. If at any point should
Add debugfs hook.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Lijo Lazar
Reviewed-by: Luben Tuikov
---
drivers/gpu/drm/amd/pm/amdgpu_pm.c| 2 +
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 1 +
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 86 +++
3 files changed
Add STB implementation for sienna_cichlid
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Lijo Lazar
Reviewed-by: Luben Tuikov
---
.../amd/include/asic_reg/mp/mp_11_0_offset.h | 7 +++
.../amd/include/asic_reg/mp/mp_11_0_sh_mask.h | 12
.../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 55
Add interface to collect STB logs.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Lijo Lazar
Reviewed-by: Luben Tuikov
---
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 15 +++
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 18 ++
2 files changed, 33 insertions(+)
diff
additional instrumentation.
Andrey Grodzovsky (3):
drm/amd/pm: Add STB accessors interface
drm/amd/pm: Add STB support in sienna_cichlid
drm/amd/pm: Add debugfs info for STB
.../amd/include/asic_reg/mp/mp_11_0_offset.h | 7 ++
.../amd/include/asic_reg/mp/mp_11_0_sh_mask.h | 12 ++
drivers/gpu
On 2021-11-10 8:24 a.m., Daniel Vetter wrote:
On Wed, Nov 10, 2021 at 11:09:50AM +0100, Christian König wrote:
Am 10.11.21 um 10:50 schrieb Daniel Vetter:
On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter wrote:
On Mon, Nov 08, 2021 at
On 2021-11-10 5:09 a.m., Christian König wrote:
Am 10.11.21 um 10:50 schrieb Daniel Vetter:
On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter wrote:
On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
I stumbled across this
Pushed to drm-misc-next
Andrey
On 2021-10-29 3:07 a.m., Christian König wrote:
Attached a patch. Give it a try please, I tested it on my side and
tried to generate the right conditions to trigger this code path by
repeatedly submitting commands while issuing GPU reset to stop the
scheduler
On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote:
On 2021-10-27 10:50 a.m., Christian König wrote:
Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still
On 2021-10-27 10:43 p.m., JingWen Chen wrote:
On 2021/10/28 上午3:43, Andrey Grodzovsky wrote:
On 2021-10-25 10:57 p.m., JingWen Chen wrote:
On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
On 2021-10-24 10:56 p.m., JingWen Chen wrote:
On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
What do
On 2021-10-27 10:50 a.m., Christian König wrote:
Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still missing some point of yours.
Well, I mean we need
On 2021-10-25 10:57 p.m., JingWen Chen wrote:
On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
On 2021-10-24 10:56 p.m., JingWen Chen wrote:
On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
What do you mean by underflow in this case ? You mean use after free because of
extra dma_fence_put
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still missing some point of yours.
Well, I mean we need to be able to handle this for all drivers.
For sure, but as i said above in my opinion we need
On 2021-10-26 6:54 a.m., Christian König wrote:
Am 26.10.21 um 04:33 schrieb Andrey Grodzovsky:
On 2021-10-25 3:56 p.m., Christian König wrote:
In general I'm all there to get this fixed, but there is one major
problem: Drivers don't expect the lock to be dropped.
I am probably missing
ill missing some point of yours.
Andrey
Regards,
Christian.
Am 25.10.21 um 21:10 schrieb Andrey Grodzovsky:
Adding back Daniel (somehow he got off the addresses list) and Chris
who worked a lot in this area.
On 2021-10-21 2:34 a.m., Christian König wrote:
Am 20.10.21 um 21:32 schrieb And
Adding back Daniel (somehow he got off the addresses list) and Chris who
worked a lot in this area.
On 2021-10-21 2:34 a.m., Christian König wrote:
Am 20.10.21 um 21:32 schrieb Andrey Grodzovsky:
On 2021-10-04 4:14 a.m., Christian König wrote:
The problem is a bit different.
The callback
On 2021-10-24 10:56 p.m., JingWen Chen wrote:
On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
What do you mean by underflow in this case ? You mean use after free because of
extra dma_fence_put() ?
yes
Then maybe update the description because 'underflow' is very confusing
On 2021-10
What do you mean by underflow in this case ? You mean use after free
because of extra dma_fence_put() ?
On 2021-10-22 4:14 a.m., JingWen Chen wrote:
ping
On 2021/10/22 AM11:33, Jingwen Chen wrote:
[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in
ring->adev->rings[ring->idx] = NULL;
}
Regards,
Lang
Got it, Looks good to me.
Reviewed-by: Andrey Grodzovsky
Andrey
Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early
and late")
Signed-off-by: Lang Yu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device
On 2021-10-21 3:19 a.m., Yu, Lang wrote:
[AMD Official Use Only]
-Original Message-
From: Yu, Lang
Sent: Thursday, October 21, 2021 3:18 PM
To: Grodzovsky, Andrey
Cc: Deucher, Alexander ; Koenig, Christian
; Huang, Ray ; Yu, Lang
Subject: [PATCH 1/3] drm/amdgpu: fix a potential
t be done there.
Andrey
Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky:
From what I see here you supposed to have actual deadlock and not
only warning, sched_fence->finished is first signaled from within
hw fence done callback (drm_sched_job_done_cb) but then again from
within it's
On 2021-10-19 11:54 a.m., Christian König wrote:
Am 19.10.21 um 17:41 schrieb Andrey Grodzovsky:
On 2021-10-19 9:22 a.m., Nirmoy Das wrote:
Get rid off pin/unpin and evict and swap back gart
page table which should make things less likely to break.
+Christian
Could you guys also clarify
On 2021-10-19 9:22 a.m., Nirmoy Das wrote:
Get rid off pin/unpin and evict and swap back gart
page table which should make things less likely to break.
+Christian
Could you guys also clarify what exactly are the stability issues this
fixes ?
Andrey
Also remove 2nd call to
, and only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.
Fixes: c9a6b82f45e2("drm/amdgpu: Implement DPC recovery")
Suggested-by: Andrey Grodzovsky
Signed-off-by: Guchun Chen
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 +
drivers/gpu/drm/amd/amdgpu/amdgp
the
scheduler fence.
Daniel is right that this needs an irq_work struct to handle this
properly.
Christian.
Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky:
From what I see here you supposed to have actual deadlock and not
only warning, sched_fence->finished is first signaled from within
From what I see here you supposed to have actual deadlock and not only
warning, sched_fence->finished is first signaled from within
hw fence done callback (drm_sched_job_done_cb) but then again from
within it's own callback (drm_sched_entity_kill_jobs_cb) and so
looks like same fence object is
No, scheduler restart and device unlock must take place
inamdgpu_pci_resume (see struct pci_error_handlers for the various
states of PCI recovery). So just add a flag (probably in amdgpu_device)
so we can remember what pci_channel_state_t we came from (unfortunately
it's not passed to us in
On 2021-09-30 10:00 p.m., Guchun Chen wrote:
When a PCI error state pci_channel_io_normal is detectd, it will
report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI driver
will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into
Can you test this change with hotunplug tests in libdrm ?
Since the tests are still in disabled mode until latest fixes propagate
to drm-next upstream you will need to comment out
https://gitlab.freedesktop.org/mesa/drm/-/blob/main/tests/amdgpu/hotunplug_tests.c#L65
I recently fixed a few
Series is Acked-by: Andrey Grodzovsky
Andrey
On 2021-09-21 2:53 p.m., Philip Yang wrote:
If svm migration init failed to create pgmap for device memory, set
pgmap type to 0 to disable device SVM support capability.
Signed-off-by: Philip Yang
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
Reviewed-by: Andrey Grodzovsky
Andrey
On 2021-09-21 9:11 a.m., Chen, Guchun wrote:
[Public]
Ping...
Regards,
Guchun
-Original Message-
From: Chen, Guchun
Sent: Saturday, September 18, 2021 2:09 PM
To: amd-gfx@lists.freedesktop.org; Koenig, Christian ; Pan, Xinhui
; Deucher
In any case, once you converge on solution please include
the relevant ticket in the commit description -
https://gitlab.freedesktop.org/drm/amd/-/issues/1718
Andrey
On 2021-09-20 10:20 p.m., Felix Kuehling wrote:
Am 2021-09-20 um 5:55 p.m. schrieb Philip Yang:
Don't use
Ping
Andrey
On 2021-09-17 7:30 a.m., Andrey Grodzovsky wrote:
Problem:
When device goes into suspend and unplugged during it
then all HW programming during resume fails leading
to a bad SW during pci remove handling which follows.
Because device is first resumed and only later removed
we
201 - 300 of 1477 matches
Mail list logo