[PATCH] drm/amdgpu: use amdgpu_ras.h in amdgpu_debugfs.c

2020-03-11 Thread Stanley . Yang
include amdgpu_ras.h head file instead of use extern ras_debugfs_create_all function Signed-off-by: Stanley.Yang Change-Id: I2697250ba67d4deac4371fea05efb68a976f7e5a --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git

[PATCH] drm/amdgpu: fix warning in ras_debugfs_create_all()

2020-03-12 Thread Stanley . Yang
Fix the warning "warn: variable dereferenced before check 'obj' (see line 1131)" by removing unnecessary checks as amdgpu_ras_debugfs_create_all() is only called from amdgpu_debugfs_init() where obj member in con->head list is not NULL. Use list_for_each_entry() instead list_for_each_entry_safe()

[PATCH 1/2] drm/amdgpu: add function to creat all ras debugfs node

2020-03-09 Thread Stanley . Yang
From: Tao Zhou centralize all debugfs creation in one place for ras Signed-off-by: Tao Zhou Signed-off-by: Stanley.Yang Change-Id: I7489ccb41dcf7a11ecc45313ad42940474999d81 --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 29 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |

[PATCH 2/2] drm/amdgpu: call ras_debugfs_create_all in debugfs_init

2020-03-09 Thread Stanley . Yang
From: Tao Zhou and remove each ras IP's own debugfs creation Signed-off-by: Tao Zhou Signed-off-by: Stanley.Yang Change-Id: If3d16862afa0d97abad183dd6e60478b34029e95 --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 1 -

[PATCH 1/1] drm/amdgpu: fix hdp register access error

2020-09-22 Thread Stanley . Yang
mmHDP_READ_CACHE_INVALIDATE register is in HDP not in NBIO Signed-off-by: Stanley.Yang Change-Id: I4375a8a67d3a13f9605479e169169e22dd5833d1 --- drivers/gpu/drm/amd/amdgpu/nv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c

[PATCH V4 1/1] drm/amdgpu: update athub interrupt harvesting handle

2020-09-21 Thread Stanley . Yang
GCEA/MMHUB EA error should not result to DF freeze, this is fixed in next generation, but for some reasons the GCEA/MMHUB EA error will result to DF freeze in previous generation, diver should avoid to indicate GCEA/MMHUB EA error as hw fatal error in kernel message by read GCEA/MMHUB err status

[PATCH] drm/amdgpu: support reserve bad page for virt

2020-06-03 Thread Stanley . Yang
Signed-off-by: Stanley.Yang Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816 --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 164 + drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 30 +++- 3 files changed, 196

[PATCH V3] drm/amdgpu: support reserve bad page for virt

2020-06-04 Thread Stanley . Yang
Changed from V1: rename some functions name, only init ras error handler data for supported asic. Changed from V2: fix poential memory leak. Signed-off-by: Stanley.Yang Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816 --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |

[PATCH V3] drm/amdgpu: support reserve bad page for virt

2020-06-04 Thread Stanley . Yang
Changed from V1: rename some functions name, only init ras error handler data for supported asic. Changed from V2: fix potential memory leak. Signed-off-by: Stanley.Yang Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816 --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

[PATCH V2] drm/amdgpu: support reserve bad page for virt

2020-06-04 Thread Stanley . Yang
Changed from V1: rename same functions name, only init ras error handler data for supported asic. Signed-off-by: Stanley.Yang Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816 --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c |

[PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw in SRIOV for navi12

2020-11-24 Thread Stanley . Yang
The KFDTopologyTest.BasicTest will failed if skip smc, sdma, sos, ta and asd fw in SRIOV for vega10, so adjust above fw and skip load them in SRIOV only for navi12. v2: remove unnecessary asic type check. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 3

[PATCH 1/1] drm/amdgpu: fix sdma instance fw version and feature version init

2020-12-06 Thread Stanley . Yang
each sdma instance fw_version and feature_version should be set right value when asic type isn't between SIENNA_CICHILD and CHIP_DIMGREY_CAVEFISH Signed-off-by: Stanley.Yang Change-Id: I1edbf3e0557d771eb4c0b686fa5299a3b5f26e35 --- drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 2 +- 1 file changed, 1

[PATCH 1/1] drm/amdgpu: set default value of noretry to 1 for specified asic

2020-11-23 Thread Stanley . Yang
noretry = 0 casue KFDGraphicsInterop test failed on SRIOV platform for vega10, so set noretry to 1 for vega10. Signed-off-by: Stanley.Yang Change-Id: I241da5c20970ea889909997ff044d6e61642da81 --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 + 1 file changed, 1 insertion(+) diff --git

[PATCH 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw in SRIOV for navi12

2020-11-23 Thread Stanley . Yang
The KFDTopologyTest.BasicTest will failed if skip smc, sdma, sos, ta and asd fw in SRIOV for vega10, so adjust above fw and skip load them in SRIOV only for navi12. Signed-off-by: Stanley.Yang Change-Id: Id354be93723d7b5d769d73dc67c596af300305af --- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c

[PATCH V2 1/1] drm/amdgpu: skip load smu and sdma microcode on sriov for SIENNA_CICHLID

2020-12-14 Thread Stanley . Yang
skip load smu and sdma fw on sriov due to sos, ta and asd fw have been skipped for SIENNA_CICHLID. V2: move asic check into smu11 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 +++ drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 10 --

[PATCH 1/1] drm/amdgpu: skip load smu and sdma microcode on sriov for SIENNA_CICHLID

2020-12-13 Thread Stanley . Yang
skip load smu and sdma fw on sriov due to smc, sos, ta and asd fw have been skipped for SIENNA_CICHLID. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c| 3 +++ drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 4 +++- 2 files changed, 6 insertions(+), 1 deletion(-) diff

[PATCH Review 1/1] drm/amdgpu: fix bad address translation for sienna_cichlid

2021-06-16 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 5 + drivers/gpu/drm/amd/amdgpu/umc_v8_7.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h index

[PATCH Review 1/1] drm/amdgpu: force enable vega20 gaming sku gfx ras

2021-06-16 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index f404c2321a6a..ca5a32944242 100644 ---

[PATCH Review 1/1] drm/amdgpu: initialize umc ras function

2021-07-08 Thread Stanley . Yang
From: John Clements support umc ras function initialization for aldebaran Change-Id: I84155d4d3eaae86a8c1bd2331b1964946c47f6da Signed-off-by: John Clements Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 13 + drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 15

[PATCH Review 1/1] drm/amdgpu: force enable gfx ras for vega20 ws

2021-04-30 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index daf63a4c1fff..dfeaa57dd7ea 100644 ---

[PATCH Review 1/1] drm/amdgpu: support sdma error injection

2021-04-01 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 0e16683876aa..d9d292c79cfa 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++

[PATCH Review v3 1/1] drm/amdgpu: fix send ras disable cmd when asic not support ras

2021-03-14 Thread Stanley . Yang
cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable

[PATCH Review 1/1] drm/amdgpu: fix send ras disable cmd when asic not support ras

2021-03-12 Thread Stanley . Yang
cause: It is necessary to send ras disable command to ras-ta to program GB_EDC_MODE to "BYPASS" mode during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init

[PATCH Review 1/1] drm/amdgpu: optimize gfx ras features flag clean

2021-04-19 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index ec3ebc33ee03..8fdf355d7de8 100644 ---

[PATCH Review 1/1] drm/ttm: fix debugfs node create failed

2021-10-12 Thread Stanley . Yang
Test scenario: modprobe amdgpu -> rmmod amdgpu -> modprobe amdgpu Error log: [ 54.396807] debugfs: File 'page_pool' in directory 'amdttm' already present! [ 54.396833] debugfs: File 'page_pool_shrink' in directory 'amdttm' already present! [ 54.396848] debugfs: File

[PATCH Review 1/1] drm/amd/pm: print errorno if get ecc info failed

2021-12-06 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c index 6e781cee8bb6..e0a8224e466f 100644

[PATCH Review 1/4] drm/amdgpu: Update smu driver interface for aldebaran

2021-11-17 Thread Stanley . Yang
update smu driver if version to 0x08 to avoid mismatch log A version mismatch can still happen with an older FW Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935 Signed-off-by: Stanley.Yang --- .../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-

[PATCH Review 2/4] drm/amdgpu: add new query interface for umc block

2021-11-17 Thread Stanley . Yang
add message smu to query error information Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 4 + drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 161 3 files changed, 181 insertions(+) diff --git

[PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Stanley . Yang
support ECC TABLE message, this table include unc ras error count and error address Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 7 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38 +++ .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 2

[PATCH Review 4/4] query umc error info from ecc_table

2021-11-17 Thread Stanley . Yang
if smu support ECCTABLE, driver can message smu to get ecc_table then query umc error info from ECCTABLE apply pmfw version check to ensure backward compatibility Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 42 ---

[PATCH Review 1/1] drm/amdgpu: fix smu not match warning

2021-11-15 Thread Stanley . Yang
update smu driver if version to avoid mismatch log Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h

[PATCH Review 1/1] drm/amdgpu: fix smu not match warning

2021-11-16 Thread Stanley . Yang
update smu driver if and version to avoid mismatch log v2: update smu driver interface Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935 Signed-off-by: Stanley.Yang --- .../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +- drivers/gpu/drm/amd/pm/inc/smu_v13_0.h

[PATCH Review 1/1] drm/amdgpu: fix smu not match warning

2021-11-16 Thread Stanley . Yang
update smu driver if version to avoid mismatch log Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h

[PATCH Review 1/1] drm/amdgpu: adjust ip block add sequence on aldebaran

2021-11-28 Thread Stanley . Yang
Reason: { [ 578.019986] amdgpu :23:00.0: amdgpu: GPU reset begin! [ 583.245566] amdgpu :23:00.0: amdgpu: Failed to disable smu features. [ 583.245621] amdgpu :23:00.0: amdgpu: Fail to disable dpm features! [ 583.245639] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]

[PATCH Review 1/1] drm/amdgpu: adjust ip block suspend sequence on aldebaran to fix disable smu feature failure

2021-11-28 Thread Stanley . Yang
{ [ 578.019986] amdgpu :23:00.0: amdgpu: GPU reset begin! [ 583.245566] amdgpu :23:00.0: amdgpu: Failed to disable smu features. [ 583.245621] amdgpu :23:00.0: amdgpu: Fail to disable dpm features! [ 583.245639] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*

[PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed when unload drvier

2021-11-26 Thread Stanley . Yang
Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw, so ras ta will unload before send ras disable command, ras dsiable operation must before hw fini. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

[PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed when unload drvier v2

2021-11-26 Thread Stanley . Yang
v2: still need call ras_disable_all_featrures to handle ras initilization failure case. Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw, so ras ta will unload before send ras disable command, ras dsiable operation must before hw fini. Signed-off-by: Stanley.Yang

[PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table v2

2021-11-18 Thread Stanley . Yang
support ECC TABLE message, this table include umc ras error count and error address v2: add smu version check to query whether support ecctable call smu_cmn_update_table to get ecctable directly Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 8 +++

[PATCH Review 1/4] drm/amdgpu: Update smu driver interface for aldebaran

2021-11-18 Thread Stanley . Yang
update smu driver if version to 0x08 to avoid mismatch log A version mismatch can still happen with an older FW Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935 Signed-off-by: Stanley.Yang --- .../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-

[PATCH Review 2/4] drm/amdgpu: add new query interface for umc block v2

2021-11-18 Thread Stanley . Yang
add message smu to query error information v2: rename message_smu to ecc_info Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 4 + drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 161 3 files

[PATCH Review 4/4] query umc error info from ecc_table v2

2021-11-18 Thread Stanley . Yang
if smu support ECCTABLE, driver can message smu to get ecc_table then query umc error info from ECCTABLE v2: optimize source code makes logical more reasonable Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 42 +++

[PATCH Review 1/1] drm/amdgpu: only skip get ecc info for aldebaran

2021-12-02 Thread Stanley . Yang
skip get ecc info for aldebarn through check ip version do not affect other asic type Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

[PATCH Review 1/2] drm/amdgpu: skip query ecc info in gpu recovery

2021-12-02 Thread Stanley . Yang
this is a workaround due to get ecc info failed during gpu recovery [ 700.236122] amdgpu :09:00.0: amdgpu: Failed to export SMU ecc table! [ 700.236128] amdgpu :09:00.0: amdgpu: GPU reset begin! [ 704.331171] amdgpu: qcm fence wait loop timeout expired [ 704.331194] amdgpu: The cp

[PATCH Review 1/1] drm/amdgpu: skip umc ras error count harvest

2021-12-06 Thread Stanley . Yang
remove in recovery stat check, skip umc ras err cnt harvest in amdgpu_ras_log_on_err_counter Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

[PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions

2022-01-11 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 10 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +- drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 3 ++- 3 files changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c

[PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions v2

2022-01-12 Thread Stanley . Yang
Changed from v1: remove unused brace Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 - drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +- drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 3 ++- 3 files changed, 11 insertions(+), 3 deletions(-) diff --git

[PATCH Review 1/1] drm/amd/pm: use pm mutex to protect ecc info table

2022-03-10 Thread Stanley . Yang
Change-Id: I6afe0332cbb20528648c38665264930d6b091c2f Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c index

[PATCH Review 1/1] drm/amdgpu: support send bad channel info to smu

2022-03-01 Thread Stanley . Yang
Message SMU bad channel information bitmap to update OOB table Change-Id: I49a79af64d5263c28db059ecb8b8405a471431b4 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 ++

[PATCH Review 1/2] drm/amd/pm: add send bad channel info function

2022-03-03 Thread Stanley . Yang
support message SMU update bad channel info to update HBM bad channel info in OOB table Change-Id: I1e50ed8118f4c1aaefb04c040e59ae4918cdc295 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 12 ++ drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 1 +

[PATCH Review 2/2] drm/amdgpu: message smu to update bad channel info

2022-03-03 Thread Stanley . Yang
It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Change-Id: I2dc8848affdb53e52891013953ae9383fff5f20f Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 +++

[PATCH Review 1/1] drm/amdgpu: print more error info

2022-02-14 Thread Stanley . Yang
print more error info when deferred uncorrectable ras error changed from V1: move Defferred error msg into query uncorrectable error count function. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +- drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 72

[PATCH Review 1/1] drm/amdgpu/pm: add asic smu support check

2022-03-20 Thread Stanley . Yang
It must check asic whether support smu before call smu powerplay function, otherwise it may cause null point on no support smu asic. Change-Id: Ib86f3d4c88317b23eb1040b9ce1c5c8dcae42488 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 6 ++ 1 file changed, 6

[PATCH Review 1/1] drm/amdgpu: Reset OOB table error count info

2022-02-10 Thread Stanley . Yang
The OOB table error count info should be reset after reset eeprom table Change-Id: I2a39e0e44b7b1a5ab7d6b4d4b73ebe48264396b7 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++ 1 file changed, 3 insertions(+) diff --git

[PATCH Review 1/1] drm/amdgpu: adjust register address calculation

2022-02-11 Thread Stanley . Yang
the UMC_STATUS register is not liner, adjust offset calculation formula to get correct address Change-Id: Ic8926078301848330babf289c4238dc8cbcf313d Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 7 +++ 1 file changed, 7 insertions(+) diff --git

[PATCH Review 1/1] drm/amdgpu: fix channel index mapping for SIENNA_CICHLID

2022-01-21 Thread Stanley . Yang
Pmfw read ecc info registers in the following order, umc0: ch_inst 0, 1, 2 ... 7 umc1: ch_inst 0, 1, 2 ... 7 The position of the register value stored in eccinfo table is calculated according to the below formula, channel_index = umc_inst * channel_in_umc + ch_inst Driver directly

[PATCH Review 1/1] drm/amdgpu: fix convert bad page retiremt

2022-01-19 Thread Stanley . Yang
Pmfw read ecc info registers and store values in eccinfo_table in the following order umc0 ch_inst 0, 1, 2 ... 7 umc1 ch_inst 0, 1, 2 ... 7 ... umc3 ch_inst 0, 1, 2 ... 7 Driver should convert eccinfo_table_idx into channel_index according to channel_idx_tbe. Change-Id:

[PATCH Review 1/1] drm/amdgpu: remove unused variable warning

2022-01-19 Thread Stanley . Yang
Change-Id: Ic2a488ee253a913d806bd33ee9c90e31a71af320 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 23 --- drivers/gpu/drm/amd/amdgpu/umc_v8_7.c | 6 -- 2 files changed, 29 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c

[PATCH Review 1/1] drm/amdgpu: print more correctable error info

2022-04-07 Thread Stanley . Yang
Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 62 ++- 1 file changed, 60 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c

[PATCH Review 1/1] drm/amdgpu: add umc query error status function

2022-04-08 Thread Stanley . Yang
In order to debug ras error, driver will print IPID/SYND/MISC0 register value if detect correctable or uncorrectable error. Provide umc_query_error_status_helper function to reduce code redundancy. Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec Signed-off-by: Stanley.Yang ---

[PATCH Review 1/1] drm/amdgpu: Fix false positive error log

2023-09-15 Thread Stanley . Yang
It should first check block ras obj whether be set, it should return directly if block ras obj is not set. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

[PATCH Review V2 1/1] drm/amdgpu: Fix false positive error log

2023-09-15 Thread Stanley . Yang
It should first check block ras obj whether be set, it should return 0 directly if block ras obj or hw_ops is not set. Changed from V1: return 0 directly if block ras obj or hw ops is not set Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +- 1

[PATCH Review 1/1] drm/amdgpu: Workaround to skip kiq ring test during ras gpu recovery

2023-10-17 Thread Stanley . Yang
This is workaround, kiq ring test failed in suspend stage when do ras recovery for gfx v9_4_3. Change-Id: I8de9900aa76706f59bc029d4e9e8438c6e1db8e0 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 21 + 1 file changed, 21 insertions(+) diff --git

[PATCH Review 1/1] drm/amdgpu: Reset vram error data info

2023-11-01 Thread Stanley . Yang
Reset error data info stored in vram when user clear eeprom table. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 97 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 + .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 4 + 3 files

[PATCH Review 1/1] drm/amdgpu: Fix delete nodes that have been relesed

2023-10-19 Thread Stanley . Yang
Fix delete nodes that it has been freed. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index

[PATCH Review 1/1] drm/amdgpu: Skip ring test during ras in recovery

2023-09-27 Thread Stanley . Yang
This is workaround due to ring test failed during ras do gpu recovery for aqua vanjaram. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c

[PATCH Review 1/1] drm/amdgpu: Fix potential null pointer derefernce

2023-09-27 Thread Stanley . Yang
The amdgpu_ras_get_context may return NULL if device not support ras feature, so add check before using. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

[PATCH Review V2 1/1] drm/amdgpu: Enable mca debug mode mode when ras enabled

2023-10-18 Thread Stanley . Yang
Enable smu_v13_0_6 mca debug mode if ras is enabled. Changed from V1: enable mca debug mode if ras enabled. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git

[PATCH Review 1/1] drm/amdgpu: Enable mca debug mode mode for apu

2023-10-18 Thread Stanley . Yang
Enable smu_v13_0_6 mca debug mode when GFX RAS feature is enabled on APU. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c

[PATCH Review 1/1] drm/amdgpu: Enable RAS feature by default for APU

2023-10-19 Thread Stanley . Yang
Enable RAS feature by default for aqua vanjaram on apu platform. Change-Id: I02105d07d169d1356251c994249a134ca5dd2a7a Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git

[PATCH Review 1/1] drm/amdgpu: support ras on SRIOV

2022-05-18 Thread Stanley . Yang
support umc/gfx/sdma ras on guest side Changed from V1: move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler Change-Id: Ic7dda45d8f8cf2d5f1abc7705abc153d558da8a1 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++

[PATCH Review 1/1] drm/amdgpu: fix ras suppoted check

2022-05-31 Thread Stanley . Yang
Fix aldebaran ras supported check on SRIOV guest side, the previous check conditicon block all ras feature on baremetal Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git

[PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

2022-05-23 Thread Stanley . Yang
SMU add a new variable mca_ceumc_addr to record umc correctable error address in EccInfo table, driver side add ecctable_v2 to support this feature Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 2 +

[PATCH Review 2/2] drm/amdgpu: print umc correctable error address

2022-05-23 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 ++ drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 55 ++- .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 2 + 3 files changed, 60 insertions(+), 2 deletions(-) diff --git

[PATCH Review v3 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

2022-05-25 Thread Stanley . Yang
SMU add a new variable mca_ceumc_addr to record umc correctable error address in EccInfo table, driver side add EccInfo_V2_t to support this feature Changed from V1: remove ecc_table_v2 and unnecessary table id, define union struct include EccInfo_t and EccInfo_V2_t. Changed

[PATCH Review v3 2/2] drm/amdgpu: print umc correctable error address

2022-05-25 Thread Stanley . Yang
Changed from V1: remove unnecessary same row physical address calculation Changed from V2: move record_ce_addr_supported to umc_ecc_info struct Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 5 ++ drivers/gpu/drm/amd/amdgpu/umc_v6_7.c |

[PATCH Review 1/1] drm/amdgpu: support ras on SRIOV

2022-05-18 Thread Stanley . Yang
support umc/gfx/sdma ras on guest side Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c| 4 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 23 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c |

[PATCH Review 1/1] drm/amdgpu: add missed ras block id

2022-06-22 Thread Stanley . Yang
The VCN and JPEG ras are supported, so add VCN and JPEG ras block id. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h

[PATCH Review 1/1] drm/amdgpu/pm: adjust EccInfo_t struct

2022-06-15 Thread Stanley . Yang
The EccInfo_t struct in driver_if.h is as below in official release verion 68.55.0 typedef struct { uint64_t mca_umc_status; uint64_t mca_umc_addr; uint16_t ce_count_lo_chip; uint16_t ce_count_hi_chip; uint32_t eccPadding; uint64_t mca_ceumc_addr; } EccInfo_t; It's different

[PATCH Review v2 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

2022-05-24 Thread Stanley . Yang
SMU add a new variable mca_ceumc_addr to record umc correctable error address in EccInfo table, driver side add EccInfo_V2_t to support this feature Changed from V1: remove ecc_table_v2 and unnecessary table id, define union struct include EccInfo_t and EccInfo_V2_t.

[PATCH Review v2 2/2] drm/amdgpu: print umc correctable error address

2022-05-24 Thread Stanley . Yang
Changed from V1: remove unnecessary same row physical address calculation Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 ++ drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 52 ++- .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 1 +

[PATCH Reivew 1/1] drm/amdgpu: fix use-after-free during gpu recovery

2022-11-16 Thread Stanley . Yang
[Why] [ 754.862560] refcount_t: underflow; use-after-free. [ 754.862898] Call Trace: [ 754.862903] [ 754.862913] amdgpu_job_free_cb+0xc2/0xe1 [amdgpu] [ 754.863543] drm_sched_main.cold+0x34/0x39 [amd_sched] [How] The fw_fence may be not init, check whether

[PATCH Review 1/1] drm/amdgpu: print ras drv fw debug info

2023-03-23 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c index 6d2879ac585b..f76b1cb8baf8 100644 ---

[PATCH Review 1/1] drm/amdgpu: Add SDMA_UTCL1_WR_FIFO_SED field for sdma_v4_4_ras_field

2023-04-27 Thread Stanley . Yang
Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c index 6f9895cdddb1..0ddb6955a6d3 100644 --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c +++

[PATCH Review V2 2/2] drm/amdgpu: correct ras enabled flag

2023-04-11 Thread Stanley . Yang
XGMI RAS should be according to the gmc xgmi physical nodes number, XGMI RAS should not be enabled if xgmi num_physical_nodes is zero. Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++ 1 file changed, 7

[PATCH Review V2 1/2] drm/amdgpu: fix unexpected block id

2023-04-11 Thread Stanley . Yang
Aldebaran supports VCN and JPEG RAS, it reports unexpected block id message during VCN and JPEG RAS initialization if VCN and JPEG block id not defined. Change-Id: Icceb43556eec802f11c2077c1c58a1e92c9df599 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4

[PATCH Review 2/2] drm/amdgpu: correct ras enabled flag

2023-04-10 Thread Stanley . Yang
XGMI RAS should be according to the gmc xmgi supported flag and xgmi physical nodes number. Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 1 file changed, 8 insertions(+) diff --git

[PATCH Review 1/2] drm/admgpu: fix unexpected block id

2023-04-10 Thread Stanley . Yang
Change-Id: Icceb43556eec802f11c2077c1c58a1e92c9df599 Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h

[PATCH Review 1/2] drm/amdgpu: Optimze checking ras supported

2023-06-12 Thread Stanley . Yang
Using "is_app_apu" to identify device in the native APU mode or carveout mode. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 34 ++--- 3 files

[PATCH Review 2/2] drm/amdgpu: Add checking mc_vram_size

2023-06-12 Thread Stanley . Yang
Do not compare injection address with mc_vram_size if mc_vram_size is zero. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

[PATCH Review V2 1/2] drm/amdgpu: Enable aqua vanjaram RAS

2023-07-13 Thread Stanley . Yang
Enable RAS for aqua vanjaram. Changed from V1: Split the change in amdgpu_ras_asic_supported into a separated patch. Signed-off-by: Stanley.Yang Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 + 1 file changed, 1 insertion(+) diff --git

[PATCH Review V2 2/2] drm/amdgpu: Disable RAS by default on APU flatform

2023-07-13 Thread Stanley . Yang
Disable RAS feature by default for aqua vanjaram on APU platform. Changed from V1: Splite Disable RAS by default on APU platform into a separated patch. Signed-off-by: Stanley.Yang Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 + 1 file

[PATCH Review V3 1/2] drm/amdgpu: Enable aqua vanjaram RAS

2023-07-13 Thread Stanley . Yang
Enable RAS for aqua vanjaram. Changed from V1: Split the change in amdgpu_ras_asic_supported into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang Reviewed-by: Hawking Zhang ---

[PATCH Review V3 2/2] drm/amdgpu: Disable RAS by default on APU flatform

2023-07-13 Thread Stanley . Yang
Disable RAS feature by default for aqua vanjaram on APU platform. Changed from V1: Splite Disable RAS by default on APU platform into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang Reviewed-by: Hawking

[PATCH Review V2 1/3] drm/amdgpu: pass xcc mask to ras ta

2023-06-05 Thread Stanley . Yang
pass xcc mask to ras ta, ras ta will compare the mask with the one from chiplet topology. Changed from V1: Remove IP version checking. Set ras_cmd->ras_init_message.init_flags.xcc_mask directly due to xcc_mask is common structres to all the devices. Signed-off-by:

[PATCH Review V2 3/3] drm/amdgpu: convert vcn/jpeg logical mask to physical mask

2023-06-05 Thread Stanley . Yang
Changed from V1: Remove amdgpu_ras_logical_mask_to_physical_mask due to GET_MASK provides same feature. Support convert VCN/JPEG logical mask to physical mask. Signed-off-by: Stanley.Yang Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang ---

[PATCH Review V2 2/3] drm/amdgpu: support check vcn jpeg block mask

2023-06-05 Thread Stanley . Yang
Support VCN/JPEG instance mask checking, pass logical mask directly except GFX/SDMA/VCN/JPEG blocks. Changed from V1: correct a typo Signed-off-by: Stanley.Yang Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +- 1 file changed, 5

[PATCH Review 4/6] drm/amdgpu: Add support EEPROM table v2.1

2023-06-07 Thread Stanley . Yang
Add ras info to EEPROM table, app can analyse device ECC status without GPU driver through EEPROM table ras info. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +- .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 204 --

[PATCH Review 6/6] drm/amdgpu: Set EEPROM ras info

2023-06-07 Thread Stanley . Yang
Set EEPROM ras info: rma status, health percent and bad page threshold. Signed-off-by: Stanley.Yang --- .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 24 +++ .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h| 5 2 files changed, 29 insertions(+) diff --git

[PATCH Review 5/6] drm/amdgpu: Calculate EEPROM table ras info bytes sum

2023-06-07 Thread Stanley . Yang
It's more reasonable to check EEPROM table ras info bytes. Signed-off-by: Stanley.Yang --- .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

[PATCH Review 1/6] drm/amdgpu: Rename ras table version

2023-06-07 Thread Stanley . Yang
Rename RAS_TABLE_VER to RAS_TABLE_VER_V1, move RAS_TABLE_VER_V1 from amdgpu_ras_eeprom.c to amdgpu_ras_eeprom.h. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 5 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 2 ++ 2 files changed, 4 insertions(+), 3

  1   2   >