Re: [PATCH] drm/amdgpu: clean up vbios fetching code

2024-09-17 Thread Lazar, Lijo



On 9/17/2024 6:24 PM, Alex Deucher wrote:
> After splitting the logic between APU and dGPU,
> clean up some of the APU and dGPU specific logic
> that no longer applied.
> 
> Signed-off-by: Alex Deucher 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 16 ++--
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> index e8f62d718167b..46bf623919d7c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> @@ -284,10 +284,6 @@ static bool amdgpu_atrm_get_bios(struct amdgpu_device 
> *adev)
>   acpi_status status;
>   bool found = false;
>  
> - /* ATRM is for the discrete card only */
> - if (adev->flags & AMD_IS_APU)
> - return false;
> -
>   /* ATRM is for on-platform devices only */
>   if (dev_is_removable(&adev->pdev->dev))
>   return false;
> @@ -343,11 +339,8 @@ static inline bool amdgpu_atrm_get_bios(struct 
> amdgpu_device *adev)
>  
>  static bool amdgpu_read_disabled_bios(struct amdgpu_device *adev)
>  {
> - if (adev->flags & AMD_IS_APU)
> - return igp_read_bios_from_vram(adev);
> - else
> - return (!adev->asic_funcs || 
> !adev->asic_funcs->read_disabled_bios) ?
> - false : amdgpu_asic_read_disabled_bios(adev);
> + return (!adev->asic_funcs || !adev->asic_funcs->read_disabled_bios) ?
> + false : amdgpu_asic_read_disabled_bios(adev);
>  }
>  
>  #ifdef CONFIG_ACPI
> @@ -455,11 +448,6 @@ static bool amdgpu_get_bios_dgpu(struct amdgpu_device 
> *adev)
>   goto success;
>   }
>  
> - if (igp_read_bios_from_vram(adev)) {
> - dev_info(adev->dev, "Fetched VBIOS from VRAM BAR\n");
> - goto success;
> - }
> -
>   if (amdgpu_read_platform_bios(adev)) {
>   dev_info(adev->dev, "Fetched VBIOS from platform\n");
>   goto success;


Re: [PATCH] drm/amdgpu/bios: split vbios fetching between APU and dGPU

2024-09-17 Thread Lazar, Lijo



On 9/17/2024 1:32 AM, Alex Deucher wrote:
> We need some different logic for dGPUs and the APU path
> can be simplified because there are some methods which
> are never used on APUs.  This also fixes a regression
> on some older APUs causing the the driver to fetch
> the unpatched ROM image rather than the patched image.
> 
> Fixes: 9c081c11c621 ("drm/amdgpu: Reorder to read EFI exported ROM first")
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 47 +++-
>  1 file changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> index 42e64bce661e..e8f62d718167 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> @@ -414,7 +414,36 @@ static inline bool amdgpu_acpi_vfct_bios(struct 
> amdgpu_device *adev)
>  }
>  #endif
>  

Now that they are separated couple of additional changes -

> -bool amdgpu_get_bios(struct amdgpu_device *adev)
> +static bool amdgpu_get_bios_apu(struct amdgpu_device *adev)
> +{
> + if (amdgpu_acpi_vfct_bios(adev)) {
> + dev_info(adev->dev, "Fetched VBIOS from VFCT\n");
> + goto success;
> + }
> +
> + if (igp_read_bios_from_vram(adev)) {
> + dev_info(adev->dev, "Fetched VBIOS from VRAM BAR\n");
> + goto success;
> + }
> +

This may no longer be needed for dGPU path.

> + if (amdgpu_read_bios(adev)) {
> + dev_info(adev->dev, "Fetched VBIOS from ROM BAR\n");
> + goto success;
> + }
> +
> + if (amdgpu_read_platform_bios(adev)) {
> + dev_info(adev->dev, "Fetched VBIOS from platform\n");
> + goto success;
> + }
> +
> + dev_err(adev->dev, "Unable to locate a BIOS ROM\n");
> + return false;
> +
> +success:
> + return true;
> +}
> +
> +static bool amdgpu_get_bios_dgpu(struct amdgpu_device *adev)
>  {
>   if (amdgpu_atrm_get_bios(adev)) {

Better remove this check from this -
/* ATRM is for the discrete card only */
if (adev->flags & AMD_IS_APU)
return false;

Thanks,
Lijo

>   dev_info(adev->dev, "Fetched VBIOS from ATRM\n");
> @@ -455,10 +484,24 @@ bool amdgpu_get_bios(struct amdgpu_device *adev)
>   return false;
>  
>  success:
> - adev->is_atom_fw = adev->asic_type >= CHIP_VEGA10;
>   return true;
>  }
>  
> +bool amdgpu_get_bios(struct amdgpu_device *adev)
> +{
> + bool found;
> +
> + if (adev->flags & AMD_IS_APU)
> + found = amdgpu_get_bios_apu(adev);
> + else
> + found = amdgpu_get_bios_dgpu(adev);
> +
> + if (found)
> + adev->is_atom_fw = adev->asic_type >= CHIP_VEGA10;
> +
> + return found;
> +}
> +
>  /* helper function for soc15 and onwards to read bios from rom */
>  bool amdgpu_soc15_read_bios_from_rom(struct amdgpu_device *adev,
>u8 *bios, u32 length_bytes)


Re: [PATCH v2 5/5] drm/amd/pm: Use metrics 1_6

2024-09-13 Thread Lazar, Lijo



On 9/12/2024 5:29 PM, Asad Kamal wrote:
> Use metrics 1_6 to report activities per partition
> 
> v2: Use separate per instance for different platforms, shared
> vcn handled by other fix
> 
> Signed-off-by: Asad Kamal 

Series is -

Reviewed-by: Lijo Lazar 

Thanks,
Lijo
> ---
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 78 ++-
>  1 file changed, 60 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index ee178914ca53..cd739f627df0 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -358,7 +358,7 @@ static int smu_v13_0_6_tables_init(struct smu_context 
> *smu)
>   return -ENOMEM;
>   smu_table->metrics_time = 0;
>  
> - smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v1_5);
> + smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v1_6);
>   smu_table->gpu_metrics_table =
>   kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
>   if (!smu_table->gpu_metrics_table) {
> @@ -2302,15 +2302,18 @@ static int 
> smu_v13_0_6_get_current_pcie_link_speed(struct smu_context *smu)
>  
>  static ssize_t smu_v13_0_6_get_gpu_metrics(struct smu_context *smu, void 
> **table)
>  {
> + bool per_inst, smu_13_0_6_per_inst, smu_13_0_14_per_inst, apu_per_inst;
>   struct smu_table_context *smu_table = &smu->smu_table;
> - struct gpu_metrics_v1_5 *gpu_metrics =
> - (struct gpu_metrics_v1_5 *)smu_table->gpu_metrics_table;
> + struct gpu_metrics_v1_6 *gpu_metrics =
> + (struct gpu_metrics_v1_6 *)smu_table->gpu_metrics_table;
>   bool flag = smu_v13_0_6_is_unified_metrics(smu);
> + int ret = 0, xcc_id, inst, i, j, k, idx;
>   struct amdgpu_device *adev = smu->adev;
> - int ret = 0, xcc_id, inst, i, j;
>   MetricsTableX_t *metrics_x;
>   MetricsTableA_t *metrics_a;
> + struct amdgpu_xcp *xcp;
>   u16 link_width_level;
> + u32 inst_mask;
>  
>   metrics_x = kzalloc(max(sizeof(MetricsTableX_t), 
> sizeof(MetricsTableA_t)), GFP_KERNEL);
>   ret = smu_v13_0_6_get_metrics_table(smu, metrics_x, true);
> @@ -2321,7 +2324,7 @@ static ssize_t smu_v13_0_6_get_gpu_metrics(struct 
> smu_context *smu, void **table
>  
>   metrics_a = (MetricsTableA_t *)metrics_x;
>  
> - smu_cmn_init_soft_gpu_metrics(gpu_metrics, 1, 5);
> + smu_cmn_init_soft_gpu_metrics(gpu_metrics, 1, 6);
>  
>   gpu_metrics->temperature_hotspot =
>   SMUQ10_ROUND(GET_METRIC_FIELD(MaxSocketTemperature, flag));
> @@ -2363,8 +2366,15 @@ static ssize_t smu_v13_0_6_get_gpu_metrics(struct 
> smu_context *smu, void **table
>  
>   gpu_metrics->current_uclk = 
> SMUQ10_ROUND(GET_METRIC_FIELD(UclkFrequency, flag));
>  
> - /* Throttle status is not reported through metrics now */
> - gpu_metrics->throttle_status = 0;
> + /* Total accumulated cycle counter */
> + gpu_metrics->accumulation_counter = 
> GET_METRIC_FIELD(AccumulationCounter, flag);
> +
> + /* Accumulated throttler residencies */
> + gpu_metrics->prochot_residency_acc = 
> GET_METRIC_FIELD(ProchotResidencyAcc, flag);
> + gpu_metrics->ppt_residency_acc = GET_METRIC_FIELD(PptResidencyAcc, 
> flag);
> + gpu_metrics->socket_thm_residency_acc = 
> GET_METRIC_FIELD(SocketThmResidencyAcc, flag);
> + gpu_metrics->vr_thm_residency_acc = GET_METRIC_FIELD(VrThmResidencyAcc, 
> flag);
> + gpu_metrics->hbm_thm_residency_acc = 
> GET_METRIC_FIELD(HbmThmResidencyAcc, flag);
>  
>   /* Clock Lock Status. Each bit corresponds to each GFXCLK instance */
>   gpu_metrics->gfxclk_lock_status = GET_METRIC_FIELD(GfxLockXCDMak, flag) 
> >> GET_INST(GC, 0);
> @@ -2419,19 +2429,51 @@ static ssize_t smu_v13_0_6_get_gpu_metrics(struct 
> smu_context *smu, void **table
>   SMUQ10_ROUND(GET_METRIC_FIELD(XgmiWriteDataSizeAcc, 
> flag)[i]);
>   }
>  
> - for (i = 0; i < adev->jpeg.num_jpeg_inst; ++i) {
> - inst = GET_INST(JPEG, i);
> - for (j = 0; j < adev->jpeg.num_jpeg_rings; ++j) {
> - gpu_metrics->jpeg_activity[(i * 
> adev->jpeg.num_jpeg_rings) + j] =
> - SMUQ10_ROUND(GET_METRIC_FIELD(JpegBusy, flag)
> - [(inst * adev->jpeg.num_jpeg_rings) + j]);
> + gpu_metrics->num_partition = adev->xcp_mgr->num_xcps;
> +
> + apu_per_inst = (adev->flags & AMD_IS_APU) && (smu->smc_fw_version >= 
> 0x04556A00);
> + smu_13_0_6_per_inst = !(adev->flags & AMD_IS_APU) &&
> + (amdgpu_ip_version(smu->adev, MP1_HWIP, 0)
> +  == IP_VERSION(13, 0, 6)) &&
> + (smu->smc_fw_version >= 0x556F00);
> + smu_13_0_14_per_inst = !(adev->flags & AMD_IS_APU) &&
> + (amdgpu_ip_version

Re: [PATCH v2 2/5] drm/amd/pm: Use same metric table for APU

2024-09-13 Thread Lazar, Lijo



On 9/12/2024 5:29 PM, Asad Kamal wrote:
> Use same metric table for APU and Non APU systems
> for smu_v_13_0_6 to get metric data based on newer pmfw
> versions
> 
> v2: Use inline func to check for ubified metrics support
> 

Typo -ubified  => unified

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> Signed-off-by: Asad Kamal 
> ---
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 102 ++
>  1 file changed, 55 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index 9974c9f8135e..ee178914ca53 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -102,6 +102,12 @@ MODULE_FIRMWARE("amdgpu/smu_13_0_14.bin");
>  #define MCA_BANK_IPID(_ip, _hwid, _type) \
>   [AMDGPU_MCA_IP_##_ip] = { .hwid = _hwid, .mcatype = _type, }
>  
> +static inline bool smu_v13_0_6_is_unified_metrics(struct smu_context *smu)
> +{
> + return (smu->adev->flags & AMD_IS_APU) &&
> + smu->smc_fw_version <= 0x4556900;
> +}
> +
>  struct mca_bank_ipid {
>   enum amdgpu_mca_ip ip;
>   uint16_t hwid;
> @@ -253,7 +259,7 @@ struct PPTable_t {
>  #define SMUQ10_TO_UINT(x) ((x) >> 10)
>  #define SMUQ10_FRAC(x) ((x) & 0x3ff)
>  #define SMUQ10_ROUND(x) ((SMUQ10_TO_UINT(x)) + ((SMUQ10_FRAC(x)) >= 0x200))
> -#define GET_METRIC_FIELD(field) ((adev->flags & AMD_IS_APU) ?\
> +#define GET_METRIC_FIELD(field, flag) ((flag) ?\
>   (metrics_a->field) : (metrics_x->field))
>  
>  struct smu_v13_0_6_dpm_map {
> @@ -583,7 +589,7 @@ static int smu_v13_0_6_setup_driver_pptable(struct 
> smu_context *smu)
>   MetricsTableA_t *metrics_a = (MetricsTableA_t 
> *)smu_table->metrics_table;
>   struct PPTable_t *pptable =
>   (struct PPTable_t *)smu_table->driver_pptable;
> - struct amdgpu_device *adev = smu->adev;
> + bool flag = smu_v13_0_6_is_unified_metrics(smu);
>   int ret, i, retry = 100;
>   uint32_t table_version;
>  
> @@ -595,7 +601,7 @@ static int smu_v13_0_6_setup_driver_pptable(struct 
> smu_context *smu)
>   return ret;
>  
>   /* Ensure that metrics have been updated */
> - if (GET_METRIC_FIELD(AccumulationCounter))
> + if (GET_METRIC_FIELD(AccumulationCounter, flag))
>   break;
>  
>   usleep_range(1000, 1100);
> @@ -612,29 +618,29 @@ static int smu_v13_0_6_setup_driver_pptable(struct 
> smu_context *smu)
>   table_version;
>  
>   pptable->MaxSocketPowerLimit =
> - SMUQ10_ROUND(GET_METRIC_FIELD(MaxSocketPowerLimit));
> + SMUQ10_ROUND(GET_METRIC_FIELD(MaxSocketPowerLimit, 
> flag));
>   pptable->MaxGfxclkFrequency =
> - SMUQ10_ROUND(GET_METRIC_FIELD(MaxGfxclkFrequency));
> + SMUQ10_ROUND(GET_METRIC_FIELD(MaxGfxclkFrequency, 
> flag));
>   pptable->MinGfxclkFrequency =
> - SMUQ10_ROUND(GET_METRIC_FIELD(MinGfxclkFrequency));
> + SMUQ10_ROUND(GET_METRIC_FIELD(MinGfxclkFrequency, 
> flag));
>  
>   for (i = 0; i < 4; ++i) {
>   pptable->FclkFrequencyTable[i] =
> - 
> SMUQ10_ROUND(GET_METRIC_FIELD(FclkFrequencyTable)[i]);
> + 
> SMUQ10_ROUND(GET_METRIC_FIELD(FclkFrequencyTable, flag)[i]);
>   pptable->UclkFrequencyTable[i] =
> - 
> SMUQ10_ROUND(GET_METRIC_FIELD(UclkFrequencyTable)[i]);
> + 
> SMUQ10_ROUND(GET_METRIC_FIELD(UclkFrequencyTable, flag)[i]);
>   pptable->SocclkFrequencyTable[i] = SMUQ10_ROUND(
> - GET_METRIC_FIELD(SocclkFrequencyTable)[i]);
> + GET_METRIC_FIELD(SocclkFrequencyTable, 
> flag)[i]);
>   pptable->VclkFrequencyTable[i] =
> - 
> SMUQ10_ROUND(GET_METRIC_FIELD(VclkFrequencyTable)[i]);
> + 
> SMUQ10_ROUND(GET_METRIC_FIELD(VclkFrequencyTable, flag)[i]);
>   pptable->DclkFrequencyTable[i] =
> - 
> SMUQ10_ROUND(GET_METRIC_FIELD(DclkFrequencyTable)[i]);
> + 
> SMUQ10_ROUND(GET_METRIC_FIELD(DclkFrequencyTable, flag)[i]);
>   pptable->LclkFrequencyTable[i] =
> - 
> SMUQ10_ROUND(GET_METRIC_FIELD(LclkFrequencyTable)[i]);
> + 
> SMUQ10_ROUND(GET_METRIC_FIELD(LclkFrequencyTable, flag)[i]);
>   }
>  
>   /* use AID0 serial number by default */
> - pptable->PublicSerialNumber_AID = 
> GET_METRIC_FIELD(PublicSerialNumber_AID)[0];
> + pptable->PublicSerialNumb

Re: [PATCH] drm/amdgpu: Retry i2c transfer once if it fails on SMU13.0.6

2024-09-12 Thread Lazar, Lijo



On 9/12/2024 2:42 AM, Russell, Kent wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Fixed the typo locally.
> 
>> -Original Message-
>> From: Russell, Kent 
>> Sent: Wednesday, September 11, 2024 2:06 PM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Russell, Kent 
>> Subject: [PATCH] drm/amdgpu: Retry i2c transfer once if it fails on SMU13.0.6
>>
>> During init, there can be some collisions on the i2c bus that result in
>> the EEPROM read failing. This has been mitigated in the PMFW to a
>> degree, but there is still a small chance that the bus will be busy.
>> When the read fails during RAS init, that disables page retirement
>> altogether, which is obviously not ideal. To try to avoid that
>> situation, set the eeprom_read function to retry once if the first read
>> fails, specifically for smu_v13_0_6.
>>
>> Signed-off-by: Kent Russell 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

>> ---
>>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 8 ++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> index 9974c9f8135e..65d24c2f7e24 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> @@ -2107,8 +2107,12 @@ static int smu_v13_0_6_i2c_xfer(struct i2c_adapter
>> *i2c_adap,
>>   }
>>   mutex_lock(&adev->pm.mutex);
>>   r = smu_v13_0_6_request_i2c_xfer(smu, req);
>> - if (r)
>> - goto fail;
>> + if (r) {
>> + /* Rrtry once, in case of an i2c collision */
> Rrtry->Retry
>> + r = smu_v13_0_6_request_i2c_xfer(smu, req);
>> + if (r)
>> + goto fail;
>> + }
>>
>>   for (c = i = 0; i < num_msgs; i++) {
>>   if (!(msg[i].flags & I2C_M_RD)) {
>> --
>> 2.34.1
> 


Re: [PATCH] drm/amdgpu/gfx9.4.3: drop extra wrapper

2024-09-10 Thread Lazar, Lijo



On 9/10/2024 8:19 PM, Alex Deucher wrote:
> Drop wrapper used in one place.  gfx_v9_4_3_xcc_cp_enable()
> is used in one place.  gfx_v9_4_3_xcc_cp_compute_enable()
> is used everywhere else.
> 
> Signed-off-by: Alex Deucher 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 408e5600bb61..b940d2ad57db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -2299,12 +2299,6 @@ static int gfx_v9_4_3_cp_resume(struct amdgpu_device 
> *adev)
>   return 0;
>  }
>  
> -static void gfx_v9_4_3_xcc_cp_enable(struct amdgpu_device *adev, bool enable,
> -  int xcc_id)
> -{
> - gfx_v9_4_3_xcc_cp_compute_enable(adev, enable, xcc_id);
> -}
> -
>  static void gfx_v9_4_3_xcc_fini(struct amdgpu_device *adev, int xcc_id)
>  {
>   if (amdgpu_gfx_disable_kcq(adev, xcc_id))
> @@ -2336,7 +2330,7 @@ static void gfx_v9_4_3_xcc_fini(struct amdgpu_device 
> *adev, int xcc_id)
>   }
>  
>   gfx_v9_4_3_xcc_kcq_fini_register(adev, xcc_id);
> - gfx_v9_4_3_xcc_cp_enable(adev, false, xcc_id);
> + gfx_v9_4_3_xcc_cp_compute_enable(adev, false, xcc_id);
>  }
>  
>  static int gfx_v9_4_3_hw_init(void *handle)


RE: [PATCH 2/2] drm/amdgpu: Retry i2c transfer once if it fails

2024-09-10 Thread Lazar, Lijo
[AMD Official Use Only - AMD Internal Distribution Only]

The ideal place is -
smu_v13_0_6_request_i2c_xfer

Restricts the change to specific SOCs with collision problem.
Gives a bit more survival chance with a retry on every chunk requested.

Thanks,
Lijo
-Original Message-
From: amd-gfx  On Behalf Of Kent Russell
Sent: Tuesday, September 10, 2024 7:07 PM
To: amd-gfx@lists.freedesktop.org
Cc: Russell, Kent 
Subject: [PATCH 2/2] drm/amdgpu: Retry i2c transfer once if it fails

During init, there can be some collisions on the i2c bus that result in the 
EEPROM read failing. This has been mitigated in the PMFW to a degree, but there 
is still a small chance that the bus will be busy.
When the read fails during RAS init, that disables page retirement altogether, 
which is obviously not ideal. To try to avoid that situation, set the 
eeprom_read function to retry once if the first read fails.

Signed-off-by: Kent Russell 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c
index 35fee3e8cde2..32755a37dcef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c
@@ -227,8 +227,14 @@ int amdgpu_eeprom_read(struct i2c_adapter *i2c_adap,
   u32 eeprom_addr, u8 *eeprom_buf,
   u32 bytes)
 {
-   return amdgpu_eeprom_xfer(i2c_adap, eeprom_addr, eeprom_buf, bytes,
+   int ret;
+
+   ret = amdgpu_eeprom_xfer(i2c_adap, eeprom_addr, eeprom_buf, bytes,
+ true);
+   if (ret)
+   ret = amdgpu_eeprom_xfer(i2c_adap, eeprom_addr, eeprom_buf, 
bytes,
  true);
+   return ret;
 }

 int amdgpu_eeprom_write(struct i2c_adapter *i2c_adap,
--
2.34.1



Re: [PATCH] drm/amdgpu: disable GPU RAS bad page feature for specific ASIC

2024-09-10 Thread Lazar, Lijo


On a second thought, this may be made more generic by just checking APU
flag - holds true for any APU in general.

Thanks,
Lijo

On 9/10/2024 7:24 PM, Lazar, Lijo wrote:
> 
> 
> On 9/10/2024 2:07 PM, Tao Zhou wrote:
>> The feature is not applicable to specific app platform.
>>
>> v2: update the disablement condition and commit description
>> v3: move the setting to amdgpu_ras_check_supported
>>
>> Signed-off-by: Tao Zhou 
>> Reviewed-by: Hawking Zhang 
> 
> Reviewed-by: Lijo Lazar 
> 
> Thanks,
> Lijo
> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index dbfc41ddc3c7..ebe3e8f01fe2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -3483,6 +3483,11 @@ static void amdgpu_ras_check_supported(struct 
>> amdgpu_device *adev)
>>  
>>  /* aca is disabled by default */
>>  adev->aca.is_enabled = false;
>> +
>> +/* bad page feature is not applicable to specific app platform */
>> +if (adev->gmc.is_app_apu &&
>> +amdgpu_ip_version(adev, UMC_HWIP, 0) == IP_VERSION(12, 0, 0))
>> +amdgpu_bad_page_threshold = 0;
>>  }
>>  
>>  static void amdgpu_ras_counte_dw(struct work_struct *work)


Re: [PATCH] drm/amdgpu: disable GPU RAS bad page feature for specific ASIC

2024-09-10 Thread Lazar, Lijo



On 9/10/2024 2:07 PM, Tao Zhou wrote:
> The feature is not applicable to specific app platform.
> 
> v2: update the disablement condition and commit description
> v3: move the setting to amdgpu_ras_check_supported
> 
> Signed-off-by: Tao Zhou 
> Reviewed-by: Hawking Zhang 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dbfc41ddc3c7..ebe3e8f01fe2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3483,6 +3483,11 @@ static void amdgpu_ras_check_supported(struct 
> amdgpu_device *adev)
>  
>   /* aca is disabled by default */
>   adev->aca.is_enabled = false;
> +
> + /* bad page feature is not applicable to specific app platform */
> + if (adev->gmc.is_app_apu &&
> + amdgpu_ip_version(adev, UMC_HWIP, 0) == IP_VERSION(12, 0, 0))
> + amdgpu_bad_page_threshold = 0;
>  }
>  
>  static void amdgpu_ras_counte_dw(struct work_struct *work)


Re: [PATCH] drm/amdgpu: Retire un-used write in JPEG v4.0.3

2024-09-09 Thread Lazar, Lijo



On 9/10/2024 10:45 AM, Jane Jian wrote:
> write OP of HDP_DEBUG1(0x3fbc) is no longer functional, so remove it.
> 

You may copy the title/description from the one I shared -

Subj: Remove unneeded write in JPEG v4.0.3

Desc:

HDP_DEBUG1(offset = 0x3fbc) is no longer functional, remove the
redundant write.

> Signed-off-by: Jane Jian 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c 
> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> index 86958cb2c2ab..eafd8bcf2870 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> @@ -743,14 +743,6 @@ void jpeg_v4_0_3_dec_ring_emit_fence(struct amdgpu_ring 
> *ring, u64 addr, u64 seq
>   amdgpu_ring_write(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6));
>   amdgpu_ring_write(ring, 0);
>  
> - amdgpu_ring_write(ring, 
> PACKETJ(regUVD_JRBC_EXTERNAL_REG_INTERNAL_OFFSET,
> - 0, 0, PACKETJ_TYPE0));
> - amdgpu_ring_write(ring, 0x3fbc);
> -
> - amdgpu_ring_write(ring, PACKETJ(JRBC_DEC_EXTERNAL_REG_WRITE_ADDR,
> - 0, 0, PACKETJ_TYPE0));
> - amdgpu_ring_write(ring, 0x1);
> -
>   amdgpu_ring_write(ring, PACKETJ(0, 0, 0, PACKETJ_TYPE6));
>   amdgpu_ring_write(ring, 0);
>  
> @@ -1088,7 +1080,7 @@ static const struct amdgpu_ring_funcs 
> jpeg_v4_0_3_dec_ring_vm_funcs = {
>   SOC15_FLUSH_GPU_TLB_NUM_WREG * 6 +
>   SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 8 +
>   8 + /* jpeg_v4_0_3_dec_ring_emit_vm_flush */
> - 22 + 22 + /* jpeg_v4_0_3_dec_ring_emit_fence x2 vm fence */
> + 18 + 18 + /* jpeg_v4_0_3_dec_ring_emit_fence x2 vm fence */
>   8 + 16,
>   .emit_ib_size = 22, /* jpeg_v4_0_3_dec_ring_emit_ib */
>   .emit_ib = jpeg_v4_0_3_dec_ring_emit_ib,


Re: [PATCH] drm/amdgpu: disable GPU RAS bad page feature for specific ASIC

2024-09-09 Thread Lazar, Lijo



On 9/10/2024 9:29 AM, Tao Zhou wrote:
> The feature is not applicable to specific app platform.
> 
> v2: update the disablement condition and commit description
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dbfc41ddc3c7..08efc9121adc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2055,6 +2055,11 @@ static int amdgpu_ras_fs_init(struct amdgpu_device 
> *adev)
>   con->event_state_attr = dev_attr_event_state;
>   sysfs_attr_init(attrs[3]);
>  
> + /* bad page feature is not applicable to specific app platform */
> + if (adev->gmc.is_app_apu &&
> + amdgpu_ip_version(adev, UMC_HWIP, 0) == IP_VERSION(12, 0, 0))
> + amdgpu_bad_page_threshold = 0;

I think sysfs file creation is not the right place to do this. It should
be done probably much earlier at a place where it says what features are
supported for the SOC.

Thanks,
Lijo

> +
>   if (amdgpu_bad_page_threshold != 0) {
>   /* add bad_page_features entry */
>   bin_attr_gpu_vram_bad_pages.private = NULL;


Re: [PATCH] drm/amdkfd: Select reset method for poison handling

2024-09-06 Thread Lazar, Lijo



On 9/6/2024 1:42 PM, Hawking Zhang wrote:
> Driver mode-2 is only supported by relative new
> smc firmware.
> 
> Signed-off-by: Hawking Zhang 
> ---
>  .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 40 +++
>  1 file changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index fecdbbab9894..d46a13156ee9 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -167,11 +167,23 @@ static void 
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
>   block = AMDGPU_RAS_BLOCK__GFX;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 
> 4, 3) ||
> - amdgpu_ip_version(dev->adev, GC_HWIP, 0) == 
> IP_VERSION(9, 4, 4))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 
> 4, 3)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x00557300 and onwards */
> + if (dev->adev->pm.fw_version < 0x00557300)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == 
> IP_VERSION(9, 4, 4)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x05550C00 and onwards */
> + if (dev->adev->pm.fw_version < 0x05550C00)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }

I think it's better to handle this inside amdgpu_ras_do_recovery rather
than here.

Something like -
int amdgpu_ras_reset_method_quirk(adev) which returns the right reset
method when (ras->gpu_reset_flags & AMDGPU_RAS_GPU_RESET_MODE2_RESET) is
set. Or add a few more flags like RAS_SDMA_POISON/RAS_GFX_POISON and
decide the method in amdgpu_ras handling.

Thanks,
Lijo


>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> @@ -184,11 +196,23 @@ static void 
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SDMA3:
>   case SOC15_IH_CLIENTID_SDMA4:
>   block = AMDGPU_RAS_BLOCK__SDMA;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 
> 4, 3) ||
> - amdgpu_ip_version(dev->adev, GC_HWIP, 0) == 
> IP_VERSION(9, 4, 4))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, SDMA0_HWIP, 0) == 
> IP_VERSION(4, 4, 2)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x00557300 and onwards */
> + if (dev->adev->pm.fw_version < 0x00557300)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else if (amdgpu_ip_version(dev->adev, SDMA0_HWIP, 0) == 
> IP_VERSION(4, 4, 5)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x05550C00 and onwards */
> + if (dev->adev->pm.fw_version < 0x05550C00)
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
>   break;
>   default:
>   dev_warn(dev->adev->dev,


Re: [PATCH 3/3] drm/amdgpu/gfx9: Refactor cleaner shader initialization for GFX9.4.3

2024-09-04 Thread Lazar, Lijo



On 9/4/2024 6:57 PM, Srinivasan Shanmugam wrote:
> This commit modifies the initialization only if the cleaner shader
> object has been allocated. This is done by adding checks for
> adev->gfx.cleaner_shader_obj before calling
> amdgpu_gfx_cleaner_shader_init
> 
> The changes are made in the gfx_v9_4_3_sw_init, gfx_v9_4_3_sw_fini, and
> gfx_v9_4_3_hw_init functions. These functions are responsible for
> initializing software components of the GFX v9.4.3 engines.
> 
> This change prevents unnecessary function calls and makes the control
> flow of the program clearer. It also ensures that the cleaner shader is
> only initialized when it has been properly allocated.
> 
> Fixes: 1b66421d29b7 ("drm/amdgpu/gfx9: Implement cleaner shader support for 
> GFX9.4.3 hardware")
> Cc: Christian König 
> Cc: Alex Deucher 
> Suggested-by: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 408e5600bb61..abf934863421 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -1061,10 +1061,12 @@ static int gfx_v9_4_3_sw_init(void *handle)
>   adev->gfx.cleaner_shader_size = 
> sizeof(gfx_9_4_3_cleaner_shader_hex);
>   if (adev->gfx.mec_fw_version >= 153) {

It's better to bring inside this one assignment of shader binary/size

>   adev->gfx.enable_cleaner_shader = true;
> - r = amdgpu_gfx_cleaner_shader_sw_init(adev, 
> adev->gfx.cleaner_shader_size);
> - if (r) {
> - adev->gfx.enable_cleaner_shader = false;
> - dev_err(adev->dev, "Failed to initialize 
> cleaner shader\n");
> + if (adev->gfx.cleaner_shader_obj) {

Keep this outside and check if a valid shader binary is available (size
! = 0). If so, do a sw_init

> + r = amdgpu_gfx_cleaner_shader_sw_init(adev);> + 
> if (r) {
> + adev->gfx.enable_cleaner_shader = false;

Move this state change inside amdgpu_gfx_cleaner_shader_sw_init such
that cleaner shader API keeps the state - cleaner shader is enabled if a
valid CPU and GPU pointer is available.

Any further calls to cleaner shader API - - ex:
amdgpu_gfx_cleaner_shader_init() - just need to check this state and
take action. Basically, all the checks may be moved inside the cleaner
shader API rather than implementing this at every client interface.

Thanks,
Lijo

> + dev_err(adev->dev, "Failed to 
> initialize cleaner shader\n");
> + }
>   }
>   }
>   break;
> @@ -1196,7 +1198,8 @@ static int gfx_v9_4_3_sw_fini(void *handle)
>   amdgpu_gfx_kiq_fini(adev, i);
>   }
>  
> - amdgpu_gfx_cleaner_shader_sw_fini(adev);
> + if (adev->gfx.cleaner_shader_obj)
> + amdgpu_gfx_cleaner_shader_sw_fini(adev);
>  
>   gfx_v9_4_3_mec_fini(adev);
>   amdgpu_bo_unref(&adev->gfx.rlc.clear_state_obj);
> @@ -2344,8 +2347,8 @@ static int gfx_v9_4_3_hw_init(void *handle)
>   int r;
>   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
> - amdgpu_gfx_cleaner_shader_init(adev, adev->gfx.cleaner_shader_size,
> -adev->gfx.cleaner_shader_ptr);
> + if (adev->gfx.cleaner_shader_obj)
> + amdgpu_gfx_cleaner_shader_init(adev);
>  
>   if (!amdgpu_sriov_vf(adev))
>   gfx_v9_4_3_init_golden_registers(adev);


Re: [PATCH v2] drm/amdgpu: fix a call trace when unload amdgpu driver

2024-09-04 Thread Lazar, Lijo



On 9/4/2024 8:08 PM, Philip Yang wrote:
> 
> On 2024-09-04 04:04, Asher Song wrote:
>> In some APUs, the bo type of GART page table is ttm_bo_type_sg.
>> Those type BOs is released by bo->delayed_delete which is added in 
>> ttm_device->wq, not released immediately.
>>
>> To make sure all the ttm_resource is released before ttm_resource_manager is 
>> finilized, drain the workqueue in ttm_device.
>>
>> v2: move drain_workqueue to amdgpu_ttm.c
>>
>> Fixes:d99fbd9aab62 ("drm/ttm: Always take the bo delayed cleanup path for 
>> imported bos")
>> Suggested-by: Christian König 
>> Signed-off-by: Asher Song 
> 
> Acked-by: Philip Yang 
> 
> Most likely this will fix another bug caused by race condition b/w GPU mode 1 
> reset and delayed bo cleanup worker.
> 

Unfortunately this won't - sw_fini doesn't get called during a reset.

Thanks,
Lijo

> Thank you.
> Philip
> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 5c938ff0bf48..cbac21df5c47 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -2461,6 +2461,7 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>>  drm_dev_exit(idx);
>>  }
>>  
>> +drain_workqueue(adev->mman.bdev.wq);
>>  amdgpu_direct_gma_fini(adev);
>>  amdgpu_vram_mgr_fini(adev);
>>  amdgpu_gtt_mgr_fini(adev);


Re: [PATCH v2] drm/amdgpu: fix a call trace when unload amdgpu driver

2024-09-04 Thread Lazar, Lijo



On 9/4/2024 1:34 PM, Asher Song wrote:
> In some APUs, the bo type of GART page table is ttm_bo_type_sg.
> Those type BOs is released by bo->delayed_delete which is added in 
> ttm_device->wq, not released immediately.
> 
> To make sure all the ttm_resource is released before ttm_resource_manager is 
> finilized, drain the workqueue in ttm_device.
> 
> v2: move drain_workqueue to amdgpu_ttm.c
> 
> Fixes:d99fbd9aab62 ("drm/ttm: Always take the bo delayed cleanup path for 
> imported bos")
> Suggested-by: Christian König 
> Signed-off-by: Asher Song 

Acked-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 5c938ff0bf48..cbac21df5c47 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -2461,6 +2461,7 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>   drm_dev_exit(idx);
>   }
>  
> + drain_workqueue(adev->mman.bdev.wq);
>   amdgpu_direct_gma_fini(adev);
>   amdgpu_vram_mgr_fini(adev);
>   amdgpu_gtt_mgr_fini(adev);


Re: [PATCH 02/10] drm/amdgpu: Use init level for pending_reset flag

2024-09-03 Thread Lazar, Lijo



On 9/4/2024 7:40 AM, Xu, Feifei wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Comment inline.
> 
> Thanks,
> Feifei
> 
> -Original Message-
> From: amd-gfx  On Behalf Of Lijo Lazar
> Sent: Monday, September 2, 2024 3:34 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Deucher, Alexander 
> ; Koenig, Christian 
> Subject: [PATCH 02/10] drm/amdgpu: Use init level for pending_reset flag
> 
> Drop pending_reset flag in gmc block. Instead use init level to determine 
> which type of init is preferred - in this case MINIMAL.
> 
> Signed-off-by: Lijo Lazar 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 33 ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  1 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h   |  1 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  6 ++--
>  .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c|  3 +-
>  6 files changed, 13 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4fb09c4fbf22..db5046e8b10d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1691,7 +1691,7 @@ bool amdgpu_device_need_post(struct amdgpu_device *adev)
> }
> 
> /* Don't post if we need to reset whole hive on init */
> -   if (adev->gmc.xgmi.pending_reset)
> +   if (adev->init_lvl->level == AMDGPU_INIT_LEVEL_MINIMAL)
> return false;
> 
> if (adev->has_hw_reset) {
> @@ -2985,7 +2985,7 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
> *adev)
> amdgpu_ttm_set_buffer_funcs_status(adev, true);
> 
> /* Don't init kfd if whole hive need to be reset during init */
> -   if (!adev->gmc.xgmi.pending_reset) {
> +   if (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL) {
> kgd2kfd_init_zone_device(adev);
> amdgpu_amdkfd_device_init(adev);
> }
> @@ -3499,14 +3499,9 @@ static int amdgpu_device_ip_suspend_phase2(struct 
> amdgpu_device *adev)
> }
> 
> /* skip unnecessary suspend if we do not initialize them yet 
> */
> -   if (adev->gmc.xgmi.pending_reset &&
> -   !(adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_GMC ||
> - adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_SMC ||
> - adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_COMMON ||
> - adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_IH)) {
> -   adev->ip_blocks[i].status.hw = false;
> +   if (!amdgpu_ip_member_of_hwini(
> +   adev, adev->ip_blocks[i].version->type))
> continue;
> -   }
> 
> [Feifei]:  AMDGPU_INIT_LEVEL_MINIMAL indicate the minimal necessary blocks 
> which need to do hw_init if SMC need to handle the mode1 reset. Though in 
> newer ASICs it is smc doing the reset, in some old one, it is MP0.
>Is it more readable if we use naming like 
> AMDGPU_INIT_LEVEL_MINIMAL_SMC to avoid confusion ?

Original intention for levels is like -

Define a single 'minimal' level init required for the SOC. Further
levels like suspend, s0i3, emulation/simulation etc. may be introduced
later which defines the level of initialization required for those
scenarios. Basically, the idea was to make it SOC specific with a callback.

It is kept this way now as the immediate purpose is to support 'minimal'
init required for XGMI-reset-on-init scenario for limited SOCs. In that
sense, this could be renamed that way also.

> 
> 
> /* skip suspend of gfx/mes and psp for S0ix
>  * gfx is in gfxoff state, so on resume it will exit gfxoff 
> just @@ -4320,20 +4315,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
> if (adev->gmc.xgmi.num_physical_nodes) {
> dev_info(adev->dev, "Pending hive reset.\n");
> -   adev->gmc.xgmi.pending_reset = true;
> -   /* Only need to init necessary block for SMU to 
> handle the reset */
> -   for (i = 0; i < adev->num_ip_blocks; i++) {
> -   if (!adev->ip_blocks[i].status.valid)
> -   continue;
> -   if (!(adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_GMC ||
> - adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_COMMON ||
> - adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_IH ||
> - adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_T

Re: [PATCH] drm/amdgpu: fix a call trace when unload amdgpu driver

2024-09-03 Thread Lazar, Lijo



On 9/3/2024 6:01 PM, Asher Song wrote:
> In some APUs, the bo type of GART page table is ttm_bo_type_sg.
> Those type BOs is released by bo->delayed_delete which is added in
> ttm_device->wq, not released immediately.
> 
> To make sure all the ttm_resource is released before ttm_resource_manager
> is finilized, drain the workqueue in ttm_device.
> 
> Fixes:d99fbd9aab62 ("drm/ttm: Always take the bo delayed cleanup path for 
> imported bos")
> Acked-by: Christian König 
> Signed-off-by: Asher Song 

Acked-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 0a5c8d97787a..99017e426618 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -2216,6 +2216,7 @@ static int gmc_v9_0_sw_fini(void *handle)
>   if (!adev->gmc.real_vram_size) {
>   dev_info(adev->dev, "Put GART in system memory for APU free\n");
>   amdgpu_gart_table_ram_free(adev);
> + drain_workqueue(adev->mman.bdev.wq);
>   } else {
>   amdgpu_gart_table_vram_free(adev);
>   }


Re: [PATCH v5] drm/amdgpu/gfx9.4.3: Implement compute pipe reset

2024-08-29 Thread Lazar, Lijo



On 8/29/2024 9:17 AM, Prike Liang wrote:
> Implement the compute pipe reset, and the driver will
> fallback to pipe reset when queue reset fails.
> The pipe reset only deactivates the queue which is
> scheduled in the pipe, and meanwhile the MEC engine
> will be reset to the firmware _start pointer. So,

May refine this to indicate that reset to _start is for the specific
pipe and not applicable for the whole MEC engine.

> it seems pipe reset will cost more cycles than the
> queue reset; therefore, the driver tries to recover
> by doing queue reset first.
> 
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 127 
>  1 file changed, 108 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 2067f26d3a9d..26ae62d2a752 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -3466,6 +3466,98 @@ static void gfx_v9_4_3_emit_wave_limit(struct 
> amdgpu_ring *ring, bool enable)
>   }
>  }
>  
> +static int gfx_v9_4_3_unmap_done(struct amdgpu_device *adev, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id)
> +{
> + int i, r;
> + /* make sure dequeue is complete*/
> + gfx_v9_4_3_xcc_set_safe_mode(adev, xcc_id);
> + mutex_lock(&adev->srbm_mutex);
> + soc15_grbm_select(adev, me, pipe, queue, 0, GET_INST(GC, xcc_id));
> + for (i = 0; i < adev->usec_timeout; i++) {
> + if (!(RREG32_SOC15(GC, GET_INST(GC, xcc_id), regCP_HQD_ACTIVE) 
> & 1))
> + break;
> + udelay(1);
> + }
> + if (i >= adev->usec_timeout)
> + r = -ETIMEDOUT;
> + else
> + r = 0;
> + soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
> + mutex_unlock(&adev->srbm_mutex);
> + gfx_v9_4_3_xcc_unset_safe_mode(adev, xcc_id);
> +
> + return r;
> +
> +}
> +
> +static bool gfx_v9_4_3_pipe_reset_support(struct amdgpu_device *adev)
> +{
> + /*TODO: Need check gfx9.4.4 mec fw whether supports pipe reset as 
> well.*/
> + if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) &&
> + adev->gfx.mec_fw_version >= 0x009b)
> + return true;
> + else
> + dev_warn_once(adev->dev, "Please use the latest MEC version to 
> see whether support pipe reset\n");
> +
> + return false;
> +}
> +
> +static int gfx_v9_4_3_kiq_reset_hw_pipe(struct amdgpu_ring *ring)

Please drop the kiq name in this function to avoid confusion. It's not
restricted to kiq.

With those

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> +{
> + struct amdgpu_device *adev = ring->adev;
> + uint32_t reset_pipe, clean_pipe;
> + int r;
> +
> + if (!gfx_v9_4_3_pipe_reset_support(adev))
> + return -EINVAL;
> +
> + gfx_v9_4_3_xcc_set_safe_mode(adev, ring->xcc_id);
> + mutex_lock(&adev->srbm_mutex);
> +
> + reset_pipe = RREG32_SOC15(GC, GET_INST(GC, ring->xcc_id), 
> regCP_MEC_CNTL);
> + clean_pipe = reset_pipe;
> +
> + if (ring->me == 1) {
> + switch (ring->pipe) {
> + case 0:
> + reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
> +MEC_ME1_PIPE0_RESET, 1);
> + break;
> + case 1:
> + reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
> +MEC_ME1_PIPE1_RESET, 1);
> + break;
> + case 2:
> + reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
> +MEC_ME1_PIPE2_RESET, 1);
> + break;
> + case 3:
> + reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
> +MEC_ME1_PIPE3_RESET, 1);
> + break;
> + default:
> + break;
> + }
> + } else {
> + if (ring->pipe)
> + reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
> +MEC_ME2_PIPE1_RESET, 1);
> + else
> + reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
> +MEC_ME2_PIPE0_RESET, 1);
> + }
> +
> + WREG32_SOC15(GC, GET_INST(GC, ring->xcc_id), regCP_MEC_CNTL, 
> reset_pipe);
> + WREG32_SOC15(GC, GET_INST(GC, ring->xcc_id), regCP_MEC_CNTL, 
> clean_pipe);
> + mutex_unlock(&adev->srbm_mutex);
> + gfx_v9_4_3_xcc_unset_safe_mode(adev, ring->xcc_id);
> +
> + r = gfx_v9_4_3_unmap_done(adev, ring->me, ring->pipe, ring->queue, 
> ring->xcc_id);
> + return r;
> +}
> +
>  static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
>   

Re: [PATCH v4] drm/amdgpu/gfx9.4.3: Implement compute pipe reset

2024-08-27 Thread Lazar, Lijo



On 8/28/2024 10:50 AM, Prike Liang wrote:
> Implement the compute pipe reset, and the driver will
> fallback to pipe reset when queue reset fails.
> The pipe reset only deactivates the queue which is
> scheduled in the pipe, and meanwhile the MEC engine
> will be reset to the firmware _start pointer. So,
> it seems pipe reset will cost more cycles than the
> queue reset; therefore, the driver tries to recover
> by doing queue reset first.
> 
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |   5 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 139 
>  2 files changed, 124 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index e28c1ebfa98f..d4d74ba2bc27 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -143,6 +143,11 @@ struct kiq_pm4_funcs {
>  uint32_t queue_type, uint32_t me_id,
>  uint32_t pipe_id, uint32_t queue_id,
>  uint32_t xcc_id, uint32_t vmid);
> + int (*kiq_reset_hw_pipe)(struct amdgpu_ring *kiq_ring,
> +uint32_t queue_type, uint32_t me,
> +uint32_t pipe, uint32_t queue,
> +uint32_t xcc_id);

Missed the addition of this callback in earlier review.

The implementation below -
Doesn't use kiq to do a pipe reset. It's looks like a direct hardware
reset. Passing a kiq_ring here or defining a callback in kiq  functions
doesn't look required unless a pipe reset through kiq is available for
other hardware generations.

Also, it uses pipe reset as a fallback when queue unmap fails. So the
callback eventually is not used.

Is this really needed? For the below implementation, it seems a private
function like gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring) is good
enough.

Thanks,
Lijo

> +
>   /* Packet sizes */
>   int set_resources_size;
>   int map_queues_size;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 2067f26d3a9d..f47b55d6f673 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -166,6 +166,10 @@ static int gfx_v9_4_3_get_cu_info(struct amdgpu_device 
> *adev,
>   struct amdgpu_cu_info *cu_info);
>  static void gfx_v9_4_3_xcc_set_safe_mode(struct amdgpu_device *adev, int 
> xcc_id);
>  static void gfx_v9_4_3_xcc_unset_safe_mode(struct amdgpu_device *adev, int 
> xcc_id);
> +static int gfx_v9_4_3_kiq_reset_hw_pipe(struct amdgpu_ring *kiq_ring,
> + uint32_t queue_type, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id);
>  
>  static void gfx_v9_4_3_kiq_set_resources(struct amdgpu_ring *kiq_ring,
>   uint64_t queue_mask)
> @@ -323,6 +327,7 @@ static const struct kiq_pm4_funcs 
> gfx_v9_4_3_kiq_pm4_funcs = {
>   .kiq_query_status = gfx_v9_4_3_kiq_query_status,
>   .kiq_invalidate_tlbs = gfx_v9_4_3_kiq_invalidate_tlbs,
>   .kiq_reset_hw_queue = gfx_v9_4_3_kiq_reset_hw_queue,
> + .kiq_reset_hw_pipe = gfx_v9_4_3_kiq_reset_hw_pipe,
>   .set_resources_size = 8,
>   .map_queues_size = 7,
>   .unmap_queues_size = 6,
> @@ -3466,6 +3471,101 @@ static void gfx_v9_4_3_emit_wave_limit(struct 
> amdgpu_ring *ring, bool enable)
>   }
>  }
>  
> +static int gfx_v9_4_3_unmap_done(struct amdgpu_device *adev, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id)
> +{
> + int i, r;
> + /* make sure dequeue is complete*/
> + gfx_v9_4_3_xcc_set_safe_mode(adev, xcc_id);
> + mutex_lock(&adev->srbm_mutex);
> + soc15_grbm_select(adev, me, pipe, queue, 0, GET_INST(GC, xcc_id));
> + for (i = 0; i < adev->usec_timeout; i++) {
> + if (!(RREG32_SOC15(GC, GET_INST(GC, xcc_id), regCP_HQD_ACTIVE) 
> & 1))
> + break;
> + udelay(1);
> + }
> + if (i >= adev->usec_timeout)
> + r = -ETIMEDOUT;
> + else
> + r = 0;
> + soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
> + mutex_unlock(&adev->srbm_mutex);
> + gfx_v9_4_3_xcc_unset_safe_mode(adev, xcc_id);
> +
> + return r;
> +
> +}
> +
> +static bool gfx_v9_4_3_pipe_reset_support(struct amdgpu_device *adev)
> +{
> + /*TODO: Need check gfx9.4.4 mec fw whether supports pipe reset as 
> well.*/
> + if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) &&
> + adev->gfx.mec_fw_version >= 0x009b)
> + return true;
> + else
> + dev_warn_once(adev->dev, "Please use the latest MEC version to 

Re: [PATCH 3/3] drm/amdgpu: nuke the VM PD/PT shadow handling

2024-08-27 Thread Lazar, Lijo



On 8/27/2024 7:42 PM, Christian König wrote:
> This was only used as workaround for recovering the page tables after
> VRAM was lost and is no longer necessary after the function
> amdgpu_vm_bo_reset_state_machine() started to do the same.
> 
> Compute never used shadows either, so the only proplematic case left is
> SVM and that is most likely not recoverable in any way when VRAM is
> lost.
> 
> Signed-off-by: Christian König 

This patch works fine on GC 9.4.3 SOCs.

Acked-by: Lijo Lazar 

Alex or someone else may take a closer look.

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h |  4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 87 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  | 67 +---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h  | 21 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 17 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c   | 56 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 19 +
>  7 files changed, 6 insertions(+), 265 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e8c284aea1f2..e2cf77a93a0f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1082,10 +1082,6 @@ struct amdgpu_device {
>  
>   struct amdgpu_virt  virt;
>  
> - /* link all shadow bo */
> - struct list_headshadow_list;
> - struct mutexshadow_list_lock;
> -
>   /* record hw reset is performed */
>   bool has_hw_reset;
>   u8  reset_magic[AMDGPU_RESET_MAGIC_NUM];
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index da06705f0026..33a939571f89 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4107,9 +4107,6 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   spin_lock_init(&adev->mm_stats.lock);
>   spin_lock_init(&adev->wb.lock);
>  
> - INIT_LIST_HEAD(&adev->shadow_list);
> - mutex_init(&adev->shadow_list_lock);
> -
>   INIT_LIST_HEAD(&adev->reset_list);
>  
>   INIT_LIST_HEAD(&adev->ras_list);
> @@ -5029,80 +5026,6 @@ static int amdgpu_device_ip_post_soft_reset(struct 
> amdgpu_device *adev)
>   return 0;
>  }
>  
> -/**
> - * amdgpu_device_recover_vram - Recover some VRAM contents
> - *
> - * @adev: amdgpu_device pointer
> - *
> - * Restores the contents of VRAM buffers from the shadows in GTT.  Used to
> - * restore things like GPUVM page tables after a GPU reset where
> - * the contents of VRAM might be lost.
> - *
> - * Returns:
> - * 0 on success, negative error code on failure.
> - */
> -static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
> -{
> - struct dma_fence *fence = NULL, *next = NULL;
> - struct amdgpu_bo *shadow;
> - struct amdgpu_bo_vm *vmbo;
> - long r = 1, tmo;
> -
> - if (amdgpu_sriov_runtime(adev))
> - tmo = msecs_to_jiffies(8000);
> - else
> - tmo = msecs_to_jiffies(100);
> -
> - dev_info(adev->dev, "recover vram bo from shadow start\n");
> - mutex_lock(&adev->shadow_list_lock);
> - list_for_each_entry(vmbo, &adev->shadow_list, shadow_list) {
> - /* If vm is compute context or adev is APU, shadow will be NULL 
> */
> - if (!vmbo->shadow)
> - continue;
> - shadow = vmbo->shadow;
> -
> - /* No need to recover an evicted BO */
> - if (!shadow->tbo.resource ||
> - shadow->tbo.resource->mem_type != TTM_PL_TT ||
> - shadow->tbo.resource->start == AMDGPU_BO_INVALID_OFFSET ||
> - shadow->parent->tbo.resource->mem_type != TTM_PL_VRAM)
> - continue;
> -
> - r = amdgpu_bo_restore_shadow(shadow, &next);
> - if (r)
> - break;
> -
> - if (fence) {
> - tmo = dma_fence_wait_timeout(fence, false, tmo);
> - dma_fence_put(fence);
> - fence = next;
> - if (tmo == 0) {
> - r = -ETIMEDOUT;
> - break;
> - } else if (tmo < 0) {
> - r = tmo;
> - break;
> - }
> - } else {
> - fence = next;
> - }
> - }
> - mutex_unlock(&adev->shadow_list_lock);
> -
> - if (fence)
> - tmo = dma_fence_wait_timeout(fence, false, tmo);
> - dma_fence_put(fence);
> -
> - if (r < 0 || tmo <= 0) {
> - dev_err(adev->dev, "recover vram bo from shadow failed, r is 
> %ld, tmo is %ld\n", r, tmo);
> - return -EIO;
> - }
> -
> - dev_info(adev->dev, "recover vram bo from shadow done\n");
> - retur

Re: [PATCH v3] drm/amdgpu/gfx9.4.3: Implement compute pipe reset

2024-08-27 Thread Lazar, Lijo



On 8/22/2024 3:08 PM, Prike Liang wrote:
> Implement the compute pipe reset and driver will
> fallback to pipe reset when queue reset failed.
> 
> Signed-off-by: Prike Liang 
> ---
> v3: Use the dev log and filer out the gfx9.4.4 pipe reset support.
> v2: Convert the GC logic instance to physical instance in the
> register accessing process and use the dev_* print to specify
> the failed device.
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |   5 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 154 +---
>  2 files changed, 139 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index e28c1ebfa98f..d4d74ba2bc27 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -143,6 +143,11 @@ struct kiq_pm4_funcs {
>  uint32_t queue_type, uint32_t me_id,
>  uint32_t pipe_id, uint32_t queue_id,
>  uint32_t xcc_id, uint32_t vmid);
> + int (*kiq_reset_hw_pipe)(struct amdgpu_ring *kiq_ring,
> +uint32_t queue_type, uint32_t me,
> +uint32_t pipe, uint32_t queue,
> +uint32_t xcc_id);
> +
>   /* Packet sizes */
>   int set_resources_size;
>   int map_queues_size;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 2067f26d3a9d..aa0c76eed452 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -166,6 +166,10 @@ static int gfx_v9_4_3_get_cu_info(struct amdgpu_device 
> *adev,
>   struct amdgpu_cu_info *cu_info);
>  static void gfx_v9_4_3_xcc_set_safe_mode(struct amdgpu_device *adev, int 
> xcc_id);
>  static void gfx_v9_4_3_xcc_unset_safe_mode(struct amdgpu_device *adev, int 
> xcc_id);
> +static int gfx_v9_4_3_kiq_reset_hw_pipe(struct amdgpu_ring *kiq_ring,
> + uint32_t queue_type, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id);
>  
>  static void gfx_v9_4_3_kiq_set_resources(struct amdgpu_ring *kiq_ring,
>   uint64_t queue_mask)
> @@ -323,6 +327,7 @@ static const struct kiq_pm4_funcs 
> gfx_v9_4_3_kiq_pm4_funcs = {
>   .kiq_query_status = gfx_v9_4_3_kiq_query_status,
>   .kiq_invalidate_tlbs = gfx_v9_4_3_kiq_invalidate_tlbs,
>   .kiq_reset_hw_queue = gfx_v9_4_3_kiq_reset_hw_queue,
> + .kiq_reset_hw_pipe = gfx_v9_4_3_kiq_reset_hw_pipe,
>   .set_resources_size = 8,
>   .map_queues_size = 7,
>   .unmap_queues_size = 6,
> @@ -3466,6 +3471,116 @@ static void gfx_v9_4_3_emit_wave_limit(struct 
> amdgpu_ring *ring, bool enable)
>   }
>  }
>  
> +static int gfx_v9_4_3_unmap_done(struct amdgpu_device *adev, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id)
> +{
> + int i, r;
> + /* make sure dequeue is complete*/
> + gfx_v9_4_3_xcc_set_safe_mode(adev, xcc_id);
> + mutex_lock(&adev->srbm_mutex);
> + soc15_grbm_select(adev, me, pipe, queue, 0, GET_INST(GC, xcc_id));
> + for (i = 0; i < adev->usec_timeout; i++) {
> + if (!(RREG32_SOC15(GC, GET_INST(GC, xcc_id), regCP_HQD_ACTIVE) 
> & 1))
> + break;
> + udelay(1);
> + }
> + if (i >= adev->usec_timeout)
> + r = -ETIMEDOUT;
> + else
> + r = 0;
> + soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
> + mutex_unlock(&adev->srbm_mutex);
> + gfx_v9_4_3_xcc_unset_safe_mode(adev, xcc_id);
> +
> + return r;
> +
> +}
> +
> +static bool gfx_v9_4_3_pipe_reset_support(struct amdgpu_device *adev)
> +{
> + /*TODO: Need check gfx9.4.4 mec fw whether supports pipe reset as 
> well.*/
> + if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) &&
> + adev->gfx.mec_fw_version >= 0x009b)
> + return true;
> + else
> + dev_warn_once(adev->dev, "Please use the latest MEC version to 
> see whether support pipe reset\n");
> +
> + return false;
> +}
> +
> +static int gfx_v9_4_3_kiq_reset_hw_pipe(struct amdgpu_ring *kiq_ring,
> + uint32_t queue_type, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id)
> +{
> + struct amdgpu_device *adev = kiq_ring->adev;
> + uint32_t reset_pipe, clean_pipe;
> + int r;
> +
> + if (!gfx_v9_4_3_pipe_reset_support(adev))
> + return -EINVAL;
> +
> + gfx_v9_4_3_xcc_set_safe_mode(adev, xcc_id);
> + mutex_lock(&adev->srbm_mutex);
> + soc15_grbm_select(adev, me, pipe

Re: [PATCH v2] drm/amdgpu/gfx9.4.3: Implement compute pipe reset

2024-08-20 Thread Lazar, Lijo



On 8/20/2024 4:01 PM, Prike Liang wrote:
> Implement the compute pipe reset and driver will
> fallback to pipe reset when queue reset failed.
> 
> Signed-off-by: Prike Liang 
> ---
> v2: Convert the GC logic instance to physical instance in the
> register accessing process and 

> use the dev_* print to specify the failed device.

This is not fully done, marked below.

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |   5 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 153 
>  2 files changed, 138 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index e28c1ebfa98f..d4d74ba2bc27 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -143,6 +143,11 @@ struct kiq_pm4_funcs {
>  uint32_t queue_type, uint32_t me_id,
>  uint32_t pipe_id, uint32_t queue_id,
>  uint32_t xcc_id, uint32_t vmid);
> + int (*kiq_reset_hw_pipe)(struct amdgpu_ring *kiq_ring,
> +uint32_t queue_type, uint32_t me,
> +uint32_t pipe, uint32_t queue,
> +uint32_t xcc_id);
> +
>   /* Packet sizes */
>   int set_resources_size;
>   int map_queues_size;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 2067f26d3a9d..ab9d5adbbfe8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -166,6 +166,10 @@ static int gfx_v9_4_3_get_cu_info(struct amdgpu_device 
> *adev,
>   struct amdgpu_cu_info *cu_info);
>  static void gfx_v9_4_3_xcc_set_safe_mode(struct amdgpu_device *adev, int 
> xcc_id);
>  static void gfx_v9_4_3_xcc_unset_safe_mode(struct amdgpu_device *adev, int 
> xcc_id);
> +static int gfx_v9_4_3_kiq_reset_hw_pipe(struct amdgpu_ring *kiq_ring,
> + uint32_t queue_type, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id);
>  
>  static void gfx_v9_4_3_kiq_set_resources(struct amdgpu_ring *kiq_ring,
>   uint64_t queue_mask)
> @@ -323,6 +327,7 @@ static const struct kiq_pm4_funcs 
> gfx_v9_4_3_kiq_pm4_funcs = {
>   .kiq_query_status = gfx_v9_4_3_kiq_query_status,
>   .kiq_invalidate_tlbs = gfx_v9_4_3_kiq_invalidate_tlbs,
>   .kiq_reset_hw_queue = gfx_v9_4_3_kiq_reset_hw_queue,
> + .kiq_reset_hw_pipe = gfx_v9_4_3_kiq_reset_hw_pipe,
>   .set_resources_size = 8,
>   .map_queues_size = 7,
>   .unmap_queues_size = 6,
> @@ -3466,6 +3471,115 @@ static void gfx_v9_4_3_emit_wave_limit(struct 
> amdgpu_ring *ring, bool enable)
>   }
>  }
>  
> +static int gfx_v9_4_3_unmap_done(struct amdgpu_device *adev, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id)
> +{
> + int i, r;
> + /* make sure dequeue is complete*/
> + gfx_v9_4_3_xcc_set_safe_mode(adev, xcc_id);
> + mutex_lock(&adev->srbm_mutex);
> + soc15_grbm_select(adev, me, pipe, queue, 0, GET_INST(GC, xcc_id));
> + for (i = 0; i < adev->usec_timeout; i++) {
> + if (!(RREG32_SOC15(GC, GET_INST(GC, xcc_id), regCP_HQD_ACTIVE) 
> & 1))
> + break;
> + udelay(1);
> + }
> + if (i >= adev->usec_timeout)
> + r = -ETIMEDOUT;
> + else
> + r = 0;
> + soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
> + mutex_unlock(&adev->srbm_mutex);
> + gfx_v9_4_3_xcc_unset_safe_mode(adev, xcc_id);
> +
> + return r;
> +
> +}
> +
> +static bool gfx_v9_4_3_pipe_reset_support(struct amdgpu_device *adev)
> +{
> +
> + if (unlikely(adev->gfx.mec_fw_version < 0x009b)) {
> + DRM_WARN_ONCE("MEC firmware version too old, please use FW no 
> older than 155!\n");
> + return false;
> + }

This path will be taken GCv9.4.3 and GCv9.4.4. GCv9.4.4 has a different
FW version. If FW is not yet ready for 9.4.4, better check that and
return false for that.

> +
> + return true;
> +}
> +
> +static int gfx_v9_4_3_kiq_reset_hw_pipe(struct amdgpu_ring *kiq_ring,
> + uint32_t queue_type, uint32_t me,
> + uint32_t pipe, uint32_t queue,
> + uint32_t xcc_id)
> +{
> + struct amdgpu_device *adev = kiq_ring->adev;
> + uint32_t reset_pipe, clean_pipe;
> + int r;
> +
> + if (!gfx_v9_4_3_pipe_reset_support(adev))
> + return -EINVAL;
> +
> + gfx_v9_4_3_xcc_set_safe_mode(adev, xcc_id);
> + mutex_lock(&adev->srbm_mutex);
> + soc15_grbm_select(adev, me, pipe, queue, 0, GET_INST(GC, xcc_id));
> +
> 

Re: [PATCH v3] drm/amdgpu: Disable dpm_enabled flag while VF is in reset

2024-08-12 Thread Lazar, Lijo



On 8/12/2024 7:53 PM, Victor Skvortsov wrote:
> VFs do not perform HW fini/suspend in FLR, so the dpm_enabled
> is incorrectly kept enabled. Add interface to disable it in
> virt_pre_reset call.
> 
> v2: Made implementation generic for all asics
> v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF
> 
> Signed-off-by: Victor Skvortsov 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 8 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 1 +
>  drivers/gpu/drm/amd/include/kgd_pp_interface.h | 1 +
>  drivers/gpu/drm/amd/pm/amdgpu_dpm.c| 6 +-
>  5 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 29a4adee9286..a6b8d0ba4758 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5289,10 +5289,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> *adev,
>   if (reset_context->reset_req_dev == adev)
>   job = reset_context->job;
>  
> - if (amdgpu_sriov_vf(adev)) {
> - /* stop the data exchange thread */
> - amdgpu_virt_fini_data_exchange(adev);
> - }
> + if (amdgpu_sriov_vf(adev))
> + amdgpu_virt_pre_reset(adev);
>  
>   amdgpu_fence_driver_isr_toggle(adev, true);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index b287a82e6177..b6397d3229e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -33,6 +33,7 @@
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
>  #include "amdgpu_reset.h"
> +#include "amdgpu_dpm.h"
>  #include "vi.h"
>  #include "soc15.h"
>  #include "nv.h"
> @@ -849,6 +850,13 @@ enum amdgpu_sriov_vf_mode 
> amdgpu_virt_get_sriov_vf_mode(struct amdgpu_device *ad
>   return mode;
>  }
>  
> +void amdgpu_virt_pre_reset(struct amdgpu_device *adev)
> +{
> + /* stop the data exchange thread */
> + amdgpu_virt_fini_data_exchange(adev);
> + amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
> +}
> +
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev)
>  {
>   if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3)) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index b42a8854dca0..b650a2032c42 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -376,6 +376,7 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
> u32 offset, u32 acc_flags, u32 hwip, u32 xcc_id);
>  bool amdgpu_virt_fw_load_skip_check(struct amdgpu_device *adev,
>   uint32_t ucode_id);
> +void amdgpu_virt_pre_reset(struct amdgpu_device *adev);
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev);
>  bool amdgpu_sriov_xnack_support(struct amdgpu_device *adev);
>  bool amdgpu_virt_get_rlcg_reg_access_flag(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
> b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> index 4b20e2274313..19a48d98830a 100644
> --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> @@ -218,6 +218,7 @@ enum pp_mp1_state {
>   PP_MP1_STATE_SHUTDOWN,
>   PP_MP1_STATE_UNLOAD,
>   PP_MP1_STATE_RESET,
> + PP_MP1_STATE_FLR,
>  };
>  
>  enum pp_df_cstate {
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
> b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> index 8b7d6ed7e2ed..9dc82f4d7c93 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> @@ -168,7 +168,11 @@ int amdgpu_dpm_set_mp1_state(struct amdgpu_device *adev,
>   int ret = 0;
>   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
>  
> - if (pp_funcs && pp_funcs->set_mp1_state) {
> + if (mp1_state == PP_MP1_STATE_FLR) {
> + /* VF lost access to SMU */
> + if (amdgpu_sriov_vf(adev))
> + adev->pm.dpm_enabled = false;
> + } else if (pp_funcs && pp_funcs->set_mp1_state) {
>   mutex_lock(&adev->pm.mutex);
>  
>   ret = pp_funcs->set_mp1_state(


Re: [PATCH v2] drm/amdgpu: Disable dpm_enabled flag while VF is in reset

2024-08-12 Thread Lazar, Lijo



On 8/12/2024 6:39 PM, Victor Skvortsov wrote:
> VFs do not perform HW fini/suspend in FLR, so the dpm_enabled
> is incorrectly kept enabled. Add interface to disable it in
> virt_pre_reset call.
> 
> v2: Made implementation generic for all asics
> 
> Signed-off-by: Victor Skvortsov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 8 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 1 +
>  drivers/gpu/drm/amd/include/kgd_pp_interface.h | 1 +
>  drivers/gpu/drm/amd/pm/amdgpu_dpm.c| 5 -
>  5 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 29a4adee9286..a6b8d0ba4758 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5289,10 +5289,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> *adev,
>   if (reset_context->reset_req_dev == adev)
>   job = reset_context->job;
>  
> - if (amdgpu_sriov_vf(adev)) {
> - /* stop the data exchange thread */
> - amdgpu_virt_fini_data_exchange(adev);
> - }
> + if (amdgpu_sriov_vf(adev))
> + amdgpu_virt_pre_reset(adev);
>  
>   amdgpu_fence_driver_isr_toggle(adev, true);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index b287a82e6177..b6397d3229e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -33,6 +33,7 @@
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
>  #include "amdgpu_reset.h"
> +#include "amdgpu_dpm.h"
>  #include "vi.h"
>  #include "soc15.h"
>  #include "nv.h"
> @@ -849,6 +850,13 @@ enum amdgpu_sriov_vf_mode 
> amdgpu_virt_get_sriov_vf_mode(struct amdgpu_device *ad
>   return mode;
>  }
>  
> +void amdgpu_virt_pre_reset(struct amdgpu_device *adev)
> +{
> + /* stop the data exchange thread */
> + amdgpu_virt_fini_data_exchange(adev);
> + amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
> +}
> +
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev)
>  {
>   if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3)) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index b42a8854dca0..b650a2032c42 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -376,6 +376,7 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
> u32 offset, u32 acc_flags, u32 hwip, u32 xcc_id);
>  bool amdgpu_virt_fw_load_skip_check(struct amdgpu_device *adev,
>   uint32_t ucode_id);
> +void amdgpu_virt_pre_reset(struct amdgpu_device *adev);
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev);
>  bool amdgpu_sriov_xnack_support(struct amdgpu_device *adev);
>  bool amdgpu_virt_get_rlcg_reg_access_flag(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
> b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> index 4b20e2274313..19a48d98830a 100644
> --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> @@ -218,6 +218,7 @@ enum pp_mp1_state {
>   PP_MP1_STATE_SHUTDOWN,
>   PP_MP1_STATE_UNLOAD,
>   PP_MP1_STATE_RESET,
> + PP_MP1_STATE_FLR,
>  };
>  
>  enum pp_df_cstate {
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
> b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> index 8b7d6ed7e2ed..af39206a2c5f 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> @@ -168,7 +168,10 @@ int amdgpu_dpm_set_mp1_state(struct amdgpu_device *adev,
>   int ret = 0;
>   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
>  
> - if (pp_funcs && pp_funcs->set_mp1_state) {
> + if (amdgpu_sriov_vf(adev) && mp1_state == PP_MP1_STATE_FLR) {
> + /* VF lost access to SMU */
> + adev->pm.dpm_enabled = false;

For non-VF devices, PP_MP1_STATE_FLR needs to be a don't care.
Preferrably, something like

if (mp1_state == PP_MP1_STATE_FLR) {
if (amdgpu_sriov_vf(adev))
adev->pm.dpm_enabled = false;
}else { ..
}

Thanks,
Lijo

> + } else if (pp_funcs && pp_funcs->set_mp1_state) {
>   mutex_lock(&adev->pm.mutex);
>  
>   ret = pp_funcs->set_mp1_state(


Re: [PATCH] drm/amdgpu: Do not init ta microcode from guest side

2024-08-11 Thread Lazar, Lijo



On 8/12/2024 10:35 AM, Zhang, Hawking wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Yes, this applies to all types of Tas
> 

Presently, we have this -
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c#L925

which makes use of XGMI TA commands in VF mode.

Thanks,
Lijo

> Regards,
> Hawking
> 
> -Original Message-
> From: Lazar, Lijo 
> Sent: Monday, August 12, 2024 12:52
> To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org; 
> Zhou1, Tao 
> Subject: Re: [PATCH] drm/amdgpu: Do not init ta microcode from guest side
> 
> 
> 
> On 8/12/2024 8:52 AM, Hawking Zhang wrote:
>> TA should not be loaded from guest side.
> 
> Does this apply to XGMI TA?
> 
> Thanks,
> Lijo
> 
>>
>> Signed-off-by: Hawking Zhang 
>> Reviewed-by: Shiwu Zhang 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 8 +---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
>> index 85ec9e35690a..749d8143b1e7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
>> @@ -132,9 +132,11 @@ static int psp_v13_0_init_microcode(struct psp_context 
>> *psp)
>>   (adev->emu_flags & AMDGPU_EMU_dGPU_SIDEWINDER))
>>   break;
>>   /* It's not necessary to load ras ta on Guest side */
>> - err = psp_init_ta_microcode(psp, ucode_prefix);
>> - if (err)
>> - return err;
>> + if (!amdgpu_sriov_vf(adev)) {
>> + err = psp_init_ta_microcode(psp, ucode_prefix);
>> + if (err)
>> + return err;
>> + }
>>   break;
>>   default:
>>   BUG();


Re: [PATCH] drm/amdgpu: Do not init ta microcode from guest side

2024-08-11 Thread Lazar, Lijo



On 8/12/2024 8:52 AM, Hawking Zhang wrote:
> TA should not be loaded from guest side.

Does this apply to XGMI TA?

Thanks,
Lijo

> 
> Signed-off-by: Hawking Zhang 
> Reviewed-by: Shiwu Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c 
> b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> index 85ec9e35690a..749d8143b1e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> @@ -132,9 +132,11 @@ static int psp_v13_0_init_microcode(struct psp_context 
> *psp)
>   (adev->emu_flags & AMDGPU_EMU_dGPU_SIDEWINDER))
>   break;
>   /* It's not necessary to load ras ta on Guest side */
> - err = psp_init_ta_microcode(psp, ucode_prefix);
> - if (err)
> - return err;
> + if (!amdgpu_sriov_vf(adev)) {
> + err = psp_init_ta_microcode(psp, ucode_prefix);
> + if (err)
> + return err;
> + }
>   break;
>   default:
>   BUG();


RE: [PATCH] drm/amdkfd: fix partition query when setting up recommended sdma engines

2024-08-09 Thread Lazar, Lijo
[Public]

Reviewed-by: Lijo Lazar 

Thanks,
Lijo
-Original Message-
From: Kim, Jonathan 
Sent: Thursday, August 8, 2024 9:39 PM
To: amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Kuehling, Felix ; 
Kim, Jonathan ; Kim, Jonathan 
Subject: [PATCH] drm/amdkfd: fix partition query when setting up recommended 
sdma engines

When users dynamically set the partition mode through sysfs writes, this can 
lead to a double lock situation where the KFD is trying to take the partition 
lock when updating the recommended SDMA engines.
Have the KFD reference its saved socket device number count instead.
Also ensure we have enough SDMA xGMI engines report the recommended engines in 
the first place.

v2: fixups in description

Fixes: a0f548d7871e ("drm/amdkfd: allow users to target recommended SDMA 
engines")
Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 40771f8752cb..27d452e50ca9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1286,9 +1286,8 @@ static void kfd_set_recommended_sdma_engines(struct 
kfd_topology_device *to_dev,
struct amdgpu_device *adev = gpu->adev;
int num_xgmi_nodes = adev->gmc.xgmi.num_physical_nodes;
bool support_rec_eng = !amdgpu_sriov_vf(adev) && to_dev->gpu &&
-   adev->aid_mask && num_xgmi_nodes &&
-   (amdgpu_xcp_query_partition_mode(adev->xcp_mgr, 
AMDGPU_XCP_FL_NONE) ==
- AMDGPU_SPX_PARTITION_MODE) &&
+   adev->aid_mask && num_xgmi_nodes && gpu->kfd->num_nodes == 1 &&
+   kfd_get_num_xgmi_sdma_engines(gpu) >= 14 &&
(!(adev->flags & AMD_IS_APU) && num_xgmi_nodes == 8);

if (support_rec_eng) {
--
2.34.1



Re: [PATCH] drm/amdgpu: Disable dpm_enabled flag while VF is in reset

2024-08-08 Thread Lazar, Lijo



On 8/8/2024 10:56 PM, Victor Skvortsov wrote:
> VFs do not perform HW fini/suspend in FLR, so the dpm_enabled
> is incorrectly kept enabled. Add interface to disable it in
> virt_pre_reset call.
> 
> Signed-off-by: Victor Skvortsov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 10 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c  |  8 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h  |  1 +
>  .../gpu/drm/amd/include/kgd_pp_interface.h|  1 +
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 21 +++
>  5 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 730dae77570c..1be5699f4190 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5288,10 +5288,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> *adev,
>   if (reset_context->reset_req_dev == adev)
>   job = reset_context->job;
>  
> - if (amdgpu_sriov_vf(adev)) {
> - /* stop the data exchange thread */
> - amdgpu_virt_fini_data_exchange(adev);
> - }
> + if (amdgpu_sriov_vf(adev))
> + amdgpu_virt_pre_reset(adev);
>  
>   amdgpu_fence_driver_isr_toggle(adev, true);
>  
> @@ -5561,6 +5559,10 @@ int amdgpu_do_asic_reset(struct list_head 
> *device_list_handle,
>  
>  static void amdgpu_device_set_mp1_state(struct amdgpu_device *adev)
>  {
> + if (amdgpu_sriov_vf(adev)) {
> + adev->mp1_state = PP_MP1_STATE_FLR;
> + return;
> + }

Better remove this change. If at all this state need to be persisted
through out the reset, handle it only through amdgpu_dpm_set_mp1_state.
For now, I don't see a reason to store this state, we only need this
state as a trigger for the action associated with this.
>  
>   switch (amdgpu_asic_reset_method(adev)) {
>   case AMD_RESET_METHOD_MODE1:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 111c380f929b..456a685c3975 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -33,6 +33,7 @@
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
>  #include "amdgpu_reset.h"
> +#include "amdgpu_dpm.h"
>  #include "vi.h"
>  #include "soc15.h"
>  #include "nv.h"
> @@ -849,6 +850,13 @@ enum amdgpu_sriov_vf_mode 
> amdgpu_virt_get_sriov_vf_mode(struct amdgpu_device *ad
>   return mode;
>  }
>  
> +void amdgpu_virt_pre_reset(struct amdgpu_device *adev)
> +{
> + /* stop the data exchange thread */
> + amdgpu_virt_fini_data_exchange(adev);
> + amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
> +}
> +
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev)
>  {
>   if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3)) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index b42a8854dca0..b650a2032c42 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -376,6 +376,7 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
> u32 offset, u32 acc_flags, u32 hwip, u32 xcc_id);
>  bool amdgpu_virt_fw_load_skip_check(struct amdgpu_device *adev,
>   uint32_t ucode_id);
> +void amdgpu_virt_pre_reset(struct amdgpu_device *adev);
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev);
>  bool amdgpu_sriov_xnack_support(struct amdgpu_device *adev);
>  bool amdgpu_virt_get_rlcg_reg_access_flag(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
> b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> index 4b20e2274313..19a48d98830a 100644
> --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> @@ -218,6 +218,7 @@ enum pp_mp1_state {
>   PP_MP1_STATE_SHUTDOWN,
>   PP_MP1_STATE_UNLOAD,
>   PP_MP1_STATE_RESET,
> + PP_MP1_STATE_FLR,
>  };
>  
>  enum pp_df_cstate {
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index 78c3f94bb3ff..b85478b1eaa7 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -2638,6 +2638,26 @@ static int smu_v13_0_6_send_rma_reason(struct 
> smu_context *smu)
>   return ret;
>  }
>  
> +static int smu_v13_0_6_set_mp1_state(struct smu_context *smu,
> + enum pp_mp1_state mp1_state)
> +{
> + int ret =0;
> +
> + switch (mp1_state) {
> + case PP_MP1_STATE_FLR:
> + /* VF lost access to SMU */
> + smu->adev->pm.dpm_enabled = false;
> + ret = 0;
> + break;
> + default:
> + /* Ignore others */
> + ret = 0;
> + }
> +
> + return r

Re: [PATCH v1 05/15] drm/amdgpu: add vcn_v4_0_3 ip dump support

2024-08-08 Thread Lazar, Lijo



On 8/8/2024 12:36 PM, Khatri, Sunil wrote:
> 
> On 8/8/2024 11:20 AM, Lazar, Lijo wrote:
>>
>> On 8/7/2024 2:58 AM, Alex Deucher wrote:
>>> On Tue, Aug 6, 2024 at 4:18 AM Sunil Khatri 
>>> wrote:
>>>> Add support of vcn ip dump in the devcoredump
>>>> for vcn_v4_0_3.
>>>>
>>>> Signed-off-by: Sunil Khatri 
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 170
>>>> +++-
>>>>   1 file changed, 169 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> index 9bae95538b62..dd3baccb2904 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> @@ -45,6 +45,132 @@
>>>>   #define VCN_VID_SOC_ADDRESS_2_0    0x1fb00
>>>>   #define VCN1_VID_SOC_ADDRESS_3_0   0x48300
>>>>
>>>> +static const struct amdgpu_hwip_reg_entry vcn_reg_list_4_0_3[] = {
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_POWER_STATUS),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_STATUS),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_CACHE1_64BIT_BAR_HIGH),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_CACHE1_64BIT_BAR_LOW),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_CACHE2_64BIT_BAR_HIGH),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_CACHE2_64BIT_BAR_LOW),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_VCPU_CACHE_OFFSET0),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_VCPU_CACHE_OFFSET1),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_VCPU_CACHE_OFFSET2),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CONTEXT_ID),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_GPCOM_VCPU_DATA0),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_GPCOM_VCPU_DATA1),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_GPCOM_VCPU_CMD),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_NC1_64BIT_BAR_HIGH),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC1_64BIT_BAR_LOW),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0,
>>>> regUVD_LMI_VCPU_NC0_64BIT_BAR_HIGH),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC0_64BIT_BAR_LOW),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE_VMIDS_MULTI),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC_VMIDS_MULTI),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI2),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO2),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI3),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO3),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI4),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO4),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR2),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR2),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR3),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR3),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR4),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR4),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE2),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE3),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SOFT_RESET),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SOFT_RESET2),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_GATE),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_STATUS),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_CTRL),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_CTRL3),
>>>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SUVD

Re: [PATCH v1 05/15] drm/amdgpu: add vcn_v4_0_3 ip dump support

2024-08-07 Thread Lazar, Lijo



On 8/7/2024 2:58 AM, Alex Deucher wrote:
> On Tue, Aug 6, 2024 at 4:18 AM Sunil Khatri  wrote:
>>
>> Add support of vcn ip dump in the devcoredump
>> for vcn_v4_0_3.
>>
>> Signed-off-by: Sunil Khatri 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 170 +++-
>>  1 file changed, 169 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c 
>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>> index 9bae95538b62..dd3baccb2904 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>> @@ -45,6 +45,132 @@
>>  #define VCN_VID_SOC_ADDRESS_2_00x1fb00
>>  #define VCN1_VID_SOC_ADDRESS_3_0   0x48300
>>
>> +static const struct amdgpu_hwip_reg_entry vcn_reg_list_4_0_3[] = {
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_POWER_STATUS),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_STATUS),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE1_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE1_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE2_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE2_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_VCPU_CACHE_OFFSET0),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_VCPU_CACHE_OFFSET1),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_VCPU_CACHE_OFFSET2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CONTEXT_ID),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_GPCOM_VCPU_DATA0),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_GPCOM_VCPU_DATA1),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_GPCOM_VCPU_CMD),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC1_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC1_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC0_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC0_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_CACHE_VMIDS_MULTI),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_VCPU_NC_VMIDS_MULTI),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI3),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO3),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_HI4),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_BASE_LO4),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR3),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR3),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_RPTR4),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_WPTR4),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE3),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SOFT_RESET),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SOFT_RESET2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_GATE),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_STATUS),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_CTRL),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_CGC_CTRL3),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SUVD_CGC_GATE),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SUVD_CGC_STATUS),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SUVD_CGC_CTRL),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SUVD_CGC_GATE2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE3),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE4),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_RB_SIZE4),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SUVD_CGC_STATUS2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_SUVD_CGC_GATE2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_VCPU_CACHE_OFFSET2),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_MIF_GPGPU_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_MIF_GPGPU_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_MIF_CURR_LUMA_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_MIF_CURR_LUMA_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, 
>> regUVD_LMI_MIF_CURR_CHROMA_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, 
>> regUVD_LMI_MIF_CURR_CHROMA_64BIT_BAR_HIGH),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_MIF_DBW_64BIT_BAR_LOW),
>> +   SOC15_REG_ENTRY_STR(VCN, 0, regUVD_LMI_MIF_DBW_64BIT_BAR_HIGH),
>> +   SOC1

Re: [PATCH] drm/amdkfd: fix partition query when setting up recommended sdma engines

2024-08-07 Thread Lazar, Lijo



On 8/8/2024 2:04 AM, Jonathan Kim wrote:
> When users dynamically set the partition mode through sysfs writes,
> this can lead to a double lock situation where the KFD is trying to take
> the partition lock when updating the recommended SDMA engines.
> Have the KFD do a lockless query instead to avoid this.
> This should work since the KFD always initializes synchronously after
> the KGD partition mode is set regardless of user or system setup.
> 
> Fixes: a0f548d7871e ("drm/amdkfd: allow users to target recommended SDMA 
> engines")
> Signed-off-by: Jonathan Kim 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index 40771f8752cb..8fee89b8dd67 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -1287,7 +1287,7 @@ static void kfd_set_recommended_sdma_engines(struct 
> kfd_topology_device *to_dev,
>   int num_xgmi_nodes = adev->gmc.xgmi.num_physical_nodes;
>   bool support_rec_eng = !amdgpu_sriov_vf(adev) && to_dev->gpu &&
>   adev->aid_mask && num_xgmi_nodes &&
> - (amdgpu_xcp_query_partition_mode(adev->xcp_mgr, 
> AMDGPU_XCP_FL_NONE) ==
> + (amdgpu_xcp_query_partition_mode(adev->xcp_mgr, 
> AMDGPU_XCP_FL_LOCKED) ==
> AMDGPU_SPX_PARTITION_MODE) &&

Replacing with (gpu->kfd->num_nodes == 1) may be better.

Thanks,
Lijo

>   (!(adev->flags & AMD_IS_APU) && num_xgmi_nodes == 8);
>  


Re: [PATCH 2/2] drm/amdgpu: abort KIQ waits when there is a pending reset

2024-08-03 Thread Lazar, Lijo



On 8/3/2024 12:09 AM, Victor Skvortsov wrote:
> Stop waiting for the KIQ to return back when there is a reset pending.
> It's quite likely that the KIQ will never response.
> 
> Signed-off-by: Victor Skvortsov 

Copying Christian/Vignesh

The patch is originally from Christian. Please keep the author as
Christian and you may add Tested-By.

Thanks,
Lijo

> Suggested-by: Lazar Lijo 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c   | 3 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index c02659025656..8962be257942 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -785,7 +785,8 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct 
> amdgpu_device *adev,
>   goto failed_kiq;
>  
>   might_sleep();
> - while (r < 1 && cnt++ < MAX_KIQ_REG_TRY) {
> + while (r < 1 && cnt++ < MAX_KIQ_REG_TRY&&
> + !amdgpu_reset_pending(adev->reset_domain)) {
>  
>   msleep(MAX_KIQ_REG_BAILOUT_INTERVAL);
>   r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 4ae581f3fcb5..f33a4e0ffba1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -136,6 +136,11 @@ static inline bool amdgpu_reset_domain_schedule(struct 
> amdgpu_reset_domain *doma
>   return queue_work(domain->wq, work);
>  }
>  
> +static inline bool amdgpu_reset_pending(struct amdgpu_reset_domain *domain) {
> + lockdep_assert_held(&domain->sem);
> + return rwsem_is_contended(&domain->sem);
> +}
> +
>  void amdgpu_device_lock_reset_domain(struct amdgpu_reset_domain 
> *reset_domain);
>  
>  void amdgpu_device_unlock_reset_domain(struct amdgpu_reset_domain 
> *reset_domain);


Re: [PATCH] drm/amdgpu: optimize the padding with hw optimization

2024-08-02 Thread Lazar, Lijo



On 8/2/2024 12:25 AM, Marek Olšák wrote:
> On Thu, Aug 1, 2024, 03:37 Christian König  > wrote:
> 
> __
> Am 01.08.24 um 08:53 schrieb Marek Olšák:
>> On Thu, Aug 1, 2024, 00:28 Khatri, Sunil > > wrote:
>>
>>
>> On 8/1/2024 8:49 AM, Marek Olšák wrote:
>> >> +       /* Header is at index 0, followed by num_nops - 1
>> NOP packet's */
>> >> +       for (i = 1; i < num_nop; i++)
>> >> +               amdgpu_ring_write(ring, ring->funcs->nop);
>> > This loop should be removed. It's unnecessary CPU overhead
>> and we
>> > should never get more than 0x3fff NOPs (maybe use BUG_ON).
>> Leaving the
>> > whole packet body uninitialized is the fastest option.
>> That was the original intent to just move the WPTR for the no
>> of nops
>> and tried too. Based on Christian inputs we should not let the
>> nops packet
>>
>> as garbage or whatever was there originally as a threat/safety
>> measure.
>>
>>
>> It doesn't help safety. It can only be read by the GPU with
>> kernel-level permissions.
>>
>> Initializing the packet body is useless and adds CPU overhead,
>> especially with the 256 NOPs or so that we use for no reason.
> 
> Not filling the remaining ring buffers with NOPs is a pretty clear
> NAK from my side. Leaving garbage in the ring buffer is not even
> remotely defensive.
> 
> 
> What are you defending against? You know the ring is kernel-owned
> memory, right? 
> 

Aside from that, the true hardware behavior is that CP still fetches the
words and discards them. It's not the same mentioned in the description.
So the only optimization it allows is to move the pointer without
filling/caring about the contents as hardware also doesn't care about
them. The notion of filling those unused region is exactly opposite of
the intention. If that's the case, nothing is gained and just drop these
patches.

Thanks,
Lijo

> Marek
> 
> 
> What we can do is to optimize filling N DWs into the ring buffer
> without updating the WPTR each time.
> 
> Regards,
> Christian.
> 
>>
>> Marek
>>


Re: [PATCH] drm/amdgpu: report bad status in GPU recovery

2024-07-31 Thread Lazar, Lijo



On 8/1/2024 11:28 AM, Zhou1, Tao wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Yes, the bad status message is printed twice with this patch. I think it's 
> harmless and the second message is more convenient for customer.
> 
> I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable 
> the first message if you think printing message twice is not a good idea.
> 

Instead of this way, can't this be added to amdgpu_ras_do_recovery() and
stop all recovery actions?

Thanks,
Lijo

> Tao
> 
>> -Original Message-
>> From: Zhang, Hawking 
>> Sent: Thursday, August 1, 2024 1:30 PM
>> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Right, it's functional. My concern is whether the kernel message in
>> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is the end
>> of gpu recovery (i.e., report gpu reset failed or gpu reset succeed).
>> Check_err_threshold was already done before reaching here.
>>
>> Regards,
>> Hawking
>>
>> -Original Message-
>> From: Zhou1, Tao 
>> Sent: Thursday, August 1, 2024 11:49
>> To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is good
>> enough, no need to update it with is_rma.
>>
>> Tao
>>
>>> -Original Message-
>>> From: Zhang, Hawking 
>>> Sent: Thursday, August 1, 2024 11:00 AM
>>> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
>>> Cc: Zhou1, Tao 
>>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> Might consider leverage is_RMA flag for the same purpose?
>>>
>>> Regards,
>>> Hawking
>>>
>>> -Original Message-
>>> From: amd-gfx  On Behalf Of Tao
>>> Zhou
>>> Sent: Wednesday, July 31, 2024 18:05
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: Zhou1, Tao 
>>> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>>
>>> Instead of printing GPU reset failed.
>>>
>>> Signed-off-by: Tao Zhou 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++--
>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 355c2478c4b6..b7c967779b4b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>> tmp_adev->asic_reset_res = 0;
>>>
>>> if (r) {
>>> -   /* bad news, how to tell it to userspace ? */
>>> -   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
>>> atomic_read(&tmp_adev->gpu_reset_counter));
>>> +   /* bad news, how to tell it to userspace ?
>>> +* for ras error, we should report GPU bad status 
>>> instead of
>>> +* reset failure
>>> +*/
>>> +   if 
>>> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
>>> +   dev_info(tmp_adev->dev, "GPU reset(%d)
>>> + failed\n",
>>> +
>>> + atomic_read(&tmp_adev->gpu_reset_counter));
>>> amdgpu_vf_error_put(tmp_adev,
>>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>> } else {
>>> dev_info(tmp_adev->dev, "GPU reset(%d)
>>> succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
>>> --
>>> 2.34.1
>>>
>>
>>
> 


Re: [PATCH] drm/amdgpu: report bad status in GPU recovery

2024-07-31 Thread Lazar, Lijo



On 8/1/2024 9:17 AM, Zhou1, Tao wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
>> -Original Message-----
>> From: Lazar, Lijo 
>> Sent: Wednesday, July 31, 2024 9:31 PM
>> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>
>>
>>
>> On 7/31/2024 3:35 PM, Tao Zhou wrote:
>>> Instead of printing GPU reset failed.
>>>
>>> Signed-off-by: Tao Zhou 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++--
>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 355c2478c4b6..b7c967779b4b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>> tmp_adev->asic_reset_res = 0;
>>>
>>> if (r) {
>>> -   /* bad news, how to tell it to userspace ? */
>>> -   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
>> atomic_read(&tmp_adev->gpu_reset_counter));
>>> +   /* bad news, how to tell it to userspace ?
>>> +* for ras error, we should report GPU bad status 
>>> instead
>> of
>>> +* reset failure
>>> +*/
>>> +   if
>> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
>>> +   dev_info(tmp_adev->dev, "GPU reset(%d)
>> failed\n",
>>> +   atomic_read(&tmp_adev-
>>> gpu_reset_counter));
>>
>> Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm that
>> the reset is indeed triggered due to ras error.
> 
> [Tao] It seems AMDGPU_RESET_SRC_RAS is not used currently, I will set it 
> before use the flag.
> 

It's set here -
https://elixir.bootlin.com/linux/v6.11-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c#L2607

Thanks,
Lijo

>>
>> Thanks,
>> Lijo
>>
>>> amdgpu_vf_error_put(tmp_adev,
>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>> } else {
>>> dev_info(tmp_adev->dev, "GPU reset(%d)
>> succeeded!\n",
>>> atomic_read(&tmp_adev->gpu_reset_counter));


Re: [PATCH] drm/amdgpu: report bad status in GPU recovery

2024-07-31 Thread Lazar, Lijo



On 7/31/2024 3:35 PM, Tao Zhou wrote:
> Instead of printing GPU reset failed.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 355c2478c4b6..b7c967779b4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   tmp_adev->asic_reset_res = 0;
>  
>   if (r) {
> - /* bad news, how to tell it to userspace ? */
> - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", 
> atomic_read(&tmp_adev->gpu_reset_counter));
> + /* bad news, how to tell it to userspace ?
> +  * for ras error, we should report GPU bad status 
> instead of
> +  * reset failure
> +  */
> + if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> + dev_info(tmp_adev->dev, "GPU reset(%d) 
> failed\n",
> + 
> atomic_read(&tmp_adev->gpu_reset_counter));

Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm
that the reset is indeed triggered due to ras error.

Thanks,
Lijo

>   amdgpu_vf_error_put(tmp_adev, 
> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>   } else {
>   dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", 
> atomic_read(&tmp_adev->gpu_reset_counter));


Re: [PATCH v2 1/2] drm/amdgpu: Remove debugfs amdgpu_reset_dump_register_list

2024-07-30 Thread Lazar, Lijo



On 7/30/2024 12:14 PM, Sunil Khatri wrote:
> There are some problem with existing amdgpu_reset_dump_register_list
> debugfs node. It is supposed to read a list of registers but there
> could be cases when the IP is not in correct power state. Register
> read in such cases could lead to more problems.
> 
> We are taking care of all such power states in devcoredump and
> dumping the registers of need for debugging. So cleaning this code
> and we dont need this functionality via debugfs anymore.
> 
> Signed-off-by: Sunil Khatri 

Series is -

Reviewed-by: Lijo Lazar 

Thanks,
Lijo
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 96 -
>  1 file changed, 96 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 0e1a11b6b989..cbef720de779 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -2026,100 +2026,6 @@ DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,
>  DEFINE_DEBUGFS_ATTRIBUTE(fops_sclk_set, NULL,
>   amdgpu_debugfs_sclk_set, "%llu\n");
>  
> -static ssize_t amdgpu_reset_dump_register_list_read(struct file *f,
> - char __user *buf, size_t size, loff_t *pos)
> -{
> - struct amdgpu_device *adev = (struct amdgpu_device 
> *)file_inode(f)->i_private;
> - char reg_offset[12];
> - int i, ret, len = 0;
> -
> - if (*pos)
> - return 0;
> -
> - memset(reg_offset, 0, 12);
> - ret = down_read_killable(&adev->reset_domain->sem);
> - if (ret)
> - return ret;
> -
> - for (i = 0; i < adev->reset_info.num_regs; i++) {
> - sprintf(reg_offset, "0x%x\n", 
> adev->reset_info.reset_dump_reg_list[i]);
> - up_read(&adev->reset_domain->sem);
> - if (copy_to_user(buf + len, reg_offset, strlen(reg_offset)))
> - return -EFAULT;
> -
> - len += strlen(reg_offset);
> - ret = down_read_killable(&adev->reset_domain->sem);
> - if (ret)
> - return ret;
> - }
> -
> - up_read(&adev->reset_domain->sem);
> - *pos += len;
> -
> - return len;
> -}
> -
> -static ssize_t amdgpu_reset_dump_register_list_write(struct file *f,
> - const char __user *buf, size_t size, loff_t *pos)
> -{
> - struct amdgpu_device *adev = (struct amdgpu_device 
> *)file_inode(f)->i_private;
> - char reg_offset[11];
> - uint32_t *new = NULL, *tmp = NULL;
> - unsigned int len = 0;
> - int ret, i = 0;
> -
> - do {
> - memset(reg_offset, 0, 11);
> - if (copy_from_user(reg_offset, buf + len,
> - min(10, (size-len {
> - ret = -EFAULT;
> - goto error_free;
> - }
> -
> - new = krealloc_array(tmp, i + 1, sizeof(uint32_t), GFP_KERNEL);
> - if (!new) {
> - ret = -ENOMEM;
> - goto error_free;
> - }
> - tmp = new;
> - if (sscanf(reg_offset, "%X %n", &tmp[i], &ret) != 1) {
> - ret = -EINVAL;
> - goto error_free;
> - }
> -
> - len += ret;
> - i++;
> - } while (len < size);
> -
> - new = kmalloc_array(i, sizeof(uint32_t), GFP_KERNEL);
> - if (!new) {
> - ret = -ENOMEM;
> - goto error_free;
> - }
> - ret = down_write_killable(&adev->reset_domain->sem);
> - if (ret)
> - goto error_free;
> -
> - swap(adev->reset_info.reset_dump_reg_list, tmp);
> - swap(adev->reset_info.reset_dump_reg_value, new);
> - adev->reset_info.num_regs = i;
> - up_write(&adev->reset_domain->sem);
> - ret = size;
> -
> -error_free:
> - if (tmp != new)
> - kfree(tmp);
> - kfree(new);
> - return ret;
> -}
> -
> -static const struct file_operations amdgpu_reset_dump_register_list = {
> - .owner = THIS_MODULE,
> - .read = amdgpu_reset_dump_register_list_read,
> - .write = amdgpu_reset_dump_register_list_write,
> - .llseek = default_llseek
> -};
> -
>  int amdgpu_debugfs_init(struct amdgpu_device *adev)
>  {
>   struct dentry *root = adev_to_drm(adev)->primary->debugfs_root;
> @@ -2204,8 +2110,6 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
>   &amdgpu_debugfs_vm_info_fops);
>   debugfs_create_file("amdgpu_benchmark", 0200, root, adev,
>   &amdgpu_benchmark_fops);
> - debugfs_create_file("amdgpu_reset_dump_register_list", 0644, root, adev,
> - &amdgpu_reset_dump_register_list);
>  
>   adev->debugfs_vbios_blob.data = adev->bios;
>   adev->debugfs_vbios_blob.size = adev->bios_size;


Re: [PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

2024-07-28 Thread Lazar, Lijo



On 7/29/2024 11:08 AM, Khatri, Sunil wrote:
> 
> On 7/29/2024 10:08 AM, Lazar, Lijo wrote:
>> On 7/27/2024 12:51 AM, Khatri, Sunil wrote:
>>> On 7/27/2024 12:13 AM, Alex Deucher wrote:
>>>> On Fri, Jul 26, 2024 at 1:16 PM Khatri, Sunil  wrote:
>>>>> On 7/26/2024 8:36 PM, Lazar, Lijo wrote:
>>>>>> On 7/26/2024 8:11 PM, Khatri, Sunil wrote:
>>>>>>> On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>>>>>>>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>>>>>>>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>>>>>>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri 
>>>>>>>>>> wrote:
>>>>>>>>>>> Problem:
>>>>>>>>>>> IP dump right now is done post suspend of
>>>>>>>>>>> all IP's which for some IP's could change power
>>>>>>>>>>> state and software state too which we do not want
>>>>>>>>>>> to reflect in the dump as it might not be same at
>>>>>>>>>>> the time of hang.
>>>>>>>>>>>
>>>>>>>>>>> Solution:
>>>>>>>>>>> IP should be dumped as close to the HW state when
>>>>>>>>>>> the GPU was in hung state without trying to reinitialize
>>>>>>>>>>> any resource.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Sunil Khatri 
>>>>>>>>>> Acked-by: Alex Deucher 
>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
>>>>>>>>>>> +++---
>>>>>>>>>>>     1 file changed, 30 insertions(+), 30 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> index 730dae77570c..74f6f15e73b5 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
>>>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>>>    return ret;
>>>>>>>>>>>     }
>>>>>>>>>>>
>>>>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>>>>> +{
>>>>>>>>>>> +   int i;
>>>>>>>>>>> +
>>>>>>>>>>> +   lockdep_assert_held(&adev->reset_domain->sem);
>>>>>>>>>>> +
>>>>>>>>>>> +   for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>>>>>>> +   adev->reset_info.reset_dump_reg_value[i] =
>>>>>>>>>>> +
>>>>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>>>>>> A suspend also involves power/clock ungate. When reg dump is moved
>>>>>>>>> earlier, I'm not sure if this read works for all. If it's left to
>>>>>>>>> individual IP call backs, they could just do the same or better
>>>>>>>>> to move
>>>>>>>>> these up before a dump.
>>>>>>>> Suspend also put the status.hw = false and each IP in their
>>>>>>>> respective
>>>>>>>> suspend state which i feel does change the state of the HW.
>>>>>>>> To get the correct snapshot of the GPU register we should not be
>>>>>>>> fiddling with the HW IP at least till we capture the dump and that is
>>>>>>>> the intention behind the change.
>>>>>>>>
>>>>>>>> Do you think there is a problem in this approach?
>>>>>>>>>    amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>>>>>>>    amdgpu_device_s

Re: [PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

2024-07-28 Thread Lazar, Lijo



On 7/27/2024 12:51 AM, Khatri, Sunil wrote:
> 
> On 7/27/2024 12:13 AM, Alex Deucher wrote:
>> On Fri, Jul 26, 2024 at 1:16 PM Khatri, Sunil  wrote:
>>>
>>> On 7/26/2024 8:36 PM, Lazar, Lijo wrote:
>>>> On 7/26/2024 8:11 PM, Khatri, Sunil wrote:
>>>>> On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>>>>>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>>>>>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>>>>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri 
>>>>>>>> wrote:
>>>>>>>>> Problem:
>>>>>>>>> IP dump right now is done post suspend of
>>>>>>>>> all IP's which for some IP's could change power
>>>>>>>>> state and software state too which we do not want
>>>>>>>>> to reflect in the dump as it might not be same at
>>>>>>>>> the time of hang.
>>>>>>>>>
>>>>>>>>> Solution:
>>>>>>>>> IP should be dumped as close to the HW state when
>>>>>>>>> the GPU was in hung state without trying to reinitialize
>>>>>>>>> any resource.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Sunil Khatri 
>>>>>>>> Acked-by: Alex Deucher 
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
>>>>>>>>> +++---
>>>>>>>>>     1 file changed, 30 insertions(+), 30 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index 730dae77570c..74f6f15e73b5 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>    return ret;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>>> +{
>>>>>>>>> +   int i;
>>>>>>>>> +
>>>>>>>>> +   lockdep_assert_held(&adev->reset_domain->sem);
>>>>>>>>> +
>>>>>>>>> +   for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>>>>> +   adev->reset_info.reset_dump_reg_value[i] =
>>>>>>>>> +
>>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>>>> A suspend also involves power/clock ungate. When reg dump is moved
>>>>>>> earlier, I'm not sure if this read works for all. If it's left to
>>>>>>> individual IP call backs, they could just do the same or better
>>>>>>> to move
>>>>>>> these up before a dump.
>>>>>> Suspend also put the status.hw = false and each IP in their
>>>>>> respective
>>>>>> suspend state which i feel does change the state of the HW.
>>>>>> To get the correct snapshot of the GPU register we should not be
>>>>>> fiddling with the HW IP at least till we capture the dump and that is
>>>>>> the intention behind the change.
>>>>>>
>>>>>> Do you think there is a problem in this approach?
>>>>>>>    amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>>>>>    amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>>>>> Adding this does sounds better to enable just before we dump the
>>>>>> registers but i am not very sure if ungating would be clean here or
>>>>>> not. i Could try quickly adding these two calls just before dump.
>>>>>>
>>>>>> All i am worried if it does change some register reflecting the
>>>>>> original state of registers at dump.
>>>>>>
>>>> I was thinking that if it includes some GFX regs and the hang happened
>

Re: [PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

2024-07-26 Thread Lazar, Lijo



On 7/26/2024 8:11 PM, Khatri, Sunil wrote:
> 
> On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>>
>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>>
>>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri 
>>>> wrote:
>>>>> Problem:
>>>>> IP dump right now is done post suspend of
>>>>> all IP's which for some IP's could change power
>>>>> state and software state too which we do not want
>>>>> to reflect in the dump as it might not be same at
>>>>> the time of hang.
>>>>>
>>>>> Solution:
>>>>> IP should be dumped as close to the HW state when
>>>>> the GPU was in hung state without trying to reinitialize
>>>>> any resource.
>>>>>
>>>>> Signed-off-by: Sunil Khatri 
>>>> Acked-by: Alex Deucher 
>>>>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
>>>>> +++---
>>>>>   1 file changed, 30 insertions(+), 30 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 730dae77570c..74f6f15e73b5 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
>>>>> amdgpu_device *adev)
>>>>>  return ret;
>>>>>   }
>>>>>
>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>> +{
>>>>> +   int i;
>>>>> +
>>>>> +   lockdep_assert_held(&adev->reset_domain->sem);
>>>>> +
>>>>> +   for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>> +   adev->reset_info.reset_dump_reg_value[i] =
>>>>> +  
>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>> A suspend also involves power/clock ungate. When reg dump is moved
>>> earlier, I'm not sure if this read works for all. If it's left to
>>> individual IP call backs, they could just do the same or better to move
>>> these up before a dump.
>> Suspend also put the status.hw = false and each IP in their respective
>> suspend state which i feel does change the state of the HW.
>> To get the correct snapshot of the GPU register we should not be
>> fiddling with the HW IP at least till we capture the dump and that is
>> the intention behind the change.
>>
>> Do you think there is a problem in this approach?
>>>
>>>  amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>  amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>> Adding this does sounds better to enable just before we dump the
>> registers but i am not very sure if ungating would be clean here or
>> not. i Could try quickly adding these two calls just before dump.
>>
>> All i am worried if it does change some register reflecting the
>> original state of registers at dump.
>>

I was thinking that if it includes some GFX regs and the hang happened
because of some SDMA/VCN jobs which somehow keeps GFXOFF state intact.

BTW, since there is already dump_ip state which could capture IP regs
separately and handle their power/clock gate situations, do you think
this generic one is still needed?

Thanks,
Lijo

>> What u suggest ?
>> Regards
>> Sunil
> I quickly validated on Navi22 by adding below call just before we dump
> registers
> if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>                        
>     amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>     amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>                        
>     amdgpu_reset_reg_dumps(tmp_adev);
>     dev_info(tmp_adev->dev, "Dumping IP State\n");
>     /* Trigger ip dump before we reset the asic */
>     for(i=0; inum_ip_blocks; i++)
>         if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>             tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>                                     (void*)tmp_adev);
>     dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
> }
> If this sounds fine with you i am update that. Regards Sunil Khatri
>>
>>>
>>> T

Re: [PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

2024-07-26 Thread Lazar, Lijo



On 7/26/2024 6:42 PM, Alex Deucher wrote:
> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri  wrote:
>>
>> Problem:
>> IP dump right now is done post suspend of
>> all IP's which for some IP's could change power
>> state and software state too which we do not want
>> to reflect in the dump as it might not be same at
>> the time of hang.
>>
>> Solution:
>> IP should be dumped as close to the HW state when
>> the GPU was in hung state without trying to reinitialize
>> any resource.
>>
>> Signed-off-by: Sunil Khatri 
> 
> Acked-by: Alex Deucher 
> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 +++---
>>  1 file changed, 30 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 730dae77570c..74f6f15e73b5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct amdgpu_device 
>> *adev)
>> return ret;
>>  }
>>
>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>> +{
>> +   int i;
>> +
>> +   lockdep_assert_held(&adev->reset_domain->sem);
>> +
>> +   for (i = 0; i < adev->reset_info.num_regs; i++) {
>> +   adev->reset_info.reset_dump_reg_value[i] =
>> +   RREG32(adev->reset_info.reset_dump_reg_list[i]);

A suspend also involves power/clock ungate. When reg dump is moved
earlier, I'm not sure if this read works for all. If it's left to
individual IP call backs, they could just do the same or better to move
these up before a dump.

amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);

Thanks,
Lijo

>> +
>> +   
>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>> +
>> adev->reset_info.reset_dump_reg_value[i]);
>> +   }
>> +
>> +   return 0;
>> +}
>> +
>>  int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>  struct amdgpu_reset_context *reset_context)
>>  {
>> int i, r = 0;
>> struct amdgpu_job *job = NULL;
>> +   struct amdgpu_device *tmp_adev = reset_context->reset_req_dev;
>> bool need_full_reset =
>> test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>>
>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
>> *adev,
>> }
>> }
>>
>> +   if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>> +   amdgpu_reset_reg_dumps(tmp_adev);
>> +
>> +   dev_info(tmp_adev->dev, "Dumping IP State\n");
>> +   /* Trigger ip dump before we reset the asic */
>> +   for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>> +   if 
>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>> +   
>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>> +   (void *)tmp_adev);
>> +   dev_info(tmp_adev->dev, "Dumping IP State 
>> Completed\n");
>> +   }
>> +
>> if (need_full_reset)
>> r = amdgpu_device_ip_suspend(adev);
>> if (need_full_reset)
>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>> return r;
>>  }
>>
>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>> -{
>> -   int i;
>> -
>> -   lockdep_assert_held(&adev->reset_domain->sem);
>> -
>> -   for (i = 0; i < adev->reset_info.num_regs; i++) {
>> -   adev->reset_info.reset_dump_reg_value[i] =
>> -   RREG32(adev->reset_info.reset_dump_reg_list[i]);
>> -
>> -   
>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>> -
>> adev->reset_info.reset_dump_reg_value[i]);
>> -   }
>> -
>> -   return 0;
>> -}
>> -
>>  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>  struct amdgpu_reset_context *reset_context)
>>  {
>> struct amdgpu_device *tmp_adev = NULL;
>> bool need_full_reset, skip_hw_reset, vram_lost = false;
>> int r = 0;
>> -   uint32_t i;
>>
>> /* Try reset handler method first */
>> tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
>> reset_list);
>>
>> -   if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>> -   amdgpu_reset_reg_dumps(tmp_adev);
>> -
>> -   dev_info(tmp_adev->dev, "Dumping IP State\n");
>> -   /* Trigger ip dump before we reset the asic */
>> -   for (i = 0; i

Re: [PATCH 3/3] drm/amdgpu/vcn: Use offsets local to VCN/JPEG in VF

2024-07-17 Thread Lazar, Lijo



On 7/16/2024 2:17 PM, Jane Jian wrote:
> For VCN/JPEG 4.0.3, use only the local addressing scheme.
> 
> - Mask bit higher than AID0 range
> 
> v2
> remain the case for mmhub use master XCC
> 
> Signed-off-by: Jane Jian 

This patch is

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 19 --
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c  | 46 ++--
>  2 files changed, 60 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c 
> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> index 30a143ab592d..ad524ddc9760 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> @@ -32,6 +32,9 @@
>  #include "vcn/vcn_4_0_3_sh_mask.h"
>  #include "ivsrcid/vcn/irqsrcs_vcn_4_0.h"
>  
> +#define NORMALIZE_JPEG_REG_OFFSET(offset) \
> + (offset & 0x1)
> +
>  enum jpeg_engin_status {
>   UVD_PGFSM_STATUS__UVDJ_PWR_ON  = 0,
>   UVD_PGFSM_STATUS__UVDJ_PWR_OFF = 2,
> @@ -824,7 +827,13 @@ void jpeg_v4_0_3_dec_ring_emit_ib(struct amdgpu_ring 
> *ring,
>  void jpeg_v4_0_3_dec_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t 
> reg,
>   uint32_t val, uint32_t mask)
>  {
> - uint32_t reg_offset = (reg << 2);
> + uint32_t reg_offset;
> +
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_JPEG_REG_OFFSET(reg);
> +
> + reg_offset = (reg << 2);
>  
>   amdgpu_ring_write(ring, 
> PACKETJ(regUVD_JRBC_RB_COND_RD_TIMER_INTERNAL_OFFSET,
>   0, 0, PACKETJ_TYPE0));
> @@ -865,7 +874,13 @@ void jpeg_v4_0_3_dec_ring_emit_vm_flush(struct 
> amdgpu_ring *ring,
>  
>  void jpeg_v4_0_3_dec_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg, 
> uint32_t val)
>  {
> - uint32_t reg_offset = (reg << 2);
> + uint32_t reg_offset;
> +
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_JPEG_REG_OFFSET(reg);
> +
> + reg_offset = (reg << 2);
>  
>   amdgpu_ring_write(ring, 
> PACKETJ(regUVD_JRBC_EXTERNAL_REG_INTERNAL_OFFSET,
>   0, 0, PACKETJ_TYPE0));
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c 
> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> index 101b120f6fbd..9bae95538b62 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> @@ -45,6 +45,9 @@
>  #define VCN_VID_SOC_ADDRESS_2_0  0x1fb00
>  #define VCN1_VID_SOC_ADDRESS_3_0 0x48300
>  
> +#define NORMALIZE_VCN_REG_OFFSET(offset) \
> + (offset & 0x1)
> +
>  static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev);
>  static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device *adev);
>  static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev);
> @@ -1375,6 +1378,43 @@ static uint64_t 
> vcn_v4_0_3_unified_ring_get_wptr(struct amdgpu_ring *ring)
>   regUVD_RB_WPTR);
>  }
>  
> +static void vcn_v4_0_3_enc_ring_emit_reg_wait(struct amdgpu_ring *ring, 
> uint32_t reg,
> + uint32_t val, uint32_t mask)
> +{
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_VCN_REG_OFFSET(reg);
> +
> + amdgpu_ring_write(ring, VCN_ENC_CMD_REG_WAIT);
> + amdgpu_ring_write(ring, reg << 2);
> + amdgpu_ring_write(ring, mask);
> + amdgpu_ring_write(ring, val);
> +}
> +
> +static void vcn_v4_0_3_enc_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t 
> reg, uint32_t val)
> +{
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_VCN_REG_OFFSET(reg);
> +
> + amdgpu_ring_write(ring, VCN_ENC_CMD_REG_WRITE);
> + amdgpu_ring_write(ring, reg << 2);
> + amdgpu_ring_write(ring, val);
> +}
> +
> +static void vcn_v4_0_3_enc_ring_emit_vm_flush(struct amdgpu_ring *ring,
> + unsigned int vmid, uint64_t pd_addr)
> +{
> + struct amdgpu_vmhub *hub = &ring->adev->vmhub[ring->vm_hub];
> +
> + pd_addr = amdgpu_gmc_emit_flush_gpu_tlb(ring, vmid, pd_addr);
> +
> + /* wait for reg writes */
> + vcn_v4_0_3_enc_ring_emit_reg_wait(ring, hub->ctx0_ptb_addr_lo32 +
> + vmid * hub->ctx_addr_distance,
> + lower_32_bits(pd_addr), 0x);
> +}
> +
>  static void vcn_v4_0_3_ring_emit_hdp_flush(struct amdgpu_ring *ring)
>  {
>   /* VCN engine access for HDP flush doesn't work when RRMT is enabled.
> @@ -1421,7 +1461,7 @@ static const struct amdgpu_ring_funcs 
> vcn_v4_0_3_unified_ring_vm_funcs = {
>   .emit_ib_size = 5, /* vcn_v2_0_enc_ring_emit_ib */
>   .emit_ib = vcn_v2_0_enc_ring_emit_ib,
>   .emit_fence = vcn_v2_0_enc_ring_emit_fence,
> - .emit_vm_flush = vcn_v2_0_enc_

Re: [PATCH 3/3] drm/amdgpu/vcn: Use offsets local to VCN/JPEG in VF

2024-07-16 Thread Lazar, Lijo



On 7/16/2024 1:29 PM, Jane Jian wrote:
> For VCN/JPEG 4.0.3, use only the local addressing scheme.
> 
> - Mask bit higher than AID0 range
> - Remove gmc v9 mmhub vmid replacement, since the bit will be masked later in 
> register write/wait
> 
> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c|  5 ---
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 19 --
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c  | 46 ++--
>  3 files changed, 60 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index b73136d390cc..2c7b4002ed72 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -844,11 +844,6 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
>   req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
>   ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
>  
> - if (vmhub >= AMDGPU_MMHUB0(0))
> - inst = 0;
> - else
> - inst = vmhub;
> -

This doesn't look correct. This is also used to identify the KIQ to be
used to perform flush operation and it goes through master XCC in case
of MMHUB.

Thanks,
Lijo

>   /* This is necessary for SRIOV as well as for GFXOFF to function
>* properly under bare metal
>*/
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c 
> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> index 30a143ab592d..ad524ddc9760 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> @@ -32,6 +32,9 @@
>  #include "vcn/vcn_4_0_3_sh_mask.h"
>  #include "ivsrcid/vcn/irqsrcs_vcn_4_0.h"
>  
> +#define NORMALIZE_JPEG_REG_OFFSET(offset) \
> + (offset & 0x1)
> +
>  enum jpeg_engin_status {
>   UVD_PGFSM_STATUS__UVDJ_PWR_ON  = 0,
>   UVD_PGFSM_STATUS__UVDJ_PWR_OFF = 2,
> @@ -824,7 +827,13 @@ void jpeg_v4_0_3_dec_ring_emit_ib(struct amdgpu_ring 
> *ring,
>  void jpeg_v4_0_3_dec_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t 
> reg,
>   uint32_t val, uint32_t mask)
>  {
> - uint32_t reg_offset = (reg << 2);
> + uint32_t reg_offset;
> +
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_JPEG_REG_OFFSET(reg);
> +
> + reg_offset = (reg << 2);
>  
>   amdgpu_ring_write(ring, 
> PACKETJ(regUVD_JRBC_RB_COND_RD_TIMER_INTERNAL_OFFSET,
>   0, 0, PACKETJ_TYPE0));
> @@ -865,7 +874,13 @@ void jpeg_v4_0_3_dec_ring_emit_vm_flush(struct 
> amdgpu_ring *ring,
>  
>  void jpeg_v4_0_3_dec_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg, 
> uint32_t val)
>  {
> - uint32_t reg_offset = (reg << 2);
> + uint32_t reg_offset;
> +
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_JPEG_REG_OFFSET(reg);
> +
> + reg_offset = (reg << 2);
>  
>   amdgpu_ring_write(ring, 
> PACKETJ(regUVD_JRBC_EXTERNAL_REG_INTERNAL_OFFSET,
>   0, 0, PACKETJ_TYPE0));
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c 
> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> index 101b120f6fbd..9bae95538b62 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> @@ -45,6 +45,9 @@
>  #define VCN_VID_SOC_ADDRESS_2_0  0x1fb00
>  #define VCN1_VID_SOC_ADDRESS_3_0 0x48300
>  
> +#define NORMALIZE_VCN_REG_OFFSET(offset) \
> + (offset & 0x1)
> +
>  static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev);
>  static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device *adev);
>  static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev);
> @@ -1375,6 +1378,43 @@ static uint64_t 
> vcn_v4_0_3_unified_ring_get_wptr(struct amdgpu_ring *ring)
>   regUVD_RB_WPTR);
>  }
>  
> +static void vcn_v4_0_3_enc_ring_emit_reg_wait(struct amdgpu_ring *ring, 
> uint32_t reg,
> + uint32_t val, uint32_t mask)
> +{
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_VCN_REG_OFFSET(reg);
> +
> + amdgpu_ring_write(ring, VCN_ENC_CMD_REG_WAIT);
> + amdgpu_ring_write(ring, reg << 2);
> + amdgpu_ring_write(ring, mask);
> + amdgpu_ring_write(ring, val);
> +}
> +
> +static void vcn_v4_0_3_enc_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t 
> reg, uint32_t val)
> +{
> + /* For VF, only local offsets should be used */
> + if (amdgpu_sriov_vf(ring->adev))
> + reg = NORMALIZE_VCN_REG_OFFSET(reg);
> +
> + amdgpu_ring_write(ring, VCN_ENC_CMD_REG_WRITE);
> + amdgpu_ring_write(ring, reg << 2);
> + amdgpu_ring_write(ring, val);
> +}
> +
> +static void vcn_v4_0_3_enc_ring_emit_vm_flush(struct amdgpu_ring *ring,
> + 

Re: [PATCH 1/3] drm/amdgpu: Add empty HDP flush function to JPEG v4.0.3

2024-07-15 Thread Lazar, Lijo



On 7/15/2024 8:28 PM, Christian König wrote:
> 
> 
> Am 15.07.24 um 16:47 schrieb Jane Jian:
>> From: Lijo Lazar 
>>
>> JPEG v4.0.3 doesn't support HDP flush when RRMT is enabled. Instead,
>> mmsch fw will do the flush.
>>
>> Signed-off-by: Lijo Lazar 
>> Signed-off-by: Jane Jian 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 9 +
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>> index 04d8966423de..ea601047dab0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>> @@ -621,6 +621,14 @@ static uint64_t
>> jpeg_v4_0_3_dec_ring_get_wptr(struct amdgpu_ring *ring)
>>   ring->pipe ? (0x40 * ring->pipe - 0xc80) : 0);
>>   }
>>   +static void jpeg_v4_0_3_ring_emit_hdp_flush(struct amdgpu_ring *ring)
>> +{
>> +    /* VCN engine access for HDP flush doesn't work when RRMT is
>> enabled.
>> + * This is a workaround to avoid any HDP flush through VCN ring.
>> Instead
>> + * HDP flush will be done by driver while submitting doorbell.
> 
> I think that should read "HDP flush will be done by firmware ".
> 
> Or is it really the driver which should do this? In this case the patch
> here would be wrong.
> 

That's a copy-paste mistake. This comment was originally in the initial
version of the patch.

Discussed with Jane and she'll be sending a revised version. Also, there
is a third patch expected which does normalization of register offsets
when submitted through ring.

Thanks,
Lijo

> Regards,
> Christian.
> 
>> + */
>> +}
>> +
>>   /**
>>    * jpeg_v4_0_3_dec_ring_set_wptr - set write pointer
>>    *
>> @@ -1072,6 +1080,7 @@ static const struct amdgpu_ring_funcs
>> jpeg_v4_0_3_dec_ring_vm_funcs = {
>>   .emit_ib = jpeg_v4_0_3_dec_ring_emit_ib,
>>   .emit_fence = jpeg_v4_0_3_dec_ring_emit_fence,
>>   .emit_vm_flush = jpeg_v4_0_3_dec_ring_emit_vm_flush,
>> +    .emit_hdp_flush = jpeg_v4_0_3_ring_emit_hdp_flush,
>>   .test_ring = amdgpu_jpeg_dec_ring_test_ring,
>>   .test_ib = amdgpu_jpeg_dec_ring_test_ib,
>>   .insert_nop = jpeg_v4_0_3_dec_ring_nop,
> 


Re: [PATCH v2] drm/amd/pm: Ignore initial value in smu response register

2024-07-09 Thread Lazar, Lijo



On 7/8/2024 7:01 PM, Danijel Slivka wrote:
> Why:
> If the reg mmMP1_SMN_C2PMSG_90 is being written to during amdgpu driver
> load or driver unload, subsequent amdgpu driver load will fail at
> smu_hw_init. The default of mmMP1_SMN_C2PMSG_90 register at a clean
> environment is 0x1 and if value differs from expected, amdgpu driver
> load will fail.
> 
> How to fix:
> Ignore the initial value in smu response register before the first smu
> message is sent,if smc in SMU_FW_INIT state, just proceed further to
> send the message. If register holds an unexpected value after smu message
> was sent set, smc_state to SMU_FW_HANG state and no further smu messages
> will be sent.
> 
> v2:
> Set SMU_FW_INIT state at the start of smu hw_init/resume.
> Check smc_fw_state before sending smu message if in hang state skip
> sending message.
> Set SMU_FW_HANG only in case unexpected value is detected
> 
> Signed-off-by: Danijel Slivka 

Patch looks good to me

Reviewed-by: Lijo Lazar 

Copying Kenneth/Kevin as well.

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  2 ++
>  drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  7 
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 34 ---
>  3 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index d79bdb1e8cdf..fb8643d25d1b 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1755,6 +1755,8 @@ static int smu_start_smc_engine(struct smu_context *smu)
>   struct amdgpu_device *adev = smu->adev;
>   int ret = 0;
>  
> + smu->smc_fw_state = SMU_FW_INIT;
> +
>   if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
>   if (amdgpu_ip_version(adev, MP1_HWIP, 0) < IP_VERSION(11, 0, 
> 0)) {
>   if (smu->ppt_funcs->load_microcode) {
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> index a34c802f52be..b44a185d07e8 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> @@ -495,6 +495,12 @@ struct stb_context {
>   spinlock_t lock;
>  };
>  
> +enum smu_fw_status {
> + SMU_FW_INIT = 0,
> + SMU_FW_RUNTIME,
> + SMU_FW_HANG,
> +};
> +
>  #define WORKLOAD_POLICY_MAX 7
>  
>  /*
> @@ -562,6 +568,7 @@ struct smu_context {
>   uint32_t smc_fw_if_version;
>   uint32_t smc_fw_version;
>   uint32_t smc_fw_caps;
> + uint8_t smc_fw_state;
>  
>   bool uploading_custom_pp_table;
>   bool dc_controlled_by_gpio;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 5592fd825aa3..d7c983a1f3f5 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -315,11 +315,20 @@ int smu_cmn_send_msg_without_waiting(struct smu_context 
> *smu,
>   if (adev->no_hw_access)
>   return 0;
>  
> - reg = __smu_cmn_poll_stat(smu);
> - res = __smu_cmn_reg2errno(smu, reg);
> - if (reg == SMU_RESP_NONE ||
> - res == -EREMOTEIO)
> + if (smu->smc_fw_state == SMU_FW_HANG) {
> + dev_err(adev->dev, "SMU is in hanged state, failed to send smu 
> message!\n");
>   goto Out;
> + }
> +
> + if (smu->smc_fw_state == SMU_FW_INIT) {
> + smu->smc_fw_state = SMU_FW_RUNTIME;
> + } else {
> + reg = __smu_cmn_poll_stat(smu);
> + res = __smu_cmn_reg2errno(smu, reg);
> + if (reg == SMU_RESP_NONE || res == -EREMOTEIO)
> + goto Out;
> + }
> +
>   __smu_cmn_send_msg(smu, msg_index, param);
>   res = 0;
>  Out:
> @@ -350,6 +359,9 @@ int smu_cmn_wait_for_response(struct smu_context *smu)
>   reg = __smu_cmn_poll_stat(smu);
>   res = __smu_cmn_reg2errno(smu, reg);
>  
> + if (res == -EREMOTEIO)
> + smu->smc_fw_state = SMU_FW_HANG;
> +
>   if (unlikely(smu->adev->pm.smu_debug_mask & SMU_DEBUG_HALT_ON_ERROR) &&
>   res && (res != -ETIME)) {
>   amdgpu_device_halt(smu->adev);
> @@ -418,6 +430,15 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
> *smu,
>   goto Out;
>   }
>  
> + if (smu->smc_fw_state == SMU_FW_HANG) {
> + dev_err(adev->dev, "SMU is in hanged state, failed to send smu 
> message!\n");
> + goto Out;
> + } else if (smu->smc_fw_state == SMU_FW_INIT) {
> + /* Ignore initial smu response register value */
> + poll = false;
> + smu->smc_fw_state = SMU_FW_RUNTIME;
> + }
> +
>   if (poll) {
>   reg = __smu_cmn_poll_stat(smu);
>   res = __smu_cmn_reg2errno(smu, reg);
> @@ -429,8 +450,11 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
> *smu,
>   __smu_cmn_send_msg(smu, (uint16_t

RE: [PATCH] drm/amd/pm: Ignore initial value in smu response register

2024-07-08 Thread Lazar, Lijo
[Public]

One problem is it's also bypassing a valid 0 response which usually means FW 
may not have completed processing the previous message.

What I thought was is it shouldn't even attempt sending a message if it 
identified a FW hang.

Is there a possibility to have the same problem whenever there is SRIOV full 
access - as in before/after reset etc.?

If state == FW_INIT, ignore response state before sending the message.
If there is no expected response to a message, make the state to FW_HANG. This 
part is tricky as what qualifies as a FW hang could change based on the 
specific SOC's message. Avoiding bool for this reason; to keep it open for 
having other FW states.
If state == FW_HANG don't even attempt to send the message.

Move FW state to FW_INIT whenever there is init/resume sequence - 
hw_init/hw_resume?

Thanks,
Lijo
-Original Message-
From: amd-gfx  On Behalf Of Danijel 
Slivka
Sent: Monday, July 8, 2024 1:37 PM
To: amd-gfx@lists.freedesktop.org
Cc: Slivka, Danijel 
Subject: [PATCH] drm/amd/pm: Ignore initial value in smu response register

Why:
If the reg mmMP1_SMN_C2PMSG_90 is being written to during amdgpu driver load or 
driver unload, subsequent amdgpu driver load will fail at smu_hw_init. The 
default of mmMP1_SMN_C2PMSG_90 register at a clean environment is 0x1 and if 
value differs from expected, amdgpu driver load will fail.

How to fix:
Ignore the initial value in smu response register before the first smu message 
is sent, proceed further to send the message. If register holds
0x0 or an unexpected value after smu message was sent set fw_state_hang flag 
and no further smu messages will be sent.

Signed-off-by: Danijel Slivka 
---
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 1 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 7 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
index a34c802f52be..bfe08fa0db6d 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
@@ -562,6 +562,7 @@ struct smu_context {
uint32_t smc_fw_if_version;
uint32_t smc_fw_version;
uint32_t smc_fw_caps;
+   bool smc_fw_state_hang;

bool uploading_custom_pp_table;
bool dc_controlled_by_gpio;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 5592fd825aa3..9e4e62dcbee7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -421,7 +421,7 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
if (poll) {
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (reg == SMU_RESP_NONE || res == -EREMOTEIO) {
+   if ((reg == SMU_RESP_NONE || res == -EREMOTEIO) &&
+smu->smc_fw_state_hang) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
@@ -429,8 +429,11 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
*smu,
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
+   if (reg == SMU_RESP_NONE || res == -EREMOTEIO)
+   smu->smc_fw_state_hang = true;
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   }
if (read_arg) {
smu_cmn_read_arg(smu, read_arg);
dev_dbg(adev->dev, "smu send message: %s(%d) param: 0x%08x, 
resp: 0x%08x,\
--
2.34.1



RE: [PATCH] drm/amd/pm: Ignore initial value in smu response register

2024-07-08 Thread Lazar, Lijo
[Public]

One problem is it's also bypassing a valid 0 response which usually means FW 
may not have completed processing the previous message.

What I thought was is it shouldn't even attempt sending a message if it 
identified a FW hang.

Is there a possibility to have the same problem whenever there is SRIOV full 
access - as in before/after reset etc.?

If state == FW_INIT, ignore response state before sending the message.
If there is no expected response to a message, make the state to FW_HANG. This 
part is tricky as what qualifies as a FW hang could change based on the 
specific SOC's message. Avoiding bool for this reason; to keep it open for 
having other FW states.
If state == FW_HANG don't even attempt to send the message.

Move FW state to FW_INIT whenever there is init/resume sequence - 
hw_init/hw_resume?

Thanks,
Lijo
-Original Message-
From: amd-gfx  On Behalf Of Danijel 
Slivka
Sent: Monday, July 8, 2024 1:37 PM
To: amd-gfx@lists.freedesktop.org
Cc: Slivka, Danijel 
Subject: [PATCH] drm/amd/pm: Ignore initial value in smu response register

Why:
If the reg mmMP1_SMN_C2PMSG_90 is being written to during amdgpu driver load or 
driver unload, subsequent amdgpu driver load will fail at smu_hw_init. The 
default of mmMP1_SMN_C2PMSG_90 register at a clean environment is 0x1 and if 
value differs from expected, amdgpu driver load will fail.

How to fix:
Ignore the initial value in smu response register before the first smu message 
is sent, proceed further to send the message. If register holds
0x0 or an unexpected value after smu message was sent set fw_state_hang flag 
and no further smu messages will be sent.

Signed-off-by: Danijel Slivka 
---
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 1 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 7 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
index a34c802f52be..bfe08fa0db6d 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
@@ -562,6 +562,7 @@ struct smu_context {
uint32_t smc_fw_if_version;
uint32_t smc_fw_version;
uint32_t smc_fw_caps;
+   bool smc_fw_state_hang;

bool uploading_custom_pp_table;
bool dc_controlled_by_gpio;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 5592fd825aa3..9e4e62dcbee7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -421,7 +421,7 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
if (poll) {
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (reg == SMU_RESP_NONE || res == -EREMOTEIO) {
+   if ((reg == SMU_RESP_NONE || res == -EREMOTEIO) &&
+smu->smc_fw_state_hang) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
@@ -429,8 +429,11 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
*smu,
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
+   if (reg == SMU_RESP_NONE || res == -EREMOTEIO)
+   smu->smc_fw_state_hang = true;
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   }
if (read_arg) {
smu_cmn_read_arg(smu, read_arg);
dev_dbg(adev->dev, "smu send message: %s(%d) param: 0x%08x, 
resp: 0x%08x,\
--
2.34.1



Re: [PATCH] drm/amdgpu: set CP_HQD_PQ_DOORBELL_CONTROL.DOORBELL_MODE to 1

2024-07-05 Thread Lazar, Lijo



On 7/4/2024 9:10 PM, Zhigang Luo wrote:
> to avoid reading wrong WPTR from doorbell in sriov vf, set
> CP_HQD_PQ_DOORBELL_CONTROL.DOORBELL_MODE to 1 to read WPTR from MQD.
> 
> Signed-off-by: Zhigang Luo 

Acked-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 3 +++
>  drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 8d8763ebe027..4556a1be5f71 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -1584,6 +1584,9 @@ static int gfx_v9_4_3_xcc_mqd_init(struct amdgpu_ring 
> *ring, int xcc_id)
>   DOORBELL_SOURCE, 0);
>   tmp = REG_SET_FIELD(tmp, CP_HQD_PQ_DOORBELL_CONTROL,
>   DOORBELL_HIT, 0);
> + if (amdgpu_sriov_vf(adev))
> + tmp = REG_SET_FIELD(tmp, CP_HQD_PQ_DOORBELL_CONTROL,
> + DOORBELL_MODE, 1);
>   } else {
>   tmp = REG_SET_FIELD(tmp, CP_HQD_PQ_DOORBELL_CONTROL,
>DOORBELL_EN, 0);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> index 399fa2106631..66c73825c0a0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> @@ -546,6 +546,9 @@ static void init_mqd_hiq_v9_4_3(struct mqd_manager *mm, 
> void **mqd,
>   m->cp_hqd_pq_control |= CP_HQD_PQ_CONTROL__NO_UPDATE_RPTR_MASK |
>   1 << 
> CP_HQD_PQ_CONTROL__PRIV_STATE__SHIFT |
>   1 << 
> CP_HQD_PQ_CONTROL__KMD_QUEUE__SHIFT;
> + if (amdgpu_sriov_vf(mm->dev->adev))
> + m->cp_hqd_pq_doorbell_control |= 1 <<
> + 
> CP_HQD_PQ_DOORBELL_CONTROL__DOORBELL_MODE__SHIFT;
>   m->cp_mqd_stride_size = kfd_hiq_mqd_stride(mm->dev);
>   if (xcc == 0) {
>   /* Set no_update_rptr = 0 in Master XCC */


Re: [PATCH] drm/amdgpu: normalize registers as local xcc to read/write in gfx_v9_4_3

2024-06-25 Thread Lazar, Lijo



On 6/26/2024 11:31 AM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
> 
> [HOW]
> normalize the registers to keep lower 16-bit(dword aligned) to aviod higher 
> bit violation
> RLCG will mask xcd out and always assume it's accessing its own xcd
> 
> v2
> add check in wait mem that only do the normalization on regspace
> 
> Signed-off-by: Jane Jian 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 33 +
>  1 file changed, 33 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 8d8763ebe027..1149595a02d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -55,6 +55,14 @@ MODULE_FIRMWARE("amdgpu/gc_9_4_4_rlc.bin");
>  #define mmSMNAID_XCD1_MCA_SMU 0x38430400 /* SMN AID XCD1 */
>  #define mmSMNXCD_XCD0_MCA_SMU 0x40430400 /* SMN XCD XCD0 */
>  
> +#define XCC_REG_RANGE_0_LOW  0x2000 /* XCC gfxdec0 lower Bound */
> +#define XCC_REG_RANGE_0_HIGH 0x3400 /* XCC gfxdec0 upper Bound */
> +#define XCC_REG_RANGE_1_LOW  0xA000 /* XCC gfxdec1 lower Bound */
> +#define XCC_REG_RANGE_1_HIGH 0x1/* XCC gfxdec1 upper Bound */
> +
> +#define NORMALIZE_XCC_REG_OFFSET(offset) \
> + (offset & 0x)
> +
>  struct amdgpu_gfx_ras gfx_v9_4_3_ras;
>  
>  static void gfx_v9_4_3_set_ring_funcs(struct amdgpu_device *adev);
> @@ -217,9 +225,24 @@ static void gfx_v9_4_3_init_golden_registers(struct 
> amdgpu_device *adev)
>   }
>  }
>  
> +static uint32_t gfx_v9_4_3_normalize_xcc_reg_offset(uint32_t reg)
> +{
> + uint32_t normalized_reg = NORMALIZE_XCC_REG_OFFSET(reg);
> +
> + /* If it is an XCC reg, normalize the reg to keep
> +lower 16 bits in local xcc */
> +
> + if (((normalized_reg >= XCC_REG_RANGE_0_LOW) && (normalized_reg < 
> XCC_REG_RANGE_0_HIGH)) ||
> + ((normalized_reg >= XCC_REG_RANGE_1_LOW) && (normalized_reg < 
> XCC_REG_RANGE_1_HIGH)))
> + return normalized_reg;
> + else
> + return reg;
> +}
> +
>  static void gfx_v9_4_3_write_data_to_reg(struct amdgpu_ring *ring, int 
> eng_sel,
>  bool wc, uint32_t reg, uint32_t val)
>  {
> + reg = gfx_v9_4_3_normalize_xcc_reg_offset(reg);
>   amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
>   amdgpu_ring_write(ring, WRITE_DATA_ENGINE_SEL(eng_sel) |
>   WRITE_DATA_DST_SEL(0) |
> @@ -234,6 +257,12 @@ static void gfx_v9_4_3_wait_reg_mem(struct amdgpu_ring 
> *ring, int eng_sel,
> uint32_t addr1, uint32_t ref, uint32_t mask,
> uint32_t inv)
>  {
> + /* Only do the normalization on regspace */
> + if (mem_space == 0) {
> + addr0 = gfx_v9_4_3_normalize_xcc_reg_offset(addr0);
> + addr1 = gfx_v9_4_3_normalize_xcc_reg_offset(addr1);
> + }
> +
>   amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
>   amdgpu_ring_write(ring,
>/* memory (1) or register (0) */
> @@ -2725,6 +2754,8 @@ static void gfx_v9_4_3_ring_emit_rreg(struct 
> amdgpu_ring *ring, uint32_t reg,
>  {
>   struct amdgpu_device *adev = ring->adev;
>  
> + reg = gfx_v9_4_3_normalize_xcc_reg_offset(reg);
> +
>   amdgpu_ring_write(ring, PACKET3(PACKET3_COPY_DATA, 4));
>   amdgpu_ring_write(ring, 0 | /* src: register*/
>   (5 << 8) |  /* dst: memory */
> @@ -2742,6 +2773,8 @@ static void gfx_v9_4_3_ring_emit_wreg(struct 
> amdgpu_ring *ring, uint32_t reg,
>  {
>   uint32_t cmd = 0;
>  
> + reg = gfx_v9_4_3_normalize_xcc_reg_offset(reg);
> +
>   switch (ring->funcs->type) {
>   case AMDGPU_RING_TYPE_GFX:
>   cmd = WRITE_DATA_ENGINE_SEL(1) | WR_CONFIRM;


Re: [PATCH] drm/amdgpu: normalize registers as local xcc to read/write in gfx_v9_4_3

2024-06-25 Thread Lazar, Lijo



On 6/26/2024 8:31 AM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
> 
> [HOW]
> normalize the registers to keep lower 16-bit(dword aligned) to aviod higher 
> bit violation
> RLCG will mask xcd out and always assume it's accessing its own xcd
> 
> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 29 +
>  1 file changed, 29 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 8d8763ebe027..87a6a610e467 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -55,6 +55,14 @@ MODULE_FIRMWARE("amdgpu/gc_9_4_4_rlc.bin");
>  #define mmSMNAID_XCD1_MCA_SMU 0x38430400 /* SMN AID XCD1 */
>  #define mmSMNXCD_XCD0_MCA_SMU 0x40430400 /* SMN XCD XCD0 */
>  
> +#define XCC_REG_RANGE_0_LOW  0x2000 /* XCC gfxdec0 lower Bound */
> +#define XCC_REG_RANGE_0_HIGH 0x3400 /* XCC gfxdec0 upper Bound */
> +#define XCC_REG_RANGE_1_LOW  0xA000 /* XCC gfxdec1 lower Bound */
> +#define XCC_REG_RANGE_1_HIGH 0x1/* XCC gfxdec1 upper Bound */
> +
> +#define NORMALIZE_XCC_REG_OFFSET(offset) \
> + (offset & 0x)
> +
>  struct amdgpu_gfx_ras gfx_v9_4_3_ras;
>  
>  static void gfx_v9_4_3_set_ring_funcs(struct amdgpu_device *adev);
> @@ -217,9 +225,24 @@ static void gfx_v9_4_3_init_golden_registers(struct 
> amdgpu_device *adev)
>   }
>  }
>  
> +static uint32_t gfx_v9_4_3_normalize_xcc_reg_offset(uint32_t reg)
> +{
> + uint32_t normalized_reg = NORMALIZE_XCC_REG_OFFSET(reg);
> +
> + /* If it is an XCC reg, normalize the reg to keep
> +lower 16 bits in local xcc */
> +
> + if (((normalized_reg >= XCC_REG_RANGE_0_LOW) && (normalized_reg < 
> XCC_REG_RANGE_0_HIGH)) ||
> + ((normalized_reg >= XCC_REG_RANGE_1_LOW) && (normalized_reg < 
> XCC_REG_RANGE_1_HIGH)))
> + return normalized_reg;
> + else
> + return reg;
> +}
> +
>  static void gfx_v9_4_3_write_data_to_reg(struct amdgpu_ring *ring, int 
> eng_sel,
>  bool wc, uint32_t reg, uint32_t val)
>  {
> + reg = gfx_v9_4_3_normalize_xcc_reg_offset(reg);
>   amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
>   amdgpu_ring_write(ring, WRITE_DATA_ENGINE_SEL(eng_sel) |
>   WRITE_DATA_DST_SEL(0) |
> @@ -234,6 +257,8 @@ static void gfx_v9_4_3_wait_reg_mem(struct amdgpu_ring 
> *ring, int eng_sel,
> uint32_t addr1, uint32_t ref, uint32_t mask,
> uint32_t inv)
>  {
> + addr0 = gfx_v9_4_3_normalize_xcc_reg_offset(addr0);
> + addr1 = gfx_v9_4_3_normalize_xcc_reg_offset(addr1);

I guess, this should be done only if it's regspace. Apart from that,
this looks good.

Thanks,
Lijo

>   amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
>   amdgpu_ring_write(ring,
>/* memory (1) or register (0) */
> @@ -2725,6 +2750,8 @@ static void gfx_v9_4_3_ring_emit_rreg(struct 
> amdgpu_ring *ring, uint32_t reg,
>  {
>   struct amdgpu_device *adev = ring->adev;
>  
> + reg = gfx_v9_4_3_normalize_xcc_reg_offset(reg);
> +
>   amdgpu_ring_write(ring, PACKET3(PACKET3_COPY_DATA, 4));
>   amdgpu_ring_write(ring, 0 | /* src: register*/
>   (5 << 8) |  /* dst: memory */
> @@ -2742,6 +2769,8 @@ static void gfx_v9_4_3_ring_emit_wreg(struct 
> amdgpu_ring *ring, uint32_t reg,
>  {
>   uint32_t cmd = 0;
>  
> + reg = gfx_v9_4_3_normalize_xcc_reg_offset(reg);
> +
>   switch (ring->funcs->type) {
>   case AMDGPU_RING_TYPE_GFX:
>   cmd = WRITE_DATA_ENGINE_SEL(1) | WR_CONFIRM;


Re: [PATCH] drm/amdgpu: drop kiq access while in reset

2024-06-24 Thread Lazar, Lijo



On 6/24/2024 5:31 PM, Christian König wrote:
> Am 24.06.24 um 13:57 schrieb Lazar, Lijo:
>> On 6/24/2024 5:19 PM, Christian König wrote:
>>> Am 24.06.24 um 11:52 schrieb Lazar, Lijo:
>>>> On 6/24/2024 3:08 PM, Christian König wrote:
>>>>> Am 24.06.24 um 08:34 schrieb Lazar, Lijo:
>>>>>> On 6/24/2024 12:01 PM, Vignesh Chander wrote:
>>>>>>> correctly handle the case when trylock fails when gpu is
>>>>>>> about to be reset by dropping the request instead of using mmio
>>>>>>>
>>>>>>> Signed-off-by: Vignesh Chander 
>>>>>> Reviewed-by: Lijo Lazar 
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38
>>>>>>> --
>>>>>>>     1 file changed, 21 insertions(+), 17 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index a7ce8280b17ce7..3cfd24699d691d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -613,10 +613,11 @@ uint32_t amdgpu_device_rreg(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>       if ((reg * 4) < adev->rmmio_size) {
>>>>>>>     if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>>>> -    amdgpu_sriov_runtime(adev) &&
>>>>>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> -    ret = amdgpu_kiq_rreg(adev, reg, 0);
>>>>>>> -    up_read(&adev->reset_domain->sem);
>>>>>>> +    amdgpu_sriov_runtime(adev) {
>>>>>>> +    if (down_read_trylock(&adev->reset_domain->sem)) {
>>>>>>> +    ret = amdgpu_kiq_rreg(adev, reg, 0);
>>>>>>> +    up_read(&adev->reset_domain->sem);
>>>>>>> +    }
>>>>> What has actually changed here? As far as I can see that isn't
>>>>> functionally different to the existing code.
>>>>>
>>>> In earlier logic, if it fails to get the lock, it takes the 'else'
>>>> path.
>>>> The 'else' path is not meant for sriov/vf.
>>> Yeah, but the read or write is then just silently dropped.
>>>
>>> That can't be correct either.
>>>
>> These are void funcs. Moreover, the drops will happen if there is
>> concurrent access from another thread while a reset is going on. That is
>> expected and those accesses during a reset won't help anyway.
> 
> Nope, Teddy has been working on that for a while as well.

This silent drop is already there in bare metal.

https://elixir.bootlin.com/linux/v6.10-rc5/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L738

> 
> Trying to make those accesses while the reset is going on is wrong in
> the first place. What we need to do is to grab the reset lock in the
> higher level function so that the whole sequences of reads and writes
> are protected.
> 
> What this logic here does is to use readl()/writel() from the reset
> thread itself. Dropping that is incorrect and could lead to broken reset.

This doesn't change anything for a reset thread. This fixes an already
broken path for sriov where it attempts a direct readl()/writel() if
taking the lock fails. That is even more broken.

Thanks,
Lijo

> 
> So clear NAK from my side to this patch here.
> 
> Regards,
> Christian.
> 
>>
>> Thanks,
>> Lijo
>>
>>> Regards,
>>> Christian.
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>>     } else {
>>>>>>>     ret = readl(((void __iomem *)adev->rmmio) + (reg
>>>>>>> * 4));
>>>>>>>     }
>>>>>>> @@ -681,10 +682,11 @@ uint32_t amdgpu_device_xcc_rreg(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>  &rlcg_flag)) {
>>>>>>>     ret = amdgpu_virt_rlcg

Re: [PATCH] drm/amdgpu: drop kiq access while in reset

2024-06-24 Thread Lazar, Lijo



On 6/24/2024 5:19 PM, Christian König wrote:
> Am 24.06.24 um 11:52 schrieb Lazar, Lijo:
>>
>> On 6/24/2024 3:08 PM, Christian König wrote:
>>> Am 24.06.24 um 08:34 schrieb Lazar, Lijo:
>>>> On 6/24/2024 12:01 PM, Vignesh Chander wrote:
>>>>> correctly handle the case when trylock fails when gpu is
>>>>> about to be reset by dropping the request instead of using mmio
>>>>>
>>>>> Signed-off-by: Vignesh Chander 
>>>> Reviewed-by: Lijo Lazar 
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38
>>>>> --
>>>>>    1 file changed, 21 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index a7ce8280b17ce7..3cfd24699d691d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -613,10 +613,11 @@ uint32_t amdgpu_device_rreg(struct
>>>>> amdgpu_device *adev,
>>>>>      if ((reg * 4) < adev->rmmio_size) {
>>>>>    if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>> -    amdgpu_sriov_runtime(adev) &&
>>>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>>>> -    ret = amdgpu_kiq_rreg(adev, reg, 0);
>>>>> -    up_read(&adev->reset_domain->sem);
>>>>> +    amdgpu_sriov_runtime(adev) {
>>>>> +    if (down_read_trylock(&adev->reset_domain->sem)) {
>>>>> +    ret = amdgpu_kiq_rreg(adev, reg, 0);
>>>>> +    up_read(&adev->reset_domain->sem);
>>>>> +    }
>>> What has actually changed here? As far as I can see that isn't
>>> functionally different to the existing code.
>>>
>> In earlier logic, if it fails to get the lock, it takes the 'else' path.
>> The 'else' path is not meant for sriov/vf.
> 
> Yeah, but the read or write is then just silently dropped.
> 
> That can't be correct either.
> 

These are void funcs. Moreover, the drops will happen if there is
concurrent access from another thread while a reset is going on. That is
expected and those accesses during a reset won't help anyway.

Thanks,
Lijo

> Regards,
> Christian.
> 
>>
>> Thanks,
>> Lijo
>>
>>> Regards,
>>> Christian.
>>>
>>>>>    } else {
>>>>>    ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
>>>>>    }
>>>>> @@ -681,10 +682,11 @@ uint32_t amdgpu_device_xcc_rreg(struct
>>>>> amdgpu_device *adev,
>>>>>     &rlcg_flag)) {
>>>>>    ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0, rlcg_flag,
>>>>> GET_INST(GC, xcc_id));
>>>>>    } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>> -    amdgpu_sriov_runtime(adev) &&
>>>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>>>> -    ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
>>>>> -    up_read(&adev->reset_domain->sem);
>>>>> +    amdgpu_sriov_runtime(adev) {
>>>>> +    if (down_read_trylock(&adev->reset_domain->sem)) {
>>>>> +    ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
>>>>> +    up_read(&adev->reset_domain->sem);
>>>>> +    }
>>>>>    } else {
>>>>>    ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
>>>>>    }
>>>>> @@ -740,10 +742,11 @@ void amdgpu_device_wreg(struct amdgpu_device
>>>>> *adev,
>>>>>      if ((reg * 4) < adev->rmmio_size) {
>>>>>    if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>>>> -    amdgpu_sriov_runtime(adev) &&
>>>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>>>> -    amdgpu_kiq_wreg(adev, reg, v, 0);
>>>>> -    up_read(&adev->reset_domain->sem);
>>>>>

Re: [PATCH] drm/amdgpu: drop kiq access while in reset

2024-06-24 Thread Lazar, Lijo



On 6/24/2024 3:08 PM, Christian König wrote:
> Am 24.06.24 um 08:34 schrieb Lazar, Lijo:
>>
>> On 6/24/2024 12:01 PM, Vignesh Chander wrote:
>>> correctly handle the case when trylock fails when gpu is
>>> about to be reset by dropping the request instead of using mmio
>>>
>>> Signed-off-by: Vignesh Chander 
>> Reviewed-by: Lijo Lazar 
>>
>> Thanks,
>> Lijo
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 --
>>>   1 file changed, 21 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index a7ce8280b17ce7..3cfd24699d691d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -613,10 +613,11 @@ uint32_t amdgpu_device_rreg(struct
>>> amdgpu_device *adev,
>>>     if ((reg * 4) < adev->rmmio_size) {
>>>   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>> -    amdgpu_sriov_runtime(adev) &&
>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>> -    ret = amdgpu_kiq_rreg(adev, reg, 0);
>>> -    up_read(&adev->reset_domain->sem);
>>> +    amdgpu_sriov_runtime(adev) {
>>> +    if (down_read_trylock(&adev->reset_domain->sem)) {
>>> +    ret = amdgpu_kiq_rreg(adev, reg, 0);
>>> +    up_read(&adev->reset_domain->sem);
>>> +    }
> 
> What has actually changed here? As far as I can see that isn't
> functionally different to the existing code.
> 

In earlier logic, if it fails to get the lock, it takes the 'else' path.
The 'else' path is not meant for sriov/vf.

Thanks,
Lijo

> Regards,
> Christian.
> 
>>>   } else {
>>>   ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
>>>   }
>>> @@ -681,10 +682,11 @@ uint32_t amdgpu_device_xcc_rreg(struct
>>> amdgpu_device *adev,
>>>    &rlcg_flag)) {
>>>   ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0, rlcg_flag,
>>> GET_INST(GC, xcc_id));
>>>   } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>> -    amdgpu_sriov_runtime(adev) &&
>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>> -    ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
>>> -    up_read(&adev->reset_domain->sem);
>>> +    amdgpu_sriov_runtime(adev) {
>>> +    if (down_read_trylock(&adev->reset_domain->sem)) {
>>> +    ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
>>> +    up_read(&adev->reset_domain->sem);
>>> +    }
>>>   } else {
>>>   ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
>>>   }
>>> @@ -740,10 +742,11 @@ void amdgpu_device_wreg(struct amdgpu_device
>>> *adev,
>>>     if ((reg * 4) < adev->rmmio_size) {
>>>   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>> -    amdgpu_sriov_runtime(adev) &&
>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>> -    amdgpu_kiq_wreg(adev, reg, v, 0);
>>> -    up_read(&adev->reset_domain->sem);
>>> +    amdgpu_sriov_runtime(adev) {
>>> +    if (down_read_trylock(&adev->reset_domain->sem)) {
>>> +    amdgpu_kiq_wreg(adev, reg, v, 0);
>>> +    up_read(&adev->reset_domain->sem);
>>> +    }
>>>   } else {
>>>   writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
>>>   }
>>> @@ -812,11 +815,12 @@ void amdgpu_device_xcc_wreg(struct
>>> amdgpu_device *adev,
>>>    &rlcg_flag)) {
>>>   amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag,
>>> GET_INST(GC, xcc_id));
>>>   } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
>>> -    amdgpu_sriov_runtime(adev) &&
>>> -    down_read_trylock(&adev->reset_domain->sem)) {
>>> -    amdgpu_kiq_wreg(adev, reg, v, xcc_id);
>>> -    up_read(&adev->reset_domain->sem);
>>> -    } else {
>>> +    amdgpu_sriov_runtime(adev) {
>>> +    if (down_read_trylock(&adev->reset_domain->sem)) {
>>> +    amdgpu_kiq_wreg(adev, reg, v, xcc_id);
>>> +    up_read(&adev->reset_domain->sem);
>>> +    }
>>> +    } else {
>>>   writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
>>>   }
>>>   } else {
> 


Re: [PATCH] drm/amdgpu: normalize registers as local xcc to read/write under sriov in TLB flush

2024-06-24 Thread Lazar, Lijo



On 6/21/2024 1:45 PM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
> 
> [HOW]
> normalize the registers to keep lower 16-bit(dword aligned) to aviod higher 
> bit violation
> RLCG will mask xcd out and always assume it's accessing its own xcd
> 
> [TODO]
> later will add the normalization in sriovw/rreg after fixing bugs
> 
> v2
> rename the normalized macro, add ip block type for further use
> move asics func declaration after ip block type since new func refers ip 
> block type
> add normalization in emit flush tlb as well
> 
> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h| 112 +++--
>  drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c |  16 +++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  32 --
>  drivers/gpu/drm/amd/amdgpu/soc15.c |   1 +
>  drivers/gpu/drm/amd/amdgpu/soc15.h |   1 +
>  drivers/gpu/drm/amd/amdgpu/soc15_common.h  |   5 +-
>  6 files changed, 101 insertions(+), 66 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 083f353cff6e..070fd9e601fe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -583,61 +583,6 @@ struct amdgpu_video_codecs {
>   const struct amdgpu_video_codec_info *codec_array;
>  };
>  
> -/*
> - * ASIC specific functions.
> - */
> -struct amdgpu_asic_funcs {
> - bool (*read_disabled_bios)(struct amdgpu_device *adev);
> - bool (*read_bios_from_rom)(struct amdgpu_device *adev,
> -u8 *bios, u32 length_bytes);
> - int (*read_register)(struct amdgpu_device *adev, u32 se_num,
> -  u32 sh_num, u32 reg_offset, u32 *value);
> - void (*set_vga_state)(struct amdgpu_device *adev, bool state);
> - int (*reset)(struct amdgpu_device *adev);
> - enum amd_reset_method (*reset_method)(struct amdgpu_device *adev);
> - /* get the reference clock */
> - u32 (*get_xclk)(struct amdgpu_device *adev);
> - /* MM block clocks */
> - int (*set_uvd_clocks)(struct amdgpu_device *adev, u32 vclk, u32 dclk);
> - int (*set_vce_clocks)(struct amdgpu_device *adev, u32 evclk, u32 ecclk);
> - /* static power management */
> - int (*get_pcie_lanes)(struct amdgpu_device *adev);
> - void (*set_pcie_lanes)(struct amdgpu_device *adev, int lanes);
> - /* get config memsize register */
> - u32 (*get_config_memsize)(struct amdgpu_device *adev);
> - /* flush hdp write queue */
> - void (*flush_hdp)(struct amdgpu_device *adev, struct amdgpu_ring *ring);
> - /* invalidate hdp read cache */
> - void (*invalidate_hdp)(struct amdgpu_device *adev,
> -struct amdgpu_ring *ring);
> - /* check if the asic needs a full reset of if soft reset will work */
> - bool (*need_full_reset)(struct amdgpu_device *adev);
> - /* initialize doorbell layout for specific asic*/
> - void (*init_doorbell_index)(struct amdgpu_device *adev);
> - /* PCIe bandwidth usage */
> - void (*get_pcie_usage)(struct amdgpu_device *adev, uint64_t *count0,
> -uint64_t *count1);
> - /* do we need to reset the asic at init time (e.g., kexec) */
> - bool (*need_reset_on_init)(struct amdgpu_device *adev);
> - /* PCIe replay counter */
> - uint64_t (*get_pcie_replay_count)(struct amdgpu_device *adev);
> - /* device supports BACO */
> - int (*supports_baco)(struct amdgpu_device *adev);
> - /* pre asic_init quirks */
> - void (*pre_asic_init)(struct amdgpu_device *adev);
> - /* enter/exit umd stable pstate */
> - int (*update_umd_stable_pstate)(struct amdgpu_device *adev, bool enter);
> - /* query video codecs */
> - int (*query_video_codecs)(struct amdgpu_device *adev, bool encode,
> -   const struct amdgpu_video_codecs **codecs);
> - /* encode "> 32bits" smn addressing */
> - u64 (*encode_ext_smn_addressing)(int ext_id);
> -
> - ssize_t (*get_reg_state)(struct amdgpu_device *adev,
> -  enum amdgpu_reg_state reg_state, void *buf,
> -  size_t max_size);
> -};
> -
>  /*
>   * IOCTL.
>   */
> @@ -728,6 +673,63 @@ enum amd_hw_ip_block_type {
>   MAX_HWIP
>  };
>  
> +/*
> + * ASIC specific functions.
> + */
> +struct amdgpu_asic_funcs {
> + bool (*read_disabled_bios)(struct amdgpu_device *adev);
> + bool (*read_bios_from_rom)(struct amdgpu_device *adev,
> +u8 *bios, u32 length_bytes);
> + int (*read_register)(struct amdgpu_device *adev, u32 se_num,
> +  u32 sh_num, u32 reg_offset, u32 *value);
> + void (*set_vga_state)(struct amdgpu_device *adev, bool state);
> + int (*reset)(struct amdgpu_device *adev);
> + enum amd_reset_method (*reset_method)(struct amdgpu_device *adev);
> + /* get the reference clock */
> +   

Re: [PATCH] drm/amdgpu: drop kiq access while in reset

2024-06-23 Thread Lazar, Lijo



On 6/24/2024 12:01 PM, Vignesh Chander wrote:
> correctly handle the case when trylock fails when gpu is
> about to be reset by dropping the request instead of using mmio
> 
> Signed-off-by: Vignesh Chander 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 --
>  1 file changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a7ce8280b17ce7..3cfd24699d691d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -613,10 +613,11 @@ uint32_t amdgpu_device_rreg(struct amdgpu_device *adev,
>  
>   if ((reg * 4) < adev->rmmio_size) {
>   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
> - amdgpu_sriov_runtime(adev) &&
> - down_read_trylock(&adev->reset_domain->sem)) {
> - ret = amdgpu_kiq_rreg(adev, reg, 0);
> - up_read(&adev->reset_domain->sem);
> + amdgpu_sriov_runtime(adev) {
> + if (down_read_trylock(&adev->reset_domain->sem)) {
> + ret = amdgpu_kiq_rreg(adev, reg, 0);
> + up_read(&adev->reset_domain->sem);
> + }
>   } else {
>   ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
>   }
> @@ -681,10 +682,11 @@ uint32_t amdgpu_device_xcc_rreg(struct amdgpu_device 
> *adev,
>&rlcg_flag)) {
>   ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0, rlcg_flag, 
> GET_INST(GC, xcc_id));
>   } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
> - amdgpu_sriov_runtime(adev) &&
> - down_read_trylock(&adev->reset_domain->sem)) {
> - ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
> - up_read(&adev->reset_domain->sem);
> + amdgpu_sriov_runtime(adev) {
> + if (down_read_trylock(&adev->reset_domain->sem)) {
> + ret = amdgpu_kiq_rreg(adev, reg, xcc_id);
> + up_read(&adev->reset_domain->sem);
> + }
>   } else {
>   ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
>   }
> @@ -740,10 +742,11 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
>  
>   if ((reg * 4) < adev->rmmio_size) {
>   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
> - amdgpu_sriov_runtime(adev) &&
> - down_read_trylock(&adev->reset_domain->sem)) {
> - amdgpu_kiq_wreg(adev, reg, v, 0);
> - up_read(&adev->reset_domain->sem);
> + amdgpu_sriov_runtime(adev) {
> + if (down_read_trylock(&adev->reset_domain->sem)) {
> + amdgpu_kiq_wreg(adev, reg, v, 0);
> + up_read(&adev->reset_domain->sem);
> + }
>   } else {
>   writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
>   }
> @@ -812,11 +815,12 @@ void amdgpu_device_xcc_wreg(struct amdgpu_device *adev,
>&rlcg_flag)) {
>   amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag, 
> GET_INST(GC, xcc_id));
>   } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
> - amdgpu_sriov_runtime(adev) &&
> - down_read_trylock(&adev->reset_domain->sem)) {
> - amdgpu_kiq_wreg(adev, reg, v, xcc_id);
> - up_read(&adev->reset_domain->sem);
> - } else {
> + amdgpu_sriov_runtime(adev) {
> + if (down_read_trylock(&adev->reset_domain->sem)) {
> + amdgpu_kiq_wreg(adev, reg, v, xcc_id);
> + up_read(&adev->reset_domain->sem);
> + }
> + } else {
>   writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
>   }
>   } else {


Re: [Patch v2 2/2] drm/amdgpu: Don't warn for compute mode switch under SRIOV

2024-06-23 Thread Lazar, Lijo



On 6/22/2024 9:17 PM, Rajneesh Bhardwaj wrote:
> Under SRIOV environment, the compute partition mode is setup by the
> host driver so state machine cached copy might be different when doing
> the transition for the first time.
> 
> Signed-off-by: Rajneesh Bhardwaj 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> index 2b99eed5ba19..c4a9669bceb0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> @@ -228,7 +228,8 @@ int amdgpu_xcp_query_partition_mode(struct amdgpu_xcp_mgr 
> *xcp_mgr, u32 flags)
>   if (!(flags & AMDGPU_XCP_FL_LOCKED))
>   mutex_lock(&xcp_mgr->xcp_lock);
>   mode = xcp_mgr->funcs->query_partition_mode(xcp_mgr);
> - if (xcp_mgr->mode != AMDGPU_XCP_MODE_TRANS && mode != xcp_mgr->mode)
> + if (xcp_mgr->mode != AMDGPU_XCP_MODE_TRANS && mode != xcp_mgr->mode
> + && !amdgpu_sriov_vf(xcp_mgr->adev))

This indicates a fundamental problem in host-guest mode and host is
doing a switch without guest's knowledge. This needs to be fixed
differently.

Thanks,
Lijo

>   dev_WARN(
>   xcp_mgr->adev->dev,
>   "Cached partition mode %d not matching with device mode 
> %d",


Re: [Patch v2 1/2] drm/amdgpu: Disable compute partition switch under SRIOV

2024-06-23 Thread Lazar, Lijo



On 6/22/2024 9:17 PM, Rajneesh Bhardwaj wrote:
> Do not allow the compute partition mode switch from the guest driver but
> still allow the query for current_compute_partition.
> 
> Signed-off-by: Rajneesh Bhardwaj 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 5 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 ++---
>  2 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 82452606ae6c..1c673c0b65d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -1292,6 +1292,11 @@ static ssize_t amdgpu_gfx_set_compute_partition(struct 
> device *dev,
>   enum amdgpu_gfx_partition mode;
>   int ret = 0, num_xcc;
>  
> + /* Under SRIOV, this is allowed only via the host driver but not from
> +  * within the VF */
> + if (amdgpu_sriov_vf(adev))
> + return -EPERM;
> +

This is not the way to do this. It needs to disable switch partition
callback in xcp_mgr for VF mode. That's the one which will be checked
for making the sysfs node read/write vs read-only.

Thanks,
Lijo
>   num_xcc = NUM_XCC(adev->gfx.xcc_mask);
>   if (num_xcc % 2 != 0)
>   return -EINVAL;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 8d8763ebe027..f87dc1b9d77c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -936,11 +936,7 @@ static int gfx_v9_4_3_sw_init(void *handle)
>   if (r)
>   return r;
>  
> -
> - if (!amdgpu_sriov_vf(adev))
> - r = amdgpu_gfx_sysfs_init(adev);
> -
> - return r;
> + return amdgpu_gfx_sysfs_init(adev);
>  }
>  
>  static int gfx_v9_4_3_sw_fini(void *handle)
> @@ -961,8 +957,7 @@ static int gfx_v9_4_3_sw_fini(void *handle)
>   gfx_v9_4_3_mec_fini(adev);
>   amdgpu_bo_unref(&adev->gfx.rlc.clear_state_obj);
>   gfx_v9_4_3_free_microcode(adev);
> - if (!amdgpu_sriov_vf(adev))
> - amdgpu_gfx_sysfs_fini(adev);
> + amdgpu_gfx_sysfs_fini(adev);
>  
>   return 0;
>  }


Re: [PATCH] drm/amdgpu: part I - normalize registers as local xcc to read/write under sriov in TLB

2024-06-20 Thread Lazar, Lijo



On 6/19/2024 10:01 PM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
> 
> [HOW]
> normalize the registers to keep lower 16-bit(dword aligned) to aviod higher 
> bit violation
> RLCG will mask xcd out and always assume it's accessing its own xcd
> 
> [TODO]
> later will add the normalization in sriovw/rreg after fixing bugs
> 
> v2
> rename the normalized macro, add ip block type for further use
> 

In subject, part I etc. doesn't look good. Only put the intent - like
normalize xcc register offsets for TLB flush. Rest of the story may be
put in description.


> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
>  drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c | 16 
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 10 --
>  drivers/gpu/drm/amd/amdgpu/soc15.c |  1 +
>  drivers/gpu/drm/amd/amdgpu/soc15.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/soc15_common.h  |  5 -
>  6 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 083f353cff6e..eb2e7312bf1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -636,6 +636,8 @@ struct amdgpu_asic_funcs {
>   ssize_t (*get_reg_state)(struct amdgpu_device *adev,
>enum amdgpu_reg_state reg_state, void *buf,
>size_t max_size);
> + /* normalize offset to keep in lower 16-bit */
> + u32 (*normalize_reg_offset)(u32 hwip, u32 offset);

Please change the hwip argument type to enum amd_hw_ip_block_type.

>  };
>  
>  /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c 
> b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> index 2c9a0aa41e2d..9b4bea2ca7df 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> @@ -1085,3 +1085,19 @@ ssize_t aqua_vanjaram_get_reg_state(struct 
> amdgpu_device *adev,
>  
>   return size;
>  }
> +
> +u32 aqua_vanjaram_normalize_reg_offset(u32 hwip, u32 offset)
> +{
> + u32 normalized_offset;
> +
> + switch (hwip) {
> + case GC_HWIP:
> + normalized_offset = offset & 0x;
> + break;
> + default:
> + normalized_offset = offset;
> + break;
> + }
> +
> + return normalized_offset;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 88b4644f8e96..1d24e19f304d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -853,8 +853,14 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
>*/
>   if (adev->gfx.kiq[inst].ring.sched.ready &&
>   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> - uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
> - uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
> +
> + /* Select lower 16 bits to write in local xcc
> +  * for MMHUB it uses xcc0, NO cross AID reg offset
> +  */

The comment about MMHUB is confusing. MMHUB offset in the current form
will be SOC wide, though it is passed to xcc0. Better remove it.

> + if (AMDGPU_IS_GFXHUB(vmhub)) {
> + req = NORMALIZE_XCC_REG_OFFSET(GC, req);

Since IP is parameter, naming the macro as XCC_REG_OFFSET doesn't suit.

Thanks,
Lijo

> + ack = NORMALIZE_XCC_REG_OFFSET(GC, ack);
> + }
>  
>   amdgpu_gmc_fw_reg_write_reg_wait(adev, req, ack, inv_req,
>1 << vmid, inst);
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 8d16dacdc172..e6e61fc77080 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -927,6 +927,7 @@ static const struct amdgpu_asic_funcs 
> aqua_vanjaram_asic_funcs =
>   .query_video_codecs = &soc15_query_video_codecs,
>   .encode_ext_smn_addressing = &aqua_vanjaram_encode_ext_smn_addressing,
>   .get_reg_state = &aqua_vanjaram_get_reg_state,
> + .normalize_reg_offset = &aqua_vanjaram_normalize_reg_offset,
>  };
>  
>  static int soc15_common_early_init(void *handle)
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.h 
> b/drivers/gpu/drm/amd/amdgpu/soc15.h
> index 282584a48be0..f1e974604e3e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.h
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.h
> @@ -124,4 +124,5 @@ ssize_t aqua_vanjaram_get_reg_state(struct amdgpu_device 
> *adev,
>  void vega10_doorbell_index_init(struct amdgpu_device *adev);
>  void vega20_doorbell_index_init(struct amdgpu_device *adev);
>  void aqua_vanjaram_doorbell_index_init(struct amdgpu_device *adev);
> +u32 aqua_vanjaram_normalize_reg_offset(u32 hwip, u32 offset);
>  #endif
> diff

Re: [PATCH] drm/amdgpu: process RAS fatal error MB notification

2024-06-19 Thread Lazar, Lijo



On 6/19/2024 2:44 AM, Vignesh Chander wrote:
> For RAS error scenario, VF guest driver will check mailbox
> and set fed flag to avoid unnecessary HW accesses.
> additionally, poll for reset completion message first
> to avoid accidentally spamming multiple reset requests to host.
> 
> v2: add another mailbox check for handling case where kfd detects
> timeout first
> 
> Signed-off-by: Vignesh Chander 
> Change-Id: Ib501c653265883999c62a12a209ce5eb81c80846
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 25 +---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h |  4 +++-
>  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c| 22 +++--
>  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h|  4 +++-
>  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c| 22 +++--
>  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h|  3 ++-
>  6 files changed, 70 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 63f2286858c484..ccb3d041c2b249 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -229,6 +229,22 @@ void amdgpu_virt_free_mm_table(struct amdgpu_device 
> *adev)
>   adev->virt.mm_table.gpu_addr = 0;
>  }
>  
> +/**
> + * amdgpu_virt_rcvd_ras_interrupt() - receive ras interrupt
> + * @adev:amdgpu device.
> + * Check whether host sent RAS error message
> + * Return: true if found, otherwise false
> + */
> +bool amdgpu_virt_rcvd_ras_interrupt(struct amdgpu_device *adev)
> +{
> + struct amdgpu_virt *virt = &adev->virt;
> +
> + if (!virt->ops || !virt->ops->rcvd_ras_intr)
> + return false;
> +
> + return virt->ops->rcvd_ras_intr(adev);
> +}
> +
>  >  unsigned int amd_sriov_msg_checksum(void *obj,
>   unsigned long obj_size,
> @@ -612,11 +628,14 @@ static void amdgpu_virt_update_vf2pf_work_item(struct 
> work_struct *work)
>   ret = amdgpu_virt_read_pf2vf_data(adev);
>   if (ret) {
>   adev->virt.vf2pf_update_retry_cnt++;
> - if ((adev->virt.vf2pf_update_retry_cnt >= 
> AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
> - amdgpu_sriov_runtime(adev)) {
> +
> + if ((amdgpu_virt_rcvd_ras_interrupt(adev) ||
> + adev->virt.vf2pf_update_retry_cnt >= 
> AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
> + amdgpu_sriov_runtime(adev)) {
> +
>   amdgpu_ras_set_fed(adev, true);
>   if (amdgpu_reset_domain_schedule(adev->reset_domain,
> -   
> &adev->kfd.reset_work))
> + &adev->kfd.reset_work))

Instead of this and below waits, what about checking the status in
gpu_recover() or in device_reset_sriov(). It will get called for reset
initiated from all sources.

Setting the flag means it will wait for FLR completion.

/* Actual ASIC resets if needed.*/
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
+
+   /* RAS error is equivalent to FLR initiated from host,
wait for
+* completion
+*/
+   if (amdgpu_virt_rcvd_ras_interrupt(adev) ||
amdgpu_ras_get_fed_status(adev))
+   set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
+


Thanks,
Lijo
>   return;
>   else
>   dev_err(adev->dev, "Failed to queue work! at 
> %s", __func__);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index f04cd1586c7220..b42a8854dca0cb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -52,7 +52,7 @@
>  /* tonga/fiji use this offset */
>  #define mmBIF_IOV_FUNC_IDENTIFIER 0x1503
>  
> -#define AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT 5
> +#define AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT 2
>  
>  enum amdgpu_sriov_vf_mode {
>   SRIOV_VF_MODE_BARE_METAL = 0,
> @@ -94,6 +94,7 @@ struct amdgpu_virt_ops {
> u32 data1, u32 data2, u32 data3);
>   void (*ras_poison_handler)(struct amdgpu_device *adev,
>   enum amdgpu_ras_block block);
> + bool (*rcvd_ras_intr)(struct amdgpu_device *adev);
>  };
>  
>  /*
> @@ -352,6 +353,7 @@ void amdgpu_virt_ready_to_reset(struct amdgpu_device 
> *adev);
>  int amdgpu_virt_wait_reset(struct amdgpu_device *adev);
>  int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
>  void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
> +bool amdgpu_virt_rcvd_ras_interrupt(struct amdgpu_device *adev);
>  void amdgpu_virt_release_ras_err_handler_data(struct amdgpu_device *adev);
>  void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev);
>  void amdgpu_virt_exchange_data(struct amdgpu_device *adev);
> diff --g

Re: [PATCH] drm/amdgpu: part I - normalize registers as local xcc to read/write under sriov in TLB

2024-06-19 Thread Lazar, Lijo



On 6/19/2024 3:25 PM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
> 
> [HOW]
> normalize the registers to keep lower 16-bit(dword aligned) to aviod higher 
> bit violation
> RLCG will mask xcd out and always assume it's accessing its own xcd
> 
> [TODO]
> later will add the normalization in sriovw/rreg after fixing bugs
> 
> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
>  drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c |  9 +
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 10 --
>  drivers/gpu/drm/amd/amdgpu/soc15.c |  1 +
>  drivers/gpu/drm/amd/amdgpu/soc15.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/soc15_common.h  |  3 +++
>  6 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 083f353cff6e..da8d3669cc23 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -632,6 +632,8 @@ struct amdgpu_asic_funcs {
> const struct amdgpu_video_codecs **codecs);
>   /* encode "> 32bits" smn addressing */
>   u64 (*encode_ext_smn_addressing)(int ext_id);
> + /* normalize offset to keep in lower 16-bit */
> + u32 (*normalize_xcc_reg_offset)(u32 offset);

Suggest to rename to normalize_reg_offset() and add enum
amd_hw_ip_block_type as well. If required, the same callback could be
used for other IPs also.

>  
>   ssize_t (*get_reg_state)(struct amdgpu_device *adev,
>enum amdgpu_reg_state reg_state, void *buf,
> diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c 
> b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> index 2c9a0aa41e2d..3306df74457b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> @@ -1085,3 +1085,12 @@ ssize_t aqua_vanjaram_get_reg_state(struct 
> amdgpu_device *adev,
>  
>   return size;
>  }
> +
> +u32 aqua_vanjaram_normalize_xcc_reg_offset(u32 offset)
> +{
> + u32 normalized_offset;
> +
> + normalized_offset = offset & 0x;
> +
> + return normalized_offset;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 88b4644f8e96..fba2e4ad58db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -853,8 +853,14 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
>*/
>   if (adev->gfx.kiq[inst].ring.sched.ready &&
>   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> - uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
> - uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
> +
> + /* Select lower 16 bits to write in local xcc
> +  * for MMHUB it uses xcc0, NO cross AID reg offset
> +  */
> + if (AMDGPU_IS_GFXHUB(vmhub)) {
> + req = NORMALIZE_XCC_REG_OFFSET(req);
> + ack = NORMALIZE_XCC_REG_OFFSET(ack);
> + }
>  
>   amdgpu_gmc_fw_reg_write_reg_wait(adev, req, ack, inv_req,
>1 << vmid, inst);
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 8d16dacdc172..31037f068902 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -927,6 +927,7 @@ static const struct amdgpu_asic_funcs 
> aqua_vanjaram_asic_funcs =
>   .query_video_codecs = &soc15_query_video_codecs,
>   .encode_ext_smn_addressing = &aqua_vanjaram_encode_ext_smn_addressing,
>   .get_reg_state = &aqua_vanjaram_get_reg_state,
> + .normalize_xcc_reg_offset = &aqua_vanjaram_normalize_xcc_reg_offset,
>  };
>  
>  static int soc15_common_early_init(void *handle)
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.h 
> b/drivers/gpu/drm/amd/amdgpu/soc15.h
> index 282584a48be0..0d405a474283 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.h
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.h
> @@ -124,4 +124,5 @@ ssize_t aqua_vanjaram_get_reg_state(struct amdgpu_device 
> *adev,
>  void vega10_doorbell_index_init(struct amdgpu_device *adev);
>  void vega20_doorbell_index_init(struct amdgpu_device *adev);
>  void aqua_vanjaram_doorbell_index_init(struct amdgpu_device *adev);
> +u32 aqua_vanjaram_normalize_xcc_reg_offset(u32 offset);
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15_common.h 
> b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
> index 242b24f73c17..43887836377d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15_common.h
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
> @@ -210,4 +210,7 @@
>  #define WREG64_MCA(ext, mca_base, idx, val) \
>   WREG64_PCIE_EXT(adev->asic_funcs->encode_ext_smn_addressing(ext) + 
> mca_base + (idx * 8), val)
>  
> +#define NORMALIZE_XCC_REG_OFFSET(offse

Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

2024-06-18 Thread Lazar, Lijo



On 6/18/2024 4:51 PM, Chai, Thomas wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> -
> Best Regards,
> Thomas
> 
> -Original Message-
> From: Chai, Thomas
> Sent: Tuesday, June 18, 2024 7:09 PM
> To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao ; 
> Li, Candice ; Wang, Yang(Kevin) ; 
> Yang, Stanley 
> Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
> complete
> 
> 
> 
> 
> -
> Best Regards,
> Thomas
> 
> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, June 18, 2024 6:09 PM
> To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao ; 
> Li, Candice ; Wang, Yang(Kevin) ; 
> Yang, Stanley 
> Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
> complete
> 
> 
> 
> On 6/18/2024 12:03 PM, YiPeng Chai wrote:
>> Add completion to wait for ras reset to complete.
>>
>> Signed-off-by: YiPeng Chai 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 898889600771..7f8e6ca07957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if
>> *ras_block)
>>
>>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>>
>> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  12 //ms
>> +
>>  enum amdgpu_ras_retire_page_reservation {
>>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
>> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct 
>> *work)
>>   atomic_set(&hive->ras_recovery, 0);
>>   amdgpu_put_xgmi_hive(hive);
>>   }
>> +
>> + complete_all(&ras->ras_recovery_completion);
>>  }
>>
>>  /* alloc/realloc bps array */
>> @@ -2911,10 +2915,16 @@ static int
>> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>>
>>   flush_delayed_work(&con->page_retirement_dwork);
>>
>> + reinit_completion(&con->ras_recovery_completion);
>> +
>>   con->gpu_reset_flags |= reset;
>>   amdgpu_ras_reset_gpu(adev);
>>
>>   *gpu_reset = reset;
>> + if (!wait_for_completion_timeout(&con->ras_recovery_completion,
>> + 
>> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
>> + dev_err(adev->dev, "Waiting for GPU to complete ras 
>> reset timeout! reset:0x%x\n",
>> + reset);
> 
>> If a mode-1 reset gets to execute first due to job timeout/hws detect cases 
>> in poison timeout, then the ras handler will never get executed.
>> Why this wait is required?
> 
>> Thanks,
>> Lijo
> 
> [Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception 
> handling" add the check before ras gpu reset.
> Poison ras reset is different from reset triggered by other 
> fatal errors, and all poison RAS resets are triggered from here,
>  in order to distinguish other gpu resets and facilitate 
> subsequent  code processing, so add wait for gpu ras reset here.
> 

Reset mechanism resets the GPU state - whether it's triggered due to
poison or fatal errors. As soon as the device is reset successfully, GPU
operations can continue. So why there needs to be a special wait for
poison triggred reset alone? Why not wait on the RAS recovery work
object rather than another completion notification?

Thanks,
Lijo

>>   }
>>
>>   return 0;
>> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device 
>> *adev)
>>   }
>>   }
>>
>> + init_completion(&con->ras_recovery_completion);
>>   mutex_init(&con->page_rsv_lock);
>>   INIT_KFIFO(con->poison_fifo);
>>   mutex_init(&con->page_retirement_lock);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 91daf48be03a..b47f03edac87 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>>   DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>>   struct ras_ecc_log_info  umc_ecc_log;
>>   struct delayed_work page_retirement_dwork;
>> + struct completion ras_recovery_completion;
>>
>>   /* Fatal error detected flag */
>>   atomic_t fed;


Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

2024-06-18 Thread Lazar, Lijo



On 6/18/2024 12:03 PM, YiPeng Chai wrote:
> Add completion to wait for ras reset to complete.
> 
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 898889600771..7f8e6ca07957 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if 
> *ras_block)
>  
>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>  
> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  12 //ms
> +
>  enum amdgpu_ras_retire_page_reservation {
>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct 
> *work)
>   atomic_set(&hive->ras_recovery, 0);
>   amdgpu_put_xgmi_hive(hive);
>   }
> +
> + complete_all(&ras->ras_recovery_completion);
>  }
>  
>  /* alloc/realloc bps array */
> @@ -2911,10 +2915,16 @@ static int 
> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>  
>   flush_delayed_work(&con->page_retirement_dwork);
>  
> + reinit_completion(&con->ras_recovery_completion);
> +
>   con->gpu_reset_flags |= reset;
>   amdgpu_ras_reset_gpu(adev);
>  
>   *gpu_reset = reset;
> + if (!wait_for_completion_timeout(&con->ras_recovery_completion,
> + 
> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
> + dev_err(adev->dev, "Waiting for GPU to complete ras 
> reset timeout! reset:0x%x\n",
> + reset);

If a mode-1 reset gets to execute first due to job timeout/hws detect
cases in poison timeout, then the ras handler will never get executed.
Why this wait is required?

Thanks,
Lijo

>   }
>  
>   return 0;
> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>   }
>   }
>  
> + init_completion(&con->ras_recovery_completion);
>   mutex_init(&con->page_rsv_lock);
>   INIT_KFIFO(con->poison_fifo);
>   mutex_init(&con->page_retirement_lock);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 91daf48be03a..b47f03edac87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>   DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>   struct ras_ecc_log_info  umc_ecc_log;
>   struct delayed_work page_retirement_dwork;
> + struct completion ras_recovery_completion;
>  
>   /* Fatal error detected flag */
>   atomic_t fed;


Re: [PATCH] drm/amdgpu: normalize registers as local xcc to read/write under sriov

2024-06-18 Thread Lazar, Lijo



On 6/17/2024 3:41 PM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
> 
> [HOW]
> normalize the registers to keep lower 16-bit(dword aligned) to aviod higher 
> bit violation
> RLCG will mask xcd out and always assume it's accessing its own xcd
> 
> also fix the typo in sriov_w/rreg:
> for KIQ case, use xcc with xcc_id to read and write
> 
> v2
> amend some typos
> 
> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c  | 12 ++--
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  8 ++--
>  drivers/gpu/drm/amd/amdgpu/soc15_common.h |  2 ++
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 63f2286858c4..d43652a38484 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -1075,6 +1075,10 @@ void amdgpu_sriov_wreg(struct amdgpu_device *adev,
>   if (amdgpu_device_skip_hw_access(adev))
>   return;
>  
> + /* Select lower 16 bits to write in local xcc */
> + if ((hwip == GC_HWIP) && !(acc_flags & AMDGPU_REGS_NO_KIQ))
> + offset = NORMALIZE_XCC_REG_OFFSET(offset);

This cannot be generalized. Instead use a similar approach of having an
soc specific function => adev->asic_funcs->encode_ext_smn_addressing

> +
>   if (!amdgpu_sriov_runtime(adev) &&
>   amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, 
> true, &rlcg_flag)) {
>   amdgpu_virt_rlcg_reg_rw(adev, offset, value, rlcg_flag, xcc_id);
> @@ -1084,7 +1088,7 @@ void amdgpu_sriov_wreg(struct amdgpu_device *adev,
>   if (acc_flags & AMDGPU_REGS_NO_KIQ)
>   WREG32_NO_KIQ(offset, value);
>   else
> - WREG32(offset, value);
> + WREG32_XCC(offset, value, xcc_id);

This doesn't look correct. AFAIU, this macro is specifically for XCC
registers. amdgpu_sriov_wreg can have registers other than hwip == GC_HWIP.

>  }
>  
>  u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
> @@ -1095,6 +1099,10 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
>   if (amdgpu_device_skip_hw_access(adev))
>   return 0;
>  
> + /* Select lower 16 bits to read in local xcc */
> + if ((hwip == GC_HWIP) && !(acc_flags & AMDGPU_REGS_NO_KIQ))
> + offset = NORMALIZE_XCC_REG_OFFSET(offset);
> +
>   if (!amdgpu_sriov_runtime(adev) &&
>   amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, 
> false, &rlcg_flag))
>   return amdgpu_virt_rlcg_reg_rw(adev, offset, 0, rlcg_flag, 
> xcc_id);
> @@ -1102,7 +1110,7 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
>   if (acc_flags & AMDGPU_REGS_NO_KIQ)
>   return RREG32_NO_KIQ(offset);
>   else
> - return RREG32(offset);
> + return RREG32_XCC(offset, xcc_id);>  }
>  
>  bool amdgpu_sriov_xnack_support(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 88b4644f8e96..5bb275b96e6a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -853,8 +853,12 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
>*/
>   if (adev->gfx.kiq[inst].ring.sched.ready &&
>   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> - uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
> - uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
> +
> + /* Select lower 16 bits to write in local xcc */
> + if (AMDGPU_IS_GFXHUB(vmhub)) {
> + req = NORMALIZE_XCC_REG_OFFSET(req);
> + ack = NORMALIZE_XCC_REG_OFFSET(ack);
> + }
Not sure if there are other things to check like cross AID register
offsets for MMHUB.

Thanks,
Lijo
>  
>   amdgpu_gmc_fw_reg_write_reg_wait(adev, req, ack, inv_req,
>1 << vmid, inst);
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15_common.h 
> b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
> index 242b24f73c17..9ddf68e7d06e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15_common.h
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
> @@ -210,4 +210,6 @@
>  #define WREG64_MCA(ext, mca_base, idx, val) \
>   WREG64_PCIE_EXT(adev->asic_funcs->encode_ext_smn_addressing(ext) + 
> mca_base + (idx * 8), val)
>  
> +#define NORMALIZE_XCC_REG_OFFSET(offset) (offset & 0x)
> +
>  #endif


Re: [PATCH] drm/amdgpu: keep init xcc0 for all xccs under sriov

2024-06-16 Thread Lazar, Lijo



On 6/17/2024 8:58 AM, Chang, HaiJun wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> Hi Lijo,
> 
>  
> 
> Right, 18bits are byte aligned range of local XCC register, 16bites are
> dword aligned offset range
> 
>  
> 
> We find the normalization needs to be applied to many functions, like
> 
>   * KIQ: amdgpu_kiq_r/wreg/
>   * RLC: amdgpu_virt_rlcg_reg_rw
>   * KIQ: amdgpu_gmc_fw_reg_write_reg_wait
>   * KIQ:
> 
> amdgpu_ring_emit_reg_write_reg_wait/amdgpu_ring_emit_reg_wait/amdgpu_ring_emit_wreg
> 
>  
> 
> For sriov gfx register access, it only has 2 ways: rlc or kiq.  Both of
> the ways can use local xcc offset,  so we think it’s simpler change to
> init the gfx register offsets with local xcc offset only.
>

Ok, is this the only place? What about other calls in gfx_v9_4_3 like
WREG32_SOC15_RLC/WREG32_SOC15 etc.?

Thanks,
Lijo

>  
> 
> Thanks,
> 
> HaiJun
> 
>  
> 
> *From:*Lazar, Lijo 
> *Sent:* Saturday, June 15, 2024 10:09 AM
> *To:* Jian, Jane ; Chang, HaiJun
> ; Zhao, Victor 
> *Cc:* amd-gfx@lists.freedesktop.org
> *Subject:* Re: [PATCH] drm/amdgpu: keep init xcc0 for all xccs under sriov
> 
>  
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
>  
> 
> Never mind, bit 16 and above is probably because of dword aligned offset.
> 
>  
> 
> Any reason not to do this in kiq/rlc based writes to normalise all?
> 
>  
> 
> Thanks,
> 
> Lijo
> 
> 
> 
> *From:*Lazar, Lijo
> *Sent:* Friday, June 14, 2024 5:20:30 PM
> *To:* Jian, Jane mailto:jane.j...@amd.com>>; Chang,
> HaiJun mailto:haijun.ch...@amd.com>>; Zhao,
> Victor mailto:victor.z...@amd.com>>
> *Cc:* amd-gfx@lists.freedesktop.org
> <mailto:amd-gfx@lists.freedesktop.org>  <mailto:amd-gfx@lists.freedesktop.org>>
> *Subject:* Re: [PATCH] drm/amdgpu: keep init xcc0 for all xccs under sriov
> 
>  
> 
> 
> 
> On 6/14/2024 4:40 PM, Jane Jian wrote:
>> [WHY]
>> sriov has the higher bit violation when flushing tlb
>> 
>> [HOW]
>> for sriov only init XCC0(lower 16-bit) for all XCCs to avoid higher bit 
>> violation
>> since kiq ring is always local, local address without XCC ID is enough to be 
>> sent to the XCC KIQ
>> 
> 
> The description is incorrect.
> 
> Bits 18:20 represent xcc id. To guarantee all paths pass a local
> address, you should just strip bits 18:20 in kiq/rlcg read/write
> functions rather than here.
> 
> Thanks,
> Lijo
> 
>> Signed-off-by: Jane Jian mailto:jane.j...@amd.com>>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c | 23 +++
>>  1 file changed, 15 insertions(+), 8 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c 
>> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> index e14acab5cceb..4e38a66a52f4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
>> @@ -537,29 +537,36 @@ static void gfxhub_v1_2_xcc_init(struct amdgpu_device 
>> *adev, uint32_t xcc_mask)
>>  {
>>    struct amdgpu_vmhub *hub;
>>    int i;
>> + uint32_t gc_index;
>>  
>>    for_each_inst(i, xcc_mask) {
>>    hub = &adev->vmhub[AMDGPU_GFXHUB(i)];
>>  
>> + /* for sriov only init XCC0(lower 16-bit) to avoid higher bit 
>> violation */
>> + if (amdgpu_sriov_vf(adev))
>> + gc_index = 0;
>> + else
>> + gc_index = GET_INST(GC, i);
>> +
>>    hub->ctx0_ptb_addr_lo32 =
>> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
>> + SOC15_REG_OFFSET(GC, gc_index,
>>    regVM_CONTEXT0_PAGE_TABLE_BASE_ADDR_LO32);
>>    hub->ctx0_ptb_addr_hi32 =
>> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
>> + SOC15_REG_OFFSET(GC, gc_index,
>>    regVM_CONTEXT0_PAGE_TABLE_BASE_ADDR_HI32);
>>    hub->vm_inv_eng0_sem =
>> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
>> regVM_INVALIDATE_ENG0_SEM);
>> + SOC15_REG_OFFSET(GC, gc_index, 
>> regVM_INVALIDATE_ENG0_SEM);
>>    hub->vm_inv_eng0_req =
>> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
>> regVM_INVALIDATE_ENG0_REQ);
>> + SOC15_REG_OFFSET(GC, gc_ind

Re: [PATCH] drm/amdgpu: keep init xcc0 for all xccs under sriov

2024-06-14 Thread Lazar, Lijo
[AMD Official Use Only - AMD Internal Distribution Only]

Never mind, bit 16 and above is probably because of dword aligned offset.

Any reason not to do this in kiq/rlc based writes to normalise all?

Thanks,
Lijo

From: Lazar, Lijo
Sent: Friday, June 14, 2024 5:20:30 PM
To: Jian, Jane ; Chang, HaiJun ; Zhao, 
Victor 
Cc: amd-gfx@lists.freedesktop.org 
Subject: Re: [PATCH] drm/amdgpu: keep init xcc0 for all xccs under sriov



On 6/14/2024 4:40 PM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
>
> [HOW]
> for sriov only init XCC0(lower 16-bit) for all XCCs to avoid higher bit 
> violation
> since kiq ring is always local, local address without XCC ID is enough to be 
> sent to the XCC KIQ
>

The description is incorrect.

Bits 18:20 represent xcc id. To guarantee all paths pass a local
address, you should just strip bits 18:20 in kiq/rlcg read/write
functions rather than here.

Thanks,
Lijo

> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c | 23 +++
>  1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c 
> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> index e14acab5cceb..4e38a66a52f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> @@ -537,29 +537,36 @@ static void gfxhub_v1_2_xcc_init(struct amdgpu_device 
> *adev, uint32_t xcc_mask)
>  {
>struct amdgpu_vmhub *hub;
>int i;
> + uint32_t gc_index;
>
>for_each_inst(i, xcc_mask) {
>hub = &adev->vmhub[AMDGPU_GFXHUB(i)];
>
> + /* for sriov only init XCC0(lower 16-bit) to avoid higher bit 
> violation */
> + if (amdgpu_sriov_vf(adev))
> + gc_index = 0;
> + else
> + gc_index = GET_INST(GC, i);
> +
>hub->ctx0_ptb_addr_lo32 =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
> + SOC15_REG_OFFSET(GC, gc_index,
>regVM_CONTEXT0_PAGE_TABLE_BASE_ADDR_LO32);
>hub->ctx0_ptb_addr_hi32 =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
> + SOC15_REG_OFFSET(GC, gc_index,
>regVM_CONTEXT0_PAGE_TABLE_BASE_ADDR_HI32);
>hub->vm_inv_eng0_sem =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_INVALIDATE_ENG0_SEM);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_INVALIDATE_ENG0_SEM);
>hub->vm_inv_eng0_req =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_INVALIDATE_ENG0_REQ);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_INVALIDATE_ENG0_REQ);
>hub->vm_inv_eng0_ack =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_INVALIDATE_ENG0_ACK);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_INVALIDATE_ENG0_ACK);
>hub->vm_context0_cntl =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_CONTEXT0_CNTL);
> + SOC15_REG_OFFSET(GC, gc_index, regVM_CONTEXT0_CNTL);
>hub->vm_l2_pro_fault_status =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
> + SOC15_REG_OFFSET(GC, gc_index,
>regVM_L2_PROTECTION_FAULT_STATUS);
>hub->vm_l2_pro_fault_cntl =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_L2_PROTECTION_FAULT_CNTL);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_L2_PROTECTION_FAULT_CNTL);
>
>hub->ctx_distance = regVM_CONTEXT1_CNTL -
>regVM_CONTEXT0_CNTL;


Re: [PATCH] drm/amdgpu: keep init xcc0 for all xccs under sriov

2024-06-14 Thread Lazar, Lijo



On 6/14/2024 4:40 PM, Jane Jian wrote:
> [WHY]
> sriov has the higher bit violation when flushing tlb
> 
> [HOW]
> for sriov only init XCC0(lower 16-bit) for all XCCs to avoid higher bit 
> violation
> since kiq ring is always local, local address without XCC ID is enough to be 
> sent to the XCC KIQ
> 

The description is incorrect.

Bits 18:20 represent xcc id. To guarantee all paths pass a local
address, you should just strip bits 18:20 in kiq/rlcg read/write
functions rather than here.

Thanks,
Lijo

> Signed-off-by: Jane Jian 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c | 23 +++
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c 
> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> index e14acab5cceb..4e38a66a52f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> @@ -537,29 +537,36 @@ static void gfxhub_v1_2_xcc_init(struct amdgpu_device 
> *adev, uint32_t xcc_mask)
>  {
>   struct amdgpu_vmhub *hub;
>   int i;
> + uint32_t gc_index;
>  
>   for_each_inst(i, xcc_mask) {
>   hub = &adev->vmhub[AMDGPU_GFXHUB(i)];
>  
> + /* for sriov only init XCC0(lower 16-bit) to avoid higher bit 
> violation */
> + if (amdgpu_sriov_vf(adev))
> + gc_index = 0;
> + else
> + gc_index = GET_INST(GC, i);
> +
>   hub->ctx0_ptb_addr_lo32 =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
> + SOC15_REG_OFFSET(GC, gc_index,
>   regVM_CONTEXT0_PAGE_TABLE_BASE_ADDR_LO32);
>   hub->ctx0_ptb_addr_hi32 =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
> + SOC15_REG_OFFSET(GC, gc_index,
>   regVM_CONTEXT0_PAGE_TABLE_BASE_ADDR_HI32);
>   hub->vm_inv_eng0_sem =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_INVALIDATE_ENG0_SEM);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_INVALIDATE_ENG0_SEM);
>   hub->vm_inv_eng0_req =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_INVALIDATE_ENG0_REQ);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_INVALIDATE_ENG0_REQ);
>   hub->vm_inv_eng0_ack =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_INVALIDATE_ENG0_ACK);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_INVALIDATE_ENG0_ACK);
>   hub->vm_context0_cntl =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_CONTEXT0_CNTL);
> + SOC15_REG_OFFSET(GC, gc_index, regVM_CONTEXT0_CNTL);
>   hub->vm_l2_pro_fault_status =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i),
> + SOC15_REG_OFFSET(GC, gc_index,
>   regVM_L2_PROTECTION_FAULT_STATUS);
>   hub->vm_l2_pro_fault_cntl =
> - SOC15_REG_OFFSET(GC, GET_INST(GC, i), 
> regVM_L2_PROTECTION_FAULT_CNTL);
> + SOC15_REG_OFFSET(GC, gc_index, 
> regVM_L2_PROTECTION_FAULT_CNTL);
>  
>   hub->ctx_distance = regVM_CONTEXT1_CNTL -
>   regVM_CONTEXT0_CNTL;


Re: [PATCH] drm/amdkfd: add ASIC version check for the reset selection of RAS poison

2024-06-13 Thread Lazar, Lijo



On 6/13/2024 4:43 PM, Tao Zhou wrote:
> GFX v9.4.3 uses mode1 reset, other ASICs choose mode2.
> 
> Signed-off-by: Tao Zhou 

Acked-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index 78dde62fb04a..816800555f7f 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -164,7 +164,10 @@ static void event_interrupt_poison_consumption_v9(struct 
> kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
>   block = AMDGPU_RAS_BLOCK__GFX;
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 
> 4, 3))
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> @@ -177,7 +180,10 @@ static void event_interrupt_poison_consumption_v9(struct 
> kfd_node *dev,
>   case SOC15_IH_CLIENTID_SDMA3:
>   case SOC15_IH_CLIENTID_SDMA4:
>   block = AMDGPU_RAS_BLOCK__SDMA;
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) == IP_VERSION(9, 
> 4, 3))
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
>   break;
>   default:
>   dev_warn(dev->adev->dev,


Re: [PATCH 1/5] drm/amdgpu: add condition check for waking up thread

2024-06-13 Thread Lazar, Lijo



On 6/13/2024 7:55 AM, YiPeng Chai wrote:
> 1. Cannot add messages to fifo in gpu reset mode.
> 2. Only when the message is successfully saved to the
> fifo, the thread can be awakened.
> 

I think fifo should still cache the poison requests while in reset. Page
retirement thread may try to acquire the read side of reset lock and
wait if any reset is in progress.

Thanks
Lijo

> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 18 +++---
>  2 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d0dcd3d37e6d..ed260966363f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2093,12 +2093,16 @@ static void 
> amdgpu_ras_interrupt_poison_creation_handler(struct ras_manager *obj
>   if (amdgpu_ip_version(obj->adev, UMC_HWIP, 0) >= IP_VERSION(12, 0, 0)) {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(obj->adev);
>  
> - amdgpu_ras_put_poison_req(obj->adev,
> - AMDGPU_RAS_BLOCK__UMC, 0, NULL, NULL, false);
> -
> - atomic_inc(&con->page_retirement_req_cnt);
> -
> - wake_up(&con->page_retirement_wq);
> + if (!amdgpu_in_reset(obj->adev) && 
> !atomic_read(&con->in_recovery)) {
> + int ret;
> +
> + ret = amdgpu_ras_put_poison_req(obj->adev,
> + AMDGPU_RAS_BLOCK__UMC, 0, NULL, NULL, false);
> + if (!ret) {
> + atomic_inc(&con->page_retirement_req_cnt);
> + wake_up(&con->page_retirement_wq);
> + }
> + }
>   }
>  #endif
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1dbe69eabb9a..94181ae85886 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -293,16 +293,20 @@ int amdgpu_umc_pasid_poison_handler(struct 
> amdgpu_device *adev,
>  
>   amdgpu_ras_error_data_fini(&err_data);
>   } else {
> - struct amdgpu_ras *con = 
> amdgpu_ras_get_context(adev);
> -
>  #ifdef HAVE_KFIFO_PUT_NON_POINTER
> - amdgpu_ras_put_poison_req(adev,
> - block, pasid, pasid_fn, data, reset);
> -#endif
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  
> - atomic_inc(&con->page_retirement_req_cnt);
> + if (!amdgpu_in_reset(adev) && 
> !atomic_read(&con->in_recovery)) {
> + int ret;
>  
> - wake_up(&con->page_retirement_wq);
> + ret = amdgpu_ras_put_poison_req(adev,
> + block, pasid, pasid_fn, data, reset);
> + if (!ret) {
> + 
> atomic_inc(&con->page_retirement_req_cnt);
> + wake_up(&con->page_retirement_wq);
> + }
> + }
> +#endif
>   }
>   } else {
>   if (adev->virt.ops && adev->virt.ops->ras_poison_handler)


Re: [PATCH] drm/amdkfd: use mode1 reset for RAS poison consumption

2024-06-13 Thread Lazar, Lijo



On 6/13/2024 12:27 PM, Tao Zhou wrote:
> Per FW requirement, replace mode2 with mode1.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index e1c21d250611..78dde62fb04a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -164,7 +164,7 @@ static void event_interrupt_poison_consumption_v9(struct 
> kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
>   block = AMDGPU_RAS_BLOCK__GFX;
> - reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> @@ -177,7 +177,7 @@ static void event_interrupt_poison_consumption_v9(struct 
> kfd_node *dev,
>   case SOC15_IH_CLIENTID_SDMA3:
>   case SOC15_IH_CLIENTID_SDMA4:
>   block = AMDGPU_RAS_BLOCK__SDMA;
> - reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
>   break;

Does this need 9.4.3 IP version check?

Thanks,
Lijo
>   default:
>   dev_warn(dev->adev->dev,


Re: [PATCH v2] drm/amdgpu: use local xcc to flush tlb

2024-06-12 Thread Lazar, Lijo



On 6/12/2024 3:06 PM, Yiqing Yao wrote:
> When flushing gpu tlb using kiq for gfxhub, kiq ring is always
> local by selecting kiq instance. Test shows mmreg write data packet's
> higher bits then 16 have no effect when flush using kiq on gfxhub.
> 
> Also some variant have policy blocking higher offset when req/ack is set
> with extra bits and can cause flush to timeout.
> 
> So keep the lower 16 bits only.
> 
> Remove redundant code.
> 
> Signed-off-by: Yiqing Yao 
> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 350f6b6676f1..f3fe318e0c1d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -853,8 +853,16 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
>*/
>   if (adev->gfx.kiq[inst].ring.sched.ready &&
>   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> - uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
> - uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
> +
> + /* 
> +  * Select lower 16 bits to write in local xcc when flushing
> +  * using kiq to write gfx as higher bits are always ignored
> +  */
> + if (vmhub < AMDGPU_MMHUB0(0))
> + {
> + req = req & 0x;
> + ack = ack & 0x;
> + }
>  

The issue is incorrect mask passed by host driver in discovery table
which results in incorrect register offsets. The fix should be in
discovery table passed by host driver and RRMT mechanism will then take
care.

Thanks,
Lijo

>   amdgpu_gmc_fw_reg_write_reg_wait(adev, req, ack, inv_req,
>1 << vmid, inst);


Re: [PATCH] drm/amdgpu: Move SR-IOV check into amdgpu_gfx_sysfs_compute_init

2024-06-07 Thread Lazar, Lijo



On 6/7/2024 12:31 PM, SRINIVASAN SHANMUGAM wrote:
> 
> On 6/6/2024 10:58 PM, Lazar, Lijo wrote:
>> On 6/6/2024 5:35 PM, Srinivasan Shanmugam wrote:
>>> Previously, this check was performed in the gfx_v9_4_3_sw_init function,
>>> and the amdgpu_gfx_sysfs_compute_init function was only called if the
>>> GPU was not a VF in SR-IOV mode. This is because the sysfs entries
>>> created by amdgpu_gfx_sysfs_compute_init are specific to compute
>>> partitions, which are only applicable on GFX9 and not on a VF in SR-IOV
>>> mode.
>>>
>>> By moving the check into amdgpu_gfx_sysfs_compute_init, we make this
>>> function responsible for deciding whether or not to create the compute
>>> partition sysfs entries.
>>>
>>> This change improves the code organization and maintainability. If in
>>> the future the  conditions for creating the compute partition sysfs
>>> entries change, we  would only need to update the
>>> amdgpu_gfx_sysfs_compute_init function.
>> This is not correct. If this has to be true, this will reside somewhere
>> in amdgpu_gfx and you would also need IP version inside this one. If for
>> a new IP version say gfx v9.4.5 this needs to be created for VF also,
> 
> In this case, how about below?
> 
> int amdgpu_gfx_sysfs_compute_init(struct amdgpu_device *adev, bool
> check_sriov)  
> {  
>     int r; 
>  
>     if (!check_sriov || !amdgpu_sriov_vf(adev)) { 
>     r = device_create_file(adev->dev,
> &dev_attr_current_compute_partition); 
>     if (r) 
>         return r; 
>  
>     r = device_create_file(adev->dev,
> &dev_attr_available_compute_partition); 
>     if (r) 
>         return r; 
>     } 
>  
>     return 0; 
> } 
> 
> In gfx_v9_4_3_sw_init you would call amdgpu_gfx_sysfs_compute_init(adev,
> true),
> 
> to perform the check, and in gfx_v9_4_5_sw_init you would call
> amdgpu_gfx_sysfs_compute_init(adev, false) to skip the check.
> 
> This way, we can control the behavior of the function without needing to
> put condition in IP code version.?
> 
> But would like have Christian's view also, onto this "a new IP version
> say gfx v9.4.5 this needs to be created for VF also,
> 

Drop the patch. As you see, the patch is just adding more complexity
with more variables rather than simplifying anything.

Thanks,
Lijo

> "
> 
>> then this check here won't work. This is the specific reason why we put
>> the conditions inside IP code.
>>
>> Thanks,
>> Lijo
>>
>>> Cc: Alex Deucher 
>>> Cc: Christian König 
>>> Suggested-by: Christian König 
>>> Signed-off-by: Srinivasan Shanmugam 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 24 +++-
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  4 ++--
>>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 11 +--
>>>  3 files changed, 22 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> index 19b1817b55d7..72477a5aedca 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> @@ -1376,21 +1376,27 @@ static DEVICE_ATTR(current_compute_partition, 0644,
>>>  static DEVICE_ATTR(available_compute_partition, 0444,
>>>amdgpu_gfx_get_available_compute_partition, NULL);
>>>  
>>> -int amdgpu_gfx_sysfs_init(struct amdgpu_device *adev)
>>> +int amdgpu_gfx_sysfs_compute_init(struct amdgpu_device *adev)
>>>  {
>>> int r;
>>>  
>>> -   r = device_create_file(adev->dev, &dev_attr_current_compute_partition);
>>> -   if (r)
>>> -   return r;
>>> +   if (!amdgpu_sriov_vf(adev)) {
>>> +   r = device_create_file(adev->dev, 
>>> &dev_attr_current_compute_partition);
>>> +   if (r)
>>> +   return r;
>>>  
>>> -   r = device_create_file(adev->dev, 
>>> &dev_attr_available_compute_partition);
>>> +   r = device_create_file(adev->dev, 
>>> &dev_attr_available_compute_partition);
>>> +   if (r)
>>> +   return r;
>>> +   }
>>>  
>>> -   return r;
>>> +   return 0;
>>>  }
>>>  
>>> -void amdgpu_gfx_sysfs_fini(struct amdgpu_device *adev)
>>> +void amdgpu_gfx_sysfs_compute_fini(struct amdgpu_device *adev)
>>>  {
>>

Re: [PATCH] drm/amdgpu: fix NULL pointer in amdgpu_reset_get_desc

2024-06-06 Thread Lazar, Lijo



On 6/6/2024 9:13 PM, Eric Huang wrote:
> amdgpu_job_ring may return NULL, which causes kernel NULL
> pointer error, using another way to print ring name instead
> of ring->name.
> 
> Suggested-by: Lijo Lazar 
> Signed-off-by: Eric Huang 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index 9deb41d61e8d..66c1a868c0e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -164,16 +164,14 @@ void amdgpu_device_unlock_reset_domain(struct 
> amdgpu_reset_domain *reset_domain)
>  void amdgpu_reset_get_desc(struct amdgpu_reset_context *rst_ctxt, char *buf,
>  size_t len)
>  {
> - struct amdgpu_ring *ring;
> -
>   if (!buf || !len)
>   return;
>  
>   switch (rst_ctxt->src) {
>   case AMDGPU_RESET_SRC_JOB:
>   if (rst_ctxt->job) {
> - ring = amdgpu_job_ring(rst_ctxt->job);
> - snprintf(buf, len, "job hang on ring:%s", ring->name);
> + snprintf(buf, len, "job hang on ring:%s",
> +  rst_ctxt->job->base.sched->name);
>   } else {
>   strscpy(buf, "job hang", len);
>   }


Re: [PATCH] drm/amdgpu: Move SR-IOV check into amdgpu_gfx_sysfs_compute_init

2024-06-06 Thread Lazar, Lijo



On 6/6/2024 5:35 PM, Srinivasan Shanmugam wrote:
> Previously, this check was performed in the gfx_v9_4_3_sw_init function,
> and the amdgpu_gfx_sysfs_compute_init function was only called if the
> GPU was not a VF in SR-IOV mode. This is because the sysfs entries
> created by amdgpu_gfx_sysfs_compute_init are specific to compute
> partitions, which are only applicable on GFX9 and not on a VF in SR-IOV
> mode.
> 
> By moving the check into amdgpu_gfx_sysfs_compute_init, we make this
> function responsible for deciding whether or not to create the compute
> partition sysfs entries.
> 
> This change improves the code organization and maintainability. If in
> the future the  conditions for creating the compute partition sysfs
> entries change, we  would only need to update the
> amdgpu_gfx_sysfs_compute_init function.

This is not correct. If this has to be true, this will reside somewhere
in amdgpu_gfx and you would also need IP version inside this one. If for
a new IP version say gfx v9.4.5 this needs to be created for VF also,
then this check here won't work. This is the specific reason why we put
the conditions inside IP code.

Thanks,
Lijo

> 
> Cc: Alex Deucher 
> Cc: Christian König 
> Suggested-by: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 24 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 11 +--
>  3 files changed, 22 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 19b1817b55d7..72477a5aedca 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -1376,21 +1376,27 @@ static DEVICE_ATTR(current_compute_partition, 0644,
>  static DEVICE_ATTR(available_compute_partition, 0444,
>  amdgpu_gfx_get_available_compute_partition, NULL);
>  
> -int amdgpu_gfx_sysfs_init(struct amdgpu_device *adev)
> +int amdgpu_gfx_sysfs_compute_init(struct amdgpu_device *adev)
>  {
>   int r;
>  
> - r = device_create_file(adev->dev, &dev_attr_current_compute_partition);
> - if (r)
> - return r;
> + if (!amdgpu_sriov_vf(adev)) {
> + r = device_create_file(adev->dev, 
> &dev_attr_current_compute_partition);
> + if (r)
> + return r;
>  
> - r = device_create_file(adev->dev, 
> &dev_attr_available_compute_partition);
> + r = device_create_file(adev->dev, 
> &dev_attr_available_compute_partition);
> + if (r)
> + return r;
> + }
>  
> - return r;
> + return 0;
>  }
>  
> -void amdgpu_gfx_sysfs_fini(struct amdgpu_device *adev)
> +void amdgpu_gfx_sysfs_compute_fini(struct amdgpu_device *adev)
>  {
> - device_remove_file(adev->dev, &dev_attr_current_compute_partition);
> - device_remove_file(adev->dev, &dev_attr_available_compute_partition);
> + if (!amdgpu_sriov_vf(adev)) {
> + device_remove_file(adev->dev, 
> &dev_attr_current_compute_partition);
> + device_remove_file(adev->dev, 
> &dev_attr_available_compute_partition);
> + }
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index 6b0416777c5b..b65c459b3aa9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -533,8 +533,8 @@ int amdgpu_gfx_poison_consumption_handler(struct 
> amdgpu_device *adev,
>   struct amdgpu_iv_entry *entry);
>  
>  bool amdgpu_gfx_is_master_xcc(struct amdgpu_device *adev, int xcc_id);
> -int amdgpu_gfx_sysfs_init(struct amdgpu_device *adev);
> -void amdgpu_gfx_sysfs_fini(struct amdgpu_device *adev);
> +int amdgpu_gfx_sysfs_compute_init(struct amdgpu_device *adev);
> +void amdgpu_gfx_sysfs_compute_fini(struct amdgpu_device *adev);
>  void amdgpu_gfx_ras_error_func(struct amdgpu_device *adev,
>   void *ras_error_status,
>   void (*func)(struct amdgpu_device *adev, void *ras_error_status,
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index aecc2bcea145..07ce614ef282 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -939,11 +939,11 @@ static int gfx_v9_4_3_sw_init(void *handle)
>   if (r)
>   return r;
>  
> + r = amdgpu_gfx_sysfs_compute_init(adev);
> + if (r)
> + return r;
>  
> - if (!amdgpu_sriov_vf(adev))
> - r = amdgpu_gfx_sysfs_init(adev);
> -
> - return r;
> + return 0;
>  }
>  
>  static int gfx_v9_4_3_sw_fini(void *handle)
> @@ -964,8 +964,7 @@ static int gfx_v9_4_3_sw_fini(void *handle)
>   gfx_v9_4_3_mec_fini(adev);
>   amdgpu_bo_unref(&adev->gfx.rlc.clear_state_obj);
>   gfx_v9_4_3_free_microcode(adev);
> - i

Re: [PATCH 2/2] drm/amdkfd: add reset cause in gpu pre-reset smi event

2024-06-04 Thread Lazar, Lijo



On 6/3/2024 11:42 PM, Eric Huang wrote:
> reset cause is requested by customer as additional
> info for gpu reset smi event.
> 
> v2: integerate reset sources suggested by Lijo Lazar
> 
> Signed-off-by: Eric Huang 

This series is
Reviewed-by: Lijo Lazar 

I think SMI needs to get all reset cause descriptions. Are you planning
to fill reset source at other places also?

Thanks,
Lijo
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  | 10 +++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c |  7 ---
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 16 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  5 -
>  6 files changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index e3738d417245..eb601b41d9d5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
> *work)
>  
>   reset_context.method = AMD_RESET_METHOD_NONE;
>   reset_context.reset_req_dev = adev;
> + reset_context.src = adev->enable_mes ?
> + AMDGPU_RESET_SRC_MES :
> + AMDGPU_RESET_SRC_HWS;
>   clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>  
>   amdgpu_device_gpu_recover(adev, NULL, &reset_context);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 1de021ebdd46..7e945a4790bb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
>  };
>  
>  struct amdgpu_device;
> +struct amdgpu_reset_context;
>  
>  enum kfd_mem_attachment_type {
>   KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
> @@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
> amdgpu_device *adev);
>  
>  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
>  
> -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
> + struct amdgpu_reset_context *reset_context);
>  
>  int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
>  
> @@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> -int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> +int kgd2kfd_pre_reset(struct kfd_dev *kfd,
> +   struct amdgpu_reset_context *reset_context);
>  int kgd2kfd_post_reset(struct kfd_dev *kfd);
>  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
>  void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd);
> @@ -459,7 +462,8 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, 
> bool run_pm)
>   return 0;
>  }
>  
> -static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
> +static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd,
> + struct amdgpu_reset_context *reset_context)
>  {
>   return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 6711836054f9..4096cb3e937e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5775,7 +5775,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>  
>   cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>  
> - amdgpu_amdkfd_pre_reset(tmp_adev);
> + amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
>  
>   /*
>* Mark these ASICs to be reseted as untracked first
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index fba9b9a258a5..52be4e340fb1 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -924,7 +924,8 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd)
>   kfree(kfd);
>  }
>  
> -int kgd2kfd_pre_reset(struct kfd_dev *kfd)
> +int kgd2kfd_pre_reset(struct kfd_dev *kfd,
> +   struct amdgpu_reset_context *reset_context)
>  {
>   struct kfd_node *node;
>   int i;
> @@ -934,7 +935,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>  
>   for (i = 0; i < kfd->num_nodes; i++) {
>   node = kfd->nodes[i];
> - kfd_smi_event_update_gpu_reset(node, false);
> + kfd_smi_event_update_gpu_reset(node, false, reset_context);
>   node->dqm->ops.pre_reset(node->dqm);
>   }
>  
> @@ -974,7 +975,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>   for (i = 0; i < kfd->num_node

RE: [PATCH] drm/amdkfd: add reset cause in gpu pre-reset smi event

2024-06-03 Thread Lazar, Lijo
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Eric,

To consider other reset cases also, you may have something like attached.

Thanks,
Lijo
-Original Message-
From: amd-gfx  On Behalf Of Eric Huang
Sent: Friday, May 31, 2024 8:38 PM
To: amd-gfx@lists.freedesktop.org
Cc: Kasiviswanathan, Harish ; Huang, JinHuiEric 

Subject: [PATCH] drm/amdkfd: add reset cause in gpu pre-reset smi event

reset cause is requested by customer as additional info for gpu reset smi event.

Signed-off-by: Eric Huang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 34 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 17 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  9 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c  |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  7 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 71 +++
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   | 13 +++-
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h   |  5 +-
 9 files changed, 133 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index e3738d417245..3588c912214a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -125,17 +125,26 @@ static void amdgpu_doorbell_get_kfd_info(struct 
amdgpu_device *adev,  static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)  {
struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
- kfd.reset_work);
-
-   struct amdgpu_reset_context reset_context;
+ kfd.reset_work.work);
+
+   if (adev->kfd.reset_work.reset_context) {
+   amdgpu_device_gpu_recover(
+   adev, NULL,
+   (struct amdgpu_reset_context *)
+   adev->kfd.reset_work.reset_context);
+   kfree(adev->kfd.reset_work.reset_context);
+   adev->kfd.reset_work.reset_context = NULL;
+   } else {
+   struct amdgpu_reset_context reset_context;

-   memset(&reset_context, 0, sizeof(reset_context));
+   memset(&reset_context, 0, sizeof(reset_context));

-   reset_context.method = AMD_RESET_METHOD_NONE;
-   reset_context.reset_req_dev = adev;
-   clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+   reset_context.method = AMD_RESET_METHOD_NONE;
+   reset_context.reset_req_dev = adev;
+   clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);

-   amdgpu_device_gpu_recover(adev, NULL, &reset_context);
+   amdgpu_device_gpu_recover(adev, NULL, &reset_context);
+   }
 }

 static const struct drm_client_funcs kfd_client_funcs = { @@ -225,7 +234,7 @@ 
void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)

amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;

-   INIT_WORK(&adev->kfd.reset_work, amdgpu_amdkfd_reset_work);
+   INIT_WORK(&adev->kfd.reset_work.work, amdgpu_amdkfd_reset_work);
}
 }

@@ -261,12 +270,13 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
run_pm)
return r;
 }

-int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
+   struct amdgpu_reset_context *reset_context)
 {
int r = 0;

if (adev->kfd.dev)
-   r = kgd2kfd_pre_reset(adev->kfd.dev);
+   r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context);

return r;
 }
@@ -285,7 +295,7 @@ void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)  {
if (amdgpu_device_should_recover_gpu(adev))
amdgpu_reset_domain_schedule(adev->reset_domain,
-&adev->kfd.reset_work);
+&adev->kfd.reset_work.work);
 }

 int amdgpu_amdkfd_alloc_gtt_mem(struct amdgpu_device *adev, size_t size, diff 
--git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 1de021ebdd46..1fc9ed33a1c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
 };

 struct amdgpu_device;
+struct amdgpu_reset_context;

 enum kfd_mem_attachment_type {
KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
@@ -98,12 +99,17 @@ struct amdgpu_amdkfd_fence {
struct svm_range_bo *svm_bo;
 };

+struct kfd_reset_work {
+   struct work_struct work;
+   void *reset_context;
+};
+
 struct amdgpu_kfd_dev {
struct kfd_dev *dev;
int64_t vram_used[MAX_XCP];
uint64_t vram_used_aligned[MAX_XCP];
bool init_complete;
-   struct work_struct reset_work;

Re: [PATCH] drm/amdkfd: select CONFIG_CRC16

2024-05-28 Thread Lazar, Lijo



On 5/28/2024 5:20 PM, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The amdkfd support fails to link when CONFIG_CRC16 is disabled:
> 
> aarch64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_topology.o: in function 
> `kfd_topology_add_device':
> kfd_topology.c:(.text+0x3a4c): undefined reference to `crc16'
> 
> This is a library module that needs to be selected from every user.
> 
> Fixes: 3ed181b8ff43 ("drm/amdkfd: Ensure gpu_id is unique")
> Signed-off-by: Arnd Bergmann 

Thanks for the patch; this is already addressed with -
https://patchwork.freedesktop.org/patch/594816/

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdkfd/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig 
> b/drivers/gpu/drm/amd/amdkfd/Kconfig
> index d3c3d3ab7225..f82595af34bf 100644
> --- a/drivers/gpu/drm/amd/amdkfd/Kconfig
> +++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
> @@ -6,6 +6,7 @@
>  config HSA_AMD
>   bool "HSA kernel driver for AMD GPU devices"
>   depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
> + select CRC16
>   select HMM_MIRROR
>   select MMU_NOTIFIER
>   select DRM_AMDGPU_USERPTR


Re: [PATCH] drm/amd/amdgpu: fix the inst passed to amdgpu_virt_rlcg_reg_rw

2024-05-22 Thread Lazar, Lijo



On 5/22/2024 1:11 PM, Zhao, Victor wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Hi Lijo,
> 
> This patch alone is not working.
> Since in your approach amdgpu_virt_rlcg_reg_rw is taking logical xcc id, so 
> all the read/write calls need to be fixed with it.
> For example, WREG32_SOC15_OFFSET. There will be bunch of places need to be 
> fixed.
> 

That definitely looks complicated. Using physical index and passing the
same to amdgpu_virt_rlcg_reg_rw is better. The patch below is -

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> Thanks,
> Victor
> 
> -Original Message-
> From: Lazar, Lijo 
> Sent: Wednesday, May 22, 2024 2:14 PM
> To: Zhao, Victor ; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH] drm/amd/amdgpu: fix the inst passed to 
> amdgpu_virt_rlcg_reg_rw
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Hi Victor,
> 
> Could you check if an approach like the attached one helps?
> 
> Thanks,
> Lijo
> -Original Message-
> From: Zhao, Victor 
> Sent: Wednesday, May 22, 2024 11:13 AM
> To: Zhao, Victor ; amd-gfx@lists.freedesktop.org; Lazar, 
> Lijo 
> Subject: RE: [PATCH] drm/amd/amdgpu: fix the inst passed to 
> amdgpu_virt_rlcg_reg_rw
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Hi @Lazar, Lijo,
> 
> Can you help review this?
> 
> Thanks,
> Victor
> 
> -----Original Message-
> From: Victor Zhao 
> Sent: Tuesday, May 21, 2024 12:08 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Lazar, Lijo ; Zhao, Victor 
> Subject: [PATCH] drm/amd/amdgpu: fix the inst passed to 
> amdgpu_virt_rlcg_reg_rw
> 
> the inst passed to amdgpu_virt_rlcg_reg_rw should be physical instance.
> Fix the miss matched code.
> 
> Signed-off-by: Victor Zhao 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 18 +-
>  2 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e72e774d17e6..e74789691070 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -679,7 +679,7 @@ uint32_t amdgpu_device_xcc_rreg(struct amdgpu_device 
> *adev,
> amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags,
>  GC_HWIP, false,
>  &rlcg_flag)) {
> -   ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0, 
> rlcg_flag, xcc_id);
> +   ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0,
> +rlcg_flag, GET_INST(GC, xcc_id));
> } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
> amdgpu_sriov_runtime(adev) &&
> down_read_trylock(&adev->reset_domain->sem)) { @@ -810,7 
> +810,7 @@ void amdgpu_device_xcc_wreg(struct amdgpu_device *adev,
> amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags,
>  GC_HWIP, true,
>  &rlcg_flag)) {
> -   amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag, 
> xcc_id);
> +   amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag,
> +GET_INST(GC, xcc_id));
> } else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
> amdgpu_sriov_runtime(adev) &&
> down_read_trylock(&adev->reset_domain->sem)) { diff --git 
> a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 094c08cb98e7..350f6b6676f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -844,7 +844,7 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
> ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
> 
> if (vmhub >= AMDGPU_MMHUB0(0))
> -   inst = GET_INST(GC, 0);
> +   inst = 0;
> else
> inst = vmhub;
> 
> @@ -876,9 +876,9 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
> *adev, uint32_t vmid,
> for (j = 0; j < adev->usec_timeout; j++) {
> /* a read return value of 1 means semaphore acquire */
> if (vmhub >= AMDGPU_MMHUB0(0))
> -   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, sem, 
> inst);
> +   tmp = RREG32_SOC15_IP_N

RE: [PATCH] drm/amd/amdgpu: fix the inst passed to amdgpu_virt_rlcg_reg_rw

2024-05-21 Thread Lazar, Lijo
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Victor,

Could you check if an approach like the attached one helps?

Thanks,
Lijo
-Original Message-
From: Zhao, Victor 
Sent: Wednesday, May 22, 2024 11:13 AM
To: Zhao, Victor ; amd-gfx@lists.freedesktop.org; Lazar, 
Lijo 
Subject: RE: [PATCH] drm/amd/amdgpu: fix the inst passed to 
amdgpu_virt_rlcg_reg_rw

[AMD Official Use Only - AMD Internal Distribution Only]

Hi @Lazar, Lijo,

Can you help review this?

Thanks,
Victor

-Original Message-
From: Victor Zhao 
Sent: Tuesday, May 21, 2024 12:08 AM
To: amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Zhao, Victor 
Subject: [PATCH] drm/amd/amdgpu: fix the inst passed to amdgpu_virt_rlcg_reg_rw

the inst passed to amdgpu_virt_rlcg_reg_rw should be physical instance.
Fix the miss matched code.

Signed-off-by: Victor Zhao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 18 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e72e774d17e6..e74789691070 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -679,7 +679,7 @@ uint32_t amdgpu_device_xcc_rreg(struct amdgpu_device *adev,
amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags,
 GC_HWIP, false,
 &rlcg_flag)) {
-   ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0, rlcg_flag, 
xcc_id);
+   ret = amdgpu_virt_rlcg_reg_rw(adev, reg, 0,
+rlcg_flag, GET_INST(GC, xcc_id));
} else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
amdgpu_sriov_runtime(adev) &&
down_read_trylock(&adev->reset_domain->sem)) { @@ -810,7 
+810,7 @@ void amdgpu_device_xcc_wreg(struct amdgpu_device *adev,
amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags,
 GC_HWIP, true,
 &rlcg_flag)) {
-   amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag, 
xcc_id);
+   amdgpu_virt_rlcg_reg_rw(adev, reg, v, rlcg_flag,
+GET_INST(GC, xcc_id));
} else if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
amdgpu_sriov_runtime(adev) &&
down_read_trylock(&adev->reset_domain->sem)) { diff --git 
a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 094c08cb98e7..350f6b6676f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -844,7 +844,7 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;

if (vmhub >= AMDGPU_MMHUB0(0))
-   inst = GET_INST(GC, 0);
+   inst = 0;
else
inst = vmhub;

@@ -876,9 +876,9 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
for (j = 0; j < adev->usec_timeout; j++) {
/* a read return value of 1 means semaphore acquire */
if (vmhub >= AMDGPU_MMHUB0(0))
-   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, sem, inst);
+   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, sem,
+ GET_INST(GC, inst));
else
-   tmp = RREG32_SOC15_IP_NO_KIQ(GC, sem, inst);
+   tmp = RREG32_SOC15_IP_NO_KIQ(GC, sem,
+ GET_INST(GC, inst));
if (tmp & 0x1)
break;
udelay(1);
@@ -889,9 +889,9 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
}

if (vmhub >= AMDGPU_MMHUB0(0))
-   WREG32_SOC15_IP_NO_KIQ(MMHUB, req, inv_req, inst);
+   WREG32_SOC15_IP_NO_KIQ(MMHUB, req, inv_req, GET_INST(GC,
+ inst));
else
-   WREG32_SOC15_IP_NO_KIQ(GC, req, inv_req, inst);
+   WREG32_SOC15_IP_NO_KIQ(GC, req, inv_req, GET_INST(GC,
+ inst));

/*
 * Issue a dummy read to wait for the ACK register to @@ -904,9 +904,9 
@@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,

for (j = 0; j < adev->usec_timeout; j++) {
if (vmhub >= AMDGPU_MMHUB0(0))
-   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, ack, inst);
+   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, ack,
+ GET_INST(GC, inst));
else
-   tmp = RREG32_SOC15_IP_NO_KIQ(GC, ack, inst);
+

Re: [PATCH 1/4 V2] drm/amdgpu: fix invadate operation for umsch

2024-05-21 Thread Lazar, Lijo



On 5/22/2024 7:49 AM, Zhang, Jesse(Jie) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Hi Lijo
> 
> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, May 21, 2024 4:20 PM
> To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian 
> ; Huang, Tim ; Yu, Lang 
> 
> Subject: Re: [PATCH 1/4 V2] drm/amdgpu: fix invadate operation for umsch
> 
> 
> 
> On 5/21/2024 12:46 PM, Jesse Zhang wrote:
>> Since the type of data_size is uint32_t, adev->umsch_mm.data_size - 1
>>>> 16 >> 16 is 0 regardless of the values of its operands
>>
>> So removing the operations upper_32_bits and lower_32_bits.
>>
>> Signed-off-by: Jesse Zhang 
>> Suggested-by: Tim Huang 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c
>> b/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c
>> index 2c5e7b0a73f9..ce3bb12e3572 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c
>> @@ -116,9 +116,8 @@ static int umsch_mm_v4_0_load_microcode(struct 
>> amdgpu_umsch_mm *umsch)
>>   upper_32_bits(adev->umsch_mm.data_start_addr));
>>
>>   WREG32_SOC15_UMSCH(regVCN_MES_LOCAL_MASK0_LO,
>> - lower_32_bits(adev->umsch_mm.data_size - 1));
>> - WREG32_SOC15_UMSCH(regVCN_MES_LOCAL_MASK0_HI,
>> - upper_32_bits(adev->umsch_mm.data_size - 1));
>> + adev->umsch_mm.data_size - 1);
>> + WREG32_SOC15_UMSCH(regVCN_MES_LOCAL_MASK0_HI, 0);
> 
> cc: Lang
> 
> The original programming and the new one doesn't look correct.
> 
> I see the below field definitions as per the header. As per this, both LO/HI 
> are 16-bit fields.
> 
> vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_HI__MASK0_HI__SHIFT
>  0x0 
> vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_HI__MASK0_HI_MASK
>  0xL
> 
> vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_LO__MASK0_LO__SHIFT
>  0x10 
> vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_LO__MASK0_LO_MASK
>  0xL
> 
> [Zhang, Jesse(Jie)]
> 
> The code seem to aligin with Windows side that have same issue. Here is the 
> windows umsch_4_0 write register 
> regVCN_MES_LOCAL_MASK0_LO/regVCN_MES_LOCAL_MASK0_HI
> 
> enum umsch_mm_status umsch_mm_engine_init_unsecure_4_0(struct 
> umsch_mm_context* umsch_mm_ip) {
> ...
> temp_data = 
> (uint32_t)umsch_mm_ip->umsch_mm_fw.ucode_info[fw]->data_system_size - 1;
> data = temp_data;
> umsch_mm_cgs_write_register(umsch_mm_ip, 
> umsch_mm_reg_offset(hwip_info, regVCN_MES_LOCAL_MASK0_LO, 
> regVCN_MES_LOCAL_MASK0_LO_BASE_IDX), data, HWBLOCK_VCN);
> 
> data = temp_data >> 32;
> umsch_mm_cgs_write_register(umsch_mm_ip, 
> umsch_mm_reg_offset(hwip_info, regVCN_MES_LOCAL_MASK0_HI, 
> regVCN_MES_LOCAL_MASK0_HI_BASE_IDX), data, HWBLOCK_VCN);
> ...
> }
> 
> struct umsch_mm_ucode_consts
> {
>  ...
> uint32_t data_system_size;
> ...
> }
> 

Thanks, checked the MES spec. Looks like the mask field definitions are
wrong. They look like copies of BASE_HI/LO fields which are used for
keeping a 64k aligned 48-bit address.

Anyway, the mask fields are for indicating size of the local heap/stack,
so most likely won't require usage of MASK0_HI.

Thanks,
Lijo

> Thanks
> Jesse
> 
> 
> Thanks,
> Lijo
> 
>>
>>   data = adev->firmware.load_type == AMDGPU_FW_LOAD_PSP ?
>>  0 : adev->umsch_mm.data_fw_gpu_addr;


Re: [PATCH 1/4 V2] drm/amdgpu: fix invadate operation for umsch

2024-05-21 Thread Lazar, Lijo



On 5/21/2024 12:46 PM, Jesse Zhang wrote:
> Since the type of data_size is uint32_t, adev->umsch_mm.data_size - 1 >> 16 
> >> 16 is 0
> regardless of the values of its operands
> 
> So removing the operations upper_32_bits and lower_32_bits.
> 
> Signed-off-by: Jesse Zhang 
> Suggested-by: Tim Huang 
> ---
>  drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c 
> b/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c
> index 2c5e7b0a73f9..ce3bb12e3572 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c
> @@ -116,9 +116,8 @@ static int umsch_mm_v4_0_load_microcode(struct 
> amdgpu_umsch_mm *umsch)
>   upper_32_bits(adev->umsch_mm.data_start_addr));
>  
>   WREG32_SOC15_UMSCH(regVCN_MES_LOCAL_MASK0_LO,
> - lower_32_bits(adev->umsch_mm.data_size - 1));
> - WREG32_SOC15_UMSCH(regVCN_MES_LOCAL_MASK0_HI,
> - upper_32_bits(adev->umsch_mm.data_size - 1));
> + adev->umsch_mm.data_size - 1);
> + WREG32_SOC15_UMSCH(regVCN_MES_LOCAL_MASK0_HI, 0);

cc: Lang

The original programming and the new one doesn't look correct.

I see the below field definitions as per the header. As per this, both
LO/HI are 16-bit fields.

vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_HI__MASK0_HI__SHIFT
 0x0
vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_HI__MASK0_HI_MASK
 0xL

vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_LO__MASK0_LO__SHIFT
 0x10
vcn/vcn_4_0_5_sh_mask.h:#define VCN_MES_LOCAL_MASK0_LO__MASK0_LO_MASK
 0xL

Thanks,
Lijo

>  
>   data = adev->firmware.load_type == AMDGPU_FW_LOAD_PSP ?
>  0 : adev->umsch_mm.data_fw_gpu_addr;


Re: [PATCH v2] drm/amdgpu: Fix snprintf usage in amdgpu_gfx_kiq_init_ring

2024-05-21 Thread Lazar, Lijo



On 5/21/2024 1:07 PM, Srinivasan Shanmugam wrote:
> This commit fixes a format truncation issue arosed by the snprintf
> function potentially writing more characters into the ring->name buffer
> than it can hold, in the amdgpu_gfx_kiq_init_ring function 
>   
> The issue occurred because the '%d' format specifier could write between
> 1 and 10 bytes into a region of size between 0 and 8, depending on the
> values of xcc_id, ring->me, ring->pipe, and ring->queue. The snprintf
> function could output between 12 and 41 bytes into a destination of size
> 16, leading to potential truncation.  
>   
> To resolve this, the snprintf line was modified to use the '%hhu' format
> specifier for ring->me, ring->pipe, and ring->queue. The '%hhu'
> specifier is used for unsigned char variables and ensures that these
> values are printed as unsigned decimal integers.
> 
> Fixes the below with gcc W=1:
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c: In function 
> ‘amdgpu_gfx_kiq_init_ring’:
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:61: warning: ‘%d’ directive 
> output may be truncated writing between 1 and 10 bytes into a region of size 
> between 0 and 8 [-Wformat-truncation=]
>   332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
>   | ^~
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:50: note: directive argument in 
> the range [0, 2147483647]
>   332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
>   |  ^
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:9: note: ‘snprintf’ output 
> between 12 and 41 bytes into a destination of size 16
>   332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
>   | ^~~
>   333 |  xcc_id, ring->me, ring->pipe, ring->queue);
>   |  ~~
> 
> Fixes: 345a36c4f1ba ("drm/amdgpu: prefer snprintf over sprintf")
> Cc: Alex Deucher 
> Cc: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
> v2:
>  - Removed width specifiers %3, %1, typecasting of unsigned char,
>s/hhd/hhu (Lijo)
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 9b7dc61c331d..0f14d4a11441 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -329,7 +329,7 @@ int amdgpu_gfx_kiq_init_ring(struct amdgpu_device *adev, 
> int xcc_id)
>  
>   ring->eop_gpu_addr = kiq->eop_gpu_addr;
>   ring->no_scheduler = true;
> - snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
> + snprintf(ring->name, sizeof(ring->name), "kiq_%d.%hhu.%hhu.%hhu",
>xcc_id, ring->me, ring->pipe, ring->queue);

Even for xcc_id, the value range expected is < 255. Anyway,

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

>   r = amdgpu_ring_init(adev, ring, 1024, irq, AMDGPU_CP_KIQ_IRQ_DRIVER0,
>AMDGPU_RING_PRIO_DEFAULT, NULL);


Re: [PATCH] drm/amdgpu: Fix snprintf usage in amdgpu_gfx_kiq_init_ring

2024-05-20 Thread Lazar, Lijo



On 5/21/2024 10:13 AM, Srinivasan Shanmugam wrote:
> This commit fixes a format truncation issue arosed by the snprintf
> function potentially writing more characters into the ring->name buffer
> than it can hold, in the amdgpu_gfx_kiq_init_ring function
> 
> The issue occurred because the '%d' format specifier could write between
> 1 and 10 bytes into a region of size between 0 and 8, depending on the
>   values of xcc_id, ring->me, ring->pipe, and ring->queue. The snprintf
> function could output between 12 and 41 bytes into a destination of size
> 16, leading to potential truncation.
> 
> To resolve this, the snprintf line was modified to use the '%3d' and
> '%1hhd' format specifiers. The '%3d' specifier is used for xcc_id and
> ensures that it is always printed with a width of 3 characters. The> '%1hhd' 
> specifier is used for ring->me, ring->pipe, and ring->queue, and


Width specifier only guarantees minimum width. It doesn't offer any
truncation. %1 also doesn't matter as that is the default minimum. What
about just using %hhu?

Thanks,
Lijo


> ensures that these values are printed as single digit numbers. This is
> achieved by casting these values to unsigned char before passing them to
> snprintf, which ensures that these values will always be in the range of
> 0 to 9.
> 
> Fixes the below with gcc W=1:
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c: In function 
> ‘amdgpu_gfx_kiq_init_ring’:
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:61: warning: ‘%d’ directive 
> output may be truncated writing between 1 and 10 bytes into a region of size 
> between 0 and 8 [-Wformat-truncation=]
>   332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
>   | ^~
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:50: note: directive argument in 
> the range [0, 2147483647]
>   332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
>   |  ^
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:9: note: ‘snprintf’ output 
> between 12 and 41 bytes into a destination of size 16
>   332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
>   | ^~~
>   333 |  xcc_id, ring->me, ring->pipe, ring->queue);
>   |  ~~
> 
> Fixes: 345a36c4f1ba ("drm/amdgpu: prefer snprintf over sprintf")
> Cc: Alex Deucher 
> Cc: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 9b7dc61c331d..88da17c0340b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -329,8 +329,9 @@ int amdgpu_gfx_kiq_init_ring(struct amdgpu_device *adev, 
> int xcc_id)
>  
>   ring->eop_gpu_addr = kiq->eop_gpu_addr;
>   ring->no_scheduler = true;
> - snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d",
> -  xcc_id, ring->me, ring->pipe, ring->queue);
> + snprintf(ring->name, sizeof(ring->name), "kiq_%3d.%1hhd.%1hhd.%1hhd",
> +  xcc_id, (unsigned char)ring->me, (unsigned char)ring->pipe,
> +  (unsigned char)ring->queue);
>   r = amdgpu_ring_init(adev, ring, 1024, irq, AMDGPU_CP_KIQ_IRQ_DRIVER0,
>AMDGPU_RING_PRIO_DEFAULT, NULL);
>   if (r)


Re: [PATCH] drm/amd/amdgpu: fix the inst passed to reg read write under sriov

2024-05-20 Thread Lazar, Lijo



On 5/20/2024 4:44 PM, Victor Zhao wrote:
> the inst passed to reg read/write should be physical instance.
> Fix the miss matched code.
> 
> Signed-off-by: Victor Zhao 
> ---
>  .../drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c   |  6 ++---
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c |  2 +-
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c   |  8 +++---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 26 +--
>  4 files changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c
> index a5c7259cf2a3..319e6793053a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c
> @@ -300,7 +300,7 @@ static int kgd_gfx_v9_4_3_hqd_load(struct amdgpu_device 
> *adev, void *mqd,
>   hqd_end = SOC15_REG_OFFSET(GC, GET_INST(GC, inst), 
> regCP_HQD_AQL_DISPATCH_ID_HI);
>  
>   for (reg = hqd_base; reg <= hqd_end; reg++)
> - WREG32_XCC(reg, mqd_hqd[reg - hqd_base], inst);
> + WREG32_XCC(reg, mqd_hqd[reg - hqd_base], GET_INST(GC, inst));

Why this needs to be done? Isn't the expectation that it goes to the
right KIQ/RLCG as those are also indexed by logical XCC ids?

Thanks,
Lijo

>  
>  
>   /* Activate doorbell logic before triggering WPTR poll. */
> @@ -493,12 +493,12 @@ static uint32_t kgd_gfx_v9_4_3_set_address_watch(
>   WREG32_XCC((SOC15_REG_OFFSET(GC, GET_INST(GC, inst),
>   regTCP_WATCH0_ADDR_H) +
>   (watch_id * TCP_WATCH_STRIDE)),
> - watch_address_high, inst);
> + watch_address_high, GET_INST(GC, inst));
>  
>   WREG32_XCC((SOC15_REG_OFFSET(GC, GET_INST(GC, inst),
>   regTCP_WATCH0_ADDR_L) +
>   (watch_id * TCP_WATCH_STRIDE)),
> - watch_address_low, inst);
> + watch_address_low, GET_INST(GC, inst));
>  
>   return watch_address_cntl;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
> index 5a35a8ca8922..76be23dcea31 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
> @@ -239,7 +239,7 @@ int kgd_gfx_v9_hqd_load(struct amdgpu_device *adev, void 
> *mqd,
>  
>   for (reg = hqd_base;
>reg <= SOC15_REG_OFFSET(GC, GET_INST(GC, inst), 
> mmCP_HQD_PQ_WPTR_HI); reg++)
> - WREG32_XCC(reg, mqd_hqd[reg - hqd_base], inst);
> + WREG32_XCC(reg, mqd_hqd[reg - hqd_base], GET_INST(GC, inst));
>  
>  
>   /* Activate doorbell logic before triggering WPTR poll. */
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 07b299ec7169..349ece5a27ee 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -2812,16 +2812,16 @@ static void 
> gfx_v9_4_3_xcc_set_compute_eop_interrupt_state(
>  
>   switch (state) {
>   case AMDGPU_IRQ_STATE_DISABLE:
> - mec_int_cntl = RREG32_XCC(mec_int_cntl_reg, xcc_id);
> + mec_int_cntl = RREG32_XCC(mec_int_cntl_reg, GET_INST(GC, 
> xcc_id));
>   mec_int_cntl = REG_SET_FIELD(mec_int_cntl, 
> CP_ME1_PIPE0_INT_CNTL,
>TIME_STAMP_INT_ENABLE, 0);
> - WREG32_XCC(mec_int_cntl_reg, mec_int_cntl, xcc_id);
> + WREG32_XCC(mec_int_cntl_reg, mec_int_cntl, GET_INST(GC, 
> xcc_id));
>   break;
>   case AMDGPU_IRQ_STATE_ENABLE:
> - mec_int_cntl = RREG32_XCC(mec_int_cntl_reg, xcc_id);
> + mec_int_cntl = RREG32_XCC(mec_int_cntl_reg, GET_INST(GC, 
> xcc_id));
>   mec_int_cntl = REG_SET_FIELD(mec_int_cntl, 
> CP_ME1_PIPE0_INT_CNTL,
>TIME_STAMP_INT_ENABLE, 1);
> - WREG32_XCC(mec_int_cntl_reg, mec_int_cntl, xcc_id);
> + WREG32_XCC(mec_int_cntl_reg, mec_int_cntl, GET_INST(GC, 
> xcc_id));
>   break;
>   default:
>   break;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 094c08cb98e7..aca842668c56 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -496,14 +496,14 @@ static int gmc_v9_0_vm_fault_interrupt_state(struct 
> amdgpu_device *adev,
>   if (j >= AMDGPU_MMHUB0(0))
>   tmp = RREG32_SOC15_IP(MMHUB, reg);
>   else
> - tmp = RREG32_XCC(reg, j);
> + tmp = RREG32_XCC(reg, GET_INST(GC, j));
>  
>   tmp &= ~bits;
>  
>   if (j >= AMDGPU_MMHUB0(0))
>   

Re: [PATCH 1/2] Revert "drm/amd/pm: Use gpu_metrics_v1_6 for SMUv13.0.6"

2024-05-19 Thread Lazar, Lijo



On 5/20/2024 10:31 AM, Asad Kamal wrote:
> Remove gpu_metrics_v1_6 usage for SMUv13.0.6 temporarily and use
> gpu_metrics_v1_5 until tool support is ready for it.
> 
> This reverts commit e6efb71ae640eada28f44cc97aa79e8ae4901e63.
> 
> Signed-off-by: Asad Kamal 

Series is
Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c   | 18 --
>  1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index ceb2174baff6..81a241ed18f5 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -351,7 +351,7 @@ static int smu_v13_0_6_tables_init(struct smu_context 
> *smu)
>   return -ENOMEM;
>   smu_table->metrics_time = 0;
>  
> - smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v1_6);
> + smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v1_5);
>   smu_table->gpu_metrics_table =
>   kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
>   if (!smu_table->gpu_metrics_table) {
> @@ -2290,8 +2290,8 @@ static int 
> smu_v13_0_6_get_current_pcie_link_speed(struct smu_context *smu)
>  static ssize_t smu_v13_0_6_get_gpu_metrics(struct smu_context *smu, void 
> **table)
>  {
>   struct smu_table_context *smu_table = &smu->smu_table;
> - struct gpu_metrics_v1_6 *gpu_metrics =
> - (struct gpu_metrics_v1_6 *)smu_table->gpu_metrics_table;
> + struct gpu_metrics_v1_5 *gpu_metrics =
> + (struct gpu_metrics_v1_5 *)smu_table->gpu_metrics_table;
>   struct amdgpu_device *adev = smu->adev;
>   int ret = 0, xcc_id, inst, i, j;
>   MetricsTableX_t *metrics_x;
> @@ -2307,7 +2307,7 @@ static ssize_t smu_v13_0_6_get_gpu_metrics(struct 
> smu_context *smu, void **table
>  
>   metrics_a = (MetricsTableA_t *)metrics_x;
>  
> - smu_cmn_init_soft_gpu_metrics(gpu_metrics, 1, 6);
> + smu_cmn_init_soft_gpu_metrics(gpu_metrics, 1, 5);
>  
>   gpu_metrics->temperature_hotspot =
>   SMUQ10_ROUND(GET_METRIC_FIELD(MaxSocketTemperature));
> @@ -2349,16 +2349,6 @@ static ssize_t smu_v13_0_6_get_gpu_metrics(struct 
> smu_context *smu, void **table
>  
>   gpu_metrics->current_uclk = 
> SMUQ10_ROUND(GET_METRIC_FIELD(UclkFrequency));
>  
> - /* Total accumulated cycle counter */
> - gpu_metrics->accumulation_counter = 
> GET_METRIC_FIELD(AccumulationCounter);
> -
> - /* Accumulated throttler residencies */
> - gpu_metrics->prochot_residency_acc = 
> GET_METRIC_FIELD(ProchotResidencyAcc);
> - gpu_metrics->ppt_residency_acc = GET_METRIC_FIELD(PptResidencyAcc);
> - gpu_metrics->socket_thm_residency_acc = 
> GET_METRIC_FIELD(SocketThmResidencyAcc);
> - gpu_metrics->vr_thm_residency_acc = GET_METRIC_FIELD(VrThmResidencyAcc);
> - gpu_metrics->hbm_thm_residency_acc = 
> GET_METRIC_FIELD(HbmThmResidencyAcc);
> -
>   /* Throttle status is not reported through metrics now */
>   gpu_metrics->throttle_status = 0;
>  


Re: [PATCH v4 00/10] Add PM policy interfaces

2024-05-15 Thread Lazar, Lijo


On 5/14/2024 4:35 PM, Lijo Lazar wrote:
> This series adds APIs to get the supported PM policies and also set them. A PM
> policy type is a predefined policy type supported by an SOC and each policy 
> may
> define two or more levels to choose from. A user can select the appropriate
> level through amdgpu_dpm_set_pm_policy() or through sysfs node pm_policy. 
> Based
> on the specific PM functional area, multiple PM policies may be defined for an
> SOC For ex: a policy may be defined to set the right setting for XGMI per link
> power down feature and another may be defined to select the SOC Pstate
> preferences.
>  
> Presently, XGMI PLPD and SOC Pstate policy types are supported. It also 
> removes
> the legacy sysfs interface to set XGMI PLPD as it is not used any client like
> SMI tool.
> 
> v2:
>  Add NULL checks to avoid access on SOCs which don't support any policy.
> 
> v3:
>  Rebase and add documentation patch
> 
> v4:
>  Use consistent policy type naming for read/write (Alex Deucher)
> 
> Lijo Lazar (10):
>   drm/amd/pm: Add support for DPM policies
>   drm/amd/pm: Update PMFW messages for SMUv13.0.6
>   drm/amd/pm: Add support to select pstate policy
>   drm/amd/pm: Add xgmi plpd policy to pm_policy
>   drm/amd/pm: Add xgmi plpd to SMU v13.0.6 pm_policy
>   drm/amd/pm: Add xgmi plpd to aldebaran pm_policy
>   drm/amd/pm: Add xgmi plpd to arcturus pm_policy
>   drm/amd/pm: Remove legacy interface for xgmi plpd
>   drm/amd/pm: Remove unused interface to set plpd
>   Documentation/amdgpu: Add PM policy documentation
> 
>  Documentation/gpu/amdgpu/thermal.rst  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |   4 +-
>  .../gpu/drm/amd/include/kgd_pp_interface.h|  17 ++
>  drivers/gpu/drm/amd/pm/amdgpu_dpm.c   |  32 ++--
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c| 136 
>  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |   9 +-
>  drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h|   2 +-
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 113 +++--
>  drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  40 -
>  .../pm/swsmu/inc/pmfw_if/smu_v13_0_6_ppsmc.h  |   3 +-
>  drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h  |   3 +-
>  .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  64 +---
>  .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  59 ---
>  .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|   2 +
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 153 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c|  57 +++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|   2 +
>  17 files changed, 533 insertions(+), 169 deletions(-)
> 


Re: [PATCH 2/2 v2] drm/amd/pm: check specific index for aldebaran

2024-05-14 Thread Lazar, Lijo



On 5/14/2024 12:28 PM, Jesse Zhang wrote:
> To avoid warning problems, drop index and
> use PPSMC_MSG_GfxDriverReset instead of index for aldebaran.
> 
> Signed-off-by: Jesse Zhang 
> Suggested-by: Lijo Lazar 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index a22eb6bbb05e..d671314c46c8 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1880,17 +1880,18 @@ static int aldebaran_mode1_reset(struct smu_context 
> *smu)
>  
>  static int aldebaran_mode2_reset(struct smu_context *smu)
>  {
> - int ret = 0, index;
> + int ret = 0;
>   struct amdgpu_device *adev = smu->adev;
>   int timeout = 10;
>  
> - index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
> - SMU_MSG_GfxDeviceDriverReset);
> - if (index < 0 )
> - return -EINVAL;
>   mutex_lock(&smu->message_lock);
>   if (smu->smc_fw_version >= 0x00441400) {
> - ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, 
> SMU_RESET_MODE_2);

For clarity, original comment is - retain this as it is, only replace
index with PPSMC_MSG_GfxDriverReset.

Changing this to msg_with_param() breaks the reset sequence.

Thanks,
Lijo

> + ret = smu_cmn_send_smc_msg_with_param(smu, 
> PPSMC_MSG_GfxDriverReset,
> + 
> SMU_RESET_MODE_2, NULL);
> + if (ret) {
> + dev_err(smu->adev->dev, "Failed to mode2 reset!\n");
> + goto out;
> + }
>   /* This is similar to FLR, wait till max FLR timeout */
>   msleep(100);
>   dev_dbg(smu->adev->dev, "restore config space...\n");


Re: [PATCH 2/2] drm/amd/pm: check specific index for aldebaran

2024-05-14 Thread Lazar, Lijo



On 5/14/2024 12:37 PM, Wang, Yang(Kevin) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> -Original Message-
> From: amd-gfx  On Behalf Of Lazar, Lijo
> Sent: Tuesday, May 14, 2024 2:07 PM
> To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian 
> ; Huang, Tim 
> Subject: Re: [PATCH 2/2] drm/amd/pm: check specific index for aldebaran
> 
> 
> 
> On 5/14/2024 11:34 AM, Jesse Zhang wrote:
>> To avoid warning problems, drop index and use PPSMC_MSG_GfxDriverReset
>> instead of index for aldebaran.
>>
>> Signed-off-by: Jesse Zhang 
>> Suggested-by: Lijo Lazar 
>> ---
>>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 13 +++--
>>  1 file changed, 7 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> index a22eb6bbb05e..d671314c46c8 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> @@ -1880,17 +1880,18 @@ static int aldebaran_mode1_reset(struct
>> smu_context *smu)
>>
>>  static int aldebaran_mode2_reset(struct smu_context *smu)  {
>> - int ret = 0, index;
>> + int ret = 0;
>>   struct amdgpu_device *adev = smu->adev;
>>   int timeout = 10;
>>
>> - index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
>> - SMU_MSG_GfxDeviceDriverReset);
>> - if (index < 0 )
>> - return -EINVAL;
>>   mutex_lock(&smu->message_lock);
>>   if (smu->smc_fw_version >= 0x00441400) {
>> - ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, 
>> SMU_RESET_MODE_2);
>> + ret = smu_cmn_send_smc_msg_with_param(smu,
>> +SMU_MSG_GfxDeviceDriverReset,
> 
> PPSMC_MSG_GfxDriverReset is different from SMU_MSG_GfxDeviceDriverReset.
> Use PPSMC_MSG_GfxDriverReset here (for both patches).
> 
> Thanks,
> Lijo
> 
> [Kevin]:
> 
> There is no interface here to directly use PPSMC_MSG_XXX to send messages to 
> smu/pmfw in the swSMU driver,
> and it is not recommended to do so to maintain code consistency.
> 

Thanks, didn't notice earlier that smu_cmn_send_msg_without_waiting got
changed as well with this patch. This API is a direct interface.

Please note not to change anything else other than what is specifically
requested in review comment. The original comment was only to replace
index with PPSMC_MSG_GfxDriverReset. Please stick to that, otherwise it
will break the entire sequence.

Thanks,
Lijo

> Best Regards,
> Kevin
> 
>> + 
>> SMU_RESET_MODE_2, NULL);
>> + if (ret) {
>> + dev_err(smu->adev->dev, "Failed to mode2 reset!\n");
>> + goto out;
>> + }
>>   /* This is similar to FLR, wait till max FLR timeout */
>>   msleep(100);
>>   dev_dbg(smu->adev->dev, "restore config space...\n");


Re: [PATCH] drm/amdgpu/pm: Drop hard-code value of usTMax

2024-05-13 Thread Lazar, Lijo



On 5/14/2024 9:43 AM, Ma Jun wrote:
> Drop hard-code value of nsTmax because we read this
> value from fantable below.
> 
> Signed-off-by: Ma Jun 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c 
> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c
> index 17882f8dfdd3..6cfef1b295ab 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c
> @@ -977,8 +977,6 @@ static int init_thermal_controller(
>   = le16_to_cpu(tonga_fan_table->usPWMMed);
>   hwmgr->thermal_controller.advanceFanControlParameters.usPWMHigh
>   = le16_to_cpu(tonga_fan_table->usPWMHigh);
> - hwmgr->thermal_controller.advanceFanControlParameters.usTMax
> - = 10900;  /* hard coded */
>   hwmgr->thermal_controller.advanceFanControlParameters.usTMax
>   = le16_to_cpu(tonga_fan_table->usTMax);
>   
> hwmgr->thermal_controller.advanceFanControlParameters.ucFanControlMode


Re: [PATCH v2] drm/amdgpu: Fix the null pointer dereference to ras_manager

2024-05-13 Thread Lazar, Lijo



On 5/14/2024 9:42 AM, Ma Jun wrote:
> Check ras_manager before using it
> 
> Signed-off-by: Ma Jun 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 925ec65ac5ed..2bcf5c3b5d70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2172,12 +2172,15 @@ static void 
> amdgpu_ras_interrupt_process_handler(struct work_struct *work)
>  int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
>   struct ras_dispatch_if *info)
>  {
> - struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
> - struct ras_ih_data *data = &obj->ih_data;
> + struct ras_manager *obj;
> + struct ras_ih_data *data;
>  
> + obj = amdgpu_ras_find_obj(adev, &info->head);
>   if (!obj)
>   return -EINVAL;
>  
> + data = &obj->ih_data;
> +
>   if (data->inuse == 0)
>   return 0;
>  


Re: [PATCH 2/2] drm/amd/pm: check specific index for aldebaran

2024-05-13 Thread Lazar, Lijo



On 5/14/2024 11:34 AM, Jesse Zhang wrote:
> To avoid warning problems, drop index and
> use PPSMC_MSG_GfxDriverReset instead of index for aldebaran.
> 
> Signed-off-by: Jesse Zhang 
> Suggested-by: Lijo Lazar 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index a22eb6bbb05e..d671314c46c8 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1880,17 +1880,18 @@ static int aldebaran_mode1_reset(struct smu_context 
> *smu)
>  
>  static int aldebaran_mode2_reset(struct smu_context *smu)
>  {
> - int ret = 0, index;
> + int ret = 0;
>   struct amdgpu_device *adev = smu->adev;
>   int timeout = 10;
>  
> - index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
> - SMU_MSG_GfxDeviceDriverReset);
> - if (index < 0 )
> - return -EINVAL;
>   mutex_lock(&smu->message_lock);
>   if (smu->smc_fw_version >= 0x00441400) {
> - ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, 
> SMU_RESET_MODE_2);
> + ret = smu_cmn_send_smc_msg_with_param(smu, 
> SMU_MSG_GfxDeviceDriverReset,

PPSMC_MSG_GfxDriverReset is different from SMU_MSG_GfxDeviceDriverReset.
Use PPSMC_MSG_GfxDriverReset here (for both patches).

Thanks,
Lijo

> + 
> SMU_RESET_MODE_2, NULL);
> + if (ret) {
> + dev_err(smu->adev->dev, "Failed to mode2 reset!\n");
> + goto out;
> + }
>   /* This is similar to FLR, wait till max FLR timeout */
>   msleep(100);
>   dev_dbg(smu->adev->dev, "restore config space...\n");


Re: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

2024-05-13 Thread Lazar, Lijo



On 5/14/2024 6:30 AM, Ma, Jun wrote:
> Hi Lijo & Kevin, thanks for review, will drop this patch
> 

In the original function below check is there.

if (!handle || !info || type >= ACA_ERROR_TYPE_COUNT)
return -EINVAL;

So moving this to a later stage is still valid.
struct aca_error_cache *error_cache = &handle->error_cache;

Further NULL check of error_cache is not required.

Thanks,
Lijo

> Regards,
> Ma Jun
> 
> On 5/14/2024 7:13 AM, Wang, Yang(Kevin) wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> -Original Message-
>> From: Ma, Jun 
>> Sent: Monday, May 13, 2024 4:56 PM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Feng, Kenneth ; Deucher, Alexander 
>> ; Wang, Yang(Kevin) ; 
>> Koenig, Christian ; Ma, Jun 
>> Subject: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle
>>
>> Check handle pointer before using it
>>
>> Signed-off-by: Ma Jun 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
>> index 28febf33fb1b..e969a7d77b4d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
>> @@ -279,7 +279,7 @@ static struct aca_bank_error *get_bank_error(struct 
>> aca_error *aerr, struct aca_  int aca_error_cache_log_bank_error(struct 
>> aca_handle *handle, struct aca_bank_info *info,
>>enum aca_error_type type, u64 count)  {
>> -   struct aca_error_cache *error_cache = &handle->error_cache;
>> +   struct aca_error_cache *error_cache;
>> struct aca_bank_error *bank_error;
>> struct aca_error *aerr;
>>
>> @@ -289,6 +289,10 @@ int aca_error_cache_log_bank_error(struct aca_handle 
>> *handle, struct aca_bank_in
>> if (!count)
>> return 0;
>>
>> +   error_cache = &handle->error_cache;
>> [Kevin]:
>> The above code is always return non-0 value, right?
>>
>> Best Regards,
>> Kevin
>> +   if (!error_cache)
>> +   return -EINVAL;
>> +
>> aerr = &error_cache->errors[type];
>> bank_error = get_bank_error(aerr, info);
>> if (!bank_error)
>> --
>> 2.34.1
>>


Re: [PATCH 05/22] drm/amd/pm: check specific index for aldebaran

2024-05-13 Thread Lazar, Lijo



On 5/10/2024 8:20 AM, Jesse Zhang wrote:
> Check for specific indexes that may be invalid values.
> 
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index ce941fbb9cfb..a22eb6bbb05e 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1886,7 +1886,8 @@ static int aldebaran_mode2_reset(struct smu_context 
> *smu)
>  
>   index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
>   SMU_MSG_GfxDeviceDriverReset);
> -
> + if (index < 0 )
> + return -EINVAL;

To avoid warning problems, drop index and use PPSMC_MSG_GfxDriverReset
instead of index.

Thanks,
Lijo

>   mutex_lock(&smu->message_lock);
>   if (smu->smc_fw_version >= 0x00441400) {
>   ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, 
> SMU_RESET_MODE_2);


Re: [PATCH 09/22] drm/amd/pm: check specific index for smu13

2024-05-13 Thread Lazar, Lijo



On 5/13/2024 4:27 PM, Lazar, Lijo wrote:
> 
> 
> On 5/10/2024 8:20 AM, Jesse Zhang wrote:
>> Check for specific indexes that may be invalid values.
>>
>> Signed-off-by: Jesse Zhang 
>> ---
>>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> index 051092f1b1b4..7c343dd12a7f 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> @@ -2336,6 +2336,8 @@ static int smu_v13_0_6_mode2_reset(struct smu_context 
>> *smu)
>>  
>>  index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
>> SMU_MSG_GfxDeviceDriverReset);
>> +if (index < 0)
>> +ret = -EINVAL;
>>  
> 
> This check is unnecessary. This is IP specific logic and the index is
> expected to exist.
> 

If you are seeing a warning problem because of this, drop index and use
PPSMC_MSG_GfxDriverReset directly.

Thanks,
Lijo

> See this entry in smu_v13_0_6_message_map
> 
> MSG_MAP(GfxDeviceDriverReset,PPSMC_MSG_GfxDriverReset,
>SMU_MSG_RAS_PRI),
> >
> Thanks,
> Lijo
> 
>>  mutex_lock(&smu->message_lock);
>>  


Re: [PATCH 09/22] drm/amd/pm: check specific index for smu13

2024-05-13 Thread Lazar, Lijo



On 5/10/2024 8:20 AM, Jesse Zhang wrote:
> Check for specific indexes that may be invalid values.
> 
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index 051092f1b1b4..7c343dd12a7f 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -2336,6 +2336,8 @@ static int smu_v13_0_6_mode2_reset(struct smu_context 
> *smu)
>  
>   index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
>  SMU_MSG_GfxDeviceDriverReset);
> + if (index < 0)
> + ret = -EINVAL;
>  

This check is unnecessary. This is IP specific logic and the index is
expected to exist.

See this entry in smu_v13_0_6_message_map

MSG_MAP(GfxDeviceDriverReset,PPSMC_MSG_GfxDriverReset,
   SMU_MSG_RAS_PRI),


Thanks,
Lijo

>   mutex_lock(&smu->message_lock);
>  


Re: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

2024-05-13 Thread Lazar, Lijo



On 5/13/2024 2:26 PM, Ma Jun wrote:
> Check handle pointer before using it
> 
> Signed-off-by: Ma Jun 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index 28febf33fb1b..e969a7d77b4d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -279,7 +279,7 @@ static struct aca_bank_error *get_bank_error(struct 
> aca_error *aerr, struct aca_
>  int aca_error_cache_log_bank_error(struct aca_handle *handle, struct 
> aca_bank_info *info,
>  enum aca_error_type type, u64 count)
>  {
> - struct aca_error_cache *error_cache = &handle->error_cache;
> + struct aca_error_cache *error_cache;
>   struct aca_bank_error *bank_error;
>   struct aca_error *aerr;
>  
> @@ -289,6 +289,10 @@ int aca_error_cache_log_bank_error(struct aca_handle 
> *handle, struct aca_bank_in
>   if (!count)
>   return 0;
>  
> + error_cache = &handle->error_cache;
> + if (!error_cache)
> + return -EINVAL;

Similar as in patch 2. error_cache is not a pointer variable.

struct aca_error_cache error_cache;


Thanks,
Lijo

> +
>   aerr = &error_cache->errors[type];
>   bank_error = get_bank_error(aerr, info);
>   if (!bank_error)


Re: [PATCH 2/5] drm/amdgpu: Fix the null pointer dereference to ras_manager

2024-05-13 Thread Lazar, Lijo



On 5/13/2024 2:26 PM, Ma Jun wrote:
> Check ras_manager before using it
> 
> Signed-off-by: Ma Jun 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 1dd13ed3b7b5..6da02a209890 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2172,12 +2172,17 @@ static void 
> amdgpu_ras_interrupt_process_handler(struct work_struct *work)
>  int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
>   struct ras_dispatch_if *info)
>  {
> - struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
> - struct ras_ih_data *data = &obj->ih_data;
> + struct ras_manager *obj;
> + struct ras_ih_data *data;
>  
> + obj = amdgpu_ras_find_obj(adev, &info->head);
>   if (!obj)
>   return -EINVAL;
>  
> + data = &obj->ih_data;
> + if (!data)
> + return -EINVAL;

This check is not needed. ih_data is declared as below in ras_manager.

struct ras_ih_data ih_data;

Thanks,
Lijo

> +
>   if (data->inuse == 0)
>   return 0;
>  


Re: [PATCH v3] drm/amdgpu: Add Ring Hang Events

2024-05-12 Thread Lazar, Lijo



On 5/13/2024 9:44 AM, Ori Messinger wrote:
> This patch adds 'ring hang' events to the driver.
> This is done by adding a 'reset_ring_hang' bool variable to the
> struct 'amdgpu_reset_context' in the amdgpu_reset.h file.
> The purpose for this 'reset_ring_hang' variable is whenever a GPU
> reset is initiated due to a ring hang, the reset_ring_hang should
> be set to 'true'.
> 
> Additionally, the reset cause is passed into the kfd smi event as
> a string, and is displayed in dmesg as an error.
> 
> This 'amdgpu_reset_context' struct is now also passed
> through across all relevant functions, and another event type
> "KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum.
> 
> Signed-off-by: Ori Messinger 
> Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  7 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  9 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  4 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h   |  2 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c |  7 ---
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  7 ++-
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  5 -
>  include/uapi/linux/kfd_ioctl.h  | 15 ---
>  10 files changed, 56 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index e3738d417245..f1c6dc939cc3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
> *work)
>  
>   reset_context.method = AMD_RESET_METHOD_NONE;
>   reset_context.reset_req_dev = adev;
> + reset_context.reset_ring_hang = true;
> + strscpy(reset_context.reset_cause, "hws_hang", 
> sizeof(reset_context.reset_cause));

Please add only reset cause as an enum to reset context. There is no
need for a separate variable like ring hang.

For a user ring that induces a HWS hang, that may be identified
separately or identified generically with "HWS hang" as the reason. HWS
hang also could be caused by RAS error.

A possible list is -

DRIVER_TRIGGERED (suspend/reset on init etc)
JOB HANG,
HWS HANG,
USER TRIGGERED,
RAS ERROR,

The description string for reset cause may be obtained separately with
something like below which returns the details - no need to add them to
reset context and pass around.

amdgpu_reset_get_reset_desc(reset_context);

If reset is caused by a specific job, reset context already has a job
pointer. From there you may get more details like device/partition id,
job id, ring name etc. and provide the description. For RAS errors,
there is already detailed dmesg log of IP which caused issue. So only
device could be sufficient.

> + DRM_ERROR("Reset cause: %s\n", reset_context.reset_cause);

Please don't use DRM_*. They are deprecated. Use either drm_err() or
dev_err() - they help to identify the device also.

Thanks,
Lijo

>   clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>  
>   amdgpu_device_gpu_recover(adev, NULL, &reset_context);
> @@ -261,12 +264,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, 
> bool run_pm)
>   return r;
>  }
>  
> -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct 
> amdgpu_reset_context *reset_context)
>  {
>   int r = 0;
>  
>   if (adev->kfd.dev)
> - r = kgd2kfd_pre_reset(adev->kfd.dev);
> + r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context);
>  
>   return r;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 1de021ebdd46..c9030d8b8308 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
>  };
>  
>  struct amdgpu_device;
> +struct amdgpu_reset_context;
>  
>  enum kfd_mem_attachment_type {
>   KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
> @@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
> amdgpu_device *adev);
>  
>  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
>  
> -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
> + struct amdgpu_reset_context *reset_context);
>  
>  int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
>  
> @@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> -int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> +int kgd2kfd_pr

Re: [PATCH] drm/amdkfd: Ensure gpu_id is unique

2024-05-10 Thread Lazar, Lijo



On 5/10/2024 1:36 AM, Harish Kasiviswanathan wrote:
> gpu_id needs to be unique for user space to identify GPUs via KFD
> interface. In the current implementation there is a very small
> probability of having non unique gpu_ids.
> 
> v2: Add check to confirm if gpu_id is unique. If not unique, find one
> Changed commit header to reflect the above
> v3: Use crc16 as suggested-by: Lijo Lazar 
> Ensure that gpu_id != 0
> 
> Signed-off-by: Harish Kasiviswanathan 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 40 +++
>  1 file changed, 34 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index 219dcf504f24..4954a3021f70 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "kfd_priv.h"
>  #include "kfd_crat.h"
> @@ -1091,14 +1092,17 @@ void kfd_topology_shutdown(void)
>  
>  static uint32_t kfd_generate_gpu_id(struct kfd_node *gpu)
>  {
> - uint32_t hashout;
> + uint32_t gpu_id;
>   uint32_t buf[8];
>   uint64_t local_mem_size;
> - int i;
> + struct kfd_topology_device *dev;
> + bool is_unique;
> + uint8_t *crc_buf;
>  
>   if (!gpu)
>   return 0;
>  
> + crc_buf = (uint8_t*)&buf;
>   local_mem_size = gpu->local_mem_info.local_mem_size_private +
>   gpu->local_mem_info.local_mem_size_public;
>   buf[0] = gpu->adev->pdev->devfn;
> @@ -,10 +1115,34 @@ static uint32_t kfd_generate_gpu_id(struct kfd_node 
> *gpu)
>   buf[6] = upper_32_bits(local_mem_size);
>   buf[7] = (ffs(gpu->xcc_mask) - 1) | (NUM_XCC(gpu->xcc_mask) << 16);
>  
> - for (i = 0, hashout = 0; i < 8; i++)
> - hashout ^= hash_32(buf[i], KFD_GPU_ID_HASH_WIDTH);
> + gpu_id = crc16(0, crc_buf, sizeof(buf)) &
> +  ((1 << KFD_GPU_ID_HASH_WIDTH) - 1);
>  
> - return hashout;
> + /* There is a very small possibility when generating a
> +  * 16 (KFD_GPU_ID_HASH_WIDTH) bit value from 8 word buffer
> +  * that the value could be 0 or non-unique. So, check if
> +  * it is unique and non-zero. If not unique increment till
> +  * unique one is found. In case of overflow, restart from 1
> +  */
> +
> + down_read(&topology_lock);
> + do {
> + is_unique = true;
> + if (!gpu_id)
> + gpu_id = 1;
> + list_for_each_entry(dev, &topology_device_list, list) {
> + if (dev->gpu && dev->gpu_id == gpu_id) {
> + is_unique = false;
> + break;
> + }
> + }
> + if (unlikely(!is_unique))
> + gpu_id = (gpu_id + 1) &
> +   ((1 << KFD_GPU_ID_HASH_WIDTH) - 1);
> + } while (!is_unique);
> + up_read(&topology_lock);
> +
> + return gpu_id;
>  }
>  /* kfd_assign_gpu - Attach @gpu to the correct kfd topology device. If
>   *   the GPU device is not already present in the topology device
> @@ -1945,7 +1973,6 @@ int kfd_topology_add_device(struct kfd_node *gpu)
>   struct amdgpu_gfx_config *gfx_info = &gpu->adev->gfx.config;
>   struct amdgpu_cu_info *cu_info = &gpu->adev->gfx.cu_info;
>  
> - gpu_id = kfd_generate_gpu_id(gpu);
>   if (gpu->xcp && !gpu->xcp->ddev) {
>   dev_warn(gpu->adev->dev,
>"Won't add GPU to topology since it has no drm node 
> assigned.");
> @@ -1968,6 +1995,7 @@ int kfd_topology_add_device(struct kfd_node *gpu)
>   if (res)
>   return res;
>  
> + gpu_id = kfd_generate_gpu_id(gpu);
>   dev->gpu_id = gpu_id;
>   gpu->id = gpu_id;
>  


Re: [PATCH 19/22 V2] drm/amdgpu: Fix the warning division or modulo by zero for the variable num_xcc_per_xcp

2024-05-10 Thread Lazar, Lijo



On 5/10/2024 1:56 PM, Jesse Zhang wrote:
> Checks the partition mode and returns an error for an invalid mode.
> 
> Signed-off-by: Jesse Zhang 
> Suggested-by:  Lijo Lazar 
> ---
>  drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c 
> b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> index 414ea3f560a7..b1c18b7a38ad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
> @@ -501,6 +501,12 @@ static int aqua_vanjaram_switch_partition_mode(struct 
> amdgpu_xcp_mgr *xcp_mgr,
>  
>   if (mode == AMDGPU_AUTO_COMPUTE_PARTITION_MODE) {
>   mode = __aqua_vanjaram_get_auto_mode(xcp_mgr);
> + if (mode == AMDGPU_UNKNOWN_COMPUTE_PARTITION_MODE) {
> + dev_err(adev->dev,
> + "Invalid compute partition mode requested, 
> requested: %s, available memory partitions: %d",

Please change the error message to something like

"Invalid config, no compatible compute partition mode found, available
memory partitions: %d"

With that change,

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> + amdgpu_gfx_compute_mode_desc(mode), 
> adev->gmc.num_mem_partitions);
> + return -EINVAL;
> + }
>   } else if (!__aqua_vanjaram_is_valid_mode(xcp_mgr, mode)) {
>   dev_err(adev->dev,
>   "Invalid compute partition mode requested, requested: 
> %s, available memory partitions: %d",
> @@ -522,6 +528,7 @@ static int aqua_vanjaram_switch_partition_mode(struct 
> amdgpu_xcp_mgr *xcp_mgr,
>   goto unlock;
>  
>   num_xcc_per_xcp = __aqua_vanjaram_get_xcc_per_xcp(xcp_mgr, mode);
>   if (adev->gfx.funcs->switch_partition_mode)
>   adev->gfx.funcs->switch_partition_mode(xcp_mgr->adev,
>  num_xcc_per_xcp);


Re: [PATCH 19/22] drm/amdgpu: Fix the warning division or modulo by zero for the variable num_xcc_per_xcp

2024-05-10 Thread Lazar, Lijo



On 5/10/2024 1:09 PM, Zhang, Jesse(Jie) wrote:
> [AMD Official Use Only - General]
> 
> Hi Lijo,
> 
> -Original Message-
> From: amd-gfx  On Behalf Of Lazar, Lijo
> Sent: Friday, May 10, 2024 3:16 PM
> To: amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 19/22] drm/amdgpu: Fix the warning division or modulo by 
> zero for the variable num_xcc_per_xcp
> 
> 
> 
> On 5/10/2024 8:20 AM, Jesse Zhang wrote:
>> Dividing expression num_xcc_per_xcp which may be zero has undefined behavior.
>>
>> Signed-off-by: Jesse Zhang 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
>> b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
>> index 414ea3f560a7..5752c6760992 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
>> @@ -522,6 +522,9 @@ static int aqua_vanjaram_switch_partition_mode(struct 
>> amdgpu_xcp_mgr *xcp_mgr,
>>   goto unlock;
>>
>>   num_xcc_per_xcp = __aqua_vanjaram_get_xcc_per_xcp(xcp_mgr, mode);
>> + if (!num_xcc_per_xcp)
>> + goto unlock;
>> +
> 
> This won't happen as the mode is validated before and for each valid mode a 
> non-zero num_xcc_per_xcp is expected. To satisfy the warning-checker, before 
> going to unlock use a proper 'ret' value also (otherwise it will look odd).
> 
> [Zhang, Jesse(Jie)]  Is that possible?
> When the initialization mode is AMDGPU_AUTO_COMPUTE_PARTITION_MODE,
> and get AMDGPU_UNKNOWN_COMPUTE_PARTITION_MODE through 
> __aqua_vanjaram_get_auto_mode.
> 
> If so, it will skip the valid mode function (__aqua_vanjaram_is_valid_mode).
> 
Yes, that is possible. If auto detection cannot figure out a valid mode,
then it needs to return an error from there without proceeding further.
That is a better fix.

Thanks,
Lijo

> Thanks
> Jesse
> 
> 
> Thanks,
> Lijo
> 
>>   if (adev->gfx.funcs->switch_partition_mode)
>>   adev->gfx.funcs->switch_partition_mode(xcp_mgr->adev,
>>  num_xcc_per_xcp);


  1   2   3   4   5   6   7   8   9   10   >