RE: [PATCH v2 2/2] drm/amdgpu/pm: Use macro definitions in the smu IH process function

2024-01-24 Thread Wang, Yang(Kevin)


-Original Message-
From: Ma, Jun  
Sent: Thursday, January 25, 2024 3:45 PM
To: Wang, Yang(Kevin) ; Ma, Jun ; 
amd-gfx@lists.freedesktop.org
Cc: Ma, Jun ; Lazar, Lijo ; Feng, Kenneth 
; Deucher, Alexander 
Subject: Re: [PATCH v2 2/2] drm/amdgpu/pm: Use macro definitions in the smu IH 
process function

Hi Kevin,

On 1/23/2024 8:08 PM, Wang, Yang(Kevin) wrote:
> [AMD Official Use Only - General]
> 
> HI Jun,
> 
> I don't think it's necessary to delete these definitions in smu driver_if.h.
> Adding a prefix can be used to distinguish definitions in the driver 
> and can also make it easier for us to track problems. E.g: 
> SMU_IH_INTERRUPT_ID_TO_DRIVER And definitions in the smu_cmn.h file will 
> cause definition conflicts in all subsequent driver if files, we try to avoid 
> modifying the driver_if file and have kept synchronization with pmfw.
> 
In general, we do need to follow the above rules.
But smu_v13_0_irq_process() is a common function used by other asics.
Defining these macros in the corresponding smu driver_if.h will cause the 
compile error.
[kevin]:

It is precisely for this reason that the prefix SMU_ needs to be added to 
distinguish these macro definitions.
e.g:
IH_INTERRUPT_ID_TO_DRIVER to SMU_ IH_INTERRUPT_ID_TO_DRIVER

Best Regards,
Kevin

Regards,
Ma Jun
> Best Regards,
> Kevin
> 
> -Original Message-
> From: Ma, Jun 
> Sent: Tuesday, January 23, 2024 4:13 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Lazar, Lijo ; Feng, Kenneth 
> ; Deucher, Alexander 
> ; Wang, Yang(Kevin) 
> ; Ma, Jun 
> Subject: [PATCH v2 2/2] drm/amdgpu/pm: Use macro definitions in the 
> smu IH process function
> 
> Replace the hard-coded numbers with macro definition
> 
> Signed-off-by: Ma Jun 
> ---
>  .../pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h | 11 ---  
> .../pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h |  4   
> .../pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h | 11 ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 10 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 14 +++---
>  drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0.c |  2 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 10 ++
>  7 files changed, 23 insertions(+), 39 deletions(-)
> 
> diff --git 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h
> index b114d14fc053..91229b2dadb5 100644
> --- 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0
> +++ .h
> @@ -1618,15 +1618,4 @@ typedef struct {
>  #define TABLE_WIFIBAND12
>  #define TABLE_COUNT   13
> 
> -//IH Interupt ID
> -#define IH_INTERRUPT_ID_TO_DRIVER   0xFE
> -#define IH_INTERRUPT_CONTEXT_ID_BACO0x2
> -#define IH_INTERRUPT_CONTEXT_ID_AC  0x3
> -#define IH_INTERRUPT_CONTEXT_ID_DC  0x4
> -#define IH_INTERRUPT_CONTEXT_ID_AUDIO_D00x5
> -#define IH_INTERRUPT_CONTEXT_ID_AUDIO_D30x6
> -#define IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING  0x7
> -#define IH_INTERRUPT_CONTEXT_ID_FAN_ABNORMAL0x8
> -#define IH_INTERRUPT_CONTEXT_ID_FAN_RECOVERY0x9
> -
>  #endif
> diff --git 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h
> index ced348d2e8bb..957b177414a9 100644
> --- 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6
> +++ .h
> @@ -209,10 +209,6 @@ typedef struct {
>floatminPsmVoltage[30];
>  } AvfsDebugTableXcd_t;
> 
> -// Defines used for IH-based thermal interrupts to GFX driver - A/X only
> -#define IH_INTERRUPT_ID_TO_DRIVER   0xFE
> -#define IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING  0x7
> -
>  //thermal over-temp mask defines for IH interrupt to host
>  #define THROTTLER_PROCHOT_BIT   0
>  #define THROTTLER_PPT_BIT   1
> diff --git 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h
> index 8b1496f8ce58..33937c1d984f 100644
> --- 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7
> +++ .h
> @@ -1608,15 +1608,4 @@ typedef struct {
>  #define TABLE_WIFIBAND12
>  #define TABLE_COUNT   13
> 
> -//IH Interupt ID
> -#define IH_INTERRUPT_ID_TO_DRIVER   0xFE
> -#define IH_INTERRUPT_CONTEXT_ID_BACO0x2
> -#define IH_INTERRUPT_CONTEXT_ID_AC  0x3
> -#define IH_INTERRUPT_CONTEXT_ID_DC  0x4
> -#define IH_INTERRUPT_CONTEXT_ID_AUDIO_D00x5
> -#define 

Re: [PATCH v2 2/2] drm/amdgpu/pm: Use macro definitions in the smu IH process function

2024-01-24 Thread Ma, Jun
Hi Kevin,

On 1/23/2024 8:08 PM, Wang, Yang(Kevin) wrote:
> [AMD Official Use Only - General]
> 
> HI Jun,
> 
> I don't think it's necessary to delete these definitions in smu driver_if.h.
> Adding a prefix can be used to distinguish definitions in the driver and can 
> also make it easier for us to track problems. E.g: 
> SMU_IH_INTERRUPT_ID_TO_DRIVER
> And definitions in the smu_cmn.h file will cause definition conflicts in all 
> subsequent driver if files, we try to avoid modifying the driver_if file and 
> have kept synchronization with pmfw.
> 
In general, we do need to follow the above rules.
But smu_v13_0_irq_process() is a common function used by other asics.
Defining these macros in the corresponding smu driver_if.h will cause the
compile error.

Regards,
Ma Jun
> Best Regards,
> Kevin
> 
> -Original Message-
> From: Ma, Jun 
> Sent: Tuesday, January 23, 2024 4:13 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Lazar, Lijo ; Feng, Kenneth ; 
> Deucher, Alexander ; Wang, Yang(Kevin) 
> ; Ma, Jun 
> Subject: [PATCH v2 2/2] drm/amdgpu/pm: Use macro definitions in the smu IH 
> process function
> 
> Replace the hard-coded numbers with macro definition
> 
> Signed-off-by: Ma Jun 
> ---
>  .../pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h | 11 ---  
> .../pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h |  4   
> .../pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h | 11 ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 10 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 14 +++---
>  drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0.c |  2 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 10 ++
>  7 files changed, 23 insertions(+), 39 deletions(-)
> 
> diff --git 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h
> index b114d14fc053..91229b2dadb5 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_0.h
> @@ -1618,15 +1618,4 @@ typedef struct {
>  #define TABLE_WIFIBAND12
>  #define TABLE_COUNT   13
> 
> -//IH Interupt ID
> -#define IH_INTERRUPT_ID_TO_DRIVER   0xFE
> -#define IH_INTERRUPT_CONTEXT_ID_BACO0x2
> -#define IH_INTERRUPT_CONTEXT_ID_AC  0x3
> -#define IH_INTERRUPT_CONTEXT_ID_DC  0x4
> -#define IH_INTERRUPT_CONTEXT_ID_AUDIO_D00x5
> -#define IH_INTERRUPT_CONTEXT_ID_AUDIO_D30x6
> -#define IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING  0x7
> -#define IH_INTERRUPT_CONTEXT_ID_FAN_ABNORMAL0x8
> -#define IH_INTERRUPT_CONTEXT_ID_FAN_RECOVERY0x9
> -
>  #endif
> diff --git 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h
> index ced348d2e8bb..957b177414a9 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_6.h
> @@ -209,10 +209,6 @@ typedef struct {
>floatminPsmVoltage[30];
>  } AvfsDebugTableXcd_t;
> 
> -// Defines used for IH-based thermal interrupts to GFX driver - A/X only
> -#define IH_INTERRUPT_ID_TO_DRIVER   0xFE
> -#define IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING  0x7
> -
>  //thermal over-temp mask defines for IH interrupt to host
>  #define THROTTLER_PROCHOT_BIT   0
>  #define THROTTLER_PPT_BIT   1
> diff --git 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h
> index 8b1496f8ce58..33937c1d984f 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_7.h
> @@ -1608,15 +1608,4 @@ typedef struct {
>  #define TABLE_WIFIBAND12
>  #define TABLE_COUNT   13
> 
> -//IH Interupt ID
> -#define IH_INTERRUPT_ID_TO_DRIVER   0xFE
> -#define IH_INTERRUPT_CONTEXT_ID_BACO0x2
> -#define IH_INTERRUPT_CONTEXT_ID_AC  0x3
> -#define IH_INTERRUPT_CONTEXT_ID_DC  0x4
> -#define IH_INTERRUPT_CONTEXT_ID_AUDIO_D00x5
> -#define IH_INTERRUPT_CONTEXT_ID_AUDIO_D30x6
> -#define IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING  0x7
> -#define IH_INTERRUPT_CONTEXT_ID_FAN_ABNORMAL0x8
> -#define IH_INTERRUPT_CONTEXT_ID_FAN_RECOVERY0x9
> -
>  #endif
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> index fbeb31bf9e48..ddf435cdddfa 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> @@ -1432,24 +1432,24 @@ static int smu_v11_0_irq_process(struct amdgpu_device 

Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-24 Thread Ma, Jun
Hi Mirsad,


On 1/25/2024 1:48 AM, Mirsad Todorovac wrote:
> Hi, Ma Jun,
> 
> Normally, I would reply under the quoted text, but I will adjust to your 
> convention.
> 
> I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
> session
> to block at typing password and ENTER in the graphical logon screen (tested 
> several times).
> 
This problem is not caused by my patch. 
Based on your syslog, it looks more like a shedule issue.
I just saw a similar problem, please refer to the link below
https://gitlab.freedesktop.org/drm/amd/-/issues/3124

Regards,
Ma Jun
> After that, I was not able to even log from another box with ssh, or the 
> session would
> block (tested one time, second time too, thrid time it passed after I 
> connected before
> attempt to login on XWayland console).
> 
> You might find useful syslog and dmesg of the freeze on this link (they were 
> +100K):
> 
> https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/
> 
> The exact applied patch was this:
> 
> marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 73f6d7e72c73..6ef333df9adf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
> amdgpu_device *adev)
>
>   if (!amdgpu_sriov_vf(adev)) {
>   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
> ucode_prefix);
> -   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
> -   /* don't check this.  There are apparently firmwares in the 
> wild with
> -* incorrect size in the header
> -*/
> -   if (err == -ENODEV)
> -   goto out;
> +   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
>   if (err)
> -   dev_dbg(adev->dev,
> -   "gfx10: amdgpu_ucode_request() failed 
> \"%s\"\n",
> -   fw_name);
> +   goto out;
> +
> +   /* don't validate this firmware.  There are apparently 
> firmwares
> +* in the wild with incorrect size in the header
> +*/
>   rlc_hdr = (const struct rlc_firmware_header_v2_0 
> *)adev->gfx.rlc_fw->data;
>   version_major = 
> le16_to_cpu(rlc_hdr->header.header_version_major);
>   version_minor = 
> le16_to_cpu(rlc_hdr->header.header_version_minor);
> marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
> Linux 6.7.0-xway-09721-g61da593f4458 x86_64
> marvin@defiant:~/linux/kernel/linux_torvalds$
> 
> So, there seems to be a problem with the way the patch affects XWayland.
> 
> Checked multiple times the exact commit with and without the diff.
> 
> Hope this helps, because I am not familiar with the amdgpu driver.
> 
> Best regards,
> Mirsad Todorovac
> 
> On 1/22/24 09:34, Ma, Jun wrote:
>> Perhaps similar to the problem I encountered earlier, you can
>> try the following patch
>>
>> https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html
>>
>> Regards,
>> Ma Jun
>>
>> On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:
>>> Hi,
>>>
>>> The last email did not pass to the most of the recipients due to banned .xz 
>>> attachment.
>>>
>>> As the .config is too big to send inline or uncompressed either, I will 
>>> omit it in this
>>> attempt. In the meantime, I had some success in decoding the stack trace, 
>>> but sadly not
>>> complete.
>>>
>>> I don't think this Oops is deterministic, but I am working on a reproducer.
>>>
>>> The platform is Ubuntu 22.04 LTS.
>>>
>>> Complete list of hardware and .config is available here:
>>>
>>> https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/
>>>
>>> Best regards,
>>> Mirsad
>>>
>>> ---
>>> kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
>>> 0008
>>> kernel: [5.576707] #PF: supervisor read access in kernel mode
>>> kernel: [5.576710] #PF: error_code(0x) - not-present page
>>> kernel: [5.576712] PGD 0 P4D 0
>>> kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
>>> kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
>>> 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
>>> kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
>>> Lightning, BIOS 1.21 04/26/2023
>>> kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init 
>>> (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 
>>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu
>>> kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 
>>> ed 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 

Re: [PATCH 1/1] drm/amdgpu: move the drm client creation behind drm device registration

2024-01-24 Thread Lazar, Lijo



On 1/25/2024 11:48 AM, Le Ma wrote:
> This patch is to eliminate interrupt warning below:
> 
>   "[drm] Fence fallback timer expired on ring sdma0.0".
> 
> An early vm pt clearing job is sent to SDMA ahead of interrupt enabled,
> introduced by patch below:
> 
>   - drm/amdkfd: Export DMABufs from KFD using GEM handles
> 
> And re-locating the drm client creation following after drm_dev_register
> looks like a more proper flow.
> 
> Change-Id: I0fece177b78345187068f92a823d96b3b7581140
> Signed-off-by: Le Ma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 13 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 11 +++
>  3 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index add315644773..69eb0f5574d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -139,14 +139,13 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
> *work)
>   amdgpu_device_gpu_recover(adev, NULL, _context);
>  }
>  
> -static const struct drm_client_funcs kfd_client_funcs = {
> +const struct drm_client_funcs kfd_client_funcs = {
>   .unregister = drm_client_release,
>  };
>  void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>  {
>   int i;
>   int last_valid_bit;
> - int ret;
>  
>   amdgpu_amdkfd_gpuvm_init_mem_limits();
>  
> @@ -165,12 +164,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device 
> *adev)
>   .enable_mes = adev->enable_mes,
>   };
>  
> - ret = drm_client_init(>ddev, >kfd.client, "kfd", 
> _client_funcs);
> - if (ret) {
> - dev_err(adev->dev, "Failed to init DRM client: %d\n", 
> ret);
> - return;
> - }
> -
>   /* this is going to have a few of the MSBs set that we need to
>* clear
>*/
> @@ -209,10 +202,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device 
> *adev)
>  
>   adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
>   _resources);
> - if (adev->kfd.init_complete)
> - drm_client_register(>kfd.client);
> - else
> - drm_client_release(>kfd.client);
>  
>   amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 00eed8c10cd4..b2c6f2b3c0fa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -40,6 +40,8 @@
>  
>  extern uint64_t amdgpu_amdkfd_total_mem_size;
>  
> +extern const struct drm_client_funcs kfd_client_funcs;
> +
>  enum TLB_FLUSH_TYPE {
>   TLB_FLUSH_LEGACY = 0,
>   TLB_FLUSH_LIGHTWEIGHT,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 0d0aa4b798ac..d0b98343481d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2293,6 +2293,17 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>   drm_fbdev_generic_setup(adev_to_drm(adev), 32);
>   }
>  
> + if (adev->kfd.init_complete) {
> + ret = drm_client_init(>ddev, >kfd.client, "kfd",
> +   _client_funcs);
> + if (ret) {
> + dev_err(adev->dev, "Failed to init DRM client: %d\n",
> + ret);
> + goto err_pci;
> + }
> + drm_client_register(>kfd.client);
> + }
> +

Maybe better to wrap this in amdgpu_amdkfd_drm_client_init() or similar

Thanks,
Lijo

>   ret = amdgpu_debugfs_init(adev);
>   if (ret)
>   DRM_ERROR("Creating debugfs files failed (%d).\n", ret);


[PATCH 1/1] drm/amdgpu: move the drm client creation behind drm device registration

2024-01-24 Thread Le Ma
This patch is to eliminate interrupt warning below:

  "[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt clearing job is sent to SDMA ahead of interrupt enabled,
introduced by patch below:

  - drm/amdkfd: Export DMABufs from KFD using GEM handles

And re-locating the drm client creation following after drm_dev_register
looks like a more proper flow.

Change-Id: I0fece177b78345187068f92a823d96b3b7581140
Signed-off-by: Le Ma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 13 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 11 +++
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index add315644773..69eb0f5574d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -139,14 +139,13 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)
amdgpu_device_gpu_recover(adev, NULL, _context);
 }
 
-static const struct drm_client_funcs kfd_client_funcs = {
+const struct drm_client_funcs kfd_client_funcs = {
.unregister = drm_client_release,
 };
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
int i;
int last_valid_bit;
-   int ret;
 
amdgpu_amdkfd_gpuvm_init_mem_limits();
 
@@ -165,12 +164,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
.enable_mes = adev->enable_mes,
};
 
-   ret = drm_client_init(>ddev, >kfd.client, "kfd", 
_client_funcs);
-   if (ret) {
-   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
-   return;
-   }
-
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -209,10 +202,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
_resources);
-   if (adev->kfd.init_complete)
-   drm_client_register(>kfd.client);
-   else
-   drm_client_release(>kfd.client);
 
amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 00eed8c10cd4..b2c6f2b3c0fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -40,6 +40,8 @@
 
 extern uint64_t amdgpu_amdkfd_total_mem_size;
 
+extern const struct drm_client_funcs kfd_client_funcs;
+
 enum TLB_FLUSH_TYPE {
TLB_FLUSH_LEGACY = 0,
TLB_FLUSH_LIGHTWEIGHT,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 0d0aa4b798ac..d0b98343481d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2293,6 +2293,17 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
drm_fbdev_generic_setup(adev_to_drm(adev), 32);
}
 
+   if (adev->kfd.init_complete) {
+   ret = drm_client_init(>ddev, >kfd.client, "kfd",
+ _client_funcs);
+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n",
+   ret);
+   goto err_pci;
+   }
+   drm_client_register(>kfd.client);
+   }
+
ret = amdgpu_debugfs_init(adev);
if (ret)
DRM_ERROR("Creating debugfs files failed (%d).\n", ret);
-- 
2.38.1



RE: [PATCH 1/3] drm/amdgpu: Support passing poison consumption ras block to SRIOV

2024-01-24 Thread Zhang, Hawking
[AMD Official Use Only - General]


@@ -210,8 +211,10 @@ static void event_interrupt_poison_consumption_v11(struct 
kfd_node *dev,
case SOC15_INTSRC_SQ_INTERRUPT_MSG:
if (dev->dqm->ops.reset_queues)
ret = dev->dqm->ops.reset_queues(dev->dqm, pasid);
+   block = AMDGPU_RAS_BLOCK__GFX;
break;
case SOC21_INTSRC_SDMA_ECC:
+   block = AMDGPU_RAS_BLOCK__SDMA;

Hi @Chai, Thomas/@Luo, 
Zhigang,

event_interrupt_poison_consumption_v11 was duplicated from v9 generation. 
However, the hardware/firmware takes completely different approach to handle 
poison consumption in gfx11.

At this stage, let's just initialize block to AMDGPU_RAS_BLOCK__GFX for all the 
IQR sources (i.e., gfx 11 poison consumption notification was centralized to 
RLC). I believe we still need a few series to correct the v11 implementation.

With above addressed, the series is

Reviewed-by: Hawking Zhang 

Regards,
Hawking

-Original Message-
From: Luo, Zhigang mailto:zhigang@amd.com>>
Sent: Thursday, January 25, 2024 07:22
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking mailto:hawking.zh...@amd.com>>; 
Skvortsov, Victor mailto:victor.skvort...@amd.com>>; 
Saye, Sashank mailto:sashank.s...@amd.com>>; Chai, Thomas 
mailto:yipeng.c...@amd.com>>
Subject: [PATCH 1/3] drm/amdgpu: Support passing poison consumption ras block 
to SRIOV

From: YiPeng Chai mailto:yipeng.c...@amd.com>>

Support passing poison consumption ras blocks to SRIOV.

Signed-off-by: YiPeng Chai mailto:yipeng.c...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  5 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h   |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h  |  3 ++-
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0_3.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  3 ++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 23 +++
 .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c  |  7 --  
.../gpu/drm/amd/amdkfd/kfd_int_process_v11.c  |  7 --
 .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   |  7 --
 13 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 77e263660288..dfb93664e866 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -732,9 +732,10 @@ void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device 
*adev)
amdgpu_device_flush_hdp(adev, NULL);
 }

-void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, 
bool reset)
+void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev,
+   enum amdgpu_ras_block block, bool reset)
 {
-   amdgpu_umc_poison_handler(adev, reset);
+   amdgpu_umc_poison_handler(adev, block, reset);
 }

 int amdgpu_amdkfd_send_close_event_drain_irq(struct amdgpu_device *adev, diff 
--git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 584a0cea5572..50d3e0149032 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -334,7 +334,7 @@ void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device 
*adev);  int amdgpu_amdkfd_get_tile_config(struct amdgpu_device *adev,
struct tile_config *config);
 void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev,
-   bool reset);
+   enum amdgpu_ras_block block, bool reset);
 bool amdgpu_amdkfd_bo_mapped_to_dev(struct amdgpu_device *adev, struct kgd_mem 
*mem);  void amdgpu_amdkfd_block_mmu_notifications(void *p);  int 
amdgpu_amdkfd_criu_resume(void *p); diff --git 
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index ebcd1cb60052..79bf6bd428a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2041,7 +2041,7 @@ static void 
amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
}
}

-   amdgpu_umc_poison_handler(adev, false);
+   amdgpu_umc_poison_handler(adev, obj->head.block, false);

if (block_obj->hw_ops && block_obj->hw_ops->handle_poison_consumption)
poison_stat = 
block_obj->hw_ops->handle_poison_consumption(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index a6cdb69897f2..20436f81856a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -246,7 +246,8 @@ int 

[PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

2024-01-24 Thread Prike Liang
In the pm abort case the gfx power rail not turn off from FCH side and
this will lead to the gfx reinitialized failed base on the unknown gfx
HW status, so let's reset the gpu to a known good power state.

Signed-off-by: Prike Liang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +
 drivers/gpu/drm/amd/amdgpu/soc15.c | 8 +++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 56d9dfa61290..4c40ffaaa5c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4627,6 +4627,11 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
return r;
}
 
+   if(amdgpu_asic_need_reset_on_init(adev)) {
+   DRM_INFO("PM abort case and let's reset asic \n");
+   amdgpu_asic_reset(adev);
+   }
+
if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
return 0;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 15033efec2ba..9329a00b6abc 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -804,9 +804,16 @@ static bool soc15_need_reset_on_init(struct amdgpu_device 
*adev)
if (adev->asic_type == CHIP_RENOIR)
return true;
 
+   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
+
/* Just return false for soc15 GPUs.  Reset does not seem to
 * be necessary.
 */
+   if (adev->in_suspend && !adev->in_s0ix &&
+   !adev->pm_complete &&
+   sol_reg)
+   return true;
+
if (!amdgpu_passthrough(adev))
return false;
 
@@ -816,7 +823,6 @@ static bool soc15_need_reset_on_init(struct amdgpu_device 
*adev)
/* Check sOS sign of life register to confirm sys driver and sOS
 * are already been loaded.
 */
-   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
if (sol_reg)
return true;
 
-- 
2.34.1



[PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for PM abort case

2024-01-24 Thread Prike Liang
In the PM abort cases, the gfx power rail doesn't turn off so
some GFXDEC registers/CSB can't reset to default vaule. In order
to avoid unexpected problem now need skip to program GFXDEC registers
and bypass issue CSB packet for PM abort case.

Signed-off-by: Prike Liang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 6 ++
 3 files changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c5f3859fd682..9e9e3385b5d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1079,6 +1079,8 @@ struct amdgpu_device {
boolin_s3;
boolin_s4;
boolin_s0ix;
+   /* indicate amdgpu suspension status */
+   boolpm_complete;
 
enum pp_mp1_state   mp1_state;
struct amdgpu_doorbell_index doorbell_index;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 475bd59c9ac2..9cb8f7fe55cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2472,6 +2472,7 @@ static int amdgpu_pmops_suspend(struct device *dev)
struct drm_device *drm_dev = dev_get_drvdata(dev);
struct amdgpu_device *adev = drm_to_adev(drm_dev);
 
+   adev->pm_complete = false;
if (amdgpu_acpi_is_s0ix_active(adev))
adev->in_s0ix = true;
else if (amdgpu_acpi_is_s3_active(adev))
@@ -2486,6 +2487,7 @@ static int amdgpu_pmops_suspend_noirq(struct device *dev)
struct drm_device *drm_dev = dev_get_drvdata(dev);
struct amdgpu_device *adev = drm_to_adev(drm_dev);
 
+   adev->pm_complete = true;
if (amdgpu_acpi_should_gpu_reset(adev))
return amdgpu_asic_reset(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 57808be6e3ec..0abdc85eda77 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3034,6 +3034,12 @@ static int gfx_v9_0_cp_gfx_start(struct amdgpu_device 
*adev)
 
gfx_v9_0_cp_gfx_enable(adev, true);
 
+   /* TODO: double confirm whether need to reinitialize gfxedc and submit 
csb packet
+* on the other gfx generations for the pm suspend abort case. */
+   if (adev->in_suspend && !adev->pm_complete) {
+   DRM_INFO(" will skip the csb ring write\n");
+   return 0;
+   }
r = amdgpu_ring_alloc(ring, gfx_v9_0_get_csb_size(adev) + 4 + 3);
if (r) {
DRM_ERROR("amdgpu: cp failed to lock ring (%d).\n", r);
-- 
2.34.1



RE: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

2024-01-24 Thread Liang, Prike
[AMD Official Use Only - General]

Hi, Alex

> -Original Message-
> From: Alex Deucher 
> Sent: Thursday, January 25, 2024 12:07 AM
> To: Liang, Prike 
> Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Sharma, Deepak
> 
> Subject: Re: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case
>
> On Wed, Jan 24, 2024 at 2:17 AM Prike Liang  wrote:
> >
> > In the pm abort case the gfx power rail not turn off from FCH side and
> > this will lead to the gfx reinitialized failed base on the unknown gfx
> > HW status, so let's reset the gpu to a known good power state.
> >
> > Signed-off-by: Prike Liang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 +
> >  drivers/gpu/drm/amd/amdgpu/soc15.c  | 7 ++-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index 9153f69bad7f..14b66c49a536 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -2935,6 +2935,11 @@ static int psp_resume(void *handle)
> >
> > dev_info(adev->dev, "PSP is resuming...\n");
> >
> > +   if(amdgpu_asic_need_reset_on_init(adev)) {
> > +   DRM_INFO("PM abort case and let's reset asic \n");
> > +   amdgpu_asic_reset(adev);
> > +   }
> > +
>
> Seems like it would be better to put this in the resume callback.
[Liang, Prike]  Yes, it makes sense to put the device level function call in 
device resume callback.
>
> > if (psp->mem_train_ctx.enable_mem_training) {
> > ret = psp_mem_training(psp, PSP_MEM_TRAIN_RESUME);
> > if (ret) {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
> > b/drivers/gpu/drm/amd/amdgpu/soc15.c
> > index 15033efec2ba..6ec4f6958c4c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> > @@ -804,9 +804,15 @@ static bool soc15_need_reset_on_init(struct
> amdgpu_device *adev)
> > if (adev->asic_type == CHIP_RENOIR)
> > return true;
> >
> > +   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
>
> Is this register consistent for all soc15 chips?
>
> Alex
>
Yes, confirmed from the PSP firmware guys this register can be used for 
indicating the TOS/SOS live status on the SOC15 series.

> > +
> > /* Just return false for soc15 GPUs.  Reset does not seem to
> >  * be necessary.
> >  */
> > +   if (adev->in_suspend && !added->microplate &&
> > +   sol_reg)
> > +   return true;
> > +
> > if (!amdgpu_passthrough(adev))
> > return false;
> >
> > @@ -816,7 +822,6 @@ static bool soc15_need_reset_on_init(struct
> amdgpu_device *adev)
> > /* Check sOS sign of life register to confirm sys driver and sOS
> >  * are already been loaded.
> >  */
> > -   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> > if (sol_reg)
> > return true;
> >> --
> > 2.34.1
> >


RE: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for PM abort case

2024-01-24 Thread Liang, Prike
[AMD Official Use Only - General]

Hi, Alex
> -Original Message-
> From: Alex Deucher 
> Sent: Wednesday, January 24, 2024 11:59 PM
> To: Liang, Prike 
> Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Sharma, Deepak
> 
> Subject: Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for
> PM abort case
>
> On Wed, Jan 24, 2024 at 2:12 AM Prike Liang  wrote:
> >
> > In the PM abort cases, the gfx power rail doesn't turn off so some
> > GFXDEC registers/CSB can't reset to default vaule. In order to avoid
> > unexpected problem now need skip to program GFXDEC registers and
> > bypass issue CSB packet for PM abort case.
> >
> > Signed-off-by: Prike Liang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 4 
> >  3 files changed, 6 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index c5f3859fd682..26d983eb831b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -1079,6 +1079,7 @@ struct amdgpu_device {
> > boolin_s3;
> > boolin_s4;
> > boolin_s0ix;
> > +   boolpm_complete;
> >
> > enum pp_mp1_state   mp1_state;
> > struct amdgpu_doorbell_index doorbell_index; diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 475bd59c9ac2..a01f9b0c2f30 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -2486,6 +2486,7 @@ static int amdgpu_pmops_suspend_noirq(struct
> device *dev)
> > struct drm_device *drm_dev = dev_get_drvdata(dev);
> > struct amdgpu_device *adev = drm_to_adev(drm_dev);
> >
> > +   adev->pm_complete = true;
>
> This needs to be cleared somewhere on resume.
[Liang, Prike]  This flag is designed to indicate the amdgpu device suspension 
process status and will update the patch and clear it at the amdgpu suspension 
beginning point.
>
> > if (amdgpu_acpi_should_gpu_reset(adev))
> > return amdgpu_asic_reset(adev);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index 57808be6e3ec..3bf51f18e13c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -3034,6 +3034,10 @@ static int gfx_v9_0_cp_gfx_start(struct
> > amdgpu_device *adev)
> >
> > gfx_v9_0_cp_gfx_enable(adev, true);
> >
> > +   if (adev->in_suspend && !adev->pm_complete) {
> > +   DRM_INFO(" will skip the csb ring write\n");
> > +   return 0;
> > +   }
>
> We probably want a similar fix for other gfx generations as well.
>
> Alex
>
[Liang, Prike] IIRC, there's no issue on the Mendocino side even without the 
fix. How about keep the other gfx generations unchanged firstly and after sort 
out the failed case will add the quirk for each specific gfx respectively?

> > r = amdgpu_ring_alloc(ring, gfx_v9_0_get_csb_size(adev) + 4 + 3);
> > if (r) {
> > DRM_ERROR("amdgpu: cp failed to lock ring (%d).\n",
> > r);
> > --
> > 2.34.1
> >


[PATCH 2/3] drm/amdgpu: Add RAS_POISON_READY host response message

2024-01-24 Thread Zhigang Luo
From: Victor Skvortsov 

In a non-FLR page avoidance scenario, the host driver will
provide the bad pages in the pf2vf exchange region.

Adding a new host response message to indicate when the
pf2vf exchange region has been updated.

Signed-off-by: Victor Skvortsov 
Change-Id: I58d5d11d959d91ad5723d33fddb93570c259e245
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index d0a018da3c7a..c49bf87d4b0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -170,6 +170,9 @@ static int xgpu_nv_send_access_requests_with_param(struct 
amdgpu_device *adev,
case IDH_REQ_GPU_INIT_DATA:
event = IDH_REQ_GPU_INIT_DATA_READY;
break;
+   case IDH_RAS_POISON:
+   if (data1 != 0)
+   event = IDH_RAS_POISON_READY;
default:
break;
}
@@ -437,8 +440,10 @@ static void xgpu_nv_ras_poison_handler(struct 
amdgpu_device *adev,
if (amdgpu_ip_version(adev, UMC_HWIP, 0) < IP_VERSION(12, 0, 0)) {
xgpu_nv_send_access_requests(adev, IDH_RAS_POISON);
} else {
+   amdgpu_virt_fini_data_exchange(adev);
xgpu_nv_send_access_requests_with_param(adev,
IDH_RAS_POISON, block, 0, 0);
+   amdgpu_virt_init_data_exchange(adev);
}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h
index d0221ce08769..1e8fd90cab43 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h
@@ -51,6 +51,7 @@ enum idh_event {
IDH_FAIL,
IDH_QUERY_ALIVE,
IDH_REQ_GPU_INIT_DATA_READY,
+   IDH_RAS_POISON_READY,
 
IDH_TEXT_MESSAGE = 255,
 };
-- 
2.25.1



[PATCH 3/3] amdgpu/drm: Use vram manager for virtualization page retirement

2024-01-24 Thread Zhigang Luo
From: Victor Skvortsov 

In runtime, use vram manager for virtualization page retirement.

Signed-off-by: Victor Skvortsov 
Change-Id: Ia8fe6c7d4e4acae9d3a953b3ba4567e8fc6de0fa
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 30 
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index f5c66e0038b5..6ff7d3fb2008 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -250,11 +250,11 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
amdgpu_device *adev)
if (!*data)
goto data_failure;
 
-   bps = kmalloc_array(align_space, sizeof((*data)->bps), GFP_KERNEL);
+   bps = kmalloc_array(align_space, sizeof(*(*data)->bps), GFP_KERNEL);
if (!bps)
goto bps_failure;
 
-   bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
GFP_KERNEL);
+   bps_bo = kmalloc_array(align_space, sizeof(*(*data)->bps_bo), 
GFP_KERNEL);
if (!bps_bo)
goto bps_bo_failure;
 
@@ -287,8 +287,10 @@ static void amdgpu_virt_ras_release_bp(struct 
amdgpu_device *adev)
 
for (i = data->last_reserved - 1; i >= 0; i--) {
bo = data->bps_bo[i];
-   amdgpu_bo_free_kernel(, NULL, NULL);
-   data->bps_bo[i] = bo;
+   if (bo) {
+   amdgpu_bo_free_kernel(, NULL, NULL);
+   data->bps_bo[i] = bo;
+   }
data->last_reserved = i;
}
 }
@@ -328,6 +330,8 @@ static void amdgpu_virt_ras_reserve_bps(struct 
amdgpu_device *adev)
 {
struct amdgpu_virt *virt = >virt;
struct amdgpu_virt_ras_err_handler_data *data = virt->virt_eh_data;
+   struct amdgpu_vram_mgr *mgr = >mman.vram_mgr;
+   struct ttm_resource_manager *man = >manager;
struct amdgpu_bo *bo = NULL;
uint64_t bp;
int i;
@@ -343,12 +347,18 @@ static void amdgpu_virt_ras_reserve_bps(struct 
amdgpu_device *adev)
 * 2) a ras bad page has been reserved (duplicate error 
injection
 *for one page);
 */
-   if (amdgpu_bo_create_kernel_at(adev, bp << 
AMDGPU_GPU_PAGE_SHIFT,
-  AMDGPU_GPU_PAGE_SIZE,
-  , NULL))
-   DRM_DEBUG("RAS WARN: reserve vram for retired page %llx 
fail\n", bp);
-
-   data->bps_bo[i] = bo;
+   if  (ttm_resource_manager_used(man)) {
+   amdgpu_vram_mgr_reserve_range(>mman.vram_mgr,
+   bp << AMDGPU_GPU_PAGE_SHIFT,
+   AMDGPU_GPU_PAGE_SIZE);
+   data->bps_bo[i] = NULL;
+   } else {
+   if (amdgpu_bo_create_kernel_at(adev, bp << 
AMDGPU_GPU_PAGE_SHIFT,
+   AMDGPU_GPU_PAGE_SIZE,
+   , NULL))
+   DRM_DEBUG("RAS WARN: reserve vram for retired 
page %llx fail\n", bp);
+   data->bps_bo[i] = bo;
+   }
data->last_reserved = i + 1;
bo = NULL;
}
-- 
2.25.1



[PATCH 1/3] drm/amdgpu: Support passing poison consumption ras block to SRIOV

2024-01-24 Thread Zhigang Luo
From: YiPeng Chai 

Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  5 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h   |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h  |  3 ++-
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0_3.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  3 ++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 23 +++
 .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c  |  7 --
 .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c  |  7 --
 .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   |  7 --
 13 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 77e263660288..dfb93664e866 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -732,9 +732,10 @@ void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device 
*adev)
amdgpu_device_flush_hdp(adev, NULL);
 }
 
-void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, 
bool reset)
+void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev,
+   enum amdgpu_ras_block block, bool reset)
 {
-   amdgpu_umc_poison_handler(adev, reset);
+   amdgpu_umc_poison_handler(adev, block, reset);
 }
 
 int amdgpu_amdkfd_send_close_event_drain_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 584a0cea5572..50d3e0149032 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -334,7 +334,7 @@ void amdgpu_amdkfd_debug_mem_fence(struct amdgpu_device 
*adev);
 int amdgpu_amdkfd_get_tile_config(struct amdgpu_device *adev,
struct tile_config *config);
 void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev,
-   bool reset);
+   enum amdgpu_ras_block block, bool reset);
 bool amdgpu_amdkfd_bo_mapped_to_dev(struct amdgpu_device *adev, struct kgd_mem 
*mem);
 void amdgpu_amdkfd_block_mmu_notifications(void *p);
 int amdgpu_amdkfd_criu_resume(void *p);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index ebcd1cb60052..79bf6bd428a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2041,7 +2041,7 @@ static void 
amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
}
}
 
-   amdgpu_umc_poison_handler(adev, false);
+   amdgpu_umc_poison_handler(adev, obj->head.block, false);
 
if (block_obj->hw_ops && block_obj->hw_ops->handle_poison_consumption)
poison_stat = 
block_obj->hw_ops->handle_poison_consumption(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index a6cdb69897f2..20436f81856a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -246,7 +246,8 @@ int amdgpu_umc_bad_page_polling_timeout(struct 
amdgpu_device *adev,
return 0;
 }
 
-int amdgpu_umc_poison_handler(struct amdgpu_device *adev, bool reset)
+int amdgpu_umc_poison_handler(struct amdgpu_device *adev,
+   enum amdgpu_ras_block block, bool reset)
 {
int ret = AMDGPU_RAS_SUCCESS;
 
@@ -297,7 +298,7 @@ int amdgpu_umc_poison_handler(struct amdgpu_device *adev, 
bool reset)
}
} else {
if (adev->virt.ops && adev->virt.ops->ras_poison_handler)
-   adev->virt.ops->ras_poison_handler(adev);
+   adev->virt.ops->ras_poison_handler(adev, block);
else
dev_warn(adev->dev,
"No ras_poison_handler interface in SRIOV!\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 83199296ed10..26d2ae498daf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -102,7 +102,8 @@ struct amdgpu_umc {
 
 int amdgpu_umc_ras_sw_init(struct amdgpu_device *adev);
 int amdgpu_umc_ras_late_init(struct amdgpu_device *adev, struct ras_common_if 
*ras_block);
-int amdgpu_umc_poison_handler(struct amdgpu_device *adev, bool reset);
+int amdgpu_umc_poison_handler(struct amdgpu_device *adev,
+   enum amdgpu_ras_block block, bool reset);
 int amdgpu_umc_process_ecc_irq(struct amdgpu_device *adev,
struct amdgpu_irq_src *source,
struct amdgpu_iv_entry *entry);
diff 

Re: [PATCH v2] drm/amdkfd: reserve the BO before validating it

2024-01-24 Thread Felix Kuehling

On 2024-01-22 4:08, Lang Yu wrote:

Fixes: 410f08516e0f ("drm/amdkfd: Move dma unmapping after TLB flush")

v2:
Avoid unmapping attachment twice when ERESTARTSYS.

[   41.708711] WARNING: CPU: 0 PID: 1463 at drivers/gpu/drm/ttm/ttm_bo.c:846 
ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.708989] Call Trace:
[   41.708992]  
[   41.708996]  ? show_regs+0x6c/0x80
[   41.709000]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.709008]  ? __warn+0x93/0x190
[   41.709014]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.709024]  ? report_bug+0x1f9/0x210
[   41.709035]  ? handle_bug+0x46/0x80
[   41.709041]  ? exc_invalid_op+0x1d/0x80
[   41.709048]  ? asm_exc_invalid_op+0x1f/0x30
[   41.709057]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 [amdgpu]
[   41.709185]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.709197]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 [amdgpu]
[   41.709337]  ? srso_alias_return_thunk+0x5/0x7f
[   41.709346]  kfd_mem_dmaunmap_attachment+0x9e/0x1e0 [amdgpu]
[   41.709467]  amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x56/0x80 [amdgpu]
[   41.709586]  kfd_ioctl_unmap_memory_from_gpu+0x1b7/0x300 [amdgpu]
[   41.709710]  kfd_ioctl+0x1ec/0x650 [amdgpu]
[   41.709822]  ? __pfx_kfd_ioctl_unmap_memory_from_gpu+0x10/0x10 [amdgpu]
[   41.709945]  ? srso_alias_return_thunk+0x5/0x7f
[   41.709949]  ? tomoyo_file_ioctl+0x20/0x30
[   41.709959]  __x64_sys_ioctl+0x9c/0xd0
[   41.709967]  do_syscall_64+0x3f/0x90
[   41.709973]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +-
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 28 +--
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  4 ++-
  3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 584a0cea5572..41854417e487 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -311,7 +311,7 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(struct 
amdgpu_device *adev,
  struct kgd_mem *mem, void *drm_priv);
  int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu(
struct amdgpu_device *adev, struct kgd_mem *mem, void 
*drm_priv);
-void amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void *drm_priv);
+int amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void *drm_priv);
  int amdgpu_amdkfd_gpuvm_sync_memory(
struct amdgpu_device *adev, struct kgd_mem *mem, bool intr);
  int amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_mem *mem,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 6f3a4cb2a9ef..7a050d46fa4d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2088,21 +2088,43 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
return ret;
  }
  
-void amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void *drm_priv)

+int amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void *drm_priv)
  {
struct kfd_mem_attachment *entry;
struct amdgpu_vm *vm;
+   bool reserved = false;
+   int ret = 0;
  
  	vm = drm_priv_to_vm(drm_priv);
  
  	mutex_lock(>lock);
  
  	list_for_each_entry(entry, >attachments, list) {

-   if (entry->bo_va->base.vm == vm)
-   kfd_mem_dmaunmap_attachment(mem, entry);
+   if (entry->bo_va->base.vm != vm)
+   continue;
+   if (entry->type == KFD_MEM_ATT_SHARED ||
+   entry->type == KFD_MEM_ATT_DMABUF)
+   continue;
+   if (!entry->bo_va->base.bo->tbo.ttm->sg)
+   continue;


You're going to great lengths to avoid the reservation when it's not 
needed by kfd_mem_dmaunmap_attachment. However, this feels a bit 
fragile. Any changes in the kfd_mem_dmaunmap_* functions could break this.




+
+   if (!reserved) {
+   ret = amdgpu_bo_reserve(mem->bo, true);
+   if (ret)
+   goto out;
+   reserved = true;
+   }
+
+   kfd_mem_dmaunmap_attachment(mem, entry);
}
  
+	if (reserved)

+   amdgpu_bo_unreserve(mem->bo);
+
+out:
mutex_unlock(>lock);
+
+   return ret;
  }
  
  int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu(

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index ce4c52ec34d8..80e90fdef291 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1442,7 +1442,9 @@ static int kfd_ioctl_unmap_memory_from_gpu(struct file 
*filep,
kfd_flush_tlb(peer_pdd, TLB_FLUSH_HEAVYWEIGHT);
  
  		/* Remove dma mapping after tlb flush to avoid IO_PAGE_FAULT */

-   

Re: [PATCH] drm/amdkfd: Use S_ENDPGM_SAVED in trap handler

2024-01-24 Thread Alex Deucher
On Wed, Jan 24, 2024 at 4:01 PM Jay Cornwall  wrote:
>
> On 1/15/2024 13:07, Jay Cornwall wrote:
> > This instruction has no functional difference to S_ENDPGM
> > but allows performance counters to track save events correctly.
> >
> > Signed-off-by: Jay Cornwall 

Acked-by: Alex Deucher 

> > ---
> >  drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h | 14 +++---
> >  .../gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm |  2 +-
> >  .../gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm  |  2 +-
> >  3 files changed, 9 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h 
> > b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> > index df75863393fc..d1caaf0e6a7c 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> > +++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> > @@ -674,7 +674,7 @@ static const uint32_t cwsr_trap_gfx9_hex[] = {
> >   0x86ea6a6a, 0x8f6e837a,
> >   0xb96ee0c2, 0xbf82,
> >   0xb97a0002, 0xbf8a,
> > - 0xbe801f6c, 0xbf81,
> > + 0xbe801f6c, 0xbf9b,
> >  };
> >
> >  static const uint32_t cwsr_trap_nv1x_hex[] = {
> > @@ -1091,7 +1091,7 @@ static const uint32_t cwsr_trap_nv1x_hex[] = {
> >   0xb9eef807, 0x876dff6d,
> >   0x, 0x87fe7e7e,
> >   0x87ea6a6a, 0xb9faf802,
> > - 0xbe80226c, 0xbf81,
> > + 0xbe80226c, 0xbf9b,
> >   0xbf9f, 0xbf9f,
> >   0xbf9f, 0xbf9f,
> >   0xbf9f, 0x,
> > @@ -1574,7 +1574,7 @@ static const uint32_t cwsr_trap_arcturus_hex[] = {
> >   0x86ea6a6a, 0x8f6e837a,
> >   0xb96ee0c2, 0xbf82,
> >   0xb97a0002, 0xbf8a,
> > - 0xbe801f6c, 0xbf81,
> > + 0xbe801f6c, 0xbf9b,
> >  };
> >
> >  static const uint32_t cwsr_trap_aldebaran_hex[] = {
> > @@ -2065,7 +2065,7 @@ static const uint32_t cwsr_trap_aldebaran_hex[] = {
> >   0x86ea6a6a, 0x8f6e837a,
> >   0xb96ee0c2, 0xbf82,
> >   0xb97a0002, 0xbf8a,
> > - 0xbe801f6c, 0xbf81,
> > + 0xbe801f6c, 0xbf9b,
> >  };
> >
> >  static const uint32_t cwsr_trap_gfx10_hex[] = {
> > @@ -2500,7 +2500,7 @@ static const uint32_t cwsr_trap_gfx10_hex[] = {
> >   0x876dff6d, 0x,
> >   0x87fe7e7e, 0x87ea6a6a,
> >   0xb9faf802, 0xbe80226c,
> > - 0xbf81, 0xbf9f,
> > + 0xbf9b, 0xbf9f,
> >   0xbf9f, 0xbf9f,
> >   0xbf9f, 0xbf9f,
> >  };
> > @@ -2944,7 +2944,7 @@ static const uint32_t cwsr_trap_gfx11_hex[] = {
> >   0xb8eef802, 0xbf0d866e,
> >   0xbfa20002, 0xb97af802,
> >   0xbe80486c, 0xb97af802,
> > - 0xbe804a6c, 0xbfb0,
> > + 0xbe804a6c, 0xbfb1,
> >   0xbf9f, 0xbf9f,
> >   0xbf9f, 0xbf9f,
> >   0xbf9f, 0x,
> > @@ -3436,5 +3436,5 @@ static const uint32_t cwsr_trap_gfx9_4_3_hex[] = {
> >   0x86ea6a6a, 0x8f6e837a,
> >   0xb96ee0c2, 0xbf82,
> >   0xb97a0002, 0xbf8a,
> > - 0xbe801f6c, 0xbf81,
> > + 0xbe801f6c, 0xbf9b,
> >  };
> > diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm 
> > b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
> > index e0140df0b0ec..71b3dc0c7363 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
> > +++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
> > @@ -1104,7 +1104,7 @@ L_RETURN_WITHOUT_PRIV:
> >   s_rfe_b64   s_restore_pc_lo   
> >   //Return to the main shader program and resume execution
> >
> >  L_END_PGM:
> > - s_endpgm
> > + s_endpgm_saved
> >  end
> >
> >  function write_hwreg_to_mem(s, s_rsrc, s_mem_offset)
> > diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm 
> > b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
> > index e506411ad28a..bb26338204f4 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
> > +++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
> > @@ -921,7 +921,7 @@ L_RESTORE:
> >  /*   the END   */
> >  
> > /**/
> >  L_END_PGM:
> > -s_endpgm
> > +s_endpgm_saved
> >
> >  end
> >
>
> Ping. Patch has been tested and verified, just looking for an Ack.


Re: [PATCH] drm/amdkfd: Use S_ENDPGM_SAVED in trap handler

2024-01-24 Thread Jay Cornwall
On 1/15/2024 13:07, Jay Cornwall wrote:
> This instruction has no functional difference to S_ENDPGM
> but allows performance counters to track save events correctly.
> 
> Signed-off-by: Jay Cornwall 
> ---
>  drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h | 14 +++---
>  .../gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm |  2 +-
>  .../gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm  |  2 +-
>  3 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h 
> b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> index df75863393fc..d1caaf0e6a7c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> +++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> @@ -674,7 +674,7 @@ static const uint32_t cwsr_trap_gfx9_hex[] = {
>   0x86ea6a6a, 0x8f6e837a,
>   0xb96ee0c2, 0xbf82,
>   0xb97a0002, 0xbf8a,
> - 0xbe801f6c, 0xbf81,
> + 0xbe801f6c, 0xbf9b,
>  };
>  
>  static const uint32_t cwsr_trap_nv1x_hex[] = {
> @@ -1091,7 +1091,7 @@ static const uint32_t cwsr_trap_nv1x_hex[] = {
>   0xb9eef807, 0x876dff6d,
>   0x, 0x87fe7e7e,
>   0x87ea6a6a, 0xb9faf802,
> - 0xbe80226c, 0xbf81,
> + 0xbe80226c, 0xbf9b,
>   0xbf9f, 0xbf9f,
>   0xbf9f, 0xbf9f,
>   0xbf9f, 0x,
> @@ -1574,7 +1574,7 @@ static const uint32_t cwsr_trap_arcturus_hex[] = {
>   0x86ea6a6a, 0x8f6e837a,
>   0xb96ee0c2, 0xbf82,
>   0xb97a0002, 0xbf8a,
> - 0xbe801f6c, 0xbf81,
> + 0xbe801f6c, 0xbf9b,
>  };
>  
>  static const uint32_t cwsr_trap_aldebaran_hex[] = {
> @@ -2065,7 +2065,7 @@ static const uint32_t cwsr_trap_aldebaran_hex[] = {
>   0x86ea6a6a, 0x8f6e837a,
>   0xb96ee0c2, 0xbf82,
>   0xb97a0002, 0xbf8a,
> - 0xbe801f6c, 0xbf81,
> + 0xbe801f6c, 0xbf9b,
>  };
>  
>  static const uint32_t cwsr_trap_gfx10_hex[] = {
> @@ -2500,7 +2500,7 @@ static const uint32_t cwsr_trap_gfx10_hex[] = {
>   0x876dff6d, 0x,
>   0x87fe7e7e, 0x87ea6a6a,
>   0xb9faf802, 0xbe80226c,
> - 0xbf81, 0xbf9f,
> + 0xbf9b, 0xbf9f,
>   0xbf9f, 0xbf9f,
>   0xbf9f, 0xbf9f,
>  };
> @@ -2944,7 +2944,7 @@ static const uint32_t cwsr_trap_gfx11_hex[] = {
>   0xb8eef802, 0xbf0d866e,
>   0xbfa20002, 0xb97af802,
>   0xbe80486c, 0xb97af802,
> - 0xbe804a6c, 0xbfb0,
> + 0xbe804a6c, 0xbfb1,
>   0xbf9f, 0xbf9f,
>   0xbf9f, 0xbf9f,
>   0xbf9f, 0x,
> @@ -3436,5 +3436,5 @@ static const uint32_t cwsr_trap_gfx9_4_3_hex[] = {
>   0x86ea6a6a, 0x8f6e837a,
>   0xb96ee0c2, 0xbf82,
>   0xb97a0002, 0xbf8a,
> - 0xbe801f6c, 0xbf81,
> + 0xbe801f6c, 0xbf9b,
>  };
> diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm 
> b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
> index e0140df0b0ec..71b3dc0c7363 100644
> --- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
> +++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
> @@ -1104,7 +1104,7 @@ L_RETURN_WITHOUT_PRIV:
>   s_rfe_b64   s_restore_pc_lo 
> //Return to the main shader program and resume execution
>  
>  L_END_PGM:
> - s_endpgm
> + s_endpgm_saved
>  end
>  
>  function write_hwreg_to_mem(s, s_rsrc, s_mem_offset)
> diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm 
> b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
> index e506411ad28a..bb26338204f4 100644
> --- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
> +++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
> @@ -921,7 +921,7 @@ L_RESTORE:
>  /*   the END   */
>  /**/
>  L_END_PGM:
> -s_endpgm
> +s_endpgm_saved
>  
>  end
>  

Ping. Patch has been tested and verified, just looking for an Ack.


Re: [PATCH 2/3] drm/amd: Add a DC debug mask for IPS

2024-01-24 Thread Mario Limonciello

On 1/24/2024 11:09, roman...@amd.com wrote:

From: Roman Li 

For debugging IPS-related issues, expose a new debug mask
that allows to disable IPS.
Usage:
amdgpu.dcdebugmask=0x800

Signed-off-by: Roman Li 
Tested-by: Mark Broadworth 

Reviewed-by: Mario Limonciello 

---
  drivers/gpu/drm/amd/include/amd_shared.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/include/amd_shared.h 
b/drivers/gpu/drm/amd/include/amd_shared.h
index da9b670fec86..a89d93154ddb 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -259,6 +259,7 @@ enum DC_DEBUG_MASK {
DC_ENABLE_DML2 = 0x100,
DC_DISABLE_PSR_SU = 0x200,
DC_DISABLE_REPLAY = 0x400,
+   DC_DISABLE_IPS = 0x800,
  };
  
  enum amd_dpm_forced_level;




Re: [PATCH 3/3] drm/amd/display: "Enable IPS by default"

2024-01-24 Thread Mario Limonciello

On 1/24/2024 11:09, roman...@amd.com wrote:

From: Roman Li 

[Why]
IPS was temporary disabled due to instability.
It was fixed in dmub firmware and with:
- "drm/amd/display: Add IPS checks before dcn register access"
- "drm/amd/display: Disable ips before dc interrupt setting"

[How]
Enable IPS by default.
Disable IPS if 0x800 bit set in amdgpu.dcdebugmask module params

Signed-off-by: Roman Li 
Tested-by: Mark Broadworth 

Reviewed-by: Mario Limonciello 

---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 41994a60e2cd..9d909c09a14f 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1719,7 +1719,8 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
init_data.nbio_reg_offsets = adev->reg_offset[NBIO_HWIP][0];
init_data.clk_reg_offsets = adev->reg_offset[CLK_HWIP][0];
  
-	init_data.flags.disable_ips = DMUB_IPS_DISABLE_ALL;

+if (amdgpu_dc_debug_mask & DC_DISABLE_IPS)
+   init_data.flags.disable_ips = DMUB_IPS_DISABLE_ALL;
  
  	init_data.flags.disable_ips_in_vpb = 1;
  




Re: [PATCH 1/3] drm/amd/display: Disable ips before dc interrupt setting

2024-01-24 Thread Mario Limonciello

On 1/24/2024 11:09, roman...@amd.com wrote:

From: Roman Li 

[Why]
While in IPS2 an access to dcn registers is not allowed.
If interrupt results in dc call, we should disable IPS.

[How]
Safeguard register access in IPS2 by disabling idle optimization
before calling dc interrupt setting api.

Signed-off-by: Roman Li 
Tested-by: Mark Broadworth 

Reviewed-by: Mario Limonciello 

---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
index 58b880acb087..3390f0d8420a 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
@@ -711,7 +711,7 @@ static inline int dm_irq_state(struct amdgpu_device *adev,
  {
bool st;
enum dc_irq_source irq_source;
-
+   struct dc *dc = adev->dm.dc;
struct amdgpu_crtc *acrtc = adev->mode_info.crtcs[crtc_id];
  
  	if (!acrtc) {

@@ -729,6 +729,9 @@ static inline int dm_irq_state(struct amdgpu_device *adev,
  
  	st = (state == AMDGPU_IRQ_STATE_ENABLE);
  
+	if (dc && dc->caps.ips_support && dc->idle_optimizations_allowed)

+   dc_allow_idle_optimizations(dc, false);
+
dc_interrupt_set(adev->dm.dc, irq_source, st);
return 0;
  }




[PATCH 2/3] drm/amd: Add a DC debug mask for IPS

2024-01-24 Thread Roman.Li
From: Roman Li 

For debugging IPS-related issues, expose a new debug mask
that allows to disable IPS.
Usage:
amdgpu.dcdebugmask=0x800

Signed-off-by: Roman Li 
Tested-by: Mark Broadworth 
---
 drivers/gpu/drm/amd/include/amd_shared.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/include/amd_shared.h 
b/drivers/gpu/drm/amd/include/amd_shared.h
index da9b670fec86..a89d93154ddb 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -259,6 +259,7 @@ enum DC_DEBUG_MASK {
DC_ENABLE_DML2 = 0x100,
DC_DISABLE_PSR_SU = 0x200,
DC_DISABLE_REPLAY = 0x400,
+   DC_DISABLE_IPS = 0x800,
 };
 
 enum amd_dpm_forced_level;
-- 
2.34.1



[PATCH 3/3] drm/amd/display: "Enable IPS by default"

2024-01-24 Thread Roman.Li
From: Roman Li 

[Why]
IPS was temporary disabled due to instability.
It was fixed in dmub firmware and with:
- "drm/amd/display: Add IPS checks before dcn register access"
- "drm/amd/display: Disable ips before dc interrupt setting"

[How]
Enable IPS by default.
Disable IPS if 0x800 bit set in amdgpu.dcdebugmask module params

Signed-off-by: Roman Li 
Tested-by: Mark Broadworth 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 41994a60e2cd..9d909c09a14f 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1719,7 +1719,8 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
init_data.nbio_reg_offsets = adev->reg_offset[NBIO_HWIP][0];
init_data.clk_reg_offsets = adev->reg_offset[CLK_HWIP][0];
 
-   init_data.flags.disable_ips = DMUB_IPS_DISABLE_ALL;
+if (amdgpu_dc_debug_mask & DC_DISABLE_IPS)
+   init_data.flags.disable_ips = DMUB_IPS_DISABLE_ALL;
 
init_data.flags.disable_ips_in_vpb = 1;
 
-- 
2.34.1



[PATCH 1/3] drm/amd/display: Disable ips before dc interrupt setting

2024-01-24 Thread Roman.Li
From: Roman Li 

[Why]
While in IPS2 an access to dcn registers is not allowed.
If interrupt results in dc call, we should disable IPS.

[How]
Safeguard register access in IPS2 by disabling idle optimization
before calling dc interrupt setting api.

Signed-off-by: Roman Li 
Tested-by: Mark Broadworth 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
index 58b880acb087..3390f0d8420a 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
@@ -711,7 +711,7 @@ static inline int dm_irq_state(struct amdgpu_device *adev,
 {
bool st;
enum dc_irq_source irq_source;
-
+   struct dc *dc = adev->dm.dc;
struct amdgpu_crtc *acrtc = adev->mode_info.crtcs[crtc_id];
 
if (!acrtc) {
@@ -729,6 +729,9 @@ static inline int dm_irq_state(struct amdgpu_device *adev,
 
st = (state == AMDGPU_IRQ_STATE_ENABLE);
 
+   if (dc && dc->caps.ips_support && dc->idle_optimizations_allowed)
+   dc_allow_idle_optimizations(dc, false);
+
dc_interrupt_set(adev->dm.dc, irq_source, st);
return 0;
 }
-- 
2.34.1



Re: [PATCH v2] drm/amdgpu: change vm->task_info handling

2024-01-24 Thread Felix Kuehling



On 2024-01-24 9:32, Shashank Sharma wrote:


On 19/01/2024 21:23, Felix Kuehling wrote:


On 2024-01-18 14:21, Shashank Sharma wrote:

This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
   reference counted.
- introducing two new helper funcs for task_info lifecycle management
 - amdgpu_vm_get_task_info: reference counts up task_info before
   returning this info
 - amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.

This patch also does logistical changes required for existing usage
of vm->task_info.

V2: Do not block all the prints when task_info not found (Felix)

Cc: Christian Koenig 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Signed-off-by: Shashank Sharma 


Nit-picks inline.


[snip]


+/**
+ * amdgpu_vm_put_task_info_pasid - reference down the vm task_info ptr
+ * frees the vm task_info ptr at the last put
+ *
+ * @adev: drm device pointer
+ * @task_info: task_info struct under discussion.
+ * @pasid: pasid of the VM which contains task_info
+ */
+void amdgpu_vm_put_task_info_pasid(struct amdgpu_device *adev,
+   struct amdgpu_task_info *task_info,
+   u32 pasid)
+{
+    int ret;
+
+    ret = kref_put(_info->refcount, amdgpu_vm_destroy_task_info);
+
+    /* Clean up if object was removed in the last put */
+    if (ret == 1) {
+    struct amdgpu_vm *vm;
+
+    vm = amdgpu_vm_get_vm_from_pasid(adev, pasid);
+    if (!vm) {
+    WARN(1, "Invalid PASID %u to put task info\n", pasid);
+    return;
+    }
+
+    vm->task_info = NULL;


This doesn't make sense. If there is a VM pointing to the task_info, 
then the ref count should not go to 0. Therefore this whole "look up 
the VM from PASID and clear vm->task_info" seams bogus.


Actually, (ret == 1) above indicates that cleanup function has been 
called and task_info has been freed, and the vm->task_info is a 
dangling ptr.


The current design is

- first open per process/fd creates vm->task_info

- last close per process/fd frees vm->task_info



I'd expect the last put_task_info call to come from amdgpu_vm_fini. 
At that point setting task_info to NULL is probably unnecessary. But 
if we wanted that, we could set it to NULL in the caller.


Even this can be done, I can call the first get_vm_info() from vm_init 
and last put from vm_fini.


Well, actually it doesn't matter where the last put comes from. The main 
point is, that vm->task_info is a counted reference. As long as that 
reference exists, the refcount should not become 0. If it does, that's a 
bug somewhere. The vm->task_info reference is only dropped in 
amdgpu_vm_fini, after which the whole vm structure is freed. So there 
should never be a need to set vm->task_info to NULL in 
amdgpu_vm_put_task_info_*.


[snip]

+static int amdgpu_vm_create_task_info(struct amdgpu_vm *vm)
+{
+    vm->task_info = kzalloc(sizeof(struct amdgpu_task_info), 
GFP_KERNEL);

+    if (!vm->task_info) {
+    DRM_ERROR("OOM while creating task_info space\n");


Printing OOM error messages is usually redundant. The allocators are 
already very noisy when OOM happens. I think checkpatch.pl also warns 
about that. Maybe it doesn't catch DRM_ERROR but only printk and 
friends.



Agree, I will make this debug instead of error.


Even a debug message is not needed. See https://lkml.org/lkml/2014/6/10/382.

[snip]

  @@ -157,18 +157,26 @@ static int gmc_v10_0_process_interrupt(struct 
amdgpu_device *adev,

  if (!printk_ratelimit())
  return 0;
  -    memset(_info, 0, sizeof(struct amdgpu_task_info));
-    amdgpu_vm_get_task_info(adev, entry->pasid, _info);
+    task_info = amdgpu_vm_get_task_info_pasid(adev, entry->pasid);
+    if (task_info) {
+    dev_err(adev->dev,
+    "[%s] page fault (src_id:%u ring:%u vmid:%u pasid:%u, 
for process %s pid %d thread %s pid %d)\n",

+    entry->vmid_src ? "mmhub" : "gfxhub",
+    entry->src_id, entry->ring_id, entry->vmid,
+    entry->pasid, task_info->process_name, task_info->tgid,
+    task_info->task_name, task_info->pid);
+    amdgpu_vm_put_task_info_pasid(adev, task_info, entry->pasid);
+    } else {
+    dev_err(adev->dev,
+    "[%s] page fault (src_id:%u ring:%u vmid:%u pasid:%u, 
no process info)\n",

+    entry->vmid_src ? "mmhub" : "gfxhub",
+    entry->src_id, entry->ring_id, entry->vmid,
+    entry->pasid);


Can we avoid the duplication here? It's too easy for them to get out 
of sync. I think it's OK to change the message format so that the 
process into is printed on a separate line. E.g.:


That's exactly I thought, but then I was afraid of breaking any 
existing scripts grepping for this exact text. I guess I can do this 
change and see if anyone complaints :).


I don't think there are any ABI guarantees for log 

Re: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

2024-01-24 Thread Alex Deucher
On Wed, Jan 24, 2024 at 2:17 AM Prike Liang  wrote:
>
> In the pm abort case the gfx power rail not turn off from FCH side and
> this will lead to the gfx reinitialized failed base on the unknown gfx
> HW status, so let's reset the gpu to a known good power state.
>
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 +
>  drivers/gpu/drm/amd/amdgpu/soc15.c  | 7 ++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 9153f69bad7f..14b66c49a536 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -2935,6 +2935,11 @@ static int psp_resume(void *handle)
>
> dev_info(adev->dev, "PSP is resuming...\n");
>
> +   if(amdgpu_asic_need_reset_on_init(adev)) {
> +   DRM_INFO("PM abort case and let's reset asic \n");
> +   amdgpu_asic_reset(adev);
> +   }
> +

Seems like it would be better to put this in the resume callback.

> if (psp->mem_train_ctx.enable_mem_training) {
> ret = psp_mem_training(psp, PSP_MEM_TRAIN_RESUME);
> if (ret) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 15033efec2ba..6ec4f6958c4c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -804,9 +804,15 @@ static bool soc15_need_reset_on_init(struct 
> amdgpu_device *adev)
> if (adev->asic_type == CHIP_RENOIR)
> return true;
>
> +   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);

Is this register consistent for all soc15 chips?

Alex

> +
> /* Just return false for soc15 GPUs.  Reset does not seem to
>  * be necessary.
>  */
> +   if (adev->in_suspend && !adev->pm_complete &&
> +   sol_reg)
> +   return true;
> +
> if (!amdgpu_passthrough(adev))
> return false;
>
> @@ -816,7 +822,6 @@ static bool soc15_need_reset_on_init(struct amdgpu_device 
> *adev)
> /* Check sOS sign of life register to confirm sys driver and sOS
>  * are already been loaded.
>  */
> -   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> if (sol_reg)
> return true;
>> --
> 2.34.1
>


Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for PM abort case

2024-01-24 Thread Alex Deucher
On Wed, Jan 24, 2024 at 2:12 AM Prike Liang  wrote:
>
> In the PM abort cases, the gfx power rail doesn't turn off so
> some GFXDEC registers/CSB can't reset to default vaule. In order
> to avoid unexpected problem now need skip to program GFXDEC registers
> and bypass issue CSB packet for PM abort case.
>
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 4 
>  3 files changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c5f3859fd682..26d983eb831b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1079,6 +1079,7 @@ struct amdgpu_device {
> boolin_s3;
> boolin_s4;
> boolin_s0ix;
> +   boolpm_complete;
>
> enum pp_mp1_state   mp1_state;
> struct amdgpu_doorbell_index doorbell_index;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 475bd59c9ac2..a01f9b0c2f30 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2486,6 +2486,7 @@ static int amdgpu_pmops_suspend_noirq(struct device 
> *dev)
> struct drm_device *drm_dev = dev_get_drvdata(dev);
> struct amdgpu_device *adev = drm_to_adev(drm_dev);
>
> +   adev->pm_complete = true;

This needs to be cleared somewhere on resume.

> if (amdgpu_acpi_should_gpu_reset(adev))
> return amdgpu_asic_reset(adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 57808be6e3ec..3bf51f18e13c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3034,6 +3034,10 @@ static int gfx_v9_0_cp_gfx_start(struct amdgpu_device 
> *adev)
>
> gfx_v9_0_cp_gfx_enable(adev, true);
>
> +   if (adev->in_suspend && !adev->pm_complete) {
> +   DRM_INFO(" will skip the csb ring write\n");
> +   return 0;
> +   }

We probably want a similar fix for other gfx generations as well.

Alex

> r = amdgpu_ring_alloc(ring, gfx_v9_0_get_csb_size(adev) + 4 + 3);
> if (r) {
> DRM_ERROR("amdgpu: cp failed to lock ring (%d).\n", r);
> --
> 2.34.1
>


Re: regression/bisected/6.8 commit f7fe64ad0f22ff034f8ebcfbd7299ee9cc9b57d7 leads to GPU hang when I open GNOME activities

2024-01-24 Thread Mario Limonciello

On 1/24/2024 08:37, Mikhail Gavrilov wrote:

On Wed, Jan 24, 2024 at 7:19 AM Mikhail Gavrilov
 wrote:


Who could dig into it, please?


You decided to revert it?
https://lkml.org/lkml/2024/1/22/1866


It's not a straight "git revert" on 6.8-rc1 because of some other 
contextual changes.


I posted that as an RFC specifically "in case" that's the direction we 
go and don't get a proper solution together.


Matthew also posted a debugging patch here for use with ftrace and the 
GPU scheduler events: https://gitlab.freedesktop.org/drm/amd/-/issues/3124


I reproduced it with that as well and posted my ftrace results.



Also I forgot to attach the kernel build .config in the previous
message. I'm going to fix it here.
It may be useful for reproducing my bug script.





Re: regression/bisected/6.8 commit f7fe64ad0f22ff034f8ebcfbd7299ee9cc9b57d7 leads to GPU hang when I open GNOME activities

2024-01-24 Thread Mikhail Gavrilov
On Wed, Jan 24, 2024 at 7:19 AM Mikhail Gavrilov
 wrote:
>
> Who could dig into it, please?

You decided to revert it?
https://lkml.org/lkml/2024/1/22/1866

Also I forgot to attach the kernel build .config in the previous
message. I'm going to fix it here.
It may be useful for reproducing my bug script.

-- 
Best Regards,
Mike Gavrilov.


.config.zip
Description: Zip archive


Re: [PATCH v2] drm/amdgpu: change vm->task_info handling

2024-01-24 Thread Shashank Sharma



On 19/01/2024 21:23, Felix Kuehling wrote:


On 2024-01-18 14:21, Shashank Sharma wrote:

This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
   reference counted.
- introducing two new helper funcs for task_info lifecycle management
 - amdgpu_vm_get_task_info: reference counts up task_info before
   returning this info
 - amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.

This patch also does logistical changes required for existing usage
of vm->task_info.

V2: Do not block all the prints when task_info not found (Felix)

Cc: Christian Koenig 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Signed-off-by: Shashank Sharma 


Nit-picks inline.



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |   7 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  18 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c   |  12 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 142 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  26 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c   |   2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  |  30 +++--
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  |  31 +++--
  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c   |  22 +--
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  26 ++--
  drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c  |  26 ++--
  drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c    |  26 ++--
  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  20 +--
  13 files changed, 287 insertions(+), 101 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c

index 0e61ebdb3f3e..99c736b6e32c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1775,9 +1775,12 @@ static int amdgpu_debugfs_vm_info_show(struct 
seq_file *m, void *unused)

  list_for_each_entry(file, >filelist, lhead) {
  struct amdgpu_fpriv *fpriv = file->driver_priv;
  struct amdgpu_vm *vm = >vm;
+    struct amdgpu_task_info *ti;
+
+    ti = amdgpu_vm_get_task_info_vm(vm);


Can ti be NULL here? I think it can, so you'd need a NULL check to 
avoid a possible kernel oops.

Agree



+    seq_printf(m, "pid:%d\tProcess:%s --\n", ti->pid, 
ti->process_name);

+    amdgpu_vm_put_task_info_vm(ti, vm);
  -    seq_printf(m, "pid:%d\tProcess:%s --\n",
-    vm->task_info.pid, vm->task_info.process_name);
  r = amdgpu_bo_reserve(vm->root.bo, true);
  if (r)
  break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

index 1f357198533f..af23746821b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -35,7 +35,7 @@ static enum drm_gpu_sched_stat 
amdgpu_job_timedout(struct drm_sched_job *s_job)

  {
  struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
  struct amdgpu_job *job = to_amdgpu_job(s_job);
-    struct amdgpu_task_info ti;
+    struct amdgpu_task_info *ti;
  struct amdgpu_device *adev = ring->adev;
  int idx;
  int r;
@@ -48,7 +48,7 @@ static enum drm_gpu_sched_stat 
amdgpu_job_timedout(struct drm_sched_job *s_job)

  return DRM_GPU_SCHED_STAT_ENODEV;
  }
  -    memset(, 0, sizeof(struct amdgpu_task_info));
+
  adev->job_hang = true;
    if (amdgpu_gpu_recovery &&
@@ -58,12 +58,16 @@ static enum drm_gpu_sched_stat 
amdgpu_job_timedout(struct drm_sched_job *s_job)

  goto exit;
  }
  -    amdgpu_vm_get_task_info(ring->adev, job->pasid, );
  DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
-  job->base.sched->name, 
atomic_read(>fence_drv.last_seq),

-  ring->fence_drv.sync_seq);
-    DRM_ERROR("Process information: process %s pid %d thread %s pid 
%d\n",

-  ti.process_name, ti.tgid, ti.task_name, ti.pid);
+  job->base.sched->name, 
atomic_read(>fence_drv.last_seq),

+  ring->fence_drv.sync_seq);


Unnecessary (and incorrect) indentation change.


Ah, my bad, looks like copy-paste screwed-up my editor config for 
alignment.

+
+    ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
+    if (ti) {
+    DRM_ERROR("Process information: process %s pid %d thread %s 
pid %d\n",

+  ti->process_name, ti->tgid, ti->task_name, ti->pid);
+    amdgpu_vm_put_task_info_pasid(ring->adev, ti, job->pasid);
+    }
    dma_fence_set_error(_job->s_fence->finished, -ETIME);
  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 4baa300121d8..bfd7a6067edd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -230,8 +230,16 @@ void amdgpu_coredump(struct amdgpu_device *adev, 
bool vram_lost,

    

Re: [PATCH v2 2/2] drm/amdgpu: Implement check_async_props for planes

2024-01-24 Thread André Almeida

Hi Ville,

Em 19/01/2024 15:25, Ville Syrjälä escreveu:

On Fri, Jan 19, 2024 at 03:12:35PM -0300, André Almeida wrote:

AMD GPUs can do async flips with changes on more properties than just
the FB ID, so implement a custom check_async_props for AMD planes.

Allow amdgpu to do async flips with IN_FENCE_ID and FB_DAMAGE_CLIPS
properties. For userspace to check if a driver support this two
properties, the strategy for now is to use TEST_ONLY commits.

Signed-off-by: André Almeida 
---
v2: Drop overlay plane option for now

  .../amd/display/amdgpu_dm/amdgpu_dm_plane.c   | 29 +++
  1 file changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
index 116121e647ca..7afe8c1b62d4 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
@@ -25,6 +25,7 @@
   */
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -1430,6 +1431,33 @@ static void 
amdgpu_dm_plane_drm_plane_destroy_state(struct drm_plane *plane,
drm_atomic_helper_plane_destroy_state(plane, state);
  }
  
+static int amdgpu_dm_plane_check_async_props(struct drm_property *prop,

+ struct drm_plane *plane,
+ struct drm_plane_state *plane_state,
+ struct drm_mode_object *obj,
+ u64 prop_value, u64 old_val)
+{
+   struct drm_mode_config *config = >dev->mode_config;
+   int ret;
+
+   if (prop != config->prop_fb_id &&
+   prop != config->prop_in_fence_fd &&


IN_FENCE should just be allowed always.


Do you mean that the common path should allow IN_FENCE_FD for all drivers?


Re: [PATCH 1/1] drm/amdgpu: enable interrupt prior to kfd device_init

2024-01-24 Thread Lazar, Lijo



On 1/24/2024 2:28 PM, Le Ma wrote:
> This patch is to eliminate interrupt warning below:
> 
>   "[drm] Fence fallback timer expired on ring sdma0.0".
> 
> An early vm pt clearing job is sent to SDMA ahead of interrupt enabled,
> introduced by patch below:
> 
>   - drm/amdkfd: Export DMABufs from KFD using GEM handles

I think the fix needs to be in the above patch. In this flow, the client
is initialized even before the drm device is registered.

Thanks,
Lijo

> Signed-off-by: Le Ma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 56d9dfa61290..c8aa07282366 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2833,12 +2833,6 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
> *adev)
>   if (r)
>   goto init_failed;
>  
> - /* Don't init kfd if whole hive need to be reset during init */
> - if (!adev->gmc.xgmi.pending_reset) {
> - kgd2kfd_init_zone_device(adev);
> - amdgpu_amdkfd_device_init(adev);
> - }
> -
>   amdgpu_fru_get_product_info(adev);
>  
>  init_failed:
> @@ -4204,6 +4198,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  
>   amdgpu_fence_driver_hw_init(adev);
>  
> + /* Don't init kfd if whole hive need to be reset during init */
> + if (!adev->gmc.xgmi.pending_reset) {
> + kgd2kfd_init_zone_device(adev);
> + amdgpu_amdkfd_device_init(adev);
> + }
> +
>   dev_info(adev->dev,
>   "SE %d, SH per SE %d, CU per SH %d, active_cu_number %d\n",
>   adev->gfx.config.max_shader_engines,


RE: [PATCH] mm: Remove double faults once write a device pfn

2024-01-24 Thread Zhou, Xianrong
[AMD Official Use Only - General]

> > The vmf_insert_pfn_prot could cause unnecessary double faults on a
> > device pfn. Because currently the vmf_insert_pfn_prot does not
> > make the pfn writable so the pte entry is normally read-only or
> > dirty catching.
>  What? How do you got to this conclusion?
> >>> Sorry. I did not mention that this problem only exists on arm64 platform.
> >> Ok, that makes at least a little bit more sense.
> >>
> >>> Because on arm64 platform the PTE_RDONLY is automatically attached
> >>> to the userspace pte entries even through VM_WRITE + VM_SHARE.
> >>> The  PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot. However
> >>> vmf_insert_pfn_prot do not make the pte writable passing false
> >>> @mkwrite to insert_pfn.
> >> Question is why is arm64 doing this? As far as I can see they must
> >> have some hardware reason for that.
> >>
> >> The mkwrite parameter to insert_pfn() was added by commit
> >> b2770da642540 to make insert_pfn() look more like insert_pfn_pmd() so
> >> that the DAX code can insert PTEs which are writable and dirty at the same
> time.
> >>
> > This is one scenario to do so. In fact on arm64 there are many
> > scenarios could be to do so. So we can let vmf_insert_pfn_prot
> > supporting @mkwrite for drivers at core layer and let drivers to
> > decide whether or not to make writable and dirty at one time. The
> > patch did this. Otherwise double faults on arm64 when call
> vmf_insert_pfn_prot.
>
> Well, that doesn't answer my question why arm64 is double faulting in the
> first place,.
>


Eh.

On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED the
vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within
PAGE_SHARED_EXEC. (seeing arm64 protection_map)

When write the userspace virtual address the first fault happen and call
into driver's .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot->insert_pfn.
The insert_pfn will establish the pte entry. However the vmf_insert_pfn_prot
pass false @mkwrite to insert_pfn by default and so insert_pfn could not make
the pfn writable and it do not call maybe_mkwrite(pte_mkdirty(entry), vma)
to clear the PTE_RDONLY bit. So the pte entry is actually write protection for 
mmu.
So when the first fault return and re-execute the store instruction the second
fault happen again. And in second fault it just only do pte_mkdirty(entry) which
clear the PTE_RDONLY.

I think so and hope no wrong.

> So as long as this isn't sorted out I'm going to reject this patch.
>
> Regards,
> Christian.
>
> >
> >> This is a completely different use case to what you try to use it
> >> here for and that looks extremely fishy to me.
> >>
> >> Regards,
> >> Christian.
> >>
> > The first fault only sets up the pte entry which actually is dirty
> > catching. And the second immediate fault to the pfn due to first
> > dirty catching when the cpu re-execute the store instruction.
>  It could be that this is done to work around some hw behavior, but
>  not because of dirty catching.
> 
> > Normally if the drivers call vmf_insert_pfn_prot and also supply
> > 'pfn_mkwrite' callback within vm_operations_struct which requires
> > the pte to be dirty catching then the vmf_insert_pfn_prot and the
> > double fault are reasonable. It is not a problem.
>  Well, as far as I can see that behavior absolutely doesn't make sense.
> 
>  When pfn_mkwrite is requested then the driver should use PAGE_COPY,
>  which is exactly what VMWGFX (the only driver using dirty tracking)
>  is
> >> doing.
>  Everybody else uses PAGE_SHARED which should make the pte writeable
>  immediately.
> 
>  Regards,
>  Christian.
> 
> > However the most of drivers calling vmf_insert_pfn_prot do not
> > supply the 'pfn_mkwrite' callback so that the second fault is
> unnecessary.
> >
> > So just like vmf_insert_mixed and vmf_insert_mixed_mkwrite pair,
> > we should also supply vmf_insert_pfn_mkwrite for drivers as well.
> >
> > Signed-off-by: Xianrong Zhou 
> > ---
> > arch/x86/entry/vdso/vma.c  |  3 ++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|  2 +-
> > drivers/gpu/drm/i915/gem/i915_gem_ttm.c|  2 +-
> > drivers/gpu/drm/nouveau/nouveau_gem.c  |  2 +-
> > drivers/gpu/drm/radeon/radeon_gem.c|  2 +-
> > drivers/gpu/drm/ttm/ttm_bo_vm.c|  8 +---
> > drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c |  8 +---
> > include/drm/ttm/ttm_bo.h   |  3 ++-
> > include/linux/mm.h |  2 +-
> > mm/memory.c| 14 +++---
> > 10 files changed, 30 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> > index 7645730dc228..dd2431c2975f 100644
> > --- a/arch/x86/entry/vdso/vma.c
> > +++ 

Re: [PATCH v2 2/2] drm/amdgpu: Implement check_async_props for planes

2024-01-24 Thread Xaver Hugl
Am Mo., 22. Jan. 2024 um 16:50 Uhr schrieb Harry Wentland
:
>
>
>
> On 2024-01-19 13:25, Ville Syrjälä wrote:
> > On Fri, Jan 19, 2024 at 03:12:35PM -0300, André Almeida wrote:
> >> AMD GPUs can do async flips with changes on more properties than just
> >> the FB ID, so implement a custom check_async_props for AMD planes.
> >>
> >> Allow amdgpu to do async flips with IN_FENCE_ID and FB_DAMAGE_CLIPS
> >> properties. For userspace to check if a driver support this two
> >> properties, the strategy for now is to use TEST_ONLY commits.
> >>
> >> Signed-off-by: André Almeida 
> >> ---
> >> v2: Drop overlay plane option for now
> >>
> >>   .../amd/display/amdgpu_dm/amdgpu_dm_plane.c   | 29 +++
> >>   1 file changed, 29 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c 
> >> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> >> index 116121e647ca..7afe8c1b62d4 100644
> >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> >> @@ -25,6 +25,7 @@
> >>*/
> >>
> >>   #include 
> >> +#include 
> >>   #include 
> >>   #include 
> >>   #include 
> >> @@ -1430,6 +1431,33 @@ static void 
> >> amdgpu_dm_plane_drm_plane_destroy_state(struct drm_plane *plane,
> >>  drm_atomic_helper_plane_destroy_state(plane, state);
> >>   }
> >>
> >> +static int amdgpu_dm_plane_check_async_props(struct drm_property *prop,
> >> +  struct drm_plane *plane,
> >> +  struct drm_plane_state *plane_state,
> >> +  struct drm_mode_object *obj,
> >> +  u64 prop_value, u64 old_val)
> >> +{
> >> +struct drm_mode_config *config = >dev->mode_config;
> >> +int ret;
> >> +
> >> +if (prop != config->prop_fb_id &&
> >> +prop != config->prop_in_fence_fd &&
> >
> > IN_FENCE should just be allowed always.
> >
> >> +prop != config->prop_fb_damage_clips) {
> >
> > This seems a bit dubious to me. How is amdgpu using the damage
> > information during async flips?
>
> Yeah, I'm also not sure this is right. Has anyone tested this
> with a PSR SU panel?
>
> Harry

I attempted to, but according to
/sys/kernel/debug/dri/1/eDP-1/psr_state, PSR never kicks in on my
laptop at all. The only reason I wanted this property though is to
reduce the number of special cases for async pageflips compositors
have to implement; as it's not necessary for any functionality I think
it's also fine to leave it out.

> >> +ret = drm_atomic_plane_get_property(plane, plane_state,
> >> +prop, _val);
> >> +return drm_atomic_check_prop_changes(ret, old_val, 
> >> prop_value, prop);
> >> +}
> >> +
> >> +if (plane_state->plane->type != DRM_PLANE_TYPE_PRIMARY) {
> >> +drm_dbg_atomic(prop->dev,
> >> +   "[OBJECT:%d] Only primary planes can be 
> >> changed during async flip\n",
> >> +   obj->id);
> >> +return -EINVAL;
> >> +}
> >> +
> >> +return 0;
> >> +}
> >> +
> >>   static const struct drm_plane_funcs dm_plane_funcs = {
> >>  .update_plane   = drm_atomic_helper_update_plane,
> >>  .disable_plane  = drm_atomic_helper_disable_plane,
> >> @@ -1438,6 +1466,7 @@ static const struct drm_plane_funcs dm_plane_funcs = 
> >> {
> >>  .atomic_duplicate_state = amdgpu_dm_plane_drm_plane_duplicate_state,
> >>  .atomic_destroy_state = amdgpu_dm_plane_drm_plane_destroy_state,
> >>  .format_mod_supported = amdgpu_dm_plane_format_mod_supported,
> >> +.check_async_props = amdgpu_dm_plane_check_async_props,
> >>   };
> >>
> >>   int amdgpu_dm_plane_init(struct amdgpu_display_manager *dm,
> >> --
> >> 2.43.0
> >


[PATCH 1/1] drm/amdgpu: enable interrupt prior to kfd device_init

2024-01-24 Thread Le Ma
This patch is to eliminate interrupt warning below:

  "[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt clearing job is sent to SDMA ahead of interrupt enabled,
introduced by patch below:

  - drm/amdkfd: Export DMABufs from KFD using GEM handles

Signed-off-by: Le Ma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 56d9dfa61290..c8aa07282366 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2833,12 +2833,6 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
if (r)
goto init_failed;
 
-   /* Don't init kfd if whole hive need to be reset during init */
-   if (!adev->gmc.xgmi.pending_reset) {
-   kgd2kfd_init_zone_device(adev);
-   amdgpu_amdkfd_device_init(adev);
-   }
-
amdgpu_fru_get_product_info(adev);
 
 init_failed:
@@ -4204,6 +4198,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 
amdgpu_fence_driver_hw_init(adev);
 
+   /* Don't init kfd if whole hive need to be reset during init */
+   if (!adev->gmc.xgmi.pending_reset) {
+   kgd2kfd_init_zone_device(adev);
+   amdgpu_amdkfd_device_init(adev);
+   }
+
dev_info(adev->dev,
"SE %d, SH per SE %d, CU per SH %d, active_cu_number %d\n",
adev->gfx.config.max_shader_engines,
-- 
2.38.1