Re: [PATCH] drm/amdgpu: Fix a typo

2021-03-18 Thread Alex Deucher
Applied both patches.  Thanks!

Alex

On Thu, Mar 18, 2021 at 7:20 PM Bhaskar Chowdhury  wrote:
>
>
> s/proces/process/
>
> Signed-off-by: Bhaskar Chowdhury 
> ---
>  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> index bf3857867f51..c1d5a3085bae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> @@ -598,7 +598,7 @@ static int psp_v11_0_memory_training_send_msg(struct 
> psp_context *psp, int msg)
>  }
>
>  /*
> - * save and restore proces
> + * save and restore process
>   */
>  static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
>  {
> --
> 2.26.2
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages

2021-03-18 Thread Alex Deucher
Applied.  Thanks!

Alex

On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
 wrote:
>
> Reviewed-by: Christian König 
> 
> Von: Daniel Gomez 
> Gesendet: Donnerstag, 18. März 2021 09:32
> Cc: dag...@gmail.com ; Daniel Gomez ; 
> Deucher, Alexander ; Koenig, Christian 
> ; David Airlie ; Daniel Vetter 
> ; amd-gfx@lists.freedesktop.org 
> ; dri-de...@lists.freedesktop.org 
> ; linux-ker...@vger.kernel.org 
> 
> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
>
> If userptr pages have been pinned but not bounded,
> they remain uncleared.
>
> Signed-off-by: Daniel Gomez 
> ---
>  drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
> b/drivers/gpu/drm/radeon/radeon_ttm.c
> index e8c66d10478f..bbcc6264d48f 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct 
> ttm_bo_device *bdev, struct ttm_tt
>  struct radeon_ttm_tt *gtt = (void *)ttm;
>  struct radeon_device *rdev = radeon_get_rdev(bdev);
>
> +   if (gtt->userptr)
> +   radeon_ttm_tt_unpin_userptr(bdev, ttm);
> +
>  if (!gtt->bound)
>  return;
>
>  radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
>
> -   if (gtt->userptr)
> -   radeon_ttm_tt_unpin_userptr(bdev, ttm);
>  gtt->bound = false;
>  }
>
> --
> 2.30.2
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/ttm: Fix memory leak userptr pages

2021-03-18 Thread Alex Deucher
Applied.  Thanks!

Alex

On Wed, Mar 17, 2021 at 12:09 PM Daniel Gomez  wrote:
>
> If userptr pages have been pinned but not bounded,
> they remain uncleared.
>
> Signed-off-by: Daniel Gomez 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 9fd2157b133a..50c2b4827c13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1162,13 +1162,13 @@ static void amdgpu_ttm_backend_unbind(struct 
> ttm_bo_device *bdev,
> struct amdgpu_ttm_tt *gtt = (void *)ttm;
> int r;
>
> -   if (!gtt->bound)
> -   return;
> -
> /* if the pages have userptr pinning then clear that first */
> if (gtt->userptr)
> amdgpu_ttm_tt_unpin_userptr(bdev, ttm);
>
> +   if (!gtt->bound)
> +   return;
> +
> if (gtt->offset == AMDGPU_BO_INVALID_OFFSET)
> return;
>
> --
> 2.30.2
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Fix a typo

2021-03-18 Thread Bhaskar Chowdhury


s/proces/process/

Signed-off-by: Bhaskar Chowdhury 
---
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index bf3857867f51..c1d5a3085bae 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -598,7 +598,7 @@ static int psp_v11_0_memory_training_send_msg(struct 
psp_context *psp, int msg)
 }

 /*
- * save and restore proces
+ * save and restore process
  */
 static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
 {
--
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

2021-03-18 Thread Quan, Evan
[AMD Public Use]

Thanks Guchun & Jiansong. Yes, I had same concern as Jiansong.

BR
Evan
-Original Message-
From: Chen, Jiansong (Simon)  
Sent: Thursday, March 18, 2021 5:36 PM
To: Chen, Guchun ; Quan, Evan ; 
amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Quan, Evan 
Subject: RE: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

We still need reserve "return 0", otherwise may trigger warning "not all 
control paths return a value".

Regards,
Jiansong
-Original Message-
From: amd-gfx  On Behalf Of Chen, Guchun
Sent: Thursday, March 18, 2021 5:28 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Quan, Evan 
Subject: RE: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

[AMD Public Use]

One comment inline. Other than this, the patch is:

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Evan Quan
Sent: Thursday, March 18, 2021 5:21 PM
To: amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Quan, Evan 
Subject: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

The RLC was put into a wrong state on runtime suspend. Thus the RLC autoload 
will fail on the succeeding runtime resume. By adding an intermediate 
PPSMC_MSG_PrepareMp1ForUnload(some GC hard reset involved, designed for PnP), 
we can bring RLC back into the desired state.

V2: integrate INTERRUPTS_ENABLED flag clearing into current
mp1 state set routines

Change-Id: I4eb3d5d76068412a6ab228af7fe7f794e53c1eaa
Signed-off-by: Evan Quan 
Reviewed-by: Lijo Lazar 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c   |  9 --
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 28 +++
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  1 +
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 25 +
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 14 ++
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 14 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 28 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
 9 files changed, 102 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 809f4cb738f4..ab6f4059630c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2160,9 +2160,12 @@ static int psp_load_smu_fw(struct psp_context *psp)
if (!ucode->fw || amdgpu_sriov_vf(psp->adev))
return 0;
 
-
-   if (amdgpu_in_reset(adev) && ras && ras->supported &&
-   adev->asic_type == CHIP_ARCTURUS) {
+   if ((amdgpu_in_reset(adev) &&
+ras && ras->supported &&
+adev->asic_type == CHIP_ARCTURUS) ||
+(adev->in_runpm &&
+ adev->asic_type >= CHIP_NAVI10 &&
+ adev->asic_type <= CHIP_NAVI12)) {
ret = amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_UNLOAD);
if (ret) {
DRM_WARN("Failed to set MP1 state prepare for 
reload\n"); diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 629a988b069d..21c3c149614c 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -836,6 +836,13 @@ struct pptable_funcs {
 */
int (*check_fw_status)(struct smu_context *smu);
 
+   /**
+* @set_mp1_state: put SMU into a correct state for comming
+* resume from runpm or gpu reset.
+*/
+   int (*set_mp1_state)(struct smu_context *smu,
+enum pp_mp1_state mp1_state);
+
/**
 * @setup_pptable: Initialize the power play table and populate it with
 * default values.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index bae9016fedea..470ca4e5d4d7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1927,36 +1927,16 @@ int smu_set_mp1_state(void *handle,
  enum pp_mp1_state mp1_state)
 {
struct smu_context *smu = handle;
-   uint16_t msg;
-   int ret;
+   int ret = 0;
 
if (!smu->pm_enabled)
return -EOPNOTSUPP;
 
mutex_lock(>mutex);
 
-   switch (mp1_state) {
-   case PP_MP1_STATE_SHUTDOWN:
-   msg = SMU_MSG_PrepareMp1ForShutdown;
-   break;
-   case PP_MP1_STATE_UNLOAD:
-   msg = SMU_MSG_PrepareMp1ForUnload;
-   break;
-   case PP_MP1_STATE_RESET:
-   msg = SMU_MSG_PrepareMp1ForReset;
-   break;
-   case PP_MP1_STATE_NONE:
-   default:
-   mutex_unlock(>mutex);
-   return 0;
-   }
-
-   ret = smu_send_smc_msg(smu, msg, NULL);
-   /* some asics may not support those messages */
-   

[PATCH] drm/amdgpu/display: properly guard dc_dsc_stream_bandwidth_in_kbps

2021-03-18 Thread Alex Deucher
Move the function protoype to the right header and guard
the call with CONFIG_DRM_AMD_DC_DCN as DSC is only available
with DCN.

Fixes: a03f6c0e26b2 ("drm/amd/display: Add changes for dsc bpp in 16ths and 
unify bw calculations")
Signed-off-by: Alex Deucher 
Cc: Dillon Varone 
Cc: Stephen Rothwell 
---
 drivers/gpu/drm/amd/display/dc/core/dc_link.c | 4 ++--
 drivers/gpu/drm/amd/display/dc/dc_dsc.h   | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_link.c
index 10e34e411e06..f9a33dc52c45 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link.c
@@ -3498,17 +3498,17 @@ void dc_link_enable_hpd_filter(struct dc_link *link, 
bool enable)
}
 }
 
-uint32_t dc_dsc_stream_bandwidth_in_kbps(uint32_t pix_clk_100hz, uint32_t 
bpp_x16);
-
 uint32_t dc_bandwidth_in_kbps_from_timing(
const struct dc_crtc_timing *timing)
 {
uint32_t bits_per_channel = 0;
uint32_t kbps;
 
+#if defined(CONFIG_DRM_AMD_DC_DCN)
if (timing->flags.DSC) {
return dc_dsc_stream_bandwidth_in_kbps(timing->pix_clk_100hz, 
timing->dsc_cfg.bits_per_pixel);
}
+#endif
 
switch (timing->display_color_depth) {
case COLOR_DEPTH_666:
diff --git a/drivers/gpu/drm/amd/display/dc/dc_dsc.h 
b/drivers/gpu/drm/amd/display/dc/dc_dsc.h
index 0c5d98524536..c51d2d961b7a 100644
--- a/drivers/gpu/drm/amd/display/dc/dc_dsc.h
+++ b/drivers/gpu/drm/amd/display/dc/dc_dsc.h
@@ -88,4 +88,6 @@ void dc_dsc_policy_set_max_target_bpp_limit(uint32_t limit);
 
 void dc_dsc_policy_set_enable_dsc_when_not_needed(bool enable);
 
+uint32_t dc_dsc_stream_bandwidth_in_kbps(uint32_t pix_clk_100hz, uint32_t 
bpp_x16);
+
 #endif
-- 
2.30.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Amdgpu kernel oops and freezing on system suspend and hibernate

2021-03-18 Thread Quan, Evan
[AMD Public Use]

Hi Harvey,

Resuming after mode1 reset failed according to the error logs below.
Also according to the lspci output of last email, it happened for a Navi14 ASIC.
However, I cannot reproduce that on my desktop platform with 2 x Navi14 ASICs.

Mär 18 13:00:43 obelix kernel: amdgpu :03:00.0: amdgpu: MODE1 reset
Mär 18 13:00:43 obelix kernel: amdgpu :03:00.0: amdgpu: GPU psp 
mode1 reset
Mär 18 13:00:43 obelix kernel: [drm] psp mode1 reset succeed

...
Mär 18 13:00:43 obelix kernel: [drm:psp_v11_0_ring_create [amdgpu]] 
*ERROR* Failed to wait for sOS ready for ring creation
Mär 18 13:00:43 obelix kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP 
create ring failed!
Mär 18 13:00:43 obelix kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP 
resume failed
Mär 18 13:00:43 obelix kernel: [drm:amdgpu_device_fw_loading [amdgpu]] 
*ERROR* resume of IP block  failed -62
Mär 18 13:00:43 obelix kernel: amdgpu :03:00.0: amdgpu: 
amdgpu_device_ip_resume failed (-62).


Considering you seemed not running our latest driver according to the complaint 
blow. Maybe it's worth to try our latest driver(@Deucher, Alexander where 
Harvey can get our latest code?).
Mär 18 12:51:36 obelix kernel: amdgpu :03:00.0: amdgpu: smu driver 
if version = 0x0036, smu fw if version = 0x0038, smu fw version 
= 0x00352100 (53.33.0)

BR
Evan
-Original Message-
From: amd-gfx  On Behalf Of Harvey
Sent: Thursday, March 18, 2021 8:17 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: Amdgpu kernel oops and freezing on system suspend and hibernate

Alex,

I waited for kernel 5.11.7 to hit our repos yesterday evening and tested 
again:

1. The suspend issue is gone - suspend and resume now work as expected.

2. System hibernation seems to be a different beast - still freezing

When invoking 'systemctl hibernate' the system does not power off (I 
waited for 5 minutes) and I have to hard reset it to start up again. It 
then tries to resume from the swap partition and comes back up with only 
the external monitor connected to the HDMI port showing a picture and 
the builtin screen of the laptop staying black. Nevertheless the system 
is freezed and not responding, neither to mouse or keyboard. After 
another hard reset I managed to get the following log from journalctl 
(only cut the relevant part):

Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3466] 
manager: sleep: sleep requested (sleeping: no  enabled: yes)
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3473] 
device (wlp4s0): state change: disconnected -> unmanaged (reason 
'sleeping', sys-iface-state: 'managed')
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3508] 
device (wlp4s0): set-hw-addr: reset MAC address to 14:F6:D8:18:8C:EC 
(unmanage)
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3575] 
device (p2p-dev-wlp4s0): state change: disconnected -> unmanaged (reason 
'sleeping', sys-iface-state: 'managed')
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3580] 
manager: NetworkManager state is now ASLEEP
Mär 18 12:51:11 obelix wpa_supplicant[954]: nl80211: deinit 
ifname=p2p-dev-wlp4s0 disabled_11b_rates=0
Mär 18 12:51:11 obelix wpa_supplicant[954]: nl80211: deinit 
ifname=wlp4s0 disabled_11b_rates=0
Mär 18 12:51:12 obelix gsd-media-keys[1691]: Unable to get default sink
Mär 18 12:51:15 obelix gnome-shell[1496]: 
../glib/gobject/gsignal.c:2732: instance '0x560b86c67b50' has no handler 
with id '15070'
Mär 18 12:51:16 obelix gsd-usb-protect[1724]: Error calling USBGuard 
DBus to change the protection after a screensaver event: 
GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name 
org.usbguard1 was not provided by any .service files
Mär 18 12:51:16 obelix systemd[1]: Reached target Sleep.
Mär 18 12:51:16 obelix systemd[1]: Starting Suspend...
Mär 18 12:51:16 obelix systemd-sleep[2000]: Suspending system...
Mär 18 12:51:16 obelix kernel: PM: suspend entry (deep)
Mär 18 12:51:16 obelix kernel: Filesystems sync: 0.005 seconds
Mär 18 12:51:36 obelix kernel: Freezing user space processes ... 
(elapsed 0.002 seconds) done.
Mär 18 12:51:36 obelix kernel: OOM killer disabled.
Mär 18 12:51:36 obelix kernel: Freezing remaining freezable tasks ... 
(elapsed 0.001 seconds) done.
Mär 18 12:51:36 obelix kernel: printk: Suspending console(s) (use 
no_console_suspend to debug)
Mär 18 12:51:36 obelix kernel: [drm] free PSP TMR buffer
Mär 18 12:51:36 obelix kernel: [drm] free PSP TMR buffer
Mär 18 12:51:36 obelix kernel: ACPI: EC: interrupt blocked
Mär 18 12:51:36 obelix kernel: ACPI: Preparing to enter system sleep 
state S3
Mär 18 12:51:36 obelix kernel: ACPI: EC: event blocked
Mär 18 12:51:36 obelix kernel: ACPI: EC: EC stopped
Mär 18 12:51:36 obelix kernel: PM: Saving platform NVS memory
Mär 18 12:51:36 obelix kernel: Disabling non-boot CPUs ...
Mär 18 12:51:36 obelix kernel: IRQ 86: no longer affine to CPU1
Mär 18 12:51:36 obelix kernel: smpboot: CPU 1 is now offline
Mär 18 12:51:36 obelix kernel: 

RE: [PATCH] drm/amdgpu: Add additional Sienna Cichlid PCI ID

2021-03-18 Thread Chen, Guchun
[AMD Public Use]

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Friday, March 19, 2021 4:45 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: Add additional Sienna Cichlid PCI ID

Add new DID.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 5438a4d3d517..6c78107db789 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1162,6 +1162,7 @@ static const struct pci_device_id pciidlist[] = {
{0x1002, 0x73A3, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
{0x1002, 0x73AB, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
{0x1002, 0x73AE, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
+   {0x1002, 0x73AF, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
{0x1002, 0x73BF, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
 
/* Van Gogh */
-- 
2.30.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Cguchun.chen%40amd.com%7C00dc99aa172142a1161408d8ea4ec1f0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637516971281942067%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=O8A7Tjn0%2BrO1Iz4G0poDgqZYFu9l2StmAVFEQFFFNtQ%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: Fix the page fault issue in amdgpu_irq_fini

2021-03-18 Thread Deng, Emily
[AMD Official Use Only - Internal Distribution Only]

>-Original Message-
>From: Christian König 
>Sent: Thursday, March 18, 2021 7:52 PM
>To: Deng, Emily ; amd-gfx@lists.freedesktop.org
>Subject: Re: [PATCH] drm/amdgpu: Fix the page fault issue in amdgpu_irq_fini
>
>Am 18.03.21 um 12:48 schrieb Emily Deng:
>> For some source, it will be shared by some client ID and source ID.
>> To fix the page fault issue, set all those to null.
>>
>> Signed-off-by: Emily Deng 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 16 +---
>>   1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> index af026109421a..623b1ac6231d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> @@ -359,7 +359,7 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
>>*/
>>   void amdgpu_irq_fini(struct amdgpu_device *adev)
>>   {
>> -unsigned i, j;
>> +unsigned i, j, m, n;
>>
>>   if (adev->irq.installed) {
>>   drm_irq_uninstall(adev_to_drm(adev));
>> @@ -380,12 +380,22 @@ void amdgpu_irq_fini(struct amdgpu_device
>*adev)
>>   if (!src)
>>   continue;
>>
>> -kfree(src->enabled_types);
>> +if (src->enabled_types)
>> +kfree(src->enabled_types);
>
>A NULL check before kfree() is unecessary and will be complained about by the
>static checkers.
Sorry, will remove this.
>
>> +
>>   src->enabled_types = NULL;
>> +
>
>Unrelated white space change.
Sorry, will remove this also.
>
>>   if (src->data) {
>>   kfree(src->data);
>>   kfree(src);
>> -adev->irq.client[i].sources[j] = NULL;
>> +}
>> +
>> +for (m = 0; m < AMDGPU_IRQ_CLIENTID_MAX; ++m) {
>> +if (!adev->irq.client[m].sources)
>> +continue;
>> +for (n = 0; n < AMDGPU_MAX_IRQ_SRC_ID;
>++n)
>> +if (adev->irq.client[m].sources[n] ==
>src)
>> +adev->irq.client[m].sources[n]
>= NULL;
>
>Hui what? The memory you set to NULL here is freed on the line below.
>
>Accessing it after that would be illegal, so why do you want to set it to NULL?
[Emily] It is in the loop "for (j = 0; j < AMDGPU_MAX_IRQ_SRC_ID; ++j) {", 
shouldn't have been freed in this loop. Only set " 
adev->irq.client[i].sources[j] = NULL;" is not enough,
as it maybe have other client ID and src ID will share the same src. Also need 
to set those to NULL.
>
>Christian.
>
>>   }
>>   }
>>   kfree(adev->irq.client[i].sources);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/display: Allow idle optimization based on vblank.

2021-03-18 Thread R, Bindu
[AMD Official Use Only - Internal Distribution Only]

​Hi All,
   Thanks for the inputs, have updated the patch to include these changes.

Regards,
Bindu


From: Lakha, Bhawanpreet 
Sent: Wednesday, March 17, 2021 1:02 PM
To: Michel Dänzer ; R, Bindu ; 
amd-gfx@lists.freedesktop.org 
Cc: Deucher, Alexander ; Feng, Kenneth 
; Zhou1, Tao 
Subject: Re: [PATCH] drm/amd/display: Allow idle optimization based on vblank.


[AMD Official Use Only - Internal Distribution Only]

Hi Bindu,

dc_allow_idle_optimizations() should be called within mutex_lock(>dc_lock). 
Please call it right after WARN_ON(!dc_commit_state(dm->dc, dc_state)) but 
before unlock().

Bhawan

From: amd-gfx  on behalf of Michel 
Dänzer 
Sent: March 17, 2021 7:37 AM
To: R, Bindu ; amd-gfx@lists.freedesktop.org 

Cc: Deucher, Alexander ; Feng, Kenneth 
; Zhou1, Tao ; Lakha, Bhawanpreet 

Subject: Re: [PATCH] drm/amd/display: Allow idle optimization based on vblank.

On 2021-03-17 12:49 a.m., Bindu Ramamurthy wrote:
> [Why]
> idle optimization was being disabled after commit.
>
> [How]
> check vblank count for display off and enable idle optimization based on this 
> count.
>
> Signed-off-by: Bindu Ramamurthy 
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 553e39f9538c..56a55143ad2d 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -987,7 +987,7 @@ static void event_mall_stutter(struct work_struct *work)
>
>if (vblank_work->enable)
>dm->active_vblank_irq_count++;
> - else
> + else if(dm->active_vblank_irq_count)
>dm->active_vblank_irq_count--;

The commit log should explain why this part is needed.


--
Earthling Michel Dänzer   |   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredhat.com%2Fdata=04%7C01%7Cbhawanpreet.lakha%40amd.com%7C11fd0779679742148e2a08d8e938fe34%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637515778296313590%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2BPsyjDxRCTdR9HL1QSlCraWo7YpFg%2FJT8i%2BSsG%2BQvZE%3Dreserved=0
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Cbhawanpreet.lakha%40amd.com%7C11fd0779679742148e2a08d8e938fe34%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637515778296313590%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=L4fr6qPeO5rcMi0zg9bk9xLRTKtyVRTJ3LcSPd3Qlyw%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/dp_mst: Enhance DP MST topology logging

2021-03-18 Thread Lyude Paul
(going to try to take a look at this tomorrow JFYI)

On Thu, 2021-03-18 at 11:55 -0400, Eryk Brol wrote:
> [why]
> MST topology print was missing fec logging and pdt printed
> as an int wasn't clear. vcpi and payload info were also logged as an
> arbitrary series of ints which require the user to know the ordering
> of the prints, making the logs difficult to use.
> 
> [how]
> -add fec logging
> -add pdt parsing into strings
> -format vcpi and payload info into tables with headings
> -clean up topology prints
> 
> Signed-off-by: Eryk Brol 
> ---
>  drivers/gpu/drm/drm_dp_mst_topology.c | 67 ---
>  1 file changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c
> b/drivers/gpu/drm/drm_dp_mst_topology.c
> index 932c4641ec3e..3afeaa59cbaa 100644
> --- a/drivers/gpu/drm/drm_dp_mst_topology.c
> +++ b/drivers/gpu/drm/drm_dp_mst_topology.c
> @@ -4720,6 +4720,24 @@ static void drm_dp_mst_kick_tx(struct
> drm_dp_mst_topology_mgr *mgr)
> queue_work(system_long_wq, >tx_work);
>  }
>  
> +static char *pdt_to_string(u8 pdt)
> +{
> +   switch (pdt) {
> +   case DP_PEER_DEVICE_NONE:
> +   return "NONE";
> +   case DP_PEER_DEVICE_SOURCE_OR_SST:
> +   return "SOURCE OR SST";
> +   case DP_PEER_DEVICE_MST_BRANCHING:
> +   return "MST BRANCHING";
> +   case DP_PEER_DEVICE_SST_SINK:
> +   return "SST SINK";
> +   case DP_PEER_DEVICE_DP_LEGACY_CONV:
> +   return "DP LEGACY CONV";
> +   default:
> +   return "ERR";
> +   }
> +}
> +
>  static void drm_dp_mst_dump_mstb(struct seq_file *m,
>  struct drm_dp_mst_branch *mstb)
>  {
> @@ -4732,9 +4750,20 @@ static void drm_dp_mst_dump_mstb(struct seq_file *m,
> prefix[i] = '\t';
> prefix[i] = '\0';
>  
> -   seq_printf(m, "%smst: %p, %d\n", prefix, mstb, mstb->num_ports);
> +   seq_printf(m, "%smstb - [%p]: num_ports: %d\n", prefix, mstb, mstb-
> >num_ports);
> list_for_each_entry(port, >ports, next) {
> -   seq_printf(m, "%sport: %d: input: %d: pdt: %d, ddps: %d ldps:
> %d, sdp: %d/%d, %p, conn: %p\n", prefix, port->port_num, port->input, port-
> >pdt, port->ddps, port->ldps, port->num_sdp_streams, port-
> >num_sdp_stream_sinks, port, port->connector);
> +   seq_printf(m, "%sport %d - [%p] (%s - %s): ddps: %d, ldps: %d,
> sdp: %d/%d, fec: %s, conn: %p\n",
> +   prefix,
> +   port->port_num,
> +   port,
> +   port->input ? "input" : "output",
> +   pdt_to_string(port->pdt),
> +   port->ddps,
> +   port->ldps,
> +   port->num_sdp_streams,
> +   port->num_sdp_stream_sinks,
> +   port->fec_capable ? "true" : "false",
> +   port->connector);
> if (port->mstb)
> drm_dp_mst_dump_mstb(m, port->mstb);
> }
> @@ -4787,33 +4816,39 @@ void drm_dp_mst_dump_topology(struct seq_file *m,
> mutex_unlock(>lock);
>  
> mutex_lock(>payload_lock);
> -   seq_printf(m, "vcpi: %lx %lx %d\n", mgr->payload_mask, mgr->vcpi_mask,
> -   mgr->max_payloads);
> +   seq_printf(m, "\n *** VCPI Info ***\npayload_mask: %lx, vcpi_mask:
> %lx, max_payloads: %d\n",
> +   mgr->payload_mask,
> +   mgr->vcpi_mask,
> +   mgr->max_payloads);
>  
> +   seq_printf(m, "\n|   idx   |  port # |  vcp_id | # slots | sink
> name |\n");
> for (i = 0; i < mgr->max_payloads; i++) {
> if (mgr->proposed_vcpis[i]) {
> char name[14];
>  
> port = container_of(mgr->proposed_vcpis[i], struct
> drm_dp_mst_port, vcpi);
> fetch_monitor_name(mgr, port, name, sizeof(name));
> -   seq_printf(m, "vcpi %d: %d %d %d sink name: %s\n", i,
> -  port->port_num, port->vcpi.vcpi,
> -  port->vcpi.num_slots,
> -  (*name != 0) ? name :  "Unknown");
> +   seq_printf(m, "%10d%10d%10d%10d%20s\n",
> +   i,
> +   port->port_num,
> +   port->vcpi.vcpi,
> +   port->vcpi.num_slots,
> +   (*name != 0) ? name :  "Unknown");
> } else
> -   seq_printf(m, "vcpi %d:unused\n", i);
> +   seq_printf(m, "%6d - Unused\n", i);
> }
> +   seq_printf(m, "\n *** Payload Info ***\n");
> +

[PATCH] drm/amdgpu: Add additional Sienna Cichlid PCI ID

2021-03-18 Thread Alex Deucher
Add new DID.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 5438a4d3d517..6c78107db789 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1162,6 +1162,7 @@ static const struct pci_device_id pciidlist[] = {
{0x1002, 0x73A3, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
{0x1002, 0x73AB, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
{0x1002, 0x73AE, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
+   {0x1002, 0x73AF, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
{0x1002, 0x73BF, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_SIENNA_CICHLID},
 
/* Van Gogh */
-- 
2.30.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH V2] drm/amdgpu: Fix a typo

2021-03-18 Thread Bhaskar Chowdhury
s/traing/training/

...Plus the entire sentence construction for better readability.

Signed-off-by: Bhaskar Chowdhury 
---
 Changes from V1:
  Alex and Randy's suggestions incorporated.

 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index c325d6f53a71..bf3857867f51 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -661,10 +661,10 @@ static int psp_v11_0_memory_training(struct psp_context 
*psp, uint32_t ops)

if (ops & PSP_MEM_TRAIN_SEND_LONG_MSG) {
/*
-* Long traing will encroach certain mount of bottom VRAM,
-* saving the content of this bottom VRAM to system memory
-* before training, and restoring it after training to avoid
-* VRAM corruption.
+* Long training will encroach a certain amount on the bottom 
of VRAM;
+ * save the content from the bottom VRAM to system memory
+ * before training, and restore it after training to avoid
+ * VRAM corruption.
 */
sz = GDDR6_MEM_TRAINING_ENCROACHED_SIZE;

--
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix a typo

2021-03-18 Thread Bhaskar Chowdhury

On 14:12 Thu 18 Mar 2021, Alex Deucher wrote:

On Thu, Mar 18, 2021 at 2:08 PM Randy Dunlap  wrote:


On 3/18/21 4:33 AM, Bhaskar Chowdhury wrote:
>
> s/traing/training/
>
> Signed-off-by: Bhaskar Chowdhury 
> ---
>  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> index c325d6f53a71..db18e4f6cf5f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> @@ -661,7 +661,7 @@ static int psp_v11_0_memory_training(struct psp_context 
*psp, uint32_t ops)
>
>   if (ops & PSP_MEM_TRAIN_SEND_LONG_MSG) {
>   /*
> -  * Long traing will encroach certain mount of bottom VRAM,
> +  * Long training will encroach certain mount of bottom VRAM,

   amount
I think.


Yeah, I think it should read something like:

Long training will encroach a certain amount on the bottom of VRAM;
save the content from the bottom VRAM to system memory
before training, and restore it after training to avoid
VRAM corruption.

Alex



>* saving the content of this bottom VRAM to system memory
>* before training, and restoring it after training to avoid
>* VRAM corruption.


Thanks.


> --
> 2.26.2
>


--
~Randy

___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


signature.asc
Description: PGP signature
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC PATCH 8/9] drm/gem: Associate GEM objects with drm cgroup

2021-03-18 Thread Brian Welty


On 3/18/2021 3:16 AM, Daniel Vetter wrote:
> On Sat, Mar 6, 2021 at 1:44 AM Brian Welty  wrote:
>>
>>
>> On 2/11/2021 7:34 AM, Daniel Vetter wrote:
>>> On Wed, Feb 10, 2021 at 02:00:57PM -0800, Brian Welty wrote:

 On 2/9/2021 2:54 AM, Daniel Vetter wrote:
> On Tue, Jan 26, 2021 at 01:46:25PM -0800, Brian Welty wrote:
>> This patch adds tracking of which cgroup to make charges against for a
>> given GEM object.  We associate the current task's cgroup with GEM 
>> objects
>> as they are created.  First user of this is for charging DRM cgroup for
>> device memory allocations.  The intended behavior is for device drivers 
>> to
>> make the cgroup charging calls at the time that backing store is 
>> allocated
>> or deallocated for the object.
>>
>> Exported functions are provided for charging memory allocations for a
>> GEM object to DRM cgroup. To aid in debugging, we store how many bytes
>> have been charged inside the GEM object.  Add helpers for setting and
>> clearing the object's associated cgroup which will check that charges are
>> not being leaked.
>>
>> For shared objects, this may make the charge against a cgroup that is
>> potentially not the same cgroup as the process using the memory.  Based
>> on the memory cgroup's discussion of "memory ownership", this seems
>> acceptable [1].
>>
>> [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt, "Memory 
>> Ownership"
>>
>> Signed-off-by: Brian Welty 
>
> Since for now we only have the generic gpu/xpu/bikeshed.memory bucket that
> counts everything, why don't we also charge in these gem functions?

 I'm not sure what you mean exactly.  You want to merge/move the charging 
 logic
 proposed in patch #5 (drm_cgroup_try_charge in kernel/cgroup/drm.c) into
 drm_gem_object_charge_mem() ?

 Or reading below, I think you are okay keeping the logic separated as is, 
 but
 you want much of the code in kernel/cgroup/drm.c moved to 
 drivers/gpu/cgroup ?
 Yes, I see that should allow to reduce number of exported functions.
>>>
>>> Both. I mean we'd need to look at the code again when it's shuffled, but
>>> I'd say:
>>>
>>> - cgroup code and the charging for general gpu memory moves to
>>>   drivers/gpu/cgroup, so dma-buf heaps can use it too.
>>>
>>> - the charging for gem buffers moves into core gem code, so it happens for
>>>   all gpu drivers and all gem buffer allocations.
>>
>> Daniel, I'm not sure we're in sync on what 'charging for general gpu memory'
>> means.  Thus far, I have been proposing to charge/uncharge when backing 
>> store is
>> allocated/freed.  And thus, this would be done in DRM driver (so then also in
>> the dma-buf exporter).
>> I can't see how we'd hoist this part into drm gem code.
>> The memory limit in this series is for VRAM usage/limit not GEM buffers...
> 
> Yes this would be at gem buffer creation time. And just to get cgroups
> for drm up

Okay, but it's not of the ones in Tejun's list to start with:
   https://lists.freedesktop.org/archives/dri-devel/2020-April/262141.html
I hoped we would start by pursuing those (gpu.weight and gpu.memory.high)
as first step.

Limiting GEM buffers is essentially controlling virtual memory size, which 
tend to just always get set to unlimited.
Would be nice to get consensus from maintainers before proceeding to implement
this.


> 
>> Unless you are talking about charging for GEM buffer creation?  But this is
>> more of a 'soft resource' more along lines of Kenny's earlier GEM buffer 
>> limit
>> control.
>> I raised issue with this then, and at the time, Tejun agreed we should keep 
>> to
>> 'hard resource' controls, see [1] and [2].
>>
>> [1] https://lists.freedesktop.org/archives/dri-devel/2019-May/218071.html
>> [2] https://lists.freedesktop.org/archives/dri-devel/2020-April/262141.html
>>
>>>
>>> - this might or might not mean a bunch less exported stuff from the
>>>   cgroups files (since you don't need separate steps for linking a gem
>>>   object to a cgroup from the actual charging), and probably no exports
>>>   anymore for drivers (since they charge nothing). That will change
>>>   when we add charging for specific memory pools I guess, but we add that
>>>   when we add tha functionality.
>>
>> ... so considering VRAM charging, then yes, we very much need to have 
>> exported
>> functions for drivers to do the charging.
>> But these can be exported from drm.ko (or new .ko?) instead of kernel.  Is
>> that still preference?   Also, if number of exported functions is concern, we
>> can replace some of it with use of function pointers.
> 
> So the reason I suggested we drop all this is because we won't charge
> in drivers, we'll charge in ttm buffer management code. Which we'll
> adopt for dg1 in upstream. But it will take some time.

Okay, thanks for clarifying.
I'm not familiar with where try_charge/uncharge would 

Re: [PATCH] PCI: quirks: Quirk PCI d3hot delay for AMD xhci

2021-03-18 Thread Bjorn Helgaas
On Tue, Mar 16, 2021 at 03:28:51PM -0400, Alex Deucher wrote:
> From: Marcin Bachry 
> 
> Renoir needs a similar delay.

See 
https://lore.kernel.org/linux-pci/20210311125322.GA216@bjorn-Precision-5520/

This is becoming a problem.  We shouldn't have to merge a quirk for
every new device.  Either the devices are defective, and AMD should
publish errata and have a plan for fixing them, or Linux is broken and
we should fix that.

There are quite a few mechanisms for controlling delays like this
(Config Request Retry Status (PCIe r5.0, sec 2.3.1), Readiness
Notifications (sec 6.23), ACPI _DSM for power-on delays (PCI Firmware
Spec r3.3)), but most are for *reducing* delay, not for extending it.

Linux supports CRS, but not all the others.  Maybe we're missing
something we should support?

How do you deal with these issues for Windows?  If it works on Windows
without quirks, we should be able to make it work on Linux as well.

> Signed-off-by: Marcin Bachry 
> Signed-off-by: Alex Deucher 
> ---
>  drivers/pci/quirks.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 653660e3ba9e..36e5ec670fae 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -1904,6 +1904,9 @@ static void quirk_ryzen_xhci_d3hot(struct pci_dev *dev)
>  }
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x15e0, quirk_ryzen_xhci_d3hot);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x15e1, quirk_ryzen_xhci_d3hot);
> +/* Renoir XHCI requires longer delay when transitioning from D0 to
> + * D3hot */

No need for "me too" comments that add no additional information.

> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x1639, quirk_ryzen_xhci_d3hot);
>  
>  #ifdef CONFIG_X86_IO_APIC
>  static int dmi_disable_ioapicreroute(const struct dmi_system_id *d)
> -- 
> 2.30.2
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix a typo

2021-03-18 Thread Alex Deucher
On Thu, Mar 18, 2021 at 2:08 PM Randy Dunlap  wrote:
>
> On 3/18/21 4:33 AM, Bhaskar Chowdhury wrote:
> >
> > s/traing/training/
> >
> > Signed-off-by: Bhaskar Chowdhury 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
> > b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> > index c325d6f53a71..db18e4f6cf5f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> > @@ -661,7 +661,7 @@ static int psp_v11_0_memory_training(struct psp_context 
> > *psp, uint32_t ops)
> >
> >   if (ops & PSP_MEM_TRAIN_SEND_LONG_MSG) {
> >   /*
> > -  * Long traing will encroach certain mount of bottom VRAM,
> > +  * Long training will encroach certain mount of bottom VRAM,
>
>amount
> I think.

Yeah, I think it should read something like:

Long training will encroach a certain amount on the bottom of VRAM;
save the content from the bottom VRAM to system memory
before training, and restore it after training to avoid
VRAM corruption.

Alex

>
> >* saving the content of this bottom VRAM to system memory
> >* before training, and restoring it after training to avoid
> >* VRAM corruption.
> > --
> > 2.26.2
> >
>
>
> --
> ~Randy
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix a typo

2021-03-18 Thread Randy Dunlap
On 3/18/21 4:33 AM, Bhaskar Chowdhury wrote:
> 
> s/traing/training/
> 
> Signed-off-by: Bhaskar Chowdhury 
> ---
>  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> index c325d6f53a71..db18e4f6cf5f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> @@ -661,7 +661,7 @@ static int psp_v11_0_memory_training(struct psp_context 
> *psp, uint32_t ops)
> 
>   if (ops & PSP_MEM_TRAIN_SEND_LONG_MSG) {
>   /*
> -  * Long traing will encroach certain mount of bottom VRAM,
> +  * Long training will encroach certain mount of bottom VRAM,

   amount
I think.

>* saving the content of this bottom VRAM to system memory
>* before training, and restoring it after training to avoid
>* VRAM corruption.
> --
> 2.26.2
> 


-- 
~Randy

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/1] drm/amdgpu: Mark Albebaran HW support as experimental

2021-03-18 Thread Russell, Kent
[AMD Public Use]

Sorry, just realized a typo in the headline. Albebaran->Aldebaran. With that 
fixed,

Reviewed-by: Kent Russell 



> -Original Message-
> From: amd-gfx  On Behalf Of Russell, 
> Kent
> Sent: Thursday, March 18, 2021 12:32 PM
> To: Kuehling, Felix ; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 1/1] drm/amdgpu: Mark Albebaran HW support as experimental
> 
> [AMD Public Use]
> 
> Reviewed-by: Kent Russell 
> 
> 
> 
> > -Original Message-
> > From: amd-gfx  On Behalf Of Felix 
> > Kuehling
> > Sent: Thursday, March 18, 2021 12:06 PM
> > To: amd-gfx@lists.freedesktop.org
> > Subject: [PATCH 1/1] drm/amdgpu: Mark Albebaran HW support as experimental
> >
> > The HW is not in production yet. Driver support is still in development.
> >
> > Signed-off-by: Felix Kuehling 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 2e7b5d922f31..a3e3760e8d62 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -1179,9 +1179,9 @@ static const struct pci_device_id pciidlist[] = {
> > {0x1002, 0x73FF, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_DIMGREY_CAVEFISH},
> >
> > /* Aldebaran */
> > -   {0x1002, 0x7408, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
> > -   {0x1002, 0x740C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
> > -   {0x1002, 0x740F, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
> > +   {0x1002, 0x7408, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
> > CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
> > +   {0x1002, 0x740C, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
> > CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
> > +   {0x1002, 0x740F, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
> > CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
> >
> > {0, 0, 0}
> >  };
> > --
> > 2.17.1
> >
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.or
> > g%2Fmailman%2Flistinfo%2Famd-
> >
> gfxdata=04%7C01%7Ckent.russell%40amd.com%7Ca58b14d47cbf4b23a11008d8ea27
> >
> b9ec%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637516803643712583%7CUn
> >
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> >
> VCI6Mn0%3D%7C1000sdata=yPjcheikiBkipF6jux1pKQv2XpNDXxEEDxu022uOB40%3D
> > reserved=0
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.or
> g%2Fmailman%2Flistinfo%2Famd-
> gfxdata=04%7C01%7Ckent.russell%40amd.com%7C680802488a6d4bc9d09d08d8ea2
> b68f0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637516819462779878%7CU
> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C1000sdata=B7hAa4%2B3IZiEcc34%2F8YLTn7p7%2Fb7yWYtOtPVLzC
> Zs90%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/1] drm/amdgpu: Mark Albebaran HW support as experimental

2021-03-18 Thread Russell, Kent
[AMD Public Use]

Reviewed-by: Kent Russell 



> -Original Message-
> From: amd-gfx  On Behalf Of Felix 
> Kuehling
> Sent: Thursday, March 18, 2021 12:06 PM
> To: amd-gfx@lists.freedesktop.org
> Subject: [PATCH 1/1] drm/amdgpu: Mark Albebaran HW support as experimental
> 
> The HW is not in production yet. Driver support is still in development.
> 
> Signed-off-by: Felix Kuehling 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 2e7b5d922f31..a3e3760e8d62 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1179,9 +1179,9 @@ static const struct pci_device_id pciidlist[] = {
>   {0x1002, 0x73FF, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_DIMGREY_CAVEFISH},
> 
>   /* Aldebaran */
> - {0x1002, 0x7408, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
> - {0x1002, 0x740C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
> - {0x1002, 0x740F, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
> + {0x1002, 0x7408, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
> CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
> + {0x1002, 0x740C, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
> CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
> + {0x1002, 0x740F, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
> CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
> 
>   {0, 0, 0}
>  };
> --
> 2.17.1
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.or
> g%2Fmailman%2Flistinfo%2Famd-
> gfxdata=04%7C01%7Ckent.russell%40amd.com%7Ca58b14d47cbf4b23a11008d8ea27
> b9ec%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637516803643712583%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C1000sdata=yPjcheikiBkipF6jux1pKQv2XpNDXxEEDxu022uOB40%3D
> reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-18 Thread Andrey Grodzovsky


On 2021-03-18 6:41 a.m., Zhang, Jack (Jian) wrote:


[AMD Official Use Only - Internal Distribution Only]


Hi, Andrey

Let me summarize the background of this patch:

In TDR resubmit step “amdgpu_device_recheck_guilty_jobs,

It will submit first jobs of each ring and do guilty job re-check.

At that point, We had to make sure each job is in the mirror list(or 
re-inserted back already).


But we found the current code never re-insert the job to mirror list 
in the 2^nd , 3^rd job_timeout thread(Bailing TDR thread).


This not only will cause memleak of the bailing jobs. What’s more 
important, the 1^st tdr thread can never iterate the bailing job and 
set its guilty status to a correct status.


Therefore, we had to re-insert the job(or even not delete node) for 
bailing job.


For the above V3 patch, the racing condition in my mind is:

we cannot make sure all bailing jobs are finished before we do 
amdgpu_device_recheck_guilty_jobs.




Yes,that race i missed - so you say that for 2nd, baling thread who 
extracted the job, even if he reinsert it right away back after driver 
callback return DRM_GPU_SCHED_STAT_BAILING, there is small time slot 
where the job is not in mirror list and so the 1st TDR might miss it and 
not find that  2nd job is the actual guilty job, right ? But, still this 
job will get back into mirror list, and since it's really the bad job, 
it will never signal completion and so on the next timeout cycle it will 
be caught (of course there is a starvation scenario here if more TDRs 
kick in and it bails out again but this is really unlikely).




Based on this insight, I think we have two options to solve this issue:

 1. Skip delete node in tdr thread2, thread3, 4 … (using mutex or
atomic variable)
 2. Re-insert back bailing job, and meanwhile use semaphore in each
tdr thread to keep the sequence as expected and ensure each job is
in the mirror list when do resubmit step.

For Option1, logic is simpler and we need only one global atomic variable:

What do you think about this plan?

Option1 should look like the following logic:

+static atomic_t in_reset; //a global atomic var for synchronization

static void drm_sched_process_job(struct dma_fence *f, struct 
dma_fence_cb *cb);


 /**

@@ -295,6 +296,12 @@ static void drm_sched_job_timedout(struct 
work_struct *work)


 * drm_sched_cleanup_jobs. It will be reinserted back 
after sched->thread


 * is parked at which point it's safe.

 */

+   if (atomic_cmpxchg(_reset, 0, 1) != 0) {  //skip 
delete node if it’s thead1,2,3,….


+ spin_unlock(>job_list_lock);

+ drm_sched_start_timeout(sched);

+   return;

+   }

+

list_del_init(>node);

spin_unlock(>job_list_lock);

@@ -320,6 +327,7 @@ static void drm_sched_job_timedout(struct 
work_struct *work)


spin_lock(>job_list_lock);

    drm_sched_start_timeout(sched);

spin_unlock(>job_list_lock);

+   atomic_set(_reset, 0); //reset in_reset when the first 
thread finished tdr


}



Technically looks like it should work as you don't access the job 
pointer any longer and so no risk that if signaled it will be freed by 
drm_sched_get_cleanup_job but,you can't just use one global variable an 
by this bailing from TDR when different drivers run their TDR threads in 
parallel, and even for amdgpu, if devices in different XGMI hives or 2 
independent devices in non XGMI setup. There should be defined some kind 
of GPU reset group structure on drm_scheduler level for which this 
variable would be used.


P.S I wonder why we can't just ref-count the job so that even if 
drm_sched_get_cleanup_job would delete it before we had a chance to stop 
the scheduler thread, we wouldn't crash. This would avoid all the dance 
with deletion and reinsertion.


Andrey



Thanks,

Jack

*From:* amd-gfx  *On Behalf Of 
*Zhang, Jack (Jian)

*Sent:* Wednesday, March 17, 2021 11:11 PM
*To:* Christian König ; 
dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; 
Koenig, Christian ; Liu, Monk 
; Deng, Emily ; Rob Herring 
; Tomeu Vizoso ; Steven 
Price ; Grodzovsky, Andrey 

*Subject:* Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid 
memleak


[AMD Official Use Only - Internal Distribution Only]

[AMD Official Use Only - Internal Distribution Only]

Hi,Andrey,

Good catch,I will expore this corner case and give feedback soon~

Best,

Jack



*From:*Grodzovsky, Andrey >

*Sent:* Wednesday, March 17, 2021 10:50:59 PM
*To:* Christian König >; Zhang, Jack (Jian) 
mailto:jack.zha...@amd.com>>; 
dri-de...@lists.freedesktop.org 
 
>; 
amd-gfx@lists.freedesktop.org  
>; Koenig, Christian 

[PATCH 1/1] drm/amdgpu: Mark Albebaran HW support as experimental

2021-03-18 Thread Felix Kuehling
The HW is not in production yet. Driver support is still in development.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 2e7b5d922f31..a3e3760e8d62 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1179,9 +1179,9 @@ static const struct pci_device_id pciidlist[] = {
{0x1002, 0x73FF, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_DIMGREY_CAVEFISH},
 
/* Aldebaran */
-   {0x1002, 0x7408, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
-   {0x1002, 0x740C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
-   {0x1002, 0x740F, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_ALDEBARAN},
+   {0x1002, 0x7408, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
+   {0x1002, 0x740C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
+   {0x1002, 0x740F, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
CHIP_ALDEBARAN|AMD_EXP_HW_SUPPORT},
 
{0, 0, 0}
 };
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/dp_mst: Enhance DP MST topology logging

2021-03-18 Thread Eryk Brol
[why]
MST topology print was missing fec logging and pdt printed
as an int wasn't clear. vcpi and payload info were also logged as an
arbitrary series of ints which require the user to know the ordering
of the prints, making the logs difficult to use.

[how]
-add fec logging
-add pdt parsing into strings
-format vcpi and payload info into tables with headings
-clean up topology prints

Signed-off-by: Eryk Brol 
---
 drivers/gpu/drm/drm_dp_mst_topology.c | 67 ---
 1 file changed, 51 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c 
b/drivers/gpu/drm/drm_dp_mst_topology.c
index 932c4641ec3e..3afeaa59cbaa 100644
--- a/drivers/gpu/drm/drm_dp_mst_topology.c
+++ b/drivers/gpu/drm/drm_dp_mst_topology.c
@@ -4720,6 +4720,24 @@ static void drm_dp_mst_kick_tx(struct 
drm_dp_mst_topology_mgr *mgr)
queue_work(system_long_wq, >tx_work);
 }
 
+static char *pdt_to_string(u8 pdt)
+{
+   switch (pdt) {
+   case DP_PEER_DEVICE_NONE:
+   return "NONE";
+   case DP_PEER_DEVICE_SOURCE_OR_SST:
+   return "SOURCE OR SST";
+   case DP_PEER_DEVICE_MST_BRANCHING:
+   return "MST BRANCHING";
+   case DP_PEER_DEVICE_SST_SINK:
+   return "SST SINK";
+   case DP_PEER_DEVICE_DP_LEGACY_CONV:
+   return "DP LEGACY CONV";
+   default:
+   return "ERR";
+   }
+}
+
 static void drm_dp_mst_dump_mstb(struct seq_file *m,
 struct drm_dp_mst_branch *mstb)
 {
@@ -4732,9 +4750,20 @@ static void drm_dp_mst_dump_mstb(struct seq_file *m,
prefix[i] = '\t';
prefix[i] = '\0';
 
-   seq_printf(m, "%smst: %p, %d\n", prefix, mstb, mstb->num_ports);
+   seq_printf(m, "%smstb - [%p]: num_ports: %d\n", prefix, mstb, 
mstb->num_ports);
list_for_each_entry(port, >ports, next) {
-   seq_printf(m, "%sport: %d: input: %d: pdt: %d, ddps: %d ldps: 
%d, sdp: %d/%d, %p, conn: %p\n", prefix, port->port_num, port->input, 
port->pdt, port->ddps, port->ldps, port->num_sdp_streams, 
port->num_sdp_stream_sinks, port, port->connector);
+   seq_printf(m, "%sport %d - [%p] (%s - %s): ddps: %d, ldps: %d, 
sdp: %d/%d, fec: %s, conn: %p\n",
+   prefix,
+   port->port_num,
+   port,
+   port->input ? "input" : "output",
+   pdt_to_string(port->pdt),
+   port->ddps,
+   port->ldps,
+   port->num_sdp_streams,
+   port->num_sdp_stream_sinks,
+   port->fec_capable ? "true" : "false",
+   port->connector);
if (port->mstb)
drm_dp_mst_dump_mstb(m, port->mstb);
}
@@ -4787,33 +4816,39 @@ void drm_dp_mst_dump_topology(struct seq_file *m,
mutex_unlock(>lock);
 
mutex_lock(>payload_lock);
-   seq_printf(m, "vcpi: %lx %lx %d\n", mgr->payload_mask, mgr->vcpi_mask,
-   mgr->max_payloads);
+   seq_printf(m, "\n *** VCPI Info ***\npayload_mask: %lx, vcpi_mask: %lx, 
max_payloads: %d\n",
+   mgr->payload_mask,
+   mgr->vcpi_mask,
+   mgr->max_payloads);
 
+   seq_printf(m, "\n|   idx   |  port # |  vcp_id | # slots | sink 
name |\n");
for (i = 0; i < mgr->max_payloads; i++) {
if (mgr->proposed_vcpis[i]) {
char name[14];
 
port = container_of(mgr->proposed_vcpis[i], struct 
drm_dp_mst_port, vcpi);
fetch_monitor_name(mgr, port, name, sizeof(name));
-   seq_printf(m, "vcpi %d: %d %d %d sink name: %s\n", i,
-  port->port_num, port->vcpi.vcpi,
-  port->vcpi.num_slots,
-  (*name != 0) ? name :  "Unknown");
+   seq_printf(m, "%10d%10d%10d%10d%20s\n",
+   i,
+   port->port_num,
+   port->vcpi.vcpi,
+   port->vcpi.num_slots,
+   (*name != 0) ? name :  "Unknown");
} else
-   seq_printf(m, "vcpi %d:unused\n", i);
+   seq_printf(m, "%6d - Unused\n", i);
}
+   seq_printf(m, "\n *** Payload Info ***\n");
+   seq_printf(m, "|   idx   |  state  |  start slot  | # slots |\n");
for (i = 0; i < mgr->max_payloads; i++) {
-   seq_printf(m, "payload %d: %d, %d, %d\n",
-  i,
-  mgr->payloads[i].payload_state,
-  mgr->payloads[i].start_slot,

[PATCH] drm/amdgpu: Fix a typo

2021-03-18 Thread Bhaskar Chowdhury


s/traing/training/

Signed-off-by: Bhaskar Chowdhury 
---
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index c325d6f53a71..db18e4f6cf5f 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -661,7 +661,7 @@ static int psp_v11_0_memory_training(struct psp_context 
*psp, uint32_t ops)

if (ops & PSP_MEM_TRAIN_SEND_LONG_MSG) {
/*
-* Long traing will encroach certain mount of bottom VRAM,
+* Long training will encroach certain mount of bottom VRAM,
 * saving the content of this bottom VRAM to system memory
 * before training, and restoring it after training to avoid
 * VRAM corruption.
--
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: revert "reserve backup pages for bad page retirment"

2021-03-18 Thread Zhang, Hawking
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Christian König  
Sent: Thursday, March 18, 2021 21:09
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; Li, 
Dennis ; Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: revert "reserve backup pages for bad page 
retirment"

As noted during the review this approach doesn't make sense at all.

We should not apply any limitation on the VRAM applications can use inside the 
kernel.

If an application or end user wants to reserve a certain amount of VRAM for bad 
pages handling we should do this in the upper layer.

This reverts commit 6ff5de4f1cfc09308d060a7085c1664a3b37f118.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 29 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h  |  4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 96 +---
 4 files changed, 15 insertions(+), 118 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d6a9787e5597..2e7b5d922f31 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -180,7 +180,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int 
amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0x; -int 
amdgpu_bad_page_threshold = 100;
+int amdgpu_bad_page_threshold = -1;
 struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
.timeout_fatal_disable = false,
.period = 0x23, /* default to max. timeout = 1 << 0x23 cycles */ @@ 
-854,7 +854,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 
0444);
  * faulty pages by ECC exceed threshold value and leave it for user's further
  * check.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto, 0 = 
disable bad page retirement, 100 = default value");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = 
+auto(default value), 0 = disable bad page retirement)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup 
(8 if set to greater than 8 or less than 0, only affect gfx 8+)"); diff --git 
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a90bf33358d3..0e16683876aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1790,14 +1790,13 @@ static bool amdgpu_ras_check_bad_page(struct 
amdgpu_device *adev,
return ret;
 }
 
-static uint32_t
-amdgpu_ras_calculate_badpags_threshold(struct amdgpu_device *adev)
+static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
+   uint32_t max_length)
 {
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
int tmp_threshold = amdgpu_bad_page_threshold;
u64 val;
-   uint32_t max_length = 0;
 
-   max_length = amdgpu_ras_eeprom_get_record_max_length();
/*
 * Justification of value bad_page_cnt_threshold in ras structure
 *
@@ -1823,18 +1822,20 @@ amdgpu_ras_calculate_badpags_threshold(struct 
amdgpu_device *adev)
tmp_threshold = max_length;
 
if (tmp_threshold == -1) {
-   val = adev->gmc.real_vram_size;
+   val = adev->gmc.mc_vram_size;
do_div(val, RAS_BAD_PAGE_RATE);
-   tmp_threshold = min(lower_32_bits(val), max_length);
+   con->bad_page_cnt_threshold = min(lower_32_bits(val),
+   max_length);
+   } else {
+   con->bad_page_cnt_threshold = tmp_threshold;
}
-
-   return tmp_threshold;
 }
 
 int amdgpu_ras_recovery_init(struct amdgpu_device *adev)  {
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
struct ras_err_handler_data **data;
+   uint32_t max_eeprom_records_len = 0;
bool exc_err_limit = false;
int ret;
 
@@ -1854,16 +1855,8 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
atomic_set(>in_recovery, 0);
con->adev = adev;
 
-   if (!con->bad_page_cnt_threshold) {
-   con->bad_page_cnt_threshold =
-   amdgpu_ras_calculate_badpags_threshold(adev);
-
-   ret = amdgpu_vram_mgr_reserve_backup_pages(
-   ttm_manager_type(>mman.bdev, TTM_PL_VRAM),
-   con->bad_page_cnt_threshold);
-   if (ret)
-   goto out;
-   }
+   max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
+   amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
 
/* Todo: During test the SMU might fail to read the eeprom through I2C
 * when the GPU is pending on XGMI reset during probe time diff --git 

Re: [RESEND 00/53] Rid GPU from W=1 warnings

2021-03-18 Thread Daniel Vetter
On Wed, Mar 17, 2021 at 9:32 PM Daniel Vetter  wrote:
>
> On Wed, Mar 17, 2021 at 9:17 AM Lee Jones  wrote:
> >
> > On Thu, 11 Mar 2021, Lee Jones wrote:
> >
> > > On Thu, 11 Mar 2021, Daniel Vetter wrote:
> > >
> > > > On Mon, Mar 08, 2021 at 09:19:32AM +, Lee Jones wrote:
> > > > > On Fri, 05 Mar 2021, Roland Scheidegger wrote:
> > > > >
> > > > > > The vmwgfx ones look all good to me, so for
> > > > > > 23-53: Reviewed-by: Roland Scheidegger 
> > > > > > That said, they were already signed off by Zack, so not sure what
> > > > > > happened here.
> > > > >
> > > > > Yes, they were accepted at one point, then dropped without a reason.
> > > > >
> > > > > Since I rebased onto the latest -next, I had to pluck them back out of
> > > > > a previous one.
> > > >
> > > > They should show up in linux-next again. We merge patches for next merge
> > > > window even during the current merge window, but need to make sure they
> > > > don't pollute linux-next. Occasionally the cut off is wrong so patches
> > > > show up, and then get pulled again.
> > > >
> > > > Unfortunately especially the 5.12 merge cycle was very wobbly due to 
> > > > some
> > > > confusion here. But your patches should all be in linux-next again (they
> > > > are queued up for 5.13 in drm-misc-next, I checked that).
> > > >
> > > > Sorry for the confusion here.
> > >
> > > Oh, I see.  Well so long as they don't get dropped, I'll be happy.
> > >
> > > Thanks for the explanation Daniel
> >
> > After rebasing today, all of my GPU patches have remained.  Would
> > someone be kind enough to check that everything is still in order
> > please?
>
> It's still broken somehow. I've kiced Maxime and Maarten again,
> they're also on this thread.

You're patches have made it into drm-next meanwhile, so they should
show up in linux-next through that tree at least. Except if that one
also has some trouble.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: revert "use the new cursor in the VM code"

2021-03-18 Thread Chen, Guchun
[AMD Public Use]

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: Christian König  
Sent: Thursday, March 18, 2021 7:53 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun 
Subject: [PATCH] drm/amdgpu: revert "use the new cursor in the VM code"

We are seeing VM page faults with this. Revert the change until the bugs are 
fixed.

This reverts commit e71af7b9807cc9ab2b40a7c02ff93c622786bd2a.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 55 +-
 1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index bc3951b71079..9268db1172bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -37,7 +37,6 @@
 #include "amdgpu_gmc.h"
 #include "amdgpu_xgmi.h"
 #include "amdgpu_dma_buf.h"
-#include "amdgpu_res_cursor.h"
 
 /**
  * DOC: GPUVM
@@ -1584,7 +1583,7 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
  * @last: last mapped entry
  * @flags: flags for the entries
  * @offset: offset into nodes and pages_addr
- * @res: ttm_resource to map
+ * @nodes: array of drm_mm_nodes with the MC addresses
  * @pages_addr: DMA addresses to use for mapping
  * @fence: optional resulting fence
  *
@@ -1599,13 +1598,13 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
   bool unlocked, struct dma_resv *resv,
   uint64_t start, uint64_t last,
   uint64_t flags, uint64_t offset,
-  struct ttm_resource *res,
+  struct drm_mm_node *nodes,
   dma_addr_t *pages_addr,
   struct dma_fence **fence)
 {
struct amdgpu_vm_update_params params;
-   struct amdgpu_res_cursor cursor;
enum amdgpu_sync_mode sync_mode;
+   uint64_t pfn;
int r;
 
memset(, 0, sizeof(params));
@@ -1623,6 +1622,14 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
else
sync_mode = AMDGPU_SYNC_EXPLICIT;
 
+   pfn = offset >> PAGE_SHIFT;
+   if (nodes) {
+   while (pfn >= nodes->size) {
+   pfn -= nodes->size;
+   ++nodes;
+   }
+   }
+
amdgpu_vm_eviction_lock(vm);
if (vm->evicting) {
r = -EBUSY;
@@ -1641,17 +1648,23 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
if (r)
goto error_unlock;
 
-   amdgpu_res_first(res, offset, (last - start + 1) * AMDGPU_GPU_PAGE_SIZE,
-);
-   while (cursor.remaining) {
+   do {
uint64_t tmp, num_entries, addr;
 
-   num_entries = cursor.size >> AMDGPU_GPU_PAGE_SHIFT;
+
+   num_entries = last - start + 1;
+   if (nodes) {
+   addr = nodes->start << PAGE_SHIFT;
+   num_entries = min((nodes->size - pfn) *
+   AMDGPU_GPU_PAGES_IN_CPU_PAGE, num_entries);
+   } else {
+   addr = 0;
+   }
+
if (pages_addr) {
bool contiguous = true;
 
if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) {
-   uint64_t pfn = cursor.start >> PAGE_SHIFT;
uint64_t count;
 
contiguous = pages_addr[pfn + 1] == @@ -1671,18 
+1684,16 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
}
 
if (!contiguous) {
-   addr = cursor.start;
+   addr = pfn << PAGE_SHIFT;
params.pages_addr = pages_addr;
} else {
-   addr = pages_addr[cursor.start >> PAGE_SHIFT];
+   addr = pages_addr[pfn];
params.pages_addr = NULL;
}
 
} else if (flags & (AMDGPU_PTE_VALID | AMDGPU_PTE_PRT)) {
-   addr = bo_adev->vm_manager.vram_base_offset +
-   cursor.start;
-   } else {
-   addr = 0;
+   addr += bo_adev->vm_manager.vram_base_offset;
+   addr += pfn << PAGE_SHIFT;
}
 
tmp = start + num_entries;
@@ -1690,9 +1701,14 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
if (r)
goto error_unlock;
 
-   amdgpu_res_next(, num_entries * AMDGPU_GPU_PAGE_SIZE);
+   pfn += num_entries 

[PATCH] drm/amdgpu: revert "reserve backup pages for bad page retirment"

2021-03-18 Thread Christian König
As noted during the review this approach doesn't make sense at all.

We should not apply any limitation on the VRAM applications can use inside the 
kernel.

If an application or end user wants to reserve a certain amount of VRAM for bad 
pages handling we should do this in the upper layer.

This reverts commit 6ff5de4f1cfc09308d060a7085c1664a3b37f118.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 29 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h  |  4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 96 +---
 4 files changed, 15 insertions(+), 118 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d6a9787e5597..2e7b5d922f31 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -180,7 +180,7 @@ struct amdgpu_mgpu_info mgpu_info = {
 };
 int amdgpu_ras_enable = -1;
 uint amdgpu_ras_mask = 0x;
-int amdgpu_bad_page_threshold = 100;
+int amdgpu_bad_page_threshold = -1;
 struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
.timeout_fatal_disable = false,
.period = 0x23, /* default to max. timeout = 1 << 0x23 cycles */
@@ -854,7 +854,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 
0444);
  * faulty pages by ECC exceed threshold value and leave it for user's further
  * check.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto, 0 = 
disable bad page retirement, 100 = default value");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default 
value), 0 = disable bad page retirement)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup 
(8 if set to greater than 8 or less than 0, only affect gfx 8+)");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a90bf33358d3..0e16683876aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1790,14 +1790,13 @@ static bool amdgpu_ras_check_bad_page(struct 
amdgpu_device *adev,
return ret;
 }
 
-static uint32_t
-amdgpu_ras_calculate_badpags_threshold(struct amdgpu_device *adev)
+static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
+   uint32_t max_length)
 {
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
int tmp_threshold = amdgpu_bad_page_threshold;
u64 val;
-   uint32_t max_length = 0;
 
-   max_length = amdgpu_ras_eeprom_get_record_max_length();
/*
 * Justification of value bad_page_cnt_threshold in ras structure
 *
@@ -1823,18 +1822,20 @@ amdgpu_ras_calculate_badpags_threshold(struct 
amdgpu_device *adev)
tmp_threshold = max_length;
 
if (tmp_threshold == -1) {
-   val = adev->gmc.real_vram_size;
+   val = adev->gmc.mc_vram_size;
do_div(val, RAS_BAD_PAGE_RATE);
-   tmp_threshold = min(lower_32_bits(val), max_length);
+   con->bad_page_cnt_threshold = min(lower_32_bits(val),
+   max_length);
+   } else {
+   con->bad_page_cnt_threshold = tmp_threshold;
}
-
-   return tmp_threshold;
 }
 
 int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 {
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
struct ras_err_handler_data **data;
+   uint32_t max_eeprom_records_len = 0;
bool exc_err_limit = false;
int ret;
 
@@ -1854,16 +1855,8 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
atomic_set(>in_recovery, 0);
con->adev = adev;
 
-   if (!con->bad_page_cnt_threshold) {
-   con->bad_page_cnt_threshold =
-   amdgpu_ras_calculate_badpags_threshold(adev);
-
-   ret = amdgpu_vram_mgr_reserve_backup_pages(
-   ttm_manager_type(>mman.bdev, TTM_PL_VRAM),
-   con->bad_page_cnt_threshold);
-   if (ret)
-   goto out;
-   }
+   max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
+   amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
 
/* Todo: During test the SMU might fail to read the eeprom through I2C
 * when the GPU is pending on XGMI reset during probe time
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index 2df88978ccba..dec0db8b0b13 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -48,8 +48,6 @@ struct amdgpu_vram_mgr {
spinlock_t lock;
struct list_head reservations_pending;
struct list_head reserved_pages;
-   struct list_head 

Re: [PATCH 2/2 V2] platform/x86: force LPS0 functions for AMD

2021-03-18 Thread Alex Deucher
Let's hold off on these patches for the time being.  At least one of
them seems to cause problems on another laptop.

Thanks,

Alex

On Wed, Mar 17, 2021 at 10:39 AM Alex Deucher  wrote:
>
> ACPI_LPS0_ENTRY_AMD/ACPI_LPS0_EXIT_AMD are supposedly not
> required for AMD platforms, and on some platforms they are
> not even listed in the function mask but at least some HP
> laptops seem to require it to properly support s0ix.
>
> Based on a patch from Marcin Bachry .
>
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1230
> Signed-off-by: Alex Deucher 
> Cc: Marcin Bachry 
> ---
>
> V2: rework the patch to just fix up the specific problematic
> case by setting the function flags that are missing.
>
>  drivers/acpi/x86/s2idle.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/acpi/x86/s2idle.c b/drivers/acpi/x86/s2idle.c
> index 2d7ddb8a8cb6..482e6b23b21a 100644
> --- a/drivers/acpi/x86/s2idle.c
> +++ b/drivers/acpi/x86/s2idle.c
> @@ -368,6 +368,13 @@ static int lps0_device_attach(struct acpi_device *adev,
>
> ACPI_FREE(out_obj);
>
> +   /*
> +* Some HP laptops require ACPI_LPS0_ENTRY_AMD/ACPI_LPS0_EXIT_AMD for 
> proper
> +* S0ix, but don't set the function mask correctly.  Fix that up here.
> +*/
> +   if (acpi_s2idle_vendor_amd())
> +   lps0_dsm_func_mask |= (1 << ACPI_LPS0_ENTRY_AMD) | (1 << 
> ACPI_LPS0_EXIT_AMD);
> +
> acpi_handle_debug(adev->handle, "_DSM function mask: 0x%x\n",
>   lps0_dsm_func_mask);
>
> --
> 2.30.2
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Amdgpu kernel oops and freezing on system suspend and hibernate

2021-03-18 Thread Harvey

Alex,

I waited for kernel 5.11.7 to hit our repos yesterday evening and tested 
again:


1. The suspend issue is gone - suspend and resume now work as expected.

2. System hibernation seems to be a different beast - still freezing

When invoking 'systemctl hibernate' the system does not power off (I 
waited for 5 minutes) and I have to hard reset it to start up again. It 
then tries to resume from the swap partition and comes back up with only 
the external monitor connected to the HDMI port showing a picture and 
the builtin screen of the laptop staying black. Nevertheless the system 
is freezed and not responding, neither to mouse or keyboard. After 
another hard reset I managed to get the following log from journalctl 
(only cut the relevant part):


Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3466] 
manager: sleep: sleep requested (sleeping: no  enabled: yes)
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3473] 
device (wlp4s0): state change: disconnected -> unmanaged (reason 
'sleeping', sys-iface-state: 'managed')
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3508] 
device (wlp4s0): set-hw-addr: reset MAC address to 14:F6:D8:18:8C:EC 
(unmanage)
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3575] 
device (p2p-dev-wlp4s0): state change: disconnected -> unmanaged (reason 
'sleeping', sys-iface-state: 'managed')
Mär 18 12:51:11 obelix NetworkManager[866]:   [1616068271.3580] 
manager: NetworkManager state is now ASLEEP
Mär 18 12:51:11 obelix wpa_supplicant[954]: nl80211: deinit 
ifname=p2p-dev-wlp4s0 disabled_11b_rates=0
Mär 18 12:51:11 obelix wpa_supplicant[954]: nl80211: deinit 
ifname=wlp4s0 disabled_11b_rates=0

Mär 18 12:51:12 obelix gsd-media-keys[1691]: Unable to get default sink
Mär 18 12:51:15 obelix gnome-shell[1496]: 
../glib/gobject/gsignal.c:2732: instance '0x560b86c67b50' has no handler 
with id '15070'
Mär 18 12:51:16 obelix gsd-usb-protect[1724]: Error calling USBGuard 
DBus to change the protection after a screensaver event: 
GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name 
org.usbguard1 was not provided by any .service files

Mär 18 12:51:16 obelix systemd[1]: Reached target Sleep.
Mär 18 12:51:16 obelix systemd[1]: Starting Suspend...
Mär 18 12:51:16 obelix systemd-sleep[2000]: Suspending system...
Mär 18 12:51:16 obelix kernel: PM: suspend entry (deep)
Mär 18 12:51:16 obelix kernel: Filesystems sync: 0.005 seconds
Mär 18 12:51:36 obelix kernel: Freezing user space processes ... 
(elapsed 0.002 seconds) done.

Mär 18 12:51:36 obelix kernel: OOM killer disabled.
Mär 18 12:51:36 obelix kernel: Freezing remaining freezable tasks ... 
(elapsed 0.001 seconds) done.
Mär 18 12:51:36 obelix kernel: printk: Suspending console(s) (use 
no_console_suspend to debug)

Mär 18 12:51:36 obelix kernel: [drm] free PSP TMR buffer
Mär 18 12:51:36 obelix kernel: [drm] free PSP TMR buffer
Mär 18 12:51:36 obelix kernel: ACPI: EC: interrupt blocked
Mär 18 12:51:36 obelix kernel: ACPI: Preparing to enter system sleep 
state S3

Mär 18 12:51:36 obelix kernel: ACPI: EC: event blocked
Mär 18 12:51:36 obelix kernel: ACPI: EC: EC stopped
Mär 18 12:51:36 obelix kernel: PM: Saving platform NVS memory
Mär 18 12:51:36 obelix kernel: Disabling non-boot CPUs ...
Mär 18 12:51:36 obelix kernel: IRQ 86: no longer affine to CPU1
Mär 18 12:51:36 obelix kernel: smpboot: CPU 1 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 87: no longer affine to CPU2
Mär 18 12:51:36 obelix kernel: smpboot: CPU 2 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 88: no longer affine to CPU3
Mär 18 12:51:36 obelix kernel: smpboot: CPU 3 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 89: no longer affine to CPU4
Mär 18 12:51:36 obelix kernel: smpboot: CPU 4 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 90: no longer affine to CPU5
Mär 18 12:51:36 obelix kernel: smpboot: CPU 5 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 91: no longer affine to CPU6
Mär 18 12:51:36 obelix kernel: smpboot: CPU 6 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 92: no longer affine to CPU7
Mär 18 12:51:36 obelix kernel: smpboot: CPU 7 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 93: no longer affine to CPU8
Mär 18 12:51:36 obelix kernel: smpboot: CPU 8 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 94: no longer affine to CPU9
Mär 18 12:51:36 obelix kernel: smpboot: CPU 9 is now offline
Mär 18 12:51:36 obelix kernel: IRQ 95: no longer affine to CPU10
Mär 18 12:51:36 obelix kernel: smpboot: CPU 10 is now offline
Mär 18 12:51:36 obelix kernel: smpboot: CPU 11 is now offline
Mär 18 12:51:36 obelix kernel: smpboot: CPU 12 is now offline
Mär 18 12:51:36 obelix kernel: smpboot: CPU 13 is now offline
Mär 18 12:51:36 obelix kernel: smpboot: CPU 14 is now offline
Mär 18 12:51:36 obelix kernel: smpboot: CPU 15 is now offline
Mär 18 12:51:36 obelix kernel: ACPI: Low-level resume complete
Mär 18 12:51:36 obelix kernel: ACPI: EC: EC started
Mär 18 12:51:36 obelix kernel: PM: Restoring 

[PATCH] drm/amdgpu: revert "use the new cursor in the VM code"

2021-03-18 Thread Christian König
We are seeing VM page faults with this. Revert the change until the bugs
are fixed.

This reverts commit e71af7b9807cc9ab2b40a7c02ff93c622786bd2a.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 55 +-
 1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index bc3951b71079..9268db1172bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -37,7 +37,6 @@
 #include "amdgpu_gmc.h"
 #include "amdgpu_xgmi.h"
 #include "amdgpu_dma_buf.h"
-#include "amdgpu_res_cursor.h"
 
 /**
  * DOC: GPUVM
@@ -1584,7 +1583,7 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
  * @last: last mapped entry
  * @flags: flags for the entries
  * @offset: offset into nodes and pages_addr
- * @res: ttm_resource to map
+ * @nodes: array of drm_mm_nodes with the MC addresses
  * @pages_addr: DMA addresses to use for mapping
  * @fence: optional resulting fence
  *
@@ -1599,13 +1598,13 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
   bool unlocked, struct dma_resv *resv,
   uint64_t start, uint64_t last,
   uint64_t flags, uint64_t offset,
-  struct ttm_resource *res,
+  struct drm_mm_node *nodes,
   dma_addr_t *pages_addr,
   struct dma_fence **fence)
 {
struct amdgpu_vm_update_params params;
-   struct amdgpu_res_cursor cursor;
enum amdgpu_sync_mode sync_mode;
+   uint64_t pfn;
int r;
 
memset(, 0, sizeof(params));
@@ -1623,6 +1622,14 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
else
sync_mode = AMDGPU_SYNC_EXPLICIT;
 
+   pfn = offset >> PAGE_SHIFT;
+   if (nodes) {
+   while (pfn >= nodes->size) {
+   pfn -= nodes->size;
+   ++nodes;
+   }
+   }
+
amdgpu_vm_eviction_lock(vm);
if (vm->evicting) {
r = -EBUSY;
@@ -1641,17 +1648,23 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
if (r)
goto error_unlock;
 
-   amdgpu_res_first(res, offset, (last - start + 1) * AMDGPU_GPU_PAGE_SIZE,
-);
-   while (cursor.remaining) {
+   do {
uint64_t tmp, num_entries, addr;
 
-   num_entries = cursor.size >> AMDGPU_GPU_PAGE_SHIFT;
+
+   num_entries = last - start + 1;
+   if (nodes) {
+   addr = nodes->start << PAGE_SHIFT;
+   num_entries = min((nodes->size - pfn) *
+   AMDGPU_GPU_PAGES_IN_CPU_PAGE, num_entries);
+   } else {
+   addr = 0;
+   }
+
if (pages_addr) {
bool contiguous = true;
 
if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) {
-   uint64_t pfn = cursor.start >> PAGE_SHIFT;
uint64_t count;
 
contiguous = pages_addr[pfn + 1] ==
@@ -1671,18 +1684,16 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
}
 
if (!contiguous) {
-   addr = cursor.start;
+   addr = pfn << PAGE_SHIFT;
params.pages_addr = pages_addr;
} else {
-   addr = pages_addr[cursor.start >> PAGE_SHIFT];
+   addr = pages_addr[pfn];
params.pages_addr = NULL;
}
 
} else if (flags & (AMDGPU_PTE_VALID | AMDGPU_PTE_PRT)) {
-   addr = bo_adev->vm_manager.vram_base_offset +
-   cursor.start;
-   } else {
-   addr = 0;
+   addr += bo_adev->vm_manager.vram_base_offset;
+   addr += pfn << PAGE_SHIFT;
}
 
tmp = start + num_entries;
@@ -1690,9 +1701,14 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
if (r)
goto error_unlock;
 
-   amdgpu_res_next(, num_entries * AMDGPU_GPU_PAGE_SIZE);
+   pfn += num_entries / AMDGPU_GPU_PAGES_IN_CPU_PAGE;
+   if (nodes && nodes->size == pfn) {
+   pfn = 0;
+   ++nodes;
+   }
start = tmp;
-   };
+
+   } while (unlikely(start != last + 1));
 
r = 

Re: [PATCH] drm/amdgpu: Fix the page fault issue in amdgpu_irq_fini

2021-03-18 Thread Christian König

Am 18.03.21 um 12:48 schrieb Emily Deng:

For some source, it will be shared by some client ID and source ID.
To fix the page fault issue, set all those to null.

Signed-off-by: Emily Deng 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 16 +---
  1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index af026109421a..623b1ac6231d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -359,7 +359,7 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
   */
  void amdgpu_irq_fini(struct amdgpu_device *adev)
  {
-   unsigned i, j;
+   unsigned i, j, m, n;
  
  	if (adev->irq.installed) {

drm_irq_uninstall(adev_to_drm(adev));
@@ -380,12 +380,22 @@ void amdgpu_irq_fini(struct amdgpu_device *adev)
if (!src)
continue;
  
-			kfree(src->enabled_types);

+   if (src->enabled_types)
+   kfree(src->enabled_types);


A NULL check before kfree() is unecessary and will be complained about 
by the static checkers.



+
src->enabled_types = NULL;
+


Unrelated white space change.


if (src->data) {
kfree(src->data);
kfree(src);
-   adev->irq.client[i].sources[j] = NULL;
+   }
+
+   for (m = 0; m < AMDGPU_IRQ_CLIENTID_MAX; ++m) {
+   if (!adev->irq.client[m].sources)
+   continue;
+   for (n = 0; n < AMDGPU_MAX_IRQ_SRC_ID; ++n)
+   if (adev->irq.client[m].sources[n] == 
src)
+   adev->irq.client[m].sources[n] 
= NULL;


Hui what? The memory you set to NULL here is freed on the line below.

Accessing it after that would be illegal, so why do you want to set it 
to NULL?


Christian.


}
}
kfree(adev->irq.client[i].sources);


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Fix the page fault issue in amdgpu_irq_fini

2021-03-18 Thread Emily Deng
For some source, it will be shared by some client ID and source ID.
To fix the page fault issue, set all those to null.

Signed-off-by: Emily Deng 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index af026109421a..623b1ac6231d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -359,7 +359,7 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
  */
 void amdgpu_irq_fini(struct amdgpu_device *adev)
 {
-   unsigned i, j;
+   unsigned i, j, m, n;
 
if (adev->irq.installed) {
drm_irq_uninstall(adev_to_drm(adev));
@@ -380,12 +380,22 @@ void amdgpu_irq_fini(struct amdgpu_device *adev)
if (!src)
continue;
 
-   kfree(src->enabled_types);
+   if (src->enabled_types)
+   kfree(src->enabled_types);
+
src->enabled_types = NULL;
+
if (src->data) {
kfree(src->data);
kfree(src);
-   adev->irq.client[i].sources[j] = NULL;
+   }
+
+   for (m = 0; m < AMDGPU_IRQ_CLIENTID_MAX; ++m) {
+   if (!adev->irq.client[m].sources)
+   continue;
+   for (n = 0; n < AMDGPU_MAX_IRQ_SRC_ID; ++n)
+   if (adev->irq.client[m].sources[n] == 
src)
+   adev->irq.client[m].sources[n] 
= NULL;
}
}
kfree(adev->irq.client[i].sources);
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-18 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey

Let me summarize the background of this patch:

In TDR resubmit step “amdgpu_device_recheck_guilty_jobs,
It will submit first jobs of each ring and do guilty job re-check.
At that point, We had to make sure each job is in the mirror list(or 
re-inserted back already).

But we found the current code never re-insert the job to mirror list in the 
2nd, 3rd job_timeout thread(Bailing TDR thread).
This not only will cause memleak of the bailing jobs. What’s more important, 
the 1st tdr thread can never iterate the bailing job and set its guilty status 
to a correct status.

Therefore, we had to re-insert the job(or even not delete node) for bailing job.

For the above V3 patch, the racing condition in my mind is:
we cannot make sure all bailing jobs are finished before we do 
amdgpu_device_recheck_guilty_jobs.

Based on this insight, I think we have two options to solve this issue:

  1.  Skip delete node in tdr thread2, thread3, 4 … (using mutex or atomic 
variable)
  2.  Re-insert back bailing job, and meanwhile use semaphore in each tdr 
thread to keep the sequence as expected and ensure each job is in the mirror 
list when do resubmit step.

For Option1, logic is simpler and we need only one global atomic variable:
What do you think about this plan?

Option1 should look like the following logic:


+static atomic_t in_reset; //a global atomic var for synchronization
static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb);
 /**
@@ -295,6 +296,12 @@ static void drm_sched_job_timedout(struct work_struct 
*work)
 * drm_sched_cleanup_jobs. It will be reinserted back after 
sched->thread
 * is parked at which point it's safe.
 */
+   if (atomic_cmpxchg(_reset, 0, 1) != 0) {  //skip delete node 
if it’s thead1,2,3,….
+   spin_unlock(>job_list_lock);
+   drm_sched_start_timeout(sched);
+   return;
+   }
+
list_del_init(>node);
spin_unlock(>job_list_lock);
@@ -320,6 +327,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
spin_lock(>job_list_lock);
drm_sched_start_timeout(sched);
spin_unlock(>job_list_lock);
+   atomic_set(_reset, 0); //reset in_reset when the first thread 
finished tdr
}


Thanks,
Jack
From: amd-gfx  On Behalf Of Zhang, Jack 
(Jian)
Sent: Wednesday, March 17, 2021 11:11 PM
To: Christian König ; 
dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; Liu, Monk ; Deng, Emily 
; Rob Herring ; Tomeu Vizoso 
; Steven Price ; Grodzovsky, 
Andrey 
Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak


[AMD Official Use Only - Internal Distribution Only]


[AMD Official Use Only - Internal Distribution Only]

Hi,Andrey,

Good catch,I will expore this corner case and give feedback soon~
Best,
Jack


From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Wednesday, March 17, 2021 10:50:59 PM
To: Christian König 
mailto:ckoenig.leichtzumer...@gmail.com>>; 
Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
dri-de...@lists.freedesktop.org 
mailto:dri-de...@lists.freedesktop.org>>; 
amd-gfx@lists.freedesktop.org 
mailto:amd-gfx@lists.freedesktop.org>>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>; Rob Herring 
mailto:r...@kernel.org>>; Tomeu Vizoso 
mailto:tomeu.viz...@collabora.com>>; Steven Price 
mailto:steven.pr...@arm.com>>
Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

I actually have a race condition concern here - see bellow -

On 2021-03-17 3:43 a.m., Christian König wrote:
> I was hoping Andrey would take a look since I'm really busy with other
> work right now.
>
> Regards,
> Christian.
>
> Am 17.03.21 um 07:46 schrieb Zhang, Jack (Jian):
>> Hi, Andrey/Crhistian and Team,
>>
>> I didn't receive the reviewer's message from maintainers on panfrost
>> driver for several days.
>> Due to this patch is urgent for my current working project.
>> Would you please help to give some review ideas?
>>
>> Many Thanks,
>> Jack
>> -Original Message-
>> From: Zhang, Jack (Jian)
>> Sent: Tuesday, March 16, 2021 3:20 PM
>> To: dri-de...@lists.freedesktop.org; 
>> amd-gfx@lists.freedesktop.org;
>> Koenig, Christian 
>> mailto:christian.koe...@amd.com>>; Grodzovsky, 
>> Andrey
>> mailto:andrey.grodzov...@amd.com>>; Liu, Monk 
>> mailto:monk@amd.com>>; Deng,
>> Emily mailto:emily.d...@amd.com>>; Rob Herring 
>> mailto:r...@kernel.org>>; Tomeu
>> Vizoso mailto:tomeu.viz...@collabora.com>>; 
>> Steven Price mailto:steven.pr...@arm.com>>
>> 

Re: [RFC PATCH 8/9] drm/gem: Associate GEM objects with drm cgroup

2021-03-18 Thread Daniel Vetter
On Sat, Mar 6, 2021 at 1:44 AM Brian Welty  wrote:
>
>
> On 2/11/2021 7:34 AM, Daniel Vetter wrote:
> > On Wed, Feb 10, 2021 at 02:00:57PM -0800, Brian Welty wrote:
> >>
> >> On 2/9/2021 2:54 AM, Daniel Vetter wrote:
> >>> On Tue, Jan 26, 2021 at 01:46:25PM -0800, Brian Welty wrote:
>  This patch adds tracking of which cgroup to make charges against for a
>  given GEM object.  We associate the current task's cgroup with GEM 
>  objects
>  as they are created.  First user of this is for charging DRM cgroup for
>  device memory allocations.  The intended behavior is for device drivers 
>  to
>  make the cgroup charging calls at the time that backing store is 
>  allocated
>  or deallocated for the object.
> 
>  Exported functions are provided for charging memory allocations for a
>  GEM object to DRM cgroup. To aid in debugging, we store how many bytes
>  have been charged inside the GEM object.  Add helpers for setting and
>  clearing the object's associated cgroup which will check that charges are
>  not being leaked.
> 
>  For shared objects, this may make the charge against a cgroup that is
>  potentially not the same cgroup as the process using the memory.  Based
>  on the memory cgroup's discussion of "memory ownership", this seems
>  acceptable [1].
> 
>  [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt, "Memory 
>  Ownership"
> 
>  Signed-off-by: Brian Welty 
> >>>
> >>> Since for now we only have the generic gpu/xpu/bikeshed.memory bucket that
> >>> counts everything, why don't we also charge in these gem functions?
> >>
> >> I'm not sure what you mean exactly.  You want to merge/move the charging 
> >> logic
> >> proposed in patch #5 (drm_cgroup_try_charge in kernel/cgroup/drm.c) into
> >> drm_gem_object_charge_mem() ?
> >>
> >> Or reading below, I think you are okay keeping the logic separated as is, 
> >> but
> >> you want much of the code in kernel/cgroup/drm.c moved to 
> >> drivers/gpu/cgroup ?
> >> Yes, I see that should allow to reduce number of exported functions.
> >
> > Both. I mean we'd need to look at the code again when it's shuffled, but
> > I'd say:
> >
> > - cgroup code and the charging for general gpu memory moves to
> >   drivers/gpu/cgroup, so dma-buf heaps can use it too.
> >
> > - the charging for gem buffers moves into core gem code, so it happens for
> >   all gpu drivers and all gem buffer allocations.
>
> Daniel, I'm not sure we're in sync on what 'charging for general gpu memory'
> means.  Thus far, I have been proposing to charge/uncharge when backing store 
> is
> allocated/freed.  And thus, this would be done in DRM driver (so then also in
> the dma-buf exporter).
> I can't see how we'd hoist this part into drm gem code.
> The memory limit in this series is for VRAM usage/limit not GEM buffers...

Yes this would be at gem buffer creation time. And just to get cgroups
for drm up

> Unless you are talking about charging for GEM buffer creation?  But this is
> more of a 'soft resource' more along lines of Kenny's earlier GEM buffer limit
> control.
> I raised issue with this then, and at the time, Tejun agreed we should keep to
> 'hard resource' controls, see [1] and [2].
>
> [1] https://lists.freedesktop.org/archives/dri-devel/2019-May/218071.html
> [2] https://lists.freedesktop.org/archives/dri-devel/2020-April/262141.html
>
> >
> > - this might or might not mean a bunch less exported stuff from the
> >   cgroups files (since you don't need separate steps for linking a gem
> >   object to a cgroup from the actual charging), and probably no exports
> >   anymore for drivers (since they charge nothing). That will change
> >   when we add charging for specific memory pools I guess, but we add that
> >   when we add tha functionality.
>
> ... so considering VRAM charging, then yes, we very much need to have exported
> functions for drivers to do the charging.
> But these can be exported from drm.ko (or new .ko?) instead of kernel.  Is
> that still preference?   Also, if number of exported functions is concern, we
> can replace some of it with use of function pointers.

So the reason I suggested we drop all this is because we won't charge
in drivers, we'll charge in ttm buffer management code. Which we'll
adopt for dg1 in upstream. But it will take some time.

> So then returning to this comment of yours:
>
> > - cgroup code and the charging for general gpu memory moves to
> >   drivers/gpu/cgroup, so dma-buf heaps can use it too.
>
> If you agree that we are charging just at backing-store level, then I think
> logic belongs in drivers/gpu/drm/cgroup ??  As charging is done in DRM driver
> (also dma-buf exporter).  In other words, part of drm.
> If I understand, dma-buf heaps is exporter of system memory and doesn't
> need to charge against gpu controller??
> Will need some help to understand the dma-buf heap use case a bit more.

Well we also need to track system 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


I understood your concern. It is more complex for DRM GFX, therefore I 
abandon adding lock protection for DRM ioctls now. Maybe we can try to 
add all kernel  dma_fence waiting in a list, and signal all in 
recovery threads. Do you have same concern for compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence also 
means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we would 
run into memory corruption because the hardware jobs running after the 
GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its goal is 
to block other threads when GPU recovery thread start. But I couldn’t 
find a better method to solve this problem. Do you have some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and then 
insert taking the read side of the GPU reset lock so that we don't wait 
for a dma_fence or allocate memory, but still protect the hardware from 
concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; 
Deucher, Alexander ; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


Lockdep also complains about this when it is used correctly. The only 
reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian >; amd-gfx@lists.freedesktop.org 
 >; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix >; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
stability


>>> Those two steps need to be exchanged or otherwise it is possible 
that new delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For example, 
the user thread hold the read lock and waiting for the fence, if 
recovery thread try to hold write lock and then complete fences, in 
this case, recovery thread will always be blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian >

Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org ; 
Deucher, Alexander >; Kuehling, Felix 
mailto:felix.kuehl...@amd.com>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
stability


Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev 
object. The atomic type variable in_gpu_reset is used to avoid 
recovery thread reenter and make lower functions return more earlier 
when recovery start, but couldn't block recovery thread when it access 
hardware. The r/w semaphore reset_sem is used to solve these 
synchronization issues between recovery thread and other threads.

>
> The original solution locked registers' access in lower functions, 
which will introduce following issues:

>
> 1) many lower functions are used in both recovery thread and others. 
Firstly we must harvest these functions, it is easy to miss someones. 
Secondly these functions need select which lock (read lock or write 
lock) will be used, according to the thread it is running in. If the 
thread context isn't considered, the added lock will easily introduce 
deadlock. Besides that, in most time, developer easily forget to add 
locks for new functions.

>
> 2) performance drop. More lower functions are more frequently called.
>
> 3) easily 

RE: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

2021-03-18 Thread Chen, Jiansong (Simon)
We still need reserve "return 0", otherwise may trigger warning "not all 
control paths return a value".

Regards,
Jiansong
-Original Message-
From: amd-gfx  On Behalf Of Chen, Guchun
Sent: Thursday, March 18, 2021 5:28 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Quan, Evan 
Subject: RE: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

[AMD Public Use]

One comment inline. Other than this, the patch is:

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Evan Quan
Sent: Thursday, March 18, 2021 5:21 PM
To: amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Quan, Evan 
Subject: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

The RLC was put into a wrong state on runtime suspend. Thus the RLC autoload 
will fail on the succeeding runtime resume. By adding an intermediate 
PPSMC_MSG_PrepareMp1ForUnload(some GC hard reset involved, designed for PnP), 
we can bring RLC back into the desired state.

V2: integrate INTERRUPTS_ENABLED flag clearing into current
mp1 state set routines

Change-Id: I4eb3d5d76068412a6ab228af7fe7f794e53c1eaa
Signed-off-by: Evan Quan 
Reviewed-by: Lijo Lazar 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c   |  9 --
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 28 +++
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  1 +
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 25 +
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 14 ++
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 14 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 28 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
 9 files changed, 102 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 809f4cb738f4..ab6f4059630c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2160,9 +2160,12 @@ static int psp_load_smu_fw(struct psp_context *psp)
if (!ucode->fw || amdgpu_sriov_vf(psp->adev))
return 0;
 
-
-   if (amdgpu_in_reset(adev) && ras && ras->supported &&
-   adev->asic_type == CHIP_ARCTURUS) {
+   if ((amdgpu_in_reset(adev) &&
+ras && ras->supported &&
+adev->asic_type == CHIP_ARCTURUS) ||
+(adev->in_runpm &&
+ adev->asic_type >= CHIP_NAVI10 &&
+ adev->asic_type <= CHIP_NAVI12)) {
ret = amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_UNLOAD);
if (ret) {
DRM_WARN("Failed to set MP1 state prepare for 
reload\n"); diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 629a988b069d..21c3c149614c 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -836,6 +836,13 @@ struct pptable_funcs {
 */
int (*check_fw_status)(struct smu_context *smu);
 
+   /**
+* @set_mp1_state: put SMU into a correct state for comming
+* resume from runpm or gpu reset.
+*/
+   int (*set_mp1_state)(struct smu_context *smu,
+enum pp_mp1_state mp1_state);
+
/**
 * @setup_pptable: Initialize the power play table and populate it with
 * default values.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index bae9016fedea..470ca4e5d4d7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1927,36 +1927,16 @@ int smu_set_mp1_state(void *handle,
  enum pp_mp1_state mp1_state)
 {
struct smu_context *smu = handle;
-   uint16_t msg;
-   int ret;
+   int ret = 0;
 
if (!smu->pm_enabled)
return -EOPNOTSUPP;
 
mutex_lock(>mutex);
 
-   switch (mp1_state) {
-   case PP_MP1_STATE_SHUTDOWN:
-   msg = SMU_MSG_PrepareMp1ForShutdown;
-   break;
-   case PP_MP1_STATE_UNLOAD:
-   msg = SMU_MSG_PrepareMp1ForUnload;
-   break;
-   case PP_MP1_STATE_RESET:
-   msg = SMU_MSG_PrepareMp1ForReset;
-   break;
-   case PP_MP1_STATE_NONE:
-   default:
-   mutex_unlock(>mutex);
-   return 0;
-   }
-
-   ret = smu_send_smc_msg(smu, msg, NULL);
-   /* some asics may not support those messages */
-   if (ret == -EINVAL)
-   ret = 0;
-   if (ret)
-   dev_err(smu->adev->dev, "[PrepareMp1] Failed!\n");
+   if (smu->ppt_funcs &&
+   smu->ppt_funcs->set_mp1_state)
+   ret = smu->ppt_funcs->set_mp1_state(smu, mp1_state);
 
mutex_unlock(>mutex);
 
diff --git 

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
>>> The GPU reset doesn't complete the fences we wait for. It only completes 
>>> the hardware fences as part of the reset.
>>> So waiting for a fence while holding the reset lock is illegal and needs to 
>>> be avoided.
I understood your concern. It is more complex for DRM GFX, therefore I abandon 
adding lock protection for DRM ioctls now. Maybe we can try to add all kernel  
dma_fence waiting in a list, and signal all in recovery threads. Do you have 
same concern for compute cases?

>>> Lockdep also complains about this when it is used correctly. The only 
>>> reason it doesn't complain here is because you use an atomic+wait_event 
>>> instead of a locking primitive.
Agree. This approach will escape the monitor of lockdep.  Its goal is to block 
other threads when GPU recovery thread start. But I couldn’t find a better 
method to solve this problem. Do you have some suggestion?

Best Regards
Dennis Li

From: Koenig, Christian 
Sent: Thursday, March 18, 2021 4:59 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only completes the 
hardware fences as part of the reset.

So waiting for a fence while holding the reset lock is illegal and needs to be 
avoided.

Lockdep also complains about this when it is used correctly. The only reason it 
doesn't complain here is because you use an atomic+wait_event instead of a 
locking primitive.

Regards,
Christian.


Von: Li, Dennis mailto:dennis...@amd.com>>
Gesendet: Donnerstag, 18. März 2021 09:28
An: Koenig, Christian 
mailto:christian.koe...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
mailto:amd-gfx@lists.freedesktop.org>>; Deucher, 
Alexander mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Betreff: RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

>>> Those two steps need to be exchanged or otherwise it is possible that new 
>>> delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If exchange the 
two steps, it maybe introduce the deadlock.  For example, the user thread hold 
the read lock and waiting for the fence, if recovery thread try to hold write 
lock and then complete fences, in this case, recovery thread will always be 
blocked.

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
mailto:christian.koe...@amd.com>>
Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org; Deucher, 
Alexander mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev object. The 
> atomic type variable in_gpu_reset is used to avoid recovery thread reenter 
> and make lower functions return more earlier when recovery start, but 
> couldn't block recovery thread when it access hardware. The r/w semaphore 
> reset_sem is used to solve these synchronization issues between recovery 
> thread and other threads.
>
> The original solution locked registers' access in lower functions, which will 
> introduce following issues:
>
> 1) many lower functions are used in both recovery thread and others. Firstly 
> we must harvest these functions, it is easy to miss someones. Secondly these 
> functions need select which lock (read lock or write lock) will be used, 
> according to the thread it is running in. If the thread context isn't 
> considered, the added lock will easily introduce deadlock. Besides that, in 
> most time, developer easily forget to add locks for new functions.
>
> 2) performance drop. More lower functions are more frequently called.
>
> 3) easily introduce false positive lockdep complaint, because write lock has 
> big range in recovery thread, but low level functions will hold read lock may 
> be protected by other locks in other threads.
>
> Therefore the new solution will try to add lock protection for ioctls of kfd. 
> Its goal is that there are no threads except for recovery thread or its 
> children (for xgmi) to access hardware when doing GPU reset and resume. So 
> refine recovery thread as the following:
>
> Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
> 1). if failed, it means system had a recovery thread running, current 
> thread exit directly;
> 2). if success, enter recovery thread;
>
> Step 1: cancel all delay works, stop drm schedule, complete all unreceived 
> fences and so on. It try to 

RE: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

2021-03-18 Thread Chen, Guchun
[AMD Public Use]

One comment inline. Other than this, the patch is:

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Evan Quan
Sent: Thursday, March 18, 2021 5:21 PM
To: amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Quan, Evan 
Subject: [PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

The RLC was put into a wrong state on runtime suspend. Thus the RLC autoload 
will fail on the succeeding runtime resume. By adding an intermediate 
PPSMC_MSG_PrepareMp1ForUnload(some GC hard reset involved, designed for PnP), 
we can bring RLC back into the desired state.

V2: integrate INTERRUPTS_ENABLED flag clearing into current
mp1 state set routines

Change-Id: I4eb3d5d76068412a6ab228af7fe7f794e53c1eaa
Signed-off-by: Evan Quan 
Reviewed-by: Lijo Lazar 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c   |  9 --
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 28 +++
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  1 +
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 25 +
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 14 ++
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 14 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 28 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
 9 files changed, 102 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 809f4cb738f4..ab6f4059630c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2160,9 +2160,12 @@ static int psp_load_smu_fw(struct psp_context *psp)
if (!ucode->fw || amdgpu_sriov_vf(psp->adev))
return 0;
 
-
-   if (amdgpu_in_reset(adev) && ras && ras->supported &&
-   adev->asic_type == CHIP_ARCTURUS) {
+   if ((amdgpu_in_reset(adev) &&
+ras && ras->supported &&
+adev->asic_type == CHIP_ARCTURUS) ||
+(adev->in_runpm &&
+ adev->asic_type >= CHIP_NAVI10 &&
+ adev->asic_type <= CHIP_NAVI12)) {
ret = amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_UNLOAD);
if (ret) {
DRM_WARN("Failed to set MP1 state prepare for 
reload\n"); diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 629a988b069d..21c3c149614c 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -836,6 +836,13 @@ struct pptable_funcs {
 */
int (*check_fw_status)(struct smu_context *smu);
 
+   /**
+* @set_mp1_state: put SMU into a correct state for comming
+* resume from runpm or gpu reset.
+*/
+   int (*set_mp1_state)(struct smu_context *smu,
+enum pp_mp1_state mp1_state);
+
/**
 * @setup_pptable: Initialize the power play table and populate it with
 * default values.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index bae9016fedea..470ca4e5d4d7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1927,36 +1927,16 @@ int smu_set_mp1_state(void *handle,
  enum pp_mp1_state mp1_state)
 {
struct smu_context *smu = handle;
-   uint16_t msg;
-   int ret;
+   int ret = 0;
 
if (!smu->pm_enabled)
return -EOPNOTSUPP;
 
mutex_lock(>mutex);
 
-   switch (mp1_state) {
-   case PP_MP1_STATE_SHUTDOWN:
-   msg = SMU_MSG_PrepareMp1ForShutdown;
-   break;
-   case PP_MP1_STATE_UNLOAD:
-   msg = SMU_MSG_PrepareMp1ForUnload;
-   break;
-   case PP_MP1_STATE_RESET:
-   msg = SMU_MSG_PrepareMp1ForReset;
-   break;
-   case PP_MP1_STATE_NONE:
-   default:
-   mutex_unlock(>mutex);
-   return 0;
-   }
-
-   ret = smu_send_smc_msg(smu, msg, NULL);
-   /* some asics may not support those messages */
-   if (ret == -EINVAL)
-   ret = 0;
-   if (ret)
-   dev_err(smu->adev->dev, "[PrepareMp1] Failed!\n");
+   if (smu->ppt_funcs &&
+   smu->ppt_funcs->set_mp1_state)
+   ret = smu->ppt_funcs->set_mp1_state(smu, mp1_state);
 
mutex_unlock(>mutex);
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
index 69aa60a2e8b7..e033aa6c7d9b 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
@@ -2367,6 +2367,7 @@ static const struct pptable_funcs arcturus_ppt_funcs = {
.get_fan_parameters = arcturus_get_fan_parameters,

[PATCH] drm/amd/pm: fix Navi1x runtime resume failure V2

2021-03-18 Thread Evan Quan
The RLC was put into a wrong state on runtime suspend. Thus the RLC
autoload will fail on the succeeding runtime resume. By adding an
intermediate PPSMC_MSG_PrepareMp1ForUnload(some GC hard reset involved,
designed for PnP), we can bring RLC back into the desired state.

V2: integrate INTERRUPTS_ENABLED flag clearing into current
mp1 state set routines

Change-Id: I4eb3d5d76068412a6ab228af7fe7f794e53c1eaa
Signed-off-by: Evan Quan 
Reviewed-by: Lijo Lazar 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c   |  9 --
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 28 +++
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  1 +
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 25 +
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 14 ++
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 14 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 28 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
 9 files changed, 102 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 809f4cb738f4..ab6f4059630c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2160,9 +2160,12 @@ static int psp_load_smu_fw(struct psp_context *psp)
if (!ucode->fw || amdgpu_sriov_vf(psp->adev))
return 0;
 
-
-   if (amdgpu_in_reset(adev) && ras && ras->supported &&
-   adev->asic_type == CHIP_ARCTURUS) {
+   if ((amdgpu_in_reset(adev) &&
+ras && ras->supported &&
+adev->asic_type == CHIP_ARCTURUS) ||
+(adev->in_runpm &&
+ adev->asic_type >= CHIP_NAVI10 &&
+ adev->asic_type <= CHIP_NAVI12)) {
ret = amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_UNLOAD);
if (ret) {
DRM_WARN("Failed to set MP1 state prepare for 
reload\n");
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 629a988b069d..21c3c149614c 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -836,6 +836,13 @@ struct pptable_funcs {
 */
int (*check_fw_status)(struct smu_context *smu);
 
+   /**
+* @set_mp1_state: put SMU into a correct state for comming
+* resume from runpm or gpu reset.
+*/
+   int (*set_mp1_state)(struct smu_context *smu,
+enum pp_mp1_state mp1_state);
+
/**
 * @setup_pptable: Initialize the power play table and populate it with
 * default values.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index bae9016fedea..470ca4e5d4d7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1927,36 +1927,16 @@ int smu_set_mp1_state(void *handle,
  enum pp_mp1_state mp1_state)
 {
struct smu_context *smu = handle;
-   uint16_t msg;
-   int ret;
+   int ret = 0;
 
if (!smu->pm_enabled)
return -EOPNOTSUPP;
 
mutex_lock(>mutex);
 
-   switch (mp1_state) {
-   case PP_MP1_STATE_SHUTDOWN:
-   msg = SMU_MSG_PrepareMp1ForShutdown;
-   break;
-   case PP_MP1_STATE_UNLOAD:
-   msg = SMU_MSG_PrepareMp1ForUnload;
-   break;
-   case PP_MP1_STATE_RESET:
-   msg = SMU_MSG_PrepareMp1ForReset;
-   break;
-   case PP_MP1_STATE_NONE:
-   default:
-   mutex_unlock(>mutex);
-   return 0;
-   }
-
-   ret = smu_send_smc_msg(smu, msg, NULL);
-   /* some asics may not support those messages */
-   if (ret == -EINVAL)
-   ret = 0;
-   if (ret)
-   dev_err(smu->adev->dev, "[PrepareMp1] Failed!\n");
+   if (smu->ppt_funcs &&
+   smu->ppt_funcs->set_mp1_state)
+   ret = smu->ppt_funcs->set_mp1_state(smu, mp1_state);
 
mutex_unlock(>mutex);
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
index 69aa60a2e8b7..e033aa6c7d9b 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
@@ -2367,6 +2367,7 @@ static const struct pptable_funcs arcturus_ppt_funcs = {
.get_fan_parameters = arcturus_get_fan_parameters,
.interrupt_work = smu_v11_0_interrupt_work,
.set_light_sbr = smu_v11_0_set_light_sbr,
+   .set_mp1_state = smu_cmn_set_mp1_state,
 };
 
 void arcturus_set_ppt_funcs(struct smu_context *smu)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 88634b44f8ff..bd95d41fe7a9 100644
--- 

AW: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages

2021-03-18 Thread Koenig, Christian
Reviewed-by: Christian König 

Von: Daniel Gomez 
Gesendet: Donnerstag, 18. März 2021 09:32
Cc: dag...@gmail.com ; Daniel Gomez ; 
Deucher, Alexander ; Koenig, Christian 
; David Airlie ; Daniel Vetter 
; amd-gfx@lists.freedesktop.org 
; dri-de...@lists.freedesktop.org 
; linux-ker...@vger.kernel.org 

Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages

If userptr pages have been pinned but not bounded,
they remain uncleared.

Signed-off-by: Daniel Gomez 
---
 drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index e8c66d10478f..bbcc6264d48f 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct 
ttm_bo_device *bdev, struct ttm_tt
 struct radeon_ttm_tt *gtt = (void *)ttm;
 struct radeon_device *rdev = radeon_get_rdev(bdev);

+   if (gtt->userptr)
+   radeon_ttm_tt_unpin_userptr(bdev, ttm);
+
 if (!gtt->bound)
 return;

 radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);

-   if (gtt->userptr)
-   radeon_ttm_tt_unpin_userptr(bdev, ttm);
 gtt->bound = false;
 }

--
2.30.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Koenig, Christian
Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only completes the 
hardware fences as part of the reset.

So waiting for a fence while holding the reset lock is illegal and needs to be 
avoided.

Lockdep also complains about this when it is used correctly. The only reason it 
doesn't complain here is because you use an atomic+wait_event instead of a 
locking primitive.

Regards,
Christian.


Von: Li, Dennis 
Gesendet: Donnerstag, 18. März 2021 09:28
An: Koenig, Christian ; amd-gfx@lists.freedesktop.org 
; Deucher, Alexander 
; Kuehling, Felix ; Zhang, 
Hawking 
Betreff: RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

>>> Those two steps need to be exchanged or otherwise it is possible that new 
>>> delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If exchange the 
two steps, it maybe introduce the deadlock.  For example, the user thread hold 
the read lock and waiting for the fence, if recovery thread try to hold write 
lock and then complete fences, in this case, recovery thread will always be 
blocked.

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev object. The 
> atomic type variable in_gpu_reset is used to avoid recovery thread reenter 
> and make lower functions return more earlier when recovery start, but 
> couldn't block recovery thread when it access hardware. The r/w semaphore 
> reset_sem is used to solve these synchronization issues between recovery 
> thread and other threads.
>
> The original solution locked registers' access in lower functions, which will 
> introduce following issues:
>
> 1) many lower functions are used in both recovery thread and others. Firstly 
> we must harvest these functions, it is easy to miss someones. Secondly these 
> functions need select which lock (read lock or write lock) will be used, 
> according to the thread it is running in. If the thread context isn't 
> considered, the added lock will easily introduce deadlock. Besides that, in 
> most time, developer easily forget to add locks for new functions.
>
> 2) performance drop. More lower functions are more frequently called.
>
> 3) easily introduce false positive lockdep complaint, because write lock has 
> big range in recovery thread, but low level functions will hold read lock may 
> be protected by other locks in other threads.
>
> Therefore the new solution will try to add lock protection for ioctls of kfd. 
> Its goal is that there are no threads except for recovery thread or its 
> children (for xgmi) to access hardware when doing GPU reset and resume. So 
> refine recovery thread as the following:
>
> Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
> 1). if failed, it means system had a recovery thread running, current 
> thread exit directly;
> 2). if success, enter recovery thread;
>
> Step 1: cancel all delay works, stop drm schedule, complete all unreceived 
> fences and so on. It try to stop or pause other threads.
>
> Step 2: call down_write(>reset_sem) to hold write lock, which will 
> block recovery thread until other threads release read locks.

Those two steps need to be exchanged or otherwise it is possible that new 
delayed work items etc are started before the lock is taken.

Just to make it clear until this is fixed the whole patch set is a NAK.

Regards,
Christian.

>
> Step 3: normally, there is only recovery threads running to access hardware, 
> it is safe to do gpu reset now.
>
> Step 4: do post gpu reset, such as call all ips' resume functions;
>
> Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release 
> write lock. Recovery thread exit normally.
>
> Other threads call the amdgpu_read_lock to synchronize with recovery thread. 
> If it finds that in_gpu_reset is 1, it should release read lock if it has 
> holden one, and then blocks itself to wait for recovery finished event. If 
> thread successfully hold read lock and in_gpu_reset is 0, it continues. It 
> will exit normally or be stopped by recovery thread in step 1.
>
> Dennis Li (4):
>drm/amdgpu: remove reset lock from low level functions
>drm/amdgpu: refine the GPU recovery sequence
>drm/amdgpu: instead of using down/up_read directly
>drm/amdkfd: add reset lock protection for kfd entry functions
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|   8 -
> 

[PATCH] drm/radeon/ttm: Fix memory leak userptr pages

2021-03-18 Thread Daniel Gomez
If userptr pages have been pinned but not bounded,
they remain uncleared.

Signed-off-by: Daniel Gomez 
---
 drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index e8c66d10478f..bbcc6264d48f 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct 
ttm_bo_device *bdev, struct ttm_tt
struct radeon_ttm_tt *gtt = (void *)ttm;
struct radeon_device *rdev = radeon_get_rdev(bdev);
 
+   if (gtt->userptr)
+   radeon_ttm_tt_unpin_userptr(bdev, ttm);
+
if (!gtt->bound)
return;
 
radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
 
-   if (gtt->userptr)
-   radeon_ttm_tt_unpin_userptr(bdev, ttm);
gtt->bound = false;
 }
 
-- 
2.30.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
>>> Those two steps need to be exchanged or otherwise it is possible that new 
>>> delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If exchange the 
two steps, it maybe introduce the deadlock.  For example, the user thread hold 
the read lock and waiting for the fence, if recovery thread try to hold write 
lock and then complete fences, in this case, recovery thread will always be 
blocked. 

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian  
Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev object. The 
> atomic type variable in_gpu_reset is used to avoid recovery thread reenter 
> and make lower functions return more earlier when recovery start, but 
> couldn't block recovery thread when it access hardware. The r/w semaphore 
> reset_sem is used to solve these synchronization issues between recovery 
> thread and other threads.
>
> The original solution locked registers' access in lower functions, which will 
> introduce following issues:
>
> 1) many lower functions are used in both recovery thread and others. Firstly 
> we must harvest these functions, it is easy to miss someones. Secondly these 
> functions need select which lock (read lock or write lock) will be used, 
> according to the thread it is running in. If the thread context isn't 
> considered, the added lock will easily introduce deadlock. Besides that, in 
> most time, developer easily forget to add locks for new functions.
>
> 2) performance drop. More lower functions are more frequently called.
>
> 3) easily introduce false positive lockdep complaint, because write lock has 
> big range in recovery thread, but low level functions will hold read lock may 
> be protected by other locks in other threads.
>
> Therefore the new solution will try to add lock protection for ioctls of kfd. 
> Its goal is that there are no threads except for recovery thread or its 
> children (for xgmi) to access hardware when doing GPU reset and resume. So 
> refine recovery thread as the following:
>
> Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
> 1). if failed, it means system had a recovery thread running, current 
> thread exit directly;
> 2). if success, enter recovery thread;
>
> Step 1: cancel all delay works, stop drm schedule, complete all unreceived 
> fences and so on. It try to stop or pause other threads.
>
> Step 2: call down_write(>reset_sem) to hold write lock, which will 
> block recovery thread until other threads release read locks.

Those two steps need to be exchanged or otherwise it is possible that new 
delayed work items etc are started before the lock is taken.

Just to make it clear until this is fixed the whole patch set is a NAK.

Regards,
Christian.

>
> Step 3: normally, there is only recovery threads running to access hardware, 
> it is safe to do gpu reset now.
>
> Step 4: do post gpu reset, such as call all ips' resume functions;
>
> Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release 
> write lock. Recovery thread exit normally.
>
> Other threads call the amdgpu_read_lock to synchronize with recovery thread. 
> If it finds that in_gpu_reset is 1, it should release read lock if it has 
> holden one, and then blocks itself to wait for recovery finished event. If 
> thread successfully hold read lock and in_gpu_reset is 0, it continues. It 
> will exit normally or be stopped by recovery thread in step 1.
>
> Dennis Li (4):
>drm/amdgpu: remove reset lock from low level functions
>drm/amdgpu: refine the GPU recovery sequence
>drm/amdgpu: instead of using down/up_read directly
>drm/amdkfd: add reset lock protection for kfd entry functions
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|   8 -
>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   4 +-
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 172 -
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c  |   4 +
>   .../amd/amdkfd/kfd_process_queue_manager.c|  17 ++
>   12 files changed, 345 insertions(+), 75 deletions(-)
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/ttm: Fix memory leak userptr pages

2021-03-18 Thread Daniel Gomez
On Thu, 18 Mar 2021 at 08:49, Christian König  wrote:
>
> Am 17.03.21 um 17:08 schrieb Daniel Gomez:
> > If userptr pages have been pinned but not bounded,
> > they remain uncleared.
> >
> > Signed-off-by: Daniel Gomez 
>
> Good catch, not sure if that can ever happen in practice but better save
> than sorry.
Thanks! We actually had a case with clEnqueueWriteBuffer where the
driver was leaking.
I can see the problem also affects to the radeon driver so, I'll send
a patch for that one as
well.
>
> Reviewed-by: Christian König 
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +++---
> >   1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > index 9fd2157b133a..50c2b4827c13 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > @@ -1162,13 +1162,13 @@ static void amdgpu_ttm_backend_unbind(struct 
> > ttm_bo_device *bdev,
> >   struct amdgpu_ttm_tt *gtt = (void *)ttm;
> >   int r;
> >
> > - if (!gtt->bound)
> > - return;
> > -
> >   /* if the pages have userptr pinning then clear that first */
> >   if (gtt->userptr)
> >   amdgpu_ttm_tt_unpin_userptr(bdev, ttm);
> >
> > + if (!gtt->bound)
> > + return;
> > +
> >   if (gtt->offset == AMDGPU_BO_INVALID_OFFSET)
> >   return;
> >
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence

2021-03-18 Thread Christian König

Am 18.03.21 um 08:23 schrieb Dennis Li:

Changed to only set in_gpu_reset as 1 when the recovery thread begin,
and delay hold reset_sem after pre-reset but before reset. It make sure
that other threads have exited or been blocked before doing GPU reset.
Compared with the old codes, it could make some threads exit more early
without waiting for timeout.

Introduce a event recovery_fini_event which is used to block new threads
when recovery thread has begun. These threads are only waked up when recovery
thread exit.

v2: remove codes to check the usage of adev->reset_sem, because lockdep
will show all locks held in the system, when system detect hung timeout
in the recovery thread.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 02a34f9a26aa..67c716e5ee8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1044,6 +1044,8 @@ struct amdgpu_device {
atomic_tin_gpu_reset;
enum pp_mp1_state   mp1_state;
struct rw_semaphore reset_sem;
+   wait_queue_head_t recovery_fini_event;
+
struct amdgpu_doorbell_index doorbell_index;
  
  	struct mutex			notifier_lock;

@@ -1406,4 +1408,8 @@ static inline int amdgpu_in_reset(struct amdgpu_device 
*adev)
  {
return atomic_read(>in_gpu_reset);
  }
+
+int amdgpu_read_lock(struct drm_device *dev, bool interruptible);
+void amdgpu_read_unlock(struct drm_device *dev);
+
  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 24ff5992cb02..15235610cc54 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -211,6 +211,60 @@ static ssize_t amdgpu_device_get_serial_number(struct 
device *dev,
  static DEVICE_ATTR(serial_number, S_IRUGO,
amdgpu_device_get_serial_number, NULL);
  
+int amdgpu_read_lock(struct drm_device *dev, bool interruptible)

+{
+   struct amdgpu_device *adev = drm_to_adev(dev);
+   int ret = 0;
+
+   /**
+* if a thread hold the read lock, but recovery thread has started,
+* it should release the read lock and wait for recovery thread finished
+* Because pre-reset functions have begun, which stops old threads but 
no
+* include the current thread.
+   */
+   if (interruptible) {
+   while (!(ret = down_read_killable(>reset_sem)) &&
+   amdgpu_in_reset(adev)) {
+   up_read(>reset_sem);
+   ret = 
wait_event_interruptible(adev->recovery_fini_event,
+   !amdgpu_in_reset(adev));
+   if (ret)
+   break;
+   }
+   } else {
+   down_read(>reset_sem);
+   while (amdgpu_in_reset(adev)) {
+   up_read(>reset_sem);
+   wait_event(adev->recovery_fini_event,
+  !amdgpu_in_reset(adev));
+   down_read(>reset_sem);
+   }
+   }


Ok once more. This general approach is a NAK. We have already tried this 
and it doesn't work.


All you do here is to replace the gpu reset lock with 
wait_event_interruptible().


From an upstream perspective that is strictly illegal since it will 
just prevent lockdep warning from filling the logs and doesn't really 
solve any problem.


Regards,
Christian.


+
+   return ret;
+}
+
+void amdgpu_read_unlock(struct drm_device *dev)
+{
+   struct amdgpu_device *adev = drm_to_adev(dev);
+
+   up_read(>reset_sem);
+}
+
+static void amdgpu_write_lock(struct amdgpu_device *adev, struct 
amdgpu_hive_info *hive)
+{
+   if (hive) {
+   down_write_nest_lock(>reset_sem, >hive_lock);
+   } else {
+   down_write(>reset_sem);
+   }
+}
+
+static void amdgpu_write_unlock(struct amdgpu_device *adev)
+{
+   up_write(>reset_sem);
+}
+
  /**
   * amdgpu_device_supports_atpx - Is the device a dGPU with HG/PX power control
   *
@@ -3280,6 +3334,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
hash_init(adev->mn_hash);
atomic_set(>in_gpu_reset, 0);
init_rwsem(>reset_sem);
+   init_waitqueue_head(>recovery_fini_event);
mutex_init(>psp.mutex);
mutex_init(>notifier_lock);
  
@@ -4509,39 +4564,18 @@ int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,

return r;
  }
  
-static bool amdgpu_device_lock_adev(struct amdgpu_device *adev,

-   struct amdgpu_hive_info *hive)
+static bool amdgpu_device_recovery_enter(struct amdgpu_device *adev)
  {
if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return false;
  
-	if (hive) {

-   down_write_nest_lock(>reset_sem, >hive_lock);
-   } else {
-   

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König

Am 18.03.21 um 08:23 schrieb Dennis Li:

We have defined two variables in_gpu_reset and reset_sem in adev object. The 
atomic type variable in_gpu_reset is used to avoid recovery thread reenter and 
make lower functions return more earlier when recovery start, but couldn't 
block recovery thread when it access hardware. The r/w semaphore reset_sem is 
used to solve these synchronization issues between recovery thread and other 
threads.

The original solution locked registers' access in lower functions, which will 
introduce following issues:

1) many lower functions are used in both recovery thread and others. Firstly we 
must harvest these functions, it is easy to miss someones. Secondly these 
functions need select which lock (read lock or write lock) will be used, 
according to the thread it is running in. If the thread context isn't 
considered, the added lock will easily introduce deadlock. Besides that, in 
most time, developer easily forget to add locks for new functions.

2) performance drop. More lower functions are more frequently called.

3) easily introduce false positive lockdep complaint, because write lock has 
big range in recovery thread, but low level functions will hold read lock may 
be protected by other locks in other threads.

Therefore the new solution will try to add lock protection for ioctls of kfd. 
Its goal is that there are no threads except for recovery thread or its 
children (for xgmi) to access hardware when doing GPU reset and resume. So 
refine recovery thread as the following:

Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
1). if failed, it means system had a recovery thread running, current 
thread exit directly;
2). if success, enter recovery thread;

Step 1: cancel all delay works, stop drm schedule, complete all unreceived 
fences and so on. It try to stop or pause other threads.

Step 2: call down_write(>reset_sem) to hold write lock, which will block 
recovery thread until other threads release read locks.


Those two steps need to be exchanged or otherwise it is possible that 
new delayed work items etc are started before the lock is taken.


Just to make it clear until this is fixed the whole patch set is a NAK.

Regards,
Christian.



Step 3: normally, there is only recovery threads running to access hardware, it 
is safe to do gpu reset now.

Step 4: do post gpu reset, such as call all ips' resume functions;

Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release 
write lock. Recovery thread exit normally.

Other threads call the amdgpu_read_lock to synchronize with recovery thread. If 
it finds that in_gpu_reset is 1, it should release read lock if it has holden 
one, and then blocks itself to wait for recovery finished event. If thread 
successfully hold read lock and in_gpu_reset is 0, it continues. It will exit 
normally or be stopped by recovery thread in step 1.

Dennis Li (4):
   drm/amdgpu: remove reset lock from low level functions
   drm/amdgpu: refine the GPU recovery sequence
   drm/amdgpu: instead of using down/up_read directly
   drm/amdkfd: add reset lock protection for kfd entry functions

  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   6 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +-
  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|   8 -
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   4 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +-
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 172 -
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +-
  drivers/gpu/drm/amd/amdkfd/kfd_process.c  |   4 +
  .../amd/amdkfd/kfd_process_queue_manager.c|  17 ++
  12 files changed, 345 insertions(+), 75 deletions(-)



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/ttm: Fix memory leak userptr pages

2021-03-18 Thread Christian König

Am 17.03.21 um 17:08 schrieb Daniel Gomez:

If userptr pages have been pinned but not bounded,
they remain uncleared.

Signed-off-by: Daniel Gomez 


Good catch, not sure if that can ever happen in practice but better save 
than sorry.


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 9fd2157b133a..50c2b4827c13 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1162,13 +1162,13 @@ static void amdgpu_ttm_backend_unbind(struct 
ttm_bo_device *bdev,
struct amdgpu_ttm_tt *gtt = (void *)ttm;
int r;
  
-	if (!gtt->bound)

-   return;
-
/* if the pages have userptr pinning then clear that first */
if (gtt->userptr)
amdgpu_ttm_tt_unpin_userptr(bdev, ttm);
  
+	if (!gtt->bound)

+   return;
+
if (gtt->offset == AMDGPU_BO_INVALID_OFFSET)
return;
  


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 13/13] drm/amdgpu: skip kfd suspend/resume for S0ix

2021-03-18 Thread Quan, Evan
[AMD Public Use]

Patch 1 -7 are reviewed-by: Evan Quan 
Patch 8 - 13 are acked-by: Evan Quan 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Thursday, March 18, 2021 12:33 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH 13/13] drm/amdgpu: skip kfd suspend/resume for S0ix

GFX is in gfxoff mode during s0ix so we shouldn't need to
actually tear anything down and restore it.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 965b4f18ddc7..319ad9326a71 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3705,7 +3705,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
 
r = amdgpu_device_ip_suspend_phase1(adev);
 
-   amdgpu_amdkfd_suspend(adev, adev->in_runpm);
+   if (!adev->in_s0ix)
+   amdgpu_amdkfd_suspend(adev, adev->in_runpm);
 
/* evict vram memory */
amdgpu_bo_evict_vram(adev);
@@ -3765,9 +3766,11 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
queue_delayed_work(system_wq, >delayed_init_work,
   msecs_to_jiffies(AMDGPU_RESUME_MS));
 
-   r = amdgpu_amdkfd_resume(adev, adev->in_runpm);
-   if (r)
-   return r;
+   if (!adev->in_s0ix) {
+   r = amdgpu_amdkfd_resume(adev, adev->in_runpm);
+   if (r)
+   return r;
+   }
 
/* Make sure IB tests flushed */
flush_delayed_work(>delayed_init_work);
-- 
2.30.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Cevan.quan%40amd.com%7Ca595a93ae68c497592db08d8e9c70c62%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637516388426325351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=JS6OofesEBdvRjCfgoBK8H1QZeP7xpdVUwUl0h1ls9k%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: enable mode-2 gpu reset for vangogh

2021-03-18 Thread Huang Rui
On Thu, Mar 18, 2021 at 03:21:20PM +0800, Du, Xiaojian wrote:
> From: Xiaojian Du 
> 
> From: Xiaojian Du 
> 
> This patch is to enable mdoe-2 gpu reset for vangogh.
> 
> Signed-off-by: Xiaojian Du 

Please add the fix PSP firmware version in the commit to let us know the
good version for reset.

Others look good for me, good job!

Reviewed-by: Huang Rui 

> ---
>  drivers/gpu/drm/amd/amdgpu/nv.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 5846eac292c3..a31ef68ee2ab 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -561,10 +561,6 @@ static int nv_asic_reset(struct amdgpu_device *adev)
>   int ret = 0;
>   struct smu_context *smu = >smu;
>  
> - /* skip reset on vangogh for now */
> - if (adev->asic_type == CHIP_VANGOGH)
> - return 0;
> -
>   switch (nv_asic_reset_method(adev)) {
>   case AMD_RESET_METHOD_PCI:
>   dev_info(adev->dev, "PCI reset\n");
> -- 
> 2.17.1
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions

2021-03-18 Thread Dennis Li
When doing GPU reset, try to block all kfd functions including
kfd ioctls and file close function, which maybe access hardware.

v2: fix a potential recursive locking issue

kfd_ioctl_dbg_register has chance called into pqm_create_queue, which
will cause recursive locking. So remove locking read_lock from process
queue manager, and add read_lock into related ioctls instead.

v3: put pqm_query_dev_by_qid under the protection of p->mutex

Signed-off-by: Dennis Li 
Acked-by: Christian König 

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 6802c616e10e..283ba9435233 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -40,6 +40,7 @@
 #include "kfd_dbgmgr.h"
 #include "amdgpu_amdkfd.h"
 #include "kfd_smi_events.h"
+#include "amdgpu.h"
 
 static long kfd_ioctl(struct file *, unsigned int, unsigned long);
 static int kfd_open(struct inode *, struct file *);
@@ -298,6 +299,9 @@ static int kfd_ioctl_create_queue(struct file *filep, 
struct kfd_process *p,
}
 
mutex_lock(>mutex);
+   err = amdgpu_read_lock(dev->ddev, true);
+   if (err)
+   goto err_read_lock;
 
pdd = kfd_bind_process_to_device(dev, p);
if (IS_ERR(pdd)) {
@@ -326,6 +330,7 @@ static int kfd_ioctl_create_queue(struct file *filep, 
struct kfd_process *p,
 */
args->doorbell_offset |= doorbell_offset_in_process;
 
+   amdgpu_read_unlock(dev->ddev);
mutex_unlock(>mutex);
 
pr_debug("Queue id %d was created successfully\n", args->queue_id);
@@ -343,6 +348,8 @@ static int kfd_ioctl_create_queue(struct file *filep, 
struct kfd_process *p,
 
 err_create_queue:
 err_bind_process:
+   amdgpu_read_unlock(dev->ddev);
+err_read_lock:
mutex_unlock(>mutex);
return err;
 }
@@ -352,6 +359,7 @@ static int kfd_ioctl_destroy_queue(struct file *filp, 
struct kfd_process *p,
 {
int retval;
struct kfd_ioctl_destroy_queue_args *args = data;
+   struct kfd_dev *dev;
 
pr_debug("Destroying queue id %d for pasid 0x%x\n",
args->queue_id,
@@ -359,8 +367,20 @@ static int kfd_ioctl_destroy_queue(struct file *filp, 
struct kfd_process *p,
 
mutex_lock(>mutex);
 
+   dev = pqm_query_dev_by_qid(>pqm, args->queue_id);
+   if (!dev) {
+   retval = -EINVAL;
+   goto err_query_dev;
+   }
+
+   retval = amdgpu_read_lock(dev->ddev, true);
+   if (retval)
+   goto err_read_lock;
retval = pqm_destroy_queue(>pqm, args->queue_id);
+   amdgpu_read_unlock(dev->ddev);
 
+err_read_lock:
+err_query_dev:
mutex_unlock(>mutex);
return retval;
 }
@@ -371,6 +391,7 @@ static int kfd_ioctl_update_queue(struct file *filp, struct 
kfd_process *p,
int retval;
struct kfd_ioctl_update_queue_args *args = data;
struct queue_properties properties;
+   struct kfd_dev *dev;
 
if (args->queue_percentage > KFD_MAX_QUEUE_PERCENTAGE) {
pr_err("Queue percentage must be between 0 to 
KFD_MAX_QUEUE_PERCENTAGE\n");
@@ -404,10 +425,21 @@ static int kfd_ioctl_update_queue(struct file *filp, 
struct kfd_process *p,
 
mutex_lock(>mutex);
 
+   dev = pqm_query_dev_by_qid(>pqm, args->queue_id);
+   if (!dev) {
+   retval = -EINVAL;
+   goto err_query_dev;
+   }
+
+   retval = amdgpu_read_lock(dev->ddev, true);
+   if (retval)
+   goto err_read_lock;
retval = pqm_update_queue(>pqm, args->queue_id, );
+   amdgpu_read_unlock(dev->ddev);
 
+err_read_lock:
+err_query_dev:
mutex_unlock(>mutex);
-
return retval;
 }
 
@@ -420,6 +452,7 @@ static int kfd_ioctl_set_cu_mask(struct file *filp, struct 
kfd_process *p,
struct queue_properties properties;
uint32_t __user *cu_mask_ptr = (uint32_t __user *)args->cu_mask_ptr;
size_t cu_mask_size = sizeof(uint32_t) * (args->num_cu_mask / 32);
+   struct kfd_dev *dev;
 
if ((args->num_cu_mask % 32) != 0) {
pr_debug("num_cu_mask 0x%x must be a multiple of 32",
@@ -456,8 +489,20 @@ static int kfd_ioctl_set_cu_mask(struct file *filp, struct 
kfd_process *p,
 
mutex_lock(>mutex);
 
+   dev = pqm_query_dev_by_qid(>pqm, args->queue_id);
+   if (!dev) {
+   retval = -EINVAL;
+   goto err_query_dev;
+   }
+
+   retval = amdgpu_read_lock(dev->ddev, true);
+   if (retval)
+   goto err_read_lock;
retval = pqm_set_cu_mask(>pqm, args->queue_id, );
+   amdgpu_read_unlock(dev->ddev);
 
+err_read_lock:
+err_query_dev:
mutex_unlock(>mutex);
 
if (retval)
@@ -471,14 +516,27 @@ static int kfd_ioctl_get_queue_wave_state(struct file 
*filep,
 {
struct kfd_ioctl_get_queue_wave_state_args *args = data;
int r;
+   struct kfd_dev *dev;
 

[PATCH 3/4] drm/amdgpu: instead of using down/up_read directly

2021-03-18 Thread Dennis Li
change to use amdgpu_read_lock/unlock which could handle more cases

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index bcaf271b39bf..66dec0f49c4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -59,11 +59,12 @@ int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev)
 static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file)
 {
struct amdgpu_device *adev = inode->i_private;
+   struct drm_device *dev = adev_to_drm(adev);
int ret;
 
file->private_data = adev;
 
-   ret = down_read_killable(>reset_sem);
+   ret = amdgpu_read_lock(dev, true);
if (ret)
return ret;
 
@@ -74,7 +75,7 @@ static int amdgpu_debugfs_autodump_open(struct inode *inode, 
struct file *file)
ret = -EBUSY;
}
 
-   up_read(>reset_sem);
+   amdgpu_read_unlock(dev);
 
return ret;
 }
@@ -1206,7 +1207,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
}
 
/* Avoid accidently unparking the sched thread during GPU reset */
-   r = down_read_killable(>reset_sem);
+   r = amdgpu_read_lock(dev, true);
if (r)
return r;
 
@@ -1235,7 +1236,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
kthread_unpark(ring->sched.thread);
}
 
-   up_read(>reset_sem);
+   amdgpu_read_unlock(dev);
 
pm_runtime_mark_last_busy(dev->dev);
pm_runtime_put_autosuspend(dev->dev);
@@ -1427,6 +1428,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
struct amdgpu_ring *ring;
struct dma_fence **fences = NULL;
struct amdgpu_device *adev = (struct amdgpu_device *)data;
+   struct drm_device *dev = adev_to_drm(adev);
 
if (val >= AMDGPU_MAX_RINGS)
return -EINVAL;
@@ -1446,7 +1448,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
return -ENOMEM;
 
/* Avoid accidently unparking the sched thread during GPU reset */
-   r = down_read_killable(>reset_sem);
+   r = amdgpu_read_lock(dev, true);
if (r)
goto pro_end;
 
@@ -1489,7 +1491,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
/* restart the scheduler */
kthread_unpark(ring->sched.thread);
 
-   up_read(>reset_sem);
+   amdgpu_read_unlock(dev);
 
ttm_bo_unlock_delayed_workqueue(>mman.bdev, resched);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 3ee481557fc9..113c63bf187f 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -247,12 +247,13 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, 
flr_work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
+   struct drm_device *dev = adev_to_drm(adev);
 
/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_read_trylock(>reset_sem))
+   if (amdgpu_read_lock(dev, true))
return;
 
amdgpu_virt_fini_data_exchange(adev);
@@ -268,7 +269,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 
 flr_done:
atomic_set(>in_gpu_reset, 0);
-   up_read(>reset_sem);
+   amdgpu_read_unlock(dev);
 
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 48e588d3c409..2cd910e5caa7 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -268,12 +268,13 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, 
flr_work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
+   struct drm_device *dev = adev_to_drm(adev);
 
/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_read_trylock(>reset_sem))
+   if (amdgpu_read_lock(dev, true))
return;
 
amdgpu_virt_fini_data_exchange(adev);
@@ -289,7 +290,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 
 flr_done:
atomic_set(>in_gpu_reset, 0);
-   up_read(>reset_sem);
+   amdgpu_read_unlock(dev);
 
/* Trigger recovery for world switch failure if no 

[PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence

2021-03-18 Thread Dennis Li
Changed to only set in_gpu_reset as 1 when the recovery thread begin,
and delay hold reset_sem after pre-reset but before reset. It make sure
that other threads have exited or been blocked before doing GPU reset.
Compared with the old codes, it could make some threads exit more early
without waiting for timeout.

Introduce a event recovery_fini_event which is used to block new threads
when recovery thread has begun. These threads are only waked up when recovery
thread exit.

v2: remove codes to check the usage of adev->reset_sem, because lockdep
will show all locks held in the system, when system detect hung timeout
in the recovery thread.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 02a34f9a26aa..67c716e5ee8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1044,6 +1044,8 @@ struct amdgpu_device {
atomic_tin_gpu_reset;
enum pp_mp1_state   mp1_state;
struct rw_semaphore reset_sem;
+   wait_queue_head_t recovery_fini_event;
+
struct amdgpu_doorbell_index doorbell_index;
 
struct mutexnotifier_lock;
@@ -1406,4 +1408,8 @@ static inline int amdgpu_in_reset(struct amdgpu_device 
*adev)
 {
return atomic_read(>in_gpu_reset);
 }
+
+int amdgpu_read_lock(struct drm_device *dev, bool interruptible);
+void amdgpu_read_unlock(struct drm_device *dev);
+
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 24ff5992cb02..15235610cc54 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -211,6 +211,60 @@ static ssize_t amdgpu_device_get_serial_number(struct 
device *dev,
 static DEVICE_ATTR(serial_number, S_IRUGO,
amdgpu_device_get_serial_number, NULL);
 
+int amdgpu_read_lock(struct drm_device *dev, bool interruptible)
+{
+   struct amdgpu_device *adev = drm_to_adev(dev);
+   int ret = 0;
+
+   /**
+* if a thread hold the read lock, but recovery thread has started,
+* it should release the read lock and wait for recovery thread finished
+* Because pre-reset functions have begun, which stops old threads but 
no
+* include the current thread.
+   */
+   if (interruptible) {
+   while (!(ret = down_read_killable(>reset_sem)) &&
+   amdgpu_in_reset(adev)) {
+   up_read(>reset_sem);
+   ret = 
wait_event_interruptible(adev->recovery_fini_event,
+   !amdgpu_in_reset(adev));
+   if (ret)
+   break;
+   }
+   } else {
+   down_read(>reset_sem);
+   while (amdgpu_in_reset(adev)) {
+   up_read(>reset_sem);
+   wait_event(adev->recovery_fini_event,
+  !amdgpu_in_reset(adev));
+   down_read(>reset_sem);
+   }
+   }
+
+   return ret;
+}
+
+void amdgpu_read_unlock(struct drm_device *dev)
+{
+   struct amdgpu_device *adev = drm_to_adev(dev);
+
+   up_read(>reset_sem);
+}
+
+static void amdgpu_write_lock(struct amdgpu_device *adev, struct 
amdgpu_hive_info *hive)
+{
+   if (hive) {
+   down_write_nest_lock(>reset_sem, >hive_lock);
+   } else {
+   down_write(>reset_sem);
+   }
+}
+
+static void amdgpu_write_unlock(struct amdgpu_device *adev)
+{
+   up_write(>reset_sem);
+}
+
 /**
  * amdgpu_device_supports_atpx - Is the device a dGPU with HG/PX power control
  *
@@ -3280,6 +3334,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
hash_init(adev->mn_hash);
atomic_set(>in_gpu_reset, 0);
init_rwsem(>reset_sem);
+   init_waitqueue_head(>recovery_fini_event);
mutex_init(>psp.mutex);
mutex_init(>notifier_lock);
 
@@ -4509,39 +4564,18 @@ int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
return r;
 }
 
-static bool amdgpu_device_lock_adev(struct amdgpu_device *adev,
-   struct amdgpu_hive_info *hive)
+static bool amdgpu_device_recovery_enter(struct amdgpu_device *adev)
 {
if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return false;
 
-   if (hive) {
-   down_write_nest_lock(>reset_sem, >hive_lock);
-   } else {
-   down_write(>reset_sem);
-   }
-
-   switch (amdgpu_asic_reset_method(adev)) {
-   case AMD_RESET_METHOD_MODE1:
-   adev->mp1_state = PP_MP1_STATE_SHUTDOWN;
-   break;
-   case AMD_RESET_METHOD_MODE2:
-   adev->mp1_state = PP_MP1_STATE_RESET;
-   break;
-   default:
-   adev->mp1_state = PP_MP1_STATE_NONE;
-   break;
-   }
-

[PATCH 1/4] drm/amdgpu: remove reset lock from low level functions

2021-03-18 Thread Dennis Li
It is easy to cause performance drop issue when using lock in low level
functions.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0b1e0127056f..24ff5992cb02 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -374,13 +374,10 @@ uint32_t amdgpu_device_rreg(struct amdgpu_device *adev,
 
if ((reg * 4) < adev->rmmio_size) {
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
-   amdgpu_sriov_runtime(adev) &&
-   down_read_trylock(>reset_sem)) {
+   amdgpu_sriov_runtime(adev))
ret = amdgpu_kiq_rreg(adev, reg);
-   up_read(>reset_sem);
-   } else {
+   else
ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
-   }
} else {
ret = adev->pcie_rreg(adev, reg * 4);
}
@@ -459,13 +456,10 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
 
if ((reg * 4) < adev->rmmio_size) {
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
-   amdgpu_sriov_runtime(adev) &&
-   down_read_trylock(>reset_sem)) {
+   amdgpu_sriov_runtime(adev))
amdgpu_kiq_wreg(adev, reg, v);
-   up_read(>reset_sem);
-   } else {
+   else
writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
-   }
} else {
adev->pcie_wreg(adev, reg * 4, v);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index a05dbbbd9803..9f6eaca107ab 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -155,11 +155,7 @@ static int __update_table_header(struct 
amdgpu_ras_eeprom_control *control,
 
msg.addr = control->i2c_address;
 
-   /* i2c may be unstable in gpu reset */
-   down_read(>reset_sem);
ret = i2c_transfer(>pm.smu_i2c, , 1);
-   up_read(>reset_sem);
-
if (ret < 1)
DRM_ERROR("Failed to write EEPROM table header, ret:%d", ret);
 
@@ -546,11 +542,7 @@ int amdgpu_ras_eeprom_process_recods(struct 
amdgpu_ras_eeprom_control *control,
control->next_addr += EEPROM_TABLE_RECORD_SIZE;
}
 
-   /* i2c may be unstable in gpu reset */
-   down_read(>reset_sem);
ret = i2c_transfer(>pm.smu_i2c, msgs, num);
-   up_read(>reset_sem);
-
if (ret < 1) {
DRM_ERROR("Failed to process EEPROM table records, ret:%d", 
ret);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 33e54eed2eec..690f368ce378 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -317,8 +317,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 * Directly use kiq to do the vm invalidation instead
 */
if (adev->gfx.kiq.ring.sched.ready &&
-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   down_read_trylock(>reset_sem)) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
struct amdgpu_vmhub *hub = >vmhub[vmhub];
const unsigned eng = 17;
u32 inv_req = hub->vmhub_funcs->get_invalidate_req(vmid, 
flush_type);
@@ -328,7 +327,6 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
1 << vmid);
 
-   up_read(>reset_sem);
return;
}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 1567dd227f51..ec3c05360776 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -757,14 +757,12 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 * as GFXOFF under bare metal
 */
if (adev->gfx.kiq.ring.sched.ready &&
-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   down_read_trylock(>reset_sem)) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
 
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
   1 << vmid);
-   up_read(>reset_sem);
return;
}
 
@@ -859,7 +857,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
if (amdgpu_in_reset(adev))
return -EIO;

[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Dennis Li
We have defined two variables in_gpu_reset and reset_sem in adev object. The 
atomic type variable in_gpu_reset is used to avoid recovery thread reenter and 
make lower functions return more earlier when recovery start, but couldn't 
block recovery thread when it access hardware. The r/w semaphore reset_sem is 
used to solve these synchronization issues between recovery thread and other 
threads.

The original solution locked registers' access in lower functions, which will 
introduce following issues:

1) many lower functions are used in both recovery thread and others. Firstly we 
must harvest these functions, it is easy to miss someones. Secondly these 
functions need select which lock (read lock or write lock) will be used, 
according to the thread it is running in. If the thread context isn't 
considered, the added lock will easily introduce deadlock. Besides that, in 
most time, developer easily forget to add locks for new functions.

2) performance drop. More lower functions are more frequently called.

3) easily introduce false positive lockdep complaint, because write lock has 
big range in recovery thread, but low level functions will hold read lock may 
be protected by other locks in other threads.

Therefore the new solution will try to add lock protection for ioctls of kfd. 
Its goal is that there are no threads except for recovery thread or its 
children (for xgmi) to access hardware when doing GPU reset and resume. So 
refine recovery thread as the following:

Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
   1). if failed, it means system had a recovery thread running, current thread 
exit directly;
   2). if success, enter recovery thread;

Step 1: cancel all delay works, stop drm schedule, complete all unreceived 
fences and so on. It try to stop or pause other threads.

Step 2: call down_write(>reset_sem) to hold write lock, which will block 
recovery thread until other threads release read locks.

Step 3: normally, there is only recovery threads running to access hardware, it 
is safe to do gpu reset now.

Step 4: do post gpu reset, such as call all ips' resume functions;

Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release 
write lock. Recovery thread exit normally.

Other threads call the amdgpu_read_lock to synchronize with recovery thread. If 
it finds that in_gpu_reset is 1, it should release read lock if it has holden 
one, and then blocks itself to wait for recovery finished event. If thread 
successfully hold read lock and in_gpu_reset is 0, it continues. It will exit 
normally or be stopped by recovery thread in step 1.

Dennis Li (4):
  drm/amdgpu: remove reset lock from low level functions
  drm/amdgpu: refine the GPU recovery sequence
  drm/amdgpu: instead of using down/up_read directly
  drm/amdkfd: add reset lock protection for kfd entry functions

 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|   8 -
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   4 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 172 -
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  |   4 +
 .../amd/amdkfd/kfd_process_queue_manager.c|  17 ++
 12 files changed, 345 insertions(+), 75 deletions(-)

-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: enable mode-2 gpu reset for vangogh

2021-03-18 Thread Xiaojian Du
From: Xiaojian Du 

From: Xiaojian Du 

This patch is to enable mdoe-2 gpu reset for vangogh.

Signed-off-by: Xiaojian Du 
---
 drivers/gpu/drm/amd/amdgpu/nv.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 5846eac292c3..a31ef68ee2ab 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -561,10 +561,6 @@ static int nv_asic_reset(struct amdgpu_device *adev)
int ret = 0;
struct smu_context *smu = >smu;
 
-   /* skip reset on vangogh for now */
-   if (adev->asic_type == CHIP_VANGOGH)
-   return 0;
-
switch (nv_asic_reset_method(adev)) {
case AMD_RESET_METHOD_PCI:
dev_info(adev->dev, "PCI reset\n");
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx