RE: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

2023-03-15 Thread Liu, Monk
[AMD Official Use Only - General]

Hi Luben

Please let us know if you don't have bandwidth to review and we can require 
other people to review, please be understand that we are in an urgent situation 
and need this change to go to mainline this week.

Thanks 
---
Monk Liu | Cloud GPU & Virtualization Software | AMD
---

-Original Message-
From: Wang, YuBiao  
Sent: 2023年3月14日 14:34
To: Tuikov, Luben 
Cc: Quan, Evan ; Chen, Horace ; Koenig, 
Christian ; Deucher, Alexander 
; Zhang, Hawking ; Liu, Monk 
; Xu, Feifei ; Wang, Yang(Kevin) 
; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in 
non-sched jobs

Hi Luben,

I'd have to ping you because we've got a P1 ticket currently on this issue. 
Would you please give a vague time when would you confirm whether this patch is 
safe? Thank you a lot for helping double check this.

Regards & Thanks,
Yubiao 

-Original Message-
From: Tuikov, Luben 
Sent: Saturday, March 11, 2023 12:56 AM
To: Wang, YuBiao ; amd-gfx@lists.freedesktop.org
Cc: Quan, Evan ; Chen, Horace ; Koenig, 
Christian ; Deucher, Alexander 
; Zhang, Hawking ; Liu, Monk 
; Xu, Feifei ; Wang, Yang(Kevin) 

Subject: Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in 
non-sched jobs

On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a 
> failed ib test before mode 1 reset during asic reset. The fences in 
> this case are never signaled and next time when we try to free the 
> sa_bo, kernel will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the 
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be 
> released in job_free_cb, and for non-sched jobs like ib_test, it's 
> meant to be released in sa_bo_free but only when the fences are 
> signaled. So we have to force signal the non_sched bad job's fence 
> during pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct
> amdgpu_ring *ring)  {
>   int i;
>   struct dma_fence *old, **ptr;
> + struct amdgpu_job *job;
>  
>   for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>   ptr = >fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>   if (old && old->ops == _job_fence_ops) {
>   RCU_INIT_POINTER(*ptr, NULL);
>   dma_fence_put(old);
> + /* For non-sched bad job, i.e. failed ib test, we need 
> to force
> +  * signal it right here or we won't be able to track 
> them in fence drv
> +  * and they will remain unsignaled during sa_bo free.
> +  */
> + job = container_of(old, struct amdgpu_job, hw_fence);
> + if (!job->base.s_fence && !dma_fence_is_signaled(old))
> + dma_fence_signal(old);

Conceptually, I don't mind this patch for what it does. The only thing which 
worries me is this check here, !job->base.s_fence, which is used here to 
qualify that we can signal the fence (and of course that the fence is not yet 
signalled.) We need to audit this check to make sure that it is not overloaded 
to mean other things. I'll take a look.

>   }
>   }
>  }

--
Regards,
Luben


Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

2023-03-15 Thread Luben Tuikov
On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a failed
> ib test before mode 1 reset during asic reset. The fences in this case
> are never signaled and next time when we try to free the sa_bo, kernel
> will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
> to be released in sa_bo_free but only when the fences are signaled. So
> we have to force signal the non_sched bad job's fence during
> pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>  {
>   int i;
>   struct dma_fence *old, **ptr;
> + struct amdgpu_job *job;
>  
>   for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>   ptr = >fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>   if (old && old->ops == _job_fence_ops) {
>   RCU_INIT_POINTER(*ptr, NULL);
>   dma_fence_put(old);
> + /* For non-sched bad job, i.e. failed ib test, we need 
> to force
> +  * signal it right here or we won't be able to track 
> them in fence drv
> +  * and they will remain unsignaled during sa_bo free.
> +  */
> + job = container_of(old, struct amdgpu_job, hw_fence);
> + if (!job->base.s_fence && !dma_fence_is_signaled(old))
> + dma_fence_signal(old);

Hi YuBiao,

Thanks for adding the clarifying comments and sending a v2 of this patch.

Perhaps move the chunk you're adding, to sit before, rather than after,
the statements of the if-conditional. Also move the "job" variable
declaration to be inside the if-conditional, since it is not used
by anything outside it. Something like this, (note a few small fixes to the 
comment),
if (old && old->ops == _job_fence_ops) {
struct amdgpu_job *job;

/* For non-scheduler bad job, i.e. failed IB test, we need to 
signal
 * it right here or we won't be able to track them in fence_drv
 * and they will remain unsignaled during sa_bo free.
 */
job = container_of(old, struct amdgpu_job, hw_fence);
if (!job->base.s_fence && !dma_fence_is_signaled(old))
dma_fence_signal(old);
RCU_INIT_POINTER(*ptr, NULL);
dma_fence_put(old);
}
Then, give it a test.
With that change, and upon successful test results, this patch is,
Acked-by: Luben Tuikov 
-- 
Regards,
Luben



RE: [PATCH 1/2] drm/amdgpu: reposition the gpu reset checking for reuse

2023-03-15 Thread Huang, Tim
[Public]

-Original Message-
From: Alex Deucher 
Sent: Wednesday, March 15, 2023 10:36 PM
To: Huang, Tim 
Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Zhang, Yifan ; Limonciello, 
Mario 
Subject: Re: [PATCH 1/2] drm/amdgpu: reposition the gpu reset checking for reuse

On Wed, Mar 15, 2023 at 7:05 AM Tim Huang  wrote:
>
> Move the amdgpu_acpi_should_gpu_reset out of CONFIG_SUSPEND to share
> it with hibernate case.
>
> Signed-off-by: Tim Huang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h  |  4 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 40
> +---
>  2 files changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 5c6132502f35..5bddc03332b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1392,10 +1392,12 @@ int amdgpu_acpi_smart_shift_update(struct
> drm_device *dev, enum amdgpu_ss ss_sta  int
> amdgpu_acpi_pcie_notify_device_ready(struct amdgpu_device *adev);
>
>  void amdgpu_acpi_get_backlight_caps(struct amdgpu_dm_backlight_caps
> *caps);
> +bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev);
>  void amdgpu_acpi_detect(void);
>  #else
>  static inline int amdgpu_acpi_init(struct amdgpu_device *adev) {
> return 0; }  static inline void amdgpu_acpi_fini(struct amdgpu_device
> *adev) { }
> +static inline bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device
> +*adev) { return false; }
>  static inline void amdgpu_acpi_detect(void) { }  static inline bool
> amdgpu_acpi_is_power_shift_control_supported(void) { return false; }
> static inline int amdgpu_acpi_power_shift_control(struct amdgpu_device
> *adev, @@ -1406,11 +1408,9 @@ static inline int
> amdgpu_acpi_smart_shift_update(struct drm_device *dev,
>
>  #if defined(CONFIG_ACPI) && defined(CONFIG_SUSPEND)  bool
> amdgpu_acpi_is_s3_active(struct amdgpu_device *adev); -bool
> amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev);  bool
> amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev);  #else  static
> inline bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev) {
> return false; } -static inline bool
> amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) { return
> false; }  static inline bool amdgpu_acpi_is_s3_active(struct
> amdgpu_device *adev) { return false; }  #endif
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index 25e902077caf..065944bdeee4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -971,6 +971,28 @@ static bool amdgpu_atcs_pci_probe_handle(struct pci_dev 
> *pdev)
> return true;
>  }
>
> +
> +/**
> + * amdgpu_acpi_should_gpu_reset
> + *
> + * @adev: amdgpu_device_pointer
> + *
> + * returns true if should reset GPU, false if not  */ bool
> +amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) {
> +   if (adev->flags & AMD_IS_APU)
> +   return false;
> +
> +   if (amdgpu_sriov_vf(adev))
> +   return false;
> +
> +#if IS_ENABLED(CONFIG_SUSPEND)
> +   return pm_suspend_target_state != PM_SUSPEND_TO_IDLE; #endif
> +/* CONFIG_SUSPEND */
> +   return true;


Should probably be:
#if IS_ENABLED(CONFIG_SUSPEND)
return pm_suspend_target_state != PM_SUSPEND_TO_IDLE; #else
return true;
#endif

Yes, will fix it. Thanks Alex.

With that fixed, series is:
Reviewed-by: Alex Deucher 

> +}
> +
>  /*
>   * amdgpu_acpi_detect - detect ACPI ATIF/ATCS methods
>   *
> @@ -1042,24 +1064,6 @@ bool amdgpu_acpi_is_s3_active(struct amdgpu_device 
> *adev)
> (pm_suspend_target_state == PM_SUSPEND_MEM);  }
>
> -/**
> - * amdgpu_acpi_should_gpu_reset
> - *
> - * @adev: amdgpu_device_pointer
> - *
> - * returns true if should reset GPU, false if not
> - */
> -bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) -{
> -   if (adev->flags & AMD_IS_APU)
> -   return false;
> -
> -   if (amdgpu_sriov_vf(adev))
> -   return false;
> -
> -   return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
> -}
> -
>  /**
>   * amdgpu_acpi_is_s0ix_active
>   *
> --
> 2.25.1
>


RE: [PATCH] drm/amdgpu: drop the extra sign extension

2023-03-15 Thread Wang, Yang(Kevin)
[AMD Official Use Only - General]

Reviewed-by: Yang Wang 

Best Regards,
Kevin

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Thursday, March 16, 2023 1:53 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: drop the extra sign extension

amdgpu_bo_gpu_offset_no_check() already calls
amdgpu_gmc_sign_extend() so no need to call it twice.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 69e105fa41f6..ce2afd7e775b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -152,7 +152,7 @@ static void amdgpu_vm_sdma_copy_ptes(struct 
amdgpu_vm_update_params *p,

src += p->num_dw_left * 4;

-   pe += amdgpu_gmc_sign_extend(amdgpu_bo_gpu_offset_no_check(bo));
+   pe += amdgpu_bo_gpu_offset_no_check(bo);
trace_amdgpu_vm_copy_ptes(pe, src, count, p->immediate);

amdgpu_vm_copy_pte(p->adev, ib, pe, src, count); @@ -179,7 +179,7 @@ 
static void amdgpu_vm_sdma_set_ptes(struct amdgpu_vm_update_params *p,  {
struct amdgpu_ib *ib = p->job->ibs;

-   pe += amdgpu_gmc_sign_extend(amdgpu_bo_gpu_offset_no_check(bo));
+   pe += amdgpu_bo_gpu_offset_no_check(bo);
trace_amdgpu_vm_set_ptes(pe, addr, count, incr, flags, p->immediate);
if (count < 3) {
amdgpu_vm_write_pte(p->adev, ib, pe, addr | flags,
--
2.39.2



Re: [PATCH v3 09/17] drm/amd/display: Register Colorspace property for DP and HDMI

2023-03-15 Thread Sebastian Wick
On Tue, Mar 7, 2023 at 4:12 PM Harry Wentland  wrote:
>
> We want compositors to be able to set the output
> colorspace on DP and HDMI outputs, based on the
> caps reported from the receiver via EDID.

About that... The documentation says that user space has to check the
EDID for what the sink actually supports. So whatever is in
supported_colorspaces is just what the driver/hardware is able to set
but doesn't actually indicate that the sink supports it.

So the only way to enable bt2020 is by checking if the sink supports
both RGB and YUV variants because both could be used by the driver.
Not great at all. Something to remember for the new property.

> Signed-off-by: Harry Wentland 
> Cc: Pekka Paalanen 
> Cc: Sebastian Wick 
> Cc: vitaly.pros...@amd.com
> Cc: Joshua Ashton 
> Cc: dri-de...@lists.freedesktop.org
> Cc: amd-gfx@lists.freedesktop.org
> Reviewed-By: Joshua Ashton 
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index f91b2ea13d96..2d883c6dae90 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -7184,6 +7184,12 @@ static int amdgpu_dm_connector_get_modes(struct 
> drm_connector *connector)
> return amdgpu_dm_connector->num_modes;
>  }
>
> +static const u32 supported_colorspaces =
> +   BIT(DRM_MODE_COLORIMETRY_BT709_YCC) |
> +   BIT(DRM_MODE_COLORIMETRY_OPRGB) |
> +   BIT(DRM_MODE_COLORIMETRY_BT2020) |
> +   BIT(DRM_MODE_COLORIMETRY_BT2020_DEPRECATED);
> +
>  void amdgpu_dm_connector_init_helper(struct amdgpu_display_manager *dm,
>  struct amdgpu_dm_connector *aconnector,
>  int connector_type,
> @@ -7264,6 +7270,15 @@ void amdgpu_dm_connector_init_helper(struct 
> amdgpu_display_manager *dm,
> adev->mode_info.abm_level_property, 0);
> }
>
> +   if (connector_type == DRM_MODE_CONNECTOR_HDMIA) {
> +   if 
> (!drm_mode_create_hdmi_colorspace_property(>base, 
> supported_colorspaces))
> +   
> drm_connector_attach_colorspace_property(>base);
> +   } else if (connector_type == DRM_MODE_CONNECTOR_DisplayPort ||
> +  connector_type == DRM_MODE_CONNECTOR_eDP) {
> +   if 
> (!drm_mode_create_dp_colorspace_property(>base, 
> supported_colorspaces))
> +   
> drm_connector_attach_colorspace_property(>base);
> +   }
> +
> if (connector_type == DRM_MODE_CONNECTOR_HDMIA ||
> connector_type == DRM_MODE_CONNECTOR_DisplayPort ||
> connector_type == DRM_MODE_CONNECTOR_eDP) {
> --
> 2.39.2
>



Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

2023-03-15 Thread Stefano Stabellini
On Wed, 15 Mar 2023, Jan Beulich wrote:
> On 15.03.2023 01:52, Stefano Stabellini wrote:
> > On Mon, 13 Mar 2023, Jan Beulich wrote:
> >> On 12.03.2023 13:01, Huang Rui wrote:
> >>> Xen PVH is the paravirtualized mode and takes advantage of hardware
> >>> virtualization support when possible. It will using the hardware IOMMU
> >>> support instead of xen-swiotlb, so disable swiotlb if current domain is
> >>> Xen PVH.
> >>
> >> But the kernel has no way (yet) to drive the IOMMU, so how can it get
> >> away without resorting to swiotlb in certain cases (like I/O to an
> >> address-restricted device)?
> > 
> > I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
> > need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
> > so we can use guest physical addresses instead of machine addresses for
> > DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
> > (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
> > case is XENFEAT_not_direct_mapped).
> 
> But how does Xen using an IOMMU help with, as said, address-restricted
> devices? They may still need e.g. a 32-bit address to be programmed in,
> and if the kernel has memory beyond the 4G boundary not all I/O buffers
> may fulfill this requirement.

In short, it is going to work as long as Linux has guest physical
addresses (not machine addresses, those could be anything) lower than
4GB.

If the address-restricted device does DMA via an IOMMU, then the device
gets programmed by Linux using its guest physical addresses (not machine
addresses).

The 32-bit restriction would be applied by Linux to its choice of guest
physical address to use to program the device, the same way it does on
native. The device would be fine as it always uses Linux-provided <4GB
addresses. After the IOMMU translation (pagetable setup by Xen), we
could get any address, including >4GB addresses, and that is expected to
work.


[pull] amdgpu, amdkfd drm-fixes-6.3

2023-03-15 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 6.3.

The following changes since commit eeac8ede17557680855031c6f305ece2378af326:

  Linux 6.3-rc2 (2023-03-12 16:36:44 -0700)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.3-2023-03-15

for you to fetch changes up to f3921a9a641483784448fb982b2eb738b383d9b9:

  drm/amdgpu: Don't resume IOMMU after incomplete init (2023-03-15 18:21:51 
-0400)


amd-drm-fixes-6.3-2023-03-15:

amdgpu:
- SMU 13 update
- RDNA2 suspend/resume fix when overclocking is enabled
- SRIOV VCN fixes
- HDCP suspend/resume fix
- Fix drm polling splat regression
- Fix dirty rectangle tracking for PSR
- Fix vangogh regression on certain BIOSes
- Misc display fixes
- Suspend/resume IOMMU regression fix

amdkfd:
- Fix BO offset for multi-VMA page migration
- Fix a possible double free
- Fix potential use after free
- Fix process cleanup on module exit


Alex Deucher (1):
  drm/amdgpu/nv: fix codec array for SR_IOV

Ayush Gupta (1):
  drm/amd/display: disconnect MPCC only on OTG change

Benjamin Cheng (1):
  drm/amd/display: Write to correct dirty_rect

Bhawanpreet Lakha (1):
  drm/amd/display: Fix HDCP failing to enable after suspend

Błażej Szczygieł (1):
  drm/amd/pm: Fix sienna cichlid incorrect OD volage after resume

Chia-I Wu (2):
  drm/amdkfd: fix a potential double free in pqm_create_queue
  drm/amdkfd: fix potential kgd_mem UAFs

Cruise Hung (1):
  drm/amd/display: Fix DP MST sinks removal issue

David Belanger (1):
  drm/amdkfd: Fixed kfd_process cleanup on module exit.

Felix Kuehling (1):
  drm/amdgpu: Don't resume IOMMU after incomplete init

Guchun Chen (1):
  drm/amdgpu: move poll enabled/disable into non DC path

Guilherme G. Piccoli (1):
  drm/amdgpu/vcn: Disable indirect SRAM on Vangogh broken BIOSes

Jane Jian (1):
  drm/amdgpu/vcn: custom video info caps for sriov

Saaem Rizvi (1):
  drm/amd/display: Remove OTG DIV register write for Virtual signals.

Tim Huang (1):
  drm/amd/pm: bump SMU 13.0.4 driver_if header version

Wesley Chalmers (1):
  drm/amd/display: Do not set DRR on pipe Commit

Xiaogang Chen (2):
  drm/amdkfd: Fix BO offset for multi-VMA page migration
  drm/amdkfd: Get prange->offset after svm_range_vram_node_new

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c|   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c|  19 
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h|   3 +-
 drivers/gpu/drm/amd/amdgpu/nv.c|   4 +-
 drivers/gpu/drm/amd/amdgpu/soc21.c | 103 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  16 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_device.c|  11 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c   |  33 ---
 drivers/gpu/drm/amd/amdkfd/kfd_module.c|   1 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |   1 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   |  67 --
 .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c |   4 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c  |   6 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_hdcp.c |   2 +-
 drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c |   3 -
 drivers/gpu/drm/amd/display/dc/dcn32/dcn32_hwseq.c |   2 +-
 .../gpu/drm/amd/display/dc/dcn32/dcn32_resource.c  |   6 +-
 .../gpu/drm/amd/display/dc/link/link_detection.c   |   8 ++
 .../pm/swsmu/inc/pmfw_if/smu13_driver_if_v13_0_4.h |   4 +-
 drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h   |   2 +-
 .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c|  43 +++--
 23 files changed, 281 insertions(+), 69 deletions(-)


Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Steven Rostedt
On Wed, 15 Mar 2023 11:57:12 -0400
Steven Rostedt  wrote:

So I'm looking at the backtraces.

> The WARN_ON triggered:
> 
> [   21.481449] mpls_gso: MPLS GSO support
> [   21.488795] IPI shorthand broadcast: enabled
> [   21.488873] [ cut here ]
> [   21.490101] [ cut here ]
> 
> [   21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 
> ttm_bo_release+0x2ac/0x2fc  <<< Line of the added WARN_ON()

This happened on CPU 1.

> 
> [   21.492940] refcount_t: underflow; use-after-free.
> [   21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 
> refcount_warn_saturate+0xb6/0xfc

This happened on CPU 0.

> [   21.496116] Modules linked in:
> [   21.497197] Modules linked in:
> [   21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 
> 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [   21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 
> 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [   21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.0-debian-1.16.0-5 04/01/2014
> [   21.503533] sched_clock: Marking stable (20788024762, 
> 714243692)->(22140778105, -638509651)
> [   21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.0-debian-1.16.0-5 04/01/2014
> [   21.504089] Workqueue: ttm ttm_bo_delayed_delete
> [   21.507196] Workqueue: events drm_fb_helper_damage_work
> [   21.509235] 
> [   21.510291] registered taskstats version 1
> [   21.510302] Running ring buffer tests...
> [   21.511792] 
> [   21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
> [   21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
> [   21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 
> c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 
> 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
> [   21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 
> a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b 
> e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
> [   21.517905] EAX: 0026 EBX: c129d150 ECX: 0040 EDX: 0002
> [   21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 0001
> [   21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
> [   21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
> [   21.521546] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
> [   21.526154] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
> [   21.526162] CR0: 80050033 CR2:  CR3: 18506000 CR4: 00150ef0
> [   21.526166] Call Trace:
> [   21.526189]  ? ww_mutex_unlock+0x3a/0x94
> [   21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
> [   21.531722]  ? ttm_bo_cleanup_refs+0xc4/0x1e0
> [   21.533114] Call Trace:
> [   21.534516]  ttm_mem_evict_first+0x3d3/0x568
> [   21.535901]  ttm_bo_delayed_delete+0x9c/0xa4
> [   21.537391]  ? kfree+0x6b/0xdc
> [   21.538901]  process_one_work+0x21a/0x484
> [   21.540279]  ? ttm_range_man_alloc+0xe0/0xec
> [   21.540854]  worker_thread+0x14a/0x39c
> [   21.541714]  ? ttm_range_man_fini_nocheck+0xe8/0xe8
> [   21.543332]  kthread+0xea/0x10c
> [   21.544301]  ttm_bo_mem_space+0x1d0/0x1e4
> [   21.544942]  ? process_one_work+0x484/0x484
> [   21.545887]  ttm_bo_validate+0xc5/0x19c
> [   21.546986]  ? kthread_complete_and_exit+0x1c/0x1c
> [   21.547680]  ttm_bo_init_reserved+0x15e/0x1fc
> [   21.548716]  ret_from_fork+0x1c/0x28
> [   21.549650]  qxl_bo_create+0x145/0x20c

The qxl_bo_create() calls ttm_bo_init_reserved() as the object in question
is about to be freed.

I'm guessing what is happening here, is that an object was to be freed by
the delayed_delete, and in the mean time, something else picked it up.

What's protecting this from not being used again?

-- Steve



Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Steven Rostedt
On Wed, 15 Mar 2023 20:51:49 +0100
Christian König  wrote:

> Steven please try the attached patch.

I applied it, but as it's not always reproducible, I'll have to give it
several runs before I give you my "tested-by" tag.

-- Steve


Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Steven Rostedt
On Wed, 15 Mar 2023 16:25:11 +0100
Christian König  wrote:
> >>
> >> Thanks for the notice,  
> > I'm still getting this on Linus's latest tree.  
> 
> This must be some reference counting issue which only happens in your 
> particular use case. We have tested this quite extensively and couldn't 
> reproduce it so far.

Have you tried 32 bit with my config. I also sent a link to your previous
email that gives access to the VM image I'm using that is triggering this
issue.

Here it is again:

  The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
  and the VM image itself is here: 
https://rostedt.org/vm-images/tracetest-32.qcow2.bz2

> 
> Can you apply this code snippet here and see if you get any warning in 
> the system logs?
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 459f1b4440da..efc390bfd69c 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -314,6 +314,7 @@ static void ttm_bo_delayed_delete(struct work_struct 
> *work)
>      dma_resv_lock(bo->base.resv, NULL);
>      ttm_bo_cleanup_memtype_use(bo);
>      dma_resv_unlock(bo->base.resv);
> +   bo->delayed_delete.func = NULL;
>      ttm_bo_put(bo);
>   }
> 
> @@ -327,6 +328,8 @@ static void ttm_bo_release(struct kref *kref)
>      WARN_ON_ONCE(bo->pin_count);
>      WARN_ON_ONCE(bo->bulk_move);
> 
> +   WARN_ON(bo->delayed_delete.func != NULL);
> +
>      if (!bo->deleted) {
>      ret = ttm_bo_individualize_resv(bo);
>      if (ret) {
> 

The WARN_ON triggered:

[   21.481449] mpls_gso: MPLS GSO support
[   21.488795] IPI shorthand broadcast: enabled
[   21.488873] [ cut here ]
[   21.490101] [ cut here ]

[   21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 
ttm_bo_release+0x2ac/0x2fc  <<< Line of the added WARN_ON()

[   21.492940] refcount_t: underflow; use-after-free.
[   21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 
refcount_warn_saturate+0xb6/0xfc
[   21.496116] Modules linked in:
[   21.497197] Modules linked in:
[   21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   21.503533] sched_clock: Marking stable (20788024762, 
714243692)->(22140778105, -638509651)
[   21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   21.504089] Workqueue: ttm ttm_bo_delayed_delete
[   21.507196] Workqueue: events drm_fb_helper_damage_work
[   21.509235] 
[   21.510291] registered taskstats version 1
[   21.510302] Running ring buffer tests...
[   21.511792] 
[   21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
[   21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
[   21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 
c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 
c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
[   21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 
a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 
92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
[   21.517905] EAX: 0026 EBX: c129d150 ECX: 0040 EDX: 0002
[   21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 0001
[   21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
[   21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
[   21.521546] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   21.526154] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   21.526162] CR0: 80050033 CR2:  CR3: 18506000 CR4: 00150ef0
[   21.526166] Call Trace:
[   21.526189]  ? ww_mutex_unlock+0x3a/0x94
[   21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
[   21.531722]  ? ttm_bo_cleanup_refs+0xc4/0x1e0
[   21.533114] Call Trace:
[   21.534516]  ttm_mem_evict_first+0x3d3/0x568
[   21.535901]  ttm_bo_delayed_delete+0x9c/0xa4
[   21.537391]  ? kfree+0x6b/0xdc
[   21.538901]  process_one_work+0x21a/0x484
[   21.540279]  ? ttm_range_man_alloc+0xe0/0xec
[   21.540854]  worker_thread+0x14a/0x39c
[   21.541714]  ? ttm_range_man_fini_nocheck+0xe8/0xe8
[   21.543332]  kthread+0xea/0x10c
[   21.544301]  ttm_bo_mem_space+0x1d0/0x1e4
[   21.544942]  ? process_one_work+0x484/0x484
[   21.545887]  ttm_bo_validate+0xc5/0x19c
[   21.546986]  ? kthread_complete_and_exit+0x1c/0x1c
[   21.547680]  ttm_bo_init_reserved+0x15e/0x1fc
[   21.548716]  ret_from_fork+0x1c/0x28
[   21.549650]  qxl_bo_create+0x145/0x20c

Note, this is all on boot up before user space is running.

-- Steve


Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Steven Rostedt
On Wed, 15 Mar 2023 11:57:12 -0400
Steven Rostedt  wrote:

> The WARN_ON triggered:
> 
> [   21.481449] mpls_gso: MPLS GSO support
> [   21.488795] IPI shorthand broadcast: enabled
> [   21.488873] [ cut here ]
> [   21.490101] [ cut here ]
> 
> [   21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 
> ttm_bo_release+0x2ac/0x2fc  <<< Line of the added WARN_ON()
> 
> [   21.492940] refcount_t: underflow; use-after-free.
> [   21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 
> refcount_warn_saturate+0xb6/0xfc
> [   21.496116] Modules linked in:
> [   21.497197] Modules linked in:
> [   21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 
> 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [   21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 
> 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [   21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.0-debian-1.16.0-5 04/01/2014
> [   21.503533] sched_clock: Marking stable (20788024762, 
> 714243692)->(22140778105, -638509651)
> [   21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.0-debian-1.16.0-5 04/01/2014
> [   21.504089] Workqueue: ttm ttm_bo_delayed_delete
> [   21.507196] Workqueue: events drm_fb_helper_damage_work
> [   21.509235] 
> [   21.510291] registered taskstats version 1
> [   21.510302] Running ring buffer tests...
> [   21.511792] 
> [   21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
> [   21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
> [   21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 
> c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 
> 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
> [   21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 
> a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b 
> e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
> [   21.517905] EAX: 0026 EBX: c129d150 ECX: 0040 EDX: 0002
> [   21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 0001
> [   21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
> [   21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
> [   21.521546] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
> [   21.526154] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
> [   21.526162] CR0: 80050033 CR2:  CR3: 18506000 CR4: 00150ef0
> [   21.526166] Call Trace:
> [   21.526189]  ? ww_mutex_unlock+0x3a/0x94
> [   21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
> [   21.531722]  ? ttm_bo_cleanup_refs+0xc4/0x1e0
> [   21.533114] Call Trace:
> [   21.534516]  ttm_mem_evict_first+0x3d3/0x568
> [   21.535901]  ttm_bo_delayed_delete+0x9c/0xa4
> [   21.537391]  ? kfree+0x6b/0xdc
> [   21.538901]  process_one_work+0x21a/0x484
> [   21.540279]  ? ttm_range_man_alloc+0xe0/0xec
> [   21.540854]  worker_thread+0x14a/0x39c
> [   21.541714]  ? ttm_range_man_fini_nocheck+0xe8/0xe8
> [   21.543332]  kthread+0xea/0x10c

So I triggered it again, and the same backtrace is there.

> [   21.544301]  ttm_bo_mem_space+0x1d0/0x1e4

It looks like the object is being reserved before it's fully removed. And
it's somewhere in this tty_bo_mem_space() (which comes from the
qxl_bo_create()).

I don't know this code at all, nor do I have any idea of what it's trying
to do. All I know is that this is triggering often (not always), and it has
to do with some race.

Now my config has lots of debugging enabled, which slows down the system
quite a bit. This also happens to open up race windows. Just because your
testing doesn't trigger it, doesn't mean that the race doesn't exist. It's
just likely to be very hard to hit.

> [   21.544942]  ? process_one_work+0x484/0x484
> [   21.545887]  ttm_bo_validate+0xc5/0x19c
> [   21.546986]  ? kthread_complete_and_exit+0x1c/0x1c
> [   21.547680]  ttm_bo_init_reserved+0x15e/0x1fc
> [   21.548716]  ret_from_fork+0x1c/0x28
> [   21.549650]  qxl_bo_create+0x145/0x20c

Here's the latest backtrace:

[  170.817449] [ cut here ]
[  170.817455] [ cut here ]
[  170.818210] refcount_t: underflow; use-after-free.
[  170.818228] WARNING: CPU: 0 PID: 267 at lib/refcount.c:28 
refcount_warn_saturate+0xb6/0xfc
[  170.819352] WARNING: CPU: 3 PID: 2382 at drivers/gpu/drm/ttm/ttm_bo.c:332 
ttm_bo_release+0x278/0x2c8
[  170.820124] Modules linked in:
[  170.822127] Modules linked in:
[  170.823829] 
[  170.823832] CPU: 0 PID: 267 Comm: kworker/0:10H Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
[  170.824610] 
[  170.825121] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[  170.825124] Workqueue: ttm ttm_bo_delayed_delete
[  170.825498] CPU: 3 PID: 2382 Comm: kworker/3:3 Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
[  170.826996] 
[  

Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Christian König



Am 15.03.23 um 20:15 schrieb Matthew Auld:

On Wed, 15 Mar 2023 at 18:41, Christian König
 wrote:

Am 08.03.23 um 13:43 schrieb Steven Rostedt:

On Wed, 8 Mar 2023 07:17:38 +0100
Christian König  wrote:


What test case/environment do you run to trigger this?

I'm running a 32bit x86 qemu instance. Attached is the config.

The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
and the VM image itself is here: 
https://rostedt.org/vm-images/tracetest-32.qcow2.bz2

I've started to download that, but it will take about an hour. So I
tried to avoid that for now.

But looks like there isn't any other way to reproduce this, the code
seems to work with both amdgpu and radeon.

My suspicion is that we just have a reference count issue in qxl or ttm
which was never noticed because it didn't caused any problems (except
for a minor memory corruption).

Why does ttm_bo_cleanup_refs() do a bo_put() at the end?


Yeah, that's the bug. I own you a beer!

In the old model we had an extra reference while the BOs where on the 
deleted list and I've forgot to remove this put here.


Steven please try the attached patch.

Thanks,
Christian.



  It doesn't
make sense to me. Say if the BO is in the process of being delay freed
(bo->deleted = true), and we just did the kref_init() in
ttm_bo_release(), it might drop that ref hitting ttm_bo_release() yet
again, this time doing the actual bo->destroy(), which frees the
object. The worker then fires at some later point calling
ttm_bo_delayed_delete(), but the BO has already been freed.


Now you get a rain of warnings because we try to grab the lock in the
delete worker.

Christian.


It happened again in another test (it's not 100% reproducible).

[   23.234838] [ cut here ]
[   23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[   23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 
__ww_mutex_lock.constprop.0+0x566/0xfec
[   23.240990] Modules linked in:
[   23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 
6.3.0-rc1-test-1-ga98bd42762ed-dirty #972
[   23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   23.247900] Workqueue: ttm ttm_bo_delayed_delete
[   23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
[   23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 
0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 
ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
[   23.256901] EAX: 0028 EBX:  ECX: c1847dd8 EDX: 0002
[   23.258849] ESI:  EDI: c12958bc EBP: c1847f00 ESP: c1847eac
[   23.260786] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
[   23.264781] Call Trace:
[   23.265899]  ? lock_is_held_type+0xbe/0x10c
[   23.267434]  ? ttm_bo_delayed_delete+0x30/0x94
[   23.268971]  ww_mutex_lock+0x32/0x94
[   23.270327]  ttm_bo_delayed_delete+0x30/0x94
[   23.271818]  process_one_work+0x21a/0x538
[   23.273242]  worker_thread+0x146/0x398
[   23.274616]  kthread+0xea/0x10c
[   23.275859]  ? process_one_work+0x538/0x538
[   23.277312]  ? kthread_complete_and_exit+0x1c/0x1c
[   23.278899]  ret_from_fork+0x1c/0x28
[   23.280223] irq event stamp: 33
[   23.281440] hardirqs last  enabled at (33): [] 
_raw_spin_unlock_irqrestore+0x2d/0x58
[   23.283860] hardirqs last disabled at (32): [] 
kvfree_call_rcu+0x155/0x2ec
[   23.286066] softirqs last  enabled at (0): [] 
copy_process+0x989/0x2368
[   23.288220] softirqs last disabled at (0): [<>] 0x0
[   23.289952] ---[ end trace  ]---
[   23.291501] [ cut here ]
[   23.293027] refcount_t: underflow; use-after-free.
[   23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 
refcount_warn_saturate+0xb6/0xfc
[   23.296959] Modules linked in:
[   23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: GW  
6.3.0-rc1-test-1-ga98bd42762ed-dirty #972
[   23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   23.303642] Workqueue: ttm ttm_bo_delayed_delete
[   23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
[   23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 
c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 
80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
[   23.311935] EAX: 0026 EBX: c1295950 ECX: c1847e40 EDX: 0002
[   23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
[   23.315840] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010246
[   23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
[   23.319859] Call Trace:
[   23.320978]  ttm_bo_delayed_delete+0x8c/0x94
[   23.322492]  process_one_work+0x21a/0x538
[   23.323959]  worker_thread+0x146/0x398
[   23.325353]  kthread+0xea/0x10c
[   23.326609]  ? 

Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Matthew Auld
On Wed, 15 Mar 2023 at 18:41, Christian König
 wrote:
>
> Am 08.03.23 um 13:43 schrieb Steven Rostedt:
> > On Wed, 8 Mar 2023 07:17:38 +0100
> > Christian König  wrote:
> >
> >> What test case/environment do you run to trigger this?
> > I'm running a 32bit x86 qemu instance. Attached is the config.
> >
> > The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
> > and the VM image itself is here: 
> > https://rostedt.org/vm-images/tracetest-32.qcow2.bz2
>
> I've started to download that, but it will take about an hour. So I
> tried to avoid that for now.
>
> But looks like there isn't any other way to reproduce this, the code
> seems to work with both amdgpu and radeon.
>
> My suspicion is that we just have a reference count issue in qxl or ttm
> which was never noticed because it didn't caused any problems (except
> for a minor memory corruption).

Why does ttm_bo_cleanup_refs() do a bo_put() at the end? It doesn't
make sense to me. Say if the BO is in the process of being delay freed
(bo->deleted = true), and we just did the kref_init() in
ttm_bo_release(), it might drop that ref hitting ttm_bo_release() yet
again, this time doing the actual bo->destroy(), which frees the
object. The worker then fires at some later point calling
ttm_bo_delayed_delete(), but the BO has already been freed.

>
> Now you get a rain of warnings because we try to grab the lock in the
> delete worker.
>
> Christian.
>
> >
> > It happened again in another test (it's not 100% reproducible).
> >
> > [   23.234838] [ cut here ]
> > [   23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> > [   23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 
> > __ww_mutex_lock.constprop.0+0x566/0xfec
> > [   23.240990] Modules linked in:
> > [   23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 
> > 6.3.0-rc1-test-1-ga98bd42762ed-dirty #972
> > [   23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.16.0-debian-1.16.0-5 04/01/2014
> > [   23.247900] Workqueue: ttm ttm_bo_delayed_delete
> > [   23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
> > [   23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b 
> > c8 85 c9 0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff 
> > <0f> 0b 58 5a e9 ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
> > [   23.256901] EAX: 0028 EBX:  ECX: c1847dd8 EDX: 0002
> > [   23.258849] ESI:  EDI: c12958bc EBP: c1847f00 ESP: c1847eac
> > [   23.260786] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
> > [   23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [   23.264781] Call Trace:
> > [   23.265899]  ? lock_is_held_type+0xbe/0x10c
> > [   23.267434]  ? ttm_bo_delayed_delete+0x30/0x94
> > [   23.268971]  ww_mutex_lock+0x32/0x94
> > [   23.270327]  ttm_bo_delayed_delete+0x30/0x94
> > [   23.271818]  process_one_work+0x21a/0x538
> > [   23.273242]  worker_thread+0x146/0x398
> > [   23.274616]  kthread+0xea/0x10c
> > [   23.275859]  ? process_one_work+0x538/0x538
> > [   23.277312]  ? kthread_complete_and_exit+0x1c/0x1c
> > [   23.278899]  ret_from_fork+0x1c/0x28
> > [   23.280223] irq event stamp: 33
> > [   23.281440] hardirqs last  enabled at (33): [] 
> > _raw_spin_unlock_irqrestore+0x2d/0x58
> > [   23.283860] hardirqs last disabled at (32): [] 
> > kvfree_call_rcu+0x155/0x2ec
> > [   23.286066] softirqs last  enabled at (0): [] 
> > copy_process+0x989/0x2368
> > [   23.288220] softirqs last disabled at (0): [<>] 0x0
> > [   23.289952] ---[ end trace  ]---
> > [   23.291501] [ cut here ]
> > [   23.293027] refcount_t: underflow; use-after-free.
> > [   23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 
> > refcount_warn_saturate+0xb6/0xfc
> > [   23.296959] Modules linked in:
> > [   23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: GW
> >   6.3.0-rc1-test-1-ga98bd42762ed-dirty #972
> > [   23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.16.0-debian-1.16.0-5 04/01/2014
> > [   23.303642] Workqueue: ttm ttm_bo_delayed_delete
> > [   23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
> > [   23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 
> > 8a 78 38 c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff 
> > <0f> 0b 59 c9 c3 80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
> > [   23.311935] EAX: 0026 EBX: c1295950 ECX: c1847e40 EDX: 0002
> > [   23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
> > [   23.315840] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010246
> > [   23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [   23.319859] Call Trace:
> > [   23.320978]  ttm_bo_delayed_delete+0x8c/0x94
> > [   23.322492]  process_one_work+0x21a/0x538
> > [   23.323959]  worker_thread+0x146/0x398
> > [   23.325353]  kthread+0xea/0x10c
> > [  

Re: [PATCH] drm/amdgpu: drop the extra sign extension

2023-03-15 Thread Christian König

Am 15.03.23 um 18:53 schrieb Alex Deucher:

amdgpu_bo_gpu_offset_no_check() already calls
amdgpu_gmc_sign_extend() so no need to call it twice.

Signed-off-by: Alex Deucher 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 69e105fa41f6..ce2afd7e775b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -152,7 +152,7 @@ static void amdgpu_vm_sdma_copy_ptes(struct 
amdgpu_vm_update_params *p,
  
  	src += p->num_dw_left * 4;
  
-	pe += amdgpu_gmc_sign_extend(amdgpu_bo_gpu_offset_no_check(bo));

+   pe += amdgpu_bo_gpu_offset_no_check(bo);
trace_amdgpu_vm_copy_ptes(pe, src, count, p->immediate);
  
  	amdgpu_vm_copy_pte(p->adev, ib, pe, src, count);

@@ -179,7 +179,7 @@ static void amdgpu_vm_sdma_set_ptes(struct 
amdgpu_vm_update_params *p,
  {
struct amdgpu_ib *ib = p->job->ibs;
  
-	pe += amdgpu_gmc_sign_extend(amdgpu_bo_gpu_offset_no_check(bo));

+   pe += amdgpu_bo_gpu_offset_no_check(bo);
trace_amdgpu_vm_set_ptes(pe, addr, count, incr, flags, p->immediate);
if (count < 3) {
amdgpu_vm_write_pte(p->adev, ib, pe, addr | flags,




Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Christian König

Am 08.03.23 um 13:43 schrieb Steven Rostedt:

On Wed, 8 Mar 2023 07:17:38 +0100
Christian König  wrote:


What test case/environment do you run to trigger this?

I'm running a 32bit x86 qemu instance. Attached is the config.

The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
and the VM image itself is here: 
https://rostedt.org/vm-images/tracetest-32.qcow2.bz2


I've started to download that, but it will take about an hour. So I 
tried to avoid that for now.


But looks like there isn't any other way to reproduce this, the code 
seems to work with both amdgpu and radeon.


My suspicion is that we just have a reference count issue in qxl or ttm 
which was never noticed because it didn't caused any problems (except 
for a minor memory corruption).


Now you get a rain of warnings because we try to grab the lock in the 
delete worker.


Christian.



It happened again in another test (it's not 100% reproducible).

[   23.234838] [ cut here ]
[   23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[   23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 
__ww_mutex_lock.constprop.0+0x566/0xfec
[   23.240990] Modules linked in:
[   23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 
6.3.0-rc1-test-1-ga98bd42762ed-dirty #972
[   23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   23.247900] Workqueue: ttm ttm_bo_delayed_delete
[   23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
[   23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 
0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 
ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
[   23.256901] EAX: 0028 EBX:  ECX: c1847dd8 EDX: 0002
[   23.258849] ESI:  EDI: c12958bc EBP: c1847f00 ESP: c1847eac
[   23.260786] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
[   23.264781] Call Trace:
[   23.265899]  ? lock_is_held_type+0xbe/0x10c
[   23.267434]  ? ttm_bo_delayed_delete+0x30/0x94
[   23.268971]  ww_mutex_lock+0x32/0x94
[   23.270327]  ttm_bo_delayed_delete+0x30/0x94
[   23.271818]  process_one_work+0x21a/0x538
[   23.273242]  worker_thread+0x146/0x398
[   23.274616]  kthread+0xea/0x10c
[   23.275859]  ? process_one_work+0x538/0x538
[   23.277312]  ? kthread_complete_and_exit+0x1c/0x1c
[   23.278899]  ret_from_fork+0x1c/0x28
[   23.280223] irq event stamp: 33
[   23.281440] hardirqs last  enabled at (33): [] 
_raw_spin_unlock_irqrestore+0x2d/0x58
[   23.283860] hardirqs last disabled at (32): [] 
kvfree_call_rcu+0x155/0x2ec
[   23.286066] softirqs last  enabled at (0): [] 
copy_process+0x989/0x2368
[   23.288220] softirqs last disabled at (0): [<>] 0x0
[   23.289952] ---[ end trace  ]---
[   23.291501] [ cut here ]
[   23.293027] refcount_t: underflow; use-after-free.
[   23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 
refcount_warn_saturate+0xb6/0xfc
[   23.296959] Modules linked in:
[   23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: GW  
6.3.0-rc1-test-1-ga98bd42762ed-dirty #972
[   23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   23.303642] Workqueue: ttm ttm_bo_delayed_delete
[   23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
[   23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 
c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 
80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
[   23.311935] EAX: 0026 EBX: c1295950 ECX: c1847e40 EDX: 0002
[   23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
[   23.315840] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010246
[   23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
[   23.319859] Call Trace:
[   23.320978]  ttm_bo_delayed_delete+0x8c/0x94
[   23.322492]  process_one_work+0x21a/0x538
[   23.323959]  worker_thread+0x146/0x398
[   23.325353]  kthread+0xea/0x10c
[   23.326609]  ? process_one_work+0x538/0x538
[   23.328081]  ? kthread_complete_and_exit+0x1c/0x1c
[   23.329683]  ret_from_fork+0x1c/0x28
[   23.331011] irq event stamp: 33
[   23.332251] hardirqs last  enabled at (33): [] 
_raw_spin_unlock_irqrestore+0x2d/0x58
[   23.334334] hardirqs last disabled at (32): [] 
kvfree_call_rcu+0x155/0x2ec
[   23.336176] softirqs last  enabled at (0): [] 
copy_process+0x989/0x2368
[   23.337904] softirqs last disabled at (0): [<>] 0x0
[   23.339313] ---[ end trace  ]---

-- Steve




Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Christian König

Am 15.03.23 um 18:31 schrieb Steven Rostedt:

On Wed, 15 Mar 2023 11:57:12 -0400
Steven Rostedt  wrote:

So I'm looking at the backtraces.


The WARN_ON triggered:

[   21.481449] mpls_gso: MPLS GSO support
[   21.488795] IPI shorthand broadcast: enabled
[   21.488873] [ cut here ]
[   21.490101] [ cut here ]

[   21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 
ttm_bo_release+0x2ac/0x2fc  <<< Line of the added WARN_ON()

This happened on CPU 1.


[   21.492940] refcount_t: underflow; use-after-free.
[   21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 
refcount_warn_saturate+0xb6/0xfc

This happened on CPU 0.


[   21.496116] Modules linked in:
[   21.497197] Modules linked in:
[   21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   21.503533] sched_clock: Marking stable (20788024762, 
714243692)->(22140778105, -638509651)
[   21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   21.504089] Workqueue: ttm ttm_bo_delayed_delete
[   21.507196] Workqueue: events drm_fb_helper_damage_work
[   21.509235]
[   21.510291] registered taskstats version 1
[   21.510302] Running ring buffer tests...
[   21.511792]
[   21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
[   21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
[   21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 
d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 
80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
[   21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd 
ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd 
ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
[   21.517905] EAX: 0026 EBX: c129d150 ECX: 0040 EDX: 0002
[   21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 0001
[   21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
[   21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
[   21.521546] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   21.526154] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   21.526162] CR0: 80050033 CR2:  CR3: 18506000 CR4: 00150ef0
[   21.526166] Call Trace:
[   21.526189]  ? ww_mutex_unlock+0x3a/0x94
[   21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
[   21.531722]  ? ttm_bo_cleanup_refs+0xc4/0x1e0
[   21.533114] Call Trace:
[   21.534516]  ttm_mem_evict_first+0x3d3/0x568
[   21.535901]  ttm_bo_delayed_delete+0x9c/0xa4
[   21.537391]  ? kfree+0x6b/0xdc
[   21.538901]  process_one_work+0x21a/0x484
[   21.540279]  ? ttm_range_man_alloc+0xe0/0xec
[   21.540854]  worker_thread+0x14a/0x39c
[   21.541714]  ? ttm_range_man_fini_nocheck+0xe8/0xe8
[   21.543332]  kthread+0xea/0x10c
[   21.544301]  ttm_bo_mem_space+0x1d0/0x1e4
[   21.544942]  ? process_one_work+0x484/0x484
[   21.545887]  ttm_bo_validate+0xc5/0x19c
[   21.546986]  ? kthread_complete_and_exit+0x1c/0x1c
[   21.547680]  ttm_bo_init_reserved+0x15e/0x1fc
[   21.548716]  ret_from_fork+0x1c/0x28
[   21.549650]  qxl_bo_create+0x145/0x20c

The qxl_bo_create() calls ttm_bo_init_reserved() as the object in question
is about to be freed.

I'm guessing what is happening here, is that an object was to be freed by
the delayed_delete, and in the mean time, something else picked it up.

What's protecting this from not being used again?


The reference count. This is pretty clearly an unbalanced reference 
counting issue.


It's just that previously you wouldn't notice it much because we were 
just silently removing the BO from the LRU list without checking if it 
was already removed (and so just damaging a bit of memory).


While now we get tons of errors because the delayed worker actually runs 
no matter if the BO is already freed or not.


Christian.



-- Steve





Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Christian König

Am 15.03.23 um 18:54 schrieb Steven Rostedt:

On Wed, 15 Mar 2023 11:57:12 -0400
Steven Rostedt  wrote:


The WARN_ON triggered:

[   21.481449] mpls_gso: MPLS GSO support
[   21.488795] IPI shorthand broadcast: enabled
[   21.488873] [ cut here ]
[   21.490101] [ cut here ]

[   21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 
ttm_bo_release+0x2ac/0x2fc  <<< Line of the added WARN_ON()

[   21.492940] refcount_t: underflow; use-after-free.
[   21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 
refcount_warn_saturate+0xb6/0xfc
[   21.496116] Modules linked in:
[   21.497197] Modules linked in:


The problem here is that two backtraces mix together. So it's pretty 
hard to figure out what's going on.




[   21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   21.503533] sched_clock: Marking stable (20788024762, 
714243692)->(22140778105, -638509651)
[   21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[   21.504089] Workqueue: ttm ttm_bo_delayed_delete
[   21.507196] Workqueue: events drm_fb_helper_damage_work
[   21.509235]
[   21.510291] registered taskstats version 1
[   21.510302] Running ring buffer tests...
[   21.511792]
[   21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
[   21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
[   21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 
d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 
80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
[   21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd 
ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd 
ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
[   21.517905] EAX: 0026 EBX: c129d150 ECX: 0040 EDX: 0002
[   21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 0001
[   21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
[   21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
[   21.521546] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   21.526154] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010286
[   21.526162] CR0: 80050033 CR2:  CR3: 18506000 CR4: 00150ef0
[   21.526166] Call Trace:
[   21.526189]  ? ww_mutex_unlock+0x3a/0x94
[   21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
[   21.531722]  ? ttm_bo_cleanup_refs+0xc4/0x1e0
[   21.533114] Call Trace:
[   21.534516]  ttm_mem_evict_first+0x3d3/0x568
[   21.535901]  ttm_bo_delayed_delete+0x9c/0xa4
[   21.537391]  ? kfree+0x6b/0xdc
[   21.538901]  process_one_work+0x21a/0x484
[   21.540279]  ? ttm_range_man_alloc+0xe0/0xec
[   21.540854]  worker_thread+0x14a/0x39c
[   21.541714]  ? ttm_range_man_fini_nocheck+0xe8/0xe8
[   21.543332]  kthread+0xea/0x10c

So I triggered it again, and the same backtrace is there.


[   21.544301]  ttm_bo_mem_space+0x1d0/0x1e4

It looks like the object is being reserved before it's fully removed. And
it's somewhere in this tty_bo_mem_space() (which comes from the
qxl_bo_create()).

I don't know this code at all, nor do I have any idea of what it's trying
to do. All I know is that this is triggering often (not always), and it has
to do with some race.

Now my config has lots of debugging enabled, which slows down the system
quite a bit. This also happens to open up race windows. Just because your
testing doesn't trigger it, doesn't mean that the race doesn't exist. It's
just likely to be very hard to hit.


[   21.544942]  ? process_one_work+0x484/0x484
[   21.545887]  ttm_bo_validate+0xc5/0x19c
[   21.546986]  ? kthread_complete_and_exit+0x1c/0x1c
[   21.547680]  ttm_bo_init_reserved+0x15e/0x1fc
[   21.548716]  ret_from_fork+0x1c/0x28
[   21.549650]  qxl_bo_create+0x145/0x20c

Here's the latest backtrace:

[  170.817449] [ cut here ]
[  170.817455] [ cut here ]
[  170.818210] refcount_t: underflow; use-after-free.
[  170.818228] WARNING: CPU: 0 PID: 267 at lib/refcount.c:28 
refcount_warn_saturate+0xb6/0xfc
[  170.819352] WARNING: CPU: 3 PID: 2382 at drivers/gpu/drm/ttm/ttm_bo.c:332 
ttm_bo_release+0x278/0x2c8
[  170.820124] Modules linked in:
[  170.822127] Modules linked in:
[  170.823829]
[  170.823832] CPU: 0 PID: 267 Comm: kworker/0:10H Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
[  170.824610]
[  170.825121] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[  170.825124] Workqueue: ttm ttm_bo_delayed_delete
[  170.825498] CPU: 3 PID: 2382 Comm: kworker/3:3 Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty 

[PATCH] drm/amdgpu: drop the extra sign extension

2023-03-15 Thread Alex Deucher
amdgpu_bo_gpu_offset_no_check() already calls
amdgpu_gmc_sign_extend() so no need to call it twice.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 69e105fa41f6..ce2afd7e775b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -152,7 +152,7 @@ static void amdgpu_vm_sdma_copy_ptes(struct 
amdgpu_vm_update_params *p,
 
src += p->num_dw_left * 4;
 
-   pe += amdgpu_gmc_sign_extend(amdgpu_bo_gpu_offset_no_check(bo));
+   pe += amdgpu_bo_gpu_offset_no_check(bo);
trace_amdgpu_vm_copy_ptes(pe, src, count, p->immediate);
 
amdgpu_vm_copy_pte(p->adev, ib, pe, src, count);
@@ -179,7 +179,7 @@ static void amdgpu_vm_sdma_set_ptes(struct 
amdgpu_vm_update_params *p,
 {
struct amdgpu_ib *ib = p->job->ibs;
 
-   pe += amdgpu_gmc_sign_extend(amdgpu_bo_gpu_offset_no_check(bo));
+   pe += amdgpu_bo_gpu_offset_no_check(bo);
trace_amdgpu_vm_set_ptes(pe, addr, count, incr, flags, p->immediate);
if (count < 3) {
amdgpu_vm_write_pte(p->adev, ib, pe, addr | flags,
-- 
2.39.2



[linux-next:master] BUILD SUCCESS WITH WARNING 225b6b81afe63b3850b7cee0a3590f51144f2a75

2023-03-15 Thread kernel test robot
tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
branch HEAD: 225b6b81afe63b3850b7cee0a3590f51144f2a75  Add linux-next specific 
files for 20230315

Warning reports:

https://lore.kernel.org/oe-kbuild-all/202303081807.lblwkmpx-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202303151409.por0sbf7-...@intel.com

Warning: (recently discovered and may have been fixed)

drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_validation.c:258:10: 
warning: no previous prototype for 'link_timing_bandwidth_kbps' 
[-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:2184:
 warning: expecting prototype for Check if there is a native DP or passive 
DP(). Prototype was for dp_is_sink_present() instead
lib/dynamic_debug.c:947:6: warning: no previous prototype for function 
'__dynamic_ibdev_dbg' [-Wmissing-prototypes]

Unverified Warning (likely false positive, please contact us if interested):

crypto/akcipher.c:135:32: warning: Value stored to 'istat' during its 
initialization is never read [clang-analyzer-deadcode.DeadStores]

Warning ids grouped by kconfigs:

gcc_recent_errors
|-- alpha-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- alpha-randconfig-r024-20230312
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:no-previous-prototype-for-link_timing_bandwidth_kbps
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- arc-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- arm-allmodconfig
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:no-previous-prototype-for-link_timing_bandwidth_kbps
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- arm-allyesconfig
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:no-previous-prototype-for-link_timing_bandwidth_kbps
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- arm64-allyesconfig
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:no-previous-prototype-for-link_timing_bandwidth_kbps
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- i386-allyesconfig
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:no-previous-prototype-for-link_timing_bandwidth_kbps
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- ia64-allmodconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- loongarch-allmodconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- loongarch-defconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- mips-allmodconfig
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:no-previous-prototype-for-link_timing_bandwidth_kbps
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- mips-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype-was-for-dp_is_sink_present()-inste
|-- powerpc-allmodconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:expecting-prototype-for-Check-if-there-is-a-native-DP-or-passive-DP().-Prototype

Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

2023-03-15 Thread Christian König

Am 15.03.23 um 16:09 schrieb Steven Rostedt:

On Wed, 8 Mar 2023 07:17:38 +0100
Christian König  wrote:


Am 08.03.23 um 03:26 schrieb Steven Rostedt:

On Tue, 7 Mar 2023 21:22:23 -0500
Steven Rostedt  wrote:
  

Looks like there was a lock possibly used after free. But as commit
9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
workers") changed a lot of this code, I figured it may be the culprit.

If I bothered to look at the second warning after this one (I usually stop
after the first), it appears to state there was a use after free issue.

Yeah, that looks like the reference count was somehow messed up.

What test case/environment do you run to trigger this?

Thanks for the notice,

I'm still getting this on Linus's latest tree.


This must be some reference counting issue which only happens in your 
particular use case. We have tested this quite extensively and couldn't 
reproduce it so far.


Can you apply this code snippet here and see if you get any warning in 
the system logs?


diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 459f1b4440da..efc390bfd69c 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -314,6 +314,7 @@ static void ttm_bo_delayed_delete(struct work_struct 
*work)

    dma_resv_lock(bo->base.resv, NULL);
    ttm_bo_cleanup_memtype_use(bo);
    dma_resv_unlock(bo->base.resv);
+   bo->delayed_delete.func = NULL;
    ttm_bo_put(bo);
 }

@@ -327,6 +328,8 @@ static void ttm_bo_release(struct kref *kref)
    WARN_ON_ONCE(bo->pin_count);
    WARN_ON_ONCE(bo->bulk_move);

+   WARN_ON(bo->delayed_delete.func != NULL);
+
    if (!bo->deleted) {
    ret = ttm_bo_individualize_resv(bo);
    if (ret) {


Thanks,
Christian.



[  230.530222] [ cut here ]
[  230.569795] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[  230.569957] WARNING: CPU: 0 PID: 212 at kernel/locking/mutex.c:582 
__ww_mutex_lock.constprop.0+0x62a/0x1300
[  230.612599] Modules linked in:
[  230.632144] CPU: 0 PID: 212 Comm: kworker/0:8H Not tainted 
6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #992
[  230.654939] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-debian-1.16.0-5 04/01/2014
[  230.678866] Workqueue: ttm ttm_bo_delayed_delete
[  230.699452] EIP: __ww_mutex_lock.constprop.0+0x62a/0x1300
[  230.720582] Code: e8 3b 9a 95 ff 85 c0 0f 84 61 fa ff ff 8b 0d 58 bc 3a c4 85 c9 
0f 85 53 fa ff ff 68 54 98 06 c4 68 b7 b6 04 c4 e8 46 af 40 ff <0f> 0b 58 5a e9 
3b fa ff ff 8d 74 26 00 90 a1 ec 47 b0 c4 85 c0 75
[  230.768336] EAX: 0028 EBX:  ECX: c51abdd8 EDX: 0002
[  230.792001] ESI:  EDI: c53856bc EBP: c51abf00 ESP: c51abeac
[  230.815944] DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068 EFLAGS: 00010246
[  230.840033] CR0: 80050033 CR2: ff9ff000 CR3: 04506000 CR4: 00150ef0
[  230.864059] Call Trace:
[  230.886369]  ? ttm_bo_delayed_delete+0x30/0x94
[  230.909902]  ww_mutex_lock+0x32/0x94
[  230.932550]  ttm_bo_delayed_delete+0x30/0x94
[  230.955798]  process_one_work+0x21a/0x484
[  230.979335]  worker_thread+0x14a/0x39c
[  231.002258]  kthread+0xea/0x10c
[  231.024769]  ? process_one_work+0x484/0x484
[  231.047870]  ? kthread_complete_and_exit+0x1c/0x1c
[  231.071498]  ret_from_fork+0x1c/0x28
[  231.094701] irq event stamp: 4023
[  231.117272] hardirqs last  enabled at (4023): [] 
_raw_spin_unlock_irqrestore+0x2d/0x58
[  231.143217] hardirqs last disabled at (4022): [] 
kvfree_call_rcu+0x155/0x2ec
[  231.166058] softirqs last  enabled at (3460): [] 
__do_softirq+0x2c3/0x3bb
[  231.183104] softirqs last disabled at (3455): [] 
call_on_stack+0x45/0x4c
[  231.200336] ---[ end trace  ]---
[  231.216572] [ cut here ]


This is preventing me from adding any of my own patches on v6.3-rcX due to
this bug failing my tests. Which means I can't add anything to linux-next
until this is fixed!

-- Steve




Re: [RFC PATCH 5/5] xen/privcmd: add IOCTL_PRIVCMD_GSI_FROM_IRQ

2023-03-15 Thread Roger Pau Monné
On Sun, Mar 12, 2023 at 08:01:57PM +0800, Huang Rui wrote:
> From: Chen Jiqian 
> 
> When hypervisor get an interrupt, it needs interrupt's
> gsi number instead of irq number. Gsi number is unique
> in xen, but irq number is only unique in one domain.
> So, we need to record the relationship between irq and
> gsi when dom0 initialized pci devices, and provide syscall
> IOCTL_PRIVCMD_GSI_FROM_IRQ to translate irq to gsi. So
> that, we can map pirq successfully in hypervisor side.

GSI is not only unique in Xen, it's an ACPI provided value that's
unique in the platform.  The text above make it look like GSI is some
kind of internal Xen reference to an interrupt, but it's not.

How does a PV domain deal with this? I would assume there Linux will
also end up with IRQ != GSI, and hence will need some kind of
translation?

Thanks, Roger.


Re: [PATCH 1/2] drm/amdgpu: reposition the gpu reset checking for reuse

2023-03-15 Thread Alex Deucher
On Wed, Mar 15, 2023 at 7:05 AM Tim Huang  wrote:
>
> Move the amdgpu_acpi_should_gpu_reset out of
> CONFIG_SUSPEND to share it with hibernate case.
>
> Signed-off-by: Tim Huang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h  |  4 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 40 +---
>  2 files changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 5c6132502f35..5bddc03332b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1392,10 +1392,12 @@ int amdgpu_acpi_smart_shift_update(struct drm_device 
> *dev, enum amdgpu_ss ss_sta
>  int amdgpu_acpi_pcie_notify_device_ready(struct amdgpu_device *adev);
>
>  void amdgpu_acpi_get_backlight_caps(struct amdgpu_dm_backlight_caps *caps);
> +bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev);
>  void amdgpu_acpi_detect(void);
>  #else
>  static inline int amdgpu_acpi_init(struct amdgpu_device *adev) { return 0; }
>  static inline void amdgpu_acpi_fini(struct amdgpu_device *adev) { }
> +static inline bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) 
> { return false; }
>  static inline void amdgpu_acpi_detect(void) { }
>  static inline bool amdgpu_acpi_is_power_shift_control_supported(void) { 
> return false; }
>  static inline int amdgpu_acpi_power_shift_control(struct amdgpu_device *adev,
> @@ -1406,11 +1408,9 @@ static inline int 
> amdgpu_acpi_smart_shift_update(struct drm_device *dev,
>
>  #if defined(CONFIG_ACPI) && defined(CONFIG_SUSPEND)
>  bool amdgpu_acpi_is_s3_active(struct amdgpu_device *adev);
> -bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev);
>  bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev);
>  #else
>  static inline bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev) { 
> return false; }
> -static inline bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) 
> { return false; }
>  static inline bool amdgpu_acpi_is_s3_active(struct amdgpu_device *adev) { 
> return false; }
>  #endif
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index 25e902077caf..065944bdeee4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -971,6 +971,28 @@ static bool amdgpu_atcs_pci_probe_handle(struct pci_dev 
> *pdev)
> return true;
>  }
>
> +
> +/**
> + * amdgpu_acpi_should_gpu_reset
> + *
> + * @adev: amdgpu_device_pointer
> + *
> + * returns true if should reset GPU, false if not
> + */
> +bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
> +{
> +   if (adev->flags & AMD_IS_APU)
> +   return false;
> +
> +   if (amdgpu_sriov_vf(adev))
> +   return false;
> +
> +#if IS_ENABLED(CONFIG_SUSPEND)
> +   return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
> +#endif /* CONFIG_SUSPEND */
> +   return true;

Should probably be:
#if IS_ENABLED(CONFIG_SUSPEND)
return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
#else
return true;
#endif

With that fixed, series is:
Reviewed-by: Alex Deucher 

> +}
> +
>  /*
>   * amdgpu_acpi_detect - detect ACPI ATIF/ATCS methods
>   *
> @@ -1042,24 +1064,6 @@ bool amdgpu_acpi_is_s3_active(struct amdgpu_device 
> *adev)
> (pm_suspend_target_state == PM_SUSPEND_MEM);
>  }
>
> -/**
> - * amdgpu_acpi_should_gpu_reset
> - *
> - * @adev: amdgpu_device_pointer
> - *
> - * returns true if should reset GPU, false if not
> - */
> -bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
> -{
> -   if (adev->flags & AMD_IS_APU)
> -   return false;
> -
> -   if (amdgpu_sriov_vf(adev))
> -   return false;
> -
> -   return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
> -}
> -
>  /**
>   * amdgpu_acpi_is_s0ix_active
>   *
> --
> 2.25.1
>


Re: [RFC PATCH 4/5] x86/xen: acpi registers gsi for xen pvh

2023-03-15 Thread Roger Pau Monné
On Sun, Mar 12, 2023 at 08:01:56PM +0800, Huang Rui wrote:
> From: Chen Jiqian 
> 
> Add acpi_register_gsi_xen_pvh() to register gsi for PVH mode.
> In addition to call acpi_register_gsi_ioapic(), it also setup
> a map between gsi and vector in hypervisor side. So that,
> when dgpu create an interrupt, hypervisor can correctly find
> which guest domain to process interrupt by vector.

The term 'dgpu' needs clarifying or replacing by a more generic
naming.

Also, I would like to be able to get away from requiring dom0 to
register the GSIs in this way.  If you take a look at Xen, there's
code in the emulated IO-APIC available to dom0 that already does this
registering (see vioapic_hwdom_map_gsi() in Xen).

I think the problem here is that the GSI used by the device you want
to passthrough has never had it's pin unmasked in the IO-APIC, and
hence hasn't been registered.

It would be helpful if you could state in the commit message what
issue you are trying to solve by doing this registering here, I assume
it is done in order to map the IRQ to a PIRQ, so later calls by the
toolstack to bind it succeed.

Would it be possible instead to perform the call to PHYSDEVOP_map_pirq
in the toolstack itself if the PIRQ cannot be found?

Thanks, Roger.


Re: [PATCH] drm/amdgpu: add mes resume when do gfx post soft reset

2023-03-15 Thread Alex Deucher
On Wed, Mar 15, 2023 at 4:13 AM Tong Liu01  wrote:
>
> [why]
> when gfx do soft reset, mes will also do reset, if mes is not
> resumed when do recover from soft reset, mes is unable to respond
> in later sequence
>
> [how]
> resume mes when do gfx post soft reset
>
> Signed-off-by: Tong Liu01 

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index 3bf697a80cf2..08650f93f210 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -4655,6 +4655,14 @@ static bool gfx_v11_0_check_soft_reset(void *handle)
> return false;
>  }
>
> +static int gfx_v11_0_post_soft_reset(void *handle)
> +{
> +   /**
> +* GFX soft reset will impact MES, need resume MES when do GFX soft 
> reset
> +*/
> +   return amdgpu_mes_resume((struct amdgpu_device *)handle);
> +}
> +
>  static uint64_t gfx_v11_0_get_gpu_clock_counter(struct amdgpu_device *adev)
>  {
> uint64_t clock;
> @@ -6166,6 +6174,7 @@ static const struct amd_ip_funcs gfx_v11_0_ip_funcs = {
> .wait_for_idle = gfx_v11_0_wait_for_idle,
> .soft_reset = gfx_v11_0_soft_reset,
> .check_soft_reset = gfx_v11_0_check_soft_reset,
> +   .post_soft_reset = gfx_v11_0_post_soft_reset,
> .set_clockgating_state = gfx_v11_0_set_clockgating_state,
> .set_powergating_state = gfx_v11_0_set_powergating_state,
> .get_clockgating_state = gfx_v11_0_get_clockgating_state,
> --
> 2.34.1
>


Re: [RFC PATCH 3/5] drm/amdgpu: set passthrough mode for xen pvh/hvm

2023-03-15 Thread Roger Pau Monné
On Sun, Mar 12, 2023 at 08:01:55PM +0800, Huang Rui wrote:
> There is an second stage translation between the guest machine address
> and host machine address in Xen PVH/HVM. The PCI bar address in the xen
> guest kernel are not translated at the second stage on Xen PVH/HVM, so

I'm confused by the sentence above, do you think it could be reworded
or expanded to clarify?

PCI BAR addresses are not in the guest kernel, but rather in the
physical memory layout made available to the guest.

Also, I'm unsure why xen_initial_domain() needs to be used in the
conditional below: all PV domains handle addresses the same, so if
it's not needed for a PV dom0 it's likely not needed for a PV domU
either.  Albeit it would help to know more about what
AMDGPU_PASSTHROUGH_MODE implies.

Thanks, Roger.


Re: [PATCH] [RFC] drm/drm_buddy fails to initialize on 32-bit architectures

2023-03-15 Thread Luís Mendes
I'll give it a try this weekend.

Luís

On Fri, Mar 10, 2023 at 1:15 PM Arunpravin Paneer Selvam
 wrote:
>
>
>
> On 3/9/2023 3:42 PM, Luís Mendes wrote:
> > Hi,
> >
> > Ping? This is actually a regression.
> > If there is no one available to work this, maybe I can have a look in
> > my spare time, in accordance with your suggestion.
> >
> > Regards,
> > Luís
> >
> > On Tue, Jan 3, 2023 at 8:44 AM Christian König  
> > wrote:
> >> Am 25.12.22 um 20:39 schrieb Luís Mendes:
> >>> Re-sending with the correct  linux-kernel mailing list email address.
> >>> Sorry for the inconvenience.
> >>>
> >>> The proposed patch fixes the issue and allows amdgpu to work again on
> >>> armhf with a AMD RX 550 card, however it may not be the best solution
> >>> for the issue, as detailed below.
> >>>
> >>> include/log2.h defined macros rounddown_pow_of_two(...) and
> >>> roundup_pow_of_two(...) do not handle 64-bit values on 32-bit
> >>> architectures (tested on armv9 armhf machine) causing
> >>> drm_buddy_init(...) to fail on BUG_ON with an underflow on the order
> >>> value, thus impeding amdgpu to load properly (no GUI).
> >>>
> >>> One option is to modify rounddown_pow_of_two(...) to detect if the
> >>> variable takes 32 bits or less and call __rounddown_pow_of_two_u32(u32
> >>> n) or if the variable takes more space than 32 bits, then call
> >>> __rounddown_pow_of_two_u64(u64 n). This would imply renaming
> >>> __rounddown_pow_of_two(unsigne
> >>> d long n) to
> >>> __rounddown_pow_of_two_u32(u32 n) and add a new function
> >>> __rounddown_pow_of_two_u64(u64 n). This would be the most transparent
> >>> solution, however there a few complications, and they are:
> >>> - that the mm subsystem will fail to link on armhf with an undefined
> >>> reference on __aeabi_uldivmod
> >>> - there a few drivers that directly call __rounddown_pow_of_two(...)
> >>> - that other drivers and subsystems generate warnings
> >>>
> >>> So this alternate solution was devised which avoids touching existing
> >>> code paths, and just updates drm_buddy which seems to be the only
> >>> driver that is failing, however I am not sure if this is the proper
> >>> way to go. So I would like to get a second opinion on this, by those
> >>> who know.
> >>>
> >>> /include/linux/log2.h
> >>> /drivers/gpu/drm/drm_buddy.c
> >>>
> >>> Signed-off-by: Luís Mendes 
>  8--8<
> >>> diff -uprN linux-next/drivers/gpu/drm/drm_buddy.c
> >>> linux-nextLM/drivers/gpu/drm/drm_buddy.c
> >>> --- linux-next/drivers/gpu/drm/drm_buddy.c2022-12-25
> >>> 16:29:26.0 +
> >>> +++ linux-nextLM/drivers/gpu/drm/drm_buddy.c2022-12-25
> >>> 17:04:32.136007116 +
> >>> @@ -128,7 +128,7 @@ int drm_buddy_init(struct drm_buddy *mm,
> >>>unsigned int order;
> >>>u64 root_size;
> >>>
> >>> -root_size = rounddown_pow_of_two(size);
> >>> +root_size = rounddown_pow_of_two_u64(size);
> >>>order = ilog2(root_size) - ilog2(chunk_size);
> >> I think this can be handled much easier if keep around the root_order
> >> instead of the root_size in the first place.
> >>
> >> Cause ilog2() does the right thing even for non power of two values and
> >> so we just need the order for the offset subtraction below.
> Could you try with ilog2() and see if you are getting the right value
> for size as suggested
> by Christian.
>
> Thanks,
> Arun
> >>
> >> Arun can you take a closer look at this?
> >>
> >> Regards,
> >> Christian.
> >>
> >>>root = drm_block_alloc(mm, NULL, order, offset);
> >>> diff -uprN linux-next/include/linux/log2.h 
> >>> linux-nextLM/include/linux/log2.h
> >>> --- linux-next/include/linux/log2.h2022-12-25 16:29:29.0 +
> >>> +++ linux-nextLM/include/linux/log2.h2022-12-25 17:00:34.319901492 
> >>> +
> >>> @@ -58,6 +58,18 @@ unsigned long __roundup_pow_of_two(unsig
> >>>}
> >>>
> >>>/**
> >>> + * __roundup_pow_of_two_u64() - round up to nearest power of two
> >>> + * (unsgined 64-bits precision version)
> >>> + * @n: value to round up
> >>> + */
> >>> +static inline __attribute__((const))
> >>> +u64 __roundup_pow_of_two_u64(u64 n)
> >>> +{
> >>> +return 1ULL << fls64(n - 1);
> >>> +}
> >>> +
> >>> +
> >>> +/**
> >>> * __rounddown_pow_of_two() - round down to nearest power of two
> >>> * @n: value to round down
> >>> */
> >>> @@ -68,6 +80,17 @@ unsigned long __rounddown_pow_of_two(uns
> >>>}
> >>>
> >>>/**
> >>> + * __rounddown_pow_of_two_u64() - round down to nearest power of two
> >>> + * (unsgined 64-bits precision version)
> >>> + * @n: value to round down
> >>> + */
> >>> +static inline __attribute__((const))
> >>> +u64 __rounddown_pow_of_two_u64(u64 n)
> >>> +{
> >>> +return 1ULL << (fls64(n) - 1);
> >>> +}
> >>> +
> >>> +/**
> >>> * const_ilog2 - log base 2 of 32-bit or a 64-bit constant unsigned 
> >>> value
> >>> * @n: parameter
> >>> *
> >>> @@ -163,6 +186,7 @@ unsigned long 

Re: [RFC PATCH 2/5] xen/grants: update initialization order of xen grant table

2023-03-15 Thread Roger Pau Monné
On Sun, Mar 12, 2023 at 08:01:54PM +0800, Huang Rui wrote:
> The xen grant table will be initialied before parsing the PCI resources,
> so xen_alloc_unpopulated_pages() ends up using a range from the PCI
> window because Linux hasn't parsed the PCI information yet.
> 
> So modify the initialization order to make sure the real PCI resources
> are parsed before.

Has this been tested on a domU to make sure the late grant table init
doesn't interfere with PV devices getting setup?

> Signed-off-by: Huang Rui 
> ---
>  arch/x86/xen/grant-table.c | 2 +-
>  drivers/xen/grant-table.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
> index 1e681bf62561..64a04d1e70f5 100644
> --- a/arch/x86/xen/grant-table.c
> +++ b/arch/x86/xen/grant-table.c
> @@ -165,5 +165,5 @@ static int __init xen_pvh_gnttab_setup(void)
>  }
>  /* Call it _before_ __gnttab_init as we need to initialize the
>   * xen_auto_xlat_grant_frames first. */
> -core_initcall(xen_pvh_gnttab_setup);
> +fs_initcall_sync();
>  #endif
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index e1ec725c2819..6382112f3473 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -1680,4 +1680,4 @@ static int __gnttab_init(void)
>  }
>  /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
>   * beforehand to initialize xen_auto_xlat_grant_frames. */

Comment need to be updated, but I was thinking whether it won't be
best to simply call xen_pvh_gnttab_setup() from __gnttab_init() itself
when running as a PVH guest?

Thanks, Roger.


[PATCH 6.2 015/141] drm/display: Dont block HDR_OUTPUT_METADATA on unknown EOTF

2023-03-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

commit e5eef23e267c72521d81f23f7f82d1f523d4a253 upstream.

The EDID of an HDR display defines EOTFs that are supported
by the display and can be set in the HDR metadata infoframe.
Userspace is expected to read the EDID and set an appropriate
HDR_OUTPUT_METADATA.

In drm_parse_hdr_metadata_block the kernel reads the supported
EOTFs from the EDID and stores them in the
drm_connector->hdr_sink_metadata. While doing so it also
filters the EOTFs to the EOTFs the kernel knows about.
When an HDR_OUTPUT_METADATA is set it then checks to
make sure the EOTF is a supported EOTF. In cases where
the kernel doesn't know about a new EOTF this check will
fail, even if the EDID advertises support.

Since it is expected that userspace reads the EDID to understand
what the display supports it doesn't make sense for DRM to block
an HDR_OUTPUT_METADATA if it contains an EOTF the kernel doesn't
understand.

This comes with the added benefit of future-proofing metadata
support. If the spec defines a new EOTF there is no need to
update DRM and an compositor can immediately make use of it.

Bug: https://gitlab.freedesktop.org/wayland/weston/-/issues/609

v2: Distinguish EOTFs defind in kernel and ones defined
in EDID in the commit description (Pekka)

v3: Rebase; drm_hdmi_infoframe_set_hdr_metadata moved
to drm_hdmi_helper.c

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Acked-by: Pekka Paalanen 
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-2-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/display/drm_hdmi_helper.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/drivers/gpu/drm/display/drm_hdmi_helper.c
+++ b/drivers/gpu/drm/display/drm_hdmi_helper.c
@@ -44,10 +44,8 @@ int drm_hdmi_infoframe_set_hdr_metadata(
 
/* Sink EOTF is Bit map while infoframe is absolute values */
if (!is_eotf_supported(hdr_metadata->hdmi_metadata_type1.eotf,
-   connector->hdr_sink_metadata.hdmi_type1.eotf)) {
-   DRM_DEBUG_KMS("EOTF Not Supported\n");
-   return -EINVAL;
-   }
+   connector->hdr_sink_metadata.hdmi_type1.eotf))
+   DRM_DEBUG_KMS("Unknown EOTF %d\n", 
hdr_metadata->hdmi_metadata_type1.eotf);
 
err = hdmi_drm_infoframe_init(frame);
if (err < 0)




[PATCH 6.2 016/141] drm/connector: print max_requested_bpc in state debugfs

2023-03-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

commit 7d386975f6a495902e679a3a250a7456d7e54765 upstream.

This is useful to understand the bpc defaults and
support of a driver.

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/drm_atomic.c |1 +
 1 file changed, 1 insertion(+)

--- a/drivers/gpu/drm/drm_atomic.c
+++ b/drivers/gpu/drm/drm_atomic.c
@@ -1070,6 +1070,7 @@ static void drm_atomic_connector_print_s
drm_printf(p, "connector[%u]: %s\n", connector->base.id, 
connector->name);
drm_printf(p, "\tcrtc=%s\n", state->crtc ? state->crtc->name : 
"(null)");
drm_printf(p, "\tself_refresh_aware=%d\n", state->self_refresh_aware);
+   drm_printf(p, "\tmax_requested_bpc=%d\n", state->max_requested_bpc);
 
if (connector->connector_type == DRM_MODE_CONNECTOR_WRITEBACK)
if (state->writeback_job && state->writeback_job->fb)




[PATCH v2] drm/amdgpu/nv: Apply ASPM quirk on Intel ADL + AMD Navi

2023-03-15 Thread Kai-Heng Feng
S2idle resume freeze can be observed on Intel ADL + AMD WX5500. This is
caused by commit 0064b0ce85bb ("drm/amd/pm: enable ASPM by default").

The root cause is still not clear for now.

So extend and apply the ASPM quirk from commit e02fe3bc7aba
("drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems"), to
workaround the issue on Navi cards too.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2458
Reviewed-by: Alex Deucher 
Signed-off-by: Kai-Heng Feng 
---
v2:
 - Rename the quirk function.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++
 drivers/gpu/drm/amd/amdgpu/nv.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/vi.c| 17 +
 4 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 164141bc8b4a..5f3b139c1f99 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1272,6 +1272,7 @@ void amdgpu_device_pci_config_reset(struct amdgpu_device 
*adev);
 int amdgpu_device_pci_reset(struct amdgpu_device *adev);
 bool amdgpu_device_need_post(struct amdgpu_device *adev);
 bool amdgpu_device_should_use_aspm(struct amdgpu_device *adev);
+bool amdgpu_device_aspm_support_quirk(void);
 
 void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes,
  u64 num_vis_bytes);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c4a4e2fe6681..05a34ff79e78 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -80,6 +80,10 @@
 
 #include 
 
+#if IS_ENABLED(CONFIG_X86)
+#include 
+#endif
+
 MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
 MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
 MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -1356,6 +1360,17 @@ bool amdgpu_device_should_use_aspm(struct amdgpu_device 
*adev)
return pcie_aspm_enabled(adev->pdev);
 }
 
+bool amdgpu_device_aspm_support_quirk(void)
+{
+#if IS_ENABLED(CONFIG_X86)
+   struct cpuinfo_x86 *c = _data(0);
+
+   return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
+#else
+   return true;
+#endif
+}
+
 /* if we get transitioned to only one device, take VGA back */
 /**
  * amdgpu_device_vga_set_decode - enable/disable vga decode
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 855d390c41de..26733263913e 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -578,7 +578,7 @@ static void nv_pcie_gen3_enable(struct amdgpu_device *adev)
 
 static void nv_program_aspm(struct amdgpu_device *adev)
 {
-   if (!amdgpu_device_should_use_aspm(adev))
+   if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_aspm_support_quirk())
return;
 
if (!(adev->flags & AMD_IS_APU) &&
diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 12ef782eb478..ceab8783575c 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,10 +81,6 @@
 #include "mxgpu_vi.h"
 #include "amdgpu_dm.h"
 
-#if IS_ENABLED(CONFIG_X86)
-#include 
-#endif
-
 #define ixPCIE_LC_L1_PM_SUBSTATE   0x100100C6
 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK   
0x0001L
 #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK   0x0002L
@@ -1138,24 +1134,13 @@ static void vi_enable_aspm(struct amdgpu_device *adev)
WREG32_PCIE(ixPCIE_LC_CNTL, data);
 }
 
-static bool aspm_support_quirk_check(void)
-{
-#if IS_ENABLED(CONFIG_X86)
-   struct cpuinfo_x86 *c = _data(0);
-
-   return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
-#else
-   return true;
-#endif
-}
-
 static void vi_program_aspm(struct amdgpu_device *adev)
 {
u32 data, data1, orig;
bool bL1SS = false;
bool bClkReqSupport = true;
 
-   if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check())
+   if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_aspm_support_quirk())
return;
 
if (adev->flags & AMD_IS_APU ||
-- 
2.34.1



[PATCH 6.1 013/143] drm/display: Dont block HDR_OUTPUT_METADATA on unknown EOTF

2023-03-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

commit e5eef23e267c72521d81f23f7f82d1f523d4a253 upstream.

The EDID of an HDR display defines EOTFs that are supported
by the display and can be set in the HDR metadata infoframe.
Userspace is expected to read the EDID and set an appropriate
HDR_OUTPUT_METADATA.

In drm_parse_hdr_metadata_block the kernel reads the supported
EOTFs from the EDID and stores them in the
drm_connector->hdr_sink_metadata. While doing so it also
filters the EOTFs to the EOTFs the kernel knows about.
When an HDR_OUTPUT_METADATA is set it then checks to
make sure the EOTF is a supported EOTF. In cases where
the kernel doesn't know about a new EOTF this check will
fail, even if the EDID advertises support.

Since it is expected that userspace reads the EDID to understand
what the display supports it doesn't make sense for DRM to block
an HDR_OUTPUT_METADATA if it contains an EOTF the kernel doesn't
understand.

This comes with the added benefit of future-proofing metadata
support. If the spec defines a new EOTF there is no need to
update DRM and an compositor can immediately make use of it.

Bug: https://gitlab.freedesktop.org/wayland/weston/-/issues/609

v2: Distinguish EOTFs defind in kernel and ones defined
in EDID in the commit description (Pekka)

v3: Rebase; drm_hdmi_infoframe_set_hdr_metadata moved
to drm_hdmi_helper.c

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Acked-by: Pekka Paalanen 
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-2-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/display/drm_hdmi_helper.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/drivers/gpu/drm/display/drm_hdmi_helper.c
+++ b/drivers/gpu/drm/display/drm_hdmi_helper.c
@@ -44,10 +44,8 @@ int drm_hdmi_infoframe_set_hdr_metadata(
 
/* Sink EOTF is Bit map while infoframe is absolute values */
if (!is_eotf_supported(hdr_metadata->hdmi_metadata_type1.eotf,
-   connector->hdr_sink_metadata.hdmi_type1.eotf)) {
-   DRM_DEBUG_KMS("EOTF Not Supported\n");
-   return -EINVAL;
-   }
+   connector->hdr_sink_metadata.hdmi_type1.eotf))
+   DRM_DEBUG_KMS("Unknown EOTF %d\n", 
hdr_metadata->hdmi_metadata_type1.eotf);
 
err = hdmi_drm_infoframe_init(frame);
if (err < 0)




[PATCH 5.4 03/68] drm/connector: print max_requested_bpc in state debugfs

2023-03-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

commit 7d386975f6a495902e679a3a250a7456d7e54765 upstream.

This is useful to understand the bpc defaults and
support of a driver.

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/drm_atomic.c |1 +
 1 file changed, 1 insertion(+)

--- a/drivers/gpu/drm/drm_atomic.c
+++ b/drivers/gpu/drm/drm_atomic.c
@@ -1006,6 +1006,7 @@ static void drm_atomic_connector_print_s
drm_printf(p, "connector[%u]: %s\n", connector->base.id, 
connector->name);
drm_printf(p, "\tcrtc=%s\n", state->crtc ? state->crtc->name : 
"(null)");
drm_printf(p, "\tself_refresh_aware=%d\n", state->self_refresh_aware);
+   drm_printf(p, "\tmax_requested_bpc=%d\n", state->max_requested_bpc);
 
if (connector->connector_type == DRM_MODE_CONNECTOR_WRITEBACK)
if (state->writeback_job && state->writeback_job->fb)




[PATCH 5.15 007/145] drm/connector: print max_requested_bpc in state debugfs

2023-03-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

commit 7d386975f6a495902e679a3a250a7456d7e54765 upstream.

This is useful to understand the bpc defaults and
support of a driver.

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/drm_atomic.c |1 +
 1 file changed, 1 insertion(+)

--- a/drivers/gpu/drm/drm_atomic.c
+++ b/drivers/gpu/drm/drm_atomic.c
@@ -1052,6 +1052,7 @@ static void drm_atomic_connector_print_s
drm_printf(p, "connector[%u]: %s\n", connector->base.id, 
connector->name);
drm_printf(p, "\tcrtc=%s\n", state->crtc ? state->crtc->name : 
"(null)");
drm_printf(p, "\tself_refresh_aware=%d\n", state->self_refresh_aware);
+   drm_printf(p, "\tmax_requested_bpc=%d\n", state->max_requested_bpc);
 
if (connector->connector_type == DRM_MODE_CONNECTOR_WRITEBACK)
if (state->writeback_job && state->writeback_job->fb)




[PATCH 6.1 014/143] drm/connector: print max_requested_bpc in state debugfs

2023-03-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

commit 7d386975f6a495902e679a3a250a7456d7e54765 upstream.

This is useful to understand the bpc defaults and
support of a driver.

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/drm_atomic.c |1 +
 1 file changed, 1 insertion(+)

--- a/drivers/gpu/drm/drm_atomic.c
+++ b/drivers/gpu/drm/drm_atomic.c
@@ -1070,6 +1070,7 @@ static void drm_atomic_connector_print_s
drm_printf(p, "connector[%u]: %s\n", connector->base.id, 
connector->name);
drm_printf(p, "\tcrtc=%s\n", state->crtc ? state->crtc->name : 
"(null)");
drm_printf(p, "\tself_refresh_aware=%d\n", state->self_refresh_aware);
+   drm_printf(p, "\tmax_requested_bpc=%d\n", state->max_requested_bpc);
 
if (connector->connector_type == DRM_MODE_CONNECTOR_WRITEBACK)
if (state->writeback_job && state->writeback_job->fb)




[PATCH 5.10 005/104] drm/connector: print max_requested_bpc in state debugfs

2023-03-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

commit 7d386975f6a495902e679a3a250a7456d7e54765 upstream.

This is useful to understand the bpc defaults and
support of a driver.

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/drm_atomic.c |1 +
 1 file changed, 1 insertion(+)

--- a/drivers/gpu/drm/drm_atomic.c
+++ b/drivers/gpu/drm/drm_atomic.c
@@ -1010,6 +1010,7 @@ static void drm_atomic_connector_print_s
drm_printf(p, "connector[%u]: %s\n", connector->base.id, 
connector->name);
drm_printf(p, "\tcrtc=%s\n", state->crtc ? state->crtc->name : 
"(null)");
drm_printf(p, "\tself_refresh_aware=%d\n", state->self_refresh_aware);
+   drm_printf(p, "\tmax_requested_bpc=%d\n", state->max_requested_bpc);
 
if (connector->connector_type == DRM_MODE_CONNECTOR_WRITEBACK)
if (state->writeback_job && state->writeback_job->fb)




[PATCH 2/2] drm/amdgpu: skip ASIC reset for APUs when go to S4

2023-03-15 Thread Tim Huang
For GC IP v11.0.4/11, PSP TMR need to be reserved
for ASIC mode2 reset. But for S4, when psp suspend,
it will destroy the TMR that fails the ASIC reset.

[  96.006101] amdgpu :62:00.0: amdgpu: MODE2 reset
[  100.409717] amdgpu :62:00.0: amdgpu: SMU: I'm not done with your 
previous command: SMN_C2PMSG_66:0x0011 SMN_C2PMSG_82:0x0002
[  100.411593] amdgpu :62:00.0: amdgpu: Mode2 reset failed!
[  100.412470] amdgpu :62:00.0: PM: pci_pm_freeze(): 
amdgpu_pmops_freeze+0x0/0x50 [amdgpu] returns -62
[  100.414020] amdgpu :62:00.0: PM: dpm_run_callback(): 
pci_pm_freeze+0x0/0xd0 returns -62
[  100.415311] amdgpu :62:00.0: PM: pci_pm_freeze+0x0/0xd0 returned -62 
after 4623202 usecs
[  100.416608] amdgpu :62:00.0: PM: failed to freeze async: error -62

We can skip the reset on APUs, assuming we can resume them
properly. Verified on some GFX11, GFX10 and old GFX9 APUs.

Signed-off-by: Tim Huang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 5f02c530e2cc..64214996278b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2467,7 +2467,10 @@ static int amdgpu_pmops_freeze(struct device *dev)
adev->in_s4 = false;
if (r)
return r;
-   return amdgpu_asic_reset(adev);
+
+   if (amdgpu_acpi_should_gpu_reset(adev))
+   return amdgpu_asic_reset(adev);
+   return 0;
 }
 
 static int amdgpu_pmops_thaw(struct device *dev)
-- 
2.25.1



[PATCH 1/2] drm/amdgpu: reposition the gpu reset checking for reuse

2023-03-15 Thread Tim Huang
Move the amdgpu_acpi_should_gpu_reset out of
CONFIG_SUSPEND to share it with hibernate case.

Signed-off-by: Tim Huang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h  |  4 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 40 +---
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 5c6132502f35..5bddc03332b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1392,10 +1392,12 @@ int amdgpu_acpi_smart_shift_update(struct drm_device 
*dev, enum amdgpu_ss ss_sta
 int amdgpu_acpi_pcie_notify_device_ready(struct amdgpu_device *adev);
 
 void amdgpu_acpi_get_backlight_caps(struct amdgpu_dm_backlight_caps *caps);
+bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev);
 void amdgpu_acpi_detect(void);
 #else
 static inline int amdgpu_acpi_init(struct amdgpu_device *adev) { return 0; }
 static inline void amdgpu_acpi_fini(struct amdgpu_device *adev) { }
+static inline bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) { 
return false; }
 static inline void amdgpu_acpi_detect(void) { }
 static inline bool amdgpu_acpi_is_power_shift_control_supported(void) { return 
false; }
 static inline int amdgpu_acpi_power_shift_control(struct amdgpu_device *adev,
@@ -1406,11 +1408,9 @@ static inline int amdgpu_acpi_smart_shift_update(struct 
drm_device *dev,
 
 #if defined(CONFIG_ACPI) && defined(CONFIG_SUSPEND)
 bool amdgpu_acpi_is_s3_active(struct amdgpu_device *adev);
-bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev);
 bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev);
 #else
 static inline bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev) { 
return false; }
-static inline bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev) { 
return false; }
 static inline bool amdgpu_acpi_is_s3_active(struct amdgpu_device *adev) { 
return false; }
 #endif
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
index 25e902077caf..065944bdeee4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
@@ -971,6 +971,28 @@ static bool amdgpu_atcs_pci_probe_handle(struct pci_dev 
*pdev)
return true;
 }
 
+
+/**
+ * amdgpu_acpi_should_gpu_reset
+ *
+ * @adev: amdgpu_device_pointer
+ *
+ * returns true if should reset GPU, false if not
+ */
+bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
+{
+   if (adev->flags & AMD_IS_APU)
+   return false;
+
+   if (amdgpu_sriov_vf(adev))
+   return false;
+
+#if IS_ENABLED(CONFIG_SUSPEND)
+   return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
+#endif /* CONFIG_SUSPEND */
+   return true;
+}
+
 /*
  * amdgpu_acpi_detect - detect ACPI ATIF/ATCS methods
  *
@@ -1042,24 +1064,6 @@ bool amdgpu_acpi_is_s3_active(struct amdgpu_device *adev)
(pm_suspend_target_state == PM_SUSPEND_MEM);
 }
 
-/**
- * amdgpu_acpi_should_gpu_reset
- *
- * @adev: amdgpu_device_pointer
- *
- * returns true if should reset GPU, false if not
- */
-bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
-{
-   if (adev->flags & AMD_IS_APU)
-   return false;
-
-   if (amdgpu_sriov_vf(adev))
-   return false;
-
-   return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
-}
-
 /**
  * amdgpu_acpi_is_s0ix_active
  *
-- 
2.25.1



Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

2023-03-15 Thread Jan Beulich
On 15.03.2023 01:52, Stefano Stabellini wrote:
> On Mon, 13 Mar 2023, Jan Beulich wrote:
>> On 12.03.2023 13:01, Huang Rui wrote:
>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
>>> virtualization support when possible. It will using the hardware IOMMU
>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
>>> Xen PVH.
>>
>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>> away without resorting to swiotlb in certain cases (like I/O to an
>> address-restricted device)?
> 
> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
> so we can use guest physical addresses instead of machine addresses for
> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
> case is XENFEAT_not_direct_mapped).

But how does Xen using an IOMMU help with, as said, address-restricted
devices? They may still need e.g. a 32-bit address to be programmed in,
and if the kernel has memory beyond the 4G boundary not all I/O buffers
may fulfill this requirement.

Jan


Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

2023-03-15 Thread Jan Beulich
On 15.03.2023 05:14, Huang Rui wrote:
> On Wed, Mar 15, 2023 at 08:52:30AM +0800, Stefano Stabellini wrote:
>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>> On 12.03.2023 13:01, Huang Rui wrote:
 Xen PVH is the paravirtualized mode and takes advantage of hardware
 virtualization support when possible. It will using the hardware IOMMU
 support instead of xen-swiotlb, so disable swiotlb if current domain is
 Xen PVH.
>>>
>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>>> away without resorting to swiotlb in certain cases (like I/O to an
>>> address-restricted device)?
>>
>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
>> so we can use guest physical addresses instead of machine addresses for
>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
>> case is XENFEAT_not_direct_mapped).
> 
> Hi Jan, sorry to late reply. We are using the native kernel amdgpu and ttm
> driver on Dom0, amdgpu/ttm would like to use IOMMU to allocate coherent
> buffers for userptr that map the user space memory to gpu access, however,
> swiotlb doesn't support this. In other words, with swiotlb, we only can
> handle the buffer page by page.

But how does outright disabling swiotlb help with this? There still wouldn't
be an IOMMU that your kernel has control over. Looks like you want something
like pvIOMMU, but that work was never completed. And even then the swiotlb
may continue to be needed for other purposes.

Jan


Re: [PATCH v2] drm/amdgpu: resove reboot exception for si oland

2023-03-15 Thread lizhenneng



On 2023/3/14 下午5:27, Chen, Guchun wrote:

[AMD Official Use Only - General]


-Original Message-
From: Lazar, Lijo 
Sent: Tuesday, March 14, 2023 5:07 PM
To: Chen, Guchun ; Zhenneng Li

Cc: David Airlie ; Pan, Xinhui ;
amd-gfx@lists.freedesktop.org; Daniel Vetter ; Deucher,
Alexander ; Koenig, Christian

Subject: RE: [PATCH v2] drm/amdgpu: resove reboot exception for si oland

[AMD Official Use Only - General]

Hi Guchun,

This patch doesn't look correct. Without dpm enabled, temperature range
shouldn't be set at all. The patch posted by Zhenneng is good enough or
better to skip late init altogether as it remains an empty function with that
patch.

My intention is to prevent setting temperature range again in late_init, as in 
hw_init prior to late_init, we have configured this range and set dpm_enabled 
to true already. Also this is a draft patch:)

Leaving a NULL function in late_init looks good to me.


To be consistent with previous code style, such as:

static bool si_dpm_is_idle(void *handle)
{
    /* XXX */
    return true;
}

static int si_dpm_wait_for_idle(void *handle)
{
    /* XXX */
    return 0;
}

static int si_dpm_soft_reset(void *handle)
{
    return 0;
}

static int si_dpm_set_clockgating_state(void *handle,
                    enum amd_clockgating_state state)
{
    return 0;
}

static int si_dpm_set_powergating_state(void *handle,
                    enum amd_powergating_state state)
{
    return 0;
}

We could  use "return 0".



Regards,
Guchun

Thanks,
Lijo

-Original Message-
From: amd-gfx  On Behalf Of Chen,
Guchun
Sent: Tuesday, March 14, 2023 6:35 AM
To: Zhenneng Li 
Cc: David Airlie ; Pan, Xinhui ;
amd-gfx@lists.freedesktop.org; Daniel Vetter ; Deucher,
Alexander ; Koenig, Christian

Subject: RE: [PATCH v2] drm/amdgpu: resove reboot exception for si oland

Will attached patch help?

Regards,
Guchun


-Original Message-
From: Zhenneng Li 
Sent: Monday, March 13, 2023 10:57 AM
To: Chen, Guchun 
Cc: Deucher, Alexander ; Koenig, Christian
; Pan, Xinhui ; David
Airlie ; Daniel Vetter ; amd-
g...@lists.freedesktop.org; Zhenneng Li 
Subject: [PATCH v2] drm/amdgpu: resove reboot exception for si oland

During reboot test on arm64 platform, it may failure on boot.

The error message are as follows:
[6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]]
*ERROR*
 late_init of IP block  failed -22
[7.006919][ 7] [  T295] amdgpu :04:00.0:

amdgpu_device_ip_late_init

failed
[7.014224][ 7] [  T295] amdgpu :04:00.0: Fatal error during GPU init
---
  drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 12 
  1 file changed, 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
index d6d9e3b1b2c0..ca9bce895dbe 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
@@ -7626,18 +7626,6 @@ static int si_dpm_process_interrupt(struct
amdgpu_device *adev,

  static int si_dpm_late_init(void *handle)  {
-   int ret;
-   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-
-   if (!adev->pm.dpm_enabled)
-   return 0;
-
-   ret = si_set_temperature_range(adev);
-   if (ret)
-   return ret;
-#if 0 //TODO ?
-   si_dpm_powergate_uvd(adev, true);
-#endif
 return 0;
  }

--
2.25.1


[PATCH] drm/amdgpu: add mes resume when do gfx post soft reset

2023-03-15 Thread Tong Liu01
[why]
when gfx do soft reset, mes will also do reset, if mes is not
resumed when do recover from soft reset, mes is unable to respond
in later sequence

[how]
resume mes when do gfx post soft reset

Signed-off-by: Tong Liu01 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 3bf697a80cf2..08650f93f210 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4655,6 +4655,14 @@ static bool gfx_v11_0_check_soft_reset(void *handle)
return false;
 }
 
+static int gfx_v11_0_post_soft_reset(void *handle)
+{
+   /**
+* GFX soft reset will impact MES, need resume MES when do GFX soft 
reset
+*/
+   return amdgpu_mes_resume((struct amdgpu_device *)handle);
+}
+
 static uint64_t gfx_v11_0_get_gpu_clock_counter(struct amdgpu_device *adev)
 {
uint64_t clock;
@@ -6166,6 +6174,7 @@ static const struct amd_ip_funcs gfx_v11_0_ip_funcs = {
.wait_for_idle = gfx_v11_0_wait_for_idle,
.soft_reset = gfx_v11_0_soft_reset,
.check_soft_reset = gfx_v11_0_check_soft_reset,
+   .post_soft_reset = gfx_v11_0_post_soft_reset,
.set_clockgating_state = gfx_v11_0_set_clockgating_state,
.set_powergating_state = gfx_v11_0_set_powergating_state,
.get_clockgating_state = gfx_v11_0_get_clockgating_state,
-- 
2.34.1



RE: [PATCH] drm/amd: fix compilation issue with legacy gcc

2023-03-15 Thread Chen, Guchun
Reviewed-by: Guchun Chen 

> -Original Message-
> From: bobzhou 
> Sent: Wednesday, March 15, 2023 3:29 PM
> To: amd-gfx@lists.freedesktop.org; Chen, Guchun
> ; Cui, Flora ; Shi, Leslie
> ; Ma, Jun 
> Cc: Zhou, Bob 
> Subject: [PATCH] drm/amd: fix compilation issue with legacy gcc
> 
> This patch is used to fix following compilation issue with legacy gcc
> 
> error: ‘for’ loop initial declarations are only allowed in C99 mode
> 
> Signed-off-by: bobzhou 
> ---
>  .../drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c  | 9 ++---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 7 ---
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git
> a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
> b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
> index 2e251dcbb022..931f7c6446de 100644
> --- a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
> +++ b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
> @@ -137,8 +137,9 @@ static uint8_t get_lowest_dpia_index(struct dc_link
> *link)  {
>   const struct dc *dc_struct = link->dc;
>   uint8_t idx = 0xFF;
> + int i;
> 
> - for (int i = 0; i < MAX_PIPES * 2; ++i) {
> + for (i = 0; i < MAX_PIPES * 2; ++i) {
> 
>   if (!dc_struct->links[i] ||
>   dc_struct->links[i]->ep_type !=
> DISPLAY_ENDPOINT_USB4_DPIA) @@ -165,8 +166,9 @@ static int
> get_host_router_total_bw(struct dc_link *link, uint8_t type)
>   uint8_t idx = (link->link_index - lowest_dpia_index) / 2, idx_temp = 0;
>   struct dc_link *link_temp;
>   int total_bw = 0;
> + int i;
> 
> - for (int i = 0; i < MAX_PIPES * 2; ++i) {
> + for (i = 0; i < MAX_PIPES * 2; ++i) {
> 
>   if (!dc_struct->links[i] || dc_struct->links[i]->ep_type !=
> DISPLAY_ENDPOINT_USB4_DPIA)
>   continue;
> @@ -467,12 +469,13 @@ bool dpia_validate_usb4_bw(struct dc_link **link,
> int *bw_needed_per_dpia, uint8
>   bool ret = true;
>   int bw_needed_per_hr[MAX_HR_NUM] = { 0, 0 };
>   uint8_t lowest_dpia_index = 0, dpia_index = 0;
> + uint8_t i;
> 
>   if (!num_dpias || num_dpias > MAX_DPIA_NUM)
>   return ret;
> 
>   //Get total Host Router BW & Validate against each Host Router max
> BW
> - for (uint8_t i = 0; i < num_dpias; ++i) {
> + for (i = 0; i < num_dpias; ++i) {
> 
>   if (!link[i]->dpia_bw_alloc_config.bw_alloc_enabled)
>   continue;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index 54d36df1306f..ea8f3d6fb98b 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -325,6 +325,7 @@ static int smu_v13_0_6_setup_driver_pptable(struct
> smu_context *smu)
>   struct PPTable_t *pptable =
>   (struct PPTable_t *)smu_table->driver_pptable;
>   int ret;
> + int i;
> 
>   /* Store one-time values in driver PPTable */
>   if (!pptable->Init) {
> @@ -339,7 +340,7 @@ static int smu_v13_0_6_setup_driver_pptable(struct
> smu_context *smu)
>   pptable->MinGfxclkFrequency =
>   SMUQ10_TO_UINT(metrics->MinGfxclkFrequency);
> 
> - for (int i = 0; i < 4; ++i) {
> + for (i = 0; i < 4; ++i) {
>   pptable->FclkFrequencyTable[i] =
>   SMUQ10_TO_UINT(metrics-
> >FclkFrequencyTable[i]);
>   pptable->UclkFrequencyTable[i] =
> @@ -466,7 +467,7 @@ static int
> smu_v13_0_6_set_default_dpm_table(struct smu_context *smu)
>   struct PPTable_t *pptable =
>   (struct PPTable_t *)smu_table->driver_pptable;
>   uint32_t gfxclkmin, gfxclkmax, levels;
> - int ret = 0, i;
> + int ret = 0, i, j;
>   struct smu_v13_0_6_dpm_map dpm_map[] = {
>   { SMU_SOCCLK, SMU_FEATURE_DPM_SOCCLK_BIT,
> _context->dpm_tables.soc_table,
> @@ -513,7 +514,7 @@ static int
> smu_v13_0_6_set_default_dpm_table(struct smu_context *smu)
>   dpm_table->max = dpm_table->dpm_levels[0].value;
>   }
> 
> - for (int j = 0; j < ARRAY_SIZE(dpm_map); j++) {
> + for (j = 0; j < ARRAY_SIZE(dpm_map); j++) {
>   dpm_table = dpm_map[j].dpm_table;
>   levels = 1;
>   if (smu_cmn_feature_is_enabled(smu,
> dpm_map[j].feature_num)) {
> --
> 2.34.1



[PATCH] drm/amd: fix compilation issue with legacy gcc

2023-03-15 Thread bobzhou
This patch is used to fix following compilation issue with legacy gcc

error: ‘for’ loop initial declarations are only allowed in C99 mode

Signed-off-by: bobzhou 
---
 .../drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c  | 9 ++---
 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 7 ---
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c 
b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
index 2e251dcbb022..931f7c6446de 100644
--- a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
+++ b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
@@ -137,8 +137,9 @@ static uint8_t get_lowest_dpia_index(struct dc_link *link)
 {
const struct dc *dc_struct = link->dc;
uint8_t idx = 0xFF;
+   int i;
 
-   for (int i = 0; i < MAX_PIPES * 2; ++i) {
+   for (i = 0; i < MAX_PIPES * 2; ++i) {
 
if (!dc_struct->links[i] ||
dc_struct->links[i]->ep_type != 
DISPLAY_ENDPOINT_USB4_DPIA)
@@ -165,8 +166,9 @@ static int get_host_router_total_bw(struct dc_link *link, 
uint8_t type)
uint8_t idx = (link->link_index - lowest_dpia_index) / 2, idx_temp = 0;
struct dc_link *link_temp;
int total_bw = 0;
+   int i;
 
-   for (int i = 0; i < MAX_PIPES * 2; ++i) {
+   for (i = 0; i < MAX_PIPES * 2; ++i) {
 
if (!dc_struct->links[i] || dc_struct->links[i]->ep_type != 
DISPLAY_ENDPOINT_USB4_DPIA)
continue;
@@ -467,12 +469,13 @@ bool dpia_validate_usb4_bw(struct dc_link **link, int 
*bw_needed_per_dpia, uint8
bool ret = true;
int bw_needed_per_hr[MAX_HR_NUM] = { 0, 0 };
uint8_t lowest_dpia_index = 0, dpia_index = 0;
+   uint8_t i;
 
if (!num_dpias || num_dpias > MAX_DPIA_NUM)
return ret;
 
//Get total Host Router BW & Validate against each Host Router max BW
-   for (uint8_t i = 0; i < num_dpias; ++i) {
+   for (i = 0; i < num_dpias; ++i) {
 
if (!link[i]->dpia_bw_alloc_config.bw_alloc_enabled)
continue;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
index 54d36df1306f..ea8f3d6fb98b 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
@@ -325,6 +325,7 @@ static int smu_v13_0_6_setup_driver_pptable(struct 
smu_context *smu)
struct PPTable_t *pptable =
(struct PPTable_t *)smu_table->driver_pptable;
int ret;
+   int i;
 
/* Store one-time values in driver PPTable */
if (!pptable->Init) {
@@ -339,7 +340,7 @@ static int smu_v13_0_6_setup_driver_pptable(struct 
smu_context *smu)
pptable->MinGfxclkFrequency =
SMUQ10_TO_UINT(metrics->MinGfxclkFrequency);
 
-   for (int i = 0; i < 4; ++i) {
+   for (i = 0; i < 4; ++i) {
pptable->FclkFrequencyTable[i] =
SMUQ10_TO_UINT(metrics->FclkFrequencyTable[i]);
pptable->UclkFrequencyTable[i] =
@@ -466,7 +467,7 @@ static int smu_v13_0_6_set_default_dpm_table(struct 
smu_context *smu)
struct PPTable_t *pptable =
(struct PPTable_t *)smu_table->driver_pptable;
uint32_t gfxclkmin, gfxclkmax, levels;
-   int ret = 0, i;
+   int ret = 0, i, j;
struct smu_v13_0_6_dpm_map dpm_map[] = {
{ SMU_SOCCLK, SMU_FEATURE_DPM_SOCCLK_BIT,
  _context->dpm_tables.soc_table,
@@ -513,7 +514,7 @@ static int smu_v13_0_6_set_default_dpm_table(struct 
smu_context *smu)
dpm_table->max = dpm_table->dpm_levels[0].value;
}
 
-   for (int j = 0; j < ARRAY_SIZE(dpm_map); j++) {
+   for (j = 0; j < ARRAY_SIZE(dpm_map); j++) {
dpm_table = dpm_map[j].dpm_table;
levels = 1;
if (smu_cmn_feature_is_enabled(smu, dpm_map[j].feature_num)) {
-- 
2.34.1