RE: [PATCH] drm/amd/powerplay: drop the unnecessary uclk hard min setting

2019-01-07 Thread Gao, Likun
Reviewed-by: Likun Gao 

-Original Message-
From: Evan Quan  
Sent: Tuesday, January 08, 2019 12:28 PM
To: amd-gfx@lists.freedesktop.org
Cc: Gao, Likun ; Feng, Kenneth ; Quan, 
Evan 
Subject: [PATCH] drm/amd/powerplay: drop the unnecessary uclk hard min setting

Since soft min setting is enough. Hard min setting is redundant.

Change-Id: I758386085f227bad94148ec0b38776312b6f5b25
Reported-by: Likun Gao 
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
index c2061d351d04..82935a3bd950 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
@@ -1692,13 +1692,6 @@ static int vega20_upload_dpm_min_level(struct pp_hwmgr 
*hwmgr, uint32_t feature_
(PPCLK_UCLK << 16) | (min_freq & 
0x))),
"Failed to set soft min memclk !",
return ret);
-
-   min_freq = data->dpm_table.mem_table.dpm_state.hard_min_level;
-   PP_ASSERT_WITH_CODE(!(ret = smum_send_msg_to_smc_with_parameter(
-   hwmgr, PPSMC_MSG_SetHardMinByFreq,
-   (PPCLK_UCLK << 16) | (min_freq & 
0x))),
-   "Failed to set hard min memclk !",
-   return ret);
}
 
if (data->smu_features[GNLD_DPM_UVD].enabled &&
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/powerplay: drop the unnecessary uclk hard min setting

2019-01-07 Thread Alex Deucher
Acked-by: Alex Deucher 

On Mon, Jan 7, 2019 at 11:28 PM Evan Quan  wrote:
>
> Since soft min setting is enough. Hard min setting is redundant.
>
> Change-Id: I758386085f227bad94148ec0b38776312b6f5b25
> Reported-by: Likun Gao 
> Signed-off-by: Evan Quan 
> ---
>  drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 7 ---
>  1 file changed, 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c 
> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> index c2061d351d04..82935a3bd950 100644
> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> @@ -1692,13 +1692,6 @@ static int vega20_upload_dpm_min_level(struct pp_hwmgr 
> *hwmgr, uint32_t feature_
> (PPCLK_UCLK << 16) | (min_freq & 
> 0x))),
> "Failed to set soft min memclk !",
> return ret);
> -
> -   min_freq = data->dpm_table.mem_table.dpm_state.hard_min_level;
> -   PP_ASSERT_WITH_CODE(!(ret = 
> smum_send_msg_to_smc_with_parameter(
> -   hwmgr, PPSMC_MSG_SetHardMinByFreq,
> -   (PPCLK_UCLK << 16) | (min_freq & 
> 0x))),
> -   "Failed to set hard min memclk !",
> -   return ret);
> }
>
> if (data->smu_features[GNLD_DPM_UVD].enabled &&
> --
> 2.20.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/powerplay: drop the unnecessary uclk hard min setting

2019-01-07 Thread Evan Quan
Since soft min setting is enough. Hard min setting is redundant.

Change-Id: I758386085f227bad94148ec0b38776312b6f5b25
Reported-by: Likun Gao 
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
index c2061d351d04..82935a3bd950 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
@@ -1692,13 +1692,6 @@ static int vega20_upload_dpm_min_level(struct pp_hwmgr 
*hwmgr, uint32_t feature_
(PPCLK_UCLK << 16) | (min_freq & 
0x))),
"Failed to set soft min memclk !",
return ret);
-
-   min_freq = data->dpm_table.mem_table.dpm_state.hard_min_level;
-   PP_ASSERT_WITH_CODE(!(ret = smum_send_msg_to_smc_with_parameter(
-   hwmgr, PPSMC_MSG_SetHardMinByFreq,
-   (PPCLK_UCLK << 16) | (min_freq & 
0x))),
-   "Failed to set hard min memclk !",
-   return ret);
}
 
if (data->smu_features[GNLD_DPM_UVD].enabled &&
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH libdrm] amdgpu: add a faster BO list API

2019-01-07 Thread Zhou, David(ChunMing)
Looks good to me, Reviewed-by: Chunming Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Marek Ol?ák
> Sent: Tuesday, January 08, 2019 3:31 AM
> To: amd-gfx@lists.freedesktop.org
> Subject: [PATCH libdrm] amdgpu: add a faster BO list API
> 
> From: Marek Olšák 
> 
> ---
>  amdgpu/amdgpu-symbol-check |  3 ++
>  amdgpu/amdgpu.h| 56
> +-
>  amdgpu/amdgpu_bo.c | 36 
>  amdgpu/amdgpu_cs.c | 25 +
>  4 files changed, 119 insertions(+), 1 deletion(-)
> 
> diff --git a/amdgpu/amdgpu-symbol-check b/amdgpu/amdgpu-symbol-
> check index 6f5e0f95..96a44b40 100755
> --- a/amdgpu/amdgpu-symbol-check
> +++ b/amdgpu/amdgpu-symbol-check
> @@ -12,20 +12,22 @@ _edata
>  _end
>  _fini
>  _init
>  amdgpu_bo_alloc
>  amdgpu_bo_cpu_map
>  amdgpu_bo_cpu_unmap
>  amdgpu_bo_export
>  amdgpu_bo_free
>  amdgpu_bo_import
>  amdgpu_bo_inc_ref
> +amdgpu_bo_list_create_raw
> +amdgpu_bo_list_destroy_raw
>  amdgpu_bo_list_create
>  amdgpu_bo_list_destroy
>  amdgpu_bo_list_update
>  amdgpu_bo_query_info
>  amdgpu_bo_set_metadata
>  amdgpu_bo_va_op
>  amdgpu_bo_va_op_raw
>  amdgpu_bo_wait_for_idle
>  amdgpu_create_bo_from_user_mem
>  amdgpu_cs_chunk_fence_info_to_data
> @@ -40,20 +42,21 @@ amdgpu_cs_destroy_semaphore
> amdgpu_cs_destroy_syncobj  amdgpu_cs_export_syncobj
> amdgpu_cs_fence_to_handle  amdgpu_cs_import_syncobj
> amdgpu_cs_query_fence_status  amdgpu_cs_query_reset_state
> amdgpu_query_sw_info  amdgpu_cs_signal_semaphore
> amdgpu_cs_submit  amdgpu_cs_submit_raw
> +amdgpu_cs_submit_raw2
>  amdgpu_cs_syncobj_export_sync_file
>  amdgpu_cs_syncobj_import_sync_file
>  amdgpu_cs_syncobj_reset
>  amdgpu_cs_syncobj_signal
>  amdgpu_cs_syncobj_wait
>  amdgpu_cs_wait_fences
>  amdgpu_cs_wait_semaphore
>  amdgpu_device_deinitialize
>  amdgpu_device_initialize
>  amdgpu_find_bo_by_cpu_mapping
> diff --git a/amdgpu/amdgpu.h b/amdgpu/amdgpu.h index
> dc51659a..5b800033 100644
> --- a/amdgpu/amdgpu.h
> +++ b/amdgpu/amdgpu.h
> @@ -35,20 +35,21 @@
>  #define _AMDGPU_H_
> 
>  #include 
>  #include 
> 
>  #ifdef __cplusplus
>  extern "C" {
>  #endif
> 
>  struct drm_amdgpu_info_hw_ip;
> +struct drm_amdgpu_bo_list_entry;
> 
>  
> /*--*/
>  /* --- Defines  
> */  /*-
> -*/
> 
>  /**
>   * Define max. number of Command Buffers (IB) which could be sent to the
> single
>   * hardware IP to accommodate CE/DE requirements
>   *
>   * \sa amdgpu_cs_ib_info
> @@ -767,34 +768,65 @@ int amdgpu_bo_cpu_unmap(amdgpu_bo_handle
> buf_handle);
>   *and no GPU access is scheduled.
>   *  1 GPU access is in fly or scheduled
>   *
>   * \return   0 - on success
>   *  <0 - Negative POSIX Error code
>   */
>  int amdgpu_bo_wait_for_idle(amdgpu_bo_handle buf_handle,
>   uint64_t timeout_ns,
>   bool *buffer_busy);
> 
> +/**
> + * Creates a BO list handle for command submission.
> + *
> + * \param   dev  - \c [in] Device handle.
> + *  See #amdgpu_device_initialize()
> + * \param   number_of_buffers- \c [in] Number of BOs in the list
> + * \param   buffers  - \c [in] List of BO handles
> + * \param   result   - \c [out] Created BO list handle
> + *
> + * \return   0 on success\n
> + *  <0 - Negative POSIX Error code
> + *
> + * \sa amdgpu_bo_list_destroy_raw()
> +*/
> +int amdgpu_bo_list_create_raw(amdgpu_device_handle dev,
> +   uint32_t number_of_buffers,
> +   struct drm_amdgpu_bo_list_entry *buffers,
> +   uint32_t *result);
> +
> +/**
> + * Destroys a BO list handle.
> + *
> + * \param   bo_list  - \c [in] BO list handle.
> + *
> + * \return   0 on success\n
> + *  <0 - Negative POSIX Error code
> + *
> + * \sa amdgpu_bo_list_create_raw(), amdgpu_cs_submit_raw2() */ int
> +amdgpu_bo_list_destroy_raw(amdgpu_device_handle dev, uint32_t
> bo_list);
> +
>  /**
>   * Creates a BO list handle for command submission.
>   *
>   * \param   dev  - \c [in] Device handle.
>   *  See #amdgpu_device_initialize()
>   * \param   number_of_resources  - \c [in] Number of BOs in the list
>   * \param   resources- \c [in] List of BO handles
>   * \param   resource_prios   - \c [in] Optional priority for each handle
>   * \param   result   - \c [out] Created BO list handle
>   *
>   * \return   0 on success\n
>   *  <0 - Negative POSIX Error code
>   *
> - * \sa amdgpu_bo_list_destroy()
> + * \sa amdgpu_bo_list_destroy(), amdgpu_cs_submit_raw2()
>  */
>  int amdgpu_bo_list_create(amdgpu_device_handle dev,
> 

[PATCH v6 1/2] drm/amd: validate user pitch alignment

2019-01-07 Thread Yu Zhao
Userspace may request pitch alignment that is not supported by GPU.
Some requests 32, but GPU ignores it and uses default 64 when cpp is
4. If GEM object is allocated based on the smaller alignment, GPU
DMA will go out of bound.

Cc: sta...@vger.kernel.org # v4.2+
Signed-off-by: Yu Zhao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
index 15ce7e681d67..de9f198d5371 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
@@ -527,6 +527,16 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
*dev,
struct drm_gem_object *obj;
struct amdgpu_framebuffer *amdgpu_fb;
int ret;
+   struct amdgpu_device *adev = dev->dev_private;
+   int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
+   int pitch = mode_cmd->pitches[0] / cpp;
+
+   pitch = amdgpu_align_pitch(adev, pitch, cpp, false);
+   if (mode_cmd->pitches[0] != pitch) {
+   DRM_DEBUG_KMS("Invalid pitch: expecting %d but got %d\n",
+ pitch, mode_cmd->pitches[0]);
+   return ERR_PTR(-EINVAL);
+   }
 
obj = drm_gem_object_lookup(file_priv, mode_cmd->handles[0]);
if (obj ==  NULL) {
-- 
2.20.1.97.g81188d93c3-goog

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v6 2/2] drm/amd: validate user GEM object size

2019-01-07 Thread Yu Zhao
When creating frame buffer, userspace may request to attach to a
previously allocated GEM object that is smaller than what GPU
requires. Validation must be done to prevent out-of-bound DMA,
otherwise it could be exploited to reveal sensitive data.

This fix is not done in a common code path because individual
driver might have different requirement.

Cc: sta...@vger.kernel.org # v4.2+
Signed-off-by: Yu Zhao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
index de9f198d5371..32b7648b7ef4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
@@ -527,6 +527,7 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
*dev,
struct drm_gem_object *obj;
struct amdgpu_framebuffer *amdgpu_fb;
int ret;
+   int height;
struct amdgpu_device *adev = dev->dev_private;
int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
int pitch = mode_cmd->pitches[0] / cpp;
@@ -551,6 +552,13 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
*dev,
return ERR_PTR(-EINVAL);
}
 
+   height = ALIGN(mode_cmd->height, 8);
+   if (obj->size < pitch * height) {
+   DRM_DEBUG_KMS("Invalid GEM size: expecting >= %d but got %zu\n",
+ pitch * height, obj->size);
+   return ERR_PTR(-EINVAL);
+   }
+
amdgpu_fb = kzalloc(sizeof(*amdgpu_fb), GFP_KERNEL);
if (amdgpu_fb == NULL) {
drm_gem_object_put_unlocked(obj);
-- 
2.20.1.97.g81188d93c3-goog

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/powerplay: avoid possible buffer overflow

2019-01-07 Thread Alex Deucher
Reviewed-by: Alex Deucher 

On Mon, Jan 7, 2019 at 9:36 PM Evan Quan  wrote:
>
> Make sure the clock level enforced is within the allowed
> ranges.
>
> Change-Id: If69a8512121c0c94818ab698595502e17569d4c7
> Signed-off-by: Evan Quan 
> ---
>  drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c 
> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> index 8f6097c6a02b..c2061d351d04 100644
> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
> @@ -2251,6 +2251,13 @@ static int vega20_force_clock_level(struct pp_hwmgr 
> *hwmgr,
> soft_min_level = mask ? (ffs(mask) - 1) : 0;
> soft_max_level = mask ? (fls(mask) - 1) : 0;
>
> +   if (soft_max_level >= data->dpm_table.gfx_table.count) {
> +   pr_err("Clock level specified %d is over max allowed 
> %d\n",
> +   soft_max_level,
> +   data->dpm_table.gfx_table.count - 1);
> +   return -EINVAL;
> +   }
> +
> data->dpm_table.gfx_table.dpm_state.soft_min_level =
> 
> data->dpm_table.gfx_table.dpm_levels[soft_min_level].value;
> data->dpm_table.gfx_table.dpm_state.soft_max_level =
> @@ -2271,6 +2278,13 @@ static int vega20_force_clock_level(struct pp_hwmgr 
> *hwmgr,
> soft_min_level = mask ? (ffs(mask) - 1) : 0;
> soft_max_level = mask ? (fls(mask) - 1) : 0;
>
> +   if (soft_max_level >= data->dpm_table.mem_table.count) {
> +   pr_err("Clock level specified %d is over max allowed 
> %d\n",
> +   soft_max_level,
> +   data->dpm_table.mem_table.count - 1);
> +   return -EINVAL;
> +   }
> +
> data->dpm_table.mem_table.dpm_state.soft_min_level =
> 
> data->dpm_table.mem_table.dpm_levels[soft_min_level].value;
> data->dpm_table.mem_table.dpm_state.soft_max_level =
> --
> 2.20.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/powerplay: avoid possible buffer overflow

2019-01-07 Thread Evan Quan
Make sure the clock level enforced is within the allowed
ranges.

Change-Id: If69a8512121c0c94818ab698595502e17569d4c7
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
index 8f6097c6a02b..c2061d351d04 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
@@ -2251,6 +2251,13 @@ static int vega20_force_clock_level(struct pp_hwmgr 
*hwmgr,
soft_min_level = mask ? (ffs(mask) - 1) : 0;
soft_max_level = mask ? (fls(mask) - 1) : 0;
 
+   if (soft_max_level >= data->dpm_table.gfx_table.count) {
+   pr_err("Clock level specified %d is over max allowed 
%d\n",
+   soft_max_level,
+   data->dpm_table.gfx_table.count - 1);
+   return -EINVAL;
+   }
+
data->dpm_table.gfx_table.dpm_state.soft_min_level =

data->dpm_table.gfx_table.dpm_levels[soft_min_level].value;
data->dpm_table.gfx_table.dpm_state.soft_max_level =
@@ -2271,6 +2278,13 @@ static int vega20_force_clock_level(struct pp_hwmgr 
*hwmgr,
soft_min_level = mask ? (ffs(mask) - 1) : 0;
soft_max_level = mask ? (fls(mask) - 1) : 0;
 
+   if (soft_max_level >= data->dpm_table.mem_table.count) {
+   pr_err("Clock level specified %d is over max allowed 
%d\n",
+   soft_max_level,
+   data->dpm_table.mem_table.count - 1);
+   return -EINVAL;
+   }
+
data->dpm_table.mem_table.dpm_state.soft_min_level =

data->dpm_table.mem_table.dpm_levels[soft_min_level].value;
data->dpm_table.mem_table.dpm_state.soft_max_level =
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v5

2019-01-07 Thread Yang, Philip


On 2019-01-07 9:21 a.m., Christian König wrote:
> Am 14.12.18 um 22:10 schrieb Yang, Philip:
>> Use HMM helper function hmm_vma_fault() to get physical pages backing
>> userptr and start CPU page table update track of those pages. Then use
>> hmm_vma_range_done() to check if those pages are updated before
>> amdgpu_cs_submit for gfx or before user queues are resumed for kfd.
>>
>> If userptr pages are updated, for gfx, amdgpu_cs_ioctl will restart
>> from scratch, for kfd, restore worker is rescheduled to retry.
>>
>> HMM simplify the CPU page table concurrent update check, so remove
>> guptasklock, mmu_invalidations, last_set_pages fields from
>> amdgpu_ttm_tt struct.
>>
>> HMM does not pin the page (increase page ref count), so remove related
>> operations like release_pages(), put_page(), mark_page_dirty().
>>
>> Change-Id: I2e8c0c6f0d8c21e5596a32d7fc91762778bc9e67
>> Signed-off-by: Philip Yang 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |   1 -
>>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  95 +++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c   |   2 -
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h   |   3 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    | 185 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  14 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c    |  25 ++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h    |   4 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 168 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |   4 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c    |   1 -
>>   11 files changed, 209 insertions(+), 293 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 70429f7aa9a8..717791d4fa45 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -62,7 +62,6 @@ struct kgd_mem {
>>   atomic_t invalid;
>>   struct amdkfd_process_info *process_info;
>> -    struct page **user_pages;
>>   struct amdgpu_sync sync;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index be1ab43473c6..2897793600f7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -582,28 +582,12 @@ static int init_user_pages(struct kgd_mem *mem, 
>> struct mm_struct *mm,
>>   goto out;
>>   }
>> -    /* If no restore worker is running concurrently, user_pages
>> - * should not be allocated
>> - */
>> -    WARN(mem->user_pages, "Leaking user_pages array");
>> -
>> -    mem->user_pages = kvmalloc_array(bo->tbo.ttm->num_pages,
>> -   sizeof(struct page *),
>> -   GFP_KERNEL | __GFP_ZERO);
>> -    if (!mem->user_pages) {
>> -    pr_err("%s: Failed to allocate pages array\n", __func__);
>> -    ret = -ENOMEM;
>> -    goto unregister_out;
>> -    }
>> -
>> -    ret = amdgpu_ttm_tt_get_user_pages(bo->tbo.ttm, mem->user_pages);
>> +    ret = amdgpu_ttm_tt_get_user_pages(bo->tbo.ttm, bo->tbo.ttm->pages);
>>   if (ret) {
>>   pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
>> -    goto free_out;
>> +    goto unregister_out;
>>   }
>> -    amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm, mem->user_pages);
>> -
>>   ret = amdgpu_bo_reserve(bo, true);
>>   if (ret) {
>>   pr_err("%s: Failed to reserve BO\n", __func__);
>> @@ -616,11 +600,7 @@ static int init_user_pages(struct kgd_mem *mem, 
>> struct mm_struct *mm,
>>   amdgpu_bo_unreserve(bo);
>>   release_out:
>> -    if (ret)
>> -    release_pages(mem->user_pages, bo->tbo.ttm->num_pages);
>> -free_out:
>> -    kvfree(mem->user_pages);
>> -    mem->user_pages = NULL;
>> +    amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>>   unregister_out:
>>   if (ret)
>>   amdgpu_mn_unregister(bo);
>> @@ -679,7 +659,6 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
>>   ctx->kfd_bo.priority = 0;
>>   ctx->kfd_bo.tv.bo = >tbo;
>>   ctx->kfd_bo.tv.num_shared = 1;
>> -    ctx->kfd_bo.user_pages = NULL;
>>   list_add(>kfd_bo.tv.head, >list);
>>   amdgpu_vm_get_pd_bo(vm, >list, >vm_pd[0]);
>> @@ -743,7 +722,6 @@ static int reserve_bo_and_cond_vms(struct kgd_mem 
>> *mem,
>>   ctx->kfd_bo.priority = 0;
>>   ctx->kfd_bo.tv.bo = >tbo;
>>   ctx->kfd_bo.tv.num_shared = 1;
>> -    ctx->kfd_bo.user_pages = NULL;
>>   list_add(>kfd_bo.tv.head, >list);
>>   i = 0;
>> @@ -1371,15 +1349,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>   list_del(_list_entry->head);
>>   mutex_unlock(_info->lock);
>> -    /* Free user pages if necessary */
>> -    if (mem->user_pages) {
>> -    pr_debug("%s: Freeing user_pages array\n", __func__);
>> -    if (mem->user_pages[0])
>> -    release_pages(mem->user_pages,
>> - 

[PATCH 1/3] drm/amdgpu: Don't ignore rc from drm_dp_mst_topology_mgr_resume()

2019-01-07 Thread Lyude Paul
drm_dp_mst_topology_mgr_resume() returns whether or not it managed to
find the topology in question after a suspend resume cycle, and the
driver is supposed to check this value and disable MST accordingly if
it's gone-in addition to sending a hotplug in order to notify userspace
that something changed during suspend.

Currently, amdgpu just makes the mistake of ignoring the return code
from drm_dp_mst_topology_mgr_resume() which means that if a topology was
removed in suspend, amdgpu never notices and assumes it's still
connected which leads to all sorts of problems.

So, fix this by actually checking the rc from
drm_dp_mst_topology_mgr_resume(). Also, reformat the rest of the
function while we're at it to fix the over-indenting.

Signed-off-by: Lyude Paul 
Cc: Jerry Zuo 
Cc:  # v4.15+
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 32 +--
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 8a626d16e8e3..3f326a2c513b 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -699,22 +699,36 @@ static void s3_handle_mst(struct drm_device *dev, bool 
suspend)
 {
struct amdgpu_dm_connector *aconnector;
struct drm_connector *connector;
+   struct drm_dp_mst_topology_mgr *mgr;
+   int ret;
+   bool need_hotplug = false;
 
drm_modeset_lock(>mode_config.connection_mutex, NULL);
 
-   list_for_each_entry(connector, >mode_config.connector_list, head) {
-  aconnector = to_amdgpu_dm_connector(connector);
-  if (aconnector->dc_link->type == dc_connection_mst_branch &&
-  !aconnector->mst_port) {
+   list_for_each_entry(connector, >mode_config.connector_list,
+   head) {
+   aconnector = to_amdgpu_dm_connector(connector);
+   if (aconnector->dc_link->type != dc_connection_mst_branch ||
+   aconnector->mst_port)
+   continue;
+
+   mgr = >mst_mgr;
 
-  if (suspend)
-  
drm_dp_mst_topology_mgr_suspend(>mst_mgr);
-  else
-  
drm_dp_mst_topology_mgr_resume(>mst_mgr);
-  }
+   if (suspend) {
+   drm_dp_mst_topology_mgr_suspend(mgr);
+   } else {
+   ret = drm_dp_mst_topology_mgr_resume(mgr);
+   if (ret < 0) {
+   drm_dp_mst_topology_mgr_set_mst(mgr, false);
+   need_hotplug = true;
+   }
+   }
}
 
drm_modeset_unlock(>mode_config.connection_mutex);
+
+   if (need_hotplug)
+   drm_kms_helper_hotplug_event(dev);
 }
 
 /**
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/3] drm/amdgpu: Don't fail resume process if resuming atomic state fails

2019-01-07 Thread Lyude Paul
This is an ugly one unfortunately. Currently, all DRM drivers supporting
atomic modesetting will save the state that userspace had set before
suspending, then attempt to restore that state on resume. This probably
worked very well at one point, like many other things, until DP MST came
into the picture. While it's easy to restore state on normal display
connectors that were disconnected during suspend regardless of their
state post-resume, this can't really be done with MST because of the
fact that setting up a downstream sink requires performing sideband
transactions between the source and the MST hub, sending out the ACT
packets, etc.

Because of this, there isn't really a guarantee that we can restore the
atomic state we had before suspend once we've resumed. This sucks pretty
bad, but so far I haven't run into any compositors that this actually
causes serious issues with. Most compositors will notice the hotplug we
send afterwards, and then reprobe state.

Since nouveau and i915 also don't fail the suspend/resume process due to
failing to restore the atomic state, let's make amdgpu match this
behavior. Better to resume the GPU properly, then to stop the process
half way because of a potentially unavoidable atomic commit failure.

Eventually, we'll have a real fix for this problem on the DRM level. But
we've got some more important low-hanging fruit to deal with first.

Signed-off-by: Lyude Paul 
Cc: Jerry Zuo 
Cc:  # v4.15+
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 3f326a2c513b..a3e65e457348 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -912,7 +912,6 @@ static int dm_resume(void *handle)
struct drm_plane_state *new_plane_state;
struct dm_plane_state *dm_new_plane_state;
enum dc_connection_type new_connection_type = dc_connection_none;
-   int ret;
int i;
 
/* power on hardware */
@@ -985,13 +984,13 @@ static int dm_resume(void *handle)
}
}
 
-   ret = drm_atomic_helper_resume(ddev, dm->cached_state);
+   drm_atomic_helper_resume(ddev, dm->cached_state);
 
dm->cached_state = NULL;
 
amdgpu_dm_irq_resume_late(adev);
 
-   return ret;
+   return 0;
 }
 
 /**
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 0/3] drm/amdgpu: Fix suspend/resume issues with MST

2019-01-07 Thread Lyude Paul
Fix the suspend/issues that Jerry Zuo found in amdgpu, and add some
compiler warnings for drivers ignoring the return code of
drm_dp_mst_topology_mgr_resume() to help ensure we don't need to fix
this again in the future for someone else's driver.

Cc: Jerry Zuo 

Lyude Paul (3):
  drm/amdgpu: Don't ignore rc from drm_dp_mst_topology_mgr_resume()
  drm/amdgpu: Don't fail resume process if resuming atomic state fails
  drm/dp_mst: Add __must_check to drm_dp_mst_topology_mgr_resume()

 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 37 +--
 drivers/gpu/drm/drm_dp_mst_topology.c |  3 +-
 include/drm/drm_dp_mst_helper.h   |  3 +-
 3 files changed, 29 insertions(+), 14 deletions(-)

-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/3] drm/dp_mst: Add __must_check to drm_dp_mst_topology_mgr_resume()

2019-01-07 Thread Lyude Paul
Since I've had to fix two cases of drivers not checking the return code
from this function, let's make the compiler complain so this doesn't
come up again in the future.

Signed-off-by: Lyude Paul 
Cc: Jerry Zuo 
Cc: Daniel Vetter 
---
 drivers/gpu/drm/drm_dp_mst_topology.c | 3 ++-
 include/drm/drm_dp_mst_helper.h   | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c 
b/drivers/gpu/drm/drm_dp_mst_topology.c
index 2ab16c9e6243..13a7e497bfe2 100644
--- a/drivers/gpu/drm/drm_dp_mst_topology.c
+++ b/drivers/gpu/drm/drm_dp_mst_topology.c
@@ -2226,7 +2226,8 @@ EXPORT_SYMBOL(drm_dp_mst_topology_mgr_suspend);
  * if the device fails this returns -1, and the driver should do
  * a full MST reprobe, in case we were undocked.
  */
-int drm_dp_mst_topology_mgr_resume(struct drm_dp_mst_topology_mgr *mgr)
+int __must_check
+drm_dp_mst_topology_mgr_resume(struct drm_dp_mst_topology_mgr *mgr)
 {
int ret = 0;
 
diff --git a/include/drm/drm_dp_mst_helper.h b/include/drm/drm_dp_mst_helper.h
index 371cc2816477..4355b55d0081 100644
--- a/include/drm/drm_dp_mst_helper.h
+++ b/include/drm/drm_dp_mst_helper.h
@@ -614,7 +614,8 @@ void drm_dp_mst_dump_topology(struct seq_file *m,
  struct drm_dp_mst_topology_mgr *mgr);
 
 void drm_dp_mst_topology_mgr_suspend(struct drm_dp_mst_topology_mgr *mgr);
-int drm_dp_mst_topology_mgr_resume(struct drm_dp_mst_topology_mgr *mgr);
+int __must_check
+drm_dp_mst_topology_mgr_resume(struct drm_dp_mst_topology_mgr *mgr);
 struct drm_dp_mst_topology_state *drm_atomic_get_mst_topology_state(struct 
drm_atomic_state *state,
struct 
drm_dp_mst_topology_mgr *mgr);
 int drm_dp_atomic_find_vcpi_slots(struct drm_atomic_state *state,
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdkfd: Allow building KFD on ARM64 (v2)

2019-01-07 Thread Kuehling, Felix
ifdef x86_64 specific code.
Allow enabling CONFIG_HSA_AMD on ARM64.

v2: Fixed a compiler warning due to an unused variable

CC: Mark Nutter 
Signed-off-by: Felix Kuehling 
Acked-by: Alex Deucher 
Acked-by: Christian König 
---
 drivers/gpu/drm/amd/amdkfd/Kconfig| 4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 8 
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 9 +
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig 
b/drivers/gpu/drm/amd/amdkfd/Kconfig
index fbf0ee5..c361360 100644
--- a/drivers/gpu/drm/amd/amdkfd/Kconfig
+++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
@@ -4,8 +4,8 @@
 
 config HSA_AMD
bool "HSA kernel driver for AMD GPU devices"
-   depends on DRM_AMDGPU && X86_64
-   imply AMD_IOMMU_V2
+   depends on DRM_AMDGPU && (X86_64 || ARM64)
+   imply AMD_IOMMU_V2 if X86_64
select MMU_NOTIFIER
help
  Enable this if you want to use HSA features on AMD GPU devices.
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index c02adbb..d7ddb35 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -863,6 +863,7 @@ static int kfd_fill_mem_info_for_cpu(int numa_node_id, int 
*avail_size,
return 0;
 }
 
+#if CONFIG_X86_64
 static int kfd_fill_iolink_info_for_cpu(int numa_node_id, int *avail_size,
uint32_t *num_entries,
struct crat_subtype_iolink *sub_type_hdr)
@@ -905,6 +906,7 @@ static int kfd_fill_iolink_info_for_cpu(int numa_node_id, 
int *avail_size,
 
return 0;
 }
+#endif
 
 /* kfd_create_vcrat_image_cpu - Create Virtual CRAT for CPU
  *
@@ -920,7 +922,9 @@ static int kfd_create_vcrat_image_cpu(void *pcrat_image, 
size_t *size)
struct crat_subtype_generic *sub_type_hdr;
int avail_size = *size;
int numa_node_id;
+#ifdef CONFIG_X86_64
uint32_t entries = 0;
+#endif
int ret = 0;
 
if (!pcrat_image || avail_size < VCRAT_SIZE_FOR_CPU)
@@ -982,6 +986,7 @@ static int kfd_create_vcrat_image_cpu(void *pcrat_image, 
size_t *size)
sub_type_hdr->length);
 
/* Fill in Subtype: IO Link */
+#ifdef CONFIG_X86_64
ret = kfd_fill_iolink_info_for_cpu(numa_node_id, _size,
,
(struct crat_subtype_iolink *)sub_type_hdr);
@@ -992,6 +997,9 @@ static int kfd_create_vcrat_image_cpu(void *pcrat_image, 
size_t *size)
 
sub_type_hdr = (typeof(sub_type_hdr))((char *)sub_type_hdr +
sub_type_hdr->length * entries);
+#else
+   pr_info("IO link not available for non x86 platforms\n");
+#endif
 
crat_table->num_domains++;
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 5f5b2ac..4734f1a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1392,7 +1392,6 @@ int kfd_topology_enum_kfd_devices(uint8_t idx, struct 
kfd_dev **kdev)
 
 static int kfd_cpumask_to_apic_id(const struct cpumask *cpumask)
 {
-   const struct cpuinfo_x86 *cpuinfo;
int first_cpu_of_numa_node;
 
if (!cpumask || cpumask == cpu_none_mask)
@@ -1400,9 +1399,11 @@ static int kfd_cpumask_to_apic_id(const struct cpumask 
*cpumask)
first_cpu_of_numa_node = cpumask_first(cpumask);
if (first_cpu_of_numa_node >= nr_cpu_ids)
return -1;
-   cpuinfo = _data(first_cpu_of_numa_node);
-
-   return cpuinfo->apicid;
+#ifdef CONFIG_X86_64
+   return cpu_data(first_cpu_of_numa_node).apicid;
+#else
+   return first_cpu_of_numa_node;
+#endif
 }
 
 /* kfd_numa_node_to_apic_id - Returns the APIC ID of the first logical 
processor
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v3)

2019-01-07 Thread Grodzovsky, Andrey
Reviewed-by: Andrey Grodzovsky 

Andrey


On 01/07/2019 12:31 PM, StDenis, Tom wrote:
> v2: Move locks around in other functions so that this
> function can stand on its own.  Also only hold the hive
> specific lock for add/remove device instead of the driver
> global lock so you can't add/remove devices in parallel from
> one hive.
>
> v3: add reset_lock
>
> Signed-off-by: Tom St Denis 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 41 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  5 +--
>   3 files changed, 33 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 39d5d058b2c7..1a558dc41ba6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3525,9 +3525,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>* by different nodes. No point also since the one node already 
> executing
>* reset will also reset all the other nodes in the hive.
>*/
> - hive = amdgpu_get_xgmi_hive(adev);
> + hive = amdgpu_get_xgmi_hive(adev, 0);
>   if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
> - !mutex_trylock(>hive_lock))
> + !mutex_trylock(>reset_lock))
>   return 0;
>   
>   /* Start with adev pre asic reset first for soft reset check.*/
> @@ -3606,7 +3606,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   }
>   
>   if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
> - mutex_unlock(>hive_lock);
> + mutex_unlock(>reset_lock);
>   
>   if (r)
>   dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 8a8bc60cb6b4..9e98ab8b1525 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -40,26 +40,40 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
> *hive)
>   return >device_list;
>   }
>   
> -struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
> +struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev, 
> int lock)
>   {
>   int i;
>   struct amdgpu_hive_info *tmp;
>   
>   if (!adev->gmc.xgmi.hive_id)
>   return NULL;
> +
> + mutex_lock(_mutex);
> +
>   for (i = 0 ; i < hive_count; ++i) {
>   tmp = _hives[i];
> - if (tmp->hive_id == adev->gmc.xgmi.hive_id)
> + if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
> + if (lock)
> + mutex_lock(>hive_lock);
> + mutex_unlock(_mutex);
>   return tmp;
> + }
>   }
> - if (i >= AMDGPU_MAX_XGMI_HIVE)
> + if (i >= AMDGPU_MAX_XGMI_HIVE) {
> + mutex_unlock(_mutex);
>   return NULL;
> + }
>   
>   /* initialize new hive if not exist */
>   tmp = _hives[hive_count++];
>   tmp->hive_id = adev->gmc.xgmi.hive_id;
>   INIT_LIST_HEAD(>device_list);
>   mutex_init(>hive_lock);
> + mutex_init(>reset_lock);
> + if (lock)
> + mutex_lock(>hive_lock);
> +
> + mutex_unlock(_mutex);
>   
>   return tmp;
>   }
> @@ -111,8 +125,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>   return ret;
>   }
>   
> - mutex_lock(_mutex);
> - hive = amdgpu_get_xgmi_hive(adev);
> + /* find hive and take lock */
> + hive = amdgpu_get_xgmi_hive(adev, 1);
>   if (!hive)
>   goto exit;
>   
> @@ -142,8 +156,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>   break;
>   }
>   
> + mutex_unlock(>hive_lock);
>   exit:
> - mutex_unlock(_mutex);
>   return ret;
>   }
>   
> @@ -154,15 +168,14 @@ void amdgpu_xgmi_remove_device(struct amdgpu_device 
> *adev)
>   if (!adev->gmc.xgmi.supported)
>   return;
>   
> - mutex_lock(_mutex);
> -
> - hive = amdgpu_get_xgmi_hive(adev);
> + hive = amdgpu_get_xgmi_hive(adev, 1);
>   if (!hive)
> - goto exit;
> + return;
>   
> - if (!(hive->number_devices--))
> + if (!(hive->number_devices--)) {
>   mutex_destroy(>hive_lock);
> -
> -exit:
> - mutex_unlock(_mutex);
> + mutex_destroy(>reset_lock);
> + } else {
> + mutex_unlock(>hive_lock);
> + }
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> index 6151eb9c8ad3..14bc60664159 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> @@ -29,10 +29,11 @@ struct amdgpu_hive_info {
>   struct list_headdevice_list;
>   struct psp_xgmi_topology_info   

Re: [PATCH 2/2] drm/amdgpu: Add sysfs file for PCIe usage

2019-01-07 Thread Kuehling, Felix
On 2019-01-04 7:35 a.m., Russell, Kent wrote:
> Add a sysfs file that reports the number of bytes transmitted and
> received in the last second. This can be used to approximate the PCIe
> bandwidth usage over the last second.
>
> Change-Id: I6c82fe1cb17726825736512990fa64f5c1489861
> Signed-off-by: Kent Russell 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  4 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 31 
>  drivers/gpu/drm/amd/amdgpu/cik.c   | 41 +++
>  drivers/gpu/drm/amd/amdgpu/si.c| 41 +++
>  drivers/gpu/drm/amd/amdgpu/soc15.c | 44 
> ++
>  drivers/gpu/drm/amd/amdgpu/vi.c| 41 +++
>  6 files changed, 202 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e1b2c64..512b124 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -542,6 +542,9 @@ struct amdgpu_asic_funcs {
>   bool (*need_full_reset)(struct amdgpu_device *adev);
>   /* initialize doorbell layout for specific asic*/
>   void (*init_doorbell_index)(struct amdgpu_device *adev);
> + /* PCIe bandwidth usage */
> + void (*get_pcie_usage)(struct amdgpu_device *adev, uint64_t *count0,
> +uint64_t *count1);
>  };
>  
>  /*
> @@ -1045,6 +1048,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
>  #define amdgpu_asic_invalidate_hdp(adev, r) 
> (adev)->asic_funcs->invalidate_hdp((adev), (r))
>  #define amdgpu_asic_need_full_reset(adev) 
> (adev)->asic_funcs->need_full_reset((adev))
>  #define amdgpu_asic_init_doorbell_index(adev) 
> (adev)->asic_funcs->init_doorbell_index((adev))
> +#define amdgpu_asic_get_pcie_usage(adev, cnt0, cnt1) 
> ((adev)->asic_funcs->get_pcie_usage((adev), (cnt0), (cnt1)))
>  
>  /* Common functions */
>  bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> index 1f61ed9..051307e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> @@ -990,6 +990,30 @@ static ssize_t amdgpu_get_busy_percent(struct device 
> *dev,
>   return snprintf(buf, PAGE_SIZE, "%d\n", value);
>  }
>  
> +/**
> + * DOC: pcie_bw
> + *
> + * The amdgpu driver provides a sysfs API for reading how much data
> + * has been sent and received by the GPU in the last second through PCIe.
> + * The file pcie_bw is used for this.
> + * The Perf counters calculate this and return the number of sent and 
> received
> + * messages, which we multiply by the size of our PCIe packets (mps).
> + */
> +static ssize_t amdgpu_get_pcie_bw(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct drm_device *ddev = dev_get_drvdata(dev);
> + struct amdgpu_device *adev = ddev->dev_private;
> + uint64_t mps = pcie_get_mps(adev->pdev);
> + uint64_t count0, count1;
> +
> + amdgpu_asic_get_pcie_usage(adev, , );
> + return snprintf(buf, PAGE_SIZE,
> + "Bytes received: %llu\nBytes sent: %llu\n",
> + count0 * mps, count1 * mps);

This quietly assumes that all transfers use the maximum payload size.

We should either state the assumption or report the count rather than a
number in bytes.

Regards,
  Felix


> +}
> +
>  static DEVICE_ATTR(power_dpm_state, S_IRUGO | S_IWUSR, amdgpu_get_dpm_state, 
> amdgpu_set_dpm_state);
>  static DEVICE_ATTR(power_dpm_force_performance_level, S_IRUGO | S_IWUSR,
>  amdgpu_get_dpm_forced_performance_level,
> @@ -1025,6 +1049,7 @@ static DEVICE_ATTR(pp_od_clk_voltage, S_IRUGO | S_IWUSR,
>   amdgpu_set_pp_od_clk_voltage);
>  static DEVICE_ATTR(gpu_busy_percent, S_IRUGO,
>   amdgpu_get_busy_percent, NULL);
> +static DEVICE_ATTR(pcie_bw, S_IRUGO, amdgpu_get_pcie_bw, NULL);
>  
>  static ssize_t amdgpu_hwmon_show_temp(struct device *dev,
> struct device_attribute *attr,
> @@ -2105,6 +2130,11 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
>   "gpu_busy_level\n");
>   return ret;
>   }
> + ret = device_create_file(adev->dev, _attr_pcie_bw);
> + if (ret) {
> + DRM_ERROR("failed to create device file pcie_bw\n");
> + return ret;
> + }
>   ret = amdgpu_debugfs_pm_init(adev);
>   if (ret) {
>   DRM_ERROR("Failed to register debugfs file for dpm!\n");
> @@ -2141,6 +2171,7 @@ void amdgpu_pm_sysfs_fini(struct amdgpu_device *adev)
>   device_remove_file(adev->dev,
>   _attr_pp_od_clk_voltage);
>   device_remove_file(adev->dev, _attr_gpu_busy_percent);
> + device_remove_file(adev->dev, _attr_pcie_bw);
>  }
>  
>  void amdgpu_pm_compute_clocks(struct 

Re: [PATCH v5 1/2] drm/sched: Refactor ring mirror list handling.

2019-01-07 Thread Grodzovsky, Andrey


On 01/07/2019 09:13 AM, Christian König wrote:
> Am 03.01.19 um 18:42 schrieb Grodzovsky, Andrey:
>>
>> On 01/03/2019 11:20 AM, Grodzovsky, Andrey wrote:
>>> On 01/03/2019 03:54 AM, Koenig, Christian wrote:
 Am 21.12.18 um 21:36 schrieb Grodzovsky, Andrey:
> On 12/21/2018 01:37 PM, Christian König wrote:
>> Am 20.12.18 um 20:23 schrieb Andrey Grodzovsky:
>>> Decauple sched threads stop and start and ring mirror
>>> list handling from the policy of what to do about the
>>> guilty jobs.
>>> When stoppping the sched thread and detaching sched fences
>>> from non signaled HW fenes wait for all signaled HW fences
>>> to complete before rerunning the jobs.
>>>
>>> v2: Fix resubmission of guilty job into HW after refactoring.
>>>
>>> v4:
>>> Full restart for all the jobs, not only from guilty ring.
>>> Extract karma increase into standalone function.
>>>
>>> v5:
>>> Rework waiting for signaled jobs without relying on the job
>>> struct itself as those might already be freed for non 'guilty'
>>> job's schedulers.
>>> Expose karma increase to drivers.
>>>
>>> Suggested-by: Christian Koenig 
>>> Signed-off-by: Andrey Grodzovsky 
>>> ---
>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  18 +--
>>>       drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  11 +-
>>>       drivers/gpu/drm/scheduler/sched_main.c | 188
>>> +++--
>>>       drivers/gpu/drm/v3d/v3d_sched.c    |  12 +-
>>>       include/drm/gpu_scheduler.h    |  10 +-
>>>       5 files changed, 151 insertions(+), 88 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 8a078f4..a4bd2d3 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3298,12 +3298,10 @@ static int
>>> amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>       if (!ring || !ring->sched.thread)
>>>       continue;
>>>       -    kthread_park(ring->sched.thread);
>>> +    drm_sched_stop(>sched, job ? >base : NULL);
>>>       -    if (job && job->base.sched != >sched)
>>> -    continue;
>>> -
>>> -    drm_sched_hw_job_reset(>sched, job ? >base : 
>>> NULL);
>>> +    if(job)
>>> + drm_sched_increase_karma(>base);
>> Since we dropped the "job && job->base.sched != >sched" check
>> above this will now increase the jobs karma multiple times.
>>
>> Maybe just move that outside of the loop.
>>
>>>         /* after all hw jobs are reset, hw fence is 
>>> meaningless,
>>> so force_completion */
>>> amdgpu_fence_driver_force_completion(ring);
>>> @@ -3454,14 +3452,10 @@ static void
>>> amdgpu_device_post_asic_reset(struct amdgpu_device *adev,
>>>       if (!ring || !ring->sched.thread)
>>>       continue;
>>>       -    /* only need recovery sched of the given job's ring
>>> - * or all rings (in the case @job is NULL)
>>> - * after above amdgpu_reset accomplished
>>> - */
>>> -    if ((!job || job->base.sched == >sched) &&
>>> !adev->asic_reset_res)
>>> - drm_sched_job_recovery(>sched);
>>> +    if (!adev->asic_reset_res)
>>> + drm_sched_resubmit_jobs(>sched);
>>>       -    kthread_unpark(ring->sched.thread);
>>> +    drm_sched_start(>sched, !adev->asic_reset_res);
>>>       }
>>>         if (!amdgpu_device_has_dc_support(adev)) {
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> index 49a6763..6f1268f 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> @@ -109,16 +109,19 @@ static void etnaviv_sched_timedout_job(struct
>>> drm_sched_job *sched_job)
>>>       }
>>>         /* block scheduler */
>>> -    kthread_park(gpu->sched.thread);
>>> -    drm_sched_hw_job_reset(>sched, sched_job);
>>> +    drm_sched_stop(>sched, sched_job);
>>> +
>>> +    if(sched_job)
>>> +    drm_sched_increase_karma(sched_job);
>>>         /* get the GPU back into the init state */
>>>       etnaviv_core_dump(gpu);
>>>       etnaviv_gpu_recover_hang(gpu);
>>>       + drm_sched_resubmit_jobs(>sched);
>>> +
>>>       /* restart scheduler after GPU is usable again */
>>> -    drm_sched_job_recovery(>sched);
>>> -    kthread_unpark(gpu->sched.thread);
>>> +    drm_sched_start(>sched, true);
>>>       }
>>>         static void etnaviv_sched_free_job(struct drm_sched_job 
>>> *sched_job)
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c

[PATCH libdrm] amdgpu: add a faster BO list API

2019-01-07 Thread Marek Olšák
From: Marek Olšák 

---
 amdgpu/amdgpu-symbol-check |  3 ++
 amdgpu/amdgpu.h| 56 +-
 amdgpu/amdgpu_bo.c | 36 
 amdgpu/amdgpu_cs.c | 25 +
 4 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/amdgpu/amdgpu-symbol-check b/amdgpu/amdgpu-symbol-check
index 6f5e0f95..96a44b40 100755
--- a/amdgpu/amdgpu-symbol-check
+++ b/amdgpu/amdgpu-symbol-check
@@ -12,20 +12,22 @@ _edata
 _end
 _fini
 _init
 amdgpu_bo_alloc
 amdgpu_bo_cpu_map
 amdgpu_bo_cpu_unmap
 amdgpu_bo_export
 amdgpu_bo_free
 amdgpu_bo_import
 amdgpu_bo_inc_ref
+amdgpu_bo_list_create_raw
+amdgpu_bo_list_destroy_raw
 amdgpu_bo_list_create
 amdgpu_bo_list_destroy
 amdgpu_bo_list_update
 amdgpu_bo_query_info
 amdgpu_bo_set_metadata
 amdgpu_bo_va_op
 amdgpu_bo_va_op_raw
 amdgpu_bo_wait_for_idle
 amdgpu_create_bo_from_user_mem
 amdgpu_cs_chunk_fence_info_to_data
@@ -40,20 +42,21 @@ amdgpu_cs_destroy_semaphore
 amdgpu_cs_destroy_syncobj
 amdgpu_cs_export_syncobj
 amdgpu_cs_fence_to_handle
 amdgpu_cs_import_syncobj
 amdgpu_cs_query_fence_status
 amdgpu_cs_query_reset_state
 amdgpu_query_sw_info
 amdgpu_cs_signal_semaphore
 amdgpu_cs_submit
 amdgpu_cs_submit_raw
+amdgpu_cs_submit_raw2
 amdgpu_cs_syncobj_export_sync_file
 amdgpu_cs_syncobj_import_sync_file
 amdgpu_cs_syncobj_reset
 amdgpu_cs_syncobj_signal
 amdgpu_cs_syncobj_wait
 amdgpu_cs_wait_fences
 amdgpu_cs_wait_semaphore
 amdgpu_device_deinitialize
 amdgpu_device_initialize
 amdgpu_find_bo_by_cpu_mapping
diff --git a/amdgpu/amdgpu.h b/amdgpu/amdgpu.h
index dc51659a..5b800033 100644
--- a/amdgpu/amdgpu.h
+++ b/amdgpu/amdgpu.h
@@ -35,20 +35,21 @@
 #define _AMDGPU_H_
 
 #include 
 #include 
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
 struct drm_amdgpu_info_hw_ip;
+struct drm_amdgpu_bo_list_entry;
 
 /*--*/
 /* --- Defines  */
 /*--*/
 
 /**
  * Define max. number of Command Buffers (IB) which could be sent to the single
  * hardware IP to accommodate CE/DE requirements
  *
  * \sa amdgpu_cs_ib_info
@@ -767,34 +768,65 @@ int amdgpu_bo_cpu_unmap(amdgpu_bo_handle buf_handle);
  *and no GPU access is scheduled.
  *  1 GPU access is in fly or scheduled
  *
  * \return   0 - on success
  *  <0 - Negative POSIX Error code
  */
 int amdgpu_bo_wait_for_idle(amdgpu_bo_handle buf_handle,
uint64_t timeout_ns,
bool *buffer_busy);
 
+/**
+ * Creates a BO list handle for command submission.
+ *
+ * \param   dev- \c [in] Device handle.
+ *See #amdgpu_device_initialize()
+ * \param   number_of_buffers  - \c [in] Number of BOs in the list
+ * \param   buffers- \c [in] List of BO handles
+ * \param   result - \c [out] Created BO list handle
+ *
+ * \return   0 on success\n
+ *  <0 - Negative POSIX Error code
+ *
+ * \sa amdgpu_bo_list_destroy_raw()
+*/
+int amdgpu_bo_list_create_raw(amdgpu_device_handle dev,
+ uint32_t number_of_buffers,
+ struct drm_amdgpu_bo_list_entry *buffers,
+ uint32_t *result);
+
+/**
+ * Destroys a BO list handle.
+ *
+ * \param   bo_list- \c [in] BO list handle.
+ *
+ * \return   0 on success\n
+ *  <0 - Negative POSIX Error code
+ *
+ * \sa amdgpu_bo_list_create_raw(), amdgpu_cs_submit_raw2()
+*/
+int amdgpu_bo_list_destroy_raw(amdgpu_device_handle dev, uint32_t bo_list);
+
 /**
  * Creates a BO list handle for command submission.
  *
  * \param   dev- \c [in] Device handle.
  *See #amdgpu_device_initialize()
  * \param   number_of_resources- \c [in] Number of BOs in the list
  * \param   resources  - \c [in] List of BO handles
  * \param   resource_prios - \c [in] Optional priority for each handle
  * \param   result - \c [out] Created BO list handle
  *
  * \return   0 on success\n
  *  <0 - Negative POSIX Error code
  *
- * \sa amdgpu_bo_list_destroy()
+ * \sa amdgpu_bo_list_destroy(), amdgpu_cs_submit_raw2()
 */
 int amdgpu_bo_list_create(amdgpu_device_handle dev,
  uint32_t number_of_resources,
  amdgpu_bo_handle *resources,
  uint8_t *resource_prios,
  amdgpu_bo_list_handle *result);
 
 /**
  * Destroys a BO list handle.
  *
@@ -1580,20 +1612,42 @@ struct drm_amdgpu_cs_chunk;
 struct drm_amdgpu_cs_chunk_dep;
 struct drm_amdgpu_cs_chunk_data;
 
 int amdgpu_cs_submit_raw(amdgpu_device_handle dev,
 amdgpu_context_handle context,
 amdgpu_bo_list_handle 

Re: [PATCH 2/2] drm/amdgpu: Add sysfs file for PCIe usage

2019-01-07 Thread Alex Deucher
On Mon, Jan 7, 2019 at 6:53 AM Russell, Kent  wrote:
>
> Did you mean on APU+dGPU, or just straight APU? The file currently returns 0s 
> on my Carrizo. The event values are the same on CZ as they are for the dGPU 
> ASICs, so it might just be that it's not supported since it's not going 
> through the physical PCIe bus on the board.  I have tested it on 
> Fiji/Vega10/Vega20 successfully, and they all give reasonable values, at 
> least. But I don't have access to Raven to see if it just returns 0s like CZ 
> does. Would you prefer that the sysfs file only be added for dGPUs, to save 
> confusion and wasted sysfs files that return no useful information?
>

On APUs.  If the registers don't exist or don't do anything on APUs,
then let's just not expose the file.  Otherwise, it just leads to
confusion.

Alex

>  Kent
>
> -Original Message-
> From: Alex Deucher 
> Sent: Friday, January 04, 2019 11:22 AM
> To: Russell, Kent 
> Cc: amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 2/2] drm/amdgpu: Add sysfs file for PCIe usage
>
> On Fri, Jan 4, 2019 at 7:35 AM Russell, Kent  wrote:
> >
> > Add a sysfs file that reports the number of bytes transmitted and
> > received in the last second. This can be used to approximate the PCIe
> > bandwidth usage over the last second.
> >
> > Change-Id: I6c82fe1cb17726825736512990fa64f5c1489861
> > Signed-off-by: Kent Russell 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  4 
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 31 
> >  drivers/gpu/drm/amd/amdgpu/cik.c   | 41 +++
> >  drivers/gpu/drm/amd/amdgpu/si.c| 41 +++
> >  drivers/gpu/drm/amd/amdgpu/soc15.c | 44 
> > ++
> >  drivers/gpu/drm/amd/amdgpu/vi.c| 41 +++
> >  6 files changed, 202 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index e1b2c64..512b124 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -542,6 +542,9 @@ struct amdgpu_asic_funcs {
> > bool (*need_full_reset)(struct amdgpu_device *adev);
> > /* initialize doorbell layout for specific asic*/
> > void (*init_doorbell_index)(struct amdgpu_device *adev);
> > +   /* PCIe bandwidth usage */
> > +   void (*get_pcie_usage)(struct amdgpu_device *adev, uint64_t *count0,
> > +  uint64_t *count1);
> >  };
> >
> >  /*
> > @@ -1045,6 +1048,7 @@ int emu_soc_asic_init(struct amdgpu_device
> > *adev);  #define amdgpu_asic_invalidate_hdp(adev, r)
> > (adev)->asic_funcs->invalidate_hdp((adev), (r))  #define
> > amdgpu_asic_need_full_reset(adev)
> > (adev)->asic_funcs->need_full_reset((adev))
> >  #define amdgpu_asic_init_doorbell_index(adev)
> > (adev)->asic_funcs->init_doorbell_index((adev))
> > +#define amdgpu_asic_get_pcie_usage(adev, cnt0, cnt1)
> > +((adev)->asic_funcs->get_pcie_usage((adev), (cnt0), (cnt1)))
> >
> >  /* Common functions */
> >  bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> > index 1f61ed9..051307e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> > @@ -990,6 +990,30 @@ static ssize_t amdgpu_get_busy_percent(struct device 
> > *dev,
> > return snprintf(buf, PAGE_SIZE, "%d\n", value);  }
> >
> > +/**
> > + * DOC: pcie_bw
> > + *
> > + * The amdgpu driver provides a sysfs API for reading how much data
> > + * has been sent and received by the GPU in the last second through PCIe.
> > + * The file pcie_bw is used for this.
> > + * The Perf counters calculate this and return the number of sent and
> > +received
> > + * messages, which we multiply by the size of our PCIe packets (mps).
> > + */
> > +static ssize_t amdgpu_get_pcie_bw(struct device *dev,
> > +   struct device_attribute *attr,
> > +   char *buf)
> > +{
> > +   struct drm_device *ddev = dev_get_drvdata(dev);
> > +   struct amdgpu_device *adev = ddev->dev_private;
> > +   uint64_t mps = pcie_get_mps(adev->pdev);
> > +   uint64_t count0, count1;
> > +
> > +   amdgpu_asic_get_pcie_usage(adev, , );
> > +   return snprintf(buf, PAGE_SIZE,
> > +   "Bytes received: %llu\nBytes sent: %llu\n",
> > +   count0 * mps, count1 * mps); }
> > +
> >  static DEVICE_ATTR(power_dpm_state, S_IRUGO | S_IWUSR,
> > amdgpu_get_dpm_state, amdgpu_set_dpm_state);  static 
> > DEVICE_ATTR(power_dpm_force_performance_level, S_IRUGO | S_IWUSR,
> >amdgpu_get_dpm_forced_performance_level,
> > @@ -1025,6 +1049,7 @@ static DEVICE_ATTR(pp_od_clk_voltage, S_IRUGO | 
> > S_IWUSR,
> > amdgpu_set_pp_od_clk_voltage);  static
> > DEVICE_ATTR(gpu_busy_percent, S_IRUGO,

Re: After Vega 56/64 GPU hang I unable reboot system

2019-01-07 Thread Grodzovsky, Andrey
I see 'no active waves' print meaning it's not shader hang.

We can try and estimate around which commands the hang occurred - in 
addition to what you already print please also dump

sudo umr -O many,bits  -r *.*.mmGRBM_STATUS* && sudo umr -O many,bits  
-r *.*.mmCP_EOP_* && sudo umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP 
&& sudo umr -O many,bits  -r *.*.mmCP_ME_HEADER_DUMP

Andrey


On 01/04/2019 12:50 PM, Mikhail Gavrilov wrote:
> On Fri, 4 Jan 2019 at 01:23, Mikhail Gavrilov
>  wrote:
>> On Tue, 18 Dec 2018 at 00:08, Grodzovsky, Andrey
>>  wrote:
>>> Please install UMR and dump gfx ring content and waves after the hang is
>>> happening.
>>>
>> Today I caught hang again and able dump gfx ring content.
>> Andrey, can you look my dumps?
>>
> And I am catch yet another hang.
> Of course I dumped all needed gfx ring content in attachment.
>
> --
> Best Regards,
> Mike Gavrilov.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amd/display: Move iteration out of dm_update_crtcs

2019-01-07 Thread Kazlauskas, Nicholas
On 1/7/19 10:53 AM, sunpeng...@amd.com wrote:
> From: Leo Li 
> 
> [Why]
> To reduce indent in dm_update_crtcs, and to make it operate on single
> instances.
> 
> [How]
> Move iteration of plane states into atomic_check.
> No functional change is intended.
> 
> Signed-off-by: Leo Li 

Series is:

Reviewed-by: Nicholas Kazlauskas 

Looks good to me. I guess we don't need the other two patches from the 
last series if we're not going to be making use of the helper.

Nicholas Kazlauskas

> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 341 
> +++---
>   1 file changed, 175 insertions(+), 166 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 0b6bce5..70cd8bc 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -5550,15 +5550,15 @@ static void reset_freesync_config_for_crtc(
>  sizeof(new_crtc_state->vrr_infopacket));
>   }
>   
> -static int dm_update_crtcs_state(struct amdgpu_display_manager *dm,
> -  struct drm_atomic_state *state,
> -  bool enable,
> -  bool *lock_and_validation_needed)
> +static int dm_update_crtc_state(struct amdgpu_display_manager *dm,
> + struct drm_atomic_state *state,
> + struct drm_crtc *crtc,
> + struct drm_crtc_state *old_crtc_state,
> + struct drm_crtc_state *new_crtc_state,
> + bool enable,
> + bool *lock_and_validation_needed)
>   {
>   struct dm_atomic_state *dm_state = NULL;
> - struct drm_crtc *crtc;
> - struct drm_crtc_state *old_crtc_state, *new_crtc_state;
> - int i;
>   struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
>   struct dc_stream_state *new_stream;
>   int ret = 0;
> @@ -5567,200 +5567,199 @@ static int dm_update_crtcs_state(struct 
> amdgpu_display_manager *dm,
>* TODO Move this code into dm_crtc_atomic_check once we get rid of 
> dc_validation_set
>* update changed items
>*/
> - for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, 
> new_crtc_state, i) {
> - struct amdgpu_crtc *acrtc = NULL;
> - struct amdgpu_dm_connector *aconnector = NULL;
> - struct drm_connector_state *drm_new_conn_state = NULL, 
> *drm_old_conn_state = NULL;
> - struct dm_connector_state *dm_new_conn_state = NULL, 
> *dm_old_conn_state = NULL;
> - struct drm_plane_state *new_plane_state = NULL;
> + struct amdgpu_crtc *acrtc = NULL;
> + struct amdgpu_dm_connector *aconnector = NULL;
> + struct drm_connector_state *drm_new_conn_state = NULL, 
> *drm_old_conn_state = NULL;
> + struct dm_connector_state *dm_new_conn_state = NULL, *dm_old_conn_state 
> = NULL;
> + struct drm_plane_state *new_plane_state = NULL;
>   
> - new_stream = NULL;
> + new_stream = NULL;
>   
> - dm_old_crtc_state = to_dm_crtc_state(old_crtc_state);
> - dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
> - acrtc = to_amdgpu_crtc(crtc);
> + dm_old_crtc_state = to_dm_crtc_state(old_crtc_state);
> + dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
> + acrtc = to_amdgpu_crtc(crtc);
>   
> - new_plane_state = drm_atomic_get_new_plane_state(state, 
> new_crtc_state->crtc->primary);
> + new_plane_state = drm_atomic_get_new_plane_state(state, 
> new_crtc_state->crtc->primary);
>   
> - if (new_crtc_state->enable && new_plane_state && 
> !new_plane_state->fb) {
> - ret = -EINVAL;
> - goto fail;
> - }
> + if (new_crtc_state->enable && new_plane_state && !new_plane_state->fb) {
> + ret = -EINVAL;
> + goto fail;
> + }
>   
> - aconnector = 
> amdgpu_dm_find_first_crtc_matching_connector(state, crtc);
> + aconnector = amdgpu_dm_find_first_crtc_matching_connector(state, crtc);
>   
> - /* TODO This hack should go away */
> - if (aconnector && enable) {
> - /* Make sure fake sink is created in plug-in scenario */
> - drm_new_conn_state = 
> drm_atomic_get_new_connector_state(state,
> - 
> >base);
> - drm_old_conn_state = 
> drm_atomic_get_old_connector_state(state,
> - 
> >base);
> + /* TODO This hack should go away */
> + if (aconnector && enable) {
> + /* Make sure fake sink is created in plug-in scenario */
> + drm_new_conn_state = drm_atomic_get_new_connector_state(state,
> +

[PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v3)

2019-01-07 Thread StDenis, Tom
v2: Move locks around in other functions so that this
function can stand on its own.  Also only hold the hive
specific lock for add/remove device instead of the driver
global lock so you can't add/remove devices in parallel from
one hive.

v3: add reset_lock

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 41 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  5 +--
 3 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 39d5d058b2c7..1a558dc41ba6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3525,9 +3525,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * by different nodes. No point also since the one node already 
executing
 * reset will also reset all the other nodes in the hive.
 */
-   hive = amdgpu_get_xgmi_hive(adev);
+   hive = amdgpu_get_xgmi_hive(adev, 0);
if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
-   !mutex_trylock(>hive_lock))
+   !mutex_trylock(>reset_lock))
return 0;
 
/* Start with adev pre asic reset first for soft reset check.*/
@@ -3606,7 +3606,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
}
 
if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
-   mutex_unlock(>hive_lock);
+   mutex_unlock(>reset_lock);
 
if (r)
dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 8a8bc60cb6b4..9e98ab8b1525 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -40,26 +40,40 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
*hive)
return >device_list;
 }
 
-struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
+struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev, int 
lock)
 {
int i;
struct amdgpu_hive_info *tmp;
 
if (!adev->gmc.xgmi.hive_id)
return NULL;
+
+   mutex_lock(_mutex);
+
for (i = 0 ; i < hive_count; ++i) {
tmp = _hives[i];
-   if (tmp->hive_id == adev->gmc.xgmi.hive_id)
+   if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
+   if (lock)
+   mutex_lock(>hive_lock);
+   mutex_unlock(_mutex);
return tmp;
+   }
}
-   if (i >= AMDGPU_MAX_XGMI_HIVE)
+   if (i >= AMDGPU_MAX_XGMI_HIVE) {
+   mutex_unlock(_mutex);
return NULL;
+   }
 
/* initialize new hive if not exist */
tmp = _hives[hive_count++];
tmp->hive_id = adev->gmc.xgmi.hive_id;
INIT_LIST_HEAD(>device_list);
mutex_init(>hive_lock);
+   mutex_init(>reset_lock);
+   if (lock)
+   mutex_lock(>hive_lock);
+
+   mutex_unlock(_mutex);
 
return tmp;
 }
@@ -111,8 +125,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
return ret;
}
 
-   mutex_lock(_mutex);
-   hive = amdgpu_get_xgmi_hive(adev);
+   /* find hive and take lock */
+   hive = amdgpu_get_xgmi_hive(adev, 1);
if (!hive)
goto exit;
 
@@ -142,8 +156,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
break;
}
 
+   mutex_unlock(>hive_lock);
 exit:
-   mutex_unlock(_mutex);
return ret;
 }
 
@@ -154,15 +168,14 @@ void amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
if (!adev->gmc.xgmi.supported)
return;
 
-   mutex_lock(_mutex);
-
-   hive = amdgpu_get_xgmi_hive(adev);
+   hive = amdgpu_get_xgmi_hive(adev, 1);
if (!hive)
-   goto exit;
+   return;
 
-   if (!(hive->number_devices--))
+   if (!(hive->number_devices--)) {
mutex_destroy(>hive_lock);
-
-exit:
-   mutex_unlock(_mutex);
+   mutex_destroy(>reset_lock);
+   } else {
+   mutex_unlock(>hive_lock);
+   }
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index 6151eb9c8ad3..14bc60664159 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -29,10 +29,11 @@ struct amdgpu_hive_info {
struct list_headdevice_list;
struct psp_xgmi_topology_info   topology_info;
int number_devices;
-   struct mutex hive_lock;
+   struct mutex hive_lock,
+reset_lock;
 };
 
-struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
+struct amdgpu_hive_info 

Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread Grodzovsky, Andrey


On 01/07/2019 12:03 PM, StDenis, Tom wrote:
> On 2019-01-07 12:00 p.m., Grodzovsky, Andrey wrote:
>>
>> On 01/07/2019 11:53 AM, StDenis, Tom wrote:
>>> On 2019-01-07 11:51 a.m., Grodzovsky, Andrey wrote:
 On 01/07/2019 11:36 AM, StDenis, Tom wrote:
> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
>> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
>>> I think it's reasonable to use the hive  specific lock for hive  
>>> specific functions.
>>> The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>
>>>
>>> -Original Message-
>>> From: amd-gfx  On Behalf Of 
>>> StDenis, Tom
>>> Sent: Monday, January 7, 2019 10:16 AM
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: StDenis, Tom 
>>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>>>
>>> v2: Move locks around in other functions so that this function can 
>>> stand on its own.  Also only hold the hive specific lock for add/remove 
>>> device instead of the driver global lock so you can't add/remove 
>>> devices in parallel from one hive.
>>>
>>> Signed-off-by: Tom St Denis 
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 
>>> ++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>>> 3 files changed, 25 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 39d5d058b2c7..13d8e2ad2f7a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>  * by different nodes. No point also since the one node 
>>> already executing
>>>  * reset will also reset all the other nodes in the 
>>> hive.
>>>  */
>>> -   hive = amdgpu_get_xgmi_hive(adev);
>>> +   hive = amdgpu_get_xgmi_hive(adev, 0);
>>> if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>> !mutex_trylock(>hive_lock))
>>> return 0;
>> Let's say i have device 0 in hive A and it just got a gpu reset and at
>> the same time device 1 is being added to same have though
>> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
>> being added and so gpu reset for device 0 bails out on
>> '!mutex_trylock(>hive_lock))' without completing the reset.
>> Also in general i feel a bit uncomfortable about the confusing
>> acquisition scheme in the function and  the fact that you take the
>> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
>> of the function.
> Is adding a device while resetting a device even a valid operation
> anyways?
 In theory it's valid if you have hot pluggable devices
> I think this means more so that the reset logic is broken.  Instead
> there should be a per-hive reset lock that is taken and that is tested
> instead.
>
> Tom
 The hive->hive_lock was added exactly for this purpose and used only for
 that purpose. Maybe the naming i gave it wasn't reflective of it's
 purpose :)
>>> But the add/remove should use per-hive locks not the global lock... :-)
>>>
>>> (I'm honestly not trying to bike shed I just thought the get_hive
>>> function looked wrong :-)).
>>>
>>> Tom
>> Totally agree with you, if Shayun (who origianlly added the global
>> xgmi_mutex) is fine with switching to per hive mutex then me too, I just
>> point out the problem with gpu reset and as you said we then need to
>> rename the existing hive_lock into reset_lock and then and another per
>> hive lock to do what you propose. Also - is there a way to not take the
>> hive lock inside amdgpu_get_xgmi_hive but release it outside ? AFAIK
>> it's an opening for problems where people use it but forget to call
>> release.
> I wanted to take the per-hive lock inside get_hive because it also takes
> the global lock so that add/remove couldn't happen in parallel.
>
> For instance, deleting the last node while adding a new node means the
> per-hive mutex could be in limbo (because remove will delete the lock).
>
> Adding a per-hive reset lock would fix the remaining issues no?
>
> Tom

Looks like it.

Andrey


> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread StDenis, Tom
On 2019-01-07 12:00 p.m., Grodzovsky, Andrey wrote:
> 
> 
> On 01/07/2019 11:53 AM, StDenis, Tom wrote:
>> On 2019-01-07 11:51 a.m., Grodzovsky, Andrey wrote:
>>>
>>> On 01/07/2019 11:36 AM, StDenis, Tom wrote:
 On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
>> I think it's reasonable to use the hive  specific lock for hive  
>> specific functions.
>> The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of 
>> StDenis, Tom
>> Sent: Monday, January 7, 2019 10:16 AM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: StDenis, Tom 
>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>>
>> v2: Move locks around in other functions so that this function can stand 
>> on its own.  Also only hold the hive specific lock for add/remove device 
>> instead of the driver global lock so you can't add/remove devices in 
>> parallel from one hive.
>>
>> Signed-off-by: Tom St Denis 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 
>> ++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>>3 files changed, 25 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 39d5d058b2c7..13d8e2ad2f7a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>> *adev,
>>   * by different nodes. No point also since the one node already 
>> executing
>>   * reset will also reset all the other nodes in the hive.
>>   */
>> -hive = amdgpu_get_xgmi_hive(adev);
>> +hive = amdgpu_get_xgmi_hive(adev, 0);
>>  if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>  !mutex_trylock(>hive_lock))
>>  return 0;
> Let's say i have device 0 in hive A and it just got a gpu reset and at
> the same time device 1 is being added to same have though
> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
> being added and so gpu reset for device 0 bails out on
> '!mutex_trylock(>hive_lock))' without completing the reset.
> Also in general i feel a bit uncomfortable about the confusing
> acquisition scheme in the function and  the fact that you take the
> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
> of the function.
 Is adding a device while resetting a device even a valid operation
 anyways?
>>> In theory it's valid if you have hot pluggable devices
 I think this means more so that the reset logic is broken.  Instead
 there should be a per-hive reset lock that is taken and that is tested
 instead.

 Tom
>>> The hive->hive_lock was added exactly for this purpose and used only for
>>> that purpose. Maybe the naming i gave it wasn't reflective of it's
>>> purpose :)
>>
>> But the add/remove should use per-hive locks not the global lock... :-)
>>
>> (I'm honestly not trying to bike shed I just thought the get_hive
>> function looked wrong :-)).
>>
>> Tom
> 
> Totally agree with you, if Shayun (who origianlly added the global
> xgmi_mutex) is fine with switching to per hive mutex then me too, I just
> point out the problem with gpu reset and as you said we then need to
> rename the existing hive_lock into reset_lock and then and another per
> hive lock to do what you propose. Also - is there a way to not take the
> hive lock inside amdgpu_get_xgmi_hive but release it outside ? AFAIK
> it's an opening for problems where people use it but forget to call
> release.

I wanted to take the per-hive lock inside get_hive because it also takes 
the global lock so that add/remove couldn't happen in parallel.

For instance, deleting the last node while adding a new node means the 
per-hive mutex could be in limbo (because remove will delete the lock).

Adding a per-hive reset lock would fix the remaining issues no?

Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread Grodzovsky, Andrey


On 01/07/2019 11:53 AM, StDenis, Tom wrote:
> On 2019-01-07 11:51 a.m., Grodzovsky, Andrey wrote:
>>
>> On 01/07/2019 11:36 AM, StDenis, Tom wrote:
>>> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
 On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
> I think it's reasonable to use the hive  specific lock for hive  specific 
> functions.
> The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> StDenis, Tom
> Sent: Monday, January 7, 2019 10:16 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: StDenis, Tom 
> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>
> v2: Move locks around in other functions so that this function can stand 
> on its own.  Also only hold the hive specific lock for add/remove device 
> instead of the driver global lock so you can't add/remove devices in 
> parallel from one hive.
>
> Signed-off-by: Tom St Denis 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 
> ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>   3 files changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 39d5d058b2c7..13d8e2ad2f7a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>* by different nodes. No point also since the one node already 
> executing
>* reset will also reset all the other nodes in the hive.
>*/
> - hive = amdgpu_get_xgmi_hive(adev);
> + hive = amdgpu_get_xgmi_hive(adev, 0);
>   if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>   !mutex_trylock(>hive_lock))
>   return 0;
 Let's say i have device 0 in hive A and it just got a gpu reset and at
 the same time device 1 is being added to same have though
 amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
 being added and so gpu reset for device 0 bails out on
 '!mutex_trylock(>hive_lock))' without completing the reset.
 Also in general i feel a bit uncomfortable about the confusing
 acquisition scheme in the function and  the fact that you take the
 hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
 of the function.
>>> Is adding a device while resetting a device even a valid operation
>>> anyways?
>> In theory it's valid if you have hot pluggable devices
>>> I think this means more so that the reset logic is broken.  Instead
>>> there should be a per-hive reset lock that is taken and that is tested
>>> instead.
>>>
>>> Tom
>> The hive->hive_lock was added exactly for this purpose and used only for
>> that purpose. Maybe the naming i gave it wasn't reflective of it's
>> purpose :)
>
> But the add/remove should use per-hive locks not the global lock... :-)
>
> (I'm honestly not trying to bike shed I just thought the get_hive
> function looked wrong :-)).
>
> Tom

Totally agree with you, if Shayun (who origianlly added the global 
xgmi_mutex) is fine with switching to per hive mutex then me too, I just 
point out the problem with gpu reset and as you said we then need to 
rename the existing hive_lock into reset_lock and then and another per 
hive lock to do what you propose. Also - is there a way to not take the 
hive lock inside amdgpu_get_xgmi_hive but release it outside ? AFAIK 
it's an opening for problems where people use it but forget to call 
release.

Andrey

> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread StDenis, Tom
On 2019-01-07 11:51 a.m., Grodzovsky, Andrey wrote:
> 
> 
> On 01/07/2019 11:36 AM, StDenis, Tom wrote:
>> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
>>>
>>> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
 I think it's reasonable to use the hive  specific lock for hive  specific 
 functions.
 The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>

 -Original Message-
 From: amd-gfx  On Behalf Of 
 StDenis, Tom
 Sent: Monday, January 7, 2019 10:16 AM
 To: amd-gfx@lists.freedesktop.org
 Cc: StDenis, Tom 
 Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

 v2: Move locks around in other functions so that this function can stand 
 on its own.  Also only hold the hive specific lock for add/remove device 
 instead of the driver global lock so you can't add/remove devices in 
 parallel from one hive.

 Signed-off-by: Tom St Denis 
 ---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
  3 files changed, 25 insertions(+), 15 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
 index 39d5d058b2c7..13d8e2ad2f7a 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
 @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
 *adev,
 * by different nodes. No point also since the one node already 
 executing
 * reset will also reset all the other nodes in the hive.
 */
 -  hive = amdgpu_get_xgmi_hive(adev);
 +  hive = amdgpu_get_xgmi_hive(adev, 0);
if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
!mutex_trylock(>hive_lock))
return 0;
>>> Let's say i have device 0 in hive A and it just got a gpu reset and at
>>> the same time device 1 is being added to same have though
>>> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
>>> being added and so gpu reset for device 0 bails out on
>>> '!mutex_trylock(>hive_lock))' without completing the reset.
>>> Also in general i feel a bit uncomfortable about the confusing
>>> acquisition scheme in the function and  the fact that you take the
>>> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
>>> of the function.
>> Is adding a device while resetting a device even a valid operation
>> anyways?
> 
> In theory it's valid if you have hot pluggable devices
>>
>> I think this means more so that the reset logic is broken.  Instead
>> there should be a per-hive reset lock that is taken and that is tested
>> instead.
>>
>> Tom
> 
> The hive->hive_lock was added exactly for this purpose and used only for
> that purpose. Maybe the naming i gave it wasn't reflective of it's
> purpose :)


But the add/remove should use per-hive locks not the global lock... :-)

(I'm honestly not trying to bike shed I just thought the get_hive 
function looked wrong :-)).

Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread Grodzovsky, Andrey


On 01/07/2019 11:36 AM, StDenis, Tom wrote:
> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
>>
>> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
>>> I think it's reasonable to use the hive  specific lock for hive  specific 
>>> functions.
>>> The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>
>>>
>>> -Original Message-
>>> From: amd-gfx  On Behalf Of StDenis, 
>>> Tom
>>> Sent: Monday, January 7, 2019 10:16 AM
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: StDenis, Tom 
>>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>>>
>>> v2: Move locks around in other functions so that this function can stand on 
>>> its own.  Also only hold the hive specific lock for add/remove device 
>>> instead of the driver global lock so you can't add/remove devices in 
>>> parallel from one hive.
>>>
>>> Signed-off-by: Tom St Denis 
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>>> 3 files changed, 25 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 39d5d058b2c7..13d8e2ad2f7a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>>> *adev,
>>>  * by different nodes. No point also since the one node already 
>>> executing
>>>  * reset will also reset all the other nodes in the hive.
>>>  */
>>> -   hive = amdgpu_get_xgmi_hive(adev);
>>> +   hive = amdgpu_get_xgmi_hive(adev, 0);
>>> if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>> !mutex_trylock(>hive_lock))
>>> return 0;
>> Let's say i have device 0 in hive A and it just got a gpu reset and at
>> the same time device 1 is being added to same have though
>> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
>> being added and so gpu reset for device 0 bails out on
>> '!mutex_trylock(>hive_lock))' without completing the reset.
>> Also in general i feel a bit uncomfortable about the confusing
>> acquisition scheme in the function and  the fact that you take the
>> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
>> of the function.
> Is adding a device while resetting a device even a valid operation
> anyways?

In theory it's valid if you have hot pluggable devices
>
> I think this means more so that the reset logic is broken.  Instead
> there should be a per-hive reset lock that is taken and that is tested
> instead.
>
> Tom

The hive->hive_lock was added exactly for this purpose and used only for 
that purpose. Maybe the naming i gave it wasn't reflective of it's 
purpose :)

Andrey

>
>
>> Andrey
>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> index 8a8bc60cb6b4..ebf50809485f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> @@ -40,26 +40,39 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
>>> *hive)
>>> return >device_list;
>>> }
>>> 
>>> -struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
>>> +struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device
>>> +*adev, int lock)
>>> {
>>> int i;
>>> struct amdgpu_hive_info *tmp;
>>> 
>>> if (!adev->gmc.xgmi.hive_id)
>>> return NULL;
>>> +
>>> +   mutex_lock(_mutex);
>>> +
>>> for (i = 0 ; i < hive_count; ++i) {
>>> tmp = _hives[i];
>>> -   if (tmp->hive_id == adev->gmc.xgmi.hive_id)
>>> +   if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
>>> +   if (lock)
>>> +   mutex_lock(>hive_lock);
>>> +   mutex_unlock(_mutex);
>>> return tmp;
>>> +   }
>>> }
>>> -   if (i >= AMDGPU_MAX_XGMI_HIVE)
>>> +   if (i >= AMDGPU_MAX_XGMI_HIVE) {
>>> +   mutex_unlock(_mutex);
>>> return NULL;
>>> +   }
>>> 
>>> /* initialize new hive if not exist */
>>> tmp = _hives[hive_count++];
>>> tmp->hive_id = adev->gmc.xgmi.hive_id;
>>> INIT_LIST_HEAD(>device_list);
>>> mutex_init(>hive_lock);
>>> +   if (lock)
>>> +   mutex_lock(>hive_lock);
>>> +
>>> +   mutex_unlock(_mutex);
>>> 
>>> return tmp;
>>> }
>>> @@ -111,8 +124,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>>> return ret;
>>> }
>>> 
>>> -   mutex_lock(_mutex);
>>> -   hive = amdgpu_get_xgmi_hive(adev);
>>> +   /* find hive and take lock */
>>> +   hive = 

Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread StDenis, Tom
On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
> 
> 
> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
>> I think it's reasonable to use the hive  specific lock for hive  specific 
>> functions.
>> The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of StDenis, 
>> Tom
>> Sent: Monday, January 7, 2019 10:16 AM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: StDenis, Tom 
>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>>
>> v2: Move locks around in other functions so that this function can stand on 
>> its own.  Also only hold the hive specific lock for add/remove device 
>> instead of the driver global lock so you can't add/remove devices in 
>> parallel from one hive.
>>
>> Signed-off-by: Tom St Denis 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>>3 files changed, 25 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 39d5d058b2c7..13d8e2ad2f7a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>> *adev,
>>   * by different nodes. No point also since the one node already 
>> executing
>>   * reset will also reset all the other nodes in the hive.
>>   */
>> -hive = amdgpu_get_xgmi_hive(adev);
>> +hive = amdgpu_get_xgmi_hive(adev, 0);
>>  if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>  !mutex_trylock(>hive_lock))
>>  return 0;
> 
> Let's say i have device 0 in hive A and it just got a gpu reset and at
> the same time device 1 is being added to same have though
> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
> being added and so gpu reset for device 0 bails out on
> '!mutex_trylock(>hive_lock))' without completing the reset.
> Also in general i feel a bit uncomfortable about the confusing
> acquisition scheme in the function and  the fact that you take the
> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
> of the function.

Is adding a device while resetting a device even a valid operation 
anyways?

I think this means more so that the reset logic is broken.  Instead 
there should be a per-hive reset lock that is taken and that is tested 
instead.

Tom


> 
> Andrey
> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> index 8a8bc60cb6b4..ebf50809485f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> @@ -40,26 +40,39 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
>> *hive)
>>  return >device_list;
>>}
>>
>> -struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
>> +struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device
>> +*adev, int lock)
>>{
>>  int i;
>>  struct amdgpu_hive_info *tmp;
>>
>>  if (!adev->gmc.xgmi.hive_id)
>>  return NULL;
>> +
>> +mutex_lock(_mutex);
>> +
>>  for (i = 0 ; i < hive_count; ++i) {
>>  tmp = _hives[i];
>> -if (tmp->hive_id == adev->gmc.xgmi.hive_id)
>> +if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
>> +if (lock)
>> +mutex_lock(>hive_lock);
>> +mutex_unlock(_mutex);
>>  return tmp;
>> +}
>>  }
>> -if (i >= AMDGPU_MAX_XGMI_HIVE)
>> +if (i >= AMDGPU_MAX_XGMI_HIVE) {
>> +mutex_unlock(_mutex);
>>  return NULL;
>> +}
>>
>>  /* initialize new hive if not exist */
>>  tmp = _hives[hive_count++];
>>  tmp->hive_id = adev->gmc.xgmi.hive_id;
>>  INIT_LIST_HEAD(>device_list);
>>  mutex_init(>hive_lock);
>> +if (lock)
>> +mutex_lock(>hive_lock);
>> +
>> +mutex_unlock(_mutex);
>>
>>  return tmp;
>>}
>> @@ -111,8 +124,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>>  return ret;
>>  }
>>
>> -mutex_lock(_mutex);
>> -hive = amdgpu_get_xgmi_hive(adev);
>> +/* find hive and take lock */
>> +hive = amdgpu_get_xgmi_hive(adev, 1);
>>  if (!hive)
>>  goto exit;
>>
>> @@ -142,8 +155,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>>  break;
>>  }
>>
>> +mutex_unlock(>hive_lock);
>>exit:
>> -mutex_unlock(_mutex);
>>  return ret;
>>}
>>
>> @@ -154,15 +167,12 @@ void amdgpu_xgmi_remove_device(struct amdgpu_device 
>> *adev)
>>  if (!adev->gmc.xgmi.supported)
>>  return;
>>
>> -mutex_lock(_mutex);
>> -
>> -hive = 

Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread Grodzovsky, Andrey


On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
> I think it's reasonable to use the hive  specific lock for hive  specific 
> functions.
> The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>
>
> -Original Message-
> From: amd-gfx  On Behalf Of StDenis, 
> Tom
> Sent: Monday, January 7, 2019 10:16 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: StDenis, Tom 
> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>
> v2: Move locks around in other functions so that this function can stand on 
> its own.  Also only hold the hive specific lock for add/remove device instead 
> of the driver global lock so you can't add/remove devices in parallel from 
> one hive.
>
> Signed-off-by: Tom St Denis 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>   3 files changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 39d5d058b2c7..13d8e2ad2f7a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>* by different nodes. No point also since the one node already 
> executing
>* reset will also reset all the other nodes in the hive.
>*/
> - hive = amdgpu_get_xgmi_hive(adev);
> + hive = amdgpu_get_xgmi_hive(adev, 0);
>   if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>   !mutex_trylock(>hive_lock))
>   return 0;

Let's say i have device 0 in hive A and it just got a gpu reset and at 
the same time device 1 is being added to same have though 
amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device 
being added and so gpu reset for device 0 bails out on 
'!mutex_trylock(>hive_lock))' without completing the reset.
Also in general i feel a bit uncomfortable about the confusing 
acquisition scheme in the function and  the fact that you take the 
hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside 
of the function.

Andrey

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 8a8bc60cb6b4..ebf50809485f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -40,26 +40,39 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
> *hive)
>   return >device_list;
>   }
>   
> -struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
> +struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device
> +*adev, int lock)
>   {
>   int i;
>   struct amdgpu_hive_info *tmp;
>   
>   if (!adev->gmc.xgmi.hive_id)
>   return NULL;
> +
> + mutex_lock(_mutex);
> +
>   for (i = 0 ; i < hive_count; ++i) {
>   tmp = _hives[i];
> - if (tmp->hive_id == adev->gmc.xgmi.hive_id)
> + if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
> + if (lock)
> + mutex_lock(>hive_lock);
> + mutex_unlock(_mutex);
>   return tmp;
> + }
>   }
> - if (i >= AMDGPU_MAX_XGMI_HIVE)
> + if (i >= AMDGPU_MAX_XGMI_HIVE) {
> + mutex_unlock(_mutex);
>   return NULL;
> + }
>   
>   /* initialize new hive if not exist */
>   tmp = _hives[hive_count++];
>   tmp->hive_id = adev->gmc.xgmi.hive_id;
>   INIT_LIST_HEAD(>device_list);
>   mutex_init(>hive_lock);
> + if (lock)
> + mutex_lock(>hive_lock);
> +
> + mutex_unlock(_mutex);
>   
>   return tmp;
>   }
> @@ -111,8 +124,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>   return ret;
>   }
>   
> - mutex_lock(_mutex);
> - hive = amdgpu_get_xgmi_hive(adev);
> + /* find hive and take lock */
> + hive = amdgpu_get_xgmi_hive(adev, 1);
>   if (!hive)
>   goto exit;
>   
> @@ -142,8 +155,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>   break;
>   }
>   
> + mutex_unlock(>hive_lock);
>   exit:
> - mutex_unlock(_mutex);
>   return ret;
>   }
>   
> @@ -154,15 +167,12 @@ void amdgpu_xgmi_remove_device(struct amdgpu_device 
> *adev)
>   if (!adev->gmc.xgmi.supported)
>   return;
>   
> - mutex_lock(_mutex);
> -
> - hive = amdgpu_get_xgmi_hive(adev);
> + hive = amdgpu_get_xgmi_hive(adev, 1);
>   if (!hive)
> - goto exit;
> + return;
>   
>   if (!(hive->number_devices--))
>   mutex_destroy(>hive_lock);
> -
> -exit:
> - mutex_unlock(_mutex);
> + else
> + mutex_unlock(>hive_lock);
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
> 

Re: [PATCH] drm/amdgpu: Add message print when unable to get valid hive

2019-01-07 Thread Deucher, Alexander
Reviewed-by: Alex Deucher 


From: amd-gfx  on behalf of Liu, Shaoyun 

Sent: Monday, January 7, 2019 11:01:51 AM
To: amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu: Add message print when unable to get valid hive

Ping ...

-Original Message-
From: Liu, Shaoyun 
Sent: Friday, January 4, 2019 1:29 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu: Add message print when unable to get valid hive

Add message print out and return -EINVAL when driver can not get valid hive 
from hive  arrary on xgmi configuration

Change-Id: Ic03927904edf0e384b8c4651e19274bb3f2a7d9a
Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 8a8bc60..b226631 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -113,8 +113,13 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)

 mutex_lock(_mutex);
 hive = amdgpu_get_xgmi_hive(adev);
-   if (!hive)
+   if (!hive) {
+   ret = -EINVAL;
+   dev_err(adev->dev,
+   "XGMI: Can not matech hive 0xllx in the hive list.\n",
+   adev->gmc.xgmi.hive_id);
 goto exit;
+   }

 hive_topology = >topology_info;

--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread Liu, Shaoyun
I think it's reasonable to use the hive  specific lock for hive  specific 
functions. 
The changes is acked-by  Shaoyun.liu < shaoyun@amd.com>

-Original Message-
From: amd-gfx  On Behalf Of StDenis, Tom
Sent: Monday, January 7, 2019 10:16 AM
To: amd-gfx@lists.freedesktop.org
Cc: StDenis, Tom 
Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

v2: Move locks around in other functions so that this function can stand on its 
own.  Also only hold the hive specific lock for add/remove device instead of 
the driver global lock so you can't add/remove devices in parallel from one 
hive.

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 39d5d058b2c7..13d8e2ad2f7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * by different nodes. No point also since the one node already 
executing
 * reset will also reset all the other nodes in the hive.
 */
-   hive = amdgpu_get_xgmi_hive(adev);
+   hive = amdgpu_get_xgmi_hive(adev, 0);
if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
!mutex_trylock(>hive_lock))
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 8a8bc60cb6b4..ebf50809485f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -40,26 +40,39 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
*hive)
return >device_list;
 }
 
-struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
+struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device 
+*adev, int lock)
 {
int i;
struct amdgpu_hive_info *tmp;
 
if (!adev->gmc.xgmi.hive_id)
return NULL;
+
+   mutex_lock(_mutex);
+
for (i = 0 ; i < hive_count; ++i) {
tmp = _hives[i];
-   if (tmp->hive_id == adev->gmc.xgmi.hive_id)
+   if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
+   if (lock)
+   mutex_lock(>hive_lock);
+   mutex_unlock(_mutex);
return tmp;
+   }
}
-   if (i >= AMDGPU_MAX_XGMI_HIVE)
+   if (i >= AMDGPU_MAX_XGMI_HIVE) {
+   mutex_unlock(_mutex);
return NULL;
+   }
 
/* initialize new hive if not exist */
tmp = _hives[hive_count++];
tmp->hive_id = adev->gmc.xgmi.hive_id;
INIT_LIST_HEAD(>device_list);
mutex_init(>hive_lock);
+   if (lock)
+   mutex_lock(>hive_lock);
+
+   mutex_unlock(_mutex);
 
return tmp;
 }
@@ -111,8 +124,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
return ret;
}
 
-   mutex_lock(_mutex);
-   hive = amdgpu_get_xgmi_hive(adev);
+   /* find hive and take lock */
+   hive = amdgpu_get_xgmi_hive(adev, 1);
if (!hive)
goto exit;
 
@@ -142,8 +155,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
break;
}
 
+   mutex_unlock(>hive_lock);
 exit:
-   mutex_unlock(_mutex);
return ret;
 }
 
@@ -154,15 +167,12 @@ void amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
if (!adev->gmc.xgmi.supported)
return;
 
-   mutex_lock(_mutex);
-
-   hive = amdgpu_get_xgmi_hive(adev);
+   hive = amdgpu_get_xgmi_hive(adev, 1);
if (!hive)
-   goto exit;
+   return;
 
if (!(hive->number_devices--))
mutex_destroy(>hive_lock);
-
-exit:
-   mutex_unlock(_mutex);
+   else
+   mutex_unlock(>hive_lock);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index 6151eb9c8ad3..8d7984844174 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -32,7 +32,7 @@ struct amdgpu_hive_info {
struct mutex hive_lock;
 };
 
-struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
+struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device 
+*adev, int lock);
 int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct 
amdgpu_device *adev);  int amdgpu_xgmi_add_device(struct amdgpu_device *adev);  
void amdgpu_xgmi_remove_device(struct amdgpu_device *adev);
--
2.17.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

RE: [PATCH] drm/amdgpu: Add message print when unable to get valid hive

2019-01-07 Thread Liu, Shaoyun
Ping ...

-Original Message-
From: Liu, Shaoyun  
Sent: Friday, January 4, 2019 1:29 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu: Add message print when unable to get valid hive

Add message print out and return -EINVAL when driver can not get valid hive 
from hive  arrary on xgmi configuration

Change-Id: Ic03927904edf0e384b8c4651e19274bb3f2a7d9a
Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 8a8bc60..b226631 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -113,8 +113,13 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
 
mutex_lock(_mutex);
hive = amdgpu_get_xgmi_hive(adev);
-   if (!hive)
+   if (!hive) {
+   ret = -EINVAL;
+   dev_err(adev->dev,
+   "XGMI: Can not matech hive 0xllx in the hive list.\n",
+   adev->gmc.xgmi.hive_id);
goto exit;
+   }
 
hive_topology = >topology_info;
 
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/2] drm/amd/display: Move iteration out of dm_update_planes

2019-01-07 Thread sunpeng.li
From: Leo Li 

[Why]
To reduce indentation of dm_update_planes, and to make it operate on
single plane instances.

[How]
Move iteration of plane states into atomic_check.
No functional change is intended.

Signed-off-by: Leo Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 220 +++---
 1 file changed, 114 insertions(+), 106 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 483..0b6bce5 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5769,145 +5769,141 @@ static int dm_update_crtcs_state(struct 
amdgpu_display_manager *dm,
return ret;
 }
 
-static int dm_update_planes_state(struct dc *dc,
- struct drm_atomic_state *state,
- bool enable,
- bool *lock_and_validation_needed)
+static int dm_update_plane_state(struct dc *dc,
+struct drm_atomic_state *state,
+struct drm_plane *plane,
+struct drm_plane_state *old_plane_state,
+struct drm_plane_state *new_plane_state,
+bool enable,
+bool *lock_and_validation_needed)
 {
 
struct dm_atomic_state *dm_state = NULL;
struct drm_crtc *new_plane_crtc, *old_plane_crtc;
struct drm_crtc_state *old_crtc_state, *new_crtc_state;
-   struct drm_plane *plane;
-   struct drm_plane_state *old_plane_state, *new_plane_state;
struct dm_crtc_state *dm_new_crtc_state, *dm_old_crtc_state;
struct dm_plane_state *dm_new_plane_state, *dm_old_plane_state;
-   int i ;
/* TODO return page_flip_needed() function */
bool pflip_needed  = !state->allow_modeset;
int ret = 0;
 
 
-   /* Add new planes, in reverse order as DC expectation */
-   for_each_oldnew_plane_in_state_reverse(state, plane, old_plane_state, 
new_plane_state, i) {
-   new_plane_crtc = new_plane_state->crtc;
-   old_plane_crtc = old_plane_state->crtc;
-   dm_new_plane_state = to_dm_plane_state(new_plane_state);
-   dm_old_plane_state = to_dm_plane_state(old_plane_state);
+   new_plane_crtc = new_plane_state->crtc;
+   old_plane_crtc = old_plane_state->crtc;
+   dm_new_plane_state = to_dm_plane_state(new_plane_state);
+   dm_old_plane_state = to_dm_plane_state(old_plane_state);
 
-   /*TODO Implement atomic check for cursor plane */
-   if (plane->type == DRM_PLANE_TYPE_CURSOR)
-   continue;
+   /*TODO Implement atomic check for cursor plane */
+   if (plane->type == DRM_PLANE_TYPE_CURSOR)
+   return 0;
 
-   /* Remove any changed/removed planes */
-   if (!enable) {
-   if (pflip_needed &&
-   plane->type != DRM_PLANE_TYPE_OVERLAY)
-   continue;
+   /* Remove any changed/removed planes */
+   if (!enable) {
+   if (pflip_needed &&
+   plane->type != DRM_PLANE_TYPE_OVERLAY)
+   return 0;
 
-   if (!old_plane_crtc)
-   continue;
+   if (!old_plane_crtc)
+   return 0;
 
-   old_crtc_state = drm_atomic_get_old_crtc_state(
-   state, old_plane_crtc);
-   dm_old_crtc_state = to_dm_crtc_state(old_crtc_state);
+   old_crtc_state = drm_atomic_get_old_crtc_state(
+   state, old_plane_crtc);
+   dm_old_crtc_state = to_dm_crtc_state(old_crtc_state);
 
-   if (!dm_old_crtc_state->stream)
-   continue;
+   if (!dm_old_crtc_state->stream)
+   return 0;
 
-   DRM_DEBUG_ATOMIC("Disabling DRM plane: %d on DRM crtc 
%d\n",
-   plane->base.id, 
old_plane_crtc->base.id);
+   DRM_DEBUG_ATOMIC("Disabling DRM plane: %d on DRM crtc %d\n",
+   plane->base.id, old_plane_crtc->base.id);
 
-   ret = dm_atomic_get_state(state, _state);
-   if (ret)
-   return ret;
+   ret = dm_atomic_get_state(state, _state);
+   if (ret)
+   return ret;
 
-   if (!dc_remove_plane_from_context(
-   dc,
-   dm_old_crtc_state->stream,
-   dm_old_plane_state->dc_state,
-   dm_state->context)) {
+ 

[PATCH 2/2] drm/amd/display: Move iteration out of dm_update_crtcs

2019-01-07 Thread sunpeng.li
From: Leo Li 

[Why]
To reduce indent in dm_update_crtcs, and to make it operate on single
instances.

[How]
Move iteration of plane states into atomic_check.
No functional change is intended.

Signed-off-by: Leo Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 341 +++---
 1 file changed, 175 insertions(+), 166 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 0b6bce5..70cd8bc 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5550,15 +5550,15 @@ static void reset_freesync_config_for_crtc(
   sizeof(new_crtc_state->vrr_infopacket));
 }
 
-static int dm_update_crtcs_state(struct amdgpu_display_manager *dm,
-struct drm_atomic_state *state,
-bool enable,
-bool *lock_and_validation_needed)
+static int dm_update_crtc_state(struct amdgpu_display_manager *dm,
+   struct drm_atomic_state *state,
+   struct drm_crtc *crtc,
+   struct drm_crtc_state *old_crtc_state,
+   struct drm_crtc_state *new_crtc_state,
+   bool enable,
+   bool *lock_and_validation_needed)
 {
struct dm_atomic_state *dm_state = NULL;
-   struct drm_crtc *crtc;
-   struct drm_crtc_state *old_crtc_state, *new_crtc_state;
-   int i;
struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
struct dc_stream_state *new_stream;
int ret = 0;
@@ -5567,200 +5567,199 @@ static int dm_update_crtcs_state(struct 
amdgpu_display_manager *dm,
 * TODO Move this code into dm_crtc_atomic_check once we get rid of 
dc_validation_set
 * update changed items
 */
-   for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, 
new_crtc_state, i) {
-   struct amdgpu_crtc *acrtc = NULL;
-   struct amdgpu_dm_connector *aconnector = NULL;
-   struct drm_connector_state *drm_new_conn_state = NULL, 
*drm_old_conn_state = NULL;
-   struct dm_connector_state *dm_new_conn_state = NULL, 
*dm_old_conn_state = NULL;
-   struct drm_plane_state *new_plane_state = NULL;
+   struct amdgpu_crtc *acrtc = NULL;
+   struct amdgpu_dm_connector *aconnector = NULL;
+   struct drm_connector_state *drm_new_conn_state = NULL, 
*drm_old_conn_state = NULL;
+   struct dm_connector_state *dm_new_conn_state = NULL, *dm_old_conn_state 
= NULL;
+   struct drm_plane_state *new_plane_state = NULL;
 
-   new_stream = NULL;
+   new_stream = NULL;
 
-   dm_old_crtc_state = to_dm_crtc_state(old_crtc_state);
-   dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
-   acrtc = to_amdgpu_crtc(crtc);
+   dm_old_crtc_state = to_dm_crtc_state(old_crtc_state);
+   dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
+   acrtc = to_amdgpu_crtc(crtc);
 
-   new_plane_state = drm_atomic_get_new_plane_state(state, 
new_crtc_state->crtc->primary);
+   new_plane_state = drm_atomic_get_new_plane_state(state, 
new_crtc_state->crtc->primary);
 
-   if (new_crtc_state->enable && new_plane_state && 
!new_plane_state->fb) {
-   ret = -EINVAL;
-   goto fail;
-   }
+   if (new_crtc_state->enable && new_plane_state && !new_plane_state->fb) {
+   ret = -EINVAL;
+   goto fail;
+   }
 
-   aconnector = 
amdgpu_dm_find_first_crtc_matching_connector(state, crtc);
+   aconnector = amdgpu_dm_find_first_crtc_matching_connector(state, crtc);
 
-   /* TODO This hack should go away */
-   if (aconnector && enable) {
-   /* Make sure fake sink is created in plug-in scenario */
-   drm_new_conn_state = 
drm_atomic_get_new_connector_state(state,
-   
>base);
-   drm_old_conn_state = 
drm_atomic_get_old_connector_state(state,
-   
>base);
+   /* TODO This hack should go away */
+   if (aconnector && enable) {
+   /* Make sure fake sink is created in plug-in scenario */
+   drm_new_conn_state = drm_atomic_get_new_connector_state(state,
+   >base);
+   drm_old_conn_state = drm_atomic_get_old_connector_state(state,
+   >base);
 
-   if (IS_ERR(drm_new_conn_state)) {
-   ret = PTR_ERR_OR_ZERO(drm_new_conn_state);
- 

Re: System with kernel 4.21 hang immediately after start (invalid opcode: 0000 [#1] SMP NOPTI)

2019-01-07 Thread Michel Dänzer
On 2019-01-07 4:21 p.m., Mikhail Gavrilov wrote:
> On Mon, 7 Jan 2019 at 15:09, Michel Dänzer  wrote:
>>
>> Is this reproducible with commit 674e78acae0d ("drm/amd/display: Add
>> fast path for cursor plane updates") + the patch above?
> 
> Yes. I am sure I am able reproduce issue:
> All new dmesg are attached.
> 
>>
>> If yes, is it not reproducible with commit fc42d47ce011 ("drm/amdgpu:
>> Enable GPU recovery by default for CI")? (Please test long enough to be
>> at least 99% sure before saying it's not)
> 
> Commits 674e78acae0d and fc42d47ce011 already present in my patched kernel.

I mean:

Run "git checkout 674e78acae0d" and apply the patch. Build the kernel
and boot it. If it reproduces the problem, un-apply the patch, run "git
checkout fc42d47ce011", build the kernel and boot it. Test until either
it reproduces the problem, or you're at least 99% sure it won't.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amd/powerplay: create pp_od_clk_voltage device file under OD support

2019-01-07 Thread Deucher, Alexander
Series is:

Reviewed-by: Alex Deucher 


From: amd-gfx  on behalf of Evan Quan 

Sent: Monday, January 7, 2019 7:05:56 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander; Quan, Evan; Freehill, Chris
Subject: [PATCH 2/2] drm/amd/powerplay: create pp_od_clk_voltage device file 
under OD support

Since pp_od_clk_voltage device file is for OD related sysfs operations.

Change-Id: I13e95b4bd2ffb93b1cd5d272dd27171ab38dbe57
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
index 32aa4d09e12d..d8f3f006f5c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
@@ -2061,6 +2061,7 @@ void amdgpu_pm_print_power_states(struct amdgpu_device 
*adev)

 int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
 {
+   struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle;
 int ret;

 if (adev->pm.sysfs_initialized)
@@ -2144,12 +2145,14 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
 "pp_power_profile_mode\n");
 return ret;
 }
-   ret = device_create_file(adev->dev,
-   _attr_pp_od_clk_voltage);
-   if (ret) {
-   DRM_ERROR("failed to create device file "
-   "pp_od_clk_voltage\n");
-   return ret;
+   if (hwmgr->od_enabled) {
+   ret = device_create_file(adev->dev,
+   _attr_pp_od_clk_voltage);
+   if (ret) {
+   DRM_ERROR("failed to create device file "
+   "pp_od_clk_voltage\n");
+   return ret;
+   }
 }
 ret = device_create_file(adev->dev,
 _attr_gpu_busy_percent);
@@ -2171,6 +2174,8 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)

 void amdgpu_pm_sysfs_fini(struct amdgpu_device *adev)
 {
+   struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle;
+
 if (adev->pm.dpm_enabled == 0)
 return;

@@ -2191,8 +2196,9 @@ void amdgpu_pm_sysfs_fini(struct amdgpu_device *adev)
 device_remove_file(adev->dev, _attr_pp_mclk_od);
 device_remove_file(adev->dev,
 _attr_pp_power_profile_mode);
-   device_remove_file(adev->dev,
-   _attr_pp_od_clk_voltage);
+   if (hwmgr->od_enabled)
+   device_remove_file(adev->dev,
+   _attr_pp_od_clk_voltage);
 device_remove_file(adev->dev, _attr_gpu_busy_percent);
 }

--
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
amd-gfx Info Page - 
freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the amd-gfx 
Archives.. Using amd-gfx: To post a message to all the list members, send email 
to amd-gfx@lists.freedesktop.org. You can subscribe to the list, or change your 
existing subscription, in the sections below.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)

2019-01-07 Thread StDenis, Tom
v2: Move locks around in other functions so that this
function can stand on its own.  Also only hold the hive
specific lock for add/remove device instead of the driver
global lock so you can't add/remove devices in parallel from
one hive.

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 39d5d058b2c7..13d8e2ad2f7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * by different nodes. No point also since the one node already 
executing
 * reset will also reset all the other nodes in the hive.
 */
-   hive = amdgpu_get_xgmi_hive(adev);
+   hive = amdgpu_get_xgmi_hive(adev, 0);
if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
!mutex_trylock(>hive_lock))
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 8a8bc60cb6b4..ebf50809485f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -40,26 +40,39 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
*hive)
return >device_list;
 }
 
-struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
+struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev, int 
lock)
 {
int i;
struct amdgpu_hive_info *tmp;
 
if (!adev->gmc.xgmi.hive_id)
return NULL;
+
+   mutex_lock(_mutex);
+
for (i = 0 ; i < hive_count; ++i) {
tmp = _hives[i];
-   if (tmp->hive_id == adev->gmc.xgmi.hive_id)
+   if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
+   if (lock)
+   mutex_lock(>hive_lock);
+   mutex_unlock(_mutex);
return tmp;
+   }
}
-   if (i >= AMDGPU_MAX_XGMI_HIVE)
+   if (i >= AMDGPU_MAX_XGMI_HIVE) {
+   mutex_unlock(_mutex);
return NULL;
+   }
 
/* initialize new hive if not exist */
tmp = _hives[hive_count++];
tmp->hive_id = adev->gmc.xgmi.hive_id;
INIT_LIST_HEAD(>device_list);
mutex_init(>hive_lock);
+   if (lock)
+   mutex_lock(>hive_lock);
+
+   mutex_unlock(_mutex);
 
return tmp;
 }
@@ -111,8 +124,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
return ret;
}
 
-   mutex_lock(_mutex);
-   hive = amdgpu_get_xgmi_hive(adev);
+   /* find hive and take lock */
+   hive = amdgpu_get_xgmi_hive(adev, 1);
if (!hive)
goto exit;
 
@@ -142,8 +155,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
break;
}
 
+   mutex_unlock(>hive_lock);
 exit:
-   mutex_unlock(_mutex);
return ret;
 }
 
@@ -154,15 +167,12 @@ void amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
if (!adev->gmc.xgmi.supported)
return;
 
-   mutex_lock(_mutex);
-
-   hive = amdgpu_get_xgmi_hive(adev);
+   hive = amdgpu_get_xgmi_hive(adev, 1);
if (!hive)
-   goto exit;
+   return;
 
if (!(hive->number_devices--))
mutex_destroy(>hive_lock);
-
-exit:
-   mutex_unlock(_mutex);
+   else
+   mutex_unlock(>hive_lock);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index 6151eb9c8ad3..8d7984844174 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -32,7 +32,7 @@ struct amdgpu_hive_info {
struct mutex hive_lock;
 };
 
-struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
+struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev, int 
lock);
 int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct 
amdgpu_device *adev);
 int amdgpu_xgmi_add_device(struct amdgpu_device *adev);
 void amdgpu_xgmi_remove_device(struct amdgpu_device *adev);
-- 
2.17.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd: fix typo

2019-01-07 Thread Matteo Croce
Fix spelling mistake: "lenght" -> "length"

Signed-off-by: Matteo Croce 
---
 drivers/gpu/drm/amd/include/atombios.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/include/atombios.h 
b/drivers/gpu/drm/amd/include/atombios.h
index 7931502fa54f..8ba21747b40a 100644
--- a/drivers/gpu/drm/amd/include/atombios.h
+++ b/drivers/gpu/drm/amd/include/atombios.h
@@ -4106,7 +4106,7 @@ typedef struct  _ATOM_LCD_MODE_CONTROL_CAP
 typedef struct _ATOM_FAKE_EDID_PATCH_RECORD
 {
   UCHAR ucRecordType;
-  UCHAR ucFakeEDIDLength;   // = 128 means EDID lenght is 128 bytes, 
otherwise the EDID length = ucFakeEDIDLength*128
+  UCHAR ucFakeEDIDLength;   // = 128 means EDID length is 128 bytes, 
otherwise the EDID length = ucFakeEDIDLength*128
   UCHAR ucFakeEDIDString[1];// This actually has ucFakeEdidLength elements.
 } ATOM_FAKE_EDID_PATCH_RECORD;
 
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: disable system memory page tables for now

2019-01-07 Thread Christian König

Am 07.01.19 um 16:03 schrieb Michel Dänzer:

On 2019-01-07 2:45 p.m., Christian König wrote:

We hit a problem with IOMMU with that. Disable until we have time to
debug further.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4cc0d1079935..ccffcadfd73b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -847,9 +847,6 @@ static void amdgpu_vm_bo_param(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
bp->size = amdgpu_vm_bo_size(adev, level);
bp->byte_align = AMDGPU_GPU_PAGE_SIZE;
bp->domain = AMDGPU_GEM_DOMAIN_VRAM;
-   if (bp->size <= PAGE_SIZE && adev->asic_type >= CHIP_VEGA10 &&
-   adev->flags & AMD_IS_APU)
-   bp->domain |= AMDGPU_GEM_DOMAIN_GTT;
bp->domain = amdgpu_bo_get_preferred_pin_domain(adev, bp->domain);
bp->flags = AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS |
AMDGPU_GEM_CREATE_CPU_GTT_USWC;


Thanks for taking care of this Christian.


Actually I'm not that happy about this because it seems to point out 
either a hardware problem or a software problem in the IOMMU driver.


I just don't have time to investigate further, so disabling it is the 
least I can do.



Maybe add a reference to at
least one of the bug reports about this, but either way:

Reviewed-by: Michel Dänzer 


Thanks,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: disable system memory page tables for now

2019-01-07 Thread Michel Dänzer
On 2019-01-07 2:45 p.m., Christian König wrote:
> We hit a problem with IOMMU with that. Disable until we have time to
> debug further.
> 
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 4cc0d1079935..ccffcadfd73b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -847,9 +847,6 @@ static void amdgpu_vm_bo_param(struct amdgpu_device 
> *adev, struct amdgpu_vm *vm,
>   bp->size = amdgpu_vm_bo_size(adev, level);
>   bp->byte_align = AMDGPU_GPU_PAGE_SIZE;
>   bp->domain = AMDGPU_GEM_DOMAIN_VRAM;
> - if (bp->size <= PAGE_SIZE && adev->asic_type >= CHIP_VEGA10 &&
> - adev->flags & AMD_IS_APU)
> - bp->domain |= AMDGPU_GEM_DOMAIN_GTT;
>   bp->domain = amdgpu_bo_get_preferred_pin_domain(adev, bp->domain);
>   bp->flags = AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS |
>   AMDGPU_GEM_CREATE_CPU_GTT_USWC;
> 

Thanks for taking care of this Christian. Maybe add a reference to at
least one of the bug reports about this, but either way:

Reviewed-by: Michel Dänzer 


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive()

2019-01-07 Thread StDenis, Tom
Self NAK this ... calling functions take the same lock

We should remove the locks from the callers so this function is thread 
safe on its own...

Tom


On 2019-01-07 10:00 a.m., StDenis, Tom wrote:
> Signed-off-by: Tom St Denis 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 +++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 8a8bc60cb6b4..587a5f73ae8c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -47,13 +47,20 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
> amdgpu_device *adev)
>   
>   if (!adev->gmc.xgmi.hive_id)
>   return NULL;
> +
> + mutex_lock(_mutex);
> +
>   for (i = 0 ; i < hive_count; ++i) {
>   tmp = _hives[i];
> - if (tmp->hive_id == adev->gmc.xgmi.hive_id)
> + if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
> + mutex_unlock(_mutex);
>   return tmp;
> + }
>   }
> - if (i >= AMDGPU_MAX_XGMI_HIVE)
> + if (i >= AMDGPU_MAX_XGMI_HIVE) {
> + mutex_unlock(_mutex);
>   return NULL;
> + }
>   
>   /* initialize new hive if not exist */
>   tmp = _hives[hive_count++];
> @@ -61,6 +68,8 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
> amdgpu_device *adev)
>   INIT_LIST_HEAD(>device_list);
>   mutex_init(>hive_lock);
>   
> + mutex_unlock(_mutex);
> +
>   return tmp;
>   }
>   
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] add missing mutex lock to amdgpu_get_xgmi_hive()

2019-01-07 Thread StDenis, Tom
Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 8a8bc60cb6b4..587a5f73ae8c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -47,13 +47,20 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
amdgpu_device *adev)
 
if (!adev->gmc.xgmi.hive_id)
return NULL;
+
+   mutex_lock(_mutex);
+
for (i = 0 ; i < hive_count; ++i) {
tmp = _hives[i];
-   if (tmp->hive_id == adev->gmc.xgmi.hive_id)
+   if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
+   mutex_unlock(_mutex);
return tmp;
+   }
}
-   if (i >= AMDGPU_MAX_XGMI_HIVE)
+   if (i >= AMDGPU_MAX_XGMI_HIVE) {
+   mutex_unlock(_mutex);
return NULL;
+   }
 
/* initialize new hive if not exist */
tmp = _hives[hive_count++];
@@ -61,6 +68,8 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
amdgpu_device *adev)
INIT_LIST_HEAD(>device_list);
mutex_init(>hive_lock);
 
+   mutex_unlock(_mutex);
+
return tmp;
 }
 
-- 
2.17.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: replace get_user_pages with HMM address mirror helpers v5

2019-01-07 Thread Christian König

Am 14.12.18 um 22:10 schrieb Yang, Philip:

Use HMM helper function hmm_vma_fault() to get physical pages backing
userptr and start CPU page table update track of those pages. Then use
hmm_vma_range_done() to check if those pages are updated before
amdgpu_cs_submit for gfx or before user queues are resumed for kfd.

If userptr pages are updated, for gfx, amdgpu_cs_ioctl will restart
from scratch, for kfd, restore worker is rescheduled to retry.

HMM simplify the CPU page table concurrent update check, so remove
guptasklock, mmu_invalidations, last_set_pages fields from
amdgpu_ttm_tt struct.

HMM does not pin the page (increase page ref count), so remove related
operations like release_pages(), put_page(), mark_page_dirty().

Change-Id: I2e8c0c6f0d8c21e5596a32d7fc91762778bc9e67
Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|   1 -
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  95 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c   |   2 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h   |   3 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 185 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  14 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c|  25 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|   4 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 168 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |   4 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|   1 -
  11 files changed, 209 insertions(+), 293 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 70429f7aa9a8..717791d4fa45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -62,7 +62,6 @@ struct kgd_mem {
  
  	atomic_t invalid;

struct amdkfd_process_info *process_info;
-   struct page **user_pages;
  
  	struct amdgpu_sync sync;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

index be1ab43473c6..2897793600f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -582,28 +582,12 @@ static int init_user_pages(struct kgd_mem *mem, struct 
mm_struct *mm,
goto out;
}
  
-	/* If no restore worker is running concurrently, user_pages

-* should not be allocated
-*/
-   WARN(mem->user_pages, "Leaking user_pages array");
-
-   mem->user_pages = kvmalloc_array(bo->tbo.ttm->num_pages,
-  sizeof(struct page *),
-  GFP_KERNEL | __GFP_ZERO);
-   if (!mem->user_pages) {
-   pr_err("%s: Failed to allocate pages array\n", __func__);
-   ret = -ENOMEM;
-   goto unregister_out;
-   }
-
-   ret = amdgpu_ttm_tt_get_user_pages(bo->tbo.ttm, mem->user_pages);
+   ret = amdgpu_ttm_tt_get_user_pages(bo->tbo.ttm, bo->tbo.ttm->pages);
if (ret) {
pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
-   goto free_out;
+   goto unregister_out;
}
  
-	amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm, mem->user_pages);

-
ret = amdgpu_bo_reserve(bo, true);
if (ret) {
pr_err("%s: Failed to reserve BO\n", __func__);
@@ -616,11 +600,7 @@ static int init_user_pages(struct kgd_mem *mem, struct 
mm_struct *mm,
amdgpu_bo_unreserve(bo);
  
  release_out:

-   if (ret)
-   release_pages(mem->user_pages, bo->tbo.ttm->num_pages);
-free_out:
-   kvfree(mem->user_pages);
-   mem->user_pages = NULL;
+   amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
  unregister_out:
if (ret)
amdgpu_mn_unregister(bo);
@@ -679,7 +659,6 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
ctx->kfd_bo.priority = 0;
ctx->kfd_bo.tv.bo = >tbo;
ctx->kfd_bo.tv.num_shared = 1;
-   ctx->kfd_bo.user_pages = NULL;
list_add(>kfd_bo.tv.head, >list);
  
  	amdgpu_vm_get_pd_bo(vm, >list, >vm_pd[0]);

@@ -743,7 +722,6 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
ctx->kfd_bo.priority = 0;
ctx->kfd_bo.tv.bo = >tbo;
ctx->kfd_bo.tv.num_shared = 1;
-   ctx->kfd_bo.user_pages = NULL;
list_add(>kfd_bo.tv.head, >list);
  
  	i = 0;

@@ -1371,15 +1349,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
list_del(_list_entry->head);
mutex_unlock(_info->lock);
  
-	/* Free user pages if necessary */

-   if (mem->user_pages) {
-   pr_debug("%s: Freeing user_pages array\n", __func__);
-   if (mem->user_pages[0])
-   release_pages(mem->user_pages,
-   mem->bo->tbo.ttm->num_pages);
-   kvfree(mem->user_pages);
-   }
-
ret = 

Re: [PATCH v5 1/2] drm/sched: Refactor ring mirror list handling.

2019-01-07 Thread Christian König

Am 03.01.19 um 18:42 schrieb Grodzovsky, Andrey:


On 01/03/2019 11:20 AM, Grodzovsky, Andrey wrote:

On 01/03/2019 03:54 AM, Koenig, Christian wrote:

Am 21.12.18 um 21:36 schrieb Grodzovsky, Andrey:

On 12/21/2018 01:37 PM, Christian König wrote:

Am 20.12.18 um 20:23 schrieb Andrey Grodzovsky:

Decauple sched threads stop and start and ring mirror
list handling from the policy of what to do about the
guilty jobs.
When stoppping the sched thread and detaching sched fences
from non signaled HW fenes wait for all signaled HW fences
to complete before rerunning the jobs.

v2: Fix resubmission of guilty job into HW after refactoring.

v4:
Full restart for all the jobs, not only from guilty ring.
Extract karma increase into standalone function.

v5:
Rework waiting for signaled jobs without relying on the job
struct itself as those might already be freed for non 'guilty'
job's schedulers.
Expose karma increase to drivers.

Suggested-by: Christian Koenig 
Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  18 +--
  drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  11 +-
  drivers/gpu/drm/scheduler/sched_main.c | 188
+++--
  drivers/gpu/drm/v3d/v3d_sched.c    |  12 +-
  include/drm/gpu_scheduler.h    |  10 +-
  5 files changed, 151 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8a078f4..a4bd2d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3298,12 +3298,10 @@ static int
amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
  if (!ring || !ring->sched.thread)
  continue;
  -    kthread_park(ring->sched.thread);
+    drm_sched_stop(>sched, job ? >base : NULL);
  -    if (job && job->base.sched != >sched)
-    continue;
-
-    drm_sched_hw_job_reset(>sched, job ? >base : NULL);
+    if(job)
+    drm_sched_increase_karma(>base);

Since we dropped the "job && job->base.sched != >sched" check
above this will now increase the jobs karma multiple times.

Maybe just move that outside of the loop.


    /* after all hw jobs are reset, hw fence is meaningless,
so force_completion */
  amdgpu_fence_driver_force_completion(ring);
@@ -3454,14 +3452,10 @@ static void
amdgpu_device_post_asic_reset(struct amdgpu_device *adev,
  if (!ring || !ring->sched.thread)
  continue;
  -    /* only need recovery sched of the given job's ring
- * or all rings (in the case @job is NULL)
- * after above amdgpu_reset accomplished
- */
-    if ((!job || job->base.sched == >sched) &&
!adev->asic_reset_res)
-    drm_sched_job_recovery(>sched);
+    if (!adev->asic_reset_res)
+    drm_sched_resubmit_jobs(>sched);
  -    kthread_unpark(ring->sched.thread);
+    drm_sched_start(>sched, !adev->asic_reset_res);
  }
    if (!amdgpu_device_has_dc_support(adev)) {
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 49a6763..6f1268f 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -109,16 +109,19 @@ static void etnaviv_sched_timedout_job(struct
drm_sched_job *sched_job)
  }
    /* block scheduler */
-    kthread_park(gpu->sched.thread);
-    drm_sched_hw_job_reset(>sched, sched_job);
+    drm_sched_stop(>sched, sched_job);
+
+    if(sched_job)
+    drm_sched_increase_karma(sched_job);
    /* get the GPU back into the init state */
  etnaviv_core_dump(gpu);
  etnaviv_gpu_recover_hang(gpu);
  +    drm_sched_resubmit_jobs(>sched);
+
  /* restart scheduler after GPU is usable again */
-    drm_sched_job_recovery(>sched);
-    kthread_unpark(gpu->sched.thread);
+    drm_sched_start(>sched, true);
  }
    static void etnaviv_sched_free_job(struct drm_sched_job *sched_job)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index dbb6906..b5c5bee 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -60,8 +60,6 @@
    static void drm_sched_process_job(struct dma_fence *f, struct
dma_fence_cb *cb);
  -static void drm_sched_expel_job_unlocked(struct drm_sched_job
*s_job);
-
  /**
   * drm_sched_rq_init - initialize a given run queue struct
   *
@@ -335,6 +333,42 @@ static void drm_sched_job_timedout(struct
work_struct *work)
  spin_unlock_irqrestore(>job_list_lock, flags);
  }

Kernel doc here would be nice to have.


+void drm_sched_increase_karma(struct drm_sched_job *bad)
+{
+    int i;
+    struct drm_sched_entity *tmp;
+    struct drm_sched_entity *entity;
+    struct drm_gpu_scheduler *sched = bad->sched;
+
+    /* 

[PATCH] drm/amdgpu: disable system memory page tables for now

2019-01-07 Thread Christian König
We hit a problem with IOMMU with that. Disable until we have time to
debug further.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4cc0d1079935..ccffcadfd73b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -847,9 +847,6 @@ static void amdgpu_vm_bo_param(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
bp->size = amdgpu_vm_bo_size(adev, level);
bp->byte_align = AMDGPU_GPU_PAGE_SIZE;
bp->domain = AMDGPU_GEM_DOMAIN_VRAM;
-   if (bp->size <= PAGE_SIZE && adev->asic_type >= CHIP_VEGA10 &&
-   adev->flags & AMD_IS_APU)
-   bp->domain |= AMDGPU_GEM_DOMAIN_GTT;
bp->domain = amdgpu_bo_get_preferred_pin_domain(adev, bp->domain);
bp->flags = AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS |
AMDGPU_GEM_CREATE_CPU_GTT_USWC;
-- 
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/2] drm/amd/powerplay: create pp_od_clk_voltage device file under OD support

2019-01-07 Thread Evan Quan
Since pp_od_clk_voltage device file is for OD related sysfs operations.

Change-Id: I13e95b4bd2ffb93b1cd5d272dd27171ab38dbe57
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
index 32aa4d09e12d..d8f3f006f5c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
@@ -2061,6 +2061,7 @@ void amdgpu_pm_print_power_states(struct amdgpu_device 
*adev)
 
 int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
 {
+   struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle;
int ret;
 
if (adev->pm.sysfs_initialized)
@@ -2144,12 +2145,14 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
"pp_power_profile_mode\n");
return ret;
}
-   ret = device_create_file(adev->dev,
-   _attr_pp_od_clk_voltage);
-   if (ret) {
-   DRM_ERROR("failed to create device file "
-   "pp_od_clk_voltage\n");
-   return ret;
+   if (hwmgr->od_enabled) {
+   ret = device_create_file(adev->dev,
+   _attr_pp_od_clk_voltage);
+   if (ret) {
+   DRM_ERROR("failed to create device file "
+   "pp_od_clk_voltage\n");
+   return ret;
+   }
}
ret = device_create_file(adev->dev,
_attr_gpu_busy_percent);
@@ -2171,6 +2174,8 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
 
 void amdgpu_pm_sysfs_fini(struct amdgpu_device *adev)
 {
+   struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle;
+
if (adev->pm.dpm_enabled == 0)
return;
 
@@ -2191,8 +2196,9 @@ void amdgpu_pm_sysfs_fini(struct amdgpu_device *adev)
device_remove_file(adev->dev, _attr_pp_mclk_od);
device_remove_file(adev->dev,
_attr_pp_power_profile_mode);
-   device_remove_file(adev->dev,
-   _attr_pp_od_clk_voltage);
+   if (hwmgr->od_enabled)
+   device_remove_file(adev->dev,
+   _attr_pp_od_clk_voltage);
device_remove_file(adev->dev, _attr_gpu_busy_percent);
 }
 
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/2] drm/amd/powerplay: update OD support flag for SKU with no OD capabilities

2019-01-07 Thread Evan Quan
For those ASICs with no overdrive capabilities, the OD support flag
will be reset.

Change-Id: I8b75ad27ec0035b80de555840ba496bc273fee08
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
index 25b8742f51c5..ea0939ad3316 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c
@@ -980,6 +980,9 @@ static int vega20_od8_set_feature_capabilities(
pp_table->FanZeroRpmEnable)
od_settings->overdrive8_capabilities |= 
OD8_FAN_ZERO_RPM_CONTROL;
 
+   if (!od_settings->overdrive8_capabilities)
+   hwmgr->od_enabled = false;
+
return 0;
 }
 
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 2/2] drm/amdgpu: Add sysfs file for PCIe usage

2019-01-07 Thread Russell, Kent
Did you mean on APU+dGPU, or just straight APU? The file currently returns 0s 
on my Carrizo. The event values are the same on CZ as they are for the dGPU 
ASICs, so it might just be that it's not supported since it's not going through 
the physical PCIe bus on the board.  I have tested it on Fiji/Vega10/Vega20 
successfully, and they all give reasonable values, at least. But I don't have 
access to Raven to see if it just returns 0s like CZ does. Would you prefer 
that the sysfs file only be added for dGPUs, to save confusion and wasted sysfs 
files that return no useful information?

 Kent

-Original Message-
From: Alex Deucher  
Sent: Friday, January 04, 2019 11:22 AM
To: Russell, Kent 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: Add sysfs file for PCIe usage

On Fri, Jan 4, 2019 at 7:35 AM Russell, Kent  wrote:
>
> Add a sysfs file that reports the number of bytes transmitted and 
> received in the last second. This can be used to approximate the PCIe 
> bandwidth usage over the last second.
>
> Change-Id: I6c82fe1cb17726825736512990fa64f5c1489861
> Signed-off-by: Kent Russell 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  4 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 31 
>  drivers/gpu/drm/amd/amdgpu/cik.c   | 41 +++
>  drivers/gpu/drm/amd/amdgpu/si.c| 41 +++
>  drivers/gpu/drm/amd/amdgpu/soc15.c | 44 
> ++
>  drivers/gpu/drm/amd/amdgpu/vi.c| 41 +++
>  6 files changed, 202 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e1b2c64..512b124 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -542,6 +542,9 @@ struct amdgpu_asic_funcs {
> bool (*need_full_reset)(struct amdgpu_device *adev);
> /* initialize doorbell layout for specific asic*/
> void (*init_doorbell_index)(struct amdgpu_device *adev);
> +   /* PCIe bandwidth usage */
> +   void (*get_pcie_usage)(struct amdgpu_device *adev, uint64_t *count0,
> +  uint64_t *count1);
>  };
>
>  /*
> @@ -1045,6 +1048,7 @@ int emu_soc_asic_init(struct amdgpu_device 
> *adev);  #define amdgpu_asic_invalidate_hdp(adev, r) 
> (adev)->asic_funcs->invalidate_hdp((adev), (r))  #define 
> amdgpu_asic_need_full_reset(adev) 
> (adev)->asic_funcs->need_full_reset((adev))
>  #define amdgpu_asic_init_doorbell_index(adev) 
> (adev)->asic_funcs->init_doorbell_index((adev))
> +#define amdgpu_asic_get_pcie_usage(adev, cnt0, cnt1) 
> +((adev)->asic_funcs->get_pcie_usage((adev), (cnt0), (cnt1)))
>
>  /* Common functions */
>  bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev); 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> index 1f61ed9..051307e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> @@ -990,6 +990,30 @@ static ssize_t amdgpu_get_busy_percent(struct device 
> *dev,
> return snprintf(buf, PAGE_SIZE, "%d\n", value);  }
>
> +/**
> + * DOC: pcie_bw
> + *
> + * The amdgpu driver provides a sysfs API for reading how much data
> + * has been sent and received by the GPU in the last second through PCIe.
> + * The file pcie_bw is used for this.
> + * The Perf counters calculate this and return the number of sent and 
> +received
> + * messages, which we multiply by the size of our PCIe packets (mps).
> + */
> +static ssize_t amdgpu_get_pcie_bw(struct device *dev,
> +   struct device_attribute *attr,
> +   char *buf)
> +{
> +   struct drm_device *ddev = dev_get_drvdata(dev);
> +   struct amdgpu_device *adev = ddev->dev_private;
> +   uint64_t mps = pcie_get_mps(adev->pdev);
> +   uint64_t count0, count1;
> +
> +   amdgpu_asic_get_pcie_usage(adev, , );
> +   return snprintf(buf, PAGE_SIZE,
> +   "Bytes received: %llu\nBytes sent: %llu\n",
> +   count0 * mps, count1 * mps); }
> +
>  static DEVICE_ATTR(power_dpm_state, S_IRUGO | S_IWUSR, 
> amdgpu_get_dpm_state, amdgpu_set_dpm_state);  static 
> DEVICE_ATTR(power_dpm_force_performance_level, S_IRUGO | S_IWUSR,
>amdgpu_get_dpm_forced_performance_level,
> @@ -1025,6 +1049,7 @@ static DEVICE_ATTR(pp_od_clk_voltage, S_IRUGO | S_IWUSR,
> amdgpu_set_pp_od_clk_voltage);  static 
> DEVICE_ATTR(gpu_busy_percent, S_IRUGO,
> amdgpu_get_busy_percent, NULL);
> +static DEVICE_ATTR(pcie_bw, S_IRUGO, amdgpu_get_pcie_bw, NULL);
>
>  static ssize_t amdgpu_hwmon_show_temp(struct device *dev,
>   struct device_attribute *attr, 
> @@ -2105,6 +2130,11 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
> "gpu_busy_level\n");
>

Re: System with kernel 4.21 hang immediately after start (invalid opcode: 0000 [#1] SMP NOPTI)

2019-01-07 Thread Michel Dänzer
On 2019-01-06 11:26 p.m., Mikhail Gavrilov wrote:
> On Sat, 5 Jan 2019 at 01:13, Mikhail Gavrilov
>  wrote:
>>
>> On Fri, 4 Jan 2019 at 23:02, Michel Dänzer  wrote:
>>>
>>> On 2019-01-04 4:24 p.m., Alex Deucher wrote:
 On Fri, Jan 4, 2019 at 9:52 AM Mikhail Gavrilov
  wrote:
>
> Hi folks I found that kernel 4.21 unworkable with AMD hardware.
> Also I am make kernel bisect and founded the blame commit.
> First bad commit: [674e78acae0dfb4beb56132e41cbae5b60f7d662]
> drm/amd/display: Add fast path for cursor plane updates

 IIRC, the issue only shows up with newer versions of the ddx.
>>>
>>> Wayland compositors seem to hit it as well, I think Mikhail is using
>>> GNOME on Wayland.
>>
>> Exactly!
> 
> Even after applying patch
> https://cgit.freedesktop.org/~agd5f/linux/patch/?id=77acd1cd912987ffd62dad6a09275a1fb406f0c2
> SMP NOPTI occurs

Is this reproducible with commit 674e78acae0d ("drm/amd/display: Add
fast path for cursor plane updates") + the patch above?

If yes, is it not reproducible with commit fc42d47ce011 ("drm/amdgpu:
Enable GPU recovery by default for CI")? (Please test long enough to be
at least 99% sure before saying it's not)


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 1/2] drm/amd: validate user pitch alignment

2019-01-07 Thread Michel Dänzer
On 2019-01-07 5:00 a.m., Yu Zhao wrote:
> On Thu, Jan 03, 2019 at 05:33:19PM +0100, Michel Dänzer wrote:
>> On 2018-12-30 2:00 a.m., Yu Zhao wrote:
>>> Userspace may request pitch alignment that is not supported by GPU.
>>> Some requests 32, but GPU ignores it and uses default 64 when cpp is
>>> 4. If GEM object is allocated based on the smaller alignment, GPU
>>> DMA will go out of bound.
>>>
>>> Cc: sta...@vger.kernel.org # v4.2+
>>> Signed-off-by: Yu Zhao 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 16 
>>>  1 file changed, 16 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> index 15ce7e681d67..16af80ccd0a0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> @@ -527,6 +527,22 @@ amdgpu_display_user_framebuffer_create(struct 
>>> drm_device *dev,
>>> struct drm_gem_object *obj;
>>> struct amdgpu_framebuffer *amdgpu_fb;
>>> int ret;
>>> +   struct amdgpu_device *adev = dev->dev_private;
>>> +   int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
>>> +   int pitch = mode_cmd->pitches[0] / cpp;
>>> +
>>> +   if (pitch < mode_cmd->width) {
>>> +   DRM_DEBUG_KMS("Expecting pitch(%d)/cpp(%d) >= width(%d)\n",
>>> + mode_cmd->pitches[0], cpp, mode_cmd->width);
>>> +   return ERR_PTR(-EINVAL);
>>> +   }
>>
>> This should possibly be tested in drm_internal_framebuffer_create instead.
> 
> It seems to me drm_format_info_min_pitch() in framebuffer_check()
> already does it. We could just remove the check here?

Yeah, looks like that should be fine.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v5 1/2] drm/amd: validate user pitch alignment

2019-01-07 Thread Yu Zhao
On Thu, Jan 03, 2019 at 05:33:19PM +0100, Michel Dänzer wrote:
> On 2018-12-30 2:00 a.m., Yu Zhao wrote:
> > Userspace may request pitch alignment that is not supported by GPU.
> > Some requests 32, but GPU ignores it and uses default 64 when cpp is
> > 4. If GEM object is allocated based on the smaller alignment, GPU
> > DMA will go out of bound.
> > 
> > Cc: sta...@vger.kernel.org # v4.2+
> > Signed-off-by: Yu Zhao 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> > index 15ce7e681d67..16af80ccd0a0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> > @@ -527,6 +527,22 @@ amdgpu_display_user_framebuffer_create(struct 
> > drm_device *dev,
> > struct drm_gem_object *obj;
> > struct amdgpu_framebuffer *amdgpu_fb;
> > int ret;
> > +   struct amdgpu_device *adev = dev->dev_private;
> > +   int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
> > +   int pitch = mode_cmd->pitches[0] / cpp;
> > +
> > +   if (pitch < mode_cmd->width) {
> > +   DRM_DEBUG_KMS("Expecting pitch(%d)/cpp(%d) >= width(%d)\n",
> > + mode_cmd->pitches[0], cpp, mode_cmd->width);
> > +   return ERR_PTR(-EINVAL);
> > +   }
> 
> This should possibly be tested in drm_internal_framebuffer_create instead.

It seems to me drm_format_info_min_pitch() in framebuffer_check()
already does it. We could just remove the check here?
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx