RE: [PATCH 1/2] drm/amdgpu: add a spinlock to wb allocation

2024-04-22 Thread Liu, Shaoyun
[AMD Official Use Only - General]

These two patches Looks good to me .

Reviewed by Shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Monday, April 22, 2024 10:38 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH 1/2] drm/amdgpu: add a spinlock to wb allocation

As we use wb slots more dynamically, we need to lock access to avoid racing on 
allocation or free.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index cac0ca64367b..f87d53e183c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -502,6 +502,7 @@ struct amdgpu_wb {
uint64_tgpu_addr;
u32 num_wb; /* Number of wb slots actually reserved 
for amdgpu. */
unsigned long   used[DIV_ROUND_UP(AMDGPU_MAX_WB, 
BITS_PER_LONG)];
+   spinlock_t  lock;
 };

 int amdgpu_device_wb_get(struct amdgpu_device *adev, u32 *wb); diff --git 
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f8a34db5d9e3..869256394136 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1482,13 +1482,17 @@ static int amdgpu_device_wb_init(struct amdgpu_device 
*adev)
  */
 int amdgpu_device_wb_get(struct amdgpu_device *adev, u32 *wb)  {
-   unsigned long offset = find_first_zero_bit(adev->wb.used, 
adev->wb.num_wb);
+   unsigned long flags, offset;

+   spin_lock_irqsave(>wb.lock, flags);
+   offset = find_first_zero_bit(adev->wb.used, adev->wb.num_wb);
if (offset < adev->wb.num_wb) {
__set_bit(offset, adev->wb.used);
+   spin_unlock_irqrestore(>wb.lock, flags);
*wb = offset << 3; /* convert to dw offset */
return 0;
} else {
+   spin_unlock_irqrestore(>wb.lock, flags);
return -EINVAL;
}
 }
@@ -1503,9 +1507,13 @@ int amdgpu_device_wb_get(struct amdgpu_device *adev, u32 
*wb)
  */
 void amdgpu_device_wb_free(struct amdgpu_device *adev, u32 wb)  {
+   unsigned long flags;
+
wb >>= 3;
+   spin_lock_irqsave(>wb.lock, flags);
if (wb < adev->wb.num_wb)
__clear_bit(wb, adev->wb.used);
+   spin_unlock_irqrestore(>wb.lock, flags);
 }

 /**
@@ -4061,6 +4069,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
spin_lock_init(>se_cac_idx_lock);
spin_lock_init(>audio_endpt_idx_lock);
spin_lock_init(>mm_stats.lock);
+   spin_lock_init(>wb.lock);

INIT_LIST_HEAD(>shadow_list);
mutex_init(>shadow_list_lock);
--
2.44.0



RE: [PATCH 3/3] drm/amdgpu/mes11: make fence waits synchronous

2024-04-18 Thread Liu, Shaoyun
[AMD Official Use Only - General]

I think the only user for the MES is kernel driver and all the submission to 
MES need to be synced ,  driver already have the  mes->ring_lock  to ensure 
that . I don't know why there is a necessary to use separate fence per request .
MES do update the API status for each API (update the fence seq at an specific 
address) , but driver can decide to check it or not . For now , MES don't 
generate a interrupt as fence signal, but from driver side it  can  check the 
fence seq as polling .

Regards
Shaoyun.liu
-Original Message-
From: Koenig, Christian 
Sent: Thursday, April 18, 2024 1:59 AM
To: Alex Deucher ; Liu, Shaoyun 
Cc: Chen, Horace ; amd-gfx@lists.freedesktop.org; Andrey 
Grodzovsky ; Kuehling, Felix 
; Deucher, Alexander ; Xiao, 
Jack ; Zhang, Hawking ; Liu, Monk 
; Xu, Feifei ; Chang, HaiJun 
; Leo Liu ; Liu, Jenny (Jing) 

Subject: Re: [PATCH 3/3] drm/amdgpu/mes11: make fence waits synchronous

Am 17.04.24 um 21:21 schrieb Alex Deucher:
> On Wed, Apr 17, 2024 at 3:17 PM Liu, Shaoyun  wrote:
>> [AMD Official Use Only - General]
>>
>> I have  a discussion with Christian about this before .  The
>> conclusion is that driver should prevent multiple process from using
>> the  MES ring at the same time . Also for current MES  ring usage
>> ,driver doesn't have the  logic to prevent the ring  been  overflowed
>> and we doesn't hit the issue because MES will wait polling for each
>> MES submission . If we want to change the MES to work asynchronously
>> , we need to consider a way to avoid this (similar to add the limit
>> in the fence handling we use for kiq and  HMM paging)
>>
> I think we need a separate fence (different GPU address and seq
> number) per request.  Then each caller can wait independently.

Well no, we need to modify the MES firmware to stop abusing the fence as 
signaling mechanism for the result of an operation.

I've pointed that out before and I think this is a hard requirement for correct 
operation.

Additional to that retrying on the reset flag looks like another broken 
workaround to me.

So just to make it clear this approach is a NAK from my side, don't commit that.

Regards,
Christian.

>
> Alex
>
>> Regards
>> Shaoyun.liu
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of
>> Christian König
>> Sent: Wednesday, April 17, 2024 8:49 AM
>> To: Chen, Horace ; amd-gfx@lists.freedesktop.org
>> Cc: Andrey Grodzovsky ; Kuehling, Felix
>> ; Deucher, Alexander
>> ; Xiao, Jack ; Zhang,
>> Hawking ; Liu, Monk ; Xu,
>> Feifei ; Chang, HaiJun ; Leo
>> Liu ; Liu, Jenny (Jing) 
>> Subject: Re: [PATCH 3/3] drm/amdgpu/mes11: make fence waits
>> synchronous
>>
>> Am 17.04.24 um 13:30 schrieb Horace Chen:
>>> The MES firmware expects synchronous operation with the driver.  For
>>> this to work asynchronously, each caller would need to provide its
>>> own fence location and sequence number.
>> Well that's certainly not correct. The seqno takes care that we can wait 
>> async for the submission to complete.
>>
>> So clear NAK for that patch here.
>>
>> Regards,
>> Christian.
>>
>>> For now, add a mutex lock to serialize the MES submission.
>>> For SR-IOV long-wait case, break the long-wait to separated part to
>>> prevent this wait from impacting reset sequence.
>>>
>>> Signed-off-by: Horace Chen 
>>> ---
>>>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c |  3 +++
>>>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  1 +
>>>drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 18 ++
>>>3 files changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>>> index 78e4f88f5134..8896be95b2c8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>>> @@ -137,6 +137,7 @@ int amdgpu_mes_init(struct amdgpu_device *adev)
>>>spin_lock_init(>mes.queue_id_lock);
>>>spin_lock_init(>mes.ring_lock);
>>>mutex_init(>mes.mutex_hidden);
>>> + mutex_init(>mes.submission_lock);
>>>
>>>adev->mes.total_max_queue = AMDGPU_FENCE_MES_QUEUE_ID_MASK;
>>>adev->mes.vmid_mask_mmhub = 0xff00; @@ -221,6 +222,7 @@
>>> int amdgpu_mes_init(struct amdgpu_device *adev)
>>>idr_destroy(>mes.queue_id_idr);
>>>ida_destroy(>mes.doorbell_ida);
>>>mutex_destroy(>mes.mutex_hidden);
>>> + mutex_destroy(>mes.submission_

RE: [PATCH 1/3] drm/amdgpu/mes11: print MES opcodes rather than numbers

2024-04-17 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Looks good to me .
Reviewed by Shaoyun.liu < shaoyun@amd.com>

-Original Message-
From: amd-gfx  On Behalf Of Horace Chen
Sent: Wednesday, April 17, 2024 7:30 AM
To: amd-gfx@lists.freedesktop.org
Cc: Andrey Grodzovsky ; Kuehling, Felix 
; Chen, Horace ; Koenig, Christian 
; Deucher, Alexander ; 
Xiao, Jack ; Zhang, Hawking ; Liu, 
Monk ; Xu, Feifei ; Chang, HaiJun 
; Leo Liu ; Liu, Jenny (Jing) 
; Deucher, Alexander 
Subject: [PATCH 1/3] drm/amdgpu/mes11: print MES opcodes rather than numbers

From: Alex Deucher 

Makes it easier to review the logs when there are MES errors.

v2: use dbg for emitted, add helpers for fetching strings

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 78 --
 1 file changed, 74 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 81833395324a..784343fb7470 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -100,18 +100,72 @@ static const struct amdgpu_ring_funcs 
mes_v11_0_ring_funcs = {
.insert_nop = amdgpu_ring_insert_nop,
 };

+static const char *mes_v11_0_opcodes[] = {
+   "MES_SCH_API_SET_HW_RSRC",
+   "MES_SCH_API_SET_SCHEDULING_CONFIG",
+   "MES_SCH_API_ADD_QUEUE"
+   "MES_SCH_API_REMOVE_QUEUE"
+   "MES_SCH_API_PERFORM_YIELD"
+   "MES_SCH_API_SET_GANG_PRIORITY_LEVEL"
+   "MES_SCH_API_SUSPEND"
+   "MES_SCH_API_RESUME"
+   "MES_SCH_API_RESET"
+   "MES_SCH_API_SET_LOG_BUFFER"
+   "MES_SCH_API_CHANGE_GANG_PRORITY"
+   "MES_SCH_API_QUERY_SCHEDULER_STATUS"
+   "MES_SCH_API_PROGRAM_GDS"
+   "MES_SCH_API_SET_DEBUG_VMID"
+   "MES_SCH_API_MISC"
+   "MES_SCH_API_UPDATE_ROOT_PAGE_TABLE"
+   "MES_SCH_API_AMD_LOG"
+};
+
+static const char *mes_v11_0_misc_opcodes[] = {
+   "MESAPI_MISC__WRITE_REG",
+   "MESAPI_MISC__INV_GART",
+   "MESAPI_MISC__QUERY_STATUS",
+   "MESAPI_MISC__READ_REG",
+   "MESAPI_MISC__WAIT_REG_MEM",
+   "MESAPI_MISC__SET_SHADER_DEBUGGER",
+};
+
+static const char *mes_v11_0_get_op_string(union MESAPI__MISC *x_pkt) {
+   const char *op_str = NULL;
+
+   if (x_pkt->header.opcode < ARRAY_SIZE(mes_v11_0_opcodes))
+   op_str = mes_v11_0_opcodes[x_pkt->header.opcode];
+
+   return op_str;
+}
+
+static const char *mes_v11_0_get_misc_op_string(union MESAPI__MISC
+*x_pkt) {
+   const char *op_str = NULL;
+
+   if ((x_pkt->header.opcode == MES_SCH_API_MISC) &&
+   (x_pkt->opcode <= ARRAY_SIZE(mes_v11_0_misc_opcodes)))
+   op_str = mes_v11_0_misc_opcodes[x_pkt->opcode];
+
+   return op_str;
+}
+
 static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
void *pkt, int size,
int api_status_off)
 {
int ndw = size / 4;
signed long r;
-   union MESAPI__ADD_QUEUE *x_pkt = pkt;
+   union MESAPI__MISC *x_pkt = pkt;
struct MES_API_STATUS *api_status;
struct amdgpu_device *adev = mes->adev;
struct amdgpu_ring *ring = >ring;
unsigned long flags;
signed long timeout = 300; /* 3000 ms */
+   const char *op_str, *misc_op_str;
+
+   if (x_pkt->header.opcode >= MES_SCH_API_MAX)
+   return -EINVAL;

if (amdgpu_emu_mode) {
timeout *= 100;
@@ -135,13 +189,29 @@ static int 
mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
amdgpu_ring_commit(ring);
spin_unlock_irqrestore(>ring_lock, flags);

-   DRM_DEBUG("MES msg=%d was emitted\n", x_pkt->header.opcode);
+   op_str = mes_v11_0_get_op_string(x_pkt);
+   misc_op_str = mes_v11_0_get_misc_op_string(x_pkt);
+
+   if (misc_op_str)
+   dev_dbg(adev->dev, "MES msg=%s (%s) was emitted\n", op_str, 
misc_op_str);
+   else if (op_str)
+   dev_dbg(adev->dev, "MES msg=%s was emitted\n", op_str);
+   else
+   dev_dbg(adev->dev, "MES msg=%d was emitted\n", 
x_pkt->header.opcode);

r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq,
  timeout);
if (r < 1) {
-   DRM_ERROR("MES failed to response msg=%d\n",
- x_pkt->header.opcode);
+
+   if (misc_op_str)
+   dev_err(adev->dev, "MES failed to respond to msg=%s 
(%s)\n",
+   op_str, misc_op_str);
+   else if (op_str)
+   dev_err(adev->dev, "MES failed to respond to msg=%s\n",
+   op_str);
+   else
+   dev_err(adev->dev, "MES failed to respond to msg=%d\n",
+   x_pkt->header.opcode);

while (halt_if_hws_hang)
schedule();
--

RE: [PATCH 3/3] drm/amdgpu/mes11: make fence waits synchronous

2024-04-17 Thread Liu, Shaoyun
[AMD Official Use Only - General]

I have  a discussion with Christian about this before .  The conclusion is that 
driver should prevent multiple process from using  the  MES ring at the same 
time . Also for current MES  ring usage ,driver doesn't have the  logic to 
prevent the ring  been  overflowed and we doesn't hit the issue because MES 
will wait polling for each MES submission . If we want to change the MES to 
work asynchronously , we need to consider a way to avoid this (similar to add 
the limit in the fence handling we use for kiq and  HMM paging)

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Wednesday, April 17, 2024 8:49 AM
To: Chen, Horace ; amd-gfx@lists.freedesktop.org
Cc: Andrey Grodzovsky ; Kuehling, Felix 
; Deucher, Alexander ; Xiao, 
Jack ; Zhang, Hawking ; Liu, Monk 
; Xu, Feifei ; Chang, HaiJun 
; Leo Liu ; Liu, Jenny (Jing) 

Subject: Re: [PATCH 3/3] drm/amdgpu/mes11: make fence waits synchronous

Am 17.04.24 um 13:30 schrieb Horace Chen:
> The MES firmware expects synchronous operation with the driver.  For
> this to work asynchronously, each caller would need to provide its own
> fence location and sequence number.

Well that's certainly not correct. The seqno takes care that we can wait async 
for the submission to complete.

So clear NAK for that patch here.

Regards,
Christian.

>
> For now, add a mutex lock to serialize the MES submission.
> For SR-IOV long-wait case, break the long-wait to separated part to
> prevent this wait from impacting reset sequence.
>
> Signed-off-by: Horace Chen 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 18 ++
>   3 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index 78e4f88f5134..8896be95b2c8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -137,6 +137,7 @@ int amdgpu_mes_init(struct amdgpu_device *adev)
>   spin_lock_init(>mes.queue_id_lock);
>   spin_lock_init(>mes.ring_lock);
>   mutex_init(>mes.mutex_hidden);
> + mutex_init(>mes.submission_lock);
>
>   adev->mes.total_max_queue = AMDGPU_FENCE_MES_QUEUE_ID_MASK;
>   adev->mes.vmid_mask_mmhub = 0xff00; @@ -221,6 +222,7 @@ int
> amdgpu_mes_init(struct amdgpu_device *adev)
>   idr_destroy(>mes.queue_id_idr);
>   ida_destroy(>mes.doorbell_ida);
>   mutex_destroy(>mes.mutex_hidden);
> + mutex_destroy(>mes.submission_lock);
>   return r;
>   }
>
> @@ -240,6 +242,7 @@ void amdgpu_mes_fini(struct amdgpu_device *adev)
>   idr_destroy(>mes.queue_id_idr);
>   ida_destroy(>mes.doorbell_ida);
>   mutex_destroy(>mes.mutex_hidden);
> + mutex_destroy(>mes.submission_lock);
>   }
>
>   static void amdgpu_mes_queue_free_mqd(struct amdgpu_mes_queue *q)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index 6b3e1844eac5..90af935cc889 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -85,6 +85,7 @@ struct amdgpu_mes {
>
>   struct amdgpu_ring  ring;
>   spinlock_t  ring_lock;
> + struct mutexsubmission_lock;
>
>   const struct firmware   *fw[AMDGPU_MAX_MES_PIPES];
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> index e40d00afd4f5..0a609a5b8835 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> @@ -162,6 +162,7 @@ static int 
> mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
>   struct amdgpu_ring *ring = >ring;
>   unsigned long flags;
>   signed long timeout = adev->usec_timeout;
> + signed long retry_count = 1;
>   const char *op_str, *misc_op_str;
>
>   if (x_pkt->header.opcode >= MES_SCH_API_MAX) @@ -169,15 +170,19 @@
> static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes
> *mes,
>
>   if (amdgpu_emu_mode) {
>   timeout *= 100;
> - } else if (amdgpu_sriov_vf(adev)) {
> + }
> +
> + if (amdgpu_sriov_vf(adev) && timeout > 0) {
>   /* Worst case in sriov where all other 15 VF timeout, each VF 
> needs about 600ms */
> - timeout = 15 * 600 * 1000;
> + retry_count = (15 * 600 * 1000) / timeout;
>   }
>   BUG_ON(size % 4 != 0);
>
> + mutex_lock(>submission_lock);
>   spin_lock_irqsave(>ring_lock, flags);
>   if (amdgpu_ring_alloc(ring, ndw)) {
>   spin_unlock_irqrestore(>ring_lock, flags);
> + mutex_unlock(>submission_lock);
>   return -ENOMEM;
>   }
>
> @@ -199,8 +204,13 @@ static int 
> mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes 

RE: [PATCH] drm/amdgpu/mes11: print MES opcodes rather than numbers

2024-04-01 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Comments inline

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Saturday, March 30, 2024 10:01 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu/mes11: print MES opcodes rather than numbers

Makes it easier to review the logs when there are MES errors.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 65 --
 1 file changed, 61 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 072c478665ade..73a4bb0f5ba0f 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -100,19 +100,51 @@ static const struct amdgpu_ring_funcs 
mes_v11_0_ring_funcs = {
.insert_nop = amdgpu_ring_insert_nop,
 };

+static const char *mes_v11_0_opcodes[] = {
+   "MES_SCH_API_SET_HW_RSRC",
+   "MES_SCH_API_SET_SCHEDULING_CONFIG",
+   "MES_SCH_API_ADD_QUEUE"
+   "MES_SCH_API_REMOVE_QUEUE"
+   "MES_SCH_API_PERFORM_YIELD"
+   "MES_SCH_API_SET_GANG_PRIORITY_LEVEL"
+   "MES_SCH_API_SUSPEND"
+   "MES_SCH_API_RESUME"
+   "MES_SCH_API_RESET"
+   "MES_SCH_API_SET_LOG_BUFFER"
+   "MES_SCH_API_CHANGE_GANG_PRORITY"
+   "MES_SCH_API_QUERY_SCHEDULER_STATUS"
+   "MES_SCH_API_PROGRAM_GDS"
+   "MES_SCH_API_SET_DEBUG_VMID"
+   "MES_SCH_API_MISC"
+   "MES_SCH_API_UPDATE_ROOT_PAGE_TABLE"
+   "MES_SCH_API_AMD_LOG"
+};
+
+static const char *mes_v11_0_misc_opcodes[] = {
+   "MESAPI_MISC__WRITE_REG",
+   "MESAPI_MISC__INV_GART",
+   "MESAPI_MISC__QUERY_STATUS",
+   "MESAPI_MISC__READ_REG",
+   "MESAPI_MISC__WAIT_REG_MEM",
+   "MESAPI_MISC__SET_SHADER_DEBUGGER",
+};
+
 static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
void *pkt, int size,
int api_status_off)
 {
int ndw = size / 4;
signed long r;
-   union MESAPI__ADD_QUEUE *x_pkt = pkt;
+   union MESAPI__MISC *x_pkt = pkt;
struct MES_API_STATUS *api_status;
struct amdgpu_device *adev = mes->adev;
struct amdgpu_ring *ring = >ring;
unsigned long flags;
signed long timeout = adev->usec_timeout;

+   if (x_pkt->header.opcode >= MES_SCH_API_MAX)
+   return -EINVAL;
+
if (amdgpu_emu_mode) {
timeout *= 100;
} else if (amdgpu_sriov_vf(adev)) {
@@ -135,13 +167,38 @@ static int 
mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
amdgpu_ring_commit(ring);
spin_unlock_irqrestore(>ring_lock, fl
-   DRM_DEBUG("MES msg=%d was emitted\n", x_pkt->header.opcode);
+   if (x_pkt->header.opcode == MES_SCH_API_MISC) {
+   if (x_pkt->opcode <= ARRAY_SIZE(mes_v11_0_misc_opcodes))
+   dev_err(adev->dev, "MES msg=%s (%s) was emitted\n",

[shaoyunl]  Shouldn't  we  use DRM_DEBUG  for valid  condition ?

Regards
Shaoyun.liu

+   mes_v11_0_opcodes[x_pkt->header.opcode],
+   mes_v11_0_misc_opcodes[x_pkt->opcode]);
+   else
+   dev_err(adev->dev, "MES msg=%s (%d) was emitted\n",
+   mes_v11_0_opcodes[x_pkt->header.opcode],
+   x_pkt->opcode);
+   } else if (x_pkt->header.opcode < ARRAY_SIZE(mes_v11_0_opcodes))
+   dev_err(adev->dev, "MES msg=%s was emitted\n",
+   mes_v11_0_opcodes[x_pkt->header.opcode]);
+   else
+   dev_err(adev->dev, "MES msg=%d was emitted\n", 
x_pkt->header.opcode);

r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq,
  timeout);
if (r < 1) {
-   DRM_ERROR("MES failed to response msg=%d\n",
- x_pkt->header.opcode);
+   if (x_pkt->header.opcode == MES_SCH_API_MISC) {
+   if (x_pkt->opcode <= ARRAY_SIZE(mes_v11_0_misc_opcodes))
+   dev_err(adev->dev, "MES failed to response 
msg=%s (%s)\n",
+   mes_v11_0_opcodes[x_pkt->header.opcode],
+   mes_v11_0_misc_opcodes[x_pkt->opcode]);
+   else
+   dev_err(adev->dev, "MES failed to response 
msg=%s (%d)\n",
+   
mes_v11_0_opcodes[x_pkt->header.opcode], x_pkt->opcode);
+   } else if (x_pkt->header.opcode < ARRAY_SIZE(mes_v11_0_opcodes))
+   dev_err(adev->dev, "MES failed to response msg=%s\n",
+   mes_v11_0_opcodes[x_pkt->header.opcode]);
+   else
+   dev_err(adev->dev, "MES failed to response msg=%d\n",
+ 

RE: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log feature

2024-03-26 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Thanks , and  your suggestion sounds like a good idea , sometimes we might just 
want to enable the  log when we want to run something  specific . I think what 
we need  is an  API  that driver can tell MES to enable it during runtime  . I 
will think it and check with MES engineer.

Regards
Shaoyun.liu

-Original Message-
From: Alex Deucher 
Sent: Tuesday, March 26, 2024 12:50 PM
To: Liu, Shaoyun 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log feature

On Tue, Mar 26, 2024 at 11:51 AM Liu, Shaoyun  wrote:
>
> [AMD Official Use Only - General]
>
>
> ping

Maybe we'd want to make this something we could dynamically enable via debugfs? 
 Not sure how much of a pain it would be to change this at runtime.  Something 
we can think about for the future.

Reviewed-by: Alex Deucher 

>
>
>
> From: amd-gfx  On Behalf Of
> Liu, Shaoyun
> Sent: Monday, March 25, 2024 8:51 AM
> To: amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu : Add mes_log_enable to control mes
> log feature
>
>
>
> [AMD Official Use Only - General]
>
>
>
> [AMD Official Use Only - General]
>
>
>
> Ping
>
>
>
> Get Outlook for iOS
>
> 
>
> From: Liu, Shaoyun 
> Sent: Friday, March 22, 2024 2:00:21 PM
> To: amd-gfx@lists.freedesktop.org 
> Cc: Liu, Shaoyun 
> Subject: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log
> feature
>
>
>
> The MES log might slow down the performance for extra step of log the
> data, disable it by default and introduce a parameter can enable it
> when necessary
>
> Signed-off-by: shaoyunl 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c |  5 -
> drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  7 +--
>  4 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 9c62552bec34..b3b84647207e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -210,6 +210,7 @@ extern int amdgpu_async_gfx_ring;  extern int
> amdgpu_mcbp;  extern int amdgpu_discovery;  extern int amdgpu_mes;
> +extern int amdgpu_mes_log_enable;
>  extern int amdgpu_mes_kiq;
>  extern int amdgpu_noretry;
>  extern int amdgpu_force_asic_type;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 80b9642f2bc4..e4277298cf1a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -195,6 +195,7 @@ int amdgpu_async_gfx_ring = 1;  int amdgpu_mcbp =
> -1;  int amdgpu_discovery = -1;  int amdgpu_mes;
> +int amdgpu_mes_log_enable = 0;
>  int amdgpu_mes_kiq;
>  int amdgpu_noretry = -1;
>  int amdgpu_force_asic_type = -1;
> @@ -667,6 +668,15 @@ MODULE_PARM_DESC(mes,
>  "Enable Micro Engine Scheduler (0 = disabled (default), 1 =
> enabled)");  module_param_named(mes, amdgpu_mes, int, 0444);
>
> +/**
> + * DOC: mes_log_enable (int)
> + * Enable Micro Engine Scheduler log. This is used to enable/disable MES 
> internal log.
> + * (0 = disabled (default), 1 = enabled)  */
> +MODULE_PARM_DESC(mes_log_enable,
> +   "Enable Micro Engine Scheduler log (0 = disabled (default), 1
> += enabled)"); module_param_named(mes_log_enable,
> +amdgpu_mes_log_enable, int, 0444);
> +
>  /**
>   * DOC: mes_kiq (int)
>   * Enable Micro Engine Scheduler KIQ. This is a new engine pipe for kiq.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index 78dfd027dc99..9ace848e174c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -100,6 +100,9 @@ static int amdgpu_mes_event_log_init(struct
> amdgpu_device *adev)  {
>  int r;
>
> +   if (!amdgpu_mes_log_enable)
> +   return 0;
> +
>  r = amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
>  AMDGPU_GEM_DOMAIN_GTT,
>  >mes.event_log_gpu_obj, @@
> -1561,7 +1564,7 @@ void amdgpu_debugfs_mes_event_log_init(struct
> amdgpu_device *adev)  #if defined(CONFIG_DEBUG_FS)
>  struct drm_minor *minor = adev_to_drm(adev)->primary;
>  struct dentry *root = minor->debugfs_root;
> -   if (adev->enable_mes)
> +   if (adev->enable_mes && amdgpu_mes_log_enable)
>  debugfs_create_file("amdgpu_mes_event_log", 0444, root,

RE: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES FW version

2024-03-26 Thread Liu, Shaoyun
[AMD Official Use Only - General]

That requires extra work in MES and  API  level change to let driver send this 
info to MES . I think that's kind of unnecessary complicated.
The original problem is MES fw doesn't encapsulate their API defines  good 
enough .  Windows driver directly use MES internal structure to calculate the  
buffer size.
I already pushed  MES team to have all the  necessary info including this  log 
buffer size defined in mes_api_def.h  .  We also agreed that maximum log buffer 
size won't exceed 0x4000 in the near future . This will happens on the new MES 
release  and may take some time for driver side to pick it up , but before this 
I'd like to have  a solution that can fix the  issue ASAP .

Regards
Shaoyun.liu

-Original Message-
From: Kuehling, Felix 
Sent: Tuesday, March 26, 2024 2:07 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new 
MES FW version


On 2024-03-25 19:33, Liu, Shaoyun wrote:
> [AMD Official Use Only - General]
>
> It can  cause page fault  when the  log size exceed the  page size .

I'd consider that a breaking change in the firmware that should be avoided. Is 
there a way the updated driver can tell the FW the log size that it allocated, 
so that old drivers continue to work with new firmware?

Regards,
   Felix


>
> -Original Message-
> From: Kuehling, Felix 
> Sent: Monday, March 25, 2024 2:58 PM
> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as
> per new MES FW version
>
>
> On 2024-03-22 12:49, shaoyunl wrote:
>>   From MES version 0x54, the log entry increased and require the log
>> buffer size to be increased. The 16k is maximum size agreed
> What happens when you run the new firmware on an old kernel that only 
> allocates 4KB?
>
> Regards,
> Felix
>
>
>> Signed-off-by: shaoyunl 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 ++---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 1 +
>>2 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>> index 9ace848e174c..78e4f88f5134 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
>> @@ -103,7 +103,7 @@ static int amdgpu_mes_event_log_init(struct 
>> amdgpu_device *adev)
>>if (!amdgpu_mes_log_enable)
>>return 0;
>>
>> - r = amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
>> + r = amdgpu_bo_create_kernel(adev, AMDGPU_MES_LOG_BUFFER_SIZE,
>> +PAGE_SIZE,
>>AMDGPU_GEM_DOMAIN_GTT,
>>>mes.event_log_gpu_obj,
>>>mes.event_log_gpu_addr, @@
>> -1548,12 +1548,11 @@ static int amdgpu_debugfs_mes_event_log_show(struct 
>> seq_file *m, void *unused)
>>uint32_t *mem = (uint32_t *)(adev->mes.event_log_cpu_addr);
>>
>>seq_hex_dump(m, "", DUMP_PREFIX_OFFSET, 32, 4,
>> -  mem, PAGE_SIZE, false);
>> +  mem, AMDGPU_MES_LOG_BUFFER_SIZE, false);
>>
>>return 0;
>>}
>>
>> -
>>DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_mes_event_log);
>>
>>#endif
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
>> index 7d4f93fea937..4c8fc3117ef8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
>> @@ -52,6 +52,7 @@ enum amdgpu_mes_priority_level {
>>
>>#define AMDGPU_MES_PROC_CTX_SIZE 0x1000 /* one page area */
>>#define AMDGPU_MES_GANG_CTX_SIZE 0x1000 /* one page area */
>> +#define AMDGPU_MES_LOG_BUFFER_SIZE 0x4000 /* Maximu log buffer size
>> +for MES */
>>
>>struct amdgpu_mes_funcs;
>>


RE: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log feature

2024-03-26 Thread Liu, Shaoyun
[AMD Official Use Only - General]

ping

From: amd-gfx  On Behalf Of Liu, Shaoyun
Sent: Monday, March 25, 2024 8:51 AM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log feature


[AMD Official Use Only - General]


[AMD Official Use Only - General]

Ping

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Liu, Shaoyun mailto:shaoyun@amd.com>>
Sent: Friday, March 22, 2024 2:00:21 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Liu, Shaoyun mailto:shaoyun@amd.com>>
Subject: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log feature

The MES log might slow down the performance for extra step of log the data,
disable it by default and introduce a parameter can enable it when necessary

Signed-off-by: shaoyunl mailto:shaoyun@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c |  5 -
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  7 +--
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9c62552bec34..b3b84647207e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -210,6 +210,7 @@ extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
 extern int amdgpu_mes;
+extern int amdgpu_mes_log_enable;
 extern int amdgpu_mes_kiq;
 extern int amdgpu_noretry;
 extern int amdgpu_force_asic_type;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 80b9642f2bc4..e4277298cf1a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -195,6 +195,7 @@ int amdgpu_async_gfx_ring = 1;
 int amdgpu_mcbp = -1;
 int amdgpu_discovery = -1;
 int amdgpu_mes;
+int amdgpu_mes_log_enable = 0;
 int amdgpu_mes_kiq;
 int amdgpu_noretry = -1;
 int amdgpu_force_asic_type = -1;
@@ -667,6 +668,15 @@ MODULE_PARM_DESC(mes,
 "Enable Micro Engine Scheduler (0 = disabled (default), 1 = enabled)");
 module_param_named(mes, amdgpu_mes, int, 0444);

+/**
+ * DOC: mes_log_enable (int)
+ * Enable Micro Engine Scheduler log. This is used to enable/disable MES 
internal log.
+ * (0 = disabled (default), 1 = enabled)
+ */
+MODULE_PARM_DESC(mes_log_enable,
+   "Enable Micro Engine Scheduler log (0 = disabled (default), 1 = 
enabled)");
+module_param_named(mes_log_enable, amdgpu_mes_log_enable, int, 0444);
+
 /**
  * DOC: mes_kiq (int)
  * Enable Micro Engine Scheduler KIQ. This is a new engine pipe for kiq.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 78dfd027dc99..9ace848e174c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -100,6 +100,9 @@ static int amdgpu_mes_event_log_init(struct amdgpu_device 
*adev)
 {
 int r;

+   if (!amdgpu_mes_log_enable)
+   return 0;
+
 r = amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
 AMDGPU_GEM_DOMAIN_GTT,
 >mes.event_log_gpu_obj,
@@ -1561,7 +1564,7 @@ void amdgpu_debugfs_mes_event_log_init(struct 
amdgpu_device *adev)
 #if defined(CONFIG_DEBUG_FS)
 struct drm_minor *minor = adev_to_drm(adev)->primary;
 struct dentry *root = minor->debugfs_root;
-   if (adev->enable_mes)
+   if (adev->enable_mes && amdgpu_mes_log_enable)
 debugfs_create_file("amdgpu_mes_event_log", 0444, root,
 adev, _debugfs_mes_event_log_fops);

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 072c478665ad..63f281a9984d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -411,8 +411,11 @@ static int mes_v11_0_set_hw_resources(struct amdgpu_mes 
*mes)
 mes_set_hw_res_pkt.enable_reg_active_poll = 1;
 mes_set_hw_res_pkt.enable_level_process_quantum_check = 1;
 mes_set_hw_res_pkt.oversubscription_timer = 50;
-   mes_set_hw_res_pkt.enable_mes_event_int_logging = 1;
-   mes_set_hw_res_pkt.event_intr_history_gpu_mc_ptr = 
mes->event_log_gpu_addr;
+   if (amdgpu_mes_log_enable) {
+   mes_set_hw_res_pkt.enable_mes_event_int_logging = 1;
+   mes_set_hw_res_pkt.event_intr_history_gpu_mc_ptr =
+   mes->event_log_gpu_addr;
+   }

 return mes_v11_0_submit_pkt_and_poll_completion(mes,
 _set_hw_res_pkt, sizeof(mes_set_hw_res_pkt),
--
2.34.1


RE: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES FW version

2024-03-25 Thread Liu, Shaoyun
[AMD Official Use Only - General]

It can  cause page fault  when the  log size exceed the  page size .

-Original Message-
From: Kuehling, Felix 
Sent: Monday, March 25, 2024 2:58 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new 
MES FW version


On 2024-03-22 12:49, shaoyunl wrote:
>  From MES version 0x54, the log entry increased and require the log
> buffer size to be increased. The 16k is maximum size agreed

What happens when you run the new firmware on an old kernel that only allocates 
4KB?

Regards,
   Felix


>
> Signed-off-by: shaoyunl 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 ++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 1 +
>   2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index 9ace848e174c..78e4f88f5134 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -103,7 +103,7 @@ static int amdgpu_mes_event_log_init(struct amdgpu_device 
> *adev)
>   if (!amdgpu_mes_log_enable)
>   return 0;
>
> - r = amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
> + r = amdgpu_bo_create_kernel(adev, AMDGPU_MES_LOG_BUFFER_SIZE,
> +PAGE_SIZE,
>   AMDGPU_GEM_DOMAIN_GTT,
>   >mes.event_log_gpu_obj,
>   >mes.event_log_gpu_addr, @@ -1548,12 
> +1548,11 @@
> static int amdgpu_debugfs_mes_event_log_show(struct seq_file *m, void *unused)
>   uint32_t *mem = (uint32_t *)(adev->mes.event_log_cpu_addr);
>
>   seq_hex_dump(m, "", DUMP_PREFIX_OFFSET, 32, 4,
> -  mem, PAGE_SIZE, false);
> +  mem, AMDGPU_MES_LOG_BUFFER_SIZE, false);
>
>   return 0;
>   }
>
> -
>   DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_mes_event_log);
>
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index 7d4f93fea937..4c8fc3117ef8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -52,6 +52,7 @@ enum amdgpu_mes_priority_level {
>
>   #define AMDGPU_MES_PROC_CTX_SIZE 0x1000 /* one page area */
>   #define AMDGPU_MES_GANG_CTX_SIZE 0x1000 /* one page area */
> +#define AMDGPU_MES_LOG_BUFFER_SIZE 0x4000 /* Maximu log buffer size
> +for MES */
>
>   struct amdgpu_mes_funcs;
>


Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES FW version

2024-03-25 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Ping

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Liu, Shaoyun 
Sent: Friday, March 22, 2024 12:49:56 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES 
FW version

>From MES version 0x54, the log entry increased and require the log buffer
size to be increased. The 16k is maximum size agreed

Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 1 +
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 9ace848e174c..78e4f88f5134 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -103,7 +103,7 @@ static int amdgpu_mes_event_log_init(struct amdgpu_device 
*adev)
 if (!amdgpu_mes_log_enable)
 return 0;

-   r = amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
+   r = amdgpu_bo_create_kernel(adev, AMDGPU_MES_LOG_BUFFER_SIZE, PAGE_SIZE,
 AMDGPU_GEM_DOMAIN_GTT,
 >mes.event_log_gpu_obj,
 >mes.event_log_gpu_addr,
@@ -1548,12 +1548,11 @@ static int amdgpu_debugfs_mes_event_log_show(struct 
seq_file *m, void *unused)
 uint32_t *mem = (uint32_t *)(adev->mes.event_log_cpu_addr);

 seq_hex_dump(m, "", DUMP_PREFIX_OFFSET, 32, 4,
-mem, PAGE_SIZE, false);
+mem, AMDGPU_MES_LOG_BUFFER_SIZE, false);

 return 0;
 }

-
 DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_mes_event_log);

 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index 7d4f93fea937..4c8fc3117ef8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -52,6 +52,7 @@ enum amdgpu_mes_priority_level {

 #define AMDGPU_MES_PROC_CTX_SIZE 0x1000 /* one page area */
 #define AMDGPU_MES_GANG_CTX_SIZE 0x1000 /* one page area */
+#define AMDGPU_MES_LOG_BUFFER_SIZE 0x4000 /* Maximu log buffer size for MES */

 struct amdgpu_mes_funcs;

--
2.34.1



Re: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log feature

2024-03-25 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Ping

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Liu, Shaoyun 
Sent: Friday, March 22, 2024 2:00:21 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu : Add mes_log_enable to control mes log feature

The MES log might slow down the performance for extra step of log the data,
disable it by default and introduce a parameter can enable it when necessary

Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c |  5 -
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  7 +--
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9c62552bec34..b3b84647207e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -210,6 +210,7 @@ extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
 extern int amdgpu_mes;
+extern int amdgpu_mes_log_enable;
 extern int amdgpu_mes_kiq;
 extern int amdgpu_noretry;
 extern int amdgpu_force_asic_type;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 80b9642f2bc4..e4277298cf1a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -195,6 +195,7 @@ int amdgpu_async_gfx_ring = 1;
 int amdgpu_mcbp = -1;
 int amdgpu_discovery = -1;
 int amdgpu_mes;
+int amdgpu_mes_log_enable = 0;
 int amdgpu_mes_kiq;
 int amdgpu_noretry = -1;
 int amdgpu_force_asic_type = -1;
@@ -667,6 +668,15 @@ MODULE_PARM_DESC(mes,
 "Enable Micro Engine Scheduler (0 = disabled (default), 1 = enabled)");
 module_param_named(mes, amdgpu_mes, int, 0444);

+/**
+ * DOC: mes_log_enable (int)
+ * Enable Micro Engine Scheduler log. This is used to enable/disable MES 
internal log.
+ * (0 = disabled (default), 1 = enabled)
+ */
+MODULE_PARM_DESC(mes_log_enable,
+   "Enable Micro Engine Scheduler log (0 = disabled (default), 1 = 
enabled)");
+module_param_named(mes_log_enable, amdgpu_mes_log_enable, int, 0444);
+
 /**
  * DOC: mes_kiq (int)
  * Enable Micro Engine Scheduler KIQ. This is a new engine pipe for kiq.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 78dfd027dc99..9ace848e174c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -100,6 +100,9 @@ static int amdgpu_mes_event_log_init(struct amdgpu_device 
*adev)
 {
 int r;

+   if (!amdgpu_mes_log_enable)
+   return 0;
+
 r = amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
 AMDGPU_GEM_DOMAIN_GTT,
 >mes.event_log_gpu_obj,
@@ -1561,7 +1564,7 @@ void amdgpu_debugfs_mes_event_log_init(struct 
amdgpu_device *adev)
 #if defined(CONFIG_DEBUG_FS)
 struct drm_minor *minor = adev_to_drm(adev)->primary;
 struct dentry *root = minor->debugfs_root;
-   if (adev->enable_mes)
+   if (adev->enable_mes && amdgpu_mes_log_enable)
 debugfs_create_file("amdgpu_mes_event_log", 0444, root,
 adev, _debugfs_mes_event_log_fops);

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 072c478665ad..63f281a9984d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -411,8 +411,11 @@ static int mes_v11_0_set_hw_resources(struct amdgpu_mes 
*mes)
 mes_set_hw_res_pkt.enable_reg_active_poll = 1;
 mes_set_hw_res_pkt.enable_level_process_quantum_check = 1;
 mes_set_hw_res_pkt.oversubscription_timer = 50;
-   mes_set_hw_res_pkt.enable_mes_event_int_logging = 1;
-   mes_set_hw_res_pkt.event_intr_history_gpu_mc_ptr = 
mes->event_log_gpu_addr;
+   if (amdgpu_mes_log_enable) {
+   mes_set_hw_res_pkt.enable_mes_event_int_logging = 1;
+   mes_set_hw_res_pkt.event_intr_history_gpu_mc_ptr =
+   mes->event_log_gpu_addr;
+   }

 return mes_v11_0_submit_pkt_and_poll_completion(mes,
 _set_hw_res_pkt, sizeof(mes_set_hw_res_pkt),
--
2.34.1



RE: [PATCH 9/9] drm/amdgpu: enable MES discovery for GC 11.5.1

2024-02-16 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Reviewed by shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Thursday, February 15, 2024 3:40 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Yifan ; Deucher, Alexander 

Subject: [PATCH 9/9] drm/amdgpu: enable MES discovery for GC 11.5.1

From: Yifan Zhang 

This patch to enable MES for GC 11.5.1

Signed-off-by: Yifan Zhang 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 70aeb56bfd53..704b7820c47c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -2185,6 +2185,7 @@ static int amdgpu_discovery_set_mes_ip_blocks(struct 
amdgpu_device *adev)
case IP_VERSION(11, 0, 3):
case IP_VERSION(11, 0, 4):
case IP_VERSION(11, 5, 0):
+   case IP_VERSION(11, 5, 1):
amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
adev->enable_mes = true;
adev->enable_mes_kiq = true;
--
2.42.0



RE: [PATCH] drm/amdgpu: Only create mes event log debugfs when mes is enabled

2024-02-01 Thread Liu, Shaoyun
[AMD Official Use Only - General]

ping

-Original Message-
From: Liu, Shaoyun 
Sent: Wednesday, January 31, 2024 9:26 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu: Only create mes event log debugfs when mes is 
enabled

Skip the debugfs file creation for mes event log if the GPU doesn't use MES. 
This to prevent potential kernel oops when user try to read the event log in 
debugfs on a GPU without MES

Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 0626ac0192a8..dd2b8f3fa2f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -1565,9 +1565,9 @@ void amdgpu_debugfs_mes_event_log_init(struct 
amdgpu_device *adev)  #if defined(CONFIG_DEBUG_FS)
struct drm_minor *minor = adev_to_drm(adev)->primary;
struct dentry *root = minor->debugfs_root;
-
-   debugfs_create_file("amdgpu_mes_event_log", 0444, root,
-   adev, _debugfs_mes_event_log_fops);
+   if (adev->enable_mes)
+   debugfs_create_file("amdgpu_mes_event_log", 0444, root,
+   adev, _debugfs_mes_event_log_fops);

 #endif
 }
--
2.34.1



RE: [PATCH] drm/amdgpu: move kiq_reg_write_reg_wait() out of amdgpu_virt.c

2024-01-08 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Looks good to me .
Reviewed by : Shaoyun.liu  

-Original Message-
From: Deucher, Alexander 
Sent: Monday, January 8, 2024 4:38 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Liu, Shaoyun 
; Koenig, Christian 
Subject: [PATCH] drm/amdgpu: move kiq_reg_write_reg_wait() out of amdgpu_virt.c

It's used for more than just SR-IOV now, so move it to amdgpu_gmc.c and rename 
it to better match the functionality and update the comments in the code paths 
to better document when each path is used and why.  No functional change.

Signed-off-by: Alex Deucher 
Cc: shaoyun@amd.com
Cc: christian.koe...@amd.com
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 53   
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  4 ++  
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 53   
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h |  4 --
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c   |  9 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c   |  9 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c| 12 +++---
 7 files changed, 74 insertions(+), 70 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index d2f273d77e59..331cf6384b12 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -746,6 +746,59 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device 
*adev, uint16_t pasid,
return r;
 }

+void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device *adev,
+ uint32_t reg0, uint32_t reg1,
+ uint32_t ref, uint32_t mask,
+ uint32_t xcc_inst)
+{
+   struct amdgpu_kiq *kiq = >gfx.kiq[xcc_inst];
+   struct amdgpu_ring *ring = >ring;
+   signed long r, cnt = 0;
+   unsigned long flags;
+   uint32_t seq;
+
+   if (adev->mes.ring.sched.ready) {
+   amdgpu_mes_reg_write_reg_wait(adev, reg0, reg1,
+ ref, mask);
+   return;
+   }
+
+   spin_lock_irqsave(>ring_lock, flags);
+   amdgpu_ring_alloc(ring, 32);
+   amdgpu_ring_emit_reg_write_reg_wait(ring, reg0, reg1,
+   ref, mask);
+   r = amdgpu_fence_emit_polling(ring, , MAX_KIQ_REG_WAIT);
+   if (r)
+   goto failed_undo;
+
+   amdgpu_ring_commit(ring);
+   spin_unlock_irqrestore(>ring_lock, flags);
+
+   r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
+
+   /* don't wait anymore for IRQ context */
+   if (r < 1 && in_interrupt())
+   goto failed_kiq;
+
+   might_sleep();
+   while (r < 1 && cnt++ < MAX_KIQ_REG_TRY) {
+
+   msleep(MAX_KIQ_REG_BAILOUT_INTERVAL);
+   r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
+   }
+
+   if (cnt > MAX_KIQ_REG_TRY)
+   goto failed_kiq;
+
+   return;
+
+failed_undo:
+   amdgpu_ring_undo(ring);
+   spin_unlock_irqrestore(>ring_lock, flags);
+failed_kiq:
+   dev_err(adev->dev, "failed to write reg %x wait reg %x\n", reg0,
+reg1); }
+
 /**
  * amdgpu_gmc_tmz_set -- check and set if a device supports TMZ
  * @adev: amdgpu_device pointer
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index e699d1ca8deb..17f40ea1104b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -417,6 +417,10 @@ void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, 
uint32_t vmid,  int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, 
uint16_t pasid,
   uint32_t flush_type, bool all_hub,
   uint32_t inst);
+void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device *adev,
+ uint32_t reg0, uint32_t reg1,
+ uint32_t ref, uint32_t mask,
+ uint32_t xcc_inst);

 extern void amdgpu_gmc_tmz_set(struct amdgpu_device *adev);  extern void 
amdgpu_gmc_noretry_set(struct amdgpu_device *adev); diff --git 
a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 0dcff2889e25..f5c66e0038b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -71,59 +71,6 @@ void amdgpu_virt_init_setting(struct amdgpu_device *adev)
amdgpu_num_kcq = 2;
 }

-void amdgpu_virt_kiq_reg_write_reg_wait(struct amdgpu_device *adev,
-   uint32_t reg0, uint32_t reg1,
-   uint32_t ref, uint32_t mask,
-   uint32_t xcc_inst)
-{
-   struct amdgpu_kiq *kiq = >gfx.kiq[xcc_inst];
-   struct amdgpu_ring *

RE: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Liu, Shaoyun
[AMD Official Use Only - General]

I remembered we try to use kiq directly send the invalidation packet to mec FW 
with pasid  as parameter since it's FW control the  VMID and  PASID mapping , 
but during  some test , it shows performance drop compare to driver directly 
get the  vmid/pasid mapping and  do the  invalidation by itself.   For now,  
although driver still use the  kiq/mes , but only use it for the register 
read/write and wait instead  of send the  invalidate package directly . This 
actually has some potential issue like when  driver read the vmid_pasid mapping 
from hw register , FW(hw scheduler) might change the process mapping 
(vmid/pasid mapping changed)
I think it probably make more sense to directly use mmio way for bare metal, 
but I heard someone  compare with kiq , and  it doesn't make much performance 
difference.  So if we want to minimize the code difference between SRIOV and  
bare metal , use the kiq looks ok .
From my understanding , although kiq and  mes are all use mes pipe , but the  
design is different,  kiq(MES pipe 1)  is mainly used for immediate job like 
register access etc , mes (MES pipe 0 )  is a scheduler responsible for queue 
management mostly , although it has extend its support for misc-op (include 
register access), internally , it will  eventually pass these package to kig( 
pipe 1) , driver side should try to not overuse them and directly sent these 
operation to kiq if necessary.

Regards
Shaoyun.liu

-Original Message-
From: Alex Deucher 
Sent: Thursday, December 14, 2023 10:07 AM
To: Liu, Shaoyun 
Cc: Christian König ; Limonciello, Mario 
; Huang, Tim ; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
sta...@vger.kernel.org
Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to 
flush TLB

On Thu, Dec 14, 2023 at 9:24 AM Liu, Shaoyun  wrote:
>
> [AMD Official Use Only - General]
>
> The gmc flush tlb function is used on both  baremetal and sriov.   But the  
> function  amdgpu_virt_kiq_reg_write_reg_wait is defined in amdgpu_virt.c with 
>  name  'virt'  make it appear as a SRIOV only function, this sounds confusion 
> . Will it make more sense to move the function out of amdgpu_virt.c file and  
> rename it as amdgpu_kig_reg_write_reg_wait ?
>
> Another thing I'm not sure is inside amdgpu_virt_kiq_reg_write_reg_wait , has 
> below logic :
> if (adev->mes.ring.sched.ready) {
> amdgpu_mes_reg_write_reg_wait(adev, reg0, reg1,
>   ref, mask);
> return;
> }
> On MES enabled situation , it will always call to mes queue to do the 
> register write and  wait .  Shouldn't this OP been directly send to kiq 
> itself ?  The ring for kiq and  mes is different ,  driver should use 
> kiq(adev->gfx.kiq[0].ring) for these register read/write or wait operation  
> and  mes ( adev->mes.ring) for add/remove queues  etc.
>

I understand why it is needed for SR-IOV.  Is there a reason to use the MES or 
KIQ for TLB invalidation rather than the register method on bare metal?  It 
looks like the register method is never used anymore.
 Seems like we should either, make the KIQ/MES method SR-IOV only, or drop the 
register method and just always use KIQ/MES.

Alex


> Regards
> Shaoyun.liu
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Christian König
> Sent: Thursday, December 14, 2023 4:22 AM
> To: Alex Deucher ; Limonciello, Mario
> 
> Cc: Huang, Tim ; amd-gfx@lists.freedesktop.org;
> Koenig, Christian ; sta...@vger.kernel.org
> Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that
> fail to flush TLB
>
> Am 13.12.23 um 20:44 schrieb Alex Deucher:
> > On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
> >  wrote:
> >> On 12/13/2023 13:12, Mario Limonciello wrote:
> >>> On 12/13/2023 13:07, Alex Deucher wrote:
> >>>> On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
> >>>>  wrote:
> >>>>> Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> >>>>> causes the first MES packet after resume to fail. This packet is
> >>>>> used to flush the TLB when GART is enabled.
> >>>>>
> >>>>> This issue is fixed in newer firmware, but as OEMs may not roll
> >>>>> this out to the field, introduce a workaround that will retry
> >>>>> the flush when detecting running on an older firmware and
> >>>>> decrease relevant error messages to debug while workaround is in use.
> >>>>>
> >>>>> Cc: sta...@vger.kernel.org # 6.1+
> >>>>> Cc: Tim Huang 
> >>>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> >>>>

RE: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Liu, Shaoyun
[AMD Official Use Only - General]

The gmc flush tlb function is used on both  baremetal and sriov.   But the  
function  amdgpu_virt_kiq_reg_write_reg_wait is defined in amdgpu_virt.c with  
name  'virt'  make it appear as a SRIOV only function, this sounds confusion . 
Will it make more sense to move the function out of amdgpu_virt.c file and  
rename it as amdgpu_kig_reg_write_reg_wait ?

Another thing I'm not sure is inside amdgpu_virt_kiq_reg_write_reg_wait , has 
below logic :
if (adev->mes.ring.sched.ready) {
amdgpu_mes_reg_write_reg_wait(adev, reg0, reg1,
  ref, mask);
return;
}
On MES enabled situation , it will always call to mes queue to do the register 
write and  wait .  Shouldn't this OP been directly send to kiq itself ?  The 
ring for kiq and  mes is different ,  driver should use 
kiq(adev->gfx.kiq[0].ring) for these register read/write or wait operation  and 
 mes ( adev->mes.ring) for add/remove queues  etc.

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, December 14, 2023 4:22 AM
To: Alex Deucher ; Limonciello, Mario 

Cc: Huang, Tim ; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; sta...@vger.kernel.org
Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to 
flush TLB

Am 13.12.23 um 20:44 schrieb Alex Deucher:
> On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
>  wrote:
>> On 12/13/2023 13:12, Mario Limonciello wrote:
>>> On 12/13/2023 13:07, Alex Deucher wrote:
 On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
  wrote:
> Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> causes the first MES packet after resume to fail. This packet is
> used to flush the TLB when GART is enabled.
>
> This issue is fixed in newer firmware, but as OEMs may not roll
> this out to the field, introduce a workaround that will retry the
> flush when detecting running on an older firmware and decrease
> relevant error messages to debug while workaround is in use.
>
> Cc: sta...@vger.kernel.org # 6.1+
> Cc: Tim Huang 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> Signed-off-by: Mario Limonciello 
> ---
>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
>drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
>drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
>4 files changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index 9ddbf1494326..6ce3f6e6b6de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
> amdgpu_device *adev,
>   }
>
>   r = adev->mes.funcs->misc_op(>mes, _input);
> -   if (r)
> -   DRM_ERROR("failed to reg_write_reg_wait\n");
> +   if (r) {
> +   const char *msg = "failed to
> + reg_write_reg_wait\n";
> +
> +   if (adev->mes.suspend_workaround)
> +   DRM_DEBUG(msg);
> +   else
> +   DRM_ERROR(msg);
> +   }
>
>error:
>   return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index a27b424ffe00..90f2bba3b12b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -135,6 +135,8 @@ struct amdgpu_mes {
>
>   /* ip specific functions */
>   const struct amdgpu_mes_funcs   *funcs;
> +
> +   boolsuspend_workaround;
>};
>
>struct amdgpu_mes_process {
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> index 23d7b548d13f..e810c7bb3156 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> @@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
> amdgpu_device *adev)
>   false : true;
>
>   adev->mmhub.funcs->set_fault_enable_default(adev, value);
> -   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +
> +   do {
> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +   adev->mes.suspend_workaround = false;
> +   } while (adev->mes.suspend_workaround);
 Shouldn't this be something like:

> +   do {
> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +   adev->mes.suspend_workaround = 

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-13 Thread Liu, Shaoyun
[Public]

Check the MES API,  It's  my fault , originally I think it will use  trap_en as 
parameter to tell MES to enable/disable shader debugger , but actually it's not 
.  So we either need to add a parameter for this (ex . add flag for 
enable/disable)  or as your solution add flag for flush.  Consider it's 
possible user process can  be killed after call set_shader but before any 
add_queue , then notify MES to do a process context flush after process 
termination seems  more reasonable .

Regards
Shaoyun.liu



From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 11:43 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management


[Public]

Again, MES only knows to flush if there was something enqueued in the first 
place.
SET_SHADER dictates what's on the process list.
SET_SHADER can be the last call prior to process termination with nothing 
enqueued, hence no MES auto flush occurs.

MES doesn't block anything on the flush flag request.
The driver guarantees that flush is only done on process termination after 
device dequeue, whether there were queues or not.
MES has no idea what an invalid context is.
It just has a value stored in its linked list that's associated with a driver 
allocated BO that no longer exists after process termination.

If you're still not sure about this solution, then this should be discussed 
offline with the MES team.
We're not going to gain ground discussing this here.  The solution has already 
been merged.
Feel free to propose a better solution if you're not satisfied with this one.

Jon

From: Liu, Shaoyun mailto:shaoyun@amd.com>>
Sent: Tuesday, December 12, 2023 11:08 PM
To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, 
JinHuiEric mailto:jinhuieric.hu...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
Harish mailto:harish.kasiviswanat...@amd.com>>
Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management


[Public]

You try to add one new interface to inform mes about the context flush after 
driver side finish process termination , from my understanding, mes already 
know the process context need to be purged after all the related queues been 
removed even without this notification. What do you expect mes to do about this 
context flush flag ? Mes should block this process context for next set_sched 
command? Mes can achieve  this by ignore the set_sched command with trap 
disable parameter on an invalid process context .

Shaoyun.liu

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Kim, Jonathan mailto:jonathan@amd.com>>
Sent: Tuesday, December 12, 2023 8:19:09 PM
To: Liu, Shaoyun mailto:shaoyun@amd.com>>; Huang, 
JinHuiEric mailto:jinhuieric.hu...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
Harish mailto:harish.kasiviswanat...@amd.com>>
Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun mailto:shaoyun@amd.com>>
> Sent: Tuesday, December 12, 2023 7:08 PM
> To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, 
> JinHuiEric
> mailto:jinhuieric.hu...@amd.com>>; 
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
> Felix
> mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
> Harish
> mailto:harish.kasiviswanat...@amd.com>>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> I see,  so the  problem is after process context , set_shader been  called 
> with
> disable parameter,  do you know the  reason why  MES re-added the
> process context into the  list ?

Because MES has no idea what disable means.

All it knows is that without the flush flag, set_shader should update the 
necessary per-VMID (process) registers as requested by the driver, which 
requires persistent per-process HW settings so that potential future waves can 
inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why 
ADD_QUEUE auto clears the process context otherwise is another long story, 
basically an unsolvable MES cache bug problem).

Common use case example:
add_queue -> set_shader call either transiently stalls the SPI per-VMID or 
transiently dequeues the HWS per-VMID depending on the request settings 

Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

You try to add one new interface to inform mes about the context flush after 
driver side finish process termination , from my understanding, mes already 
know the process context need to be purged after all the related queues been 
removed even without this notification. What do you expect mes to do about this 
context flush flag ? Mes should block this process context for next set_sched 
command? Mes can achieve  this by ignore the set_sched command with trap 
disable parameter on an invalid process context .

Shaoyun.liu

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 8:19:09 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org 

Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 7:08 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> I see,  so the  problem is after process context , set_shader been  called 
> with
> disable parameter,  do you know the  reason why  MES re-added the
> process context into the  list ?

Because MES has no idea what disable means.

All it knows is that without the flush flag, set_shader should update the 
necessary per-VMID (process) registers as requested by the driver, which 
requires persistent per-process HW settings so that potential future waves can 
inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why 
ADD_QUEUE auto clears the process context otherwise is another long story, 
basically an unsolvable MES cache bug problem).

Common use case example:
add_queue -> set_shader call either transiently stalls the SPI per-VMID or 
transiently dequeues the HWS per-VMID depending on the request settings -> 
fulfils the per-VMID register write updates -> resumes process queues so that 
potential waves on those queues inherit new debug settings.

You can't do this kind of operation at the queue level alone.

The problem that this patch solves (along with the MES FW upgrade) is an 
unfortunate quirk of having to operate between process (debug requests) and 
queue space (non-debug requests).
Old HWS used to operate at the per-process level via MAP_PROCESS so it was a 
lot easier to balance debug versus non-debug requests back then (but it was 
also lot less efficient performance wise).

Jon

>
> Shaoyun.liu
>
> -Original Message-----
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 6:07 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 5:44 PM
> > To: Kim, Jonathan ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?
> I
> > think  even in that  situation MES should still be able to handle it
> > as long as MES already  remove the process context from its list , MES
> > will treat the process context as a new item. I still don't understand why
> MES haven't
> > purged the  process context from the list after process termination .   Will
> > debug queue itself  also use the add/remove queue interface  and  is
> > it possible the debug queue itself from the  old process  still not be
> > removed ?
>
> SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
> The process list is updated on either on SET_SHADER_DEBUGGER or
> ADD_QUEUE.
> e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list
> purged) -> runtime_disable (set_shader process re-added) -> process
> termination (stale list) or debug attach (set_shader) -> add_queue ->
> remove_queue (list purged) -> debug detach (set_shader process re-added) -
> >process termination (stale list)
>
> MES has no idea what process termination means.  The new flag is a proxy
> for this.
> There are reasons for process settings to take place prior to queue add
> (debugger, gfx11 cwsr workaround, core dump etc need this).
>
> I'm not sure what kernel/debug 

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

I see,  so the  problem is after process context , set_shader been  called with 
disable parameter,  do you know the  reason why  MES re-added the  process 
context into the  list ?

Shaoyun.liu

-Original Message-
From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 6:07 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 5:44 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?  I
> think  even in that  situation MES should still be able to handle it
> as long as MES already  remove the process context from its list , MES
> will treat the process context as a new item. I still don't understand why 
> MES haven't
> purged the  process context from the list after process termination .   Will
> debug queue itself  also use the add/remove queue interface  and  is
> it possible the debug queue itself from the  old process  still not be
> removed ?

SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
The process list is updated on either on SET_SHADER_DEBUGGER or ADD_QUEUE.
e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list purged) -> 
runtime_disable (set_shader process re-added) -> process termination (stale 
list) or debug attach (set_shader) -> add_queue -> remove_queue (list purged) 
-> debug detach (set_shader process re-added) ->process termination (stale list)

MES has no idea what process termination means.  The new flag is a proxy for 
this.
There are reasons for process settings to take place prior to queue add 
(debugger, gfx11 cwsr workaround, core dump etc need this).

I'm not sure what kernel/debug queues have to do with this.
By that argument, the list should be purged.

Jon

>
> Shaoyun.liu
>
> -----Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 4:48 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 4:45 PM
> > To: Kim, Jonathan ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Shouldn't the driver side  remove all the remaining  queues for the
> > process during  process termination ?  If all the  queues been
> > removed for the process ,  MES should purge the  process context
> > automatically , otherwise it's bug inside MES .
>
> That's only if there were queues added to begin with.
>
> Jon
>
> >
> > Regard
> > Sshaoyun.liu
> >
> > -Original Message-
> > From: Kim, Jonathan 
> > Sent: Tuesday, December 12, 2023 4:33 PM
> > To: Liu, Shaoyun ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > > -Original Message-
> > > From: Liu, Shaoyun 
> > > Sent: Tuesday, December 12, 2023 4:00 PM
> > > To: Huang, JinHuiEric ; Kim, Jonathan
> > > ; amd-gfx@lists.freedesktop.org
> > > Cc: Wong, Alice ; Kuehling, Felix
> > > ; Kasiviswanathan, Harish
> > > 
> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger
> > > process management
> > >
> > > [AMD Official Use Only - General]
> > >
> > > Does this requires the  new MES FW for this process_ctx_flush
> > > requirement ?  Can driver side add logic to guaranty when  call
> > > SET_SHADER_DEBUGGER, the process address  is always valid ?
> >
> > Call to flush on old fw is a NOP so it's harmless in that case.
> > Full solution will still require a new MES version as this is a
> > workaround on corner cases and not a new feature i.e. we can't stop
> > ROCm from runnin

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?  I think  even 
in that  situation MES should still be able to handle it as long as MES already 
 remove the process context from its list , MES will treat the  process context 
as a new item. I still don't understand why MES haven't  purged the  process 
context from the list after process termination .   Will debug queue itself  
also use the add/remove queue interface  and  is it possible the debug queue 
itself from the  old process  still not be  removed ?

Shaoyun.liu

-Original Message-
From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 4:48 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 4:45 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> Shouldn't the driver side  remove all the remaining  queues for the
> process during  process termination ?  If all the  queues been removed
> for the process ,  MES should purge the  process context automatically
> , otherwise it's bug inside MES .

That's only if there were queues added to begin with.

Jon

>
> Regard
> Sshaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 4:33 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 4:00 PM
> > To: Huang, JinHuiEric ; Kim, Jonathan
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [AMD Official Use Only - General]
> >
> > Does this requires the  new MES FW for this process_ctx_flush
> > requirement ?  Can driver side add logic to guaranty when  call
> > SET_SHADER_DEBUGGER, the process address  is always valid ?
>
> Call to flush on old fw is a NOP so it's harmless in that case.
> Full solution will still require a new MES version as this is a
> workaround on corner cases and not a new feature i.e. we can't stop
> ROCm from running on old fw.
> The process address is always valid from the driver side.  It's the
> MES side of things that gets stale as mentioned in the description
> (passed value to MES is reused with new BO but MES doesn't refresh).
> i.e. MES auto refreshes it's process list assuming process queues were
> all drained but driver can't guarantee that SET_SHADER_DEBUGGER (which
> adds to MES's process list) will get called after queues get added (in
> fact it's a requirements that it can be called at any time).
> We can attempt to defer calls these calls in the KFD, considering all cases.
> But that would be a large shift in debugger/runtime_enable/KFD code,
> which is already complicated and could get buggy plus it would not be
> intuitive at all as to why we're doing this.
> I think a single flag set to flush MES on process termination is a
> simpler compromise that shows the limitation in a more obvious way.
>
> Thanks,
>
> Jon
>
>
> >
> > Regards
> > Shaoyun.liu
> >
> >
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> > Eric Huang
> > Sent: Tuesday, December 12, 2023 12:49 PM
> > To: Kim, Jonathan ; amd-
> > g...@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> >
> > On 2023-12-11 16:16, Jonathan Kim wrote:
> > > MES provides the driver a call to explicitly flush stale process
> > > memory within the MES to avoid a race condition that results in a
> > > fatal memory violation.
> > >
> > > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> > address
> > > that represents a process context address MES uses to keep track
> > > of future per-process calls.
> > >
> > > Normally, MES will purge its process context list when the last
> > > queue has be

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

Shouldn't the driver side  remove all the remaining  queues for the process 
during  process termination ?  If all the  queues been removed for the process 
,  MES should purge the  process context automatically , otherwise it's bug 
inside MES .

Regard
Sshaoyun.liu

-Original Message-
From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 4:33 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 4:00 PM
> To: Huang, JinHuiEric ; Kim, Jonathan
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [AMD Official Use Only - General]
>
> Does this requires the  new MES FW for this process_ctx_flush
> requirement ?  Can driver side add logic to guaranty when  call
> SET_SHADER_DEBUGGER, the process address  is always valid ?

Call to flush on old fw is a NOP so it's harmless in that case.
Full solution will still require a new MES version as this is a workaround on 
corner cases and not a new feature i.e. we can't stop ROCm from running on old 
fw.
The process address is always valid from the driver side.  It's the MES side of 
things that gets stale as mentioned in the description (passed value to MES is 
reused with new BO but MES doesn't refresh).
i.e. MES auto refreshes it's process list assuming process queues were all 
drained but driver can't guarantee that SET_SHADER_DEBUGGER (which adds to 
MES's process list) will get called after queues get added (in fact it's a 
requirements that it can be called at any time).
We can attempt to defer calls these calls in the KFD, considering all cases.
But that would be a large shift in debugger/runtime_enable/KFD code, which is 
already complicated and could get buggy plus it would not be intuitive at all 
as to why we're doing this.
I think a single flag set to flush MES on process termination is a simpler 
compromise that shows the limitation in a more obvious way.

Thanks,

Jon


>
> Regards
> Shaoyun.liu
>
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Eric Huang
> Sent: Tuesday, December 12, 2023 12:49 PM
> To: Kim, Jonathan ; amd-
> g...@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
>
> On 2023-12-11 16:16, Jonathan Kim wrote:
> > MES provides the driver a call to explicitly flush stale process
> > memory within the MES to avoid a race condition that results in a
> > fatal memory violation.
> >
> > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> address
> > that represents a process context address MES uses to keep track of
> > future per-process calls.
> >
> > Normally, MES will purge its process context list when the last
> > queue has been removed.  The driver, however, can call
> > SET_SHADER_DEBUGGER regardless of whether a queue has been added or not.
> >
> > If SET_SHADER_DEBUGGER has been called with no queues as the last
> > call prior to process termination, the passed process context
> > address will still reside within MES.
> >
> > On a new process call to SET_SHADER_DEBUGGER, the driver may end up
> > passing an identical process context address value (based on
> > per-process gpu memory address) to MES but is now pointing to a new
> > allocated buffer object during KFD process creation.  Since the MES
> > is unaware of this, access of the passed address points to the stale
> > object within MES and triggers a fatal memory violation.
> >
> > The solution is for KFD to explicitly flush the process context
> > address from MES on process termination.
> >
> > Note that the flush call and the MES debugger calls use the same MES
> > interface but are separated as KFD calls to avoid conflicting with
> > each other.
> >
> > Signed-off-by: Jonathan Kim 
> > Tested-by: Alice Wong 
> Reviewed-by: Eric Huang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 31
> +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 10 +++---
> >   .../amd/amdkfd/kfd_process_queue_manager.c|  1 +
> >   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
> >   4 files changed, 40 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > b/drivers/gpu/drm/amd/amdgpu/amd

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Does this requires the  new MES FW for this process_ctx_flush  requirement ?  
Can driver side add logic to guaranty when  call SET_SHADER_DEBUGGER, the 
process address  is always valid ?

Regards
Shaoyun.liu


-Original Message-
From: amd-gfx  On Behalf Of Eric Huang
Sent: Tuesday, December 12, 2023 12:49 PM
To: Kim, Jonathan ; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management


On 2023-12-11 16:16, Jonathan Kim wrote:
> MES provides the driver a call to explicitly flush stale process
> memory within the MES to avoid a race condition that results in a
> fatal memory violation.
>
> When SET_SHADER_DEBUGGER is called, the driver passes a memory address
> that represents a process context address MES uses to keep track of
> future per-process calls.
>
> Normally, MES will purge its process context list when the last queue
> has been removed.  The driver, however, can call SET_SHADER_DEBUGGER
> regardless of whether a queue has been added or not.
>
> If SET_SHADER_DEBUGGER has been called with no queues as the last call
> prior to process termination, the passed process context address will
> still reside within MES.
>
> On a new process call to SET_SHADER_DEBUGGER, the driver may end up
> passing an identical process context address value (based on
> per-process gpu memory address) to MES but is now pointing to a new
> allocated buffer object during KFD process creation.  Since the MES is
> unaware of this, access of the passed address points to the stale
> object within MES and triggers a fatal memory violation.
>
> The solution is for KFD to explicitly flush the process context
> address from MES on process termination.
>
> Note that the flush call and the MES debugger calls use the same MES
> interface but are separated as KFD calls to avoid conflicting with
> each other.
>
> Signed-off-by: Jonathan Kim 
> Tested-by: Alice Wong 
Reviewed-by: Eric Huang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 31 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 10 +++---
>   .../amd/amdkfd/kfd_process_queue_manager.c|  1 +
>   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
>   4 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index e544b823abf6..e98de23250dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
> *adev,
>   op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
>   op_input.set_shader_debugger.process_context_addr = 
> process_context_addr;
>   op_input.set_shader_debugger.flags.u32all = flags;
> +
> + /* use amdgpu mes_flush_shader_debugger instead */
> + if (op_input.set_shader_debugger.flags.process_ctx_flush)
> + return -EINVAL;
> +
>   op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = 
> spi_gdbg_per_vmid_cntl;
>   memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl,
>   sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
> @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
> *adev,
>   return r;
>   }
>
> +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
> +  uint64_t process_context_addr) {
> + struct mes_misc_op_input op_input = {0};
> + int r;
> +
> + if (!adev->mes.funcs->misc_op) {
> + DRM_ERROR("mes flush shader debugger is not supported!\n");
> + return -EINVAL;
> + }
> +
> + op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> + op_input.set_shader_debugger.process_context_addr = 
> process_context_addr;
> + op_input.set_shader_debugger.flags.process_ctx_flush = true;
> +
> + amdgpu_mes_lock(>mes);
> +
> + r = adev->mes.funcs->misc_op(>mes, _input);
> + if (r)
> + DRM_ERROR("failed to set_shader_debugger\n");
> +
> + amdgpu_mes_unlock(>mes);
> +
> + return r;
> +}
> +
>   static void
>   amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
>  struct amdgpu_ring *ring,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index 894b9b133000..7d4f93fea937 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -296,9 +296,10 @@ struct mes_misc_op_input {
>   uint64_t process_context_addr;
>   union {
>   struct {
> - uint64_t single_memop : 1;
> - uint64_t single_alu_op : 1;
> - uint64_t reserved: 30;
> + 

RE: [PATCH] drm: Disable XNACK on SRIOV environment

2023-11-02 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Looks ok to me .
Reviewed-by: Shaoyun.liu 

-Original Message-
From: Kakarya, Surbhi 
Sent: Thursday, November 2, 2023 12:10 PM
To: Kakarya, Surbhi ; amd-gfx@lists.freedesktop.org; 
Yang, Philip ; Liu, Shaoyun 
Subject: RE: [PATCH] drm: Disable XNACK on SRIOV environment

[AMD Official Use Only - General]

Ping..

-Original Message-
From: Surbhi Kakarya 
Sent: Monday, October 30, 2023 9:54 PM
To: amd-gfx@lists.freedesktop.org; Yang, Philip 
Cc: Kakarya, Surbhi 
Subject: [PATCH] drm: Disable XNACK on SRIOV environment

The purpose of this patch is to disable XNACK or set XNACK OFF mode on SRIOV 
platform which doesn't support it.

This will prevent user-space application to fail or result into unexpected 
behaviour whenever the application need to run test-case in XNACK ON mode.

Signed-off-by: Surbhi Kakarya 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  |  5 -  
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c |  9 +  
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h |  1 +  
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 --
 4 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 2dce338b0f1e..d582b240f919 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -826,7 +826,10 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
gc_ver == IP_VERSION(9, 4, 3) ||
gc_ver >= IP_VERSION(10, 3, 0));

-   gmc->noretry = (amdgpu_noretry == -1) ? noretry_default : 
amdgpu_noretry;
+   if (!amdgpu_sriov_xnack_support(adev))
+   gmc->norety = 1;
+   else
+   gmc->noretry = (amdgpu_noretry == -1) ? noretry_default :
+amdgpu_noretry;
 }

 void amdgpu_gmc_set_vm_fault_masks(struct amdgpu_device *adev, int hub_type, 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index a0aa624f5a92..41c77d5c5a79 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -1093,3 +1093,12 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
else
return RREG32(offset);
 }
+bool amdgpu_sriov_xnack_support(struct amdgpu_device *adev) {
+   bool xnack_mode = 1;
+
+   if (amdgpu_sriov_vf(adev) && (adev->ip_versions[GC_HWIP][0] == 
IP_VERSION(9, 4, 2)))
+   xnack_mode = 0;
+
+   return xnack_mode;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 858ef21ae515..935ca736300e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -365,4 +365,5 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,  bool 
amdgpu_virt_fw_load_skip_check(struct amdgpu_device *adev,
uint32_t ucode_id);  void amdgpu_virt_post_reset(struct 
amdgpu_device *adev);
+bool amdgpu_sriov_xnack_support(struct amdgpu_device *adev);
 #endif
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index fbf053001af9..69954a2a8503 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1416,8 +1416,14 @@ bool kfd_process_xnack_mode(struct kfd_process *p, bool 
supported)
 * per-process XNACK mode selection. But let the dev->noretry
 * setting still influence the default XNACK mode.
 */
-   if (supported && KFD_SUPPORT_XNACK_PER_PROCESS(dev))
-   continue;
+   if (supported && KFD_SUPPORT_XNACK_PER_PROCESS(dev)) {
+   if (!amdgpu_sriov_xnack_support(dev->kfd->adev)) {
+   pr_debug("SRIOV platform xnack not 
supported\n");
+   return false;
+   }
+   else
+   continue;
+   }

/* GFXv10 and later GPUs do not support shader preemption
 * during page faults. This can lead to poor QoS for queue
--
2.25.1




RE: [RFC 1/7] drm/amdgpu: UAPI for user queue management

2023-01-03 Thread Liu, Shaoyun
[AMD Official Use Only - General]

What about the existing rocm apps that already use the  hsakmt APIs for user 
queue ?

Shaoyun.liu

-Original Message-
From: Alex Deucher 
Sent: Tuesday, January 3, 2023 2:22 PM
To: Liu, Shaoyun 
Cc: Kuehling, Felix ; Sharma, Shashank 
; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Koenig, Christian ; 
Yadav, Arvind ; Paneer Selvam, Arunpravin 

Subject: Re: [RFC 1/7] drm/amdgpu: UAPI for user queue management

On Tue, Jan 3, 2023 at 2:17 PM Liu, Shaoyun  wrote:
>
> [AMD Official Use Only - General]
>
> Hsakmt  has  the  interfaces for compute user queue. Do we want a unify API 
> for both  graphic and compute  ?

Yeah, that is the eventual goal, hence the flag for AQL vs PM4.

Alex

>
> Regards
> Shaoyun.liu
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Felix Kuehling
> Sent: Tuesday, January 3, 2023 1:30 PM
> To: Sharma, Shashank ;
> amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Yadav, Arvind ;
> Paneer Selvam, Arunpravin 
> Subject: Re: [RFC 1/7] drm/amdgpu: UAPI for user queue management
>
> Am 2022-12-23 um 14:36 schrieb Shashank Sharma:
> > From: Alex Deucher 
> >
> > This patch intorduces new UAPI/IOCTL for usermode graphics queue.
> > The userspace app will fill this structure and request the graphics
> > driver to add a graphics work queue for it. The output of this UAPI
> > is a queue id.
> >
> > This UAPI maps the queue into GPU, so the graphics app can start
> > submitting work to the queue as soon as the call returns.
> >
> > Cc: Alex Deucher 
> > Cc: Christian Koenig 
> > Signed-off-by: Alex Deucher 
> > Signed-off-by: Shashank Sharma 
> > ---
> >   include/uapi/drm/amdgpu_drm.h | 52 +++
> >   1 file changed, 52 insertions(+)
> >
> > diff --git a/include/uapi/drm/amdgpu_drm.h
> > b/include/uapi/drm/amdgpu_drm.h index 0d93ec132ebb..a3d0dd6f62c5
> > 100644
> > --- a/include/uapi/drm/amdgpu_drm.h
> > +++ b/include/uapi/drm/amdgpu_drm.h
> > @@ -54,6 +54,7 @@ extern "C" {
> >   #define DRM_AMDGPU_VM   0x13
> >   #define DRM_AMDGPU_FENCE_TO_HANDLE  0x14
> >   #define DRM_AMDGPU_SCHED0x15
> > +#define DRM_AMDGPU_USERQ 0x16
> >
> >   #define DRM_IOCTL_AMDGPU_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + 
> > DRM_AMDGPU_GEM_CREATE, union drm_amdgpu_gem_create)
> >   #define DRM_IOCTL_AMDGPU_GEM_MMAP   DRM_IOWR(DRM_COMMAND_BASE + 
> > DRM_AMDGPU_GEM_MMAP, union drm_amdgpu_gem_mmap)
> > @@ -71,6 +72,7 @@ extern "C" {
> >   #define DRM_IOCTL_AMDGPU_VM DRM_IOWR(DRM_COMMAND_BASE + 
> > DRM_AMDGPU_VM, union drm_amdgpu_vm)
> >   #define DRM_IOCTL_AMDGPU_FENCE_TO_HANDLE DRM_IOWR(DRM_COMMAND_BASE + 
> > DRM_AMDGPU_FENCE_TO_HANDLE, union drm_amdgpu_fence_to_handle)
> >   #define DRM_IOCTL_AMDGPU_SCHED  DRM_IOW(DRM_COMMAND_BASE + 
> > DRM_AMDGPU_SCHED, union drm_amdgpu_sched)
> > +#define DRM_IOCTL_AMDGPU_USERQ   DRM_IOW(DRM_COMMAND_BASE + 
> > DRM_AMDGPU_USERQ, union drm_amdgpu_userq)
> >
> >   /**
> >* DOC: memory domains
> > @@ -288,6 +290,56 @@ union drm_amdgpu_ctx {
> >   union drm_amdgpu_ctx_out out;
> >   };
> >
> > +/* user queue IOCTL */
> > +#define AMDGPU_USERQ_OP_CREATE   1
> > +#define AMDGPU_USERQ_OP_FREE 2
> > +
> > +#define AMDGPU_USERQ_MQD_FLAGS_SECURE(1 << 0)
>
> What does "secure" mean here? I don't see this flag referenced anywhere in 
> the rest of the patch series.
>
> Regards,
>Felix
>
>
> > +#define AMDGPU_USERQ_MQD_FLAGS_AQL   (1 << 1)
> > +
> > +struct drm_amdgpu_userq_mqd {
> > + /** Flags: AMDGPU_USERQ_MQD_FLAGS_* */
> > + __u32   flags;
> > + /** IP type: AMDGPU_HW_IP_* */
> > + __u32   ip_type;
> > + /** GEM object handle */
> > + __u32   doorbell_handle;
> > + /** Doorbell offset in dwords */
> > + __u32   doorbell_offset;
> > + /** GPU virtual address of the queue */
> > + __u64   queue_va;
> > + /** Size of the queue in bytes */
> > + __u64   queue_size;
> > + /** GPU virtual address of the rptr */
> > + __u64   rptr_va;
> > + /** GPU virtual address of the wptr */
> > + __u64   wptr_va;
> > +};
> > +
> > +struct drm_amdgpu_userq_in {
> > + /** AMDGPU_USERQ_OP_* */
> > + __u32   op;
> > + /** Flags */
> > + __u32   flags;
> > + /** Context handle to associate the queue with */
> > + __u32   ctx_id;
> > + __u32   pad;
> > + /** Queue descriptor */
> > + struct drm_amdgpu_userq_mqd mqd; };
> > +
> > +struct drm_amdgpu_userq_out {
> > + /** Queue handle */
> > + __u32   q_id;
> > + /** Flags */
> > + __u32   flags;
> > +};
> > +
> > +union drm_amdgpu_userq {
> > + struct drm_amdgpu_userq_in in;
> > + struct drm_amdgpu_userq_out out; };
> > +
> >   /* vm ioctl */
> >   #define AMDGPU_VM_OP_RESERVE_VMID   1
> >   #define AMDGPU_VM_OP_UNRESERVE_VMID 2


RE: [RFC 1/7] drm/amdgpu: UAPI for user queue management

2023-01-03 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Hsakmt  has  the  interfaces for compute user queue. Do we want a unify API for 
both  graphic and compute  ?

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Felix 
Kuehling
Sent: Tuesday, January 3, 2023 1:30 PM
To: Sharma, Shashank ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Yadav, Arvind ; Paneer 
Selvam, Arunpravin 
Subject: Re: [RFC 1/7] drm/amdgpu: UAPI for user queue management

Am 2022-12-23 um 14:36 schrieb Shashank Sharma:
> From: Alex Deucher 
>
> This patch intorduces new UAPI/IOCTL for usermode graphics queue. The
> userspace app will fill this structure and request the graphics driver
> to add a graphics work queue for it. The output of this UAPI is a
> queue id.
>
> This UAPI maps the queue into GPU, so the graphics app can start
> submitting work to the queue as soon as the call returns.
>
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Alex Deucher 
> Signed-off-by: Shashank Sharma 
> ---
>   include/uapi/drm/amdgpu_drm.h | 52 +++
>   1 file changed, 52 insertions(+)
>
> diff --git a/include/uapi/drm/amdgpu_drm.h
> b/include/uapi/drm/amdgpu_drm.h index 0d93ec132ebb..a3d0dd6f62c5
> 100644
> --- a/include/uapi/drm/amdgpu_drm.h
> +++ b/include/uapi/drm/amdgpu_drm.h
> @@ -54,6 +54,7 @@ extern "C" {
>   #define DRM_AMDGPU_VM   0x13
>   #define DRM_AMDGPU_FENCE_TO_HANDLE  0x14
>   #define DRM_AMDGPU_SCHED0x15
> +#define DRM_AMDGPU_USERQ 0x16
>
>   #define DRM_IOCTL_AMDGPU_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + 
> DRM_AMDGPU_GEM_CREATE, union drm_amdgpu_gem_create)
>   #define DRM_IOCTL_AMDGPU_GEM_MMAP   DRM_IOWR(DRM_COMMAND_BASE + 
> DRM_AMDGPU_GEM_MMAP, union drm_amdgpu_gem_mmap)
> @@ -71,6 +72,7 @@ extern "C" {
>   #define DRM_IOCTL_AMDGPU_VM DRM_IOWR(DRM_COMMAND_BASE + 
> DRM_AMDGPU_VM, union drm_amdgpu_vm)
>   #define DRM_IOCTL_AMDGPU_FENCE_TO_HANDLE DRM_IOWR(DRM_COMMAND_BASE + 
> DRM_AMDGPU_FENCE_TO_HANDLE, union drm_amdgpu_fence_to_handle)
>   #define DRM_IOCTL_AMDGPU_SCHED  DRM_IOW(DRM_COMMAND_BASE + 
> DRM_AMDGPU_SCHED, union drm_amdgpu_sched)
> +#define DRM_IOCTL_AMDGPU_USERQ   DRM_IOW(DRM_COMMAND_BASE + 
> DRM_AMDGPU_USERQ, union drm_amdgpu_userq)
>
>   /**
>* DOC: memory domains
> @@ -288,6 +290,56 @@ union drm_amdgpu_ctx {
>   union drm_amdgpu_ctx_out out;
>   };
>
> +/* user queue IOCTL */
> +#define AMDGPU_USERQ_OP_CREATE   1
> +#define AMDGPU_USERQ_OP_FREE 2
> +
> +#define AMDGPU_USERQ_MQD_FLAGS_SECURE(1 << 0)

What does "secure" mean here? I don't see this flag referenced anywhere in the 
rest of the patch series.

Regards,
   Felix


> +#define AMDGPU_USERQ_MQD_FLAGS_AQL   (1 << 1)
> +
> +struct drm_amdgpu_userq_mqd {
> + /** Flags: AMDGPU_USERQ_MQD_FLAGS_* */
> + __u32   flags;
> + /** IP type: AMDGPU_HW_IP_* */
> + __u32   ip_type;
> + /** GEM object handle */
> + __u32   doorbell_handle;
> + /** Doorbell offset in dwords */
> + __u32   doorbell_offset;
> + /** GPU virtual address of the queue */
> + __u64   queue_va;
> + /** Size of the queue in bytes */
> + __u64   queue_size;
> + /** GPU virtual address of the rptr */
> + __u64   rptr_va;
> + /** GPU virtual address of the wptr */
> + __u64   wptr_va;
> +};
> +
> +struct drm_amdgpu_userq_in {
> + /** AMDGPU_USERQ_OP_* */
> + __u32   op;
> + /** Flags */
> + __u32   flags;
> + /** Context handle to associate the queue with */
> + __u32   ctx_id;
> + __u32   pad;
> + /** Queue descriptor */
> + struct drm_amdgpu_userq_mqd mqd;
> +};
> +
> +struct drm_amdgpu_userq_out {
> + /** Queue handle */
> + __u32   q_id;
> + /** Flags */
> + __u32   flags;
> +};
> +
> +union drm_amdgpu_userq {
> + struct drm_amdgpu_userq_in in;
> + struct drm_amdgpu_userq_out out;
> +};
> +
>   /* vm ioctl */
>   #define AMDGPU_VM_OP_RESERVE_VMID   1
>   #define AMDGPU_VM_OP_UNRESERVE_VMID 2


RE: [PATCH] drm/amdgpu: remove evict_resource for sriov when suspend.

2022-12-05 Thread Liu, Shaoyun
[AMD Official Use Only - General]

I agree with Christian . Although on some  hypervisior with  live migration 
support , there will be  specific API between  OS and  PF driver to handle the 
FB content save/restore for VF, in this case , guest side save/restore is not 
necessary.   On other hypervisior without the  live migration(like KVM for now) 
 , it will depends on the guests driver do the save/restore  itself on 
suspend/resume call back.

Regards
Shaoyun.liu



-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Monday, December 5, 2022 5:43 AM
To: Fan, Shikang ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: remove evict_resource for sriov when suspend.

Am 05.12.22 um 09:47 schrieb Shikang Fan:
> - There is no need to evict resources from vram to sram for sriov
>because GPU is still powered on. And evicet is taking too much time
>that would cause full access mode timedout in multi-vf mode.

Well big NAK to this!

The suspend is usually done to migrating the virtual machine to a different hw 
instance. Because of this the content of VRAM is usually destroyed.

Regards,
Christian.

>
> Signed-off-by: Shikang Fan 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 +---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 818fa72c670d..55fe425fbe6d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4135,9 +4135,11 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   if (!adev->in_s0ix)
>   amdgpu_amdkfd_suspend(adev, adev->in_runpm);
>
> - r = amdgpu_device_evict_resources(adev);
> - if (r)
> - return r;
> + if (!amdgpu_sriov_vf(adev)) {
> + r = amdgpu_device_evict_resources(adev);
> + if (r)
> + return r;
> + }
>
>   amdgpu_fence_driver_hw_fini(adev);
>

<>

RE: [PATCH] drm/amdgpu: Ignore stop rlc on SRIOV environment.

2022-11-09 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Rewed-by: shaoyun liu 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Wednesday, November 9, 2022 2:07 PM
To: Wan, Gavin 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Ignore stop rlc on SRIOV environment.

On Wed, Nov 9, 2022 at 1:24 PM Gavin Wan  wrote:
>
> For SRIOV, the guest driver should not do stop rlc. The host handles
> programing RLC.
>
> On SRIOV, the stop rlc will be hang (RLC related registers are blocked
> by policy) when the RLCG interface is not enabled.
>
> Signed-off-by: Gavin Wan 

Acked-by: Alex Deucher 

> Change-Id: Iac31332e2c958aae9506759de1d3a311b5c84942> ---
> drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 4fe75dd2b329..0e9529b95d35 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1517,7 +1517,7 @@ static int smu_disable_dpms(struct smu_context *smu)
> }
>
> if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(9, 4, 2) &&
> -   adev->gfx.rlc.funcs->stop)
> +   !amdgpu_sriov_vf(adev) && adev->gfx.rlc.funcs->stop)
> adev->gfx.rlc.funcs->stop(adev);
>
> return ret;
> --
> 2.34.1
>


RE: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-26 Thread Liu, Shaoyun
[AMD Official Use Only - General]

The SRIOV already has its own reset routine amdgpu_device_reset_sriov,  we try 
to put the sriov specific sequence  inside this function. For the rest 
part(re-submitting etc ) we should try to have the same  behavior as bare-metal.
Can  we just don't do the re-submission for all kind of reset since kernel 
already signal the reset event  to user level (at least for compute stack) ?

Regard
Sshaoyun.liu

-Original Message-
From: Koenig, Christian 
Sent: Wednesday, October 26, 2022 1:27 PM
To: Liu, Shaoyun ; Tuikov, Luben ; 
Prosyak, Vitaly ; Deucher, Alexander 
; daniel.vet...@ffwll.ch; 
amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Subject: Re: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

The problem is that this re-submitting is currently an integral part of how 
SRIOV works.

The host can send a function level reset request to the clients when it sees 
that some schedule switching didn't worked as expected and in this case (and 
only this case) the hardware has actually never started to even work on the 
IBs. So the re-submission is actually save from this side.

But in general you are right, the sw side is just completely broken because we 
came up with a bunch of rather strict rules for the dma_fence implementation 
(and those rules are perfectly valid and necessary).

Regards,
Christian.

Am 26.10.22 um 18:10 schrieb Liu, Shaoyun:
> [AMD Official Use Only - General]
>
> The  user space  shouldn't care about  SRIOV or not ,  I don't think we need 
> to keep the re-submission for SRIOV as well.  The reset from SRIOV could 
> trigger the  host do a whole GPU reset which will have the same issue as bare 
> metal.
>
> Regards
> Shaoyun.liu
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Christian König
> Sent: Wednesday, October 26, 2022 11:36 AM
> To: Tuikov, Luben ; Prosyak, Vitaly
> ; Deucher, Alexander
> ; daniel.vet...@ffwll.ch;
> amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
> Cc: Koenig, Christian 
> Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal
> reset
>
> Re-submitting IBs by the kernel has many problems because pre- requisite 
> state is not automatically re-created as well. In other words neither binary 
> semaphores nor things like ring buffer pointers are in the state they should 
> be when the hardware starts to work on the IBs again.
>
> Additional to that even after more than 5 years of developing this feature it 
> is still not stable and we have massively problems getting the reference 
> counts right.
>
> As discussed with user space developers this behavior is not helpful in the 
> first place. For graphics and multimedia workloads it makes much more sense 
> to either completely re-create the context or at least re-submitting the IBs 
> from userspace.
>
> For compute use cases re-submitting is also not very helpful since userspace 
> must rely on the accuracy of the result.
>
> Because of this we stop this practice and instead just properly note that the 
> fence submission was canceled. The only use case we keep the re-submission 
> for now is SRIOV and function level resets.
>
> Signed-off-by: Christian König 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d4584e577b51..39e94feba1ac 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>  continue;
>
>  /* No point to resubmit jobs if we didn't HW reset*/
> -   if (!tmp_adev->asic_reset_res && !job_signaled)
> +   if (!tmp_adev->asic_reset_res && !job_signaled &&
> +   amdgpu_sriov_vf(tmp_adev))
>
> drm_sched_resubmit_jobs(>sched);
>
>  drm_sched_start(>sched,
> !tmp_adev->asic_reset_res);
> --
> 2.25.1
>



RE: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-26 Thread Liu, Shaoyun
[AMD Official Use Only - General]

The  user space  shouldn't care about  SRIOV or not ,  I don't think we need to 
keep the re-submission for SRIOV as well.  The reset from SRIOV could trigger 
the  host do a whole GPU reset which will have the same issue as bare metal.

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Wednesday, October 26, 2022 11:36 AM
To: Tuikov, Luben ; Prosyak, Vitaly 
; Deucher, Alexander ; 
daniel.vet...@ffwll.ch; amd-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org
Cc: Koenig, Christian 
Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

Re-submitting IBs by the kernel has many problems because pre- requisite state 
is not automatically re-created as well. In other words neither binary 
semaphores nor things like ring buffer pointers are in the state they should be 
when the hardware starts to work on the IBs again.

Additional to that even after more than 5 years of developing this feature it 
is still not stable and we have massively problems getting the reference counts 
right.

As discussed with user space developers this behavior is not helpful in the 
first place. For graphics and multimedia workloads it makes much more sense to 
either completely re-create the context or at least re-submitting the IBs from 
userspace.

For compute use cases re-submitting is also not very helpful since userspace 
must rely on the accuracy of the result.

Because of this we stop this practice and instead just properly note that the 
fence submission was canceled. The only use case we keep the re-submission for 
now is SRIOV and function level resets.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4584e577b51..39e94feba1ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
continue;

/* No point to resubmit jobs if we didn't HW reset*/
-   if (!tmp_adev->asic_reset_res && !job_signaled)
+   if (!tmp_adev->asic_reset_res && !job_signaled &&
+   amdgpu_sriov_vf(tmp_adev))
drm_sched_resubmit_jobs(>sched);

drm_sched_start(>sched, 
!tmp_adev->asic_reset_res);
--
2.25.1



Re: [PATCH] drm/amdgpu: Skip put_reset_domain if it doesnt exist

2022-09-28 Thread Liu, Shaoyun
Looks OK to me.
Reviewed by : shaoyun.liu 



From: Chander, Vignesh 
Sent: September 28, 2022 3:03 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Liu, Shaoyun ; Chander, Vignesh 

Subject: [PATCH] drm/amdgpu: Skip put_reset_domain if it doesnt exist

For xgmi sriov, the reset is handled by host driver and hive->reset_domain
is not initialized so need to check if it exists before doing a put.
Signed-off-by: Vignesh Chander 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index dc43fcb93eac..f5318fedf2f0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -113,7 +113,8 @@ static inline bool amdgpu_reset_get_reset_domain(struct 
amdgpu_reset_domain *dom

 static inline void amdgpu_reset_put_reset_domain(struct amdgpu_reset_domain 
*domain)
 {
-   kref_put(>refcount, amdgpu_reset_destroy_reset_domain);
+   if (domain)
+   kref_put(>refcount, amdgpu_reset_destroy_reset_domain);
 }

 static inline bool amdgpu_reset_domain_schedule(struct amdgpu_reset_domain 
*domain,
--
2.25.1



RE: [PATCH] drm/amdgpu: Skip put_reset_domain if it doesnt exist

2022-09-28 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Please add description like under sriov xgmi configuration , the hive reset is 
handled by host driver , hive->reset_domain  is not been  in initialized  so  
need to skip it .

Regards
Shaoyun.liu


-Original Message-
From: Chander, Vignesh 
Sent: Wednesday, September 28, 2022 1:38 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun ; Chander, Vignesh 

Subject: [PATCH] drm/amdgpu: Skip put_reset_domain if it doesnt exist

Change-Id: Ifd6121fb94db3fadaa1dee61d35699abe1259409
Signed-off-by: Vignesh Chander 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 47159e9a0884..80fb6ef929e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -217,7 +217,8 @@ static void amdgpu_xgmi_hive_release(struct kobject *kobj)
struct amdgpu_hive_info *hive = container_of(
kobj, struct amdgpu_hive_info, kobj);

-   amdgpu_reset_put_reset_domain(hive->reset_domain);
+   if (hive->reset_domain)
+   amdgpu_reset_put_reset_domain(hive->reset_domain);
hive->reset_domain = NULL;

mutex_destroy(>hive_lock);
--
2.25.1



RE: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

2022-09-12 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Just curious about what's this gfx  software ring used for ?  who decide the 
priority , can user  request a higher priority  or it's predefined ?

Thanks
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Andrey 
Grodzovsky
Sent: Monday, September 12, 2022 11:34 AM
To: Christian König ; Zhu, Jiadong 
; amd-gfx@lists.freedesktop.org
Cc: Huang, Ray 
Subject: Re: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

On 2022-09-12 09:27, Christian König wrote:

> Am 12.09.22 um 15:22 schrieb Andrey Grodzovsky:
>>
>> On 2022-09-12 06:20, Christian König wrote:
>>> Am 09.09.22 um 18:45 schrieb Andrey Grodzovsky:

 On 2022-09-08 21:50, jiadong@amd.com wrote:
> From: "Jiadong.Zhu" 
>
> The software ring is created to support priority context while
> there is only one hardware queue for gfx.
>
> Every software rings has its fence driver and could be used as an
> ordinary ring for the gpu_scheduler.
> Multiple software rings are binded to a real ring with the ring
> muxer. The packages committed on the software ring are copied to
> the real ring.
>
> v2: use array to store software ring entry.
> v3: remove unnecessary prints.
>
> Signed-off-by: Jiadong.Zhu 
> ---
>   drivers/gpu/drm/amd/amdgpu/Makefile  |   3 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  |   3 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |   3 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c | 182
> +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.h |  67 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sw_ring.c  | 204
> +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sw_ring.h  |  48 +
>   7 files changed, 509 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
>   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.h
>   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_sw_ring.c
>   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_sw_ring.h
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> index 3e0e2eb7e235..85224bc81ce5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> @@ -58,7 +58,8 @@ amdgpu-y += amdgpu_device.o amdgpu_kms.o \
>   amdgpu_vm_sdma.o amdgpu_discovery.o amdgpu_ras_eeprom.o
> amdgpu_nbio.o \
>   amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o
> amdgpu_rap.o \
>   amdgpu_fw_attestation.o amdgpu_securedisplay.o \
> -amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o
> +amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
> +amdgpu_sw_ring.o amdgpu_ring_mux.o
> amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index 53526ffb2ce1..0de8e3cd0f1c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -33,6 +33,7 @@
>   #include "amdgpu_imu.h"
>   #include "soc15.h"
>   #include "amdgpu_ras.h"
> +#include "amdgpu_ring_mux.h"
> /* GFX current status */
>   #define AMDGPU_GFX_NORMAL_MODE0xL @@ -346,6
> +347,8 @@ struct amdgpu_gfx {
>   struct amdgpu_gfx_ras*ras;
> boolis_poweron;
> +
> +struct amdgpu_ring_muxmuxer;
>   };
> #define amdgpu_gfx_get_gpu_clock_counter(adev)
> (adev)->gfx.funcs->get_gpu_clock_counter((adev))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 7d89a52091c0..fe33a683bfba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -278,6 +278,9 @@ struct amdgpu_ring {
>   boolis_mes_queue;
>   uint32_thw_queue_id;
>   struct amdgpu_mes_ctx_data *mes_ctx;
> +
> +boolis_sw_ring;
> +
>   };
> #define amdgpu_ring_parse_cs(r, p, job, ib)
> ((r)->funcs->parse_cs((p), (job), (ib))) diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
> new file mode 100644
> index ..ea4a3c66119a
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
> @@ -0,0 +1,182 @@
> +/*
> + * Copyright 2022 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person
> obtaining a
> + * copy of this software and associated documentation files (the
> "Software"),
> + * to deal in the Software without restriction, including without
> limitation
> + * the 

RE: [PATCH] drm/amdgpu: Fix hive reference count leak

2022-09-09 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Looks good to me .

-Original Message-
From: Chander, Vignesh 
Sent: Friday, September 9, 2022 12:52 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun ; Chander, Vignesh 

Subject: [PATCH] drm/amdgpu: Fix hive reference count leak

both get_xgmi_hive and put_xgmi_hive can be skipped since the reset domain is 
not necessary for VF

Signed-off-by: Vignesh Chander 
Reviewed-by: Shaoyun Liu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e21804362995..943c9e750575 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2451,9 +2451,9 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 */
if (adev->gmc.xgmi.num_physical_nodes > 1) {
if (amdgpu_xgmi_add_device(adev) == 0) {
-   struct amdgpu_hive_info *hive = 
amdgpu_get_xgmi_hive(adev);
-
if (!amdgpu_sriov_vf(adev)) {
+   struct amdgpu_hive_info *hive = 
amdgpu_get_xgmi_hive(adev);
+
if (!hive->reset_domain ||

!amdgpu_reset_get_reset_domain(hive->reset_domain)) {
r = -ENOENT;
--
2.25.1



RE: [PATCH] drm/amdgpu: Use per device reset_domain for XGMI on sriov configuration

2022-09-07 Thread Liu, Shaoyun
[AMD Official Use Only - General]

ping

-Original Message-
From: Liu, Shaoyun 
Sent: Wednesday, September 7, 2022 11:38 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu: Use per device reset_domain for XGMI on sriov 
configuration

For SRIOV configuration, host driver control the reset method(either FLR or 
heavier chain reset). The host will notify the guest individually with FLR 
message if individual GPU within the hive need to be reset. So for guest side, 
no need to use hive->reset_domain to replace the original per device 
reset_domain

Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 20 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 +-
 2 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62b26f0e37b0..a5533e0d9d6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2453,17 +2453,19 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
if (amdgpu_xgmi_add_device(adev) == 0) {
struct amdgpu_hive_info *hive = 
amdgpu_get_xgmi_hive(adev);

-   if (!hive->reset_domain ||
-   !amdgpu_reset_get_reset_domain(hive->reset_domain)) 
{
-   r = -ENOENT;
+   if(!amdgpu_sriov_vf(adev)) {
+   if (!hive->reset_domain ||
+   
!amdgpu_reset_get_reset_domain(hive->reset_domain)) {
+   r = -ENOENT;
+   amdgpu_put_xgmi_hive(hive);
+   goto init_failed;
+   }
+
+   /* Drop the early temporary reset domain we 
created for device */
+   
amdgpu_reset_put_reset_domain(adev->reset_domain);
+   adev->reset_domain = hive->reset_domain;
amdgpu_put_xgmi_hive(hive);
-   goto init_failed;
}
-
-   /* Drop the early temporary reset domain we created for 
device */
-   amdgpu_reset_put_reset_domain(adev->reset_domain);
-   adev->reset_domain = hive->reset_domain;
-   amdgpu_put_xgmi_hive(hive);
}
}

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index d3b483aa81f8..a78b589e4f4f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -391,24 +391,32 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
amdgpu_device *adev)
goto pro_end;
}

+   /**
+* Only init hive->reset_domain for none SRIOV configuration. For SRIOV,
+* Host driver decide how to reset the GPU either through FLR or chain 
reset.
+* Guest side will get individual notifications from the host for the 
FLR
+* if necessary.
+*/
+   if (!amdgpu_sriov_vf(adev)) {
/**
 * Avoid recreating reset domain when hive is reconstructed for the case
-* of reset the devices in the XGMI hive during probe for SRIOV
+* of reset the devices in the XGMI hive during probe for passthrough
+GPU
 * See https://www.spinics.net/lists/amd-gfx/msg58836.html
 */
-   if (adev->reset_domain->type != XGMI_HIVE) {
-   hive->reset_domain = 
amdgpu_reset_create_reset_domain(XGMI_HIVE, "amdgpu-reset-hive");
-   if (!hive->reset_domain) {
-   dev_err(adev->dev, "XGMI: failed initializing 
reset domain for xgmi hive\n");
-   ret = -ENOMEM;
-   kobject_put(>kobj);
-   kfree(hive);
-   hive = NULL;
-   goto pro_end;
-   }
-   } else {
-   amdgpu_reset_get_reset_domain(adev->reset_domain);
-   hive->reset_domain = adev->reset_domain;
+   if (adev->reset_domain->type != XGMI_HIVE) {
+   hive->reset_domain = 
amdgpu_reset_create_reset_domain(XGMI_HIVE, "amdgpu-reset-hive");
+   if (!hive->reset_domain) {
+   dev_err(adev->dev, "XGMI: failed 
initializing reset domain for xgmi hive\n");
+   ret = -ENOMEM;
+   kobject_put(>kobj);
+

RE: [PATCH] drm/amdgpu: skip set_topology_info for VF

2022-08-19 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Looks good to me .
Reviewed-By : shaoyun.liu 

-Original Message-
From: Chander, Vignesh 
Sent: Thursday, August 18, 2022 1:38 PM
To: amd-gfx@lists.freedesktop.org
Cc: Kim, Jonathan ; Liu, Shaoyun ; 
Chander, Vignesh 
Subject: [PATCH] drm/amdgpu: skip set_topology_info for VF

Skip set_topology_info as xgmi TA will now block it and host needs to program 
it.

Signed-off-by: Vignesh Chander 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 1b108d03e785..1a2b4c4b745c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -504,6 +504,9 @@ int amdgpu_xgmi_update_topology(struct amdgpu_hive_info 
*hive, struct amdgpu_dev  {
int ret;

+   if (amdgpu_sriov_vf(adev))
+   return 0;
+
/* Each psp need to set the latest topology */
ret = psp_xgmi_set_topology_info(>psp,
 atomic_read(>number_devices),
--
2.25.1



RE: [Patch V3] drm/amdgpu: Increase tlb flush timeout for sriov

2022-08-11 Thread Liu, Shaoyun
[AMD Official Use Only - General]

>From HW point of view , the  maximum VF number can reach 16  instead  of 12 . 
>Although currently no product will use the 16 VFs  together,  not sure about 
>the future.
You can added Acked-by me.  I will let Alex & Christion decide whether accept 
this change.

Regards
Shaoyun.liu



-Original Message-
From: amd-gfx  On Behalf Of Dusica 
Milinkovic
Sent: Thursday, August 11, 2022 6:01 AM
To: amd-gfx@lists.freedesktop.org
Cc: Milinkovic, Dusica 
Subject: [Patch V3] drm/amdgpu: Increase tlb flush timeout for sriov

[Why]
During multi-vf executing benchmark (Luxmark) observed kiq error timeout.
It happenes because all of VFs do the tlb invalidation at the same time.
Although each VF has the invalidate register set, from hardware side the 
invalidate requests are queue to execute.

[How]
In case of 12 VF increase timeout on 12*100ms

Signed-off-by: Dusica Milinkovic 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ++-  
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 5a639c857bd0..79bb6fd83094 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -320,7 +320,7 @@ enum amdgpu_kiq_irq {
AMDGPU_CP_KIQ_IRQ_DRIVER0 = 0,
AMDGPU_CP_KIQ_IRQ_LAST
 };
-
+#define SRIOV_USEC_TIMEOUT  120 /* wait 12 * 100ms for SRIOV */
 #define MAX_KIQ_REG_WAIT   5000 /* in usecs, 5ms */
 #define MAX_KIQ_REG_BAILOUT_INTERVAL   5 /* in msecs, 5ms */
 #define MAX_KIQ_REG_TRY 1000
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 9ae8cdaa033e..f513e2c2e964 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -419,6 +419,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
uint32_t seq;
uint16_t queried_pasid;
bool ret;
+   u32 usec_timeout = amdgpu_sriov_vf(adev) ? SRIOV_USEC_TIMEOUT :
+adev->usec_timeout;
struct amdgpu_ring *ring = >gfx.kiq.ring;
struct amdgpu_kiq *kiq = >gfx.kiq;

@@ -437,7 +438,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,

amdgpu_ring_commit(ring);
spin_unlock(>gfx.kiq.ring_lock);
-   r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
+   r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);
if (r < 1) {
dev_err(adev->dev, "wait for kiq fence error: %ld.\n", 
r);
return -ETIME;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index ab89d91975ab..4603653916f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -896,6 +896,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
uint32_t seq;
uint16_t queried_pasid;
bool ret;
+   u32 usec_timeout = amdgpu_sriov_vf(adev) ? SRIOV_USEC_TIMEOUT :
+adev->usec_timeout;
struct amdgpu_ring *ring = >gfx.kiq.ring;
struct amdgpu_kiq *kiq = >gfx.kiq;

@@ -935,7 +936,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,

amdgpu_ring_commit(ring);
spin_unlock(>gfx.kiq.ring_lock);
-   r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
+   r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);
if (r < 1) {
dev_err(adev->dev, "wait for kiq fence error: %ld.\n", 
r);
up_read(>reset_domain->sem);
--
2.25.1



RE: [PATCH] drm/amdgpu: use sjt mec fw on aldebaran for sriov

2022-08-10 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Thanks for the review and yes in the  host driver side , the  PF already load 
the sjt MEC fw ,  the  psp policy requires the  MEC version loaded in guest 
side should not be lower  than the version  already  loaded in host side . So 
this will guarantee only  the VF with sjt version can be initialized and 
enabled .

Regards
Shaoyun.liu

-Original Message-
From: Alex Deucher 
Sent: Wednesday, August 10, 2022 12:35 PM
To: Liu, Shaoyun 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: use sjt mec fw on aldebaran for sriov

On Fri, Aug 5, 2022 at 12:11 PM shaoyunl  wrote:
>
> The second jump table is required on live migration or mulitple VF
> configuration on Aldebaran. With this implemented, the first level
> jump table(hw used) will be same, mec fw internal will use the second
> level jump table jump to the real functionality implementation.
> so the different VF can load different version of MEC as long as they
> support sjt

You might want some sort of mechanism to determine if the sjt firmware was 
loaded so you know whether live migration is possible, although I guess it's 
probably only used in controlled environments so it would be a known 
prerequisite.

Acked-by: Alex Deucher 

Alex

>
> Signed-off-by: shaoyunl 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index c6e0f9313a7f..7f187558220e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -126,6 +126,8 @@ MODULE_FIRMWARE("amdgpu/green_sardine_rlc.bin");
>  MODULE_FIRMWARE("amdgpu/aldebaran_mec.bin");
>  MODULE_FIRMWARE("amdgpu/aldebaran_mec2.bin");
>  MODULE_FIRMWARE("amdgpu/aldebaran_rlc.bin");
> +MODULE_FIRMWARE("amdgpu/aldebaran_sjt_mec.bin");
> +MODULE_FIRMWARE("amdgpu/aldebaran_sjt_mec2.bin");
>
>  #define mmTCP_CHAN_STEER_0_ARCT  
>   0x0b03
>  #define mmTCP_CHAN_STEER_0_ARCT_BASE_IDX 
>   0
> @@ -1496,7 +1498,11 @@ static int gfx_v9_0_init_cp_compute_microcode(struct 
> amdgpu_device *adev,
> const struct common_firmware_header *header = NULL;
> const struct gfx_firmware_header_v1_0 *cp_hdr;
>
> -   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_mec.bin", chip_name);
> +   if (amdgpu_sriov_vf(adev) && (adev->asic_type == CHIP_ALDEBARAN))
> +   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_sjt_mec.bin", 
> chip_name);
> +   else
> +   snprintf(fw_name, sizeof(fw_name),
> + "amdgpu/%s_mec.bin", chip_name);
> +
> err = request_firmware(>gfx.mec_fw, fw_name, adev->dev);
> if (err)
> goto out;
> @@ -1509,7 +1515,11 @@ static int
> gfx_v9_0_init_cp_compute_microcode(struct amdgpu_device *adev,
>
>
> if (gfx_v9_0_load_mec2_fw_bin_support(adev)) {
> -   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_mec2.bin", 
> chip_name);
> +   if (amdgpu_sriov_vf(adev) && (adev->asic_type == 
> CHIP_ALDEBARAN))
> +   snprintf(fw_name, sizeof(fw_name), 
> "amdgpu/%s_sjt_mec2.bin", chip_name);
> +   else
> +   snprintf(fw_name, sizeof(fw_name),
> + "amdgpu/%s_mec2.bin", chip_name);
> +
> err = request_firmware(>gfx.mec2_fw, fw_name, 
> adev->dev);
> if (!err) {
> err =
> amdgpu_ucode_validate(adev->gfx.mec2_fw);
> --
> 2.17.1
>


Re: [PATCH] Increase tlb flush timeout for sriov

2022-08-08 Thread Liu, Shaoyun
As I discussed with Alice ,this change is when multi-vf running compute 
benchmark (Luxmark) at the same time, which involves multiple vf  do the tlb 
invalidation at the same time. They observed kiq timeout after submit the tlb 
invalidate command. Although each vf has the invalidate register set, but from 
hw, the invalidate requests are queue to execute.

Alice, as we discussed, we can use maximum 12*100ms for the timeout , it 
shouldn't be 6000ms. Did you see issues with 1200 ms timeout?

Regards
Shaoyun.liu

From: amd-gfx  on behalf of Alex Deucher 

Sent: August 8, 2022 4:49 PM
To: Milinkovic, Dusica 
Cc: amd-gfx@lists.freedesktop.org 
Subject: Re: [PATCH] Increase tlb flush timeout for sriov

On Wed, Aug 3, 2022 at 5:02 AM Dusica Milinkovic
 wrote:
>

Please include a patch description.  Why do you need a longer timeout?
 What problem does it fix?

> Signed-off-by: Dusica Milinkovic 
> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 +-
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index 9ae8cdaa033e..6ab7d329916f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -419,6 +419,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
> uint32_t seq;
> uint16_t queried_pasid;
> bool ret;
> +   uint32_t sriov_usec_timeout = 600;  /* wait for 12 * 500ms for 
> SRIOV */
> struct amdgpu_ring *ring = >gfx.kiq.ring;
> struct amdgpu_kiq *kiq = >gfx.kiq;
>
> @@ -437,7 +438,10 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>
> amdgpu_ring_commit(ring);
> spin_unlock(>gfx.kiq.ring_lock);
> -   r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> +   if (amdgpu_sriov_vf(adev))
> +   r = amdgpu_fence_wait_polling(ring, seq, 
> sriov_usec_timeout);
> +   else
> +   r = amdgpu_fence_wait_polling(ring, seq, 
> adev->usec_timeout);

What about something like this?
u32 usec_timeout = amdgpu_sriov_vf(adev) ? 600 :
adev->usec_timeout;  /* wait for 12 * 500ms for SRIOV */
...
r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);


> if (r < 1) {
> dev_err(adev->dev, "wait for kiq fence error: 
> %ld.\n", r);
> return -ETIME;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 22761a3bb818..941a6b52fa72 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -896,6 +896,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
> uint32_t seq;
> uint16_t queried_pasid;
> bool ret;
> +   uint32_t sriov_usec_timeout = 600;  /* wait for 12 * 500ms for 
> SRIOV */
> struct amdgpu_ring *ring = >gfx.kiq.ring;
> struct amdgpu_kiq *kiq = >gfx.kiq;
>
> @@ -935,7 +936,10 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>
> amdgpu_ring_commit(ring);
> spin_unlock(>gfx.kiq.ring_lock);
> -   r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> +   if (amdgpu_sriov_vf(adev))
> +   r = amdgpu_fence_wait_polling(ring, seq, 
> sriov_usec_timeout);
> +   else
> +   r = amdgpu_fence_wait_polling(ring, seq, 
> adev->usec_timeout);

Same comment here.

Alex

> if (r < 1) {
> dev_err(adev->dev, "wait for kiq fence error: 
> %ld.\n", r);
> up_read(>reset_domain->sem);
> --
> 2.25.1
>


RE: [PATCH] drm/amdgpu: fix hive reference leak when reflecting psp topology info

2022-07-28 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Looks good to me .
BTW , why we didn't catch it on baremetal mode  ?

Reviewed-by: Shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of Jonathan Kim
Sent: Thursday, July 28, 2022 1:06 PM
To: amd-gfx@lists.freedesktop.org
Cc: Kim, Jonathan 
Subject: [PATCH] drm/amdgpu: fix hive reference leak when reflecting psp 
topology info

Hives that require psp topology info to be reflected will leak hive reference 
so fix it.

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 3ee363bfbac2..6c23e89366bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1292,6 +1292,8 @@ static void psp_xgmi_reflect_topology_info(struct 
psp_context *psp,

break;
}
+
+   amdgpu_put_xgmi_hive(hive);
 }

 int psp_xgmi_get_topology_info(struct psp_context *psp,
--
2.25.1



RE: [PATCH] drm/amdgpu: Ta fw needs to be loaded for SRIOV aldebaran

2022-04-22 Thread Liu, Shaoyun
[AMD Official Use Only]

Looks ok to me . 
You can  add  reviewed-by: Shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of David Yu
Sent: Friday, April 22, 2022 12:09 PM
To: amd-gfx@lists.freedesktop.org
Cc: Yu, David 
Subject: [PATCH] drm/amdgpu: Ta fw needs to be loaded for SRIOV aldebaran

Load ta fw during psp_init_sriov_microcode to enable XGMI. It is required to be 
loaded by both guest and host starting from Arcturus. Cap fw needs to be loaded 
first.
Fix previously patch that was pushed by mistake.

Signed-off-by: David Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 895251f42853..0bd22ebcc3d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -275,8 +275,8 @@ static int psp_init_sriov_microcode(struct psp_context *psp)
ret = psp_init_cap_microcode(psp, "sienna_cichlid");
break;
case IP_VERSION(13, 0, 2):
-   ret = psp_init_ta_microcode(psp, "aldebaran");
-   ret &= psp_init_cap_microcode(psp, "aldebaran");
+   ret = psp_init_cap_microcode(psp, "aldebaran");
+   ret &= psp_init_ta_microcode(psp, "aldebaran");
break;
default:
BUG();
-- 
2.25.1


RE: [PATCH] drm/amdgpu: Ta fw needs to be loaded for SRIOV aldebaran

2022-04-22 Thread Liu, Shaoyun
[AMD Official Use Only]

Please add some  more  info  in the description to explain  why we need to add 
TA in SRIOV guest  . 

Regard
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of David Yu
Sent: Friday, April 22, 2022 10:58 AM
To: amd-gfx@lists.freedesktop.org
Cc: Yu, David 
Subject: [PATCH] drm/amdgpu: Ta fw needs to be loaded for SRIOV aldebaran

Load ta fw during psp_init_sriov_microcode to enable XGMI

Signed-off-by: David Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index f6527aa19238..895251f42853 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -275,7 +275,8 @@ static int psp_init_sriov_microcode(struct psp_context *psp)
ret = psp_init_cap_microcode(psp, "sienna_cichlid");
break;
case IP_VERSION(13, 0, 2):
-   ret = psp_init_cap_microcode(psp, "aldebaran");
+   ret = psp_init_ta_microcode(psp, "aldebaran");
+   ret &= psp_init_cap_microcode(psp, "aldebaran");
break;
default:
BUG();
-- 
2.25.1


RE: [PATCH] drm/amdgpu: fix aldebaran xgmi topology for vf

2022-03-09 Thread Liu, Shaoyun
[Public]

Yes,  we  need the correct setting  in both bare-metal and sriov to populate 
the correct xgmi  link info.  Move the setting to a common place that not be 
affect by SRIOV or not make sense to me.  

The change is reviewed by : Shaoyun.liu 


-Original Message-
From: Kim, Jonathan  
Sent: Wednesday, March 9, 2022 6:31 PM
To: Kuehling, Felix ; amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: RE: [PATCH] drm/amdgpu: fix aldebaran xgmi topology for vf

[Public]

> -Original Message-
> From: Kuehling, Felix 
> Sent: March 9, 2022 6:12 PM
> To: Kim, Jonathan ; 
> amd-gfx@lists.freedesktop.org
> Cc: Liu, Shaoyun 
> Subject: Re: [PATCH] drm/amdgpu: fix aldebaran xgmi topology for vf
>
> On 2022-03-09 17:16, Jonathan Kim wrote:
> > VFs must also distinguish whether or not the TA supports full duplex 
> > or half duplex link records in order to report the correct xGMI topology.
> >
> > Signed-off-by: Jonathan Kim 
> I think I'm missing something here. Your condition for setting 
> supports_extended_data is exactly the same, but you're initializing it 
> in a different function. Can you explain how that change relates to SRIOV?

I probably should have included more context when sending this out.
The proposed support assignment happens after this:

if (amdgpu_sriov_vf(adev))
ret = psp_init_sriov_microcode(psp);
else
ret = psp_init_microcode(psp);
if (ret) {
DRM_ERROR("Failed to load psp firmware!\n");
return ret;
}

and psp_init_sriov_microde doesn't set secure OS micro code info (this is where 
the support assignment currently is).

Thanks,

Jon

>
> Thanks,
>Felix
>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 --
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index 3ce1d38a7822..a6acec1a6155 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -310,6 +310,10 @@ static int psp_sw_init(void *handle)
> > return ret;
> > }
> >
> > +   adev->psp.xgmi_context.supports_extended_data =
> > +   !adev->gmc.xgmi.connected_to_cpu &&
> > +   adev->ip_versions[MP0_HWIP][0] == IP_VERSION(13,
> 0, 2);
> > +
> > memset(_cfg_entry, 0, sizeof(boot_cfg_entry));
> > if (psp_get_runtime_db_entry(adev,
> > PSP_RUNTIME_ENTRY_TYPE_BOOT_CONFIG,
> > @@ -3008,7 +3012,6 @@ static int psp_init_sos_base_fw(struct
> amdgpu_device *adev)
> > adev->psp.sos.size_bytes = le32_to_cpu(sos_hdr- 
> >sos.size_bytes);
> > adev->psp.sos.start_addr = ucode_array_start_addr +
> > le32_to_cpu(sos_hdr->sos.offset_bytes);
> > -   adev->psp.xgmi_context.supports_extended_data = false;
> > } else {
> > /* Load alternate PSP SOS FW */
> > sos_hdr_v1_3 = (const struct psp_firmware_header_v1_3  
> >*)adev->psp.sos_fw->data; @@ -3023,7 +3026,6 @@ static int
> psp_init_sos_base_fw(struct amdgpu_device *adev)
> > adev->psp.sos.size_bytes = le32_to_cpu(sos_hdr_v1_3- 
> >sos_aux.size_bytes);
> > adev->psp.sos.start_addr = ucode_array_start_addr +
> > le32_to_cpu(sos_hdr_v1_3->sos_aux.offset_bytes);
> > -   adev->psp.xgmi_context.supports_extended_data = true;
> > }
> >
> > if ((adev->psp.sys.size_bytes == 0) || (adev->psp.sos.size_bytes 
> > ==
> > 0)) {


RE: [PATCH] drm/amdgpu: Add DFC CAP support for aldebaran

2022-03-03 Thread Liu, Shaoyun
[AMD Official Use Only]

Reviewed by : Shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of David Yu
Sent: Thursday, March 3, 2022 11:25 AM
To: amd-gfx@lists.freedesktop.org
Cc: Yu, David 
Subject: [PATCH] drm/amdgpu: Add DFC CAP support for aldebaran

Add DFC CAP support for aldebaran

Initialize cap microcode in psp_init_sriov_microcode,  the ta microcode will be 
 initialized in psp_vxx_init_microcode

Signed-off-by: David Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/psp_v13_0.c  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 94bfe502b55e..3ce1d38a7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -277,7 +277,7 @@ static int psp_init_sriov_microcode(struct psp_context *psp)
ret = psp_init_cap_microcode(psp, "sienna_cichlid");
break;
case IP_VERSION(13, 0, 2):
-   ret = psp_init_ta_microcode(psp, "aldebaran");
+   ret = psp_init_cap_microcode(psp, "aldebaran");
break;
default:
BUG();
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
index 2c6070b90dcf..024f60631faf 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
@@ -31,6 +31,7 @@
 
 MODULE_FIRMWARE("amdgpu/aldebaran_sos.bin");
 MODULE_FIRMWARE("amdgpu/aldebaran_ta.bin");
+MODULE_FIRMWARE("amdgpu/aldebaran_cap.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_asd.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_toc.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_ta.bin");
--
2.25.1


RE: [PATCH] drm/amdgpu: Add DFC CAP support for aldebaran

2022-03-03 Thread Liu, Shaoyun
[AMD Official Use Only]

Probably just described as follows : 
Initialize cap microcode in psp_init_sriov_microcode,  the ta microcode will be 
 initialized in psp_vxx_init_microcode


-Original Message-
From: amd-gfx  On Behalf Of David Yu
Sent: Thursday, March 3, 2022 9:10 AM
To: amd-gfx@lists.freedesktop.org
Cc: Yu, David 
Subject: [PATCH] drm/amdgpu: Add DFC CAP support for aldebaran

Add DFC CAP support for aldebaran

Changed incorrect call to psp_init_ta_microcode in psp_init_sriov_microcode to 
psp_init_cap_microcode which caused it to fail even with correct CAP firmware.

Signed-off-by: David Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/psp_v13_0.c  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 94bfe502b55e..3ce1d38a7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -277,7 +277,7 @@ static int psp_init_sriov_microcode(struct psp_context *psp)
ret = psp_init_cap_microcode(psp, "sienna_cichlid");
break;
case IP_VERSION(13, 0, 2):
-   ret = psp_init_ta_microcode(psp, "aldebaran");
+   ret = psp_init_cap_microcode(psp, "aldebaran");
break;
default:
BUG();
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
index 2c6070b90dcf..024f60631faf 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
@@ -31,6 +31,7 @@
 
 MODULE_FIRMWARE("amdgpu/aldebaran_sos.bin");
 MODULE_FIRMWARE("amdgpu/aldebaran_ta.bin");
+MODULE_FIRMWARE("amdgpu/aldebaran_cap.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_asd.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_toc.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_ta.bin");
--
2.25.1


RE: [PATCH] drm/amdgpu: Add DFC CAP support for aldebaran

2022-03-02 Thread Liu, Shaoyun
[AMD Official Use Only]

Can you  added more information in the description ?  Like why we should not 
load ta for Aldebaran here.  

Regards
Shaoyun.liu


-Original Message-
From: amd-gfx  On Behalf Of David Yu
Sent: Wednesday, March 2, 2022 10:20 PM
To: amd-gfx@lists.freedesktop.org
Cc: Yu, David 
Subject: [PATCH] drm/amdgpu: Add DFC CAP support for aldebaran

Add DFC CAP support for aldebaran

Signed-off-by: David Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/psp_v13_0.c  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 94bfe502b55e..3ce1d38a7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -277,7 +277,7 @@ static int psp_init_sriov_microcode(struct psp_context *psp)
ret = psp_init_cap_microcode(psp, "sienna_cichlid");
break;
case IP_VERSION(13, 0, 2):
-   ret = psp_init_ta_microcode(psp, "aldebaran");
+   ret = psp_init_cap_microcode(psp, "aldebaran");
break;
default:
BUG();
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
index 2c6070b90dcf..024f60631faf 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
@@ -31,6 +31,7 @@
 
 MODULE_FIRMWARE("amdgpu/aldebaran_sos.bin");
 MODULE_FIRMWARE("amdgpu/aldebaran_ta.bin");
+MODULE_FIRMWARE("amdgpu/aldebaran_cap.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_asd.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_toc.bin");
 MODULE_FIRMWARE("amdgpu/yellow_carp_ta.bin");
--
2.25.1


RE: [PATCH] drm/amdgpu: Fix wait for RLCG command completion

2022-02-15 Thread Liu, Shaoyun
[AMD Official Use Only]

So driver will use the flag on write but use the new way to check on read. 

Looks good to me . Reviewed by : Shaoyun.liu


-Original Message-
From: Skvortsov, Victor  
Sent: Tuesday, February 15, 2022 10:54 AM
To: Zhang, Bokun ; amd-gfx@lists.freedesktop.org; Liu, 
Shaoyun 
Subject: RE: [PATCH] drm/amdgpu: Fix wait for RLCG command completion

[AMD Official Use Only]

+Shaoyun

-Original Message-
From: Zhang, Bokun  
Sent: Monday, February 14, 2022 4:09 PM
To: Skvortsov, Victor ; amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: RE: [PATCH] drm/amdgpu: Fix wait for RLCG command completion

[AMD Official Use Only]

Tested-by: Bokun Zhang 

The test configuration is 8VF with 100 loops of VM reboot.

-Original Message-
From: amd-gfx  On Behalf Of Victor 
Skvortsov
Sent: Thursday, February 3, 2022 4:25 PM
To: amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: [PATCH] drm/amdgpu: Fix wait for RLCG command completion

if (!(tmp & flag)) condition will always evaluate to true when the flag is 0x0 
(AMDGPU_RLCG_GC_WRITE). Instead check that address bits are cleared to 
determine whether the command is complete.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index e1288901beb6..a8babe3bccb8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -902,7 +902,7 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device 
*adev, u32 offset, u32 v
 
for (i = 0; i < timeout; i++) {
tmp = readl(scratch_reg1);
-   if (!(tmp & flag))
+   if (!(tmp & AMDGPU_RLCG_SCRATCH1_ADDRESS_MASK))
break;
udelay(10);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 40803aab136f..68f592f0e992 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -43,6 +43,8 @@
 #define AMDGPU_RLCG_WRONG_OPERATION_TYPE   0x200
 #define AMDGPU_RLCG_REG_NOT_IN_RANGE   0x100
 
+#define AMDGPU_RLCG_SCRATCH1_ADDRESS_MASK  0xF
+
 /* all asic after AI use this offset */  #define mmRCC_IOV_FUNC_IDENTIFIER 
0xDE5
 /* tonga/fiji use this offset */
--
2.25.1


RE: [RFC v4 04/11] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2022-02-08 Thread Liu, Shaoyun
[AMD Official Use Only]

This patch is reviewed by  Shaoyun.liu 

Since  other  patches are suggested by  other engineer and they may already od 
some review on them , so I will leave  them  to continue review  the rest 
patches.  

Regards
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey  
Sent: Tuesday, February 8, 2022 7:23 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: Koenig, Christian ; dan...@ffwll.ch; Liu, Monk 
; Chen, Horace ; Lazar, Lijo 
; Chen, JingWen ; Grodzovsky, Andrey 
; Liu, Shaoyun 
Subject: [RFC v4 04/11] drm/amd/virt: For SRIOV send GPU reset directly to TDR 
queue.

No need to to trigger another work queue inside the work queue.

v3:

Problem:
Extra reset caused by host side FLR notification following guest side triggered 
reset.
Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset.

Suggested-by: Liu Shaoyun 
Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 9 ++---  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 9 ++---  
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 9 ++---
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 56da5ab82987..5869d51d8bee 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -282,7 +282,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) ||
adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-   amdgpu_device_gpu_recover(adev, NULL);
+   amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev, @@ -307,8 
+307,11 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
 
switch (event) {
case IDH_FLR_NOTIFICATION:
-   if (amdgpu_sriov_runtime(adev))
-   schedule_work(>virt.flr_work);
+   if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+   WARN_ONCE(!queue_work(adev->reset_domain.wq,
+ >virt.flr_work),
+ "Failed to queue work! at %s",
+ __func__);
break;
case IDH_QUERY_ALIVE:
xgpu_ai_mailbox_send_ack(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 477d0dde19c5..5728a6401d73 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -309,7 +309,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-   amdgpu_device_gpu_recover(adev, NULL);
+   amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev, @@ -337,8 
+337,11 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
 
switch (event) {
case IDH_FLR_NOTIFICATION:
-   if (amdgpu_sriov_runtime(adev))
-   schedule_work(>virt.flr_work);
+   if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+   WARN_ONCE(!queue_work(adev->reset_domain.wq,
+ >virt.flr_work),
+ "Failed to queue work! at %s",
+ __func__);
break;
/* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can 
ignore
 * it byfar since that polling thread will handle it, diff 
--git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index aef9d059ae52..02290febfcf4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct 
*work)
 
/* Trigger recovery due to world switch failure */
if (amdgpu_device_should_recover_gpu(adev))
-   amdgpu_device_gpu_recover(adev, NULL);
+   amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev, @@ -550,8 
+550,11 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
 
/* only handle FLR_NOTIFY now */
-   if (!r)
-   schedule_work(>virt.flr_work);
+   if (!r && !amdgp

RE: [PATCH] drm/amdgpu: Fix kernel compilation; style

2022-01-20 Thread Liu, Shaoyun
[AMD Official Use Only]

Good catch .  Thanks . 
Reviewed by : shaoyun.liu 

-Original Message-
From: Tuikov, Luben  
Sent: Thursday, January 20, 2022 6:52 PM
To: amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben ; Deucher, Alexander 
; Liu, Shaoyun ; Russell, Kent 

Subject: [PATCH] drm/amdgpu: Fix kernel compilation; style

Problem:
drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c: In function 
‘is_fru_eeprom_supported’:
drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c:47:3: error: expected ‘)’ before 
‘return’
   47 |   return false;
  |   ^~
drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c:46:5: note: to match this ‘(’
   46 |  if (amdgpu_sriov_vf(adev)
  | ^

Fix kernel compilation:
if (amdgpu_sriov_vf(adev)
return false;
missing closing right parenthesis for the "if".

Fix style:
/* The i2c access is blocked on VF
 * TODO: Need other way to get the info
 */
Has white space after the closing */.

Cc: Alex Deucher 
Cc: shaoyunl 
Cc: Kent Russell 
Fixes: 824c2051039dfc ("drm/amdgpu: Disable FRU EEPROM access for SRIOV")
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
index 0548e279cc9fc4..60e7e637eaa33d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
@@ -42,8 +42,8 @@ static bool is_fru_eeprom_supported(struct amdgpu_device 
*adev)
 
/* The i2c access is blocked on VF
 * TODO: Need other way to get the info
-*/  
-   if (amdgpu_sriov_vf(adev)
+*/
+   if (amdgpu_sriov_vf(adev))
return false;
 
/* VBIOS is of the format ###-DXXXYY-##. For SKU identification,

base-commit: 2e8e13b0a6794f3ddae0ddcd13eedb64de94f0fd
-- 
2.34.0


RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread Liu, Shaoyun
[AMD Official Use Only]

I see, I didn't notice you already have  this implemented . so the flr_work 
routine itself is synced now, in this case , I  agree it should be safe to 
remove the in_gpu_reset and  reset_semm in the flr_work. 

Regards
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey  
Sent: Tuesday, January 4, 2022 3:55 PM
To: Liu, Shaoyun ; Koenig, Christian 
; Liu, Monk ; Chen, JingWen 
; Christian König ; 
Deng, Emily ; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; Chen, Horace 
Cc: dan...@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection 
for SRIOV

On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote:

> [AMD Official Use Only]
>
> I mostly agree with the sequences Christian  described .  Just one  thing 
> might need to  discuss here.  For FLR notified from host,  in new sequenceas 
> described  , driver need to reply the  READY_TO_RESET in the  workitem  from 
> a reset  work queue which means inside flr_work, driver can not directly 
> reply to host but need to queue another workqueue .


Can you clarify why 'driver can not directly reply to host but need to queue 
another workqueue' ? To my understating all steps 3-6 in Christian's 
description happen from the same single wq thread serially.


>   For current  code ,  the flr_work for sriov itself is a work queue queued 
> from ISR .  I think we should try to response to the host driver as soon as 
> possible.  Queue another workqueue  inside  the workqueue  doesn't sounds 
> efficient to me.


Check patch 5 please [1] - I just substituted
schedule_work(>virt.flr_work) for
queue_work(adev->reset_domain.wq,>virt.flr_work) so no extra requeue 
here, just instead of sending to system_wq it's sent to dedicated reset wq

[1] -
https://lore.kernel.org/all/2021121400.790842-1-andrey.grodzov...@amd.com/

Andrey


> Anyway, what we need is a working  solution for our project.  So if we need 
> to change the sequence, we  need to make sure it's been tested first and 
> won't break the functionality before the code is landed in the branch .
>
> Regards
> Shaoyun.liu
>
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Christian König
> Sent: Tuesday, January 4, 2022 6:36 AM
> To: Liu, Monk ; Chen, JingWen 
> ; Christian König 
> ; Grodzovsky, Andrey 
> ; Deng, Emily ; 
> dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, 
> Horace 
> Cc: dan...@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each queue. 
>>>> Otherwise you have a race condition between the hypervisor and the 
>>>> scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
> Then we have a major design issue in the SRIOV protocol and really need to 
> question this.
>
> How do you want to prevent a race between the hypervisor resetting the 
> hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout 
> or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the 
> single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready 
> for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting 
> for the clients for too long, but if we don't follow this general steps we 
> will always have a race between the different components.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each queue. 
>>>> Otherwise you have a race condition between the hypervisor and the 
>>>> scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>>>> In

RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread Liu, Shaoyun
[AMD Official Use Only]

I mostly agree with the sequences Christian  described .  Just one  thing might 
need to  discuss here.  For FLR notified from host,  in new sequenceas 
described  , driver need to reply the  READY_TO_RESET in the  workitem  from a 
reset  work queue which means inside flr_work, driver can not directly reply to 
host but need to queue another workqueue . For current  code ,  the flr_work 
for sriov itself is a work queue queued from ISR .  I think we should try to 
response to the host driver as soon as possible.  Queue another workqueue  
inside  the workqueue  doesn't sounds efficient to me.  
Anyway, what we need is a working  solution for our project.  So if we need to 
change the sequence, we  need to make sure it's been tested first and won't 
break the functionality before the code is landed in the branch . 

Regards
Shaoyun.liu


-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Tuesday, January 4, 2022 6:36 AM
To: Liu, Monk ; Chen, JingWen ; 
Christian König ; Grodzovsky, Andrey 
; Deng, Emily ; 
dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace 

Cc: dan...@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection 
for SRIOV

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling 
>>> the need for a reset, similar to each job timeout on each queue. Otherwise 
>>> you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>

Then we have a major design issue in the SRIOV protocol and really need to 
question this.

How do you want to prevent a race between the hypervisor resetting the hardware 
and the client trying the same because of a timeout?

As far as I can see the procedure should be:
1. We detect that a reset is necessary, either because of a fault a timeout or 
signal from hypervisor.
2. For each of those potential reset sources a work item is send to the single 
workqueue.
3. One of those work items execute first and prepares the reset.
4. We either do the reset our self or notify the hypervisor that we are ready 
for the reset.
5. Cleanup after the reset, eventually resubmit jobs etc..
6. Cancel work items which might have been scheduled from other reset sources.

It does make sense that the hypervisor resets the hardware without waiting for 
the clients for too long, but if we don't follow this general steps we will 
always have a race between the different components.

Regards,
Christian.

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling 
>>> the need for a reset, similar to each job timeout on each queue. Otherwise 
>>> you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>
>>> In other words I strongly think that the current SRIOV reset implementation 
>>> is severely broken and what Andrey is doing is actually fixing it.
> It makes the code to crash ... how could it be a fix ?
>
> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not 
> ruin the logic, Andry or jingwen can try it if needed.
>
> Thanks
> ---
> Monk Liu | Cloud GPU & Virtualization Solution | AMD
> ---
> we are hiring software manager for CVS core team
> ---
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Tuesday, January 4, 2022 6:19 PM
> To: Chen, JingWen ; Christian König 
> ; Grodzovsky, Andrey 
> ; Deng, Emily ; Liu, 
> Monk ; dri-de...@lists.freedesktop.org; 
> amd-gfx@lists.freedesktop.org; Chen, Horace ; 
> Chen, JingWen 
> Cc: dan...@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Hi Jingwen,
>
> well what I mean is that we need to adjust the implementation in amdgpu to 
> actually match the requirements.
>
> Could be that the reset sequence is questionable in general, but I doubt so 
> at least for now.
>
> See the FLR request from the hypervisor is just another source of signaling 
> the need for a reset, similar to each job timeout on each queue. Otherwise 
> you have a race condition between the hypervisor and the scheduler.
>
> Properly setting in_gpu_reset is indeed mandatory, but should happen at a 
> central place and not in the SRIOV specific code.
>
> In other words I strongly think that the current SRIOV reset implementation 
> is 

RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-23 Thread Liu, Shaoyun
[AMD Official Use Only]

I have  a discussion with  Andrey  about this offline.   It seems dangerous  to 
remove the in_gpu_reset and  reset_semm directly inside the  flr_work.  In the 
case when the reset is triggered from host side , gpu need to be locked while 
host perform reset after flr_work reply the host with  READY_TO_RESET. 
The original comments seems need to be updated. 

Regards
Shaoyun.liu
 

-Original Message-
From: amd-gfx  On Behalf Of Andrey 
Grodzovsky
Sent: Wednesday, December 22, 2021 5:14 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk ; Grodzovsky, Andrey 
; Chen, Horace ; Koenig, 
Christian ; dan...@ffwll.ch
Subject: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for 
SRIOV

Since now flr work is serialized against  GPU resets there is no need for this.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
-   /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-* otherwise the mailbox msg will be ruined/reseted by
-* the VF FLR.
-*/
-   if (!down_write_trylock(>reset_sem))
-   return;
-
amdgpu_virt_fini_data_exchange(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
} while (timeout > 1);
 
 flr_done:
-   atomic_set(>in_gpu_reset, 0);
-   up_write(>reset_sem);
-
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) || diff --git 
a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
-   /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-* otherwise the mailbox msg will be ruined/reseted by
-* the VF FLR.
-*/
-   if (!down_write_trylock(>reset_sem))
-   return;
-
amdgpu_virt_fini_data_exchange(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
} while (timeout > 1);
 
 flr_done:
-   atomic_set(>in_gpu_reset, 0);
-   up_write(>reset_sem);
-
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) ||
--
2.25.1


RE: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-20 Thread Liu, Shaoyun
[AMD Official Use Only]


Hi , Andrey 
I actually has some concerns about this  change . 
1.  on SRIOV configuration , the reset notify coming  from host , and driver 
already trigger a work queue to handle the reset (check 
xgpu_*_mailbox_flr_work) , is it a good idea to trigger another work queue 
inside the work queue ?  Can  we just use the  new one  you added ? 
2. For KFD,  the rocm use the user queue for the submission and it won't call 
the drm scheduler  and hence no job timeout.  Can  we handle that with  your 
new change ? 
3 . For XGMI  hive, there is only hive  reset for all devices on bare-metal  
config ,  but for SRIOV config , the VF will support VF FLR, which means host 
might only need to reset specific device instead trigger whole hive reset . So 
we might still need  reset_domain for individual device within the hive for 
SRIOV configuration. 

Anyway I think this change need to be verified on sriov configuration on XGMI 
with  some rocm use app is running . 

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Monday, December 20, 2021 2:25 AM
To: Grodzovsky, Andrey ; 
dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: dan...@ffwll.ch; Chen, Horace ; Koenig, Christian 
; Liu, Monk 
Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> This patchset is based on earlier work by Boris[1] that allowed to 
> have an ordered workqueue at the driver level that will be used by the 
> different schedulers to queue their timeout work. On top of that I 
> also serialized any GPU reset we trigger from within amdgpu code to 
> also go through the same ordered wq and in this way simplify somewhat 
> our GPU reset code so we don't need to protect from concurrency by 
> multiple GPU reset triggeres such as TDR on one hand and sysfs trigger or RAS 
> trigger on the other hand.
>
> As advised by Christian and Daniel I defined a reset_domain struct 
> such that all the entities that go through reset together will be 
> serialized one against another.
>
> TDR triggered by multiple entities within the same domain due to the 
> same reason will not be triggered as the first such reset will cancel 
> all the pending resets. This is relevant only to TDR timers and not to 
> triggered resets coming from RAS or SYSFS, those will still happen after the 
> in flight resets finishes.
>
> [1] 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatc
> hwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210629073510.276439
> 1-3-boris.brezillon%40collabora.com%2Fdata=04%7C01%7CShaoyun.Liu%
> 40amd.com%7C1d2b07ad556b4da5d58808d9c389decf%7C3dd8961fe4884e608e11a82
> d994e183d%7C0%7C0%7C637755819206627827%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000
> ;sdata=8C8UbdPmM%2FH6sdTYDP5lZfRfBdQ%2B%2FN7m6s%2FREW8%2BsoM%3Dre
> served=0
>
> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
> hasn't landed yet there.

Patches #1 and #5, #6 are Reviewed-by: Christian König 


Some minor comments on the rest, but in general absolutely looks like the way 
we want to go.

Regards,
Christian.

>
> Andrey Grodzovsky (6):
>drm/amdgpu: Init GPU reset single threaded wq
>drm/amdgpu: Move scheduler init to after XGMI is ready
>drm/amdgpu: Fix crash on modprobe
>drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>drm/amdgpu: Drop hive->in_reset
>drm/amdgpu: Drop concurrent GPU reset protection for device
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h|   9 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
>   7 files changed, 132 insertions(+), 136 deletions(-)
>


RE: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handling

2021-12-17 Thread Liu, Shaoyun
[AMD Official Use Only]

Reviewed by: Shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of sashank saye
Sent: Friday, December 17, 2021 1:56 PM
To: amd-gfx@lists.freedesktop.org
Cc: Saye, Sashank 
Subject: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for 
sbr handling

For Aldebaran chip passthrough case we need to intimate SMU
about special handling for SBR.On older chips we send
LightSBR to SMU, enabling the same for Aldebaran. Slight
difference, compared to previous chips, is on Aldebaran, SMU
would do a heavy reset on SBR. Hence, the word Heavy
instead of Light SBR is used for SMU to differentiate.

Signed-off-by: sashank saye 
Change-Id: I79420e7352bb670d6f9696df97d7546f131b18fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  9 -
 drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h   |  4 +++-
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h|  6 +++---
 drivers/gpu/drm/amd/pm/inc/smu_types.h |  3 ++-
 drivers/gpu/drm/amd/pm/inc/smu_v11_0.h |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  6 +++---
 drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c  |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 10 ++
 9 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..e4c93d373224 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2618,11 +2618,10 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (r)
DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);
 
-   /* For XGMI + passthrough configuration on arcturus, enable light SBR */
-   if (adev->asic_type == CHIP_ARCTURUS &&
-   amdgpu_passthrough(adev) &&
-   adev->gmc.xgmi.num_physical_nodes > 1)
-   smu_set_light_sbr(>smu, true);
+   /* For passthrough configuration on arcturus and aldebaran, enable 
special handling SBR */
+   if (amdgpu_passthrough(adev) && ((adev->asic_type == CHIP_ARCTURUS && 
adev->gmc.xgmi.num_physical_nodes > 1)||
+  adev->asic_type == CHIP_ALDEBARAN ))
+   smu_handle_passthrough_sbr(>smu, true);
 
if (adev->gmc.xgmi.num_physical_nodes > 1) {
mutex_lock(_info.mutex);
diff --git a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h 
b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
index 35fa0d8e92dd..ab66a4b9e438 100644
--- a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
+++ b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
@@ -102,7 +102,9 @@
 
 #define PPSMC_MSG_GfxDriverResetRecovery   0x42
 #define PPSMC_MSG_BoardPowerCalibration0x43
-#define PPSMC_Message_Count0x44
+#define PPSMC_MSG_HeavySBR  0x45
+#define PPSMC_Message_Count0x46
+
 
 //PPSMC Reset Types
 #define PPSMC_RESET_TYPE_WARM_RESET  0x00
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 2b9b9a7ba97a..ba7565bc8104 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -1257,9 +1257,9 @@ struct pptable_funcs {
int (*set_fine_grain_gfx_freq_parameters)(struct smu_context *smu);
 
/**
-* @set_light_sbr:  Set light sbr mode for the SMU.
+* @smu_handle_passthrough_sbr:  Send message to SMU about special 
handling for SBR.
 */
-   int (*set_light_sbr)(struct smu_context *smu, bool enable);
+   int (*smu_handle_passthrough_sbr)(struct smu_context *smu, bool enable);
 
/**
 * @wait_for_event:  Wait for events from SMU.
@@ -1415,7 +1415,7 @@ int smu_allow_xgmi_power_down(struct smu_context *smu, 
bool en);
 
 int smu_get_status_gfxoff(struct amdgpu_device *adev, uint32_t *value);
 
-int smu_set_light_sbr(struct smu_context *smu, bool enable);
+int smu_handle_passthrough_sbr(struct smu_context *smu, bool enable);
 
 int smu_wait_for_event(struct amdgpu_device *adev, enum smu_event_type event,
   uint64_t event_arg);
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_types.h 
b/drivers/gpu/drm/amd/pm/inc/smu_types.h
index 18b862a90fbe..ff8a0bcbd290 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_types.h
@@ -229,7 +229,8 @@
__SMU_DUMMY_MAP(BoardPowerCalibration),   \
__SMU_DUMMY_MAP(RequestGfxclk),   \
__SMU_DUMMY_MAP(ForceGfxVid), \
-   __SMU_DUMMY_MAP(UnforceGfxVid),
+   __SMU_DUMMY_MAP(UnforceGfxVid),   \
+   __SMU_DUMMY_MAP(HeavySBR),
 
 #undef __SMU_DUMMY_MAP
 #define __SMU_DUMMY_MAP(type)  SMU_MSG_##type
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
index 2d422e6a9feb..acb3be292096 100644

RE: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handling

2021-12-17 Thread Liu, Shaoyun
[AMD Official Use Only]

Comment inline .

-Original Message-
From: amd-gfx  On Behalf Of sashank saye
Sent: Friday, December 17, 2021 1:19 PM
To: amd-gfx@lists.freedesktop.org
Cc: Saye, Sashank 
Subject: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for 
sbr handling

For Aldebaran chip passthrough case we need to intimate SMU about special 
handling for SBR.On older chips we send LightSBR to SMU, enabling the same for 
Aldebaran. Slight difference, compared to previous chips, is on Aldebaran, SMU 
would do a heavy reset on SBR. Hence, the word Heavy instead of Light SBR is 
used for SMU to differentiate.

Signed-off-by: sashank saye mailto:sashank.s...@amd.com>>
Change-Id: I79420e7352bb670d6f9696df97d7546f131b18fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  9 -
 drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h   |  4 +++-
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h|  6 +++---
 drivers/gpu/drm/amd/pm/inc/smu_types.h |  3 ++-
 drivers/gpu/drm/amd/pm/inc/smu_v11_0.h |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  6 +++---
 drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c  |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 10 ++
 9 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..0c292e119f7c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2618,11 +2618,10 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (r)
DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);

-   /* For XGMI + passthrough configuration on arcturus, enable light SBR */
-   if (adev->asic_type == CHIP_ARCTURUS &&
-   amdgpu_passthrough(adev) &&
-   adev->gmc.xgmi.num_physical_nodes > 1)
-   smu_set_light_sbr(>smu, true);
+   /* For passthrough configuration on arcturus and aldebaran, enable 
special handling SBR */
[shaoyunl] This will change the  behavior for ARCTURUS, which only need to set 
light SBR for XGMI configuration .  You still need to check XGMI for ARCTURUS, 
but don't do that check for ALDEBARAN
+   if ((adev->asic_type == CHIP_ARCTURUS || adev->asic_type == 
CHIP_ALDEBARAN ) &&
+   amdgpu_passthrough(adev))
+   smu_handle_passthrough_sbr(>smu, true);

if (adev->gmc.xgmi.num_physical_nodes > 1) {
mutex_lock(_info.mutex);
diff --git a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h 
b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
index 35fa0d8e92dd..ab66a4b9e438 100644
--- a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
+++ b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
@@ -102,7 +102,9 @@

 #define PPSMC_MSG_GfxDriverResetRecovery   0x42
 #define PPSMC_MSG_BoardPowerCalibration0x43
-#define PPSMC_Message_Count0x44
+#define PPSMC_MSG_HeavySBR  0x45
+#define PPSMC_Message_Count0x46
+

 //PPSMC Reset Types
 #define PPSMC_RESET_TYPE_WARM_RESET  0x00
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 2b9b9a7ba97a..ba7565bc8104 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -1257,9 +1257,9 @@ struct pptable_funcs {
int (*set_fine_grain_gfx_freq_parameters)(struct smu_context *smu);

/**
-* @set_light_sbr:  Set light sbr mode for the SMU.
+* @smu_handle_passthrough_sbr:  Send message to SMU about special 
handling for SBR.
 */
-   int (*set_light_sbr)(struct smu_context *smu, bool enable);
+   int (*smu_handle_passthrough_sbr)(struct smu_context *smu, bool
+enable);

/**
 * @wait_for_event:  Wait for events from SMU.
@@ -1415,7 +1415,7 @@ int smu_allow_xgmi_power_down(struct smu_context *smu, 
bool en);

 int smu_get_status_gfxoff(struct amdgpu_device *adev, uint32_t *value);

-int smu_set_light_sbr(struct smu_context *smu, bool enable);
+int smu_handle_passthrough_sbr(struct smu_context *smu, bool enable);

 int smu_wait_for_event(struct amdgpu_device *adev, enum smu_event_type event,
   uint64_t event_arg);
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_types.h 
b/drivers/gpu/drm/amd/pm/inc/smu_types.h
index 18b862a90fbe..ff8a0bcbd290 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_types.h
@@ -229,7 +229,8 @@
__SMU_DUMMY_MAP(BoardPowerCalibration),   \
__SMU_DUMMY_MAP(RequestGfxclk),   \
__SMU_DUMMY_MAP(ForceGfxVid), \
-   __SMU_DUMMY_MAP(UnforceGfxVid),
+   __SMU_DUMMY_MAP(UnforceGfxVid),   \
+   __SMU_DUMMY_MAP(HeavySBR),

 #undef __SMU_DUMMY_MAP
 #define __SMU_DUMMY_MAP(type) 

RE: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handling

2021-12-17 Thread Liu, Shaoyun
[AMD Official Use Only]

>From your explanation , seems  SMU always need this special handling  for SBR 
>on  passthrough mode , but in the  code , that only apply to XGMI 
>configuration.  Should you change that as well ?  Two comments inline.

Regards
Shaoyun.liu



-Original Message-
From: amd-gfx  On Behalf Of sashank saye
Sent: Friday, December 17, 2021 12:39 PM
To: amd-gfx@lists.freedesktop.org
Cc: Saye, Sashank 
Subject: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for 
sbr handling

For Aldebaran chip passthrough case we need to intimate SMU about special 
handling for SBR.On older chips we send LightSBR to SMU, enabling the same for 
Aldebaran. Slight difference, compared to previous chips, is on Aldebaran, SMU 
would do a heavy reset on SBR. Hence, the word Heavy instead of Light SBR is 
used for SMU to differentiate.

Signed-off-by: sashank saye mailto:sashank.s...@amd.com>>
Change-Id: I79420e7352bb670d6f9696df97d7546f131b18fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 +++---
 drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h   |  4 +++-
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h|  6 +++---
 drivers/gpu/drm/amd/pm/inc/smu_types.h |  3 ++-
 drivers/gpu/drm/amd/pm/inc/smu_v11_0.h |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  6 +++---
 drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c  |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 11 +++
 9 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..01b02701121e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2618,11 +2618,11 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (r)
DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);

-   /* For XGMI + passthrough configuration on arcturus, enable light SBR */
-   if (adev->asic_type == CHIP_ARCTURUS &&
+   /* For XGMI + passthrough configuration on arcturus and aldebaran, 
enable light SBR */
+   if ((adev->asic_type == CHIP_ARCTURUS || adev->asic_type ==
+CHIP_ALDEBARAN ) &&
amdgpu_passthrough(adev) &&
adev->gmc.xgmi.num_physical_nodes > 1)

[shaoyunl] , Should this apply to none  XGMI configuration as well?

-   smu_set_light_sbr(>smu, true);
+   smu_handle_passthrough_sbr(>smu, true);

if (adev->gmc.xgmi.num_physical_nodes > 1) {
mutex_lock(_info.mutex);
diff --git a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h 
b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
index 35fa0d8e92dd..ab66a4b9e438 100644
--- a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
+++ b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
@@ -102,7 +102,9 @@

 #define PPSMC_MSG_GfxDriverResetRecovery   0x42
 #define PPSMC_MSG_BoardPowerCalibration0x43
-#define PPSMC_Message_Count0x44
+#define PPSMC_MSG_HeavySBR  0x45
+#define PPSMC_Message_Count0x46
+

 //PPSMC Reset Types
 #define PPSMC_RESET_TYPE_WARM_RESET  0x00
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 2b9b9a7ba97a..ba7565bc8104 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -1257,9 +1257,9 @@ struct pptable_funcs {
int (*set_fine_grain_gfx_freq_parameters)(struct smu_context *smu);

/**
-* @set_light_sbr:  Set light sbr mode for the SMU.
+* @smu_handle_passthrough_sbr:  Send message to SMU about special 
handling for SBR.
 */
-   int (*set_light_sbr)(struct smu_context *smu, bool enable);
+   int (*smu_handle_passthrough_sbr)(struct smu_context *smu, bool
+enable);

/**
 * @wait_for_event:  Wait for events from SMU.
@@ -1415,7 +1415,7 @@ int smu_allow_xgmi_power_down(struct smu_context *smu, 
bool en);

 int smu_get_status_gfxoff(struct amdgpu_device *adev, uint32_t *value);

-int smu_set_light_sbr(struct smu_context *smu, bool enable);
+int smu_handle_passthrough_sbr(struct smu_context *smu, bool enable);

 int smu_wait_for_event(struct amdgpu_device *adev, enum smu_event_type event,
   uint64_t event_arg);
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_types.h 
b/drivers/gpu/drm/amd/pm/inc/smu_types.h
index 18b862a90fbe..ff8a0bcbd290 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_types.h
@@ -229,7 +229,8 @@
__SMU_DUMMY_MAP(BoardPowerCalibration),   \
__SMU_DUMMY_MAP(RequestGfxclk),   \
__SMU_DUMMY_MAP(ForceGfxVid), \
-   __SMU_DUMMY_MAP(UnforceGfxVid),
+   __SMU_DUMMY_MAP(UnforceGfxVid),   \
+   __SMU_DUMMY_MAP(HeavySBR),


RE: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handling

2021-12-17 Thread Liu, Shaoyun
[AMD Official Use Only]

Ok, sounds reasonable . I'm ok for the function name change .  
Another concern , from driver side , before it start the  ip init ,  it will 
check the SMU clock to determine whether the  asic need a reset from driver 
side . For your case , the hypervisor will trigger the SBR on  VM on/off and 
SMU will handle the reset.  Can  you check after this  reset , will SMU still 
alive ? If it's alive , the driver will trigger the reset again . 

Regards
Shaoyun.liu

-Original Message-
From: Saye, Sashank  
Sent: Friday, December 17, 2021 11:53 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough 
for sbr handling

[AMD Official Use Only]

Hi Shaoyun,
Yes, From SMU FW point of view they do see a difference between Bare metal and 
passthrough case for SBR. For baremetal they get it as a PCI reset whereas 
passthrough case they get it as a BIF reset. Now within BIF reset they would 
need to differentiate between older asic( where we do BACO) and newer ones 
where we do mode 1 reset. Hence in-order for SMU to differentiate these 
scenarios we are adding a new message. 

I think I will rename the function to smu_handle_passthrough_sbr from the 
current smu_set_light_sbr function name.

Regards
Sashank

-Original Message-
From: Liu, Shaoyun 
Sent: Friday, December 17, 2021 11:45 AM
To: Saye, Sashank ; amd-gfx@lists.freedesktop.org
Cc: Saye, Sashank 
Subject: RE: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough 
for sbr handling

[AMD Official Use Only]

First , the name of heavy SBR  is confusing when you need to go through  light 
SBR code path. 
Secondary,  originally we introduce the light SBR is because on older asic,   
FW can not synchronize the reset on the devices within the hive, so it depends 
on driver to sync the reset.  From what I have heard , for chip aructus , the 
FW actually can sync the reset itself.  I don't see a necessary to  introduce 
the heavy SBR message, it seems SMU will do a full reset  when it get SBR  
request.  IS there  a different code path  for SMU to handle the reset  for 
XGMI in passthrough mode ?  

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of sashank saye
Sent: Friday, December 17, 2021 10:33 AM
To: amd-gfx@lists.freedesktop.org
Cc: Saye, Sashank 
Subject: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for 
sbr handling

For Aldebaran chip passthrough case we need to intimate SMU about special 
handling for SBR.On older chips we send LightSBR to SMU, enabling the same for 
Aldebaran. Slight difference, compared to previous chips, is on Aldebaran, SMU 
would do a heavy reset on SBR. Hence, the word Heavy instead of Light SBR is 
used for SMU to differentiate.

Signed-off-by: sashank saye 
Change-Id: I79420e7352bb670d6f9696df97d7546f131b18fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
 drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h   |  4 +++-
 drivers/gpu/drm/amd/pm/inc/smu_types.h |  3 ++-
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 11 +++
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..06aee23505b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2618,8 +2618,8 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (r)
DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);
 
-   /* For XGMI + passthrough configuration on arcturus, enable light SBR */
-   if (adev->asic_type == CHIP_ARCTURUS &&
+   /* For XGMI + passthrough configuration on arcturus and aldebaran, 
enable light SBR */
+   if ((adev->asic_type == CHIP_ARCTURUS || adev->asic_type == 
+CHIP_ALDEBARAN ) &&
amdgpu_passthrough(adev) &&
adev->gmc.xgmi.num_physical_nodes > 1)
smu_set_light_sbr(>smu, true);
diff --git a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h 
b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
index 35fa0d8e92dd..ab66a4b9e438 100644
--- a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
+++ b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
@@ -102,7 +102,9 @@
 
 #define PPSMC_MSG_GfxDriverResetRecovery   0x42
 #define PPSMC_MSG_BoardPowerCalibration0x43
-#define PPSMC_Message_Count0x44
+#define PPSMC_MSG_HeavySBR  0x45
+#define PPSMC_Message_Count0x46
+
 
 //PPSMC Reset Types
 #define PPSMC_RESET_TYPE_WARM_RESET  0x00
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_types.h 
b/drivers/gpu/drm/amd/pm/inc/smu_types.h
index 18b862a90fbe..ff8a0bcbd290 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_types.h
@@ -229,7 +22

RE: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handling

2021-12-17 Thread Liu, Shaoyun
[AMD Official Use Only]

First , the name of heavy SBR  is confusing when you need to go through  light 
SBR code path. 
Secondary,  originally we introduce the light SBR is because on older asic,   
FW can not synchronize the reset on the devices within the hive, so it depends 
on driver to sync the reset.  From what I have heard , for chip aructus , the 
FW actually can sync the reset itself.  I don't see a necessary to  introduce 
the heavy SBR message, it seems SMU will do a full reset  when it get SBR  
request.  IS there  a different code path  for SMU to handle the reset  for 
XGMI in passthrough mode ?  

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of sashank saye
Sent: Friday, December 17, 2021 10:33 AM
To: amd-gfx@lists.freedesktop.org
Cc: Saye, Sashank 
Subject: [PATCH] drm/amdgpu: Send Message to SMU on aldebaran passthrough for 
sbr handling

For Aldebaran chip passthrough case we need to intimate SMU about special 
handling for SBR.On older chips we send LightSBR to SMU, enabling the same for 
Aldebaran. Slight difference, compared to previous chips, is on Aldebaran, SMU 
would do a heavy reset on SBR. Hence, the word Heavy instead of Light SBR is 
used for SMU to differentiate.

Signed-off-by: sashank saye 
Change-Id: I79420e7352bb670d6f9696df97d7546f131b18fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
 drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h   |  4 +++-
 drivers/gpu/drm/amd/pm/inc/smu_types.h |  3 ++-
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 11 +++
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..06aee23505b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2618,8 +2618,8 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (r)
DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);
 
-   /* For XGMI + passthrough configuration on arcturus, enable light SBR */
-   if (adev->asic_type == CHIP_ARCTURUS &&
+   /* For XGMI + passthrough configuration on arcturus and aldebaran, 
enable light SBR */
+   if ((adev->asic_type == CHIP_ARCTURUS || adev->asic_type == 
+CHIP_ALDEBARAN ) &&
amdgpu_passthrough(adev) &&
adev->gmc.xgmi.num_physical_nodes > 1)
smu_set_light_sbr(>smu, true);
diff --git a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h 
b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
index 35fa0d8e92dd..ab66a4b9e438 100644
--- a/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
+++ b/drivers/gpu/drm/amd/pm/inc/aldebaran_ppsmc.h
@@ -102,7 +102,9 @@
 
 #define PPSMC_MSG_GfxDriverResetRecovery   0x42
 #define PPSMC_MSG_BoardPowerCalibration0x43
-#define PPSMC_Message_Count0x44
+#define PPSMC_MSG_HeavySBR  0x45
+#define PPSMC_Message_Count0x46
+
 
 //PPSMC Reset Types
 #define PPSMC_RESET_TYPE_WARM_RESET  0x00
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_types.h 
b/drivers/gpu/drm/amd/pm/inc/smu_types.h
index 18b862a90fbe..ff8a0bcbd290 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_types.h
@@ -229,7 +229,8 @@
__SMU_DUMMY_MAP(BoardPowerCalibration),   \
__SMU_DUMMY_MAP(RequestGfxclk),   \
__SMU_DUMMY_MAP(ForceGfxVid), \
-   __SMU_DUMMY_MAP(UnforceGfxVid),
+   __SMU_DUMMY_MAP(UnforceGfxVid),   \
+   __SMU_DUMMY_MAP(HeavySBR),
 
 #undef __SMU_DUMMY_MAP
 #define __SMU_DUMMY_MAP(type)  SMU_MSG_##type
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 7433a051e795..f442950e9676 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -141,6 +141,7 @@ static const struct cmn2asic_msg_mapping 
aldebaran_message_map[SMU_MSG_MAX_COUNT
MSG_MAP(SetUclkDpmMode,  PPSMC_MSG_SetUclkDpmMode,  
0),
MSG_MAP(GfxDriverResetRecovery,  
PPSMC_MSG_GfxDriverResetRecovery,  0),
MSG_MAP(BoardPowerCalibration,   
PPSMC_MSG_BoardPowerCalibration,   0),
+   MSG_MAP(HeavySBR,PPSMC_MSG_HeavySBR,
0),
 };
 
 static const struct cmn2asic_mapping aldebaran_clk_map[SMU_CLK_COUNT] = { @@ 
-1912,6 +1913,15 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
return ret;
 }
 
+static int aldebaran_set_light_sbr(struct smu_context *smu, bool 
+enable) {
+   int ret = 0;
+   //For alderbarn chip, SMU would do a mode 1 reset as part of SBR hence 
we call it HeavySBR instead of light
+   ret =  smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_HeavySBR, enable ? 
+1 : 0, NULL);
+

RE: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

2021-12-16 Thread Liu, Shaoyun
[AMD Official Use Only]

Reviewed by: shaoyun.liu 

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 16, 2021 2:43 PM
To: amd-gfx@lists.freedesktop.org; Deng, Emily ; Liu, Monk 
; Ming, Davis ; Liu, Shaoyun 
; Zhou, Peng Ju ; Chen, JingWen 
; Chen, Horace ; Nieto, David M 

Cc: Skvortsov, Victor 
Subject: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

Driver needs to call get_xgmi_info() before ip_init to determine whether it 
needs to handle a pending hive reset.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 --
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 --
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5bd785cfc5ca..4fd370016834 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3576,6 +3576,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
 
+   /* Need to get xgmi info early to decide the reset behavior*/
+   if (adev->gmc.xgmi.supported) {
+   r = adev->gfxhub.funcs->get_xgmi_info(adev);
+   if (r)
+   return r;
+   }
+
/* enable PCIE atomic ops */
if (amdgpu_sriov_vf(adev))
adev->have_atomics_support = ((struct amd_sriov_msg_pf2vf_info 
*) diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ae46eb35b3d7..3d5d47a799e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -914,12 +914,6 @@ static int gmc_v10_0_sw_init(void *handle)
return r;
}
 
-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
r = gmc_v10_0_mc_init(adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 2b86c63b032a..57f2729a7bd0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1628,12 +1628,6 @@ static int gmc_v9_0_sw_init(void *handle)
}
adev->need_swiotlb = drm_need_swiotlb(44);
 
-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
r = gmc_v9_0_mc_init(adev);
if (r)
return r;
--
2.25.1


RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

2021-12-16 Thread Liu, Shaoyun
[AMD Official Use Only]

Actually I don't know why  the change " a35f147621bc drm/amdgpu: get xgmi info 
at eary_init " not in drm-next , instead it’s in amd-mainline-dkms-5.13.
That change is necessary for passthrough XGMI hive  to a VM and rely on our 
driver to do the reset on whole hive  when driver is loaded .  

I checked the code again,  it seems we should be ok as long as we get xgmi info 
at eary_init.  So since gfx_v9_0_set_rlc_funcs() already gets called in 
gfx_v9_0_early_init(), we can  move get xgmi info out of gmc_early_init and 
call  it at the last step  early_init . 

Regards
Shaoyun.liu

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 16, 2021 9:28 AM
To: Alex Deucher 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[AMD Official Use Only]

Gotcha, I will skip this patch for drm-next

-Original Message-
From: Alex Deucher 
Sent: Thursday, December 16, 2021 8:53 AM
To: Skvortsov, Victor 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[CAUTION: External Email]

On Wed, Dec 15, 2021 at 6:58 PM Skvortsov, Victor  
wrote:
>
> [AMD Official Use Only]
>
> Hey Alex,
>
> This change was based on the fact that amd-mainline-dkms-5.13 calls 
> get_xgmi_info() in gmc_v9_0_early_init(). But I can see that drm-next it's 
> instead called in gmc_v9_0_sw_init(). So, I'm not sure whats the correct 
> behavior. But I do agree that the change is kind of ugly. I don't know where 
> else to put it if we do need to call get_xgmi_info() in early_init.
>

We could skip this patch for drm-next and just apply it to the dkms branch.  
There's already a lot of ugly stuff in there to deal with multiple kernel 
versions.

Alex


> Thanks,
> Victor
>
> -Original Message-
> From: Alex Deucher 
> Sent: Wednesday, December 15, 2021 4:38 PM
> To: Skvortsov, Victor 
> Cc: amd-gfx list ; Deng, Emily 
> ; Liu, Monk ; Ming, Davis 
> ; Liu, Shaoyun ; Zhou, Peng 
> Ju ; Chen, JingWen ; Chen, 
> Horace ; Nieto, David M 
> Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function 
> pointers
>
> [CAUTION: External Email]
>
> On Wed, Dec 15, 2021 at 1:56 PM Victor Skvortsov  
> wrote:
> >
> > In SRIOV, RLC function pointers must be initialized early as we rely 
> > on the RLCG interface for all GC register access.
> >
> > Signed-off-by: Victor Skvortsov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 ++
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +--
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h | 2 ++
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > index 65e1f6cc59dd..1bc92a38d124 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > @@ -844,6 +844,8 @@ static int amdgpu_discovery_set_gc_ip_blocks(struct 
> > amdgpu_device *adev)
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > amdgpu_device_ip_block_add(adev, 
> > _v9_0_ip_block);
> > +   if (amdgpu_sriov_vf(adev) && adev->ip_versions[GC_HWIP][0] 
> > == IP_VERSION(9, 4, 2))
> > +   gfx_v9_0_set_rlc_funcs(adev);
>
> amdgpu_discovery.c is IP independent.  I'd rather not add random IP specific 
> function calls.  gfx_v9_0_set_rlc_funcs() already gets called in 
> gfx_v9_0_early_init().  Is that not early enough?  In general we shouldn't be 
> touching the hardware much if at all in early_init.
>
> Alex
>
> > break;
> > case IP_VERSION(10, 1, 10):
> > case IP_VERSION(10, 1, 2):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index edb3e3b08eed..d252b06efa43 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -816,7 +816,6 @@ static void gfx_v9_0_sriov_wreg(struct 
> > amdgpu_device *adev, u32 offset,  static void 
> > gfx_v9_0_set_ring_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_gds_init(struct amdgpu_device *adev); -static void 
> > gfx_v9_0_set_rlc_funcs(struct amdgpu_d

RE: [PATCH v2] drm/amdgpu: Separate vf2pf work item init from virt data exchange

2021-12-16 Thread Liu, Shaoyun
[AMD Official Use Only]

This one  looks better and  more logical . 

Reviewed By :Shaoyun.liu 

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 16, 2021 10:39 AM
To: amd-gfx@lists.freedesktop.org; Liu, Shaoyun ; Nieto, 
David M 
Cc: Skvortsov, Victor 
Subject: [PATCH v2] drm/amdgpu: Separate vf2pf work item init from virt data 
exchange

We want to be able to call virt data exchange conditionally after gmc sw init 
to reserve bad pages as early as possible.
Since this is a conditional call, we will need to call it again unconditionally 
later in the init sequence.

Refactor the data exchange function so it can be called multiple times without 
re-initializing the work item.

v2: Cleaned up the code. Kept the original call to init_exchange_data() inside 
early init to initialize the work item, afterwards call
exchange_data() when needed.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 36 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  1 +
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 48aeca3b8f16..ddc67b900587 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2316,6 +2316,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 
/* need to do gmc hw init early so we can allocate gpu mem */
if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
+   /* Try to reserve bad pages early */
+   if (amdgpu_sriov_vf(adev))
+   amdgpu_virt_exchange_data(adev);
+
r = amdgpu_device_vram_scratch_init(adev);
if (r) {
DRM_ERROR("amdgpu_vram_scratch_init failed 
%d\n", r); @@ -2347,7 +2351,7 @@ static int amdgpu_device_ip_init(struct 
amdgpu_device *adev)
}
 
if (amdgpu_sriov_vf(adev))
-   amdgpu_virt_init_data_exchange(adev);
+   amdgpu_virt_exchange_data(adev);
 
r = amdgpu_ib_pool_init(adev);
if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3fc49823f527..f8e574cc0e22 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -622,17 +622,35 @@ void amdgpu_virt_fini_data_exchange(struct amdgpu_device 
*adev)
 
 void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)  {
-   uint64_t bp_block_offset = 0;
-   uint32_t bp_block_size = 0;
-   struct amd_sriov_msg_pf2vf_info *pf2vf_v2 = NULL;
-
adev->virt.fw_reserve.p_pf2vf = NULL;
adev->virt.fw_reserve.p_vf2pf = NULL;
adev->virt.vf2pf_update_interval_ms = 0;
 
-   if (adev->mman.fw_vram_usage_va != NULL) {
+   if (adev->bios != NULL) {
adev->virt.vf2pf_update_interval_ms = 2000;
 
+   adev->virt.fw_reserve.p_pf2vf =
+   (struct amd_sriov_msg_pf2vf_info_header *)
+   (adev->bios + (AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10));
+
+   amdgpu_virt_read_pf2vf_data(adev);
+   }
+
+   if (adev->virt.vf2pf_update_interval_ms != 0) {
+   INIT_DELAYED_WORK(>virt.vf2pf_work, 
amdgpu_virt_update_vf2pf_work_item);
+   schedule_delayed_work(&(adev->virt.vf2pf_work), 
msecs_to_jiffies(adev->virt.vf2pf_update_interval_ms));
+   }
+}
+
+
+void amdgpu_virt_exchange_data(struct amdgpu_device *adev) {
+   uint64_t bp_block_offset = 0;
+   uint32_t bp_block_size = 0;
+   struct amd_sriov_msg_pf2vf_info *pf2vf_v2 = NULL;
+
+   if (adev->mman.fw_vram_usage_va != NULL) {
+
adev->virt.fw_reserve.p_pf2vf =
(struct amd_sriov_msg_pf2vf_info_header *)
(adev->mman.fw_vram_usage_va + 
(AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10)); @@ -663,16 +681,10 @@ void 
amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)
(adev->bios + (AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10));
 
amdgpu_virt_read_pf2vf_data(adev);
-
-   return;
-   }
-
-   if (adev->virt.vf2pf_update_interval_ms != 0) {
-   INIT_DELAYED_WORK(>virt.vf2pf_work, 
amdgpu_virt_update_vf2pf_work_item);
-   schedule_delayed_work(&(adev->virt.vf2pf_work), 
adev->virt.vf2pf_update_interval_ms);
}
 }
 
+
 void amdgpu_detect_virtualization(struct amdgpu_device *adev)  {
uint32_t reg;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 8d4c20bb71c5..9adfb8d63280 100644
--- a/drivers/g

RE: [PATCH 1/2] drm/amdgpu: Separate vf2pf work item init from virt data exchange

2021-12-15 Thread Liu, Shaoyun
[AMD Official Use Only]

Looks ok to me . This serial is Reviewed by: Shaoyun.liu 

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Victor 
Skvortsov
Sent: Thursday, December 9, 2021 11:48 AM
To: amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: [PATCH 1/2] drm/amdgpu: Separate vf2pf work item init from virt data 
exchange

We want to be able to call virt data exchange conditionally after gmc sw init 
to reserve bad pages as early as possible.
Since this is a conditional call, we will need to call it again unconditionally 
later in the init sequence.

Refactor the data exchange function so it can be called multiple times without 
re-initializing the work item.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 20 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 42 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  5 +--
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  2 +-
 5 files changed, 45 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ce9bdef185c0..3992c4086d26 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2181,7 +2181,7 @@ static int amdgpu_device_ip_early_init(struct 
amdgpu_device *adev)
 
/*get pf2vf msg info at it's earliest time*/
if (amdgpu_sriov_vf(adev))
-   amdgpu_virt_init_data_exchange(adev);
+   amdgpu_virt_exchange_data(adev);
 
}
}
@@ -2345,8 +2345,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
}
}
 
-   if (amdgpu_sriov_vf(adev))
-   amdgpu_virt_init_data_exchange(adev);
+   if (amdgpu_sriov_vf(adev)) {
+   amdgpu_virt_exchange_data(adev);
+   amdgpu_virt_init_vf2pf_work_item(adev);
+   }
 
r = amdgpu_ib_pool_init(adev);
if (r) {
@@ -2949,7 +2951,7 @@ int amdgpu_device_ip_suspend(struct amdgpu_device *adev)
int r;
 
if (amdgpu_sriov_vf(adev)) {
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
amdgpu_virt_request_full_gpu(adev, false);
}
 
@@ -3839,7 +3841,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 * */
if (amdgpu_sriov_vf(adev)) {
amdgpu_virt_request_full_gpu(adev, false);
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
}
 
/* disable all interrupts */
@@ -4317,7 +4319,9 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
if (r)
goto error;
 
-   amdgpu_virt_init_data_exchange(adev);
+   amdgpu_virt_exchange_data(adev);
+   amdgpu_virt_init_vf2pf_work_item(adev);
+
/* we need recover gart prior to run SMC/CP/SDMA resume */
amdgpu_gtt_mgr_recover(ttm_manager_type(>mman.bdev, TTM_PL_TT));
 
@@ -4495,7 +4499,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
 
if (amdgpu_sriov_vf(adev)) {
/* stop the data exchange thread */
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
}
 
/* block all schedulers and reset given job's ring */ @@ -4898,7 
+4902,7 @@ static void amdgpu_device_recheck_guilty_jobs(
 retry:
/* do hw reset */
if (amdgpu_sriov_vf(adev)) {
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
r = amdgpu_device_reset_sriov(adev, false);
if (r)
adev->asic_reset_res = r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3fc49823f527..b6e3d379a86a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -611,16 +611,7 @@ static void amdgpu_virt_update_vf2pf_work_item(struct 
work_struct *work)
schedule_delayed_work(&(adev->virt.vf2pf_work), 
adev->virt.vf2pf_update_interval_ms);
 }
 
-void amdgpu_virt_fini_data_exchange(struct amdgpu_device *adev) -{
-   if (adev->virt.vf2pf_update_interval_ms != 0) {
-   DRM_INFO("clean up the vf2pf work item\n");
-   cancel_delayed_work_sync(>virt.vf2pf_work);
-   adev->virt.vf2pf_update_interval_ms = 0;
-   }
-}
-
-void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)
+void amdgpu_virt_exchange_data(struct amdgpu_device *adev)
 {
uint64_t bp_block_offset = 0;
uint32_t bp_block_size = 0;
@@ 

RE: [PATCH v2 1/2] drm/amd/amdgpu: fix psp tmr bo pin count leak in SRIOV

2021-12-14 Thread Liu, Shaoyun
[AMD Official Use Only]

These workaround code looks confusing.  For PSP TMR , I think guest side should 
avoid to load it totally  since it's loaded in host side.  For gart table , in 
current  code path probably it's ok, but I think if we have  a correct sequence 
in SRIOV , we shouldn't have  these kinds  of workaround.  Ex .  Can  we try 
call  ip_suspend  for sriov in amdgpu_device_pre_asic_reset , so we  will have  
the  same logic as baremetal. 

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Jingwen Chen
Sent: Monday, December 13, 2021 11:18 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Horace ; Chen, JingWen ; 
Liu, Monk 
Subject: [PATCH v2 1/2] drm/amd/amdgpu: fix psp tmr bo pin count leak in SRIOV

[Why]
psp tmr bo will be pinned during loading amdgpu and reset in SRIOV while only 
unpinned in unload amdgpu

[How]
add amdgpu_in_reset and sriov judgement to skip pin bo

v2: fix wrong judgement

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 103bcadbc8b8..4de46fcb486c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2017,12 +2017,16 @@ static int psp_hw_start(struct psp_context *psp)
return ret;
}
 
+   if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev)) 
+   goto skip_pin_bo;
+
ret = psp_tmr_init(psp);
if (ret) {
DRM_ERROR("PSP tmr init failed!\n");
return ret;
}
 
+skip_pin_bo:
/*
 * For ASICs with DF Cstate management centralized
 * to PMFW, TMR setup should be performed after PMFW
--
2.30.2


RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

2021-12-09 Thread Liu, Shaoyun
[AMD Official Use Only]

Sounds reasonable. 
 This patch is Reviewed by : Shaoyun.liu 

Regards
Shaoyun.liu

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 9, 2021 1:33 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

[AMD Official Use Only]

I wanted to keep the order the same as in amdgpu_device_lock_adev() (Set flag 
then acquire lock) to prevent any weird race conditions.

Thanks,
Victor

-Original Message-
From: Liu, Shaoyun 
Sent: Thursday, December 9, 2021 1:25 PM
To: Skvortsov, Victor ; amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

[AMD Official Use Only]

I think it's a good catch for reset_sem, any reason to change the  
adev->in_gpu_reset ?  

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Victor 
Skvortsov
Sent: Thursday, December 9, 2021 12:02 PM
To: amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

Host initiated VF FLR may fail if someone else is already holding a read_lock. 
Change from down_write_trylock to down_write to guarantee the reset goes 
through.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +++--  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index cd2719bc0139..e4365c97adaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -252,11 +252,12 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 2bc93808469a..1cde70c72e54 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -281,11 +281,12 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
--
2.25.1


RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

2021-12-09 Thread Liu, Shaoyun
[AMD Official Use Only]

I think it's a good catch for reset_sem, any reason to change the  
adev->in_gpu_reset ?  

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Victor 
Skvortsov
Sent: Thursday, December 9, 2021 12:02 PM
To: amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

Host initiated VF FLR may fail if someone else is already holding a read_lock. 
Change from down_write_trylock to down_write to guarantee the reset goes 
through.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +++--  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index cd2719bc0139..e4365c97adaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -252,11 +252,12 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 2bc93808469a..1cde70c72e54 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -281,11 +281,12 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
--
2.25.1


RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

2021-12-07 Thread Liu, Shaoyun
[AMD Official Use Only]

Ok , sounds reasonable.  With the suggested modification 
Patch 1, 2, 3, are Reviewed by : Shaoyun.liu . Patch4 is 
Acked by  : Shaoyun.liu .

Regards
Shaoyun.liu

-Original Message-
From: Luo, Zhigang  
Sent: Tuesday, December 7, 2021 4:55 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive 
if it's SRIOV VF

[AMD Official Use Only]

Shaoyun, please see my comments inline.

Thanks,
Zhigang

-Original Message-
From: Liu, Shaoyun 
Sent: December 7, 2021 2:15 PM
To: Luo, Zhigang ; amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive 
if it's SRIOV VF

[AMD Official Use Only]

This   patch looks ok  to me . 
Patch 2 is  actually add the PSP xgmi init  not the whole XGMI  init, can  you 
change the description according  to this ? 
[Zhigang] Ok. Will change it.
Patch 3,  You take the hive lock inside the reset sriov function , but the  
hive lock already be took  before this function is called  in gpu_recovery 
function,  so is it real necessary to get hive  inside the reset sriov function 
, can  you try remove the code to check hive ?  Or maybe pass the  hive as a 
parameter into this function if the hive is needed? 
[Zhigang] in patch 1, we made change in gpu_recovery to skip getting xgmi hive 
if it's sriov vf as we don't want to reset other VF in the same hive.
Patch 4 looks ok to me , but may need  SRDC engineer confirm it won't have  
side effect on other AI  asic . 

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if 
it's SRIOV VF

On SRIOV, host driver can support FLR(function level reset) on individual VF 
within the hive which might bring the individual device back to normal without 
the necessary to execute the hive reset. If the FLR failed , host driver will 
trigger the hive reset, each guest VF will get reset notification before the 
real hive reset been executed. The VF device can handle the reset request 
individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in the same 
hive for SRIOV VF.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct 
amdgpu_device *adev, struct amdgp  {
struct amdgpu_device *tmp_adev = NULL;
 
-   if (adev->gmc.xgmi.num_physical_nodes > 1) {
+   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has 
multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * We always reset all schedulers for device and all devices for XGMI
 * hive so that should take care of them too.
 */
-   hive = amdgpu_get_xgmi_hive(adev);
+   if (!amdgpu_sriov_vf(adev))
+   hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(>in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as 
another already in progress", @@ -4999,7 +5000,7 @@ int 
amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * to put adev in the 1st position.
 */
INIT_LIST_HEAD(_list);
-   if (adev->gmc.xgmi.num_physical_nodes > 1) {
+   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
list_for_each_entry(tmp_adev, >device_list, gmc.xgmi.head)
list_add_tail(_adev->reset_list, _list);
if (!list_is_first(>reset_list, _list))
--
2.17.1


RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

2021-12-07 Thread Liu, Shaoyun
[AMD Official Use Only]

This   patch looks ok  to me . 
Patch 2 is  actually add the PSP xgmi init  not the whole XGMI  init, can  you 
change the description according  to this ? 
Patch 3,  You take the hive lock inside the reset sriov function , but the  
hive lock already be took  before this function is called  in gpu_recovery 
function,  so is it real necessary to get hive  inside the reset sriov function 
, can  you try remove the code to check hive ?  Or maybe pass the  hive as a 
parameter into this function if the hive is needed? 
Patch 4 looks ok to me , but may need  SRDC engineer confirm it won't have  
side effect on other AI  asic . 

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if 
it's SRIOV VF

On SRIOV, host driver can support FLR(function level reset) on individual VF 
within the hive which might bring the individual device back to normal without 
the necessary to execute the hive reset. If the FLR failed , host driver will 
trigger the hive reset, each guest VF will get reset notification before the 
real hive reset been executed. The VF device can handle the reset request 
individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in the same 
hive for SRIOV VF.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct 
amdgpu_device *adev, struct amdgp  {
struct amdgpu_device *tmp_adev = NULL;
 
-   if (adev->gmc.xgmi.num_physical_nodes > 1) {
+   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has 
multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * We always reset all schedulers for device and all devices for XGMI
 * hive so that should take care of them too.
 */
-   hive = amdgpu_get_xgmi_hive(adev);
+   if (!amdgpu_sriov_vf(adev))
+   hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(>in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as 
another already in progress", @@ -4999,7 +5000,7 @@ int 
amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * to put adev in the 1st position.
 */
INIT_LIST_HEAD(_list);
-   if (adev->gmc.xgmi.num_physical_nodes > 1) {
+   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
list_for_each_entry(tmp_adev, >device_list, gmc.xgmi.head)
list_add_tail(_adev->reset_list, _list);
if (!list_is_first(>reset_list, _list))
--
2.17.1


RE: [PATCH] drm/amdgpu: skip reset other device in the same hive if it's sriov vf

2021-12-03 Thread Liu, Shaoyun
[AMD Official Use Only]

I think you need to describe more details on why the  hive reset on guest side 
is not necessary and  how host and guest driver will work  together to handle 
the hive  reset . You should have  2 patches together as a serials  to handle 
the FLR and  mode 1 reset on XGMI configuration.  
The describe is something  like :
  On SRIOV, host driver can support FLR(function level reset) on individual VF 
within the hive which might bring the individual device back to normal  without 
the necessary to execute the  hive reset.  If the FLR failed , host driver will 
trigger the hive reset , each guest VF will get reset notification before the 
real hive  reset been  executed . The VF device can handle the reset request 
individually in it's reset work handler . 

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Friday, December 3, 2021 5:06 PM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH] drm/amdgpu: skip reset other device in the same hive if it's 
sriov vf

For sriov vf hang, vf flr will be triggered. Hive reset is not needed.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct 
amdgpu_device *adev, struct amdgp  {
struct amdgpu_device *tmp_adev = NULL;
 
-   if (adev->gmc.xgmi.num_physical_nodes > 1) {
+   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has 
multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * We always reset all schedulers for device and all devices for XGMI
 * hive so that should take care of them too.
 */
-   hive = amdgpu_get_xgmi_hive(adev);
+   if (!amdgpu_sriov_vf(adev))
+   hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(>in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as 
another already in progress", @@ -4999,7 +5000,7 @@ int 
amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * to put adev in the 1st position.
 */
INIT_LIST_HEAD(_list);
-   if (adev->gmc.xgmi.num_physical_nodes > 1) {
+   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
list_for_each_entry(tmp_adev, >device_list, gmc.xgmi.head)
list_add_tail(_adev->reset_list, _list);
if (!list_is_first(>reset_list, _list))
--
2.17.1


RE: [PATCH] drm/amdgpu: adjust the kfd reset sequence in reset sriov function

2021-11-30 Thread Liu, Shaoyun
Thanks for the review , change the description as suggested and submitted. 

Shaoyun.liu

-Original Message-
From: Kuehling, Felix  
Sent: Tuesday, November 30, 2021 1:19 AM
To: amd-gfx@lists.freedesktop.org; Liu, Shaoyun 
Subject: Re: [PATCH] drm/amdgpu: adjust the kfd reset sequence in reset sriov 
function

Am 2021-11-29 um 9:40 p.m. schrieb shaoyunl:
> This change revert previous commit
> 7079e7d5c6bf: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> cd547b93c62a: drm/amdgpu: move kfd post_reset out of reset_sriov 
> function

It looks like this is not a straight revert. It moves the 
amdgpu_amdkfd_pre_reset to an earlier place in amdgpu_device_reset_sriov, 
presumably to address the sequence issue that the first patch was originally 
meant to fix. The patch description should mention that.

With that fixed, the patch is

Reviewed-by: Felix Kuehling 


>
> Some register access(GRBM_GFX_CNTL) only be allowed on full access 
> mode. Move kfd_pre_reset and  kfd_post_reset back inside reset_sriov 
> function.
>
> Signed-off-by: shaoyunl 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1989f9e9379e..3c5afa45173c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4285,6 +4285,8 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,  {
>   int r;
>  
> + amdgpu_amdkfd_pre_reset(adev);
> +
>   if (from_hypervisor)
>   r = amdgpu_virt_request_full_gpu(adev, true);
>   else
> @@ -4312,6 +4314,7 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>  
>   amdgpu_irq_gpu_reset_resume_helper(adev);
>   r = amdgpu_ib_ring_tests(adev);
> + amdgpu_amdkfd_post_reset(adev);
>  
>  error:
>   if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) 
> { @@ -5026,7 +5029,8 @@ int amdgpu_device_gpu_recover(struct 
> amdgpu_device *adev,
>  
>   cancel_delayed_work_sync(_adev->delayed_init_work);
>  
> - amdgpu_amdkfd_pre_reset(tmp_adev);
> + if (!amdgpu_sriov_vf(tmp_adev))
> + amdgpu_amdkfd_pre_reset(tmp_adev);
>  
>   /*
>* Mark these ASICs to be reseted as untracked first @@ -5144,9 
> +5148,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  
>  skip_sched_resume:
>   list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> - /* unlock kfd */
> - if (!need_emergency_restart)
> - amdgpu_amdkfd_post_reset(tmp_adev);
> + /* unlock kfd: SRIOV would do it separately */
> + if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
> + amdgpu_amdkfd_post_reset(tmp_adev);
>  
>   /* kfd_post_reset will do nothing if kfd device is not 
> initialized,
>* need to bring up kfd here if it's not be initialized before


RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

2021-11-22 Thread Liu, Shaoyun
[AMD Official Use Only]

Thanks for the review .
The hash for the previous change from gerrirgit/amd-staging-drm-next branch is 
7079e7d5c6bf248bff,  so there is another drm-next branch that not in the  
gerritgit for upstream ? 

Thanks 
Shaoyun.liu


-Original Message-
From: Kuehling, Felix  
Sent: Monday, November 22, 2021 10:40 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov 
function

Am 2021-11-18 um 11:57 a.m. schrieb shaoyunl:
> For sriov XGMI  configuration, the host driver will handle the hive 
> reset, so in guest side, the reset_sriov only be called once on one 
> device. This will make kfd post_reset unblanced with kfd pre_reset 
> since kfd pre_reset already been moved out of reset_sriov function. 
> Move kfd post_reset out of reset_sriov function to make them balance .
>
> Signed-off-by: shaoyunl 

Please change the headline prefix to "drm/amdgpu: ". The extra "/amd" is 
redundant. And I'd also add a tag

Fixes: 9f4f2c1a3524 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in
sriov")

Note that the commit hash is the one from the drm-next branch, which is what 
will get merged into master eventually. With those changes, the patch is

Reviewed-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 10c8008d1da0..9a9d5493c676 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>  
>   amdgpu_irq_gpu_reset_resume_helper(adev);
>   r = amdgpu_ib_ring_tests(adev);
> - amdgpu_amdkfd_post_reset(adev);
>  
>  error:
>   if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) 
> { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct 
> amdgpu_device *adev,
>  
>   tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>   /* Actual ASIC resets if needed.*/
> - /* TODO Implement XGMI hive reset logic for SRIOV */
> + /* Host driver will handle XGMI hive reset for SRIOV */
>   if (amdgpu_sriov_vf(adev)) {
>   r = amdgpu_device_reset_sriov(adev, job ? false : true);
>   if (r)
> @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct 
> amdgpu_device *adev,
>  
>  skip_sched_resume:
>   list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> - /* unlock kfd: SRIOV would do it separately */
> - if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
> + /* unlock kfd */
> + if (!need_emergency_restart)
>   amdgpu_amdkfd_post_reset(tmp_adev);
>  
>   /* kfd_post_reset will do nothing if kfd device is not 
> initialized,


RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

2021-11-22 Thread Liu, Shaoyun
[AMD Official Use Only]

ping

-Original Message-
From: Liu, Shaoyun 
Sent: Thursday, November 18, 2021 10:08 PM
To: amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov 
function

[AMD Official Use Only]

Ping 

-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, November 18, 2021 11:58 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

For sriov XGMI  configuration, the host driver will handle the hive reset, so 
in guest side, the reset_sriov only be called once on one device. This will 
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already 
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov 
function to make them balance .

Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 10c8008d1da0..9a9d5493c676 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
 
amdgpu_irq_gpu_reset_resume_helper(adev);
r = amdgpu_ib_ring_tests(adev);
-   amdgpu_amdkfd_post_reset(adev);
 
 error:
if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { 
@@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
/* Actual ASIC resets if needed.*/
-   /* TODO Implement XGMI hive reset logic for SRIOV */
+   /* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
r = amdgpu_device_reset_sriov(adev, job ? false : true);
if (r)
@@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 skip_sched_resume:
list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
-   /* unlock kfd: SRIOV would do it separately */
-   if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
+   /* unlock kfd */
+   if (!need_emergency_restart)
amdgpu_amdkfd_post_reset(tmp_adev);
 
/* kfd_post_reset will do nothing if kfd device is not 
initialized,
--
2.17.1


RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

2021-11-18 Thread Liu, Shaoyun
[AMD Official Use Only]

Ping 

-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, November 18, 2021 11:58 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

For sriov XGMI  configuration, the host driver will handle the hive reset, so 
in guest side, the reset_sriov only be called once on one device. This will 
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already 
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov 
function to make them balance .

Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 10c8008d1da0..9a9d5493c676 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
 
amdgpu_irq_gpu_reset_resume_helper(adev);
r = amdgpu_ib_ring_tests(adev);
-   amdgpu_amdkfd_post_reset(adev);
 
 error:
if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { 
@@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
/* Actual ASIC resets if needed.*/
-   /* TODO Implement XGMI hive reset logic for SRIOV */
+   /* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
r = amdgpu_device_reset_sriov(adev, job ? false : true);
if (r)
@@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 skip_sched_resume:
list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
-   /* unlock kfd: SRIOV would do it separately */
-   if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
+   /* unlock kfd */
+   if (!need_emergency_restart)
amdgpu_amdkfd_post_reset(tmp_adev);
 
/* kfd_post_reset will do nothing if kfd device is not 
initialized,
--
2.17.1


RE: [PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again

2021-11-15 Thread Liu, Shaoyun
[AMD Official Use Only]

Om, sounds reasonable 

Thanks 
Shaoyun.liu 

-Original Message-
From: Kuehling, Felix  
Sent: Monday, November 15, 2021 11:07 AM
To: amd-gfx@lists.freedesktop.org; Liu, Shaoyun 
Subject: Re: [PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and 
been triggered again

Am 2021-11-14 um 12:55 p.m. schrieb shaoyunl:
> In SRIOV configuration, the reset may failed to bring asic back to 
> normal but stop cpsch already been called, the start_cpsch will not be 
> called since there is no resume in this case.  When reset been triggered 
> again, driver should avoid to do uninitialization again.
>
> Signed-off-by: shaoyunl 

If there is a possibility that stop_cpsch is called multiple times, I think the 
check for that should be at the start of the function.
Something like:

    if (!dqm->sched_running)
        return 0;

Regards,
  Felix


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 42b2cc999434..bcc8980d77e0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1228,12 +1228,14 @@ static int stop_cpsch(struct device_queue_manager 
> *dqm)
>   if (!dqm->is_hws_hang)
>   unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>   hanging = dqm->is_hws_hang || dqm->is_resetting;
> - dqm->sched_running = false;
>  
> - pm_release_ib(>packet_mgr);
> + if (dqm->sched_running) {
> + dqm->sched_running = false;
> + pm_release_ib(>packet_mgr);
> + kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> + pm_uninit(>packet_mgr, hanging);
> + }
>  
> - kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> - pm_uninit(>packet_mgr, hanging);
>   dqm_unlock(dqm);
>  
>   return 0;


RE: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov

2021-11-05 Thread Liu, Shaoyun
[AMD Official Use Only]

Ye, a lot already been changed since then , now the  pre_reset and  post_reset 
not in the  lock/unlock anymore.  With  my previous change , we make 
kfd_pre_reset  avoid touch  HW . Now it's pure SW handling , should be safe  to 
be moved out of the full access . 
Anyway, thanks to bring this up, it will remind us to verify on the  XGMI 
configuration on SRIOV. 

Regards
shaoyun.liu 

-Original Message-
From: Kuehling, Felix  
Sent: Friday, November 5, 2021 1:48 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov

There was a reason why pre_reset was done differently on SRIOV. However, the 
code has changed a lot since then. Is this concern still valid?

> commit 7b184b006185215daf4e911f8de212964c99a514
> Author: wentalou 
> Date:   Fri Dec 7 13:53:18 2018 +0800
>
>     drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang
>
>     XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
>     but outside req_full_gpu of sriov.
>     It would make sriov hang during reset.
>
>     Signed-off-by: Wentao Lou 
>     Reviewed-by: Shaoyun Liu 
>     Signed-off-by: Alex Deucher 
Regards,
   Felix


On 2021-11-05 12:57 p.m., shaoyunl wrote:
> The KFD pre_reset should be called before reset been executed, it will 
> hold the lock to prevent other rocm process to sent the packlage to 
> hiq during host execute the real reset on the HW
>
> Signed-off-by: shaoyunl 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +
>   1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 95fec36e385e..d7c9dce17cad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4278,8 +4278,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>   if (r)
>   return r;
>   
> - amdgpu_amdkfd_pre_reset(adev);
> -
>   /* Resume IP prior to SMC */
>   r = amdgpu_device_ip_reinit_early_sriov(adev);
>   if (r)
> @@ -5015,8 +5013,7 @@ int amdgpu_device_gpu_recover(struct 
> amdgpu_device *adev,
>   
>   cancel_delayed_work_sync(_adev->delayed_init_work);
>   
> - if (!amdgpu_sriov_vf(tmp_adev))
> - amdgpu_amdkfd_pre_reset(tmp_adev);
> + amdgpu_amdkfd_pre_reset(tmp_adev);
>   
>   /*
>* Mark these ASICs to be reseted as untracked first


RE: [PATCH] drm/amdgpu: Get atomicOps info from Host for sriov setup

2021-09-10 Thread Liu, Shaoyun
[AMD Official Use Only]

Good catch  . my editor seems has auto complete feature and  I just select the 
first one .  ☹

Thanks 
Shaoyun.liu

-Original Message-
From: Kuehling, Felix  
Sent: Friday, September 10, 2021 10:19 AM
To: amd-gfx@lists.freedesktop.org; Liu, Shaoyun 
Subject: Re: [PATCH] drm/amdgpu: Get atomicOps info from Host for sriov setup

Am 2021-09-10 um 10:04 a.m. schrieb shaoyunl:
> The AtomicOp Requester Enable bit is reserved in VFs and the PF value 
> applies to all associated VFs. so guest driver can not directly enable 
> the atomicOps for VF, it depends on PF to enable it. In current 
> design, amdgpu driver  will get the enabled atomicOps bits through 
> private pf2vf data
>
> Signed-off-by: shaoyunl 
> Change-Id: Ifdbcb4396d64e3f3cbf6bcbf7ab9c7b2cb061052

Please remove the Change-Id.

In general, the change looks good to me. One question and one more nit-pick 
inline ...


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 25 
> -  drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h |  
> 4 +++-
>  2 files changed, 17 insertions(+), 12 deletions(-)  mode change 
> 100644 => 100755 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>  mode change 100644 => 100755 
> drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> old mode 100644
> new mode 100755
> index 653bd8fdaa33..fc6a6491c1b6
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3529,17 +3529,6 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   DRM_INFO("register mmio base: 0x%08X\n", (uint32_t)adev->rmmio_base);
>   DRM_INFO("register mmio size: %u\n", (unsigned)adev->rmmio_size);
>  
> - /* enable PCIE atomic ops */
> - r = pci_enable_atomic_ops_to_root(adev->pdev,
> -   PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
> -   PCI_EXP_DEVCAP2_ATOMIC_COMP64);
> - if (r) {
> - adev->have_atomics_support = false;
> - DRM_INFO("PCIE atomic ops is not supported\n");
> - } else {
> - adev->have_atomics_support = true;
> - }
> -
>   amdgpu_device_get_pcie_info(adev);
>  
>   if (amdgpu_mcbp)
> @@ -3562,6 +3551,20 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   if (r)
>   return r;
>  
> + /* enable PCIE atomic ops */
> + if (amdgpu_sriov_bios(adev))

Is this the correct condition? I think this would be true for the PF as well. 
But on the PF we still need to call pci_enable_atomic_ops_to_root.
I would expect a condition that only applies to VFs.


> + adev->have_atomics_support = ((struct amd_sriov_msg_pf2vf_info 
> *)
> + 
> adev->virt.fw_reserve.p_pf2vf)->pcie_atomic_ops_enabled_flags ==
> + (PCI_EXP_DEVCAP2_ATOMIC_COMP32 | 
> PCI_EXP_DEVCAP2_ATOMIC_COMP64);
> + else
> + adev->have_atomics_support =
> + !pci_enable_atomic_ops_to_root(adev->pdev,
> +   PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
> +   PCI_EXP_DEVCAP2_ATOMIC_COMP64);
> + if (!adev->have_atomics_support)
> + dev_info(adev->dev, "PCIE atomic ops is not supported\n");
> +
> +

Double blank lines. One is enough.

Regards,
  Felix


>   /* doorbell bar mapping and doorbell index init*/
>   amdgpu_device_doorbell_init(adev);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
> old mode 100644
> new mode 100755
> index a434c71fde8e..995899191288
> --- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
> @@ -204,8 +204,10 @@ struct amd_sriov_msg_pf2vf_info {
>   } mm_bw_management[AMD_SRIOV_MSG_RESERVE_VCN_INST];
>   /* UUID info */
>   struct amd_sriov_msg_uuid_info uuid_info;
> + /* pcie atomic Ops info */
> + uint32_t pcie_atomic_ops_enabled_flags;
>   /* reserved */
> - uint32_t reserved[256 - 47];
> + uint32_t reserved[256 - 48];
>  };
>  
>  struct amd_sriov_msg_vf2pf_info_header {


RE: [PATCH v3 1/1] drm/amdkfd: make needs_pcie_atomics FW-version dependent

2021-09-10 Thread Liu, Shaoyun
[AMD Official Use Only]

Looks good to me . 
Reviewed by Shaoyun.liu < shaoyun@amd.com>

-Original Message-
From: Kuehling, Felix  
Sent: Friday, September 10, 2021 1:10 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: Re: [PATCH v3 1/1] drm/amdkfd: make needs_pcie_atomics FW-version 
dependent

Am 2021-09-08 um 6:48 p.m. schrieb Felix Kuehling:
> On some GPUs the PCIe atomic requirement for KFD depends on the MEC 
> firmware version. Add a firmware version check for this. The minimum 
> firmware version that works without atomics can be updated in the 
> device_info structure for each GPU type.
>
> Move PCIe atomic detection from kgf2kfd_probe into kgf2kfd_device_init 
> because the MEC firmware is not loaded yet at the probe stage.
>
> Signed-off-by: Felix Kuehling 
I tested this change on a Sienna Cichlid on a system without PCIe atomics, both 
with the old and the new firmware. This version of the change should be good to 
go if I can get an R-b.

Thanks,
  Felix


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 44 -
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h   |  1 +
>  2 files changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 16a57b70cc1a..30fde852af19 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -468,6 +468,7 @@ static const struct kfd_device_info navi10_device_info = {
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
>   .needs_pci_atomics = true,
> + .no_atomic_fw_version = 145,
>   .num_sdma_engines = 2,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 8,
> @@ -487,6 +488,7 @@ static const struct kfd_device_info navi12_device_info = {
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
>   .needs_pci_atomics = true,
> + .no_atomic_fw_version = 145,
>   .num_sdma_engines = 2,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 8,
> @@ -506,6 +508,7 @@ static const struct kfd_device_info navi14_device_info = {
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
>   .needs_pci_atomics = true,
> + .no_atomic_fw_version = 145,
>   .num_sdma_engines = 2,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 8,
> @@ -525,6 +528,7 @@ static const struct kfd_device_info 
> sienna_cichlid_device_info = {
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
>   .needs_pci_atomics = true,
> + .no_atomic_fw_version = 92,
>   .num_sdma_engines = 4,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 8,
> @@ -544,6 +548,7 @@ static const struct kfd_device_info 
> navy_flounder_device_info = {
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
>   .needs_pci_atomics = true,
> + .no_atomic_fw_version = 92,
>   .num_sdma_engines = 2,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 8,
> @@ -562,7 +567,8 @@ static const struct kfd_device_info vangogh_device_info = 
> {
>   .mqd_size_aligned = MQD_SIZE_ALIGNED,
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
> - .needs_pci_atomics = false,
> + .needs_pci_atomics = true,
> + .no_atomic_fw_version = 92,
>   .num_sdma_engines = 1,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 2,
> @@ -582,6 +588,7 @@ static const struct kfd_device_info 
> dimgrey_cavefish_device_info = {
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
>   .needs_pci_atomics = true,
> + .no_atomic_fw_version = 92,
>   .num_sdma_engines = 2,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 8,
> @@ -601,6 +608,7 @@ static const struct kfd_device_info 
> beige_goby_device_info = {
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
>   .needs_pci_atomics = true,
> + .no_atomic_fw_version = 92,
>   .num_sdma_engines = 1,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 8,
> @@ -619,7 +627,8 @@ static const struct kfd_device_info 
> yellow_carp_device_info = {
>   .mqd_size_aligned = MQD_SIZE_ALIGNED,
>   .needs_iommu_device = false,
>   .supports_cwsr = true,
> - .needs_pci_atomics = false,
> + .needs_pci_atomics = true,
> + .no_atomic_fw_version = 92,
>   .num_sdma_engines = 1,
>   .num_xgmi_sdma_engines = 0,
>   .num_sdma_queues_per_engine = 2,
> @@ -708,20 +717,6 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev 

RE: [PATCH] drm/amdgpu: Get atomicOps info from Host for sriov setup

2021-09-09 Thread Liu, Shaoyun
[AMD Official Use Only]

Thanks for the  review .  I accepted  your comments and  will sent another 
change list for review once your change is in. 

Regards
Shaoyun.liu


-Original Message-
From: Kuehling, Felix  
Sent: Thursday, September 9, 2021 12:18 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Get atomicOps info from Host for sriov setup

Am 2021-09-09 um 11:59 a.m. schrieb shaoyunl:
> The AtomicOp Requester Enable bit is reserved in VFs and the PF value 
> applies to all associated VFs. so guest driver can not directly enable 
> the atomicOps for VF, it depends on PF to enable it. In current 
> design, amdgpu driver  will get the enabled atomicOps bits through 
> private pf2vf data
>
> Signed-off-by: shaoyunl 
> Change-Id: Ifdbcb4396d64e3f3cbf6bcbf7ab9c7b2cb061052
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 20 ++--  
> drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h |  4 +++-
>  2 files changed, 21 insertions(+), 3 deletions(-)  mode change 100644 
> => 100755 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>  mode change 100644 => 100755 
> drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> old mode 100644
> new mode 100755
> index 653bd8fdaa33..a0d2b9eb84fc
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2167,8 +2167,6 @@ static int amdgpu_device_ip_early_init(struct 
> amdgpu_device *adev)
>   return -EINVAL;
>   }
>  
> - amdgpu_amdkfd_device_probe(adev);
> -
>   adev->pm.pp_feature = amdgpu_pp_feature_mask;
>   if (amdgpu_sriov_vf(adev) || sched_policy == KFD_SCHED_POLICY_NO_HWS)
>   adev->pm.pp_feature &= ~PP_GFXOFF_MASK; @@ -3562,6 +3560,24 @@ 
> int 
> amdgpu_device_init(struct amdgpu_device *adev,
>   if (r)
>   return r;
>  
> + /* enable PCIE atomic ops */
> + if (amdgpu_sriov_bios(adev))
> + adev->have_atomics_support = (((struct amd_sriov_msg_pf2vf_info 
> *)
> + 
> adev->virt.fw_reserve.p_pf2vf)->pcie_atomic_ops_enabled_flags ==
> + (PCI_EXP_DEVCAP2_ATOMIC_COMP32 | 
> PCI_EXP_DEVCAP2_ATOMIC_COMP64))
> + ? TRUE : FALSE;

Please don't use this "condition ? TRUE : FALSE" idiom. Just "condition"
is good enough.


> + else
> + adev->have_atomics_support =
> + pci_enable_atomic_ops_to_root(adev->pdev,
> +   PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
> +   PCI_EXP_DEVCAP2_ATOMIC_COMP64)
> + ? FALSE : TRUE;

Same as above, but in this case it's "!condition". Also, I would have expected 
that you remove the other call to pci_enable_atomic_ops_to_root from this 
function.


> + if (adev->have_atomics_support = false )

This should be "==", but even better would be "if
(!adev->have_atomics_support) ...

That said, the message below may be redundant. The PCIe atomic check in 
kgd2kfd_device_init already prints an error message if atomics are required by 
the GPU but not supported. If you really want to print it for information on 
GPUs where it's not required, use dev_info so the message clearly shows which 
GPU in a multi-GPU system it refers to.


> + DRM_INFO("PCIE atomic ops is not supported\n");
> +
> + amdgpu_amdkfd_device_probe(adev);

This should not be necessary. I just sent another patch for review that moves 
the PCIe atomic check in KFD into kgd2kfd_device_init:
"drm/amdkfd: make needs_pcie_atomics FW-version dependent". So 
amdgpu_amdkfd_device_probe can stay where it is, if you can wait a few days for 
my change to go in first.

Regards,
  Felix


> +
> +
>   /* doorbell bar mapping and doorbell index init*/
>   amdgpu_device_doorbell_init(adev);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
> old mode 100644
> new mode 100755
> index a434c71fde8e..995899191288
> --- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
> @@ -204,8 +204,10 @@ struct amd_sriov_msg_pf2vf_info {
>   } mm_bw_management[AMD_SRIOV_MSG_RESERVE_VCN_INST];
>   /* UUID info */
>   struct amd_sriov_msg_uuid_info uuid_info;
> + /* pcie atomic Ops info */
> + uint32_t pcie_atomic_ops_enabled_flags;
>   /* reserved */
> - uint32_t reserved[256 - 47];
> + uint32_t reserved[256 - 48];
>  };
>  
>  struct amd_sriov_msg_vf2pf_info_header {

RE: [PATCH] drm/amdgpu: correct MMSCH 1.0 version

2021-08-16 Thread Liu, Shaoyun
[AMD Official Use Only]

Looks ok to me . 

Reviewed by Shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Monday, August 16, 2021 11:04 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH] drm/amdgpu: correct MMSCH 1.0 version

MMSCH 1.0 doesn't have major/minor version, only verison.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h 
b/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h
index 20958639b601..2cdab8062c86 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h
+++ b/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h
@@ -24,9 +24,7 @@
 #ifndef __MMSCH_V1_0_H__
 #define __MMSCH_V1_0_H__
 
-#define MMSCH_VERSION_MAJOR1
-#define MMSCH_VERSION_MINOR0
-#define MMSCH_VERSION  (MMSCH_VERSION_MAJOR << 16 | MMSCH_VERSION_MINOR)
+#define MMSCH_VERSION  0x1
 
 enum mmsch_v1_0_command_type {
MMSCH_COMMAND__DIRECT_REG_WRITE = 0,
-- 
2.17.1


RE: [PATCH] drm/amdgpu: correct MMSCH version

2021-08-16 Thread Liu, Shaoyun
[AMD Official Use Only]

Is that information from MM team ? 
Please make sure it won't break the ASICs that use the same  code path. Also If 
this is true for all mmsch_v1.0 , you need to specify this is mmSCH v1.0 , 
since other MMSCH version will still use this major and  minor. 

Shaoyun.liu


-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Thursday, August 12, 2021 11:07 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH] drm/amdgpu: correct MMSCH version

MMSCH doesn't have major/minor version, only verison.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h 
b/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h
index 20958639b601..2cdab8062c86 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h
+++ b/drivers/gpu/drm/amd/amdgpu/mmsch_v1_0.h
@@ -24,9 +24,7 @@
 #ifndef __MMSCH_V1_0_H__
 #define __MMSCH_V1_0_H__
 
-#define MMSCH_VERSION_MAJOR1
-#define MMSCH_VERSION_MINOR0
-#define MMSCH_VERSION  (MMSCH_VERSION_MAJOR << 16 | MMSCH_VERSION_MINOR)
+#define MMSCH_VERSION  0x1
 
 enum mmsch_v1_0_command_type {
MMSCH_COMMAND__DIRECT_REG_WRITE = 0,
-- 
2.17.1


RE: [PATCH 5/5] drm/amdgpu: allocate psp fw private buffer from VRAM for sriov vf

2021-06-03 Thread Liu, Shaoyun
[AMD Official Use Only]

I will leave  Hawking to comment on this serial . 

Thanks 
Shaoyun.liu

-Original Message-
From: Luo, Zhigang  
Sent: Thursday, June 3, 2021 11:48 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 5/5] drm/amdgpu: allocate psp fw private buffer from VRAM 
for sriov vf

All new PSP release will have this feature. And it will not cause any failure 
even the PSP doesn't have this feature yet.

Thanks,
Zhigang

-Original Message-
From: Liu, Shaoyun  
Sent: June 3, 2021 11:15 AM
To: Luo, Zhigang ; amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: RE: [PATCH 5/5] drm/amdgpu: allocate psp fw private buffer from VRAM 
for sriov vf

[AMD Official Use Only]

Please double verify whether this feature apply to all aisc PSP supported  
since this is not only apply to ARCTURUS and  ALDEBARAN. 

Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Thursday, June 3, 2021 10:13 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH 5/5] drm/amdgpu: allocate psp fw private buffer from VRAM for 
sriov vf

psp added new feature to check fw buffer address for sriov vf. the address 
range must be in vf fb.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 6bd7e39c3e75..7c0f1017a46b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2320,11 +2320,20 @@ static int psp_load_fw(struct amdgpu_device *adev)
if (!psp->cmd)
return -ENOMEM;
 
-   ret = amdgpu_bo_create_kernel(adev, PSP_1_MEG, PSP_1_MEG,
-   AMDGPU_GEM_DOMAIN_GTT,
-   >fw_pri_bo,
-   >fw_pri_mc_addr,
-   >fw_pri_buf);
+   if (amdgpu_sriov_vf(adev)) {
+   ret = amdgpu_bo_create_kernel(adev, PSP_1_MEG, PSP_1_MEG,
+   AMDGPU_GEM_DOMAIN_VRAM,
+   >fw_pri_bo,
+   >fw_pri_mc_addr,
+   >fw_pri_buf);
+   } else {
+   ret = amdgpu_bo_create_kernel(adev, PSP_1_MEG, PSP_1_MEG,
+   AMDGPU_GEM_DOMAIN_GTT,
+   >fw_pri_bo,
+   >fw_pri_mc_addr,
+   >fw_pri_buf);
+   }
+
if (ret)
goto failed;
 
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CShaoyun.Liu%40amd.com%7C3f624a72d2574d5c10a808d92699c9a8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637583264223318916%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=4EfyfR26TENFq1%2BXlSufuOYocdCmNcdEZHyEPzAQPcc%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 4/5] drm/amdgpu: add psp microcode init for arcturus and aldebaran sriov vf

2021-06-03 Thread Liu, Shaoyun
[AMD Official Use Only]

This one  doesn't looks apply to  XGMI TA  only , it's for whole PSP init , can 
 you double check it ? 


Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Thursday, June 3, 2021 10:13 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH 4/5] drm/amdgpu: add psp microcode init for arcturus and 
aldebaran sriov vf

need to load xgmi ta for arcturus and aldebaran sriov vf.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 55378c6b9722..6bd7e39c3e75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -170,7 +170,8 @@ static int psp_sw_init(void *handle)
struct psp_context *psp = >psp;
int ret;
 
-   if (!amdgpu_sriov_vf(adev)) {
+   if ((adev->asic_type == CHIP_ARCTURUS) ||
+   (adev->asic_type == CHIP_ALDEBARAN) || (!amdgpu_sriov_vf(adev))) {
ret = psp_init_microcode(psp);
if (ret) {
DRM_ERROR("Failed to load psp firmware!\n");
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CShaoyun.Liu%40amd.com%7C7568bce040b840a5a20508d92699c7ee%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637583264190861368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=ZL8oC97Rnltg0gbqc8AUqnZS%2BEuUSq8%2FDFngzjjFtbI%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/5] drm/amdgpu: remove sriov vf mmhub system aperture and fb location programming

2021-06-03 Thread Liu, Shaoyun
[AMD Official Use Only]

Looks ok to me .

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Thursday, June 3, 2021 10:13 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH 3/5] drm/amdgpu: remove sriov vf mmhub system aperture and fb 
location programming

host driver programmed mmhub system aperture and fb location for vf, no need to 
program in guest side.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
index 998e674f9369..f5f7181f9af5 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
@@ -111,6 +111,9 @@ static void mmhub_v1_7_init_system_aperture_regs(struct 
amdgpu_device *adev)
WREG32_SOC15(MMHUB, 0, regMC_VM_AGP_BOT, adev->gmc.agp_start >> 24);
WREG32_SOC15(MMHUB, 0, regMC_VM_AGP_TOP, adev->gmc.agp_end >> 24);
 
+   if (amdgpu_sriov_vf(adev))
+   return;
+
/* Program the system aperture low logical page number. */
WREG32_SOC15(MMHUB, 0, regMC_VM_SYSTEM_APERTURE_LOW_ADDR,
 min(adev->gmc.fb_start, adev->gmc.agp_start) >> 18); @@ 
-129,8 +132,6 @@ static void mmhub_v1_7_init_system_aperture_regs(struct 
amdgpu_device *adev)
WREG32_SOC15(MMHUB, 0, regMC_VM_SYSTEM_APERTURE_LOW_ADDR, 
0x3FFF);
WREG32_SOC15(MMHUB, 0, regMC_VM_SYSTEM_APERTURE_HIGH_ADDR, 0);
}
-   if (amdgpu_sriov_vf(adev))
-   return;
 
/* Set default page address. */
value = amdgpu_gmc_vram_mc2pa(adev, adev->vram_scratch.gpu_addr); @@ 
-331,18 +332,6 @@ static void mmhub_v1_7_program_invalidation(struct 
amdgpu_device *adev)
 
 static int mmhub_v1_7_gart_enable(struct amdgpu_device *adev)  {
-   if (amdgpu_sriov_vf(adev)) {
-   /*
-* MC_VM_FB_LOCATION_BASE/TOP is NULL for VF, becuase they are
-* VF copy registers so vbios post doesn't program them, for
-* SRIOV driver need to program them
-*/
-   WREG32_SOC15(MMHUB, 0, regMC_VM_FB_LOCATION_BASE,
-adev->gmc.vram_start >> 24);
-   WREG32_SOC15(MMHUB, 0, regMC_VM_FB_LOCATION_TOP,
-adev->gmc.vram_end >> 24);
-   }
-
/* GART Enable. */
mmhub_v1_7_init_gart_aperture_regs(adev);
mmhub_v1_7_init_system_aperture_regs(adev);
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CShaoyun.Liu%40amd.com%7Ce69db3117bd84691819308d92699c93c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637583264214796331%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=vmZU4JdKm5wuX12YboT602UZTCGW%2BqtHEph1Dyw%2F8gY%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 2/5] drm/amdgpu: remove sriov vf gfxhub fb location programming

2021-06-03 Thread Liu, Shaoyun
[AMD Official Use Only]

This looks will affect other ASIC , Can you double check that ? 

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Thursday, June 3, 2021 10:13 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH 2/5] drm/amdgpu: remove sriov vf gfxhub fb location programming

host driver programmed the gfxhub fb location for vf, no need to program in 
guest side.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index 063e48df0b2d..f51fd0688eca 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -320,18 +320,6 @@ static void gfxhub_v1_0_program_invalidation(struct 
amdgpu_device *adev)
 
 static int gfxhub_v1_0_gart_enable(struct amdgpu_device *adev)  {
-   if (amdgpu_sriov_vf(adev) && adev->asic_type != CHIP_ARCTURUS) {
-   /*
-* MC_VM_FB_LOCATION_BASE/TOP is NULL for VF, becuase they are
-* VF copy registers so vbios post doesn't program them, for
-* SRIOV driver need to program them
-*/
-   WREG32_SOC15_RLC(GC, 0, mmMC_VM_FB_LOCATION_BASE,
-adev->gmc.vram_start >> 24);
-   WREG32_SOC15_RLC(GC, 0, mmMC_VM_FB_LOCATION_TOP,
-adev->gmc.vram_end >> 24);
-   }
-
/* GART Enable. */
gfxhub_v1_0_init_gart_aperture_regs(adev);
gfxhub_v1_0_init_system_aperture_regs(adev);
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CShaoyun.Liu%40amd.com%7C5e79adbeb3bb46b1cf7e08d92699c8c9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637583264238382812%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2FHQnWwOdUVoyXVRBwx03aJqif3bVRKkKfDT82lr3ZJ8%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/5] drm/amdgpu: remove sriov vf checking from getting fb location

2021-06-03 Thread Liu, Shaoyun
[AMD Official Use Only]

Looks ok to me . 

Reviewed-By : Shaoyun.liu 

-Original Message-
From: amd-gfx  On Behalf Of Zhigang Luo
Sent: Thursday, June 3, 2021 10:13 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang 
Subject: [PATCH 1/5] drm/amdgpu: remove sriov vf checking from getting fb 
location

host driver programmed fb location registers for vf, no need to check anymore.

Signed-off-by: Zhigang Luo 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index ceb3968d8326..1c2d9fde9021 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1292,10 +1292,7 @@ static int gmc_v9_0_late_init(void *handle)  static void 
gmc_v9_0_vram_gtt_location(struct amdgpu_device *adev,
struct amdgpu_gmc *mc)
 {
-   u64 base = 0;
-
-   if (!amdgpu_sriov_vf(adev))
-   base = adev->mmhub.funcs->get_fb_location(adev);
+   u64 base = adev->mmhub.funcs->get_fb_location(adev);
 
/* add the xgmi offset of the physical node */
base += adev->gmc.xgmi.physical_node_id * 
adev->gmc.xgmi.node_segment_size;
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CShaoyun.Liu%40amd.com%7C6a01e479b7014508ce6c08d92699c71d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637583264179471370%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=C8qzQbKD2RJpQvNSWYvLlm4qgwWujhJW2c%2B%2FLhUwPE4%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

2021-03-12 Thread Liu, Shaoyun
[AMD Public Use]

Hi , Lijo
For now we  only enable this light sbr feature for SMU  on XGMI + passthrough 
for Arcturus since this is the use case our customer is required and  we only 
verified in this configuration .  I feel  it's more reasonable to keep the 
logic of enable/disable in amdgpu side instead in SMU internally.   We might 
need to support  the SBR in bare-metal mode  in the future .  Also  we plan to 
use this feature  in future ASIC if the default HW reset method doesn't works 
stable.   Your suggestion will need to move the enable/disable logic into SMU 
internally which valid our original design.

Regards
Shaoyun.liu


From: Lazar, Lijo 
Sent: Friday, March 12, 2021 11:54 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org; Quan, 
Evan ; Zhang, Hawking 
Subject: RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration


[AMD Public Use]

Looks like this can be handled during post_init. It will be called as 
smu_post_init() happening during late_init part of smu block. You can check 
vangogh or navi examples on how to add your implementation.

Thanks,
Lijo

From: Liu, Shaoyun mailto:shaoyun@amd.com>>
Sent: Friday, March 12, 2021 8:57 PM
To: Lazar, Lijo mailto:lijo.la...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Quan, Evan 
mailto:evan.q...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
Subject: RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration


[AMD Public Use]

I don't like to add this set_light_sbr into ppt_funcs  either , but please 
check current swsmu  code structure ,  there is no asic specific swsmu late 
init function  and  there is no direct routine form  amdgpu_smu.c to 
smu_v11_0.c either . It requires  smu common code ->ppt_func -> smu_v11_0 for 
Arcturus  specific function .  So unless SMU and  PPT have a major re-structure 
, set_light_sbr need to go through ppt_func for now,  I think I  better  leave 
this  re-structure task to SMU and  PPT owner in the future .

Add  SMU and  PPT code owner  Hawking  and Quan for comments .

Regards
Shaoyun.liu


From: Lazar, Lijo mailto:lijo.la...@amd.com>>
Sent: Friday, March 12, 2021 8:55 AM
To: Liu, Shaoyun mailto:shaoyun@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration


[AMD Public Use]

We want to keep ppt_funcs minimal. Adding everything to ppt_funcs and keeping 
as NULL is not the right way. Please keep the code to arcturus.

Thanks,
Lijo

From: Liu, Shaoyun mailto:shaoyun@amd.com>>
Sent: Friday, March 12, 2021 7:21 PM
To: Lazar, Lijo mailto:lijo.la...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration

Thanks for the comments. This light sbr solution could be applied to other asic 
as well. In swsmu code, It will check whether the function pointer 
set_light_sbr is valid before real call the function. So for other asics if the 
smu apply the same change, just add the ppt function pointer and we will have 
this support without further code change.

Thanks
Shaoyun.liu


From: Lazar, Lijo mailto:lijo.la...@amd.com>>
Sent: March 11, 2021 10:42 PM
To: Liu, Shaoyun mailto:shaoyun@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Liu, Shaoyun mailto:shaoyun@amd.com>>
Subject: RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration

[AMD Public Use]

We don't need this as a generic ppt_func. Reset functionalities are changing 
over programs and this could be valid only for Arcturus. Please move it to 
Arcturus swsmu late init.

Thanks,
Lijo

-Original Message-
From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 On Behalf Of shaoyunl
Sent: Thursday, March 11, 2021 10:46 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Liu, Shaoyun mailto:shaoyun@amd.com>>
Subject: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

This is to fix the commit dda9bbb26c7 where it only enable the light SMU on 
normal device init. This feature actually need to be enabled after ASIC been 
reset as well.

Signed-off-by: shaoyunl mailto:shaoyun@amd.com>>
Change-Id: Ie7ee02cd3ccdab3522aad9a02f681963e211ed44
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cada3e77c7d5..fb775a9c0db1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2513,6 +2513,9 @@ static int amdgpu_device_

RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

2021-03-12 Thread Liu, Shaoyun
[AMD Public Use]

I don't like to add this set_light_sbr into ppt_funcs  either , but please 
check current swsmu  code structure ,  there is no asic specific swsmu late 
init function  and  there is no direct routine form  amdgpu_smu.c to 
smu_v11_0.c either . It requires  smu common code ->ppt_func -> smu_v11_0 for 
Arcturus  specific function .  So unless SMU and  PPT have a major re-structure 
, set_light_sbr need to go through ppt_func for now,  I think I  better  leave 
this  re-structure task to SMU and  PPT owner in the future .

Add  SMU and  PPT code owner  Hawking  and Quan for comments .

Regards
Shaoyun.liu


From: Lazar, Lijo 
Sent: Friday, March 12, 2021 8:55 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration


[AMD Public Use]

We want to keep ppt_funcs minimal. Adding everything to ppt_funcs and keeping 
as NULL is not the right way. Please keep the code to arcturus.

Thanks,
Lijo

From: Liu, Shaoyun mailto:shaoyun@amd.com>>
Sent: Friday, March 12, 2021 7:21 PM
To: Lazar, Lijo mailto:lijo.la...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration

Thanks for the comments. This light sbr solution could be applied to other asic 
as well. In swsmu code, It will check whether the function pointer 
set_light_sbr is valid before real call the function. So for other asics if the 
smu apply the same change, just add the ppt function pointer and we will have 
this support without further code change.

Thanks
Shaoyun.liu


From: Lazar, Lijo mailto:lijo.la...@amd.com>>
Sent: March 11, 2021 10:42 PM
To: Liu, Shaoyun mailto:shaoyun@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Liu, Shaoyun mailto:shaoyun@amd.com>>
Subject: RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration

[AMD Public Use]

We don't need this as a generic ppt_func. Reset functionalities are changing 
over programs and this could be valid only for Arcturus. Please move it to 
Arcturus swsmu late init.

Thanks,
Lijo

-Original Message-
From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 On Behalf Of shaoyunl
Sent: Thursday, March 11, 2021 10:46 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Liu, Shaoyun mailto:shaoyun@amd.com>>
Subject: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

This is to fix the commit dda9bbb26c7 where it only enable the light SMU on 
normal device init. This feature actually need to be enabled after ASIC been 
reset as well.

Signed-off-by: shaoyunl mailto:shaoyun@amd.com>>
Change-Id: Ie7ee02cd3ccdab3522aad9a02f681963e211ed44
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cada3e77c7d5..fb775a9c0db1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2513,6 +2513,9 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
 if (r)
 DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);

+   /* For XGMI + passthrough configuration , enable light SBR */
+   if (amdgpu_passthrough(adev) && adev->gmc.xgmi.num_physical_nodes > 1)
+   smu_set_light_sbr(>smu, true);

 if (adev->gmc.xgmi.num_physical_nodes > 1) {
 mutex_lock(_info.mutex);
@@ -3615,10 +3618,6 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 if (amdgpu_device_cache_pci_state(adev->pdev))
 pci_restore_state(pdev);

-   /* Enable lightSBR on SMU in passthrough + xgmi configuration */
-   if (amdgpu_passthrough(adev) && adev->gmc.xgmi.num_physical_nodes > 1)
-   smu_set_light_sbr(>smu, true);
-
 if (adev->gmc.xgmi.pending_reset)
 queue_delayed_work(system_wq, _info.delayed_reset_work,
msecs_to_jiffies(AMDGPU_RESUME_MS));
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Clijo.lazar%40amd.com%7Cc5aedb6c2d9d49d2fee408d8e4b15b5c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637510797685776785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=7ndhtOVmyZcRMe3UQiGvF%2BprCdBVgo7f6IATXSbQNg4%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

2021-03-12 Thread Liu, Shaoyun
Thanks for the comments. This light sbr solution could be applied to other asic 
as well. In swsmu code, It will check whether the function pointer 
set_light_sbr is valid before real call the function. So for other asics if the 
smu apply the same change, just add the ppt function pointer and we will have 
this support without further code change.

Thanks
Shaoyun.liu


From: Lazar, Lijo 
Sent: March 11, 2021 10:42 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org 

Cc: Liu, Shaoyun 
Subject: RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough 
configuration

[AMD Public Use]

We don't need this as a generic ppt_func. Reset functionalities are changing 
over programs and this could be valid only for Arcturus. Please move it to 
Arcturus swsmu late init.

Thanks,
Lijo

-Original Message-
From: amd-gfx  On Behalf Of shaoyunl
Sent: Thursday, March 11, 2021 10:46 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

This is to fix the commit dda9bbb26c7 where it only enable the light SMU on 
normal device init. This feature actually need to be enabled after ASIC been 
reset as well.

Signed-off-by: shaoyunl 
Change-Id: Ie7ee02cd3ccdab3522aad9a02f681963e211ed44
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cada3e77c7d5..fb775a9c0db1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2513,6 +2513,9 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
 if (r)
 DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);

+   /* For XGMI + passthrough configuration , enable light SBR */
+   if (amdgpu_passthrough(adev) && adev->gmc.xgmi.num_physical_nodes > 1)
+   smu_set_light_sbr(>smu, true);

 if (adev->gmc.xgmi.num_physical_nodes > 1) {
 mutex_lock(_info.mutex);
@@ -3615,10 +3618,6 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 if (amdgpu_device_cache_pci_state(adev->pdev))
 pci_restore_state(pdev);

-   /* Enable lightSBR on SMU in passthrough + xgmi configuration */
-   if (amdgpu_passthrough(adev) && adev->gmc.xgmi.num_physical_nodes > 1)
-   smu_set_light_sbr(>smu, true);
-
 if (adev->gmc.xgmi.pending_reset)
 queue_delayed_work(system_wq, _info.delayed_reset_work,
msecs_to_jiffies(AMDGPU_RESUME_MS));
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Clijo.lazar%40amd.com%7Cc5aedb6c2d9d49d2fee408d8e4b15b5c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637510797685776785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=7ndhtOVmyZcRMe3UQiGvF%2BprCdBVgo7f6IATXSbQNg4%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

2021-03-11 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Ping . 

-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, March 11, 2021 12:16 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH] drm/amdgpu: Enable light SBR in XGMI+passthrough configuration

This is to fix the commit dda9bbb26c7 where it only enable the light SMU on 
normal device init. This feature actually need to be enabled after ASIC been 
reset as well.

Signed-off-by: shaoyunl 
Change-Id: Ie7ee02cd3ccdab3522aad9a02f681963e211ed44
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cada3e77c7d5..fb775a9c0db1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2513,6 +2513,9 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (r)
DRM_ERROR("enable mgpu fan boost failed (%d).\n", r);
 
+   /* For XGMI + passthrough configuration , enable light SBR */
+   if (amdgpu_passthrough(adev) && adev->gmc.xgmi.num_physical_nodes > 1)
+   smu_set_light_sbr(>smu, true);
 
if (adev->gmc.xgmi.num_physical_nodes > 1) {
mutex_lock(_info.mutex);
@@ -3615,10 +3618,6 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (amdgpu_device_cache_pci_state(adev->pdev))
pci_restore_state(pdev);
 
-   /* Enable lightSBR on SMU in passthrough + xgmi configuration */
-   if (amdgpu_passthrough(adev) && adev->gmc.xgmi.num_physical_nodes > 1)
-   smu_set_light_sbr(>smu, true);
-
if (adev->gmc.xgmi.pending_reset)
queue_delayed_work(system_wq, _info.delayed_reset_work,
   msecs_to_jiffies(AMDGPU_RESUME_MS));
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-08 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey. 
The first 3 patches in this serial already been acked by Alex. D, can you help 
review the rest two ? 

Thanks
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey  
Sent: Monday, March 8, 2021 10:53 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

I see, thanks for explaning.

Andrey

On 2021-03-08 10:27 a.m., Liu, Shaoyun wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> Check the function amdgpu_xgmi_add_device, when  psp XGMI TA is bot available 
> ,  the driver will assign a faked hive ID 0x10 for all  GPUs, it means all 
> GPU will belongs to one same hive .  So I can still use hive->tb to sync the 
> reset on all GPUs.   The reason I can  not use the default 
> amdgpu_do_asic_reset function  is because we  want to build correct hive and 
> node topology for all GPUs after reset, so we need to call 
> amdgpu_xgmi_add_device inside the amdgpu_do_asic_reset function . To make 
> this works ,  we need to destroy the hive by remove  the device (call 
> amdgpu_xgmi_remove_device) first , so when calling amdgpu_do_asic_reset ,  
> the  faked hive(0x10) already   been destroyed. And  the hive->tb will not 
> work in this case .   That's the reason I need to call the reset explicitly 
> with the faked hive and then destroy the hive ,  build the device_list for 
> amdgpu_do_asic_reset without the hive .
> Hope I explain it clearly .
> 
> Thanks
> Shaoyun.liu
> 
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Monday, March 8, 2021 1:28 AM
> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> But the hive->tb object is used regardless, inside 
> amdgpu_device_xgmi_reset_func currently, it means then even when you 
> explcicitly schdule xgmi_reset_work as you do now they code will try to sync 
> using a not well iniitlized tb object. Maybe you can define a global static 
> tb object, fill it in the loop where you send xgmi_reset_work for all devices 
> in system and use it from within amdgpu_device_xgmi_reset_func instead of the 
> regular per hive tb object (obviosly under your special use case).
> 
> Andrey
> 
> On 2021-03-06 4:11 p.m., Liu, Shaoyun wrote:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> It  seems I can  not directly reuse the reset HW  function inside the  
>> amdgpu_do_asic_reset,  the  synchronization is based on hive->tb,   but as 
>> explained , we actually don't know the GPU belongs to which hive and will 
>> rebuild the correct hive info inside the amdgpu_do_asic_reset function with 
>> amdgpu_xgmi_add_device .  so I need to remove  all GPUs from the hive first 
>> . This will lead to the sync don't work since the hive->tb will be removed 
>> as well when all GPUs are removed .
>>
>> Thanks
>> shaopyunliu
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of 
>> Liu, Shaoyun
>> Sent: Saturday, March 6, 2021 3:41 PM
>> To: Grodzovsky, Andrey ; 
>> amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
>> hive duirng probe
>>
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so 
>> the reset won't be execute twice .  but probably I can  set this parameter 
>> to true and remove the code schedule for reset since now I already build the 
>> device_list not based on hive. Let me try that .
>> For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually 
>> not wait for SMU  to start. As I explained , I need to reset the all the 
>> GPUs in the system since I don't know which gpus belongs to which hive.  So 
>> this time is allow system to probe all the GPUs  in the system which means 
>> when this delayed thread starts ,  we can assume all the devices already 
>> been  populated in mgpu_info.
>>
>> Regards
>> Shaoyun.liu
>>
>> -Original Message-
>> From: Grodzovsky, Andrey 
>> Sent: Saturday, March 6, 2021 1:09 AM
>> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
>> hive duirng probe
>>
>> Thanks for explaining this, one thing I still don't understand is why you 
>> schedule the reset work explicilty in the begining of 
>> amdgpu_drv_delayed_reset_work_handler and then also call 
>> amdgpu_do_asic_reset which will do the same th

RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-08 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Check the function amdgpu_xgmi_add_device, when  psp XGMI TA is bot available , 
 the driver will assign a faked hive ID 0x10 for all  GPUs, it means all GPU 
will belongs to one same hive .  So I can still use hive->tb to sync the reset 
on all GPUs.   The reason I can  not use the default amdgpu_do_asic_reset 
function  is because we  want to build correct hive and node topology for all 
GPUs after reset, so we need to call amdgpu_xgmi_add_device inside the 
amdgpu_do_asic_reset function . To make this works ,  we need to destroy the 
hive by remove  the device (call amdgpu_xgmi_remove_device) first , so when 
calling amdgpu_do_asic_reset ,  the  faked hive(0x10) already   been destroyed. 
And  the hive->tb will not work in this case .   That's the reason I need to 
call the reset explicitly with the faked hive and then destroy the hive ,  
build the device_list for amdgpu_do_asic_reset without the hive . 
Hope I explain it clearly . 

Thanks 
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey  
Sent: Monday, March 8, 2021 1:28 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

But the hive->tb object is used regardless, inside 
amdgpu_device_xgmi_reset_func currently, it means then even when you 
explcicitly schdule xgmi_reset_work as you do now they code will try to sync 
using a not well iniitlized tb object. Maybe you can define a global static tb 
object, fill it in the loop where you send xgmi_reset_work for all devices in 
system and use it from within amdgpu_device_xgmi_reset_func instead of the 
regular per hive tb object (obviosly under your special use case).

Andrey

On 2021-03-06 4:11 p.m., Liu, Shaoyun wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> It  seems I can  not directly reuse the reset HW  function inside the  
> amdgpu_do_asic_reset,  the  synchronization is based on hive->tb,   but as 
> explained , we actually don't know the GPU belongs to which hive and will 
> rebuild the correct hive info inside the amdgpu_do_asic_reset function with 
> amdgpu_xgmi_add_device .  so I need to remove  all GPUs from the hive first . 
> This will lead to the sync don't work since the hive->tb will be removed as 
> well when all GPUs are removed .
> 
> Thanks
> shaopyunliu
> 
> -----Original Message-
> From: amd-gfx  On Behalf Of 
> Liu, Shaoyun
> Sent: Saturday, March 6, 2021 3:41 PM
> To: Grodzovsky, Andrey ; 
> amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> [AMD Official Use Only - Internal Distribution Only]
> 
> I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so 
> the reset won't be execute twice .  but probably I can  set this parameter to 
> true and remove the code schedule for reset since now I already build the 
> device_list not based on hive. Let me try that .
> For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually 
> not wait for SMU  to start. As I explained , I need to reset the all the GPUs 
> in the system since I don't know which gpus belongs to which hive.  So this 
> time is allow system to probe all the GPUs  in the system which means when 
> this delayed thread starts ,  we can assume all the devices already been  
> populated in mgpu_info.
> 
> Regards
> Shaoyun.liu
> 
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Saturday, March 6, 2021 1:09 AM
> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> Thanks for explaining this, one thing I still don't understand is why you 
> schedule the reset work explicilty in the begining of 
> amdgpu_drv_delayed_reset_work_handler and then also call amdgpu_do_asic_reset 
> which will do the same thing too. It looks like the physical reset will 
> execute twice for each device.
> Another thing is, more like improvement suggestion  - currently you schedule 
> delayed_reset_work using AMDGPU_RESUME_MS - so i guesss this should give 
> enough time for SMU to start ? Is there maybe a way to instead poll for SMU 
> start completion and then execute this - some SMU status registers maybe ? 
> Just to avoid relying on this arbitrary value.
> 
> Andrey
> 
> On 2021-03-05 8:37 p.m., Liu, Shaoyun wrote:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi,  Andrey
>> The existing reset function (amdgpu_device_gpu_recover or amd do_asic 
>> _reset) assumed driver already have  the correct hive info . But in my case, 
>> it's  not true . The gpus are in a bad state and the XGMI TA  might not 
>>

RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-06 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

It  seems I can  not directly reuse the reset HW  function inside the  
amdgpu_do_asic_reset,  the  synchronization is based on hive->tb,   but as 
explained , we actually don't know the GPU belongs to which hive and will 
rebuild the correct hive info inside the amdgpu_do_asic_reset function with 
amdgpu_xgmi_add_device .  so I need to remove  all GPUs from the hive first . 
This will lead to the sync don't work since the hive->tb will be removed as 
well when all GPUs are removed . 

Thanks 
shaopyunliu

-Original Message-
From: amd-gfx  On Behalf Of Liu, Shaoyun
Sent: Saturday, March 6, 2021 3:41 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

[AMD Official Use Only - Internal Distribution Only]

I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so the 
reset won't be execute twice .  but probably I can  set this parameter to true 
and remove the code schedule for reset since now I already build the 
device_list not based on hive. Let me try that . 
For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually not 
wait for SMU  to start. As I explained , I need to reset the all the GPUs in 
the system since I don't know which gpus belongs to which hive.  So this time 
is allow system to probe all the GPUs  in the system which means when this 
delayed thread starts ,  we can assume all the devices already been  populated 
in mgpu_info.

Regards
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey 
Sent: Saturday, March 6, 2021 1:09 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

Thanks for explaining this, one thing I still don't understand is why you 
schedule the reset work explicilty in the begining of 
amdgpu_drv_delayed_reset_work_handler and then also call amdgpu_do_asic_reset 
which will do the same thing too. It looks like the physical reset will execute 
twice for each device.
Another thing is, more like improvement suggestion  - currently you schedule 
delayed_reset_work using AMDGPU_RESUME_MS - so i guesss this should give enough 
time for SMU to start ? Is there maybe a way to instead poll for SMU start 
completion and then execute this - some SMU status registers maybe ? Just to 
avoid relying on this arbitrary value.

Andrey

On 2021-03-05 8:37 p.m., Liu, Shaoyun wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> Hi,  Andrey
> The existing reset function (amdgpu_device_gpu_recover or amd do_asic _reset) 
> assumed driver already have  the correct hive info . But in my case, it's  
> not true . The gpus are in a bad state and the XGMI TA  might not functional 
> properly , so driver can  not  get the hive and node info when probe the 
> device .  It means driver even don't know  the device belongs to which hive 
> on a system with multiple hive configuration (ex, 8 gpus in  two hive). The 
> only solution I can think of is let driver trigger the reset on all gpus at 
> the same time after driver do the minimum initialization on the HW ( bring up 
> the  SMU IP)  no matter they belongs to the same hive or not and call 
> amdgpu_xgmi_add_device for each device after re-init .
> The 100 ms delay added after the baco reset . I think they can be removed . 
> let me verify it.
> 
> Regards
> Shaoyun.liu
> 
> 
> 
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Friday, March 5, 2021 2:27 PM
> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> 
> 
> On 2021-03-05 12:52 p.m., shaoyunl wrote:
>> In passthrough configuration, hypervisior will trigger the 
>> SBR(Secondary bus reset) to the devices without sync to each other.
>> This could cause device hang since for XGMI configuration, all the 
>> devices within the hive need to be reset at a limit time slot. This 
>> serial of patches try to solve this issue by co-operate with new SMU 
>> which will only do minimum house keeping to response the SBR request 
>> but don't do the real reset job and leave it to driver. Driver need 
>> to do the whole sw init and minimum HW init to bring up the SMU and 
>> trigger the reset(possibly BACO) on all the ASICs at the same time
>>
>> Signed-off-by: shaoyunl 
>> Change-Id: I34e838e611b7623c7ad824704c7ce350808014fc
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu.h|  13 +++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 102 +++--
>>drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  71 ++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h|   1 +
>>drivers

RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-06 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so the 
reset won't be execute twice .  but probably I can  set this parameter to true 
and remove the code schedule for reset since now I already build the 
device_list not based on hive. Let me try that . 
For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually not 
wait for SMU  to start. As I explained , I need to reset the all the GPUs in 
the system since I don't know which gpus belongs to which hive.  So this time 
is allow system to probe all the GPUs  in the system which means when this 
delayed thread starts ,  we can assume all the devices already been  populated 
in mgpu_info.

Regards
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey  
Sent: Saturday, March 6, 2021 1:09 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

Thanks for explaining this, one thing I still don't understand is why you 
schedule the reset work explicilty in the begining of 
amdgpu_drv_delayed_reset_work_handler and then also call amdgpu_do_asic_reset 
which will do the same thing too. It looks like the physical reset will execute 
twice for each device.
Another thing is, more like improvement suggestion  - currently you schedule 
delayed_reset_work using AMDGPU_RESUME_MS - so i guesss this should give enough 
time for SMU to start ? Is there maybe a way to instead poll for SMU start 
completion and then execute this - some SMU status registers maybe ? Just to 
avoid relying on this arbitrary value.

Andrey

On 2021-03-05 8:37 p.m., Liu, Shaoyun wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> Hi,  Andrey
> The existing reset function (amdgpu_device_gpu_recover or amd do_asic _reset) 
> assumed driver already have  the correct hive info . But in my case, it's  
> not true . The gpus are in a bad state and the XGMI TA  might not functional 
> properly , so driver can  not  get the hive and node info when probe the 
> device .  It means driver even don't know  the device belongs to which hive 
> on a system with multiple hive configuration (ex, 8 gpus in  two hive). The 
> only solution I can think of is let driver trigger the reset on all gpus at 
> the same time after driver do the minimum initialization on the HW ( bring up 
> the  SMU IP)  no matter they belongs to the same hive or not and call 
> amdgpu_xgmi_add_device for each device after re-init .
> The 100 ms delay added after the baco reset . I think they can be removed . 
> let me verify it.
> 
> Regards
> Shaoyun.liu
> 
> 
> 
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Friday, March 5, 2021 2:27 PM
> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> 
> 
> On 2021-03-05 12:52 p.m., shaoyunl wrote:
>> In passthrough configuration, hypervisior will trigger the 
>> SBR(Secondary bus reset) to the devices without sync to each other.
>> This could cause device hang since for XGMI configuration, all the 
>> devices within the hive need to be reset at a limit time slot. This 
>> serial of patches try to solve this issue by co-operate with new SMU 
>> which will only do minimum house keeping to response the SBR request 
>> but don't do the real reset job and leave it to driver. Driver need 
>> to do the whole sw init and minimum HW init to bring up the SMU and 
>> trigger the reset(possibly BACO) on all the ASICs at the same time
>>
>> Signed-off-by: shaoyunl 
>> Change-Id: I34e838e611b7623c7ad824704c7ce350808014fc
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu.h|  13 +++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 102 +++--
>>drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  71 ++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h|   1 +
>>drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |   8 +-
>>5 files changed, 165 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index d46d3794699e..5602c6edee97 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -125,6 +125,10 @@ struct amdgpu_mgpu_info
>>  uint32_tnum_gpu;
>>  uint32_tnum_dgpu;
>>  uint32_tnum_apu;
>> +
>> +/* delayed reset_func for XGMI configuration if necessary */
>> +struct delayed_work delayed_reset_work;
>> +boolpending_reset;
>>

RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-05 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Hi,  Andrey 
The existing reset function (amdgpu_device_gpu_recover or amd do_asic _reset) 
assumed driver already have  the correct hive info . But in my case, it's  not 
true . The gpus are in a bad state and the XGMI TA  might not functional 
properly , so driver can  not  get the hive and node info when probe the device 
.  It means driver even don't know  the device belongs to which hive on a 
system with multiple hive configuration (ex, 8 gpus in  two hive). The only 
solution I can think of is let driver trigger the reset on all gpus at the same 
time after driver do the minimum initialization on the HW ( bring up the  SMU 
IP)  no matter they belongs to the same hive or not and call 
amdgpu_xgmi_add_device for each device after re-init . 
The 100 ms delay added after the baco reset . I think they can be removed . let 
me verify it. 

Regards
Shaoyun.liu 



-Original Message-
From: Grodzovsky, Andrey  
Sent: Friday, March 5, 2021 2:27 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe



On 2021-03-05 12:52 p.m., shaoyunl wrote:
> In passthrough configuration, hypervisior will trigger the 
> SBR(Secondary bus reset) to the devices without sync to each other. 
> This could cause device hang since for XGMI configuration, all the 
> devices within the hive need to be reset at a limit time slot. This 
> serial of patches try to solve this issue by co-operate with new SMU 
> which will only do minimum house keeping to response the SBR request 
> but don't do the real reset job and leave it to driver. Driver need to 
> do the whole sw init and minimum HW init to bring up the SMU and 
> trigger the reset(possibly BACO) on all the ASICs at the same time
> 
> Signed-off-by: shaoyunl 
> Change-Id: I34e838e611b7623c7ad824704c7ce350808014fc
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h|  13 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 102 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  71 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h|   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |   8 +-
>   5 files changed, 165 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index d46d3794699e..5602c6edee97 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -125,6 +125,10 @@ struct amdgpu_mgpu_info
>   uint32_tnum_gpu;
>   uint32_tnum_dgpu;
>   uint32_tnum_apu;
> +
> + /* delayed reset_func for XGMI configuration if necessary */
> + struct delayed_work delayed_reset_work;
> + boolpending_reset;
>   };
>   
>   #define AMDGPU_MAX_TIMEOUT_PARAM_LENGTH 256
> @@ -1124,6 +1128,15 @@ void amdgpu_device_indirect_wreg64(struct 
> amdgpu_device *adev,
>   bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type);
>   bool amdgpu_device_has_dc_support(struct amdgpu_device *adev);
>   
> +int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> +   struct amdgpu_job *job,
> +   bool *need_full_reset_arg);
> +
> +int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
> +   struct list_head *device_list_handle,
> +   bool *need_full_reset_arg,
> +   bool skip_hw_reset);
> +
>   int emu_soc_asic_init(struct amdgpu_device *adev);
>   
>   /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3c35b0c1e710..5b520f70e660 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1220,6 +1220,10 @@ bool amdgpu_device_need_post(struct amdgpu_device 
> *adev)
>   }
>   }
>   
> + /* Don't post if we need to reset whole hive on init */
> + if (adev->gmc.xgmi.pending_reset)
> + return false;
> +
>   if (adev->has_hw_reset) {
>   adev->has_hw_reset = false;
>   return true;
> @@ -2149,6 +2153,9 @@ static int amdgpu_device_fw_loading(struct 
> amdgpu_device *adev)
>   if (adev->ip_blocks[i].version->type != 
> AMD_IP_BLOCK_TYPE_PSP)
>   continue;
>   
> + if (!adev->ip_blocks[i].status.sw)
> + continue;
> +
>   /* no need to do the fw loading again if already done*/
>   if 

RE: [PATCH 4/4] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-02-24 Thread Liu, Shaoyun
[AMD Public Use]

SBR happens on hypervisior begin to start the VM . I think the purpose is 
hypervisior try to reset the device in a clean state  before the VM starts.  
The  specific issue for XGMI is  HW requires all GPUS belongs to the hive need 
to be reset within a limit time slot but SBR can not guarantee that  .  For 
none-XGMI  configuration , there is no this requirement , SBR can reset  the 
GPU  correctly . 

Regards
Shaoyun.liu

-Original Message-
From: Lazar, Lijo  
Sent: Wednesday, February 24, 2021 8:58 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: RE: [PATCH 4/4] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

[AMD Public Use]

Hi Shaoyun,

If this is SBR happening during device init, how different is the handling from 
the normal passthrough case without XGMI. Shouldn't the minimal init be done 
and reset performed in such a case also? Wondering why this is specific to 
"xgmi.pending_reset". In case of XGMI, wouldn't this cause issues if other 
devices in hive are reset without HV knowledge?

Thanks,
Lijo

-Original Message-
From: amd-gfx  On Behalf Of shaoyunl
Sent: Wednesday, February 24, 2021 7:53 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH 4/4] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

In passthrough configuration, hypervisior will trigger the SBR(Secondary bus 
reset) to the devices without sync to each other. This could cause device hang 
since for XGMI configuration, all the devices within the hive need to be reset 
at a limit time slot. This serial of patches try to solve this issue by 
co-operate with new SMU which will only do minimum house keeping to response 
the SBR request but don't do the real reset job and leave it to driver. Driver 
need to do the whole sw init and minimum HW init to bring up the SMU and 
trigger the reset(possibly BACO) on all the ASICs at the same time with 
existing gpu_recovery routine.

Signed-off-by: shaoyunl 
Change-Id: I34e838e611b7623c7ad824704c7ce350808014fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 96 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  6 +-
 4 files changed, 87 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 420ef08a51b5..ae8be6d813a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1220,6 +1220,10 @@ bool amdgpu_device_need_post(struct amdgpu_device *adev)
}
}
 
+   /* Don't post if we need to reset whole hive on init */
+   if (adev->gmc.xgmi.pending_reset)
+   return false;
+
if (adev->has_hw_reset) {
adev->has_hw_reset = false;
return true;
@@ -2147,6 +2151,9 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
if (adev->ip_blocks[i].version->type != 
AMD_IP_BLOCK_TYPE_PSP)
continue;
 
+   if (!adev->ip_blocks[i].status.sw)
+   continue;
+
/* no need to do the fw loading again if already done*/
if (adev->ip_blocks[i].status.hw == true)
break;
@@ -2287,7 +2294,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 
if (adev->gmc.xgmi.num_physical_nodes > 1)
amdgpu_xgmi_add_device(adev);
-   amdgpu_amdkfd_device_init(adev);
+
+   /* Don't init kfd if whole hive need to be reset during init */
+   if (!adev->gmc.xgmi.pending_reset)
+   amdgpu_amdkfd_device_init(adev);
 
amdgpu_fru_get_product_info(adev);
 
@@ -2731,6 +2741,16 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
adev->ip_blocks[i].status.hw = false;
continue;
}
+
+   /* skip unnecessary suspend if we do not initialize them yet */
+   if (adev->gmc.xgmi.pending_reset &&
+   !(adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC 
||
+ adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC 
||
+ adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_COMMON ||
+ adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_IH)) {
+   adev->ip_blocks[i].status.hw = false;
+   continue;
+   }
/* XXX handle errors */
r = adev->ip_blocks[i].version->funcs->suspend(adev);
/* XXX handle errors */
@@ -3402,10 +3422,29 @@ int amdgpu_device_init(str

RE: [PATCH 2/4] drm/amdgpu: get xgmi info at eary_init

2021-02-23 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Thanks , Alex. 
Whole four patches are needed for the XGMI reset to work normally .  I try to 
describe what  these patches for in the first patch. But  If you don't mind 
this , I can adjust the order as suggested . 

Thanks 
Shaoyun.liu

-Original Message-
From: Alex Deucher  
Sent: Tuesday, February 23, 2021 11:26 AM
To: Liu, Shaoyun 
Cc: amd-gfx list 
Subject: Re: [PATCH 2/4] drm/amdgpu: get xgmi info at eary_init

On Thu, Feb 18, 2021 at 8:19 PM shaoyunl  wrote:
>
> Driver need to get XGMI info function earlier before ip_init since 
> driver need to check the XGMI setting to determine how to perform 
> reset during init
>
> Signed-off-by: shaoyunl 
> Change-Id: Ic37276bbb6640bb4e9360220fed99494cedd3ef5

I think this patch needs to come first or patch 1 won't work.  With that 
changed, this patch is:
Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 3686e777c76c..3e6bfab5b855 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1151,6 +1151,10 @@ static int gmc_v9_0_early_init(void *handle)
> adev->gmc.private_aperture_end =
> adev->gmc.private_aperture_start + (4ULL << 30) - 1;
>
> +   /* Need to get xgmi info earlier to decide the reset behavior*/
> +   if (adev->gmc.xgmi.supported)
> +   adev->gfxhub.funcs->get_xgmi_info(adev);
> +
> return 0;
>  }
>
> @@ -1416,12 +1420,6 @@ static int gmc_v9_0_sw_init(void *handle)
> }
> adev->need_swiotlb = drm_need_swiotlb(44);
>
> -   if (adev->gmc.xgmi.supported) {
> -   r = adev->gfxhub.funcs->get_xgmi_info(adev);
> -   if (r)
> -   return r;
> -   }
> -
> r = gmc_v9_0_mc_init(adev);
> if (r)
> return r;
> --
> 2.17.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Csh
> aoyun.liu%40amd.com%7Ceb081cfaf9c94e59521008d8d817ccee%7C3dd8961fe4884
> e608e11a82d994e183d%7C0%7C0%7C637496944032343059%7CUnknown%7CTWFpbGZsb
> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
> 7C1000sdata=CrS5Nv4uCh8sFRILLGM%2FRgxpVlEEs%2Bft%2FHTdoeQyyqo%3D&
> amp;reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/4] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-02-23 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Comments inline , 

-Original Message-
From: Alex Deucher  
Sent: Tuesday, February 23, 2021 11:47 AM
To: Liu, Shaoyun 
Cc: amd-gfx list 
Subject: Re: [PATCH 1/4] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

On Thu, Feb 18, 2021 at 8:19 PM shaoyunl  wrote:
>
> In passthrough configuration, hypervisior will trigger the 
> SBR(Secondary bus reset) to the devices without sync to each other. 
> This could cause device hang since for XGMI configuration, all the 
> devices within the hive need to be reset at a limit time slot. This 
> serial of patches try to solve this issue by co-operate with new SMU 
> which will only do minimum house keeping to response the SBR request 
> but don't do the real reset job and leave it to driver. Driver need to do the 
> whole sw init and minimum HW init to bring up the SMU and trigger the 
> reset(possibly BACO) on all the ASICs at the same time with existing 
> gpu_recovery routine.
>
> Signed-off-by: shaoyunl 
> Change-Id: I34e838e611b7623c7ad824704c7ce350808014fc
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 96 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h|  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  6 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  6 +-
>  4 files changed, 87 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2f9ad7ed82be..9f574fd151bc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1220,6 +1220,10 @@ bool amdgpu_device_need_post(struct amdgpu_device 
> *adev)
> }
> }
>
> +   /* Don't post if we need to reset whole hive on init */
> +   if (adev->gmc.xgmi.pending_reset)
> +   return false;
> +
> if (adev->has_hw_reset) {
> adev->has_hw_reset = false;
> return true;
> @@ -2147,6 +2151,9 @@ static int amdgpu_device_fw_loading(struct 
> amdgpu_device *adev)
> if (adev->ip_blocks[i].version->type != 
> AMD_IP_BLOCK_TYPE_PSP)
> continue;
>
> +   if (!adev->ip_blocks[i].status.sw)
> +   continue;
> +
> /* no need to do the fw loading again if already 
> done*/
> if (adev->ip_blocks[i].status.hw == true)
> break; @@ -2287,7 +2294,10 @@ static 
> int amdgpu_device_ip_init(struct amdgpu_device *adev)
>
> if (adev->gmc.xgmi.num_physical_nodes > 1)
> amdgpu_xgmi_add_device(adev);
> -   amdgpu_amdkfd_device_init(adev);
> +
> +   /* Don't init kfd if whole hive need to be reset during init */
> +   if (!adev->gmc.xgmi.pending_reset)
> +   amdgpu_amdkfd_device_init(adev);
>
> amdgpu_fru_get_product_info(adev);
>
> @@ -2731,6 +2741,16 @@ static int amdgpu_device_ip_suspend_phase2(struct 
> amdgpu_device *adev)
> adev->ip_blocks[i].status.hw = false;
> continue;
> }
> +
> +   /* skip unnecessary suspend if we do not initialize them yet 
> */
> +   if (adev->gmc.xgmi.pending_reset &&
> +   !(adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_GMC ||
> + adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_SMC ||
> + adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_COMMON ||
> + adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_IH)) {
> +   adev->ip_blocks[i].status.hw = false;
> +   continue;
> +   }
> /* XXX handle errors */
> r = adev->ip_blocks[i].version->funcs->suspend(adev);
> /* XXX handle errors */ @@ -3402,10 +3422,29 @@ int 
> amdgpu_device_init(struct amdgpu_device *adev,
>  *  E.g., driver was not cleanly unloaded previously, etc.
>  */
> if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
> -   r = amdgpu_asic_reset(adev);
> -   if (r) {
> -   dev_err(adev->dev, "asic reset on init failed\n");
> -   goto failed;
> +   if (adev->gmc.xgmi.num_physical_nodes) {
> +   dev_info(adev->dev, "Pending hive reset.\n");
> +   

RE: [PATCH 4/4] drm/amdgpu: Init the cp MQD if it's not be initialized before

2021-02-22 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]



-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, February 18, 2021 8:20 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH 4/4] drm/amdgpu: Init the cp MQD if it's not be initialized 
before

The MQD might not be initialized duirng first init period if the device need to 
be reset druing probe. Driver need to proper init them in gpu recovery period

Signed-off-by: shaoyunl 
Change-Id: Iad58a050939af2afa46d1c74a90866c47ba9efd2
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 65db88bb6cbc..8fc2fd518a1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3696,11 +3696,18 @@ static int gfx_v9_0_kiq_init_queue(struct amdgpu_ring 
*ring)
struct amdgpu_device *adev = ring->adev;
struct v9_mqd *mqd = ring->mqd_ptr;
int mqd_idx = AMDGPU_MAX_COMPUTE_RINGS;
+   struct v9_mqd *tmp_mqd;
 
gfx_v9_0_kiq_setting(ring);
 
-   if (amdgpu_in_reset(adev)) { /* for GPU_RESET case */
-   /* reset MQD to a clean status */
+   /* GPU could be in bad state during probe, driver trigger the reset
+* after load the SMU, in this case , the mqd is not be initialized.
+* driver need to re-init the mqd in this case.
+* check mqd->cp_hqd_pq_control since this value should not be 0
+*/
+   tmp_mqd = (struct v9_mqd *)adev->gfx.mec.mqd_backup[mqd_idx];
+   if (amdgpu_in_reset(adev) && tmp_mqd->cp_hqd_pq_control){
+   /* for GPU_RESET case , reset MQD to a clean status */
if (adev->gfx.mec.mqd_backup[mqd_idx])
memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], 
sizeof(struct v9_mqd_allocation));
 
@@ -3736,8 +3743,15 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring 
*ring)
struct amdgpu_device *adev = ring->adev;
struct v9_mqd *mqd = ring->mqd_ptr;
int mqd_idx = ring - >gfx.compute_ring[0];
+   struct v9_mqd *tmp_mqd;
 
-   if (!amdgpu_in_reset(adev) && !adev->in_suspend) {
+   /* Samw as above kiq init, driver need to re-init the mqd if 
mqd->cp_hqd_pq_control
+* is not be initialized before
+*/
+   tmp_mqd = (struct v9_mqd *)adev->gfx.mec.mqd_backup[mqd_idx];
+
+   if (!tmp_mqd->cp_hqd_pq_control ||
+   (!amdgpu_in_reset(adev) && !adev->in_suspend)) {
memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0x;
((struct v9_mqd_allocation *)mqd)->dynamic_rb_mask = 0x;
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/4] drm/amdgpu: Add kfd init_complete flag to check from amdgpu side

2021-02-22 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]



-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, February 18, 2021 8:20 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH 3/4] drm/amdgpu: Add kfd init_complete flag to check from 
amdgpu side

amdgpu driver may in reset state duirng init which will not initialize the kfd, 
driver need to initialize the KFD after reset by check the flag

Signed-off-by: shaoyunl 
Change-Id: Ic1684b55b27e0afd42bee8b9b431c4fb0afcec15
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 ++-  
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +  
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c5343a5eecbe..a876dc3af017 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -165,7 +165,8 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
adev->doorbell_index.last_non_cp;
}
 
-   kgd2kfd_device_init(adev->kfd.dev, adev_to_drm(adev), 
_resources);
+   adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
+   adev_to_drm(adev), 
_resources);
}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 4687ff2961e1..3182dd97840e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -80,6 +80,7 @@ struct amdgpu_amdkfd_fence {  struct amdgpu_kfd_dev {
struct kfd_dev *dev;
uint64_t vram_used;
+   bool init_complete;
 };
 
 enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9f574fd151bc..e898fce96f75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4841,6 +4841,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/*unlock kfd: SRIOV would do it separately */
if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
amdgpu_amdkfd_post_reset(tmp_adev);
+
+   /*kfd_post_reset will do nothing if kfd device is not 
initialized,
+*need to bring up kfd here if it's not be initialized before
+*/
+   if (!adev->kfd.init_complete)
+   amdgpu_amdkfd_device_init(adev);
+
if (audio_suspended)
amdgpu_device_resume_display_audio(tmp_adev);
amdgpu_device_unlock_adev(tmp_adev);
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 2/4] drm/amdgpu: get xgmi info at eary_init

2021-02-22 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]



-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, February 18, 2021 8:19 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH 2/4] drm/amdgpu: get xgmi info at eary_init

Driver need to get XGMI info function earlier before ip_init since driver need 
to check the XGMI setting to determine how to perform reset during init

Signed-off-by: shaoyunl 
Change-Id: Ic37276bbb6640bb4e9360220fed99494cedd3ef5
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 3686e777c76c..3e6bfab5b855 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1151,6 +1151,10 @@ static int gmc_v9_0_early_init(void *handle)
adev->gmc.private_aperture_end =
adev->gmc.private_aperture_start + (4ULL << 30) - 1;
 
+   /* Need to get xgmi info earlier to decide the reset behavior*/
+   if (adev->gmc.xgmi.supported)
+   adev->gfxhub.funcs->get_xgmi_info(adev);
+
return 0;
 }
 
@@ -1416,12 +1420,6 @@ static int gmc_v9_0_sw_init(void *handle)
}
adev->need_swiotlb = drm_need_swiotlb(44);
 
-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
r = gmc_v9_0_mc_init(adev);
if (r)
return r;
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/4] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-02-22 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Ping. 

-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, February 18, 2021 8:19 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH 1/4] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

In passthrough configuration, hypervisior will trigger the SBR(Secondary bus 
reset) to the devices without sync to each other. This could cause device hang 
since for XGMI configuration, all the devices within the hive need to be reset 
at a limit time slot. This serial of patches try to solve this issue by 
co-operate with new SMU which will only do minimum house keeping to response 
the SBR request but don't do the real reset job and leave it to driver. Driver 
need to do the whole sw init and minimum HW init to bring up the SMU and 
trigger the reset(possibly BACO) on all the ASICs at the same time with 
existing gpu_recovery routine.

Signed-off-by: shaoyunl 
Change-Id: I34e838e611b7623c7ad824704c7ce350808014fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 96 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  6 +-
 4 files changed, 87 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2f9ad7ed82be..9f574fd151bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1220,6 +1220,10 @@ bool amdgpu_device_need_post(struct amdgpu_device *adev)
}
}
 
+   /* Don't post if we need to reset whole hive on init */
+   if (adev->gmc.xgmi.pending_reset)
+   return false;
+
if (adev->has_hw_reset) {
adev->has_hw_reset = false;
return true;
@@ -2147,6 +2151,9 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
if (adev->ip_blocks[i].version->type != 
AMD_IP_BLOCK_TYPE_PSP)
continue;
 
+   if (!adev->ip_blocks[i].status.sw)
+   continue;
+
/* no need to do the fw loading again if already done*/
if (adev->ip_blocks[i].status.hw == true)
break;
@@ -2287,7 +2294,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 
if (adev->gmc.xgmi.num_physical_nodes > 1)
amdgpu_xgmi_add_device(adev);
-   amdgpu_amdkfd_device_init(adev);
+
+   /* Don't init kfd if whole hive need to be reset during init */
+   if (!adev->gmc.xgmi.pending_reset)
+   amdgpu_amdkfd_device_init(adev);
 
amdgpu_fru_get_product_info(adev);
 
@@ -2731,6 +2741,16 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
adev->ip_blocks[i].status.hw = false;
continue;
}
+
+   /* skip unnecessary suspend if we do not initialize them yet */
+   if (adev->gmc.xgmi.pending_reset &&
+   !(adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC 
||
+ adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC 
||
+ adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_COMMON ||
+ adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_IH)) {
+   adev->ip_blocks[i].status.hw = false;
+   continue;
+   }
/* XXX handle errors */
r = adev->ip_blocks[i].version->funcs->suspend(adev);
/* XXX handle errors */
@@ -3402,10 +3422,29 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 *  E.g., driver was not cleanly unloaded previously, etc.
 */
if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
-   r = amdgpu_asic_reset(adev);
-   if (r) {
-   dev_err(adev->dev, "asic reset on init failed\n");
-   goto failed;
+   if (adev->gmc.xgmi.num_physical_nodes) {
+   dev_info(adev->dev, "Pending hive reset.\n");
+   adev->gmc.xgmi.pending_reset = true;
+   /* Only need to init necessary block for SMU to handle 
the reset */
+   for (i = 0; i < adev->num_ip_blocks; i++) {
+   if (!adev->ip_blocks[i].status.valid)
+   continue;
+   if (!(adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_GMC ||
+ adev->ip_blocks[i].vers

RE: [PATCH 1/4] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-02-19 Thread Liu, Shaoyun
[AMD Official Use Only - Internal Distribution Only]

Ping .

-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, February 18, 2021 8:19 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun 
Subject: [PATCH 1/4] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

In passthrough configuration, hypervisior will trigger the SBR(Secondary bus 
reset) to the devices without sync to each other. This could cause device hang 
since for XGMI configuration, all the devices within the hive need to be reset 
at a limit time slot. This serial of patches try to solve this issue by 
co-operate with new SMU which will only do minimum house keeping to response 
the SBR request but don't do the real reset job and leave it to driver. Driver 
need to do the whole sw init and minimum HW init to bring up the SMU and 
trigger the reset(possibly BACO) on all the ASICs at the same time with 
existing gpu_recovery routine.

Signed-off-by: shaoyunl 
Change-Id: I34e838e611b7623c7ad824704c7ce350808014fc
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 96 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  6 +-
 4 files changed, 87 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2f9ad7ed82be..9f574fd151bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1220,6 +1220,10 @@ bool amdgpu_device_need_post(struct amdgpu_device *adev)
}
}
 
+   /* Don't post if we need to reset whole hive on init */
+   if (adev->gmc.xgmi.pending_reset)
+   return false;
+
if (adev->has_hw_reset) {
adev->has_hw_reset = false;
return true;
@@ -2147,6 +2151,9 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
if (adev->ip_blocks[i].version->type != 
AMD_IP_BLOCK_TYPE_PSP)
continue;
 
+   if (!adev->ip_blocks[i].status.sw)
+   continue;
+
/* no need to do the fw loading again if already done*/
if (adev->ip_blocks[i].status.hw == true)
break;
@@ -2287,7 +2294,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 
if (adev->gmc.xgmi.num_physical_nodes > 1)
amdgpu_xgmi_add_device(adev);
-   amdgpu_amdkfd_device_init(adev);
+
+   /* Don't init kfd if whole hive need to be reset during init */
+   if (!adev->gmc.xgmi.pending_reset)
+   amdgpu_amdkfd_device_init(adev);
 
amdgpu_fru_get_product_info(adev);
 
@@ -2731,6 +2741,16 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
adev->ip_blocks[i].status.hw = false;
continue;
}
+
+   /* skip unnecessary suspend if we do not initialize them yet */
+   if (adev->gmc.xgmi.pending_reset &&
+   !(adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC 
||
+ adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC 
||
+ adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_COMMON ||
+ adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_IH)) {
+   adev->ip_blocks[i].status.hw = false;
+   continue;
+   }
/* XXX handle errors */
r = adev->ip_blocks[i].version->funcs->suspend(adev);
/* XXX handle errors */
@@ -3402,10 +3422,29 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 *  E.g., driver was not cleanly unloaded previously, etc.
 */
if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
-   r = amdgpu_asic_reset(adev);
-   if (r) {
-   dev_err(adev->dev, "asic reset on init failed\n");
-   goto failed;
+   if (adev->gmc.xgmi.num_physical_nodes) {
+   dev_info(adev->dev, "Pending hive reset.\n");
+   adev->gmc.xgmi.pending_reset = true;
+   /* Only need to init necessary block for SMU to handle 
the reset */
+   for (i = 0; i < adev->num_ip_blocks; i++) {
+   if (!adev->ip_blocks[i].status.valid)
+   continue;
+   if (!(adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_GMC ||
+ adev->ip_blocks[i].vers

  1   2   3   >