RE: [PATCH 10/11] drm/amdgpu: add gang submit frontend v4

2022-09-04 Thread Huang, Trigger
[AMD Official Use Only - General]

Before we finally add the gang submission frontend, is there any interface/flag 
for user mode driver to detect if gang submission is supported by kernel?

Regards,
Trigger

-Original Message-
From: amd-gfx  On Behalf Of Christian 
K?nig
Sent: 2022年8月29日 21:18
To: Dong, Ruijing ; amd-gfx@lists.freedesktop.org
Cc: Koenig, Christian 
Subject: [PATCH 10/11] drm/amdgpu: add gang submit frontend v4

Allows submitting jobs as gang which needs to run on multiple engines at the 
same time.

All members of the gang get the same implicit, explicit and VM dependencies. So 
no gang member will start running until everything else is ready.

The last job is considered the gang leader (usually a submission to the GFX
ring) and used for signaling output dependencies.

Each job is remembered individually as user of a buffer object, so there is no 
joining of work at the end.

v2: rebase and fix review comments from Andrey and Yogesh
v3: use READ instead of BOOKKEEP for now because of VM unmaps, set gang
leader only when necessary
v4: fix order of pushing jobs and adding fences found by Trigger.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 259 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h|  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  12 +-
 3 files changed, 184 insertions(+), 97 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 9821299dfb49..a6e50ad5e306 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -69,6 +69,7 @@ static int amdgpu_cs_p1_ib(struct amdgpu_cs_parser *p,
   unsigned int *num_ibs)
 {
struct drm_sched_entity *entity;
+   unsigned int i;
int r;

r = amdgpu_ctx_get_entity(p->ctx, chunk_ib->ip_type, @@ -77,17 +78,28 
@@ static int amdgpu_cs_p1_ib(struct amdgpu_cs_parser *p,
if (r)
return r;

-   /* Abort if there is no run queue associated with this entity.
-* Possibly because of disabled HW IP*/
+   /*
+* Abort if there is no run queue associated with this entity.
+* Possibly because of disabled HW IP.
+*/
if (entity->rq == NULL)
return -EINVAL;

-   /* Currently we don't support submitting to multiple entities */
-   if (p->entity && p->entity != entity)
+   /* Check if we can add this IB to some existing job */
+   for (i = 0; i < p->gang_size; ++i) {
+   if (p->entities[i] == entity)
+   goto found;
+   }
+
+   /* If not increase the gang size if possible */
+   if (i == AMDGPU_CS_GANG_SIZE)
return -EINVAL;

-   p->entity = entity;
-   ++(*num_ibs);
+   p->entities[i] = entity;
+   p->gang_size = i + 1;
+
+found:
+   ++(num_ibs[i]);
return 0;
 }

@@ -161,11 +173,12 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p,
   union drm_amdgpu_cs *cs)
 {
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
+   unsigned int num_ibs[AMDGPU_CS_GANG_SIZE] = { };
struct amdgpu_vm *vm = >vm;
uint64_t *chunk_array_user;
uint64_t *chunk_array;
-   unsigned size, num_ibs = 0;
uint32_t uf_offset = 0;
+   unsigned int size;
int ret;
int i;

@@ -231,7 +244,7 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p,
if (size < sizeof(struct drm_amdgpu_cs_chunk_ib))
goto free_partial_kdata;

-   ret = amdgpu_cs_p1_ib(p, p->chunks[i].kdata, _ibs);
+   ret = amdgpu_cs_p1_ib(p, p->chunks[i].kdata, num_ibs);
if (ret)
goto free_partial_kdata;
break;
@@ -268,21 +281,28 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p,
}
}

-   ret = amdgpu_job_alloc(p->adev, num_ibs, >job, vm);
-   if (ret)
-   goto free_all_kdata;
+   if (!p->gang_size)
+   return -EINVAL;

-   ret = drm_sched_job_init(>job->base, p->entity, >vm);
-   if (ret)
-   goto free_all_kdata;
+   for (i = 0; i < p->gang_size; ++i) {
+   ret = amdgpu_job_alloc(p->adev, num_ibs[i], >jobs[i], vm);
+   if (ret)
+   goto free_all_kdata;
+
+   ret = drm_sched_job_init(>jobs[i]->base, p->entities[i],
+>vm);
+   if (ret)
+   goto free_all_kdata;
+   }
+   p->gang_leader = p->jobs[p->gang_size - 1];

-   if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
+   if (p->ctx->vram_lost_counter != p->gang_leader->vram_lost_counter) {
ret = -ECANCELED;
goto free_all_kdata;
}

 

RE: [PATCH] drm/amdgpu/sriov: fix Tonga load driver failed

2019-06-20 Thread Huang, Trigger
Reviewed-by: Trigger Huang 

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: amd-gfx  On Behalf Of Jack Zhang
Sent: Thursday, June 20, 2019 2:05 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amdgpu/sriov: fix Tonga load driver failed

Tonga sriov need to use smu to load firmware.
Remove sriov flag because the default return value is zero.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
index 093cefd..469cda9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
@@ -2755,8 +2755,6 @@ int amdgpu_pm_load_smu_firmware(struct amdgpu_device 
*adev, uint32_t *smu_versio  {
int r;
 
-   if (amdgpu_sriov_vf(adev))
-   return 0;
 
if (adev->powerplay.pp_funcs && 
adev->powerplay.pp_funcs->load_firmware) {
r = 
adev->powerplay.pp_funcs->load_firmware(adev->powerplay.pp_handle);
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu/sriov: Correct some register program method

2019-05-31 Thread Huang, Trigger

Reviewed-by: Trigger Huang 


Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: amd-gfx  On Behalf Of Emily Deng
Sent: Friday, May 31, 2019 2:37 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily 
Subject: [PATCH] drm/amdgpu/sriov: Correct some register program method

For the VF, some registers only could be programmed with RLC.

Signed-off-by: Emily Deng 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c| 10 +-
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index cc5a382..2e9cac1 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -1927,17 +1927,17 @@ static void gfx_v9_0_constants_init(struct 
amdgpu_device *adev)
if (i == 0) {
tmp = REG_SET_FIELD(0, SH_MEM_CONFIG, ALIGNMENT_MODE,
SH_MEM_ALIGNMENT_MODE_UNALIGNED);
-   WREG32_SOC15(GC, 0, mmSH_MEM_CONFIG, tmp);
-   WREG32_SOC15(GC, 0, mmSH_MEM_BASES, 0);
+   WREG32_SOC15_RLC(GC, 0, mmSH_MEM_CONFIG, tmp);
+   WREG32_SOC15_RLC(GC, 0, mmSH_MEM_BASES, 0);
} else {
tmp = REG_SET_FIELD(0, SH_MEM_CONFIG, ALIGNMENT_MODE,
SH_MEM_ALIGNMENT_MODE_UNALIGNED);
-   WREG32_SOC15(GC, 0, mmSH_MEM_CONFIG, tmp);
+   WREG32_SOC15_RLC(GC, 0, mmSH_MEM_CONFIG, tmp);
tmp = REG_SET_FIELD(0, SH_MEM_BASES, PRIVATE_BASE,
(adev->gmc.private_aperture_start >> 48));
tmp = REG_SET_FIELD(tmp, SH_MEM_BASES, SHARED_BASE,
(adev->gmc.shared_aperture_start >> 48));
-   WREG32_SOC15(GC, 0, mmSH_MEM_BASES, tmp);
+   WREG32_SOC15_RLC(GC, 0, mmSH_MEM_BASES, tmp);
}
}
soc15_grbm_select(adev, 0, 0, 0, 0);
@@ -3046,7 +3046,7 @@ static int gfx_v9_0_kiq_init_register(struct amdgpu_ring 
*ring)
(adev->doorbell_index.userqueue_end * 
2) << 2);
}
 
-   WREG32_SOC15(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL,
+   WREG32_SOC15_RLC(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL,
   mqd->cp_hqd_pq_doorbell_control);
 
/* reset read and write pointers, similar to CP_RB0_WPTR/_RPTR */ diff 
--git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index 0dc8926..9f0f189 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -146,12 +146,12 @@ static void gfxhub_v1_0_init_cache_regs(struct 
amdgpu_device *adev)
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 0);
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, CONTEXT1_IDENTITY_ACCESS_MODE, 1);
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, IDENTITY_MODE_FRAGMENT_SIZE, 0);
-   WREG32_SOC15(GC, 0, mmVM_L2_CNTL, tmp);
+   WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL, tmp);
 
tmp = RREG32_SOC15(GC, 0, mmVM_L2_CNTL2);
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL2, INVALIDATE_ALL_L1_TLBS, 1);
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL2, INVALIDATE_L2_CACHE, 1);
-   WREG32_SOC15(GC, 0, mmVM_L2_CNTL2, tmp);
+   WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL2, tmp);
 
tmp = mmVM_L2_CNTL3_DEFAULT;
if (adev->gmc.translate_further) {
@@ -163,12 +163,12 @@ static void gfxhub_v1_0_init_cache_regs(struct 
amdgpu_device *adev)
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL3,
L2_CACHE_BIGK_FRAGMENT_SIZE, 6);
}
-   WREG32_SOC15(GC, 0, mmVM_L2_CNTL3, tmp);
+   WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL3, tmp);
 
tmp = mmVM_L2_CNTL4_DEFAULT;
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL4, VMC_TAP_PDE_REQUEST_PHYSICAL, 0);
tmp = REG_SET_FIELD(tmp, VM_L2_CNTL4, VMC_TAP_PTE_REQUEST_PHYSICAL, 0);
-   WREG32_SOC15(GC, 0, mmVM_L2_CNTL4, tmp);
+   WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL4, tmp);
 }
 
 static void gfxhub_v1_0_enable_system_domain(struct amdgpu_device *adev)
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 0/2] Skip IH re-route on Vega SR-IOV

2019-05-07 Thread Huang, Trigger

OK, thanks for the detailed background,  before I didn't  know the limitation 
in the hardware.

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Christian König  
Sent: Tuesday, May 07, 2019 5:04 PM
To: Huang, Trigger ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 0/2] Skip IH re-route on Vega SR-IOV

[CAUTION: External Email]

Hi Trigger,

> And see this interrupt is still from IH0 amdgpu_irq_handler, which can prove 
> this feature is not working under SR-IOV.
In this case this change is a clear NAK.

> I suggest to remove this feature from SR-IOV, as my concern is,  some weird 
> bugs may be cased by it in the Virtualization heavy stress test.
And I really think we should keep it to make sure that we have the same 
handling for bare metal as for SRIOV.

> In the future, maybe we can request PSP team to add this support for SR-IOV.
We will never be able to use this under SRIOV because of limitation in the 
hardware.

What we could maybe do is check the response code from the PSP firmware if it 
correctly ignored the commands under SR-IOV, but I think the response code is 
the same for ignoring as for executing the commands.

Regards,
Christian.

Am 07.05.19 um 10:54 schrieb Huang, Trigger:
> Hi Christian,
>
> On Vega10 SR-IOV VF, I injected a 'real' VMC page fault from user space, 
> using the modified amdgpu_test.
> [   19.127874] amdgpu :00:08.0: [gfxhub] no-retry page fault (src_id:0 
> ring:174 vmid:1 pasid:32768, for process amdgpu_test pid 1071 thread 
> amdgpu_test pid 1071)
> [   19.130037] amdgpu :00:08.0:   in page starting at address 
> 0x0008 from 27
>
> And see this interrupt is still from IH0 amdgpu_irq_handler, which can prove 
> this feature is not working under SR-IOV.
>
> I suggest to remove this feature from SR-IOV, as my concern is,  some weird 
> bugs may be cased by it in the Virtualization heavy stress test.
> In the future, maybe we can request PSP team to add this support for SR-IOV.
>
> Thanks & Best Wishes,
> Trigger Huang
>
> -Original Message-----
> From: Christian König 
> Sent: Tuesday, May 07, 2019 3:37 PM
> To: Huang, Trigger ; 
> amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 0/2] Skip IH re-route on Vega SR-IOV
>
> [CAUTION: External Email]
>
> We intentionally didn't do this to make sure that the commands are ignored by 
> the PSP firmware.
>
> I have no strong opinion on if we should do this or not, but the PSP firmware 
> guys might have.
>
> Christian.
>
> Am 07.05.19 um 06:08 schrieb Trigger Huang:
>> IH re-route is not supported on Vega SR-IOV, need to be skipped
>>
>> Trigger Huang (2):
>> drm/amdgpu: Skip IH reroute in Vega10 SR-IOV VF
>> drm/amdgpu: Skip IH reroute in Vega20 SR-IOV VF
>>
>>drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 4 
>>drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  | 4 
>>2 files changed, 8 insertions(+)
>>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 0/2] Skip IH re-route on Vega SR-IOV

2019-05-07 Thread Huang, Trigger
Hi Christian,

On Vega10 SR-IOV VF, I injected a 'real' VMC page fault from user space, using 
the modified amdgpu_test.
[   19.127874] amdgpu :00:08.0: [gfxhub] no-retry page fault (src_id:0 
ring:174 vmid:1 pasid:32768, for process amdgpu_test pid 1071 thread 
amdgpu_test pid 1071)
[   19.130037] amdgpu :00:08.0:   in page starting at address 
0x0008 from 27

And see this interrupt is still from IH0 amdgpu_irq_handler, which can prove 
this feature is not working under SR-IOV.

I suggest to remove this feature from SR-IOV, as my concern is,  some weird 
bugs may be cased by it in the Virtualization heavy stress test.
In the future, maybe we can request PSP team to add this support for SR-IOV.

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Christian König  
Sent: Tuesday, May 07, 2019 3:37 PM
To: Huang, Trigger ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 0/2] Skip IH re-route on Vega SR-IOV

[CAUTION: External Email]

We intentionally didn't do this to make sure that the commands are ignored by 
the PSP firmware.

I have no strong opinion on if we should do this or not, but the PSP firmware 
guys might have.

Christian.

Am 07.05.19 um 06:08 schrieb Trigger Huang:
> IH re-route is not supported on Vega SR-IOV, need to be skipped
>
> Trigger Huang (2):
>drm/amdgpu: Skip IH reroute in Vega10 SR-IOV VF
>drm/amdgpu: Skip IH reroute in Vega20 SR-IOV VF
>
>   drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 4 
>   drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  | 4 
>   2 files changed, 8 insertions(+)
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM

2019-05-06 Thread Huang, Trigger
Thanks Alex's comments

Yes, they are only in the SR-IOV HW initialization path of UVD/VCE.
Thanks & Best Wishes,
Trigger Huang

From: Deucher, Alexander 
Sent: Monday, May 06, 2019 10:52 PM
To: Huang, Trigger ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM

As long as this doesn't break bare metal, I'm ok with it.
Acked-by: Alex Deucher 
mailto:alexander.deuc...@amd.com>>

From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 on behalf of Trigger Huang 
mailto:trigger.hu...@amd.com>>
Sent: Thursday, May 2, 2019 8:56 AM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Huang, Trigger
Subject: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM

[CAUTION: External Email]

One Vega10 SR-IOV VF, the FW address returned by PSP should be
set into the init table, while not the original BO mc address.
otherwise, UVD and VCE IB test will fail under Vega10 SR-IOV

reference:
commit bfcea5204287 ("drm/amdgpu:change VEGA booting with firmware 
loaded by PSP")
commit aa5873dca463 ("drm/amdgpu: Change VCE booting with firmware 
loaded by PSP")

Signed-off-by: Trigger Huang 
mailto:trigger.hu...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c | 16 ++--
 drivers/gpu/drm/amd/amdgpu/vce_v4_0.c | 17 +++--
 2 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c 
b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
index dc461df..2191d3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
@@ -787,10 +787,13 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device 
*adev)
   0x, 
0x0004);
/* mc resume*/
if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
-   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
-   
lower_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr));
-   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),
-   
upper_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr));
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i,
+   
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
+   
adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_lo);
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i,
+   
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),
+   
adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_hi);
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, mmUVD_VCPU_CACHE_OFFSET0), 
0);
offset = 0;
} else {

MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
@@ -798,10 +801,11 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device 
*adev)

MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),

upper_32_bits(adev->uvd.inst[i].gpu_addr));
offset = size;
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, mmUVD_VCPU_CACHE_OFFSET0),
+   
AMDGPU_UVD_FIRMWARE_OFFSET >> 3);
+
}

-   MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_VCPU_CACHE_OFFSET0),
-   AMDGPU_UVD_FIRMWARE_OFFSET 
>> 3);
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_VCPU_CACHE_SIZE0), size);

MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE1_64BIT_BAR_LOW),
diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c 
b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
index f3f5938..c0ec279 100644
--- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
@@ -244,13 +244,18 @@ static int vce_v4_0_sriov_start(struct amdgpu_device 
*adev)
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, 
mmVCE_LMI_SWAP_CNTL1), 0);
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, 
mmVCE_LMI_VM_CTRL), 0);

+   offset = AMDGPU_VCE_FIRMWARE_OFFSET;
if

RE: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM

2019-05-06 Thread Huang, Trigger
Ping again.


Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Trigger Huang  
Sent: Thursday, May 02, 2019 8:57 PM
To: amd-gfx@lists.freedesktop.org
Cc: Huang, Trigger 
Subject: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM

One Vega10 SR-IOV VF, the FW address returned by PSP should be set into the 
init table, while not the original BO mc address.
otherwise, UVD and VCE IB test will fail under Vega10 SR-IOV

reference:
commit bfcea5204287 ("drm/amdgpu:change VEGA booting with firmware 
loaded by PSP")
commit aa5873dca463 ("drm/amdgpu: Change VCE booting with firmware 
loaded by PSP")

Signed-off-by: Trigger Huang 
---
 drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c | 16 ++--  
drivers/gpu/drm/amd/amdgpu/vce_v4_0.c | 17 +++--
 2 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c 
b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
index dc461df..2191d3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c
@@ -787,10 +787,13 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device 
*adev)
   0x, 
0x0004);
/* mc resume*/
if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
-   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
-   
lower_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr));
-   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),
-   
upper_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr));
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i,
+   
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
+   
adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_lo);
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i,
+   
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),
+   
adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_hi);
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, 
+mmUVD_VCPU_CACHE_OFFSET0), 0);
offset = 0;
} else {

MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
@@ -798,10 +801,11 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device 
*adev)

MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),

upper_32_bits(adev->uvd.inst[i].gpu_addr));
offset = size;
+   
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, mmUVD_VCPU_CACHE_OFFSET0),
+   
AMDGPU_UVD_FIRMWARE_OFFSET >> 3);
+
}
 
-   MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_VCPU_CACHE_OFFSET0),
-   AMDGPU_UVD_FIRMWARE_OFFSET 
>> 3);
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_VCPU_CACHE_SIZE0), size);
 
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, 
mmUVD_LMI_VCPU_CACHE1_64BIT_BAR_LOW),
diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c 
b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
index f3f5938..c0ec279 100644
--- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
@@ -244,13 +244,18 @@ static int vce_v4_0_sriov_start(struct amdgpu_device 
*adev)
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, 
mmVCE_LMI_SWAP_CNTL1), 0);
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, 
mmVCE_LMI_VM_CTRL), 0);
 
+   offset = AMDGPU_VCE_FIRMWARE_OFFSET;
if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
+   uint32_t low = 
adev->firmware.ucode[AMDGPU_UCODE_ID_VCE].tmr_mc_addr_lo;
+   uint32_t hi = 
adev->firmware.ucode[AMDGPU_UCODE_ID_VCE].tmr_mc_addr_hi;
+   uint64_t tmr_mc_addr = (uint64_t)(hi) << 32 | low;
+
MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0,
-   
mmVCE_LMI_VCPU_CACHE_40BIT_BAR0),
- 

RE: [PATCH] drm/amdgpu: Add IDH_QUERY_ALIVE event for SR-IOV

2019-05-01 Thread Huang, Trigger
Hi Guys,

Would you help on this patch?

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: amd-gfx  On Behalf Of Trigger Huang
Sent: Tuesday, April 30, 2019 4:30 PM
To: amd-gfx@lists.freedesktop.org
Cc: Huang, Trigger 
Subject: [PATCH] drm/amdgpu: Add IDH_QUERY_ALIVE event for SR-IOV

[CAUTION: External Email]

SR-IOV host side will send IDH_QUERY_ALIVE to guest VM to check if this guest 
VM is still alive (not destroyed). The only thing guest KMD need to do is to 
send ACK back to host.

Signed-off-by: Trigger Huang 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 3 +++  
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 8dbad49..2471e7c 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -372,6 +372,9 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device 
*adev,
if (amdgpu_sriov_runtime(adev))
schedule_work(>virt.flr_work);
break;
+   case IDH_QUERY_ALIVE:
+   xgpu_ai_mailbox_send_ack(adev);
+   break;
/* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can 
ignore
 * it byfar since that polling thread will handle it,
 * other msg like flr complete is not handled here.
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
index 39d151b..077e91a 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
@@ -49,6 +49,7 @@ enum idh_event {
IDH_FLR_NOTIFICATION_CMPL,
IDH_SUCCESS,
IDH_FAIL,
+   IDH_QUERY_ALIVE,
IDH_EVENT_MAX
 };

--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Rearm IRQ in Vega10 SR-IOV if IRQ lost

2019-05-01 Thread Huang, Trigger
Hi Christian,

Yes, it is the HW guys who suggested to rearm it.

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Christian König  
Sent: Wednesday, May 01, 2019 1:10 AM
To: Huang, Trigger ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Rearm IRQ in Vega10 SR-IOV if IRQ lost

[CAUTION: External Email]

Am 30.04.19 um 17:14 schrieb Trigger Huang:
> In Multi-VFs stress test, sometimes we see IRQ lost when running 
> benchmark, just rearm it.

Well I think I have seen that on bare metal as well, it would certainly explain 
some very odd behavior I've got from the IH block.

Have you pinged the hw guys about that already?

> Signed-off-by: Trigger Huang 

Acked-by: Christian König 

> ---
>   drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 37 
> +-
>   1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c 
> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> index 1b2f69a..8d89ab7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> @@ -31,7 +31,7 @@
>   #include "soc15_common.h"
>   #include "vega10_ih.h"
>
> -
> +#define MAX_REARM_RETRY 10
>
>   static void vega10_ih_set_interrupt_funcs(struct amdgpu_device 
> *adev);
>
> @@ -382,6 +382,38 @@ static void vega10_ih_decode_iv(struct amdgpu_device 
> *adev,
>   }
>
>   /**
> + * vega10_ih_irq_rearm - rearm IRQ if lost
> + *
> + * @adev: amdgpu_device pointer
> + *
> + */
> +static void vega10_ih_irq_rearm(struct amdgpu_device *adev,
> +struct amdgpu_ih_ring *ih) {
> + uint32_t reg_rptr = 0;
> + uint32_t v = 0;
> + uint32_t i = 0;
> +
> + if (ih == >irq.ih)
> + reg_rptr = SOC15_REG_OFFSET(OSSSYS, 0, mmIH_RB_RPTR);
> + else if (ih == >irq.ih1)
> + reg_rptr = SOC15_REG_OFFSET(OSSSYS, 0, mmIH_RB_RPTR_RING1);
> + else if (ih == >irq.ih2)
> + reg_rptr = SOC15_REG_OFFSET(OSSSYS, 0, mmIH_RB_RPTR_RING2);
> + else
> + return;
> +
> + /* Rearm IRQ / re-wwrite doorbell if doorbell write is lost */
> + for (i = 0; i < MAX_REARM_RETRY; i++) {
> + v = RREG32_NO_KIQ(reg_rptr);
> + if ((v < ih->ring_size) && (v != ih->rptr))
> + WDOORBELL32(ih->doorbell_index, ih->rptr);
> + else
> + break;
> + }
> +}
> +
> +/**
>* vega10_ih_set_rptr - set the IH ring buffer rptr
>*
>* @adev: amdgpu_device pointer
> @@ -395,6 +427,9 @@ static void vega10_ih_set_rptr(struct amdgpu_device *adev,
>   /* XXX check if swapping is necessary on BE */
>   *ih->rptr_cpu = ih->rptr;
>   WDOORBELL32(ih->doorbell_index, ih->rptr);
> +
> + if (amdgpu_sriov_vf(adev))
> + vega10_ih_irq_rearm(adev, ih);
>   } else if (ih == >irq.ih) {
>   WREG32_SOC15(OSSSYS, 0, mmIH_RB_RPTR, ih->rptr);
>   } else if (ih == >irq.ih1) {

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path

2019-04-30 Thread Huang, Trigger
Thanks, Ok, I got it.
That makes more clear.

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Christian König  
Sent: Tuesday, April 30, 2019 6:32 PM
To: Huang, Trigger ; Koenig, Christian 
; Kuehling, Felix ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path

[CAUTION: External Email]

Am 30.04.19 um 12:28 schrieb Huang, Trigger:
> Hi Christian,
>
> Thanks for the quick response.
>
> One thing I want to confirm, per my understanding that CSA is only used by CP 
> preemption for KCQ and KGQ path, but in KFD user mode queue path, can we 
> delete CSA?
> If yes, then maybe we can split this topic into two patches 1, the 
> original one that unmap CSA in KFD path is still needed

I won't go down that route anyway, we should probably move the CSA mapping into 
the CTX handling completely.

> 2, Add new patch to use  new method for VM checking.

That should actually be the first patch, cause this checking is certainly 
incorrect.

Christian.

>
> Thanks & Best Wishes,
> Trigger Huang
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Tuesday, April 30, 2019 5:59 PM
> To: Huang, Trigger ; Kuehling, Felix 
> ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path
>
> Am 30.04.19 um 11:53 schrieb Huang, Trigger:
>> Hi Christian & Felix,
>>
>> Thanks for the information.
>>
>> But actually currently CSA is mapped only in amdgpu_driver_open_kms under 
>> SR-IOV.
>> So would you give more information about ' Well that is exactly what we 
>> already do here'? Where I can find its implementation?
> Well the plan was to move this to when the actually execution context is 
> created. But Rex left the company, could be that the patches for this where 
> never committed.
>
>> On the other hand, I will  try the method 'Instead of checking if some page 
>> tables are already filled we check if some mapping is already made'
> Yeah, that should certainly work as well. Just check all entries of 
> the root PD in a for loop if any subsequent PDs are allocated and 
> abort if you find one.
>
> Christian.
>
>> Thanks & Best Wishes,
>> Trigger Huang
>>
>> -Original Message-
>> From: Christian König 
>> Sent: Tuesday, April 30, 2019 5:23 PM
>> To: Kuehling, Felix ; Huang, Trigger 
>> ; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path
>>
>> [CAUTION: External Email]
>>
>> Well that is exactly what we already do here. The only problem is we do the 
>> wrong check amdgpu_vm_make_compute().
>>
>> Instead of checking if some page tables are already filled we check if some 
>> mapping is already made.
>>
>> Regards,
>> Christian.
>>
>> Am 30.04.19 um 01:34 schrieb Kuehling, Felix:
>>> I remember a past discussion to change the CSA allocation/mapping 
>>> scheme to avoid this issue in the first place. Can adding the CSA to 
>>> the VM be delayed a little to a point after the VM gets converted to a 
>>> compute VM?
>>> Maybe the first command submission?
>>>
>>> Regards,
>>>   Felix
>>>
>>> On 2019-04-28 6:25 a.m., Trigger Huang wrote:
>>>> In amdgpu open path, CSA will be mappened in VM, so when opening 
>>>> KFD, calling mdgpu_vm_make_compute  will fail because it found this 
>>>> VM is not a clean VM with some mappings, as a result, it will lead 
>>>> to failed to create process VM object
>>>>
>>>> The fix is try to unmap CSA, and actually CSA is not needed in 
>>>> compute VF world switch
>>>>
>>>> Signed-off-by: Trigger Huang 
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  2 +-
>>>>  2 files changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> index 697b8ef..e0bc457 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> @@ -956,6 +956,16 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct 
>>>> kgd_dev *kgd,
>>>>if (avm->process_info)
>>>>return -EINVAL;
>>>>
>>>> +/* Delete CSA mapping to make sure this VM is a clean VM  before
>

RE: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path

2019-04-30 Thread Huang, Trigger
Hi Christian,

Thanks for the quick response.

One thing I want to confirm, per my understanding that CSA is only used by CP 
preemption for KCQ and KGQ path, but in KFD user mode queue path, can we delete 
CSA?
If yes, then maybe we can split this topic into two patches
1, the original one that unmap CSA in KFD path is still needed
2, Add new patch to use  new method for VM checking.

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Koenig, Christian  
Sent: Tuesday, April 30, 2019 5:59 PM
To: Huang, Trigger ; Kuehling, Felix 
; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path

Am 30.04.19 um 11:53 schrieb Huang, Trigger:
> Hi Christian & Felix,
>
> Thanks for the information.
>
> But actually currently CSA is mapped only in amdgpu_driver_open_kms under 
> SR-IOV.
> So would you give more information about ' Well that is exactly what we 
> already do here'? Where I can find its implementation?

Well the plan was to move this to when the actually execution context is 
created. But Rex left the company, could be that the patches for this where 
never committed.

>
> On the other hand, I will  try the method 'Instead of checking if some page 
> tables are already filled we check if some mapping is already made'

Yeah, that should certainly work as well. Just check all entries of the 
root PD in a for loop if any subsequent PDs are allocated and abort if 
you find one.

Christian.

>
> Thanks & Best Wishes,
> Trigger Huang
>
> -Original Message-
> From: Christian König 
> Sent: Tuesday, April 30, 2019 5:23 PM
> To: Kuehling, Felix ; Huang, Trigger 
> ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path
>
> [CAUTION: External Email]
>
> Well that is exactly what we already do here. The only problem is we do the 
> wrong check amdgpu_vm_make_compute().
>
> Instead of checking if some page tables are already filled we check if some 
> mapping is already made.
>
> Regards,
> Christian.
>
> Am 30.04.19 um 01:34 schrieb Kuehling, Felix:
>> I remember a past discussion to change the CSA allocation/mapping
>> scheme to avoid this issue in the first place. Can adding the CSA to
>> the VM be delayed a little to a point after the VM gets converted to a 
>> compute VM?
>> Maybe the first command submission?
>>
>> Regards,
>>  Felix
>>
>> On 2019-04-28 6:25 a.m., Trigger Huang wrote:
>>> In amdgpu open path, CSA will be mappened in VM, so when opening KFD,
>>> calling mdgpu_vm_make_compute  will fail because it found this VM is
>>> not a clean VM with some mappings, as a result, it will lead to
>>> failed to create process VM object
>>>
>>> The fix is try to unmap CSA, and actually CSA is not needed in
>>> compute VF world switch
>>>
>>> Signed-off-by: Trigger Huang 
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  2 +-
>>> 2 files changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index 697b8ef..e0bc457 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -956,6 +956,16 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct 
>>> kgd_dev *kgd,
>>>   if (avm->process_info)
>>>   return -EINVAL;
>>>
>>> +/* Delete CSA mapping to make sure this VM is a clean VM  before
>>> + *  converting VM
>>> + */
>>> +if (amdgpu_sriov_vf(adev) && drv_priv->csa_va) {
>>> +amdgpu_bo_reserve(adev->virt.csa_obj, true);
>>> +amdgpu_vm_bo_rmv(adev, drv_priv->csa_va);
>>> +drv_priv->csa_va = NULL;
>>> +amdgpu_bo_unreserve(adev->virt.csa_obj);
>>> +}
>>> +
>>>   /* Convert VM into a compute VM */
>>>   ret = amdgpu_vm_make_compute(adev, avm, pasid);
>>>   if (ret)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index da7b4fe..361c2e5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -1069,7 +1069,7 @@ void amdgpu_driver_postclose_kms(struct
>>> drm_device *dev,
>>>
>>>   amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>>
>>> -if (amdgpu_sriov_vf(adev)) {
>>> +if (amdgpu_sriov_vf(adev) && fpriv->csa_va) {
>>>   /* TODO: how to handle reserve failure */
>>>   BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, true));
>>>   amdgpu_vm_bo_rmv(adev, fpriv->csa_va);
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path

2019-04-30 Thread Huang, Trigger
Hi Christian & Felix,

Thanks for the information.

But actually currently CSA is mapped only in amdgpu_driver_open_kms under 
SR-IOV.
So would you give more information about ' Well that is exactly what we already 
do here'? Where I can find its implementation?

On the other hand, I will  try the method 'Instead of checking if some page 
tables are already filled we check if some mapping is already made'

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Christian König  
Sent: Tuesday, April 30, 2019 5:23 PM
To: Kuehling, Felix ; Huang, Trigger 
; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path

[CAUTION: External Email]

Well that is exactly what we already do here. The only problem is we do the 
wrong check amdgpu_vm_make_compute().

Instead of checking if some page tables are already filled we check if some 
mapping is already made.

Regards,
Christian.

Am 30.04.19 um 01:34 schrieb Kuehling, Felix:
> I remember a past discussion to change the CSA allocation/mapping 
> scheme to avoid this issue in the first place. Can adding the CSA to 
> the VM be delayed a little to a point after the VM gets converted to a 
> compute VM?
> Maybe the first command submission?
>
> Regards,
> Felix
>
> On 2019-04-28 6:25 a.m., Trigger Huang wrote:
>> In amdgpu open path, CSA will be mappened in VM, so when opening KFD, 
>> calling mdgpu_vm_make_compute  will fail because it found this VM is 
>> not a clean VM with some mappings, as a result, it will lead to 
>> failed to create process VM object
>>
>> The fix is try to unmap CSA, and actually CSA is not needed in 
>> compute VF world switch
>>
>> Signed-off-by: Trigger Huang 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  2 +-
>>2 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 697b8ef..e0bc457 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -956,6 +956,16 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct 
>> kgd_dev *kgd,
>>  if (avm->process_info)
>>  return -EINVAL;
>>
>> +/* Delete CSA mapping to make sure this VM is a clean VM  before
>> + *  converting VM
>> + */
>> +if (amdgpu_sriov_vf(adev) && drv_priv->csa_va) {
>> +amdgpu_bo_reserve(adev->virt.csa_obj, true);
>> +amdgpu_vm_bo_rmv(adev, drv_priv->csa_va);
>> +drv_priv->csa_va = NULL;
>> +amdgpu_bo_unreserve(adev->virt.csa_obj);
>> +}
>> +
>>  /* Convert VM into a compute VM */
>>  ret = amdgpu_vm_make_compute(adev, avm, pasid);
>>  if (ret)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index da7b4fe..361c2e5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -1069,7 +1069,7 @@ void amdgpu_driver_postclose_kms(struct 
>> drm_device *dev,
>>
>>  amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>
>> -if (amdgpu_sriov_vf(adev)) {
>> +if (amdgpu_sriov_vf(adev) && fpriv->csa_va) {
>>  /* TODO: how to handle reserve failure */
>>  BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, true));
>>  amdgpu_vm_bo_rmv(adev, fpriv->csa_va);
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Add more process info in VM for debug

2018-12-17 Thread Huang, Trigger
Yeah, make sense.
Let me give another patch: [PATCH] drm/amdgpu: print process info when job 
timeout

-Original Message-
From: Koenig, Christian  
Sent: Monday, December 17, 2018 9:39 PM
To: Liu, Monk ; Huang, Trigger ; 
amd-gfx@lists.freedesktop.org
Cc: Grodzovsky, Andrey ; Qu, Jim 
Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug

Hi guys,

We could also print shaders, IBs or textures to figure out what's going wrong, 
but that would be overkill as well.

The log is to notice that something is wrong and not to a detailed crash report.

The PID is perfectly sufficient to identify the process which triggered an 
issue and when the system goes down completely because of this we have done 
something wrong in the first place.

So that is still a NAK, CPU crash reports doesn't contain the commandline 
either.

Regards,
Christian.

Am 17.12.18 um 10:02 schrieb Liu, Monk:
> Hi Christian,
>
> I think for some SRIOV customers they need this rich information, 
> Maybe we can use an kernel option to let user select if rich or simple 
> information should be printed upon job TDR ?
> In SRIOV branch we can set it enable by default while set by default 
> disable for drm-next
>
> /Monk
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Huang, Trigger
> Sent: Monday, December 17, 2018 4:27 PM
> To: Koenig, Christian ; 
> amd-gfx@lists.freedesktop.org
> Cc: Grodzovsky, Andrey ; Qu, Jim 
> 
> Subject: RE: [PATCH] drm/amdgpu: Add more process info in VM for debug
>
> Hi Christian,
>
> Yes, if the test machine is still there for debugging, we can login it and 
> check a lot of things, such as ' ps -p 1 -o args ' as you suggested.
>
> But sometimes, the system is not alive anymore, and we only got some log 
> files (such as kern.log ) from QA or customers.
> And at this time, the full command line information in dmesg is quite useful. 
> Let's take the following message for example:
>   
>   [ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring 
> comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572
>   [ 1740.050167] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's 
> process information is as below:
>   [ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process SDMA 
> pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode goldimage_compare 
> --offscreen --n-swapchain-images 3 --gpu 0 --frontend test_executor 
> --n-test-threads 4 When customer/QA reported that there is comp ring job 
> timeout when running a big test case, vk-example, but the machine is not 
> alive anymore, and can't  login for debugging. Then we need to re-run the 
> whole vk-example to reproduce this issue, and this will waste a lot of time.
> But if we get the specific sub-base in the  kern.log file when job timeout 
> happened, then we can only try the specific one, here it is ' SDMA --mode 
> goldimage_compare --offscreen --n-swapchain-images 3 --gpu 0 --frontend 
> test_executor --n-test-threads 4'
> maybe several minutes later, this issues is reproduced.
>
>
> Thanks,
> Trigger.
>
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Koenig, Christian
> Sent: Monday, December 17, 2018 3:50 PM
> To: Huang, Trigger ; 
> amd-gfx@lists.freedesktop.org
> Cc: Grodzovsky, Andrey ; Qu, Jim 
> 
> Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug
>
> Hi Trigger,
>
>> Does this make sense?
> Sorry I wasn't clear enough. The key point that we don't want/need the full 
> commandline of the process here is that we can already get that from the 
> information we have.
>
> E.g. "ps -p 1 -o args" gives you the command line of the process 1.
>
> The only case where directly printing this into the logs is useful is when we 
> run into a total system crash and in this case the processed is only the 
> trigger, but not the root cause.
>
> Regards,
> Christian.
>
> Am 17.12.18 um 04:03 schrieb Huang, Trigger:
>> Hi Christian,
>>
>> Many thanks for pointing out the mistakes
>>
>> I have some comments as below, would you help to check again?
>>
>> First of all you can't get the process info during VM creating since that 
>> can happen in X as well.
>> [Trigger]: Ok, I will keep the original logic, which is that set the vm info 
>> in cs . I will still invoke kfree(cmd_line) in amdgpu_vm_fini to avoid 
>> memory leak.
>>
>> Second when a timeout happen the VM structure might already be released, so 
>> using job->vm is illegal here. What we could try is to get the VM using the 
>> PASID.
>> [Trigger]: Ok, I will do it in job timeout like what VMC page fault's 
>> handler does
>>
>> And last we

RE: [PATCH] drm/amdgpu: Add more process info in VM for debug

2018-12-16 Thread Huang, Trigger
Hi Christian,

Many thanks for pointing out the mistakes

I have some comments as below, would you help to check again?

First of all you can't get the process info during VM creating since that can 
happen in X as well.
[Trigger]: Ok, I will keep the original logic, which is that set the vm info in 
cs . I will still invoke kfree(cmd_line) in amdgpu_vm_fini to avoid memory leak.

Second when a timeout happen the VM structure might already be released, so 
using job->vm is illegal here. What we could try is to get the VM using the 
PASID.
[Trigger]: Ok, I will do it in job timeout like what VMC page fault's handler 
does

And last we don't want to keep the full command line around.
[Trigger]: well, actually, the detailed command line is just what we want.
For example, there are thousands of sub-cases of one big test case, and for 
each sub-case, the arguments may also be different.
In some corner case,  test machine is hung after running the big test case for 
several hours even several days, it is really painful to wait another several 
hours to reproduce it and debug.
But if we know the last sub-case running on the test machine, then this issues 
*may* can be reproduced by only running the specific sub-case with specific 
arguments for several rounds, 
and in this situation, it will save us a lot time for reproducing and debugging.
Does this make sense?  If not, how about we add a parameter, such as 
amdgpu_vm_debug_verbose, to turn on/off the cmd line dumping?


Thanks,
Trigger

-Original Message-
From: Christian König  
Sent: Saturday, December 15, 2018 8:23 PM
To: Huang, Trigger ; amd-gfx@lists.freedesktop.org
Cc: Grodzovsky, Andrey ; Qu, Jim 
Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug

Am 15.12.18 um 09:56 schrieb Trigger Huang:
> When debugging VMC page fault and ring hang issues, the detailed 
> process information is quite helpful, especially when the issue can 
> only be reproduced after a very long time running. With this 
> information, only run the specific sub-testcase may also will 
> reproduce the issue, which may save a lot of time for debugging.
>
> With this patch, the process information is similar as following.
>   When VMC page fault issue happened:
> [  142.978417] amdgpu :00:08.0: [gfxhub] VMC page fault (src_id:0 
> ring:171 vmid:2 pasid:32769), [  142.978542] amdgpu :00:08.0: for process 
> ocltst pid 1354 thread ocltst pid 1354, args:./ocltst -m oclperf.so -t 
> OCLPerfDeviceEnqueueEvent,
> [  142.978693] amdgpu :00:08.0:   in page starting at address 
> 0x from 27
>
>   When ring hang issue happened:
> [ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring 
> comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572 [ 1740.050167] 
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's process information is 
> as below:
> [ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process SDMA 
> pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode goldimage_compare 
> --offscreen --n-swapchain-images 3 --gpu 0 --frontend test_executor 
> --n-test-threads 4
>
> Signed-off-by: Trigger Huang 

Well NAK, we intentionally didn't do it this way.

First of all you can't get the process info during VM creating since that can 
happen in X as well.

Second when a timeout happen the VM structure might already be released, so 
using job->vm is illegal here. What we could try is to get the VM using the 
PASID.

And last we don't want to keep the full command line around.

The only valid addition I can see here is to print the thread info when the 
timeout happens.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  3 ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 11 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  1 +
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   | 11 ++-
>   5 files changed, 26 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 1c49b82..1a2d0c9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -235,9 +235,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
> *p, union drm_amdgpu_cs
>   p->job->uf_addr = uf_offset;
>   kfree(chunk_array);
>   
> - /* Use this opportunity to fill in task info for the vm */
> - amdgpu_vm_set_task_info(vm);
> -
>   return 0;
>   
>   free_all_kdata:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index e0af44f..c75ecb3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -43,6 +43,14 @@ s

RE: [PATCH V3] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-16 Thread Huang, Trigger
Hi, 
Would you help on it?

-Original Message-
From: amd-gfx  On Behalf Of Trigger Huang
Sent: Thursday, November 15, 2018 3:30 PM
To: amd-gfx@lists.freedesktop.org
Cc: Huang, Trigger 
Subject: [PATCH V3] drm/scheduler: Fix bad job be re-processed in TDR

A bad job is the one triggered TDR(In the current amdgpu's implementation, 
actually all the jobs in the current joq-queue will be treated as bad jobs). In 
the recovery process, its fence will be fake signaled and as a result, the work 
behind will be scheduled to delete it from the mirror list, but if the TDR 
process is invoked before the work's execution, then this bad job might be 
processed again and the call dma_fence_set_error to its fence in TDR process 
will lead to kernel warning trace:

[  143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 
amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
kernel: [  143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) 
amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit 
fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event 
snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc 
input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too 
floppy psmouse 8139cp mii i2c_piix4 pata_acpi
[  143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G   OE
4.15.0-20-generic #21-Ubuntu
[  143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  143.033653] Workqueue: events 
drm_sched_job_timedout [amd_sched] [  143.033656] RIP: 
0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched] [  143.033657] RSP: 
0018:a9f880fe7d48 EFLAGS: 00010202 [  143.033659] RAX: 0007 
RBX: 9b98f2b24c00 RCX: 9b98efef4f08 [  143.033660] RDX: 
9b98f2b27400 RSI: 9b98f2b24c50 RDI: 9b98efef4f18 [  143.033660] 
RBP: a9f880fe7d98 R08: 0001 R09: 02b6 [  
143.033661] R10:  R11:  R12: 9b98efef3430 [ 
 143.033662] R13: 9b98efef4d80 R14: 9b98efef4e98 R15: 9b98eaf91c00 
[  143.033663] FS:  () GS:9b98ffd0() 
knlGS: [  143.033664] CS:  0010 DS:  ES:  CR0: 
80050033 [  143.033665] CR2: 7fc49c96d470 CR3: 1400a005 
CR4: 003606e0 [  143.033669] DR0:  DR1: 
 DR2:  [  143.033669] DR3:  
DR6: fffe0ff0 DR7: 0400 [  143.033670] Call Trace:
[  143.033744]  amdgpu_device_gpu_recover+0x144/0x820 [amdgpu] [  143.033788]  
amdgpu_job_timedout+0x9b/0xa0 [amdgpu] [  143.033791]  
drm_sched_job_timedout+0xcc/0x150 [amd_sched] [  143.033795]  
process_one_work+0x1de/0x410 [  143.033797]  worker_thread+0x32/0x410 [  
143.033799]  kthread+0x121/0x140 [  143.033801]  ? process_one_work+0x410/0x410 
[  143.033803]  ? kthread_create_worker_on_cpu+0x70/0x70
[  143.033806]  ret_from_fork+0x35/0x40

So just delete the bad job from mirror list directly

Changes in v3:
- Add a helper function to delete the bad jobs from mirror list and call
it directly *before* the job's fence is signaled

Changes in v2:
- delete the useless list node check
- also delete bad jobs in drm_sched_main because:
kthread_unpark(ring->sched.thread) will be invoked very early 
before
amdgpu_device_gpu_recover's return, then drm_sched_main will 
have
chance to pick up a new job from the job queue. This new job 
will be
added into the mirror list and processed by amdgpu_job_run, but 
may
not be deleted from the mirror list on time due to the same 
reason.
And finally re-processed by drm_sched_job_recovery

Signed-off-by: Trigger Huang 
---
 drivers/gpu/drm/scheduler/sched_main.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 18ebbb0..6fedf95 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -60,6 +60,8 @@
 
 static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb 
*cb);
 
+static void drm_sched_expel_job_unlocked(struct drm_sched_job *s_job);
+
 /**
  * drm_sched_rq_init - initialize a given run queue struct
  *
@@ -228,7 +230,7 @@ static void drm_sched_job_finish(struct work_struct *work)
 
spin_lock(>job_list_lock);
/* remove job from ring_mirror_list */
-   list_del(_job->node);
+   list_del_init(_

RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-14 Thread Huang, Trigger
Hi guys,

I have modified the patch as V2 in other thread, would you help to review?

-Original Message-
From: amd-gfx  On Behalf Of Koenig, 
Christian
Sent: Monday, November 12, 2018 6:12 PM
To: Liu, Monk ; Huang, Trigger ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Hi guys,

yeah that is the same problem I was trying to fix with drm/sched: fix timeout 
handling v2.

But this case is even more complicated because here it is the hypervisor or KFD 
which triggers the second reset.

Going to take another look at this today, Christian.

Am 12.11.18 um 10:19 schrieb Liu, Monk:
> Hi Christian
>
> Trigger's patch intends to fix the double signaling on dma_fence from below 
> scenario:
>
> 1) Gpu_recovery(job = some job) invoked by guest TDR ==> the hang 
> job's fence is set as -ECANCELD and fake signaled 2)GPU_recovery(job = 
> NULL) again invoked by hypervisor or KFD, but the job of above step 
> hasn't finished its "drm_sched_job_finish" routine which is kicked by 
> queu_work(), due to the cause that the second gpu_recovery() is called 
> prior to "drm_sched_job_finish" (so the job is not took out from the 
> mirror_list,  so there will be duplicated dma_fence_set_error on that 
> job and give you warning messeage)
>
> /Monk
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Huang, Trigger
> Sent: Monday, November 12, 2018 4:36 PM
> To: Koenig, Christian ; 
> amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR
>
> Hi Christian
>
> Thanks for the correction before, I gave my explanation as below, would you 
> help to check again, thanks in advance.
>
> The key thing here is, if a job’s fence is signaled already, then call 
> dma_fence_set_error to its fence will lead to a kernel warning call trace 
> static inline void dma_fence_set_error(struct dma_fence *fence,
> int 
> error) {
>  WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
> >flags));   # here is warning source
>  WARN_ON(error >= 0 || error < -MAX_ERRNO);
>   
>  fence->error = error; }
>   
>   
> Then, let’s see a guilty job’s process flow in TDR.
>   amdgpu_device_gpu_recover -> drm_sched_job_recovery  -> for each job in 
> ring_mirror_list :
>1)dma_fence_set_error(_fence->finished, 
> -ECANCELED) if this job is guilty
>2)amdgpu_job_run: the guilty job will be 
> skipped, and not submitted into to ring
>3)drm_sched_process_job(guilty_job) -> 
> drm_sched_fence_finished -> dma_fence_signal(>finished) -> 
> drm_sched_job_finish_cb -> schedule_work(>finish_work);
>Later, finish_work’s callback function,  
> drm_sched_job_finish, will be called by work queue, and guilty job will 
> finally be deleted from ring_mirror_list.
>   But sometimes, amdgpu_device_gpu_recover will be called(from host FLR 
> interrupt or KFD) again immediately, and at this moment, finish_work  is not 
> scheduled by work queue, and drm_sched_job_finish for the guilty job is not 
> called yet, so this guilty job is still in the   ring_mirror_list. But 
> remember, the guilty job’s fence was signaled before (by 
> drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error 
> will be called again to this guilty, then cause the kernel warning call trace.
>   
> So the root cause is, the finish_work for each job  is not scheduled quickly 
> enough for GPU TDR scenario.
>   Three ideas:
>   1), Instead of using the system global work queue, maybe we can 
> create a dedicate work queue to manage the job finish things, but still can’t 
> guarantee its finish_work’s execution in GPU TDR scenario
>   2), Wait for all the guilty jobs’ finish_work to be finished 
> execution before return from drm_sched_job_recovery  , but I think it will 
> waste too much time.
>   3), to delete the guilty job from the ring_mirror_list directly 
> after this job is processed in amdgpu_job_run().amdgpu_job_run is the 
> final entrance for a job to be submitted into the ring, if something wrong 
> with this job:(finished->error < 0), then this job should be  
>  never be submitted again, and so should be deleted from the recovery list.
> I choose 3).
>
>
> I am going to make a new patch as below instead of the original one
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index e0

RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

2018-11-12 Thread Huang, Trigger
Hi Christian

Thanks for the correction before, I gave my explanation as below, would you 
help to check again, thanks in advance.

The key thing here is, if a job’s fence is signaled already, then call 
dma_fence_set_error to its fence will lead to a kernel warning call trace
static inline void dma_fence_set_error(struct dma_fence *fence,
   int 
error)
{
WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags));  
 # here is warning source
WARN_ON(error >= 0 || error < -MAX_ERRNO);
 
fence->error = error;
}
 
 
Then, let’s see a guilty job’s process flow in TDR. 

 
amdgpu_device_gpu_recover -> drm_sched_job_recovery  -> for each job in 
ring_mirror_list :
 1)dma_fence_set_error(_fence->finished, 
-ECANCELED) if this job is guilty
 2)amdgpu_job_run: the guilty job will be skipped, 
and not submitted into to ring
 3)drm_sched_process_job(guilty_job) -> 
drm_sched_fence_finished -> dma_fence_signal(>finished) -> 
drm_sched_job_finish_cb -> schedule_work(>finish_work);
 Later, finish_work’s callback function,  
drm_sched_job_finish, will be called by work queue, and guilty job will finally 
be deleted from ring_mirror_list.
But sometimes, amdgpu_device_gpu_recover will be called(from host FLR 
interrupt or KFD) again immediately, and at this moment, finish_work  is not 
scheduled by work queue, and drm_sched_job_finish for the guilty job is not 
called yet, so this guilty job is still in the   ring_mirror_list. But 
remember, the guilty job’s fence was signaled before (by 
drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error 
will be called again to this guilty, then cause the kernel warning call trace.
 
So the root cause is, the finish_work for each job  is not scheduled quickly 
enough for GPU TDR scenario.  
Three ideas:
1), Instead of using the system global work queue, maybe we can 
create a dedicate work queue to manage the job finish things, but still can’t 
guarantee its finish_work’s execution in GPU TDR scenario
2), Wait for all the guilty jobs’ finish_work to be finished 
execution before return from drm_sched_job_recovery  , but I think it will 
waste too much time.
3), to delete the guilty job from the ring_mirror_list directly 
after this job is processed in amdgpu_job_run().amdgpu_job_run is the 
final entrance for a job to be submitted into the ring, if something wrong with 
this job:(finished->error < 0), then this job should be   never 
be submitted again, and so should be deleted from the recovery list.
I choose 3).


I am going to make a new patch as below instead of the original one

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e0af44f..aaf697f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -208,6 +208,7 @@ static struct dma_fence *amdgpu_job_dependency(struct 
drm_sched_job *sched_job,
 static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 {
struct amdgpu_ring *ring = to_amdgpu_ring(sched_job->sched);
+   struct drm_gpu_scheduler *sched = sched_job->sched;
struct dma_fence *fence = NULL, *finished;
struct amdgpu_job *job;
int r;
@@ -223,7 +224,10 @@ static struct dma_fence *amdgpu_job_run(struct 
drm_sched_job *sched_job)
dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
VRAM lost */

if (finished->error < 0) {  
-   DRM_INFO("Skip scheduling IBs!\n");
+   DRM_INFO("Skip scheduling IBs and delete job from recovery 
list!\n");
+   spin_lock(>job_list_lock);
+   list_del_init(_job->node);
+   spin_unlock(>job_list_lock);
} else {
r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
   );


Thanks,
Trigger

-----Original Message-
From: Christian König  
Sent: Friday, November 09, 2018 8:18 PM
To: Huang, Trigger ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR

Am 09.11.18 um 13:10 schrieb Trigger Huang:
> A bad job is the one triggered TDR. In the recovery process, its fence 
> will be signaled and as a result, the work behind will be scheduled to 
> delete it from the mirror list, but if the TDR process is invoked 
> before the work's execution, then the call dma_fence_set_error to its 
> fence in TDR proc

RE: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini

2017-04-19 Thread Huang, Trigger
Hi Alex,
Thanks for your help.



Thanks & Best Wishes,
Trigger Huang


-Original Message-
From: Alex Deucher [mailto:alexdeuc...@gmail.com] 
Sent: Thursday, April 20, 2017 3:18 AM
To: Huang, Trigger <trigger.hu...@amd.com>
Cc: amd-gfx list <amd-gfx@lists.freedesktop.org>; Huang, Ray 
<ray.hu...@amd.com>; Yu, Xiangliang <xiangliang...@amd.com>; Liu, Monk 
<monk@amd.com>
Subject: Re: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini

On Tue, Apr 18, 2017 at 3:04 AM, Trigger Huang <trigger.hu...@amd.com> wrote:
> Fix issue that PSP initialization will fail if reload amdgpu module.
> That's because the PSP ring must be destroyed to be ready for the next 
> time PSP initialization.
>
> Changes in v2:
> - Move psp_ring_destroy before all BOs free according to
>   Ray Huang's suggestion.
>
> Signed-off-by: Trigger Huang <trigger.hu...@amd.com>

I'm not really a PSP expert, but the change looks logical.
Acked-by: Alex Deucher <alexander.deuc...@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c |  3 +++  
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h |  3 +++
>  drivers/gpu/drm/amd/amdgpu/psp_v3_1.c   | 27 +++
>  drivers/gpu/drm/amd/amdgpu/psp_v3_1.h   |  2 ++
>  4 files changed, 35 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 19180aa..73ddf4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -56,6 +56,7 @@ static int psp_sw_init(void *handle)
> psp->prep_cmd_buf = psp_v3_1_prep_cmd_buf;
> psp->ring_init = psp_v3_1_ring_init;
> psp->ring_create = psp_v3_1_ring_create;
> +   psp->ring_destroy = psp_v3_1_ring_destroy;
> psp->cmd_submit = psp_v3_1_cmd_submit;
> psp->compare_sram_data = psp_v3_1_compare_sram_data;
> psp->smu_reload_quirk = psp_v3_1_smu_reload_quirk; @@ 
> -411,6 +412,8 @@ static int psp_hw_fini(void *handle)
> struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> struct psp_context *psp = >psp;
>
> +   psp_ring_destroy(psp, PSP_RING_TYPE__KM);
> +
> if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)
> amdgpu_ucode_fini_bo(adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> index 28faba5..0301e4e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> @@ -66,6 +66,8 @@ struct psp_context
> struct psp_gfx_cmd_resp *cmd);
> int (*ring_init)(struct psp_context *psp, enum psp_ring_type 
> ring_type);
> int (*ring_create)(struct psp_context *psp, enum psp_ring_type 
> ring_type);
> +   int (*ring_destroy)(struct psp_context *psp,
> +   enum psp_ring_type ring_type);
> int (*cmd_submit)(struct psp_context *psp, struct 
> amdgpu_firmware_info *ucode,
>   uint64_t cmd_buf_mc_addr, uint64_t fence_mc_addr, 
> int index);
> bool (*compare_sram_data)(struct psp_context *psp, @@ -116,6 
> +118,7 @@ struct amdgpu_psp_funcs {  #define psp_prep_cmd_buf(ucode, 
> type) (psp)->prep_cmd_buf((ucode), (type))  #define psp_ring_init(psp, 
> type) (psp)->ring_init((psp), (type))  #define psp_ring_create(psp, 
> type) (psp)->ring_create((psp), (type))
> +#define psp_ring_destroy(psp, type) ((psp)->ring_destroy((psp), 
> +(type)))
>  #define psp_cmd_submit(psp, ucode, cmd_mc, fence_mc, index) \
> (psp)->cmd_submit((psp), (ucode), (cmd_mc), 
> (fence_mc), (index))  #define psp_compare_sram_data(psp, ucode, type) 
> \ diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c 
> b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> index d351583..e834b55 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> @@ -321,6 +321,33 @@ int psp_v3_1_ring_create(struct psp_context *psp, enum 
> psp_ring_type ring_type)
> return ret;
>  }
>
> +int psp_v3_1_ring_destroy(struct psp_context *psp, enum psp_ring_type 
> +ring_type) {
> +   int ret = 0;
> +   struct psp_ring *ring;
> +   unsigned int psp_ring_reg = 0;
> +   struct amdgpu_device *adev = psp->adev;
> +
> +   ring = >km_ring;
> +
> +   /* Write the ring destroy command to C2PMSG_64 */
> +   psp_ring_reg = 3 << 16;
> +   WREG32(SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), 
> + psp_ring_reg);
> +
> +   /* there m

RE: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini

2017-04-19 Thread Huang, Trigger
Hi Ray,

Thanks for your suggestions. As discussed, free adev->firmware.rbuf is still 
needed.
I have made a patch V3 according to your suggestions.

Thanks & Best Wishes,
Trigger Huang


-Original Message-
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of Huang 
Rui
Sent: Thursday, April 20, 2017 8:54 AM
To: Huang, Trigger <trigger.hu...@amd.com>
Cc: Yu, Xiangliang <xiangliang...@amd.com>; Liu, Monk <monk@amd.com>; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini

On Tue, Apr 18, 2017 at 03:04:45PM +0800, Trigger Huang wrote:
> Fix issue that PSP initialization will fail if reload amdgpu module.
> That's because the PSP ring must be destroyed to be ready for the next 
> time PSP initialization.
> 
> Changes in v2:
> - Move psp_ring_destroy before all BOs free according to
>   Ray Huang's suggestion.
> 
> Signed-off-by: Trigger Huang <trigger.hu...@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c |  3 +++  
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h |  3 +++
>  drivers/gpu/drm/amd/amdgpu/psp_v3_1.c   | 27 +++
>  drivers/gpu/drm/amd/amdgpu/psp_v3_1.h   |  2 ++
>  4 files changed, 35 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> b/drivers/gpu/drm/amd/ amdgpu/amdgpu_psp.c index 19180aa..73ddf4b 
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -56,6 +56,7 @@ static int psp_sw_init(void *handle)
>  psp->prep_cmd_buf = psp_v3_1_prep_cmd_buf;
>  psp->ring_init = psp_v3_1_ring_init;
>  psp->ring_create = psp_v3_1_ring_create;
> +   psp->ring_destroy = psp_v3_1_ring_destroy;
>  psp->cmd_submit = psp_v3_1_cmd_submit;
>  psp->compare_sram_data = psp_v3_1_compare_sram_data;
>  psp->smu_reload_quirk = psp_v3_1_smu_reload_quirk; @@ 
> -411,6 +412,8 @@ static int psp_hw_fini(void *handle)
>  struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  struct psp_context *psp = >psp;
>  
> +   psp_ring_destroy(psp, PSP_RING_TYPE__KM);
> +
>  if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)
>  amdgpu_ucode_fini_bo(adev);

If adev->firmware.load_type is AMDGPU_FW_LOAD_DIRECT, we should not call 
psp_ring_destroy, right?

I already told you in last mail:
if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP)
return 0;
amdgpu_ucode_fini_bo(adev);
psp_ring_destroy(psp, PSP_RING_TYPE__KM); ...

>  
> +int psp_v3_1_ring_destroy(struct psp_context *psp, enum psp_ring_type
> ring_type)
> +{
> +   int ret = 0;
> +   struct psp_ring *ring;
> +   unsigned int psp_ring_reg = 0;
> +   struct amdgpu_device *adev = psp->adev;
> +
> +   ring = >km_ring;
> +
> +   /* Write the ring destroy command to C2PMSG_64 */
> +   psp_ring_reg = 3 << 16;
> +   WREG32(SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), 
> + psp_ring_reg);
> +
> +   /* there might be handshake issue with hardware which needs delay */
> +   mdelay(20);
> +
> +   /* Wait for response flag (bit 31) in C2PMSG_64 */
> +   ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64),
> +  0x8000, 0x8000, false);
> +

> +   if (ring->ring_mem)
> +   amdgpu_bo_free_kernel(>firmware.rbuf,
> + >ring_mem_mc_addr,
> + (void **)>ring_mem);

adev->firmware.rbuf is entired memory space for storing firmware data 
adev->which
inited by amdgpu_ucode_init_bo, and it already teared down via 
amdgpu_ucode_fini_bo. You cannot free it again.

Thanks,
Rui
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini

2017-04-19 Thread Huang, Trigger
Hi,

Would you please help to review this patch?

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Trigger Huang [mailto:trigger.hu...@amd.com] 
Sent: Tuesday, April 18, 2017 3:05 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <monk@amd.com>; Yu, Xiangliang <xiangliang...@amd.com>; 
Huang, Ray <ray.hu...@amd.com>; Huang, Trigger <trigger.hu...@amd.com>
Subject: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini

Fix issue that PSP initialization will fail if reload amdgpu module.
That's because the PSP ring must be destroyed to be ready for the next time PSP 
initialization.

Changes in v2:
- Move psp_ring_destroy before all BOs free according to
  Ray Huang's suggestion.

Signed-off-by: Trigger Huang <trigger.hu...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c |  3 +++  
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h |  3 +++
 drivers/gpu/drm/amd/amdgpu/psp_v3_1.c   | 27 +++
 drivers/gpu/drm/amd/amdgpu/psp_v3_1.h   |  2 ++
 4 files changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 19180aa..73ddf4b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -56,6 +56,7 @@ static int psp_sw_init(void *handle)
psp->prep_cmd_buf = psp_v3_1_prep_cmd_buf;
psp->ring_init = psp_v3_1_ring_init;
psp->ring_create = psp_v3_1_ring_create;
+   psp->ring_destroy = psp_v3_1_ring_destroy;
psp->cmd_submit = psp_v3_1_cmd_submit;
psp->compare_sram_data = psp_v3_1_compare_sram_data;
psp->smu_reload_quirk = psp_v3_1_smu_reload_quirk; @@ -411,6 
+412,8 @@ static int psp_hw_fini(void *handle)
struct amdgpu_device *adev = (struct amdgpu_device *)handle;
struct psp_context *psp = >psp;
 
+   psp_ring_destroy(psp, PSP_RING_TYPE__KM);
+
if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)
amdgpu_ucode_fini_bo(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 28faba5..0301e4e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -66,6 +66,8 @@ struct psp_context
struct psp_gfx_cmd_resp *cmd);
int (*ring_init)(struct psp_context *psp, enum psp_ring_type ring_type);
int (*ring_create)(struct psp_context *psp, enum psp_ring_type 
ring_type);
+   int (*ring_destroy)(struct psp_context *psp,
+   enum psp_ring_type ring_type);
int (*cmd_submit)(struct psp_context *psp, struct amdgpu_firmware_info 
*ucode,
  uint64_t cmd_buf_mc_addr, uint64_t fence_mc_addr, int 
index);
bool (*compare_sram_data)(struct psp_context *psp, @@ -116,6 +118,7 @@ 
struct amdgpu_psp_funcs {  #define psp_prep_cmd_buf(ucode, type) 
(psp)->prep_cmd_buf((ucode), (type))  #define psp_ring_init(psp, type) 
(psp)->ring_init((psp), (type))  #define psp_ring_create(psp, type) 
(psp)->ring_create((psp), (type))
+#define psp_ring_destroy(psp, type) ((psp)->ring_destroy((psp), 
+(type)))
 #define psp_cmd_submit(psp, ucode, cmd_mc, fence_mc, index) \
(psp)->cmd_submit((psp), (ucode), (cmd_mc), (fence_mc), 
(index))  #define psp_compare_sram_data(psp, ucode, type) \ diff --git 
a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
index d351583..e834b55 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
@@ -321,6 +321,33 @@ int psp_v3_1_ring_create(struct psp_context *psp, enum 
psp_ring_type ring_type)
return ret;
 }
 
+int psp_v3_1_ring_destroy(struct psp_context *psp, enum psp_ring_type 
+ring_type) {
+   int ret = 0;
+   struct psp_ring *ring;
+   unsigned int psp_ring_reg = 0;
+   struct amdgpu_device *adev = psp->adev;
+
+   ring = >km_ring;
+
+   /* Write the ring destroy command to C2PMSG_64 */
+   psp_ring_reg = 3 << 16;
+   WREG32(SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), psp_ring_reg);
+
+   /* there might be handshake issue with hardware which needs delay */
+   mdelay(20);
+
+   /* Wait for response flag (bit 31) in C2PMSG_64 */
+   ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64),
+  0x8000, 0x8000, false);
+
+   if (ring->ring_mem)
+   amdgpu_bo_free_kernel(>firmware.rbuf,
+ >ring_mem_mc_addr,
+ (void **)>ring_mem);
+   return ret;
+}
+
 int psp_v3_1_cmd_submit(struct psp_context *psp,
struct amdgpu_firmware_info *ucode,
uint64_t cmd_buf_mc_addr, uint64_t fence_mc_addr, diff 
-

RE: [PATCH] drm/amdgpu: Fix module unload hang by KIQ IRQ set

2017-02-16 Thread Huang, Trigger
Hi Xiangliang,

Thanks for your suggestions, I will make a new patch according to the discussion

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Yu, Xiangliang 
Sent: Friday, February 17, 2017 2:15 PM
To: Huang, Trigger <trigger.hu...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <monk....@amd.com>; Huang, Trigger <trigger.hu...@amd.com>
Subject: RE: [PATCH] drm/amdgpu: Fix module unload hang by KIQ IRQ set

Hi Trigger,

Can you try to move src->data= NULL into gfx_v8_0_kiq_set_interrupt_state 
function? I think it is more simple and clear.

Thanks!
Xiangliang Yu

> -Original Message-
> From: Trigger Huang [mailto:trigger.hu...@amd.com]
> Sent: Thursday, February 16, 2017 6:48 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk <monk@amd.com>; Yu, Xiangliang 
> <xiangliang...@amd.com>; Huang, Trigger <trigger.hu...@amd.com>
> Subject: [PATCH] drm/amdgpu: Fix module unload hang by KIQ IRQ set
> 
> In some cases, manually insmod/rmmod amdgpu is necessary. When 
> unloading amdgpu, the KIQ IRQ enable/disable function will case system 
> hang. The root cause is, in the sequence of function amdgpu_fini, the 
> sw_fini of IP block AMD_IP_BLOCK_TYPE_GFX will be invoked earlier than 
> that of AMD_IP_BLOCK_TYPE_IH. So continue to use the variable freed by 
> AMD_IP_BLOCK_TYPE_GFX will cause system hang.
> 
> Signed-off-by: Trigger Huang <trigger.hu...@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  2 ++
>  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 17 -
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 1db2e7b..be43d09 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -793,6 +793,8 @@ struct amdgpu_kiq {
>   struct amdgpu_bo*eop_obj;
>   struct amdgpu_ring  ring;
>   struct amdgpu_irq_src   irq;
> + u32 me;
> + u32 pipe;
>  };
> 
>  /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> index 772c42b..04e2a5b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> @@ -1390,12 +1390,15 @@ static int gfx_v8_0_kiq_init_ring(struct 
> amdgpu_device *adev,
>   if (adev->gfx.mec2_fw) {
>   ring->me = 2;
>   ring->pipe = 0;
> + adev->gfx.kiq.me = 2;
> + adev->gfx.kiq.pipe = 0;
>   } else {
>   ring->me = 1;
>   ring->pipe = 1;
> + adev->gfx.kiq.me = 1;
> + adev->gfx.kiq.pipe = 1;
>   }
> 
> - irq->data = ring;
>   ring->queue = 0;
>   sprintf(ring->name, "kiq %d.%d.%d", ring->me, ring->pipe, ring-
> >queue);
>   r = amdgpu_ring_init(adev, ring, 1024, @@ -1410,7 +1413,6 @@ static 
> void gfx_v8_0_kiq_free_ring(struct amdgpu_ring *ring,  {
>   amdgpu_wb_free(ring->adev, ring->adev->virt.reg_val_offs);
>   amdgpu_ring_fini(ring);
> - irq->data = NULL;
>  }
> 
>  #define MEC_HPD_SIZE 2048
> @@ -6927,15 +6929,12 @@ static int
> gfx_v8_0_kiq_set_interrupt_state(struct amdgpu_device *adev,
>   enum amdgpu_interrupt_state
> state)  {
>   uint32_t tmp, target;
> - struct amdgpu_ring *ring = (struct amdgpu_ring *)src->data;
> 
> - BUG_ON(!ring || (ring->funcs->type != AMDGPU_RING_TYPE_KIQ));
> -
> - if (ring->me == 1)
> + if (adev->gfx.kiq.me == 1)
>   target = mmCP_ME1_PIPE0_INT_CNTL;
>   else
>   target = mmCP_ME2_PIPE0_INT_CNTL;
> - target += ring->pipe;
> + target += adev->gfx.kiq.pipe;
> 
>   switch (type) {
>   case AMDGPU_CP_KIQ_IRQ_DRIVER0:
> @@ -6973,7 +6972,7 @@ static int gfx_v8_0_kiq_irq(struct amdgpu_device 
> *adev,
>   struct amdgpu_iv_entry *entry)  {
>   u8 me_id, pipe_id, queue_id;
> - struct amdgpu_ring *ring = (struct amdgpu_ring *)source->data;
> + struct amdgpu_ring *ring = &(adev->gfx.kiq.ring);
> 
>   BUG_ON(!ring || (ring->funcs->type != AMDGPU_RING_TYPE_KIQ));
> 
> @@ -7380,4 +7379,4 @@ static void
> gfx_v8_0_compute_mqd_soft_fini(struct amdgpu_device *adev)
> 
>   ring = >gfx.kiq.ring;
>   amdgpu_bo_free_kernel(>mqd_obj, >mqd_gpu_addr, (void 
> **)>mqd_ptr); -} \ No newline at end of file
> +}
> --
> 2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH v2] drm/amdgpu:no gpu scheduler for KIQ

2016-11-03 Thread Huang, Trigger
Hi Christian,
OK, I got it, thanks for your detailed suggestion.

Thanks & Best Wishes,
Trigger Huang


-Original Message-
From: Christian König [mailto:deathsim...@vodafone.de] 
Sent: Wednesday, November 02, 2016 8:24 PM
To: Huang, Trigger <trigger.hu...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Yu, Xiangliang <xiangliang...@amd.com>; Liu, Monk <monk@amd.com>
Subject: Re: [PATCH v2] drm/amdgpu:no gpu scheduler for KIQ

Am 02.11.2016 um 12:48 schrieb Trigger Huang:
> KIQ is used for interaction between driver and CP, and not exposed to 
> outside client, as such it doesn't need to be handled by GPU 
> scheduler.
>
> Signed-off-by: Monk Liu <monk@amd.com>
> Signed-off-by: Xiangliang Yu <xiangliang...@amd.com>
> Signed-off-by: Trigger Huang <trigger.hu...@amd.com>

Even if you only fix a small mistake it is usually good practice to increase 
the version number of the patch, e.g. you would use v3 in this case and write a 
one liner what was wrong in the commit message.

But that's only a nit pick, so patch is Reviewed-by: Christian König 
<christian.koe...@amd.com> anyway.

Regards,
Christian.

>
> Changes in v2:
>   - According to Alex's suggestion, wrapping the scheduler setup
> conditionally instead of returning early.
>   - Use another simple method to check if is a KIQ ring.
>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 39 
> +--
>   1 file changed, 21 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 77b34ec..5772ef2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -382,24 +382,27 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
> *ring,
>   if (!ring->fence_drv.fences)
>   return -ENOMEM;
>   
> - timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
> - if (timeout == 0) {
> - /*
> -  * FIXME:
> -  * Delayed workqueue cannot use it directly,
> -  * so the scheduler will not use delayed workqueue if
> -  * MAX_SCHEDULE_TIMEOUT is set.
> -  * Currently keep it simple and silly.
> -  */
> - timeout = MAX_SCHEDULE_TIMEOUT;
> - }
> - r = amd_sched_init(>sched, _sched_ops,
> -num_hw_submission,
> -timeout, ring->name);
> - if (r) {
> - DRM_ERROR("Failed to create scheduler on ring %s.\n",
> -   ring->name);
> - return r;
> + /* No need to setup the GPU scheduler for KIQ ring */
> + if (ring->funcs->type != AMDGPU_RING_TYPE_KIQ) {
> + timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
> + if (timeout == 0) {
> + /*
> +  * FIXME:
> +  * Delayed workqueue cannot use it directly,
> +  * so the scheduler will not use delayed workqueue if
> +  * MAX_SCHEDULE_TIMEOUT is set.
> +  * Currently keep it simple and silly.
> +  */
> + timeout = MAX_SCHEDULE_TIMEOUT;
> + }
> + r = amd_sched_init(>sched, _sched_ops,
> +num_hw_submission,
> +timeout, ring->name);
> + if (r) {
> + DRM_ERROR("Failed to create scheduler on ring %s.\n",
> +   ring->name);
> + return r;
> + }
>   }
>   
>   return 0;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [V2] drm/amdgpu:no gpu scheduler for KIQ

2016-11-02 Thread Huang, Trigger
Sorry please just ignore this patch, since I made a small mistake, I will sent 
again soon.

Thanks & Best Wishes,
Trigger Huang

-Original Message-
From: Trigger Huang [mailto:trigger.hu...@amd.com] 
Sent: Wednesday, November 02, 2016 7:26 PM
To: amd-gfx@lists.freedesktop.org
Cc: Huang, Trigger <trigger.hu...@amd.com>; Liu, Monk <monk@amd.com>; Yu, 
Xiangliang <xiangliang...@amd.com>
Subject: [V2] drm/amdgpu:no gpu scheduler for KIQ

KIQ is used for interaction between driver and CP, and not exposed to outside 
client, as such it doesn't need to be handled by GPU scheduler.

Signed-off-by: Monk Liu <monk@amd.com>
Signed-off-by: Xiangliang Yu <xiangliang...@amd.com>
Signed-off-by: Trigger Huang <trigger.hu...@amd.com>

Changes in v2:
 - According to Alex's suggestion, wrapping the scheduler setup
   conditionally instead of returning early.
 - Use another simple method to check if is a KIQ ring.

Change-Id: Iba09b6dfb6515e5de5fc48c4e8904e8a0ec5f22b
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 39 +--
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 77b34ec..5772ef2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -382,24 +382,27 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
if (!ring->fence_drv.fences)
return -ENOMEM;
 
-   timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
-   if (timeout == 0) {
-   /*
-* FIXME:
-* Delayed workqueue cannot use it directly,
-* so the scheduler will not use delayed workqueue if
-* MAX_SCHEDULE_TIMEOUT is set.
-* Currently keep it simple and silly.
-*/
-   timeout = MAX_SCHEDULE_TIMEOUT;
-   }
-   r = amd_sched_init(>sched, _sched_ops,
-  num_hw_submission,
-  timeout, ring->name);
-   if (r) {
-   DRM_ERROR("Failed to create scheduler on ring %s.\n",
- ring->name);
-   return r;
+   /* No need to setup the GPU scheduler for KIQ ring */
+   if (ring->funcs->type != AMDGPU_RING_TYPE_KIQ) {
+   timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
+   if (timeout == 0) {
+   /*
+* FIXME:
+* Delayed workqueue cannot use it directly,
+* so the scheduler will not use delayed workqueue if
+* MAX_SCHEDULE_TIMEOUT is set.
+* Currently keep it simple and silly.
+*/
+   timeout = MAX_SCHEDULE_TIMEOUT;
+   }
+   r = amd_sched_init(>sched, _sched_ops,
+  num_hw_submission,
+  timeout, ring->name);
+   if (r) {
+   DRM_ERROR("Failed to create scheduler on ring %s.\n",
+ ring->name);
+   return r;
+   }
}
 
return 0;
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx