RE: [PATCH 10/11] drm/amdgpu: add gang submit frontend v4
[AMD Official Use Only - General] Before we finally add the gang submission frontend, is there any interface/flag for user mode driver to detect if gang submission is supported by kernel? Regards, Trigger -Original Message- From: amd-gfx On Behalf Of Christian K?nig Sent: 2022年8月29日 21:18 To: Dong, Ruijing ; amd-gfx@lists.freedesktop.org Cc: Koenig, Christian Subject: [PATCH 10/11] drm/amdgpu: add gang submit frontend v4 Allows submitting jobs as gang which needs to run on multiple engines at the same time. All members of the gang get the same implicit, explicit and VM dependencies. So no gang member will start running until everything else is ready. The last job is considered the gang leader (usually a submission to the GFX ring) and used for signaling output dependencies. Each job is remembered individually as user of a buffer object, so there is no joining of work at the end. v2: rebase and fix review comments from Andrey and Yogesh v3: use READ instead of BOOKKEEP for now because of VM unmaps, set gang leader only when necessary v4: fix order of pushing jobs and adding fences found by Trigger. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 259 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h| 10 +- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 12 +- 3 files changed, 184 insertions(+), 97 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 9821299dfb49..a6e50ad5e306 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -69,6 +69,7 @@ static int amdgpu_cs_p1_ib(struct amdgpu_cs_parser *p, unsigned int *num_ibs) { struct drm_sched_entity *entity; + unsigned int i; int r; r = amdgpu_ctx_get_entity(p->ctx, chunk_ib->ip_type, @@ -77,17 +78,28 @@ static int amdgpu_cs_p1_ib(struct amdgpu_cs_parser *p, if (r) return r; - /* Abort if there is no run queue associated with this entity. -* Possibly because of disabled HW IP*/ + /* +* Abort if there is no run queue associated with this entity. +* Possibly because of disabled HW IP. +*/ if (entity->rq == NULL) return -EINVAL; - /* Currently we don't support submitting to multiple entities */ - if (p->entity && p->entity != entity) + /* Check if we can add this IB to some existing job */ + for (i = 0; i < p->gang_size; ++i) { + if (p->entities[i] == entity) + goto found; + } + + /* If not increase the gang size if possible */ + if (i == AMDGPU_CS_GANG_SIZE) return -EINVAL; - p->entity = entity; - ++(*num_ibs); + p->entities[i] = entity; + p->gang_size = i + 1; + +found: + ++(num_ibs[i]); return 0; } @@ -161,11 +173,12 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p, union drm_amdgpu_cs *cs) { struct amdgpu_fpriv *fpriv = p->filp->driver_priv; + unsigned int num_ibs[AMDGPU_CS_GANG_SIZE] = { }; struct amdgpu_vm *vm = >vm; uint64_t *chunk_array_user; uint64_t *chunk_array; - unsigned size, num_ibs = 0; uint32_t uf_offset = 0; + unsigned int size; int ret; int i; @@ -231,7 +244,7 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p, if (size < sizeof(struct drm_amdgpu_cs_chunk_ib)) goto free_partial_kdata; - ret = amdgpu_cs_p1_ib(p, p->chunks[i].kdata, _ibs); + ret = amdgpu_cs_p1_ib(p, p->chunks[i].kdata, num_ibs); if (ret) goto free_partial_kdata; break; @@ -268,21 +281,28 @@ static int amdgpu_cs_pass1(struct amdgpu_cs_parser *p, } } - ret = amdgpu_job_alloc(p->adev, num_ibs, >job, vm); - if (ret) - goto free_all_kdata; + if (!p->gang_size) + return -EINVAL; - ret = drm_sched_job_init(>job->base, p->entity, >vm); - if (ret) - goto free_all_kdata; + for (i = 0; i < p->gang_size; ++i) { + ret = amdgpu_job_alloc(p->adev, num_ibs[i], >jobs[i], vm); + if (ret) + goto free_all_kdata; + + ret = drm_sched_job_init(>jobs[i]->base, p->entities[i], +>vm); + if (ret) + goto free_all_kdata; + } + p->gang_leader = p->jobs[p->gang_size - 1]; - if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) { + if (p->ctx->vram_lost_counter != p->gang_leader->vram_lost_counter) { ret = -ECANCELED; goto free_all_kdata; }
RE: [PATCH] drm/amdgpu/sriov: fix Tonga load driver failed
Reviewed-by: Trigger Huang Thanks & Best Wishes, Trigger Huang -Original Message- From: amd-gfx On Behalf Of Jack Zhang Sent: Thursday, June 20, 2019 2:05 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu/sriov: fix Tonga load driver failed Tonga sriov need to use smu to load firmware. Remove sriov flag because the default return value is zero. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c index 093cefd..469cda9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c @@ -2755,8 +2755,6 @@ int amdgpu_pm_load_smu_firmware(struct amdgpu_device *adev, uint32_t *smu_versio { int r; - if (amdgpu_sriov_vf(adev)) - return 0; if (adev->powerplay.pp_funcs && adev->powerplay.pp_funcs->load_firmware) { r = adev->powerplay.pp_funcs->load_firmware(adev->powerplay.pp_handle); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu/sriov: Correct some register program method
Reviewed-by: Trigger Huang Thanks & Best Wishes, Trigger Huang -Original Message- From: amd-gfx On Behalf Of Emily Deng Sent: Friday, May 31, 2019 2:37 PM To: amd-gfx@lists.freedesktop.org Cc: Deng, Emily Subject: [PATCH] drm/amdgpu/sriov: Correct some register program method For the VF, some registers only could be programmed with RLC. Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c| 10 +- drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 8 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index cc5a382..2e9cac1 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c @@ -1927,17 +1927,17 @@ static void gfx_v9_0_constants_init(struct amdgpu_device *adev) if (i == 0) { tmp = REG_SET_FIELD(0, SH_MEM_CONFIG, ALIGNMENT_MODE, SH_MEM_ALIGNMENT_MODE_UNALIGNED); - WREG32_SOC15(GC, 0, mmSH_MEM_CONFIG, tmp); - WREG32_SOC15(GC, 0, mmSH_MEM_BASES, 0); + WREG32_SOC15_RLC(GC, 0, mmSH_MEM_CONFIG, tmp); + WREG32_SOC15_RLC(GC, 0, mmSH_MEM_BASES, 0); } else { tmp = REG_SET_FIELD(0, SH_MEM_CONFIG, ALIGNMENT_MODE, SH_MEM_ALIGNMENT_MODE_UNALIGNED); - WREG32_SOC15(GC, 0, mmSH_MEM_CONFIG, tmp); + WREG32_SOC15_RLC(GC, 0, mmSH_MEM_CONFIG, tmp); tmp = REG_SET_FIELD(0, SH_MEM_BASES, PRIVATE_BASE, (adev->gmc.private_aperture_start >> 48)); tmp = REG_SET_FIELD(tmp, SH_MEM_BASES, SHARED_BASE, (adev->gmc.shared_aperture_start >> 48)); - WREG32_SOC15(GC, 0, mmSH_MEM_BASES, tmp); + WREG32_SOC15_RLC(GC, 0, mmSH_MEM_BASES, tmp); } } soc15_grbm_select(adev, 0, 0, 0, 0); @@ -3046,7 +3046,7 @@ static int gfx_v9_0_kiq_init_register(struct amdgpu_ring *ring) (adev->doorbell_index.userqueue_end * 2) << 2); } - WREG32_SOC15(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL, + WREG32_SOC15_RLC(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL, mqd->cp_hqd_pq_doorbell_control); /* reset read and write pointers, similar to CP_RB0_WPTR/_RPTR */ diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c index 0dc8926..9f0f189 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c @@ -146,12 +146,12 @@ static void gfxhub_v1_0_init_cache_regs(struct amdgpu_device *adev) tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 0); tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, CONTEXT1_IDENTITY_ACCESS_MODE, 1); tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, IDENTITY_MODE_FRAGMENT_SIZE, 0); - WREG32_SOC15(GC, 0, mmVM_L2_CNTL, tmp); + WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL, tmp); tmp = RREG32_SOC15(GC, 0, mmVM_L2_CNTL2); tmp = REG_SET_FIELD(tmp, VM_L2_CNTL2, INVALIDATE_ALL_L1_TLBS, 1); tmp = REG_SET_FIELD(tmp, VM_L2_CNTL2, INVALIDATE_L2_CACHE, 1); - WREG32_SOC15(GC, 0, mmVM_L2_CNTL2, tmp); + WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL2, tmp); tmp = mmVM_L2_CNTL3_DEFAULT; if (adev->gmc.translate_further) { @@ -163,12 +163,12 @@ static void gfxhub_v1_0_init_cache_regs(struct amdgpu_device *adev) tmp = REG_SET_FIELD(tmp, VM_L2_CNTL3, L2_CACHE_BIGK_FRAGMENT_SIZE, 6); } - WREG32_SOC15(GC, 0, mmVM_L2_CNTL3, tmp); + WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL3, tmp); tmp = mmVM_L2_CNTL4_DEFAULT; tmp = REG_SET_FIELD(tmp, VM_L2_CNTL4, VMC_TAP_PDE_REQUEST_PHYSICAL, 0); tmp = REG_SET_FIELD(tmp, VM_L2_CNTL4, VMC_TAP_PTE_REQUEST_PHYSICAL, 0); - WREG32_SOC15(GC, 0, mmVM_L2_CNTL4, tmp); + WREG32_SOC15_RLC(GC, 0, mmVM_L2_CNTL4, tmp); } static void gfxhub_v1_0_enable_system_domain(struct amdgpu_device *adev) -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH 0/2] Skip IH re-route on Vega SR-IOV
OK, thanks for the detailed background, before I didn't know the limitation in the hardware. Thanks & Best Wishes, Trigger Huang -Original Message- From: Christian König Sent: Tuesday, May 07, 2019 5:04 PM To: Huang, Trigger ; Koenig, Christian ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH 0/2] Skip IH re-route on Vega SR-IOV [CAUTION: External Email] Hi Trigger, > And see this interrupt is still from IH0 amdgpu_irq_handler, which can prove > this feature is not working under SR-IOV. In this case this change is a clear NAK. > I suggest to remove this feature from SR-IOV, as my concern is, some weird > bugs may be cased by it in the Virtualization heavy stress test. And I really think we should keep it to make sure that we have the same handling for bare metal as for SRIOV. > In the future, maybe we can request PSP team to add this support for SR-IOV. We will never be able to use this under SRIOV because of limitation in the hardware. What we could maybe do is check the response code from the PSP firmware if it correctly ignored the commands under SR-IOV, but I think the response code is the same for ignoring as for executing the commands. Regards, Christian. Am 07.05.19 um 10:54 schrieb Huang, Trigger: > Hi Christian, > > On Vega10 SR-IOV VF, I injected a 'real' VMC page fault from user space, > using the modified amdgpu_test. > [ 19.127874] amdgpu :00:08.0: [gfxhub] no-retry page fault (src_id:0 > ring:174 vmid:1 pasid:32768, for process amdgpu_test pid 1071 thread > amdgpu_test pid 1071) > [ 19.130037] amdgpu :00:08.0: in page starting at address > 0x0008 from 27 > > And see this interrupt is still from IH0 amdgpu_irq_handler, which can prove > this feature is not working under SR-IOV. > > I suggest to remove this feature from SR-IOV, as my concern is, some weird > bugs may be cased by it in the Virtualization heavy stress test. > In the future, maybe we can request PSP team to add this support for SR-IOV. > > Thanks & Best Wishes, > Trigger Huang > > -Original Message----- > From: Christian König > Sent: Tuesday, May 07, 2019 3:37 PM > To: Huang, Trigger ; > amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH 0/2] Skip IH re-route on Vega SR-IOV > > [CAUTION: External Email] > > We intentionally didn't do this to make sure that the commands are ignored by > the PSP firmware. > > I have no strong opinion on if we should do this or not, but the PSP firmware > guys might have. > > Christian. > > Am 07.05.19 um 06:08 schrieb Trigger Huang: >> IH re-route is not supported on Vega SR-IOV, need to be skipped >> >> Trigger Huang (2): >> drm/amdgpu: Skip IH reroute in Vega10 SR-IOV VF >> drm/amdgpu: Skip IH reroute in Vega20 SR-IOV VF >> >>drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 4 >>drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 4 >>2 files changed, 8 insertions(+) >> > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH 0/2] Skip IH re-route on Vega SR-IOV
Hi Christian, On Vega10 SR-IOV VF, I injected a 'real' VMC page fault from user space, using the modified amdgpu_test. [ 19.127874] amdgpu :00:08.0: [gfxhub] no-retry page fault (src_id:0 ring:174 vmid:1 pasid:32768, for process amdgpu_test pid 1071 thread amdgpu_test pid 1071) [ 19.130037] amdgpu :00:08.0: in page starting at address 0x0008 from 27 And see this interrupt is still from IH0 amdgpu_irq_handler, which can prove this feature is not working under SR-IOV. I suggest to remove this feature from SR-IOV, as my concern is, some weird bugs may be cased by it in the Virtualization heavy stress test. In the future, maybe we can request PSP team to add this support for SR-IOV. Thanks & Best Wishes, Trigger Huang -Original Message- From: Christian König Sent: Tuesday, May 07, 2019 3:37 PM To: Huang, Trigger ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH 0/2] Skip IH re-route on Vega SR-IOV [CAUTION: External Email] We intentionally didn't do this to make sure that the commands are ignored by the PSP firmware. I have no strong opinion on if we should do this or not, but the PSP firmware guys might have. Christian. Am 07.05.19 um 06:08 schrieb Trigger Huang: > IH re-route is not supported on Vega SR-IOV, need to be skipped > > Trigger Huang (2): >drm/amdgpu: Skip IH reroute in Vega10 SR-IOV VF >drm/amdgpu: Skip IH reroute in Vega20 SR-IOV VF > > drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 4 > drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 4 > 2 files changed, 8 insertions(+) > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM
Thanks Alex's comments Yes, they are only in the SR-IOV HW initialization path of UVD/VCE. Thanks & Best Wishes, Trigger Huang From: Deucher, Alexander Sent: Monday, May 06, 2019 10:52 PM To: Huang, Trigger ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM As long as this doesn't break bare metal, I'm ok with it. Acked-by: Alex Deucher mailto:alexander.deuc...@amd.com>> From: amd-gfx mailto:amd-gfx-boun...@lists.freedesktop.org>> on behalf of Trigger Huang mailto:trigger.hu...@amd.com>> Sent: Thursday, May 2, 2019 8:56 AM To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> Cc: Huang, Trigger Subject: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM [CAUTION: External Email] One Vega10 SR-IOV VF, the FW address returned by PSP should be set into the init table, while not the original BO mc address. otherwise, UVD and VCE IB test will fail under Vega10 SR-IOV reference: commit bfcea5204287 ("drm/amdgpu:change VEGA booting with firmware loaded by PSP") commit aa5873dca463 ("drm/amdgpu: Change VCE booting with firmware loaded by PSP") Signed-off-by: Trigger Huang mailto:trigger.hu...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c | 16 ++-- drivers/gpu/drm/amd/amdgpu/vce_v4_0.c | 17 +++-- 2 files changed, 21 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c index dc461df..2191d3d 100644 --- a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c +++ b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c @@ -787,10 +787,13 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device *adev) 0x, 0x0004); /* mc resume*/ if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) { - MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW), - lower_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr)); - MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH), - upper_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr)); + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, + mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW), + adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_lo); + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, + mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH), + adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_hi); + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, mmUVD_VCPU_CACHE_OFFSET0), 0); offset = 0; } else { MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW), @@ -798,10 +801,11 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device *adev) MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH), upper_32_bits(adev->uvd.inst[i].gpu_addr)); offset = size; + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, mmUVD_VCPU_CACHE_OFFSET0), + AMDGPU_UVD_FIRMWARE_OFFSET >> 3); + } - MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_VCPU_CACHE_OFFSET0), - AMDGPU_UVD_FIRMWARE_OFFSET >> 3); MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_VCPU_CACHE_SIZE0), size); MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE1_64BIT_BAR_LOW), diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c index f3f5938..c0ec279 100644 --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c @@ -244,13 +244,18 @@ static int vce_v4_0_sriov_start(struct amdgpu_device *adev) MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, mmVCE_LMI_SWAP_CNTL1), 0); MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, mmVCE_LMI_VM_CTRL), 0); + offset = AMDGPU_VCE_FIRMWARE_OFFSET; if
RE: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM
Ping again. Thanks & Best Wishes, Trigger Huang -Original Message- From: Trigger Huang Sent: Thursday, May 02, 2019 8:57 PM To: amd-gfx@lists.freedesktop.org Cc: Huang, Trigger Subject: [PATCH] drm/amdgpu: Use FW addr returned by PSP for VF MM One Vega10 SR-IOV VF, the FW address returned by PSP should be set into the init table, while not the original BO mc address. otherwise, UVD and VCE IB test will fail under Vega10 SR-IOV reference: commit bfcea5204287 ("drm/amdgpu:change VEGA booting with firmware loaded by PSP") commit aa5873dca463 ("drm/amdgpu: Change VCE booting with firmware loaded by PSP") Signed-off-by: Trigger Huang --- drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c | 16 ++-- drivers/gpu/drm/amd/amdgpu/vce_v4_0.c | 17 +++-- 2 files changed, 21 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c index dc461df..2191d3d 100644 --- a/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c +++ b/drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c @@ -787,10 +787,13 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device *adev) 0x, 0x0004); /* mc resume*/ if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) { - MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW), - lower_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr)); - MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH), - upper_32_bits(adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].mc_addr)); + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, + mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW), + adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_lo); + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, + mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH), + adev->firmware.ucode[AMDGPU_UCODE_ID_UVD].tmr_mc_addr_hi); + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, +mmUVD_VCPU_CACHE_OFFSET0), 0); offset = 0; } else { MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW), @@ -798,10 +801,11 @@ static int uvd_v7_0_sriov_start(struct amdgpu_device *adev) MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH), upper_32_bits(adev->uvd.inst[i].gpu_addr)); offset = size; + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, 0, mmUVD_VCPU_CACHE_OFFSET0), + AMDGPU_UVD_FIRMWARE_OFFSET >> 3); + } - MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_VCPU_CACHE_OFFSET0), - AMDGPU_UVD_FIRMWARE_OFFSET >> 3); MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_VCPU_CACHE_SIZE0), size); MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(UVD, i, mmUVD_LMI_VCPU_CACHE1_64BIT_BAR_LOW), diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c index f3f5938..c0ec279 100644 --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c @@ -244,13 +244,18 @@ static int vce_v4_0_sriov_start(struct amdgpu_device *adev) MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, mmVCE_LMI_SWAP_CNTL1), 0); MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, mmVCE_LMI_VM_CTRL), 0); + offset = AMDGPU_VCE_FIRMWARE_OFFSET; if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) { + uint32_t low = adev->firmware.ucode[AMDGPU_UCODE_ID_VCE].tmr_mc_addr_lo; + uint32_t hi = adev->firmware.ucode[AMDGPU_UCODE_ID_VCE].tmr_mc_addr_hi; + uint64_t tmr_mc_addr = (uint64_t)(hi) << 32 | low; + MMSCH_V1_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCE, 0, - mmVCE_LMI_VCPU_CACHE_40BIT_BAR0), -
RE: [PATCH] drm/amdgpu: Add IDH_QUERY_ALIVE event for SR-IOV
Hi Guys, Would you help on this patch? Thanks & Best Wishes, Trigger Huang -Original Message- From: amd-gfx On Behalf Of Trigger Huang Sent: Tuesday, April 30, 2019 4:30 PM To: amd-gfx@lists.freedesktop.org Cc: Huang, Trigger Subject: [PATCH] drm/amdgpu: Add IDH_QUERY_ALIVE event for SR-IOV [CAUTION: External Email] SR-IOV host side will send IDH_QUERY_ALIVE to guest VM to check if this guest VM is still alive (not destroyed). The only thing guest KMD need to do is to send ACK back to host. Signed-off-by: Trigger Huang --- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 3 +++ drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index 8dbad49..2471e7c 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -372,6 +372,9 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev, if (amdgpu_sriov_runtime(adev)) schedule_work(>virt.flr_work); break; + case IDH_QUERY_ALIVE: + xgpu_ai_mailbox_send_ack(adev); + break; /* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can ignore * it byfar since that polling thread will handle it, * other msg like flr complete is not handled here. diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h index 39d151b..077e91a 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h @@ -49,6 +49,7 @@ enum idh_event { IDH_FLR_NOTIFICATION_CMPL, IDH_SUCCESS, IDH_FAIL, + IDH_QUERY_ALIVE, IDH_EVENT_MAX }; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: Rearm IRQ in Vega10 SR-IOV if IRQ lost
Hi Christian, Yes, it is the HW guys who suggested to rearm it. Thanks & Best Wishes, Trigger Huang -Original Message- From: Christian König Sent: Wednesday, May 01, 2019 1:10 AM To: Huang, Trigger ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: Rearm IRQ in Vega10 SR-IOV if IRQ lost [CAUTION: External Email] Am 30.04.19 um 17:14 schrieb Trigger Huang: > In Multi-VFs stress test, sometimes we see IRQ lost when running > benchmark, just rearm it. Well I think I have seen that on bare metal as well, it would certainly explain some very odd behavior I've got from the IH block. Have you pinged the hw guys about that already? > Signed-off-by: Trigger Huang Acked-by: Christian König > --- > drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 37 > +- > 1 file changed, 36 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c > b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c > index 1b2f69a..8d89ab7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c > +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c > @@ -31,7 +31,7 @@ > #include "soc15_common.h" > #include "vega10_ih.h" > > - > +#define MAX_REARM_RETRY 10 > > static void vega10_ih_set_interrupt_funcs(struct amdgpu_device > *adev); > > @@ -382,6 +382,38 @@ static void vega10_ih_decode_iv(struct amdgpu_device > *adev, > } > > /** > + * vega10_ih_irq_rearm - rearm IRQ if lost > + * > + * @adev: amdgpu_device pointer > + * > + */ > +static void vega10_ih_irq_rearm(struct amdgpu_device *adev, > +struct amdgpu_ih_ring *ih) { > + uint32_t reg_rptr = 0; > + uint32_t v = 0; > + uint32_t i = 0; > + > + if (ih == >irq.ih) > + reg_rptr = SOC15_REG_OFFSET(OSSSYS, 0, mmIH_RB_RPTR); > + else if (ih == >irq.ih1) > + reg_rptr = SOC15_REG_OFFSET(OSSSYS, 0, mmIH_RB_RPTR_RING1); > + else if (ih == >irq.ih2) > + reg_rptr = SOC15_REG_OFFSET(OSSSYS, 0, mmIH_RB_RPTR_RING2); > + else > + return; > + > + /* Rearm IRQ / re-wwrite doorbell if doorbell write is lost */ > + for (i = 0; i < MAX_REARM_RETRY; i++) { > + v = RREG32_NO_KIQ(reg_rptr); > + if ((v < ih->ring_size) && (v != ih->rptr)) > + WDOORBELL32(ih->doorbell_index, ih->rptr); > + else > + break; > + } > +} > + > +/** >* vega10_ih_set_rptr - set the IH ring buffer rptr >* >* @adev: amdgpu_device pointer > @@ -395,6 +427,9 @@ static void vega10_ih_set_rptr(struct amdgpu_device *adev, > /* XXX check if swapping is necessary on BE */ > *ih->rptr_cpu = ih->rptr; > WDOORBELL32(ih->doorbell_index, ih->rptr); > + > + if (amdgpu_sriov_vf(adev)) > + vega10_ih_irq_rearm(adev, ih); > } else if (ih == >irq.ih) { > WREG32_SOC15(OSSSYS, 0, mmIH_RB_RPTR, ih->rptr); > } else if (ih == >irq.ih1) { ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path
Thanks, Ok, I got it. That makes more clear. Thanks & Best Wishes, Trigger Huang -Original Message- From: Christian König Sent: Tuesday, April 30, 2019 6:32 PM To: Huang, Trigger ; Koenig, Christian ; Kuehling, Felix ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path [CAUTION: External Email] Am 30.04.19 um 12:28 schrieb Huang, Trigger: > Hi Christian, > > Thanks for the quick response. > > One thing I want to confirm, per my understanding that CSA is only used by CP > preemption for KCQ and KGQ path, but in KFD user mode queue path, can we > delete CSA? > If yes, then maybe we can split this topic into two patches 1, the > original one that unmap CSA in KFD path is still needed I won't go down that route anyway, we should probably move the CSA mapping into the CTX handling completely. > 2, Add new patch to use new method for VM checking. That should actually be the first patch, cause this checking is certainly incorrect. Christian. > > Thanks & Best Wishes, > Trigger Huang > > -Original Message- > From: Koenig, Christian > Sent: Tuesday, April 30, 2019 5:59 PM > To: Huang, Trigger ; Kuehling, Felix > ; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path > > Am 30.04.19 um 11:53 schrieb Huang, Trigger: >> Hi Christian & Felix, >> >> Thanks for the information. >> >> But actually currently CSA is mapped only in amdgpu_driver_open_kms under >> SR-IOV. >> So would you give more information about ' Well that is exactly what we >> already do here'? Where I can find its implementation? > Well the plan was to move this to when the actually execution context is > created. But Rex left the company, could be that the patches for this where > never committed. > >> On the other hand, I will try the method 'Instead of checking if some page >> tables are already filled we check if some mapping is already made' > Yeah, that should certainly work as well. Just check all entries of > the root PD in a for loop if any subsequent PDs are allocated and > abort if you find one. > > Christian. > >> Thanks & Best Wishes, >> Trigger Huang >> >> -Original Message- >> From: Christian König >> Sent: Tuesday, April 30, 2019 5:23 PM >> To: Kuehling, Felix ; Huang, Trigger >> ; amd-gfx@lists.freedesktop.org >> Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path >> >> [CAUTION: External Email] >> >> Well that is exactly what we already do here. The only problem is we do the >> wrong check amdgpu_vm_make_compute(). >> >> Instead of checking if some page tables are already filled we check if some >> mapping is already made. >> >> Regards, >> Christian. >> >> Am 30.04.19 um 01:34 schrieb Kuehling, Felix: >>> I remember a past discussion to change the CSA allocation/mapping >>> scheme to avoid this issue in the first place. Can adding the CSA to >>> the VM be delayed a little to a point after the VM gets converted to a >>> compute VM? >>> Maybe the first command submission? >>> >>> Regards, >>> Felix >>> >>> On 2019-04-28 6:25 a.m., Trigger Huang wrote: >>>> In amdgpu open path, CSA will be mappened in VM, so when opening >>>> KFD, calling mdgpu_vm_make_compute will fail because it found this >>>> VM is not a clean VM with some mappings, as a result, it will lead >>>> to failed to create process VM object >>>> >>>> The fix is try to unmap CSA, and actually CSA is not needed in >>>> compute VF world switch >>>> >>>> Signed-off-by: Trigger Huang >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 +- >>>> 2 files changed, 11 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>> index 697b8ef..e0bc457 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>> @@ -956,6 +956,16 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct >>>> kgd_dev *kgd, >>>>if (avm->process_info) >>>>return -EINVAL; >>>> >>>> +/* Delete CSA mapping to make sure this VM is a clean VM before >
RE: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path
Hi Christian, Thanks for the quick response. One thing I want to confirm, per my understanding that CSA is only used by CP preemption for KCQ and KGQ path, but in KFD user mode queue path, can we delete CSA? If yes, then maybe we can split this topic into two patches 1, the original one that unmap CSA in KFD path is still needed 2, Add new patch to use new method for VM checking. Thanks & Best Wishes, Trigger Huang -Original Message- From: Koenig, Christian Sent: Tuesday, April 30, 2019 5:59 PM To: Huang, Trigger ; Kuehling, Felix ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path Am 30.04.19 um 11:53 schrieb Huang, Trigger: > Hi Christian & Felix, > > Thanks for the information. > > But actually currently CSA is mapped only in amdgpu_driver_open_kms under > SR-IOV. > So would you give more information about ' Well that is exactly what we > already do here'? Where I can find its implementation? Well the plan was to move this to when the actually execution context is created. But Rex left the company, could be that the patches for this where never committed. > > On the other hand, I will try the method 'Instead of checking if some page > tables are already filled we check if some mapping is already made' Yeah, that should certainly work as well. Just check all entries of the root PD in a for loop if any subsequent PDs are allocated and abort if you find one. Christian. > > Thanks & Best Wishes, > Trigger Huang > > -Original Message- > From: Christian König > Sent: Tuesday, April 30, 2019 5:23 PM > To: Kuehling, Felix ; Huang, Trigger > ; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path > > [CAUTION: External Email] > > Well that is exactly what we already do here. The only problem is we do the > wrong check amdgpu_vm_make_compute(). > > Instead of checking if some page tables are already filled we check if some > mapping is already made. > > Regards, > Christian. > > Am 30.04.19 um 01:34 schrieb Kuehling, Felix: >> I remember a past discussion to change the CSA allocation/mapping >> scheme to avoid this issue in the first place. Can adding the CSA to >> the VM be delayed a little to a point after the VM gets converted to a >> compute VM? >> Maybe the first command submission? >> >> Regards, >> Felix >> >> On 2019-04-28 6:25 a.m., Trigger Huang wrote: >>> In amdgpu open path, CSA will be mappened in VM, so when opening KFD, >>> calling mdgpu_vm_make_compute will fail because it found this VM is >>> not a clean VM with some mappings, as a result, it will lead to >>> failed to create process VM object >>> >>> The fix is try to unmap CSA, and actually CSA is not needed in >>> compute VF world switch >>> >>> Signed-off-by: Trigger Huang >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 +- >>> 2 files changed, 11 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> index 697b8ef..e0bc457 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> @@ -956,6 +956,16 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct >>> kgd_dev *kgd, >>> if (avm->process_info) >>> return -EINVAL; >>> >>> +/* Delete CSA mapping to make sure this VM is a clean VM before >>> + * converting VM >>> + */ >>> +if (amdgpu_sriov_vf(adev) && drv_priv->csa_va) { >>> +amdgpu_bo_reserve(adev->virt.csa_obj, true); >>> +amdgpu_vm_bo_rmv(adev, drv_priv->csa_va); >>> +drv_priv->csa_va = NULL; >>> +amdgpu_bo_unreserve(adev->virt.csa_obj); >>> +} >>> + >>> /* Convert VM into a compute VM */ >>> ret = amdgpu_vm_make_compute(adev, avm, pasid); >>> if (ret) >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> index da7b4fe..361c2e5 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> @@ -1069,7 +1069,7 @@ void amdgpu_driver_postclose_kms(struct >>> drm_device *dev, >>> >>> amdgpu_vm_bo_rmv(adev, fpriv->prt_va); >>> >>> -if (amdgpu_sriov_vf(adev)) { >>> +if (amdgpu_sriov_vf(adev) && fpriv->csa_va) { >>> /* TODO: how to handle reserve failure */ >>> BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, true)); >>> amdgpu_vm_bo_rmv(adev, fpriv->csa_va); >> ___ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path
Hi Christian & Felix, Thanks for the information. But actually currently CSA is mapped only in amdgpu_driver_open_kms under SR-IOV. So would you give more information about ' Well that is exactly what we already do here'? Where I can find its implementation? On the other hand, I will try the method 'Instead of checking if some page tables are already filled we check if some mapping is already made' Thanks & Best Wishes, Trigger Huang -Original Message- From: Christian König Sent: Tuesday, April 30, 2019 5:23 PM To: Kuehling, Felix ; Huang, Trigger ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: Unmap CSA under SR-IOV in KFD path [CAUTION: External Email] Well that is exactly what we already do here. The only problem is we do the wrong check amdgpu_vm_make_compute(). Instead of checking if some page tables are already filled we check if some mapping is already made. Regards, Christian. Am 30.04.19 um 01:34 schrieb Kuehling, Felix: > I remember a past discussion to change the CSA allocation/mapping > scheme to avoid this issue in the first place. Can adding the CSA to > the VM be delayed a little to a point after the VM gets converted to a > compute VM? > Maybe the first command submission? > > Regards, > Felix > > On 2019-04-28 6:25 a.m., Trigger Huang wrote: >> In amdgpu open path, CSA will be mappened in VM, so when opening KFD, >> calling mdgpu_vm_make_compute will fail because it found this VM is >> not a clean VM with some mappings, as a result, it will lead to >> failed to create process VM object >> >> The fix is try to unmap CSA, and actually CSA is not needed in >> compute VF world switch >> >> Signed-off-by: Trigger Huang >> --- >>drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++ >>drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 +- >>2 files changed, 11 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index 697b8ef..e0bc457 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -956,6 +956,16 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct >> kgd_dev *kgd, >> if (avm->process_info) >> return -EINVAL; >> >> +/* Delete CSA mapping to make sure this VM is a clean VM before >> + * converting VM >> + */ >> +if (amdgpu_sriov_vf(adev) && drv_priv->csa_va) { >> +amdgpu_bo_reserve(adev->virt.csa_obj, true); >> +amdgpu_vm_bo_rmv(adev, drv_priv->csa_va); >> +drv_priv->csa_va = NULL; >> +amdgpu_bo_unreserve(adev->virt.csa_obj); >> +} >> + >> /* Convert VM into a compute VM */ >> ret = amdgpu_vm_make_compute(adev, avm, pasid); >> if (ret) >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> index da7b4fe..361c2e5 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> @@ -1069,7 +1069,7 @@ void amdgpu_driver_postclose_kms(struct >> drm_device *dev, >> >> amdgpu_vm_bo_rmv(adev, fpriv->prt_va); >> >> -if (amdgpu_sriov_vf(adev)) { >> +if (amdgpu_sriov_vf(adev) && fpriv->csa_va) { >> /* TODO: how to handle reserve failure */ >> BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, true)); >> amdgpu_vm_bo_rmv(adev, fpriv->csa_va); > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: Add more process info in VM for debug
Yeah, make sense. Let me give another patch: [PATCH] drm/amdgpu: print process info when job timeout -Original Message- From: Koenig, Christian Sent: Monday, December 17, 2018 9:39 PM To: Liu, Monk ; Huang, Trigger ; amd-gfx@lists.freedesktop.org Cc: Grodzovsky, Andrey ; Qu, Jim Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug Hi guys, We could also print shaders, IBs or textures to figure out what's going wrong, but that would be overkill as well. The log is to notice that something is wrong and not to a detailed crash report. The PID is perfectly sufficient to identify the process which triggered an issue and when the system goes down completely because of this we have done something wrong in the first place. So that is still a NAK, CPU crash reports doesn't contain the commandline either. Regards, Christian. Am 17.12.18 um 10:02 schrieb Liu, Monk: > Hi Christian, > > I think for some SRIOV customers they need this rich information, > Maybe we can use an kernel option to let user select if rich or simple > information should be printed upon job TDR ? > In SRIOV branch we can set it enable by default while set by default > disable for drm-next > > /Monk > > -Original Message- > From: amd-gfx On Behalf Of > Huang, Trigger > Sent: Monday, December 17, 2018 4:27 PM > To: Koenig, Christian ; > amd-gfx@lists.freedesktop.org > Cc: Grodzovsky, Andrey ; Qu, Jim > > Subject: RE: [PATCH] drm/amdgpu: Add more process info in VM for debug > > Hi Christian, > > Yes, if the test machine is still there for debugging, we can login it and > check a lot of things, such as ' ps -p 1 -o args ' as you suggested. > > But sometimes, the system is not alive anymore, and we only got some log > files (such as kern.log ) from QA or customers. > And at this time, the full command line information in dmesg is quite useful. > Let's take the following message for example: > > [ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring > comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572 > [ 1740.050167] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's > process information is as below: > [ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process SDMA > pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode goldimage_compare > --offscreen --n-swapchain-images 3 --gpu 0 --frontend test_executor > --n-test-threads 4 When customer/QA reported that there is comp ring job > timeout when running a big test case, vk-example, but the machine is not > alive anymore, and can't login for debugging. Then we need to re-run the > whole vk-example to reproduce this issue, and this will waste a lot of time. > But if we get the specific sub-base in the kern.log file when job timeout > happened, then we can only try the specific one, here it is ' SDMA --mode > goldimage_compare --offscreen --n-swapchain-images 3 --gpu 0 --frontend > test_executor --n-test-threads 4' > maybe several minutes later, this issues is reproduced. > > > Thanks, > Trigger. > > > -Original Message- > From: amd-gfx On Behalf Of > Koenig, Christian > Sent: Monday, December 17, 2018 3:50 PM > To: Huang, Trigger ; > amd-gfx@lists.freedesktop.org > Cc: Grodzovsky, Andrey ; Qu, Jim > > Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug > > Hi Trigger, > >> Does this make sense? > Sorry I wasn't clear enough. The key point that we don't want/need the full > commandline of the process here is that we can already get that from the > information we have. > > E.g. "ps -p 1 -o args" gives you the command line of the process 1. > > The only case where directly printing this into the logs is useful is when we > run into a total system crash and in this case the processed is only the > trigger, but not the root cause. > > Regards, > Christian. > > Am 17.12.18 um 04:03 schrieb Huang, Trigger: >> Hi Christian, >> >> Many thanks for pointing out the mistakes >> >> I have some comments as below, would you help to check again? >> >> First of all you can't get the process info during VM creating since that >> can happen in X as well. >> [Trigger]: Ok, I will keep the original logic, which is that set the vm info >> in cs . I will still invoke kfree(cmd_line) in amdgpu_vm_fini to avoid >> memory leak. >> >> Second when a timeout happen the VM structure might already be released, so >> using job->vm is illegal here. What we could try is to get the VM using the >> PASID. >> [Trigger]: Ok, I will do it in job timeout like what VMC page fault's >> handler does >> >> And last we
RE: [PATCH] drm/amdgpu: Add more process info in VM for debug
Hi Christian, Many thanks for pointing out the mistakes I have some comments as below, would you help to check again? First of all you can't get the process info during VM creating since that can happen in X as well. [Trigger]: Ok, I will keep the original logic, which is that set the vm info in cs . I will still invoke kfree(cmd_line) in amdgpu_vm_fini to avoid memory leak. Second when a timeout happen the VM structure might already be released, so using job->vm is illegal here. What we could try is to get the VM using the PASID. [Trigger]: Ok, I will do it in job timeout like what VMC page fault's handler does And last we don't want to keep the full command line around. [Trigger]: well, actually, the detailed command line is just what we want. For example, there are thousands of sub-cases of one big test case, and for each sub-case, the arguments may also be different. In some corner case, test machine is hung after running the big test case for several hours even several days, it is really painful to wait another several hours to reproduce it and debug. But if we know the last sub-case running on the test machine, then this issues *may* can be reproduced by only running the specific sub-case with specific arguments for several rounds, and in this situation, it will save us a lot time for reproducing and debugging. Does this make sense? If not, how about we add a parameter, such as amdgpu_vm_debug_verbose, to turn on/off the cmd line dumping? Thanks, Trigger -Original Message- From: Christian König Sent: Saturday, December 15, 2018 8:23 PM To: Huang, Trigger ; amd-gfx@lists.freedesktop.org Cc: Grodzovsky, Andrey ; Qu, Jim Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug Am 15.12.18 um 09:56 schrieb Trigger Huang: > When debugging VMC page fault and ring hang issues, the detailed > process information is quite helpful, especially when the issue can > only be reproduced after a very long time running. With this > information, only run the specific sub-testcase may also will > reproduce the issue, which may save a lot of time for debugging. > > With this patch, the process information is similar as following. > When VMC page fault issue happened: > [ 142.978417] amdgpu :00:08.0: [gfxhub] VMC page fault (src_id:0 > ring:171 vmid:2 pasid:32769), [ 142.978542] amdgpu :00:08.0: for process > ocltst pid 1354 thread ocltst pid 1354, args:./ocltst -m oclperf.so -t > OCLPerfDeviceEnqueueEvent, > [ 142.978693] amdgpu :00:08.0: in page starting at address > 0x from 27 > > When ring hang issue happened: > [ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring > comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572 [ 1740.050167] > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's process information is > as below: > [ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process SDMA > pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode goldimage_compare > --offscreen --n-swapchain-images 3 --gpu 0 --frontend test_executor > --n-test-threads 4 > > Signed-off-by: Trigger Huang Well NAK, we intentionally didn't do it this way. First of all you can't get the process info during VM creating since that can happen in X as well. Second when a timeout happen the VM structure might already be released, so using job->vm is illegal here. What we could try is to get the VM using the PASID. And last we don't want to keep the full command line around. The only valid addition I can see here is to print the thread info when the timeout happens. Regards, Christian. > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 --- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 1 + > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 11 ++- > 5 files changed, 26 insertions(+), 8 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > index 1c49b82..1a2d0c9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > @@ -235,9 +235,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser > *p, union drm_amdgpu_cs > p->job->uf_addr = uf_offset; > kfree(chunk_array); > > - /* Use this opportunity to fill in task info for the vm */ > - amdgpu_vm_set_task_info(vm); > - > return 0; > > free_all_kdata: > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index e0af44f..c75ecb3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -43,6 +43,14 @@ s
RE: [PATCH V3] drm/scheduler: Fix bad job be re-processed in TDR
Hi, Would you help on it? -Original Message- From: amd-gfx On Behalf Of Trigger Huang Sent: Thursday, November 15, 2018 3:30 PM To: amd-gfx@lists.freedesktop.org Cc: Huang, Trigger Subject: [PATCH V3] drm/scheduler: Fix bad job be re-processed in TDR A bad job is the one triggered TDR(In the current amdgpu's implementation, actually all the jobs in the current joq-queue will be treated as bad jobs). In the recovery process, its fence will be fake signaled and as a result, the work behind will be scheduled to delete it from the mirror list, but if the TDR process is invoked before the work's execution, then this bad job might be processed again and the call dma_fence_set_error to its fence in TDR process will lead to kernel warning trace: [ 143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched] kernel: [ 143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too floppy psmouse 8139cp mii i2c_piix4 pata_acpi [ 143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G OE 4.15.0-20-generic #21-Ubuntu [ 143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 143.033653] Workqueue: events drm_sched_job_timedout [amd_sched] [ 143.033656] RIP: 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched] [ 143.033657] RSP: 0018:a9f880fe7d48 EFLAGS: 00010202 [ 143.033659] RAX: 0007 RBX: 9b98f2b24c00 RCX: 9b98efef4f08 [ 143.033660] RDX: 9b98f2b27400 RSI: 9b98f2b24c50 RDI: 9b98efef4f18 [ 143.033660] RBP: a9f880fe7d98 R08: 0001 R09: 02b6 [ 143.033661] R10: R11: R12: 9b98efef3430 [ 143.033662] R13: 9b98efef4d80 R14: 9b98efef4e98 R15: 9b98eaf91c00 [ 143.033663] FS: () GS:9b98ffd0() knlGS: [ 143.033664] CS: 0010 DS: ES: CR0: 80050033 [ 143.033665] CR2: 7fc49c96d470 CR3: 1400a005 CR4: 003606e0 [ 143.033669] DR0: DR1: DR2: [ 143.033669] DR3: DR6: fffe0ff0 DR7: 0400 [ 143.033670] Call Trace: [ 143.033744] amdgpu_device_gpu_recover+0x144/0x820 [amdgpu] [ 143.033788] amdgpu_job_timedout+0x9b/0xa0 [amdgpu] [ 143.033791] drm_sched_job_timedout+0xcc/0x150 [amd_sched] [ 143.033795] process_one_work+0x1de/0x410 [ 143.033797] worker_thread+0x32/0x410 [ 143.033799] kthread+0x121/0x140 [ 143.033801] ? process_one_work+0x410/0x410 [ 143.033803] ? kthread_create_worker_on_cpu+0x70/0x70 [ 143.033806] ret_from_fork+0x35/0x40 So just delete the bad job from mirror list directly Changes in v3: - Add a helper function to delete the bad jobs from mirror list and call it directly *before* the job's fence is signaled Changes in v2: - delete the useless list node check - also delete bad jobs in drm_sched_main because: kthread_unpark(ring->sched.thread) will be invoked very early before amdgpu_device_gpu_recover's return, then drm_sched_main will have chance to pick up a new job from the job queue. This new job will be added into the mirror list and processed by amdgpu_job_run, but may not be deleted from the mirror list on time due to the same reason. And finally re-processed by drm_sched_job_recovery Signed-off-by: Trigger Huang --- drivers/gpu/drm/scheduler/sched_main.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 18ebbb0..6fedf95 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -60,6 +60,8 @@ static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb); +static void drm_sched_expel_job_unlocked(struct drm_sched_job *s_job); + /** * drm_sched_rq_init - initialize a given run queue struct * @@ -228,7 +230,7 @@ static void drm_sched_job_finish(struct work_struct *work) spin_lock(>job_list_lock); /* remove job from ring_mirror_list */ - list_del(_job->node); + list_del_init(_
RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR
Hi guys, I have modified the patch as V2 in other thread, would you help to review? -Original Message- From: amd-gfx On Behalf Of Koenig, Christian Sent: Monday, November 12, 2018 6:12 PM To: Liu, Monk ; Huang, Trigger ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR Hi guys, yeah that is the same problem I was trying to fix with drm/sched: fix timeout handling v2. But this case is even more complicated because here it is the hypervisor or KFD which triggers the second reset. Going to take another look at this today, Christian. Am 12.11.18 um 10:19 schrieb Liu, Monk: > Hi Christian > > Trigger's patch intends to fix the double signaling on dma_fence from below > scenario: > > 1) Gpu_recovery(job = some job) invoked by guest TDR ==> the hang > job's fence is set as -ECANCELD and fake signaled 2)GPU_recovery(job = > NULL) again invoked by hypervisor or KFD, but the job of above step > hasn't finished its "drm_sched_job_finish" routine which is kicked by > queu_work(), due to the cause that the second gpu_recovery() is called > prior to "drm_sched_job_finish" (so the job is not took out from the > mirror_list, so there will be duplicated dma_fence_set_error on that > job and give you warning messeage) > > /Monk > > -Original Message- > From: amd-gfx On Behalf Of > Huang, Trigger > Sent: Monday, November 12, 2018 4:36 PM > To: Koenig, Christian ; > amd-gfx@lists.freedesktop.org > Subject: RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR > > Hi Christian > > Thanks for the correction before, I gave my explanation as below, would you > help to check again, thanks in advance. > > The key thing here is, if a job’s fence is signaled already, then call > dma_fence_set_error to its fence will lead to a kernel warning call trace > static inline void dma_fence_set_error(struct dma_fence *fence, > int > error) { > WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, > >flags)); # here is warning source > WARN_ON(error >= 0 || error < -MAX_ERRNO); > > fence->error = error; } > > > Then, let’s see a guilty job’s process flow in TDR. > amdgpu_device_gpu_recover -> drm_sched_job_recovery -> for each job in > ring_mirror_list : >1)dma_fence_set_error(_fence->finished, > -ECANCELED) if this job is guilty >2)amdgpu_job_run: the guilty job will be > skipped, and not submitted into to ring >3)drm_sched_process_job(guilty_job) -> > drm_sched_fence_finished -> dma_fence_signal(>finished) -> > drm_sched_job_finish_cb -> schedule_work(>finish_work); >Later, finish_work’s callback function, > drm_sched_job_finish, will be called by work queue, and guilty job will > finally be deleted from ring_mirror_list. > But sometimes, amdgpu_device_gpu_recover will be called(from host FLR > interrupt or KFD) again immediately, and at this moment, finish_work is not > scheduled by work queue, and drm_sched_job_finish for the guilty job is not > called yet, so this guilty job is still in the ring_mirror_list. But > remember, the guilty job’s fence was signaled before (by > drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error > will be called again to this guilty, then cause the kernel warning call trace. > > So the root cause is, the finish_work for each job is not scheduled quickly > enough for GPU TDR scenario. > Three ideas: > 1), Instead of using the system global work queue, maybe we can > create a dedicate work queue to manage the job finish things, but still can’t > guarantee its finish_work’s execution in GPU TDR scenario > 2), Wait for all the guilty jobs’ finish_work to be finished > execution before return from drm_sched_job_recovery , but I think it will > waste too much time. > 3), to delete the guilty job from the ring_mirror_list directly > after this job is processed in amdgpu_job_run().amdgpu_job_run is the > final entrance for a job to be submitted into the ring, if something wrong > with this job:(finished->error < 0), then this job should be > never be submitted again, and so should be deleted from the recovery list. > I choose 3). > > > I am going to make a new patch as below instead of the original one > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index e0
RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR
Hi Christian Thanks for the correction before, I gave my explanation as below, would you help to check again, thanks in advance. The key thing here is, if a job’s fence is signaled already, then call dma_fence_set_error to its fence will lead to a kernel warning call trace static inline void dma_fence_set_error(struct dma_fence *fence, int error) { WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)); # here is warning source WARN_ON(error >= 0 || error < -MAX_ERRNO); fence->error = error; } Then, let’s see a guilty job’s process flow in TDR. amdgpu_device_gpu_recover -> drm_sched_job_recovery -> for each job in ring_mirror_list : 1)dma_fence_set_error(_fence->finished, -ECANCELED) if this job is guilty 2)amdgpu_job_run: the guilty job will be skipped, and not submitted into to ring 3)drm_sched_process_job(guilty_job) -> drm_sched_fence_finished -> dma_fence_signal(>finished) -> drm_sched_job_finish_cb -> schedule_work(>finish_work); Later, finish_work’s callback function, drm_sched_job_finish, will be called by work queue, and guilty job will finally be deleted from ring_mirror_list. But sometimes, amdgpu_device_gpu_recover will be called(from host FLR interrupt or KFD) again immediately, and at this moment, finish_work is not scheduled by work queue, and drm_sched_job_finish for the guilty job is not called yet, so this guilty job is still in the ring_mirror_list. But remember, the guilty job’s fence was signaled before (by drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error will be called again to this guilty, then cause the kernel warning call trace. So the root cause is, the finish_work for each job is not scheduled quickly enough for GPU TDR scenario. Three ideas: 1), Instead of using the system global work queue, maybe we can create a dedicate work queue to manage the job finish things, but still can’t guarantee its finish_work’s execution in GPU TDR scenario 2), Wait for all the guilty jobs’ finish_work to be finished execution before return from drm_sched_job_recovery , but I think it will waste too much time. 3), to delete the guilty job from the ring_mirror_list directly after this job is processed in amdgpu_job_run().amdgpu_job_run is the final entrance for a job to be submitted into the ring, if something wrong with this job:(finished->error < 0), then this job should be never be submitted again, and so should be deleted from the recovery list. I choose 3). I am going to make a new patch as below instead of the original one diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index e0af44f..aaf697f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -208,6 +208,7 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job, static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job) { struct amdgpu_ring *ring = to_amdgpu_ring(sched_job->sched); + struct drm_gpu_scheduler *sched = sched_job->sched; struct dma_fence *fence = NULL, *finished; struct amdgpu_job *job; int r; @@ -223,7 +224,10 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job) dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */ if (finished->error < 0) { - DRM_INFO("Skip scheduling IBs!\n"); + DRM_INFO("Skip scheduling IBs and delete job from recovery list!\n"); + spin_lock(>job_list_lock); + list_del_init(_job->node); + spin_unlock(>job_list_lock); } else { r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job, ); Thanks, Trigger -----Original Message- From: Christian König Sent: Friday, November 09, 2018 8:18 PM To: Huang, Trigger ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR Am 09.11.18 um 13:10 schrieb Trigger Huang: > A bad job is the one triggered TDR. In the recovery process, its fence > will be signaled and as a result, the work behind will be scheduled to > delete it from the mirror list, but if the TDR process is invoked > before the work's execution, then the call dma_fence_set_error to its > fence in TDR proc
RE: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini
Hi Alex, Thanks for your help. Thanks & Best Wishes, Trigger Huang -Original Message- From: Alex Deucher [mailto:alexdeuc...@gmail.com] Sent: Thursday, April 20, 2017 3:18 AM To: Huang, Trigger <trigger.hu...@amd.com> Cc: amd-gfx list <amd-gfx@lists.freedesktop.org>; Huang, Ray <ray.hu...@amd.com>; Yu, Xiangliang <xiangliang...@amd.com>; Liu, Monk <monk@amd.com> Subject: Re: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini On Tue, Apr 18, 2017 at 3:04 AM, Trigger Huang <trigger.hu...@amd.com> wrote: > Fix issue that PSP initialization will fail if reload amdgpu module. > That's because the PSP ring must be destroyed to be ready for the next > time PSP initialization. > > Changes in v2: > - Move psp_ring_destroy before all BOs free according to > Ray Huang's suggestion. > > Signed-off-by: Trigger Huang <trigger.hu...@amd.com> I'm not really a PSP expert, but the change looks logical. Acked-by: Alex Deucher <alexander.deuc...@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 3 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h | 3 +++ > drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 27 +++ > drivers/gpu/drm/amd/amdgpu/psp_v3_1.h | 2 ++ > 4 files changed, 35 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > index 19180aa..73ddf4b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > @@ -56,6 +56,7 @@ static int psp_sw_init(void *handle) > psp->prep_cmd_buf = psp_v3_1_prep_cmd_buf; > psp->ring_init = psp_v3_1_ring_init; > psp->ring_create = psp_v3_1_ring_create; > + psp->ring_destroy = psp_v3_1_ring_destroy; > psp->cmd_submit = psp_v3_1_cmd_submit; > psp->compare_sram_data = psp_v3_1_compare_sram_data; > psp->smu_reload_quirk = psp_v3_1_smu_reload_quirk; @@ > -411,6 +412,8 @@ static int psp_hw_fini(void *handle) > struct amdgpu_device *adev = (struct amdgpu_device *)handle; > struct psp_context *psp = >psp; > > + psp_ring_destroy(psp, PSP_RING_TYPE__KM); > + > if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) > amdgpu_ucode_fini_bo(adev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h > index 28faba5..0301e4e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h > @@ -66,6 +66,8 @@ struct psp_context > struct psp_gfx_cmd_resp *cmd); > int (*ring_init)(struct psp_context *psp, enum psp_ring_type > ring_type); > int (*ring_create)(struct psp_context *psp, enum psp_ring_type > ring_type); > + int (*ring_destroy)(struct psp_context *psp, > + enum psp_ring_type ring_type); > int (*cmd_submit)(struct psp_context *psp, struct > amdgpu_firmware_info *ucode, > uint64_t cmd_buf_mc_addr, uint64_t fence_mc_addr, > int index); > bool (*compare_sram_data)(struct psp_context *psp, @@ -116,6 > +118,7 @@ struct amdgpu_psp_funcs { #define psp_prep_cmd_buf(ucode, > type) (psp)->prep_cmd_buf((ucode), (type)) #define psp_ring_init(psp, > type) (psp)->ring_init((psp), (type)) #define psp_ring_create(psp, > type) (psp)->ring_create((psp), (type)) > +#define psp_ring_destroy(psp, type) ((psp)->ring_destroy((psp), > +(type))) > #define psp_cmd_submit(psp, ucode, cmd_mc, fence_mc, index) \ > (psp)->cmd_submit((psp), (ucode), (cmd_mc), > (fence_mc), (index)) #define psp_compare_sram_data(psp, ucode, type) > \ diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c > b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c > index d351583..e834b55 100644 > --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c > +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c > @@ -321,6 +321,33 @@ int psp_v3_1_ring_create(struct psp_context *psp, enum > psp_ring_type ring_type) > return ret; > } > > +int psp_v3_1_ring_destroy(struct psp_context *psp, enum psp_ring_type > +ring_type) { > + int ret = 0; > + struct psp_ring *ring; > + unsigned int psp_ring_reg = 0; > + struct amdgpu_device *adev = psp->adev; > + > + ring = >km_ring; > + > + /* Write the ring destroy command to C2PMSG_64 */ > + psp_ring_reg = 3 << 16; > + WREG32(SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), > + psp_ring_reg); > + > + /* there m
RE: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini
Hi Ray, Thanks for your suggestions. As discussed, free adev->firmware.rbuf is still needed. I have made a patch V3 according to your suggestions. Thanks & Best Wishes, Trigger Huang -Original Message- From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of Huang Rui Sent: Thursday, April 20, 2017 8:54 AM To: Huang, Trigger <trigger.hu...@amd.com> Cc: Yu, Xiangliang <xiangliang...@amd.com>; Liu, Monk <monk@amd.com>; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini On Tue, Apr 18, 2017 at 03:04:45PM +0800, Trigger Huang wrote: > Fix issue that PSP initialization will fail if reload amdgpu module. > That's because the PSP ring must be destroyed to be ready for the next > time PSP initialization. > > Changes in v2: > - Move psp_ring_destroy before all BOs free according to > Ray Huang's suggestion. > > Signed-off-by: Trigger Huang <trigger.hu...@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 3 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h | 3 +++ > drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 27 +++ > drivers/gpu/drm/amd/amdgpu/psp_v3_1.h | 2 ++ > 4 files changed, 35 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > b/drivers/gpu/drm/amd/ amdgpu/amdgpu_psp.c index 19180aa..73ddf4b > 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > @@ -56,6 +56,7 @@ static int psp_sw_init(void *handle) > psp->prep_cmd_buf = psp_v3_1_prep_cmd_buf; > psp->ring_init = psp_v3_1_ring_init; > psp->ring_create = psp_v3_1_ring_create; > + psp->ring_destroy = psp_v3_1_ring_destroy; > psp->cmd_submit = psp_v3_1_cmd_submit; > psp->compare_sram_data = psp_v3_1_compare_sram_data; > psp->smu_reload_quirk = psp_v3_1_smu_reload_quirk; @@ > -411,6 +412,8 @@ static int psp_hw_fini(void *handle) > struct amdgpu_device *adev = (struct amdgpu_device *)handle; > struct psp_context *psp = >psp; > > + psp_ring_destroy(psp, PSP_RING_TYPE__KM); > + > if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) > amdgpu_ucode_fini_bo(adev); If adev->firmware.load_type is AMDGPU_FW_LOAD_DIRECT, we should not call psp_ring_destroy, right? I already told you in last mail: if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) return 0; amdgpu_ucode_fini_bo(adev); psp_ring_destroy(psp, PSP_RING_TYPE__KM); ... > > +int psp_v3_1_ring_destroy(struct psp_context *psp, enum psp_ring_type > ring_type) > +{ > + int ret = 0; > + struct psp_ring *ring; > + unsigned int psp_ring_reg = 0; > + struct amdgpu_device *adev = psp->adev; > + > + ring = >km_ring; > + > + /* Write the ring destroy command to C2PMSG_64 */ > + psp_ring_reg = 3 << 16; > + WREG32(SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), > + psp_ring_reg); > + > + /* there might be handshake issue with hardware which needs delay */ > + mdelay(20); > + > + /* Wait for response flag (bit 31) in C2PMSG_64 */ > + ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), > + 0x8000, 0x8000, false); > + > + if (ring->ring_mem) > + amdgpu_bo_free_kernel(>firmware.rbuf, > + >ring_mem_mc_addr, > + (void **)>ring_mem); adev->firmware.rbuf is entired memory space for storing firmware data adev->which inited by amdgpu_ucode_init_bo, and it already teared down via amdgpu_ucode_fini_bo. You cannot free it again. Thanks, Rui ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini
Hi, Would you please help to review this patch? Thanks & Best Wishes, Trigger Huang -Original Message- From: Trigger Huang [mailto:trigger.hu...@amd.com] Sent: Tuesday, April 18, 2017 3:05 PM To: amd-gfx@lists.freedesktop.org Cc: Liu, Monk <monk@amd.com>; Yu, Xiangliang <xiangliang...@amd.com>; Huang, Ray <ray.hu...@amd.com>; Huang, Trigger <trigger.hu...@amd.com> Subject: [PATCH V2] drm/amdgpu: Destroy psp ring in hw_fini Fix issue that PSP initialization will fail if reload amdgpu module. That's because the PSP ring must be destroyed to be ready for the next time PSP initialization. Changes in v2: - Move psp_ring_destroy before all BOs free according to Ray Huang's suggestion. Signed-off-by: Trigger Huang <trigger.hu...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h | 3 +++ drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 27 +++ drivers/gpu/drm/amd/amdgpu/psp_v3_1.h | 2 ++ 4 files changed, 35 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 19180aa..73ddf4b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -56,6 +56,7 @@ static int psp_sw_init(void *handle) psp->prep_cmd_buf = psp_v3_1_prep_cmd_buf; psp->ring_init = psp_v3_1_ring_init; psp->ring_create = psp_v3_1_ring_create; + psp->ring_destroy = psp_v3_1_ring_destroy; psp->cmd_submit = psp_v3_1_cmd_submit; psp->compare_sram_data = psp_v3_1_compare_sram_data; psp->smu_reload_quirk = psp_v3_1_smu_reload_quirk; @@ -411,6 +412,8 @@ static int psp_hw_fini(void *handle) struct amdgpu_device *adev = (struct amdgpu_device *)handle; struct psp_context *psp = >psp; + psp_ring_destroy(psp, PSP_RING_TYPE__KM); + if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) amdgpu_ucode_fini_bo(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h index 28faba5..0301e4e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h @@ -66,6 +66,8 @@ struct psp_context struct psp_gfx_cmd_resp *cmd); int (*ring_init)(struct psp_context *psp, enum psp_ring_type ring_type); int (*ring_create)(struct psp_context *psp, enum psp_ring_type ring_type); + int (*ring_destroy)(struct psp_context *psp, + enum psp_ring_type ring_type); int (*cmd_submit)(struct psp_context *psp, struct amdgpu_firmware_info *ucode, uint64_t cmd_buf_mc_addr, uint64_t fence_mc_addr, int index); bool (*compare_sram_data)(struct psp_context *psp, @@ -116,6 +118,7 @@ struct amdgpu_psp_funcs { #define psp_prep_cmd_buf(ucode, type) (psp)->prep_cmd_buf((ucode), (type)) #define psp_ring_init(psp, type) (psp)->ring_init((psp), (type)) #define psp_ring_create(psp, type) (psp)->ring_create((psp), (type)) +#define psp_ring_destroy(psp, type) ((psp)->ring_destroy((psp), +(type))) #define psp_cmd_submit(psp, ucode, cmd_mc, fence_mc, index) \ (psp)->cmd_submit((psp), (ucode), (cmd_mc), (fence_mc), (index)) #define psp_compare_sram_data(psp, ucode, type) \ diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c index d351583..e834b55 100644 --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c @@ -321,6 +321,33 @@ int psp_v3_1_ring_create(struct psp_context *psp, enum psp_ring_type ring_type) return ret; } +int psp_v3_1_ring_destroy(struct psp_context *psp, enum psp_ring_type +ring_type) { + int ret = 0; + struct psp_ring *ring; + unsigned int psp_ring_reg = 0; + struct amdgpu_device *adev = psp->adev; + + ring = >km_ring; + + /* Write the ring destroy command to C2PMSG_64 */ + psp_ring_reg = 3 << 16; + WREG32(SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), psp_ring_reg); + + /* there might be handshake issue with hardware which needs delay */ + mdelay(20); + + /* Wait for response flag (bit 31) in C2PMSG_64 */ + ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), + 0x8000, 0x8000, false); + + if (ring->ring_mem) + amdgpu_bo_free_kernel(>firmware.rbuf, + >ring_mem_mc_addr, + (void **)>ring_mem); + return ret; +} + int psp_v3_1_cmd_submit(struct psp_context *psp, struct amdgpu_firmware_info *ucode, uint64_t cmd_buf_mc_addr, uint64_t fence_mc_addr, diff -
RE: [PATCH] drm/amdgpu: Fix module unload hang by KIQ IRQ set
Hi Xiangliang, Thanks for your suggestions, I will make a new patch according to the discussion Thanks & Best Wishes, Trigger Huang -Original Message- From: Yu, Xiangliang Sent: Friday, February 17, 2017 2:15 PM To: Huang, Trigger <trigger.hu...@amd.com>; amd-gfx@lists.freedesktop.org Cc: Liu, Monk <monk....@amd.com>; Huang, Trigger <trigger.hu...@amd.com> Subject: RE: [PATCH] drm/amdgpu: Fix module unload hang by KIQ IRQ set Hi Trigger, Can you try to move src->data= NULL into gfx_v8_0_kiq_set_interrupt_state function? I think it is more simple and clear. Thanks! Xiangliang Yu > -Original Message- > From: Trigger Huang [mailto:trigger.hu...@amd.com] > Sent: Thursday, February 16, 2017 6:48 PM > To: amd-gfx@lists.freedesktop.org > Cc: Liu, Monk <monk@amd.com>; Yu, Xiangliang > <xiangliang...@amd.com>; Huang, Trigger <trigger.hu...@amd.com> > Subject: [PATCH] drm/amdgpu: Fix module unload hang by KIQ IRQ set > > In some cases, manually insmod/rmmod amdgpu is necessary. When > unloading amdgpu, the KIQ IRQ enable/disable function will case system > hang. The root cause is, in the sequence of function amdgpu_fini, the > sw_fini of IP block AMD_IP_BLOCK_TYPE_GFX will be invoked earlier than > that of AMD_IP_BLOCK_TYPE_IH. So continue to use the variable freed by > AMD_IP_BLOCK_TYPE_GFX will cause system hang. > > Signed-off-by: Trigger Huang <trigger.hu...@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++ > drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 17 - > 2 files changed, 10 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 1db2e7b..be43d09 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -793,6 +793,8 @@ struct amdgpu_kiq { > struct amdgpu_bo*eop_obj; > struct amdgpu_ring ring; > struct amdgpu_irq_src irq; > + u32 me; > + u32 pipe; > }; > > /* > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c > index 772c42b..04e2a5b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c > @@ -1390,12 +1390,15 @@ static int gfx_v8_0_kiq_init_ring(struct > amdgpu_device *adev, > if (adev->gfx.mec2_fw) { > ring->me = 2; > ring->pipe = 0; > + adev->gfx.kiq.me = 2; > + adev->gfx.kiq.pipe = 0; > } else { > ring->me = 1; > ring->pipe = 1; > + adev->gfx.kiq.me = 1; > + adev->gfx.kiq.pipe = 1; > } > > - irq->data = ring; > ring->queue = 0; > sprintf(ring->name, "kiq %d.%d.%d", ring->me, ring->pipe, ring- > >queue); > r = amdgpu_ring_init(adev, ring, 1024, @@ -1410,7 +1413,6 @@ static > void gfx_v8_0_kiq_free_ring(struct amdgpu_ring *ring, { > amdgpu_wb_free(ring->adev, ring->adev->virt.reg_val_offs); > amdgpu_ring_fini(ring); > - irq->data = NULL; > } > > #define MEC_HPD_SIZE 2048 > @@ -6927,15 +6929,12 @@ static int > gfx_v8_0_kiq_set_interrupt_state(struct amdgpu_device *adev, > enum amdgpu_interrupt_state > state) { > uint32_t tmp, target; > - struct amdgpu_ring *ring = (struct amdgpu_ring *)src->data; > > - BUG_ON(!ring || (ring->funcs->type != AMDGPU_RING_TYPE_KIQ)); > - > - if (ring->me == 1) > + if (adev->gfx.kiq.me == 1) > target = mmCP_ME1_PIPE0_INT_CNTL; > else > target = mmCP_ME2_PIPE0_INT_CNTL; > - target += ring->pipe; > + target += adev->gfx.kiq.pipe; > > switch (type) { > case AMDGPU_CP_KIQ_IRQ_DRIVER0: > @@ -6973,7 +6972,7 @@ static int gfx_v8_0_kiq_irq(struct amdgpu_device > *adev, > struct amdgpu_iv_entry *entry) { > u8 me_id, pipe_id, queue_id; > - struct amdgpu_ring *ring = (struct amdgpu_ring *)source->data; > + struct amdgpu_ring *ring = &(adev->gfx.kiq.ring); > > BUG_ON(!ring || (ring->funcs->type != AMDGPU_RING_TYPE_KIQ)); > > @@ -7380,4 +7379,4 @@ static void > gfx_v8_0_compute_mqd_soft_fini(struct amdgpu_device *adev) > > ring = >gfx.kiq.ring; > amdgpu_bo_free_kernel(>mqd_obj, >mqd_gpu_addr, (void > **)>mqd_ptr); -} \ No newline at end of file > +} > -- > 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH v2] drm/amdgpu:no gpu scheduler for KIQ
Hi Christian, OK, I got it, thanks for your detailed suggestion. Thanks & Best Wishes, Trigger Huang -Original Message- From: Christian König [mailto:deathsim...@vodafone.de] Sent: Wednesday, November 02, 2016 8:24 PM To: Huang, Trigger <trigger.hu...@amd.com>; amd-gfx@lists.freedesktop.org Cc: Yu, Xiangliang <xiangliang...@amd.com>; Liu, Monk <monk@amd.com> Subject: Re: [PATCH v2] drm/amdgpu:no gpu scheduler for KIQ Am 02.11.2016 um 12:48 schrieb Trigger Huang: > KIQ is used for interaction between driver and CP, and not exposed to > outside client, as such it doesn't need to be handled by GPU > scheduler. > > Signed-off-by: Monk Liu <monk@amd.com> > Signed-off-by: Xiangliang Yu <xiangliang...@amd.com> > Signed-off-by: Trigger Huang <trigger.hu...@amd.com> Even if you only fix a small mistake it is usually good practice to increase the version number of the patch, e.g. you would use v3 in this case and write a one liner what was wrong in the commit message. But that's only a nit pick, so patch is Reviewed-by: Christian König <christian.koe...@amd.com> anyway. Regards, Christian. > > Changes in v2: > - According to Alex's suggestion, wrapping the scheduler setup > conditionally instead of returning early. > - Use another simple method to check if is a KIQ ring. > > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 39 > +-- > 1 file changed, 21 insertions(+), 18 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index 77b34ec..5772ef2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -382,24 +382,27 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring > *ring, > if (!ring->fence_drv.fences) > return -ENOMEM; > > - timeout = msecs_to_jiffies(amdgpu_lockup_timeout); > - if (timeout == 0) { > - /* > - * FIXME: > - * Delayed workqueue cannot use it directly, > - * so the scheduler will not use delayed workqueue if > - * MAX_SCHEDULE_TIMEOUT is set. > - * Currently keep it simple and silly. > - */ > - timeout = MAX_SCHEDULE_TIMEOUT; > - } > - r = amd_sched_init(>sched, _sched_ops, > -num_hw_submission, > -timeout, ring->name); > - if (r) { > - DRM_ERROR("Failed to create scheduler on ring %s.\n", > - ring->name); > - return r; > + /* No need to setup the GPU scheduler for KIQ ring */ > + if (ring->funcs->type != AMDGPU_RING_TYPE_KIQ) { > + timeout = msecs_to_jiffies(amdgpu_lockup_timeout); > + if (timeout == 0) { > + /* > + * FIXME: > + * Delayed workqueue cannot use it directly, > + * so the scheduler will not use delayed workqueue if > + * MAX_SCHEDULE_TIMEOUT is set. > + * Currently keep it simple and silly. > + */ > + timeout = MAX_SCHEDULE_TIMEOUT; > + } > + r = amd_sched_init(>sched, _sched_ops, > +num_hw_submission, > +timeout, ring->name); > + if (r) { > + DRM_ERROR("Failed to create scheduler on ring %s.\n", > + ring->name); > + return r; > + } > } > > return 0; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [V2] drm/amdgpu:no gpu scheduler for KIQ
Sorry please just ignore this patch, since I made a small mistake, I will sent again soon. Thanks & Best Wishes, Trigger Huang -Original Message- From: Trigger Huang [mailto:trigger.hu...@amd.com] Sent: Wednesday, November 02, 2016 7:26 PM To: amd-gfx@lists.freedesktop.org Cc: Huang, Trigger <trigger.hu...@amd.com>; Liu, Monk <monk@amd.com>; Yu, Xiangliang <xiangliang...@amd.com> Subject: [V2] drm/amdgpu:no gpu scheduler for KIQ KIQ is used for interaction between driver and CP, and not exposed to outside client, as such it doesn't need to be handled by GPU scheduler. Signed-off-by: Monk Liu <monk@amd.com> Signed-off-by: Xiangliang Yu <xiangliang...@amd.com> Signed-off-by: Trigger Huang <trigger.hu...@amd.com> Changes in v2: - According to Alex's suggestion, wrapping the scheduler setup conditionally instead of returning early. - Use another simple method to check if is a KIQ ring. Change-Id: Iba09b6dfb6515e5de5fc48c4e8904e8a0ec5f22b --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 39 +-- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 77b34ec..5772ef2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -382,24 +382,27 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, if (!ring->fence_drv.fences) return -ENOMEM; - timeout = msecs_to_jiffies(amdgpu_lockup_timeout); - if (timeout == 0) { - /* -* FIXME: -* Delayed workqueue cannot use it directly, -* so the scheduler will not use delayed workqueue if -* MAX_SCHEDULE_TIMEOUT is set. -* Currently keep it simple and silly. -*/ - timeout = MAX_SCHEDULE_TIMEOUT; - } - r = amd_sched_init(>sched, _sched_ops, - num_hw_submission, - timeout, ring->name); - if (r) { - DRM_ERROR("Failed to create scheduler on ring %s.\n", - ring->name); - return r; + /* No need to setup the GPU scheduler for KIQ ring */ + if (ring->funcs->type != AMDGPU_RING_TYPE_KIQ) { + timeout = msecs_to_jiffies(amdgpu_lockup_timeout); + if (timeout == 0) { + /* +* FIXME: +* Delayed workqueue cannot use it directly, +* so the scheduler will not use delayed workqueue if +* MAX_SCHEDULE_TIMEOUT is set. +* Currently keep it simple and silly. +*/ + timeout = MAX_SCHEDULE_TIMEOUT; + } + r = amd_sched_init(>sched, _sched_ops, + num_hw_submission, + timeout, ring->name); + if (r) { + DRM_ERROR("Failed to create scheduler on ring %s.\n", + ring->name); + return r; + } } return 0; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx