[PATCH 6/9] drm/amdgpu: Enable BAD_OPCODE intr for gfx8

2018-09-12 Thread Felix Kuehling
From: Harish Kasiviswanathan This enables KFD_EVENT_TYPE_HW_EXCEPTION notifications to user mode in response to bad opcodes in a CP queue. Signed-off-by: Harish Kasiviswanathan Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c | 3 ++- 1 file changed, 2

Re: [PATCH 2/2] drm/amdgpu: use a single linked list for amdgpu_vm_bo_base

2018-09-12 Thread Felix Kuehling
it doesn't need to be a separate allocation from the amdgpu_vm_pt. Acked-by: Felix Kuehling Regards,   Felix On 2018-09-12 04:55 AM, Christian König wrote: > Instead of the double linked list. Gets the size of amdgpu_vm_pt down to > 64 bytes again. > > We could even reduce it down

Re: [PATCH 2/6] drm/amdgpu/sriov: Correct the setting about sdma doorbell offset of Vega10

2018-09-12 Thread Felix Kuehling
On 2018-09-12 09:55 PM, Alex Deucher wrote: > On Wed, Sep 12, 2018 at 9:45 PM Felix Kuehling wrote: >> From: Emily Deng >> >> Correct the format >> >> For vega10 sriov, the sdma doorbell must be fixed as follow to keep the >> same setting with h

Re: [PATCH] drm/amdgpu: use HMM mirror callback to replace mmu notifier v4

2018-09-14 Thread Felix Kuehling
On 2018-09-14 01:52 PM, Christian König wrote: > Am 14.09.2018 um 19:47 schrieb Philip Yang: >> On 2018-09-14 03:51 AM, Christian König wrote: >>> Am 13.09.2018 um 23:51 schrieb Felix Kuehling: >>>> On 2018-09-13 04:52 PM, Philip Yang wrote:

Re: kfdtest failures for amdkfd (amd-staging-drm-next)

2018-09-14 Thread Felix Kuehling
You need ROCm 1.9 to work with the upstream KFD. libhsakmt from ROCm 1.8 is incompatible with the upstream KFD ABI. Where did you get KFDTest? It's part of the same repository on GitHub as libhsakmt. It's new on the 1.9 branch. You need libhsakmt from the same branch. The ROCm 1.9 binaries are

Re: [PATCH] drm/amdgpu: use HMM mirror callback to replace mmu notifier v4

2018-09-13 Thread Felix Kuehling
On 2018-09-13 04:52 PM, Philip Yang wrote: > Replace our MMU notifier with hmm_mirror_ops.sync_cpu_device_pagetables > callback. Enable CONFIG_HMM and CONFIG_HMM_MIRROR as a dependency in > DRM_AMDGPU_USERPTR Kconfig. > > It supports both KFD userptr and gfx userptr paths. > > This depends on

Re: [PATCH 1/2] drm/amdgpu: remove amdgpu_bo_list_entry.robj

2018-09-13 Thread Felix Kuehling
On 2018-09-13 01:50 PM, Christian König wrote: > Am 12.09.2018 um 22:21 schrieb Felix Kuehling: >> On 2018-09-12 04:55 AM, Christian König wrote: >>> We can get that just by casting tv.bo. >>> >>> Signed-off-by: Christian König >>> --- >>&

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-07-13 Thread Felix Kuehling
Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian: >>> The only real solution I can see is to be able to reliable kill shaders >>> in an OOM situation. >> Well, we can in fact preempt our compute shaders with low latency. >> Killing a KFD process will do exactly that. > I've taken a look at

Re: [PATCH] drm/amdkfd: Rename create_cp_queue() to init_user_queue()

2019-11-11 Thread Felix Kuehling
On 2019-11-01 16:12, Zhao, Yong wrote: create_cp_queue() could also work with SDMA queues, so we should rename it. Change-Id: I76cbaed8fa95dd9062d786cbc1dd037ff041da9d Signed-off-by: Yong Zhao The name change makes sense. This patch is Reviewed-by: Felix Kuehling --- drivers/gpu/drm

Re: [PATCH] drm/amdkfd: Rename kfd_kernel_queue_*.c to kfd_packet_manager_*.c

2019-11-13 Thread Felix Kuehling
On 2019-11-13 5:09 p.m., Yong Zhao wrote: After the recent cleanup, the functionalities provided by the previous kfd_kernel_queue_*.c are actually all packet manager related. So rename them to reflect that. Change-Id: I6544ccb38da827c747544c0787aa949df20edbb0 Signed-off-by: Yong Zhao ---

Re: [PATCH] drm/amdkfd: Rename kfd_kernel_queue_*.c to kfd_packet_manager_*.c

2019-11-13 Thread Felix Kuehling
that the Vega code name for the SOC that's used elsewhere in the code. Regards,   Felix Yong On 2019-11-13 5:19 p.m., Felix Kuehling wrote: On 2019-11-13 5:09 p.m., Yong Zhao wrote: After the recent cleanup, the functionalities provided by the previous kfd_kernel_queue_*.c are actually all p

Re: [PATCH 2/2] drm/amdkfd: Eliminate ops_asic_specific in kernel queue

2019-11-13 Thread Felix Kuehling
See one comment inline. With that fixed, the series is Reviewed-by: Felix Kuehling I could think of more follow-up cleanup while you're at it: 1. Can you see any reason why the kq->ops need to be function pointers. Looks to me like they are the same for all kernel queues, so th

Re: [PATCH] drm/amdkfd: Rename kfd_kernel_queue_*.c to kfd_packet_manager_*.c

2019-11-13 Thread Felix Kuehling
On 2019-11-13 5:39 p.m., Yong Zhao wrote: After the recent cleanup, the functionalities provided by the previous kfd_kernel_queue_*.c are actually all packet manager related. So rename them to reflect that. NAK. Like I mentioned in the other email, AI refers to the SOC generation by its

Re: [PATCH] drm/amdkfd: Rename kfd_kernel_queue_*.c to kfd_packet_manager_*.c

2019-11-13 Thread Felix Kuehling
Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/Makefile | 4 ++-- .../amdkfd/{kfd_kernel_queue_v9.c => kfd_packet_manager_v9.c} | 0 .../amdkfd/{kfd_kernel_queue_vi.c => kfd_packet_manager_vi.c} | 0 3 files changed, 2 insertions(+), 2 deletions(-)

Re: [PATCH 2/2] drm/amdkfd: Stop using GFP_NOIO explicitly for GFX10

2019-11-12 Thread Felix Kuehling
-off-by: Yong Zhao The series is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v10

Re: [PATCH 2/2] drm/amdkfd: Stop using GFP_NOIO explicitly for GFX10

2019-11-12 Thread Felix Kuehling
On 2019-11-12 4:35 p.m., Yong Zhao wrote: Hi Felix, See one thing inline I am not too sure. Yong On 2019-11-12 4:30 p.m., Felix Kuehling wrote: On 2019-11-12 4:26 p.m., Yong Zhao wrote: Adapt the change from 1cd106ecfc1f04 The change is: drm/amdkfd: Stop using GFP_NOIO explicitly

Re: [PATCH 3/3] drm/amdkfd: Fix a bug when calculating save_area_used_size

2019-11-12 Thread Felix Kuehling
qd_cntl_stack_size;; Please fix the double-semicolon. With that fixed this change is Reviewed-by: Felix Kuehling if (copy_to_user(ctl_stack, mqd_ctl_stack, m->cp_hqd_cntl_stack_size)) return -EFAULT; ___ amd-gfx

Re: [PATCH 2/3] drm/amdkfd: Update get_wave_state() for GFX10

2019-11-12 Thread Felix Kuehling
are Reviewed-by: Felix Kuehling Patch 3 should arguably not be part of this series, because it does not affect GFXv10. Regards,   Felix --- drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd

Re: [PATCH 1/2] drm/amdkfd: Use better name to indicate the offset is in dwords

2019-11-11 Thread Felix Kuehling
On 2019-11-01 16:10, Zhao, Yong wrote: Change-Id: I75da23bba90231762cf58da3170f5bb77ece45ed Signed-off-by: Yong Zhao I agree with the name changes. One suggestion for a comment inline. With that fixed, this patch is Reviewed-by: Felix Kuehling --- .../gpu/drm/amd/amdkfd

Re: [PATCH 2/2] drm/amdkfd: Avoid using doorbell_off as offset in process doorbell pages

2019-11-11 Thread Felix Kuehling
separate the queue properties from the queue driver state. That would probably change some internal interfaces to use struct queue instead of queue_properties. Anyway, this patch is Reviewed-by: Felix Kuehling Change-Id: I553045ff9fcb3676900c92d10426f2ceb3660005 Signed-off-by: Yong Zhao

Re: [PATCH 2/2] drm/amdkfd: Avoid using doorbell_off as offset in process doorbell pages

2019-11-11 Thread Felix Kuehling
On 2019-11-11 15:43, Felix Kuehling wrote: On 2019-11-01 16:10, Zhao, Yong wrote: dorbell_off in the queue properties is mainly used for the doorbell dw offset in pci bar. We should not set it to the doorbell byte offset in process doorbell pages. This makes the code much easier to read. I

Re: [PATCH 2/4] drm/ttm: cleanup ttm_buffer_object_transfer

2019-11-15 Thread Felix Kuehling
The subject doesn't match the change. This changes ttm_bo_cleanup_refs, not ttm_buffer_object_transfer. On 2019-11-11 9:58 a.m., Christian König wrote: The function is always called with deleted BOs. While at it cleanup the indentation as well. Signed-off-by: Christian König ---

Re: [PATCH] drm/amdkfd: remove set but not used variable 'top_dev'

2019-11-15 Thread Felix Kuehling
-by: Hulk Robot Fixes: 1ae99eab34f9 ("drm/amdkfd: Initialize HSA_CAP_ATS_PRESENT capability in topology codes") Signed-off-by: zhengbin The patch is Reviewed-by: Felix Kuehling I'm applying it to amd-staging-drm-next. Thanks,   Felix --- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c |

Re: [PATCH 3/4] drm/ttm: rework BO delayed delete.

2019-11-15 Thread Felix Kuehling
, to give a confident R-b. Acked-by: Felix Kuehling --- drivers/gpu/drm/ttm/ttm_bo.c | 215 +- drivers/gpu/drm/ttm/ttm_bo_util.c | 1 - include/drm/ttm/ttm_bo_api.h | 11 +- 3 files changed, 97 insertions(+), 130 deletions(-) diff --git a/drivers/gpu

Re: [PATCH 2/2] drm/amdkfd: Move pm_create_runlist_ib() out of pm_send_runlist()

2019-11-22 Thread Felix Kuehling
I'm not sure about this one. Looks like the interface is getting needlessly more complicated. Now the caller has to keep track of the runlist IB address and size just to pass those to another function. I could understand this if there was a use case that needs to separate the allocation of the

Re: [PATCH] drm/amdgpu: Apply noretry setting for gfx10 and mmhub9.4

2019-11-22 Thread Felix Kuehling
On 2019-11-22 3:23 p.m., Oak Zeng wrote: Config the translation retry behavior according to noretry kernel parameter Change-Id: I5b91ea77715137cf8cb84e258ccdfbb19c7a4ed1 Signed-off-by: Oak Zeng Suggested-by: Jay Cornwall --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 4 +++-

Re: [PATCH] drm/amdgpu: Apply noretry setting for mmhub9.4

2019-11-22 Thread Felix Kuehling
On 2019-11-22 5:55 p.m., Oak Zeng wrote: Config the translation retry behavior according to noretry kernel parameter Change-Id: I5b91ea77715137cf8cb84e258ccdfbb19c7a4ed1 Signed-off-by: Oak Zeng Suggested-by: Jay Cornwall Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu

Re: [PATCH 2/3] drm/amdgpu: explicitely sync to VM updates

2019-12-04 Thread Felix Kuehling
On 2019-12-04 10:38 a.m., Christian König wrote: Allows us to reduce the overhead while syncing to fences a bit. This allows some further simplification. See two comments inline. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 18 +++---

[PATCH 1/1] drm/amdkfd: Improve kfd_process lookup in kfd_ioctl

2019-12-04 Thread Felix Kuehling
can be distinguished in user mode. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 30 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 ++ 2 files changed, 28 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c

Re: [PATCH 3/3] drm/amdgpu: stop adding VM updates fences to the resv obj

2019-12-04 Thread Felix Kuehling
[+Alejandro] On 2019-12-04 10:38 a.m., Christian König wrote: This way we can do updates even without the resv obj locked. This could use a bit more explanation. This change depends on the previous one that adds explicit synchronization with page table updates during command submission in

Re: Deadlock on PTEs update for HMM

2019-12-04 Thread Felix Kuehling
ic context any more, but that took me way longer than expected as well. I'm currently experimenting with using a trylock driver mutex, that at least that should work for now until we got something better. Regards, Christian. Am 28.11.19 um 21:30 schrieb Felix Kuehling: Hi Christian, I'm thin

Re: [PATCH 2/2] drm/amdkfd: Add Arcturus specific set_vm_context_page_table_base()

2019-12-12 Thread Felix Kuehling
I agree with Christian's comments on patch 1. With those fixed, the series is Reviewed-by: Felix Kuehling Regards,   Felix On 2019-12-02 20:42, Yong Zhao wrote: Since Arcturus has it own function pointer, we can move Arcturus specific logic to there rather than leaving it entangled

Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-17 Thread Felix Kuehling
I agree. Removing the call to pre-reset probably breaks GPU reset for KFD. We call the KFD suspend function in pre-reset, which uses the HIQ to stop any user mode queues still running. If that is not possible because the HIQ is hanging, it should fail with a timeout. There may be something we

Re: [PATCH 2/2] drm/amdgpu: attempt xgmi perfmon re-arm on failed arm

2019-12-17 Thread Felix Kuehling
On 2019-12-17 12:28, Jonathan Kim wrote: The DF routines to arm xGMI performance will attempt to re-arm both on performance monitoring start and read on initial failure to arm. Signed-off-by: Jonathan Kim --- drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 153 --- 1 file

Re: [PATCH 2/2] drm/amdkfd: expose num_cp_queues data field to topology node

2019-12-17 Thread Felix Kuehling
See comment inline. Other than that, the series looks good to me. On 2019-12-16 2:02, Huang Rui wrote: Thunk driver would like to know the num_cp_queues data, however this data relied on different asic specific. So it's better to get it from kfd driver. Signed-off-by: Huang Rui ---

Re: [PATCH 4/5] drm/amdgpu: add VM eviction lock v2

2019-12-05 Thread Felix Kuehling
you that the operation is in progress and you should check its status later." This call is neither non-blocking nor is the requested page table update in progress when this error is returned. So I'd think a better error to return here would be EBUSY. Other than that, this patch is Reviewed-b

Re: [PATCH 2/5] drm/amdgpu: explicitely sync to VM updates v2

2019-12-05 Thread Felix Kuehling
On 2019-12-05 8:39 a.m., Christian König wrote: Allows us to reduce the overhead while syncing to fences a bit. v2: also drop adev parameter from the functions Signed-off-by: Christian König Reviewed-by: Felix Kuehling --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 8

Re: [PATCH 3/5] drm/amdgpu: stop adding VM updates fences to the resv obj

2019-12-05 Thread Felix Kuehling
be updated because "other VM updates" fences are no longer in the resv. Something like this: VM updates only sync with moves but not with user command submissions or KFD evictions fences. With that fixed, this patch is Reviewed-by: Feli

Re: [PATCH 5/5] drm/amdgpu: immedially invalidate PTEs

2019-12-05 Thread Felix Kuehling
On 2019-12-05 8:39 a.m., Christian König wrote: When a BO is evicted immedially invalidate the mapped PTEs. I think you mentioned that this is just a proof of concept. I wouldn't submit the patch like this because it's overkill for VMs that don't want to use recoverable page faults and

Re: reserving VRAM for page tables

2019-12-05 Thread Felix Kuehling
I don't think this should go into amdgpu_vram_mgr. KFD tries to avoid running out of VRAM for page tables because we cannot oversubscribe memory within a process and we want to avoid compute processes evicting each other because that would mean thrashing. Those limitation don't apply to

Re: [PATCH 1/1] drm/amdkfd: Improve kfd_process lookup in kfd_ioctl

2019-12-05 Thread Felix Kuehling
On 2019-12-05 11:10 a.m., Philip Yang wrote: One comment in line. With it is fixed, this is reviewed by Philip Yang Philip On 2019-12-04 11:13 p.m., Felix Kuehling wrote:   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 8276601a122f

Re: [PATCH] drm/amdkfd: Improve function get_sdma_rlc_reg_offset()

2019-12-16 Thread Felix Kuehling
On 2019-12-16 3:06 p.m., Zhao, Yong wrote: [AMD Official Use Only - Internal Distribution Only] The problem happens when we want to reuse the same function for ASICs which have fewer SDMA engines. Some pointers on which SOC15_REG_OFFSET depends for some higher index SDMA engines are 0,

Re: [PATCH] drm/amdkfd: Improve function get_sdma_rlc_reg_offset()

2019-12-16 Thread Felix Kuehling
On 2019-12-13 8:38, Yong Zhao wrote: This prevents the NULL pointer access when there are fewer than 8 sdma engines. I don't see where you got a NULL pointer in the old code. Also this change is in an Arcturus-specific source file. AFAIK Arcturus always has 8 SDMA engines. The new code is

Re: [PATCH 5/5] drm/amdgpu: immedially invalidate PTEs

2019-12-11 Thread Felix Kuehling
Hi Christian, Alex started trying to invalidate PTEs in the MMU notifiers and we're finding that we still need to reserve the VM reservation for amdgpu_sync_resv in amdgpu_vm_sdma_prepare. Is that sync_resv still needed now, given that VM fences aren't in that reservation object any more?

Re: [PATCH 2/2] amd/amdgpu: force to trigger a no-retry-fault after a retry-fault

2019-11-18 Thread Felix Kuehling
On 2019-11-18 17:24, Alex Sierra wrote: Only for the debugger use case. [why] Avoid endless translation retries, after an invalid address access has been issued to the GPU. Instead, the trap handler is forced to enter by generating a no-retry-fault. A s_trap instruction is inserted in the

Re: [PATCH] drm/amdkfd: Delete KFD_MQD_TYPE_COMPUTE

2019-11-20 Thread Felix Kuehling
n 2019-11-15 11:07, Yong Zhao wrote: It is the same as KFD_MQD_TYPE_CP, so delete it. As a result, we will have one less mqd mananger per device. Change-Id: Iaa98fc17be06b216de7a826c3577f44bc0536b4c Signed-off-by: Yong Zhao Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd

Re: [PATCH 2/2] amd/amdgpu: force to trigger a no-retry-fault after a retry-fault

2019-11-19 Thread Felix Kuehling
: I4180c30e2631dc0401cbd6171f8a6776e4733c9a Signed-off-by: Alex Sierra This commit adds some unnecessary empty lines. See inline. With that fixed, the series is Reviewed-by: Felix Kuehling Please also give Christian a chance to review. Thanks,   Felix --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 - 1 file changed

Re: [PATCH RFC v4 14/16] drm, cgroup: Introduce lgpu as DRM cgroup resource

2019-11-29 Thread Felix Kuehling
On 2019-10-11 1:12 p.m., t...@kernel.org wrote: Hello, Daniel. On Wed, Oct 09, 2019 at 06:06:52PM +0200, Daniel Vetter wrote: That's not the point I was making. For cpu cgroups there's a very well defined connection between the cpu bitmasks/numbers in cgroups and the cpu bitmasks you use in

Re: [PATCH 1/1] amdgpu: Enable KFD on POWER systems

2019-11-25 Thread Felix Kuehling
Hi Timothy, Thank you for the patch and for confirming that it works. We did some experimental work on Power8 a few years ago. I see that Talos II is Power9. At the time we were working on Power8 we had to add some #ifdef CONFIG_ACPI guards around some ACPI-specific code in KFD. Do you know

[PATCH 1/1] drm/amdgpu: Optimize KFD page table reservation

2019-11-25 Thread Felix Kuehling
each + 256GB system memory = 512GB Old page table reservation per GPU: 1GB New page table reservation per GPU: 32MB Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/gpu

Re: [PATCH 1/1] drm/amdgpu: Optimize KFD page table reservation

2019-11-25 Thread Felix Kuehling
., Felix Kuehling wrote: Be less pessimistic about estimated page table use for KFD. Most allocations use 2MB pages and therefore need less VRAM for page tables. This allows more VRAM to be used for applications especially on large systems with many GPUs and hundreds of GB of system memory. Example

Re: [PATCH 1/1] amdgpu: Enable KFD on POWER systems

2019-11-25 Thread Felix Kuehling
On 2019-11-25 4:06 p.m., Timothy Pearson wrote: - Original Message - From: "Felix Kuehling" To: "Timothy Pearson" , "amd-gfx" Sent: Monday, November 25, 2019 11:07:31 AM Subject: Re: [PATCH 1/1] amdgpu: Enable KFD on POWER systems Hi Tim

[PATCH 1/1] drm/amdgpu: Raise KFD unpinned system memory limit

2019-11-25 Thread Felix Kuehling
Allow KFD applications to use more unpinned system memory through HMM. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm

Re: Deadlock on PTEs update for HMM

2019-11-28 Thread Felix Kuehling
Hi Christian, I'm thinking about this problem, trying to come up with a solution. The fundamental problem is that we need low-overhead access to the page table in the MMU notifier, without much memory management or locking. There is one "driver lock" that we're supposed to take in the MMU

Re: [PATCH] drm/amdgpu: attempt xgmi perfmon re-arm on failed arm

2019-12-19 Thread Felix Kuehling
confusing. Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 151 +++ 1 file changed, 129 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c index

Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-19 Thread Felix Kuehling
I'm thinking, if we know we're preparing for a GPU reset, maybe we shouldn't even try to suspend processes and stop the HIQ. kfd_suspend_all_processes, stop_cpsch and other functions up that call chain up to kgd2kfd_suspend could have a parameter (bool pre_reset) that would update the driver

Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

2019-12-19 Thread Felix Kuehling
? Regards shaoyun.liu On 2019-12-19 5:44 p.m., Felix Kuehling wrote: I'm thinking, if we know we're preparing for a GPU reset, maybe we shouldn't even try to suspend processes and stop the HIQ. kfd_suspend_all_processes, stop_cpsch and other functions up that call chain up to kgd2kfd_suspend could

Re: [PATCH 4/4] drm/amdkfd: Avoid hanging hardware in stop_cpsch

2019-12-20 Thread Felix Kuehling
un.liu On 2019-12-20 11:33 a.m., Felix Kuehling wrote: dqm->is_hws_hang is protected by the DQM lock. kq_uninitialize runs outside that lock protection. Therefore I opted to pass in the hanging flag as a parameter. It also keeps the logic that decides all of that inside the device queue manager

Re: [PATCH v2 2/2] drm/amdkfd: expose num_cp_queues data field to topology node (v2)

2019-12-18 Thread Felix Kuehling
On 2019-12-18 3:45 a.m., Huang Rui wrote: Thunk driver would like to know the num_cp_queues data, however this data relied on different asic specific. So it's better to get it from kfd driver. v2: don't update name size. Signed-off-by: Huang Rui The series is Reviewed-by: Felix Kuehling

[PATCH 3/4] drm/amdkfd: Improve HWS hang detection and handling

2019-12-20 Thread Felix Kuehling
Move HWS hand detection into unmap_queues_cpsch to catch hangs in all cases. If this happens during a reset, don't schedule another reset because the reset already in progress is expected to take care of it. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 3

[PATCH 1/4] drm/amdkfd: Fix permissions of hang_hws

2019-12-20 Thread Felix Kuehling
Reading from /sys/kernel/debug/kfd/hang_hws would cause a kernel oops because we didn't implement a read callback. Set the permission to write-only to prevent that. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion

[PATCH 2/4] drm/amdkfd: Remove unused variable

2019-12-20 Thread Felix Kuehling
dqm->pipeline_mem wasn't used anywhere. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 - drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 1 - 2 files changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manage

[PATCH 4/4] drm/amdkfd: Avoid hanging hardware in stop_cpsch

2019-12-20 Thread Felix Kuehling
Don't use the HWS if it's known to be hanging. In a reset also don't try to destroy the HIQ because that may hang on SRIOV if the KIQ is unresponsive. Signed-off-by: Felix Kuehling --- .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c| 12 drivers/gpu/drm/amd/amdkfd

Re: [PATCH 1/1] drm/amdkfd: Don't touch the hardware in pre_reset callback

2019-12-19 Thread Felix Kuehling
FLR for SRIOV. Regards,   Felix On 2019-12-19 9:09 p.m., Felix Kuehling wrote: The intention of the pre_reset callback is to update the driver state to reflect that all user mode queues are preempted and the HIQ is destroyed. However we should not actually preempt any queues or otherwise touch the hardw

Re: [PATCH 1/1] drm/amdkfd: Don't touch the hardware in pre_reset callback

2019-12-19 Thread Felix Kuehling
he reset by avoiding unnecessary timeouts from a potentially hanging GPU scheduler. CC: shaoyunl CC: Liu Monk Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 24 ++--- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 27 ---

[PATCH 1/1] drm/amdkfd: Don't touch the hardware in pre_reset callback

2019-12-19 Thread Felix Kuehling
of actually stopping all queues. This should prevent KIQ register access hanging on SRIOV function level reset (FLR). It should also speed up the reset by avoiding unnecessary timeouts from a potentially hanging GPU scheduler. CC: shaoyunl CC: Liu Monk Signed-off-by: Felix Kuehling --- drivers

Re: [PATCH 1/1] drm/amdkfd: Don't touch the hardware in pre_reset callback

2019-12-19 Thread Felix Kuehling
FLR for SRIOV. Regards,    Felix On 2019-12-19 9:09 p.m., Felix Kuehling wrote: > The intention of the pre_reset callback is to update the driver > state to reflect that all user mode queues are preempted and the > HIQ is destroyed. However we should not actually preempt any queues > or

Re: [PATCH] drm/amdgpu: attempt xgmi perfmon re-arm on failed arm

2019-12-18 Thread Felix Kuehling
Some nit-picks inline. Looks good otherwise. On 2019-12-18 2:04 p.m., Jonathan Kim wrote: The DF routines to arm xGMI performance will attempt to re-arm both on performance monitoring start and read on initial failure to arm. v2: Roll back reset_perfmon_cntr to void return since new perfmon

Re: [PATCH 4/4] drm/amdkfd: Avoid hanging hardware in stop_cpsch

2019-12-20 Thread Felix Kuehling
On 2019-12-20 12:22, Zeng, Oak wrote: [AMD Official Use Only - Internal Distribution Only] Regards, Oak -Original Message- From: amd-gfx On Behalf Of Felix Kuehling Sent: Friday, December 20, 2019 3:30 AM To: amd-gfx@lists.freedesktop.org Subject: [PATCH 4/4] drm/amdkfd: Avoid

Re: [PATCH 4/4] drm/amdkfd: Avoid hanging hardware in stop_cpsch

2019-12-20 Thread Felix Kuehling
I think even Inside that kq_uninitialize function , we still can get dqm as  kq->dev->dqm . shaoyun.liu On 2019-12-20 3:30 a.m., Felix Kuehling wrote: Don't use the HWS if it's known to be hanging. In a reset also don't try to destroy the HIQ because that may hang on SRIOV if the K

Re: [PATCH 3/4] drm/amdkfd: Improve HWS hang detection and handling

2019-12-20 Thread Felix Kuehling
or GPU resets in KFD. I think we created the worker to avoid locking issues, but there may be better ways to do this. Regards,   Felix Regards, Oak -Original Message- From: amd-gfx On Behalf Of Felix Kuehling Sent: Friday, December 20, 2019 3:30 AM To: amd-gfx@lists.freedeskto

Re: [PATCH 2/5] drm/amdgpu: export function to flush TLB via pasid

2019-12-20 Thread Felix Kuehling
On 2019-12-20 1:24, Alex Sierra wrote: This can be used directly from amdgpu and amdkfd to invalidate TLB through pasid. It supports gmc v7, v8, v9 and v10. Two small corrections inline to make the behaviour between KIQ and MMIO-based flushing consistent. Looks good otherwise. Change-Id:

Re: [PATCH 4/5] drm/amdgpu: flush TLB functions removal from kfd2kgd interface

2019-12-20 Thread Felix Kuehling
: Ic2c7d4a0d19fe1e884dee1ff10a520d31252afee Signed-off-by: Alex Sierra This patch is Reviewed-by: Felix Kuehling --- .../drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 - .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 67 - .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c | 41 .../gpu/drm

Re: [PATCH 1/5] drm/amdgpu: Avoid reclaim fs while eviction lock

2019-12-20 Thread Felix Kuehling
: I5531c9337836e7d4a430df3f16dcc82888e8018c Signed-off-by: Alex Sierra This patch is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 14 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 28 +- 2 files changed, 34 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm

Re: [PATCH 3/5] drm/amdgpu: GPU TLB flush API moved to amdgpu_amdkfd

2019-12-20 Thread Felix Kuehling
On 2019-12-20 1:24, Alex Sierra wrote: [Why] TLB flush method has been deprecated using kfd2kgd interface. This implementation is now on the amdgpu_amdkfd API. [How] TLB flush functions now implemented in amdgpu_amdkfd. Change-Id: Ic51cccdfe6e71288d78da772b6e1b6ced72f8ef7 Signed-off-by: Alex

Re: [PATCH 5/5] drm/amdgpu: invalidate BO during userptr mapping

2019-12-20 Thread Felix Kuehling
I think this patch is just a proof of concept for now. It should not be submitted because there are still some known locking issues that need to be solved, and we don't have the code yet that handles the recoverable page faults resulting from this. Regards,   Felix On 2019-12-20 1:24, Alex

Re: [PATCH] drm/amdgpu: add VM update fences back to the root PD v2

2020-02-25 Thread Felix Kuehling
adding VM updates fences to the resv obj Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index

Re: [PATCH 6/6] drm/amdkfd: Delete unnecessary unmap queue package submissions

2020-02-25 Thread Felix Kuehling
-by: Felix Kuehling On 2020-02-24 17:18, Yong Zhao wrote: The previous SDMA queue counting was wrong. In addition, after confirming with MEC firmware team, we understands that only one unmap queue package, instead of one unmap queue package for CP and each SDMA engine, is needed, which results in much

Re: [PATCH] drm/amdkfd: change SDMA MQD memory type

2020-02-26 Thread Felix Kuehling
Signed-off-by: Eric Huang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amdkfd: Consolidate duplicated bo alloc flags

2020-03-05 Thread Felix Kuehling
On 2020-03-04 3:21 p.m., Yong Zhao wrote: ALLOC_MEM_FLAGS_* used are the same as the KFD_IOC_ALLOC_MEM_FLAGS_*, but they are interweavedly used in kernel driver, resulting in bad readability. For example, KFD_IOC_ALLOC_MEM_FLAGS_COHERENT is totally not referenced in kernel, and it functions in

Re: [PATCH 2/2] drm/amdkfd: Signal eviction fence on process destruction (v2)

2020-03-05 Thread Felix Kuehling
wrote: Series is Reviewed-by: xinhui pan 2020年3月5日 05:50,Kuehling, Felix 写道: Otherwise BOs may wait for the fence indefinitely and never be destroyed. v2: Signal the fence right after destroying queues to avoid unnecessary delaye-delete in kfd_process_wq_release Signed-off-by: Felix

Re: [PATCH] drm/amdgpu: stop allocating PDs/PTs with the eviction lock held

2020-02-27 Thread Felix Kuehling
On 2020-02-27 9:28, Christian König wrote: Hi Felix, so coming back to this after two weeks of distraction. Am 14.02.20 um 22:12 schrieb Felix Kuehling: Now you allow eviction of page tables while you allocate page tables. Isn't the whole point of the eviction lock to prevent page table

[PATCH 1/1] drm/amdgpu: Fix 32-bit build

2020-02-26 Thread Felix Kuehling
Add a dummy implementation of amdgpu_amdkfd_remove_fence_on_pt_pd_bos for kernel configs without KFD. Fixes: be8e48e08499 ("drm/amdgpu: Remove kfd eviction fence before release bo") Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 + 1 file

Re: Raven: freeze at 'modprobe amdgpu' in early console with android-x86

2020-01-27 Thread Felix Kuehling
I've seen hangs on a Raven AM4 system after the Ubuntu upgrade to kernel 5.3. I am able to work around it by disabling stutter mode with the module parameter amdgpu.ppfeaturemask=0xfffdbfff. If that doesn't help, you could also try disabling GFXOFF with amdgpu.ppfeaturemask=0xfffd3fff.

Re: [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked

2020-01-28 Thread Felix Kuehling
On 2020-01-27 20:29, Rajneesh Bhardwaj wrote: During system suspend the kfd driver aquires a lock that prohibits further kfd actions unless the gpu is resumed. This adds some info which can be useful while debugging. Signed-off-by: Rajneesh Bhardwaj ---

Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-29 Thread Felix Kuehling
HI Rajneesh, See comments inline ... And a general question: Why do you need to set the autosuspend_delay in so many places? Amdgpu only has a single call to this function during initialization. On 2020-01-27 20:29, Rajneesh Bhardwaj wrote: So far the kfd driver implemented same routines

Re: [PATCH v3] drm/amdkfd: Add queue information to sysfs

2020-02-04 Thread Felix Kuehling
and files underneath are generated when a queue is created. They are removed when the queue is destroyed. Signed-off-by: Amber Lin Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 7 ++ drivers/gpu/drm/amd/amdkfd/kfd_process.c | 90

Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco

2020-02-04 Thread Felix Kuehling
Bhardwaj One small comment inline. Other than that patches 1-3 are Reviewed-by: Felix Kuehling Also, I believe patch 1 is unchanged from v1 and already got a Reviewed-by from Alex. Please remember to add that tag before you submit. The last patch that enabled runtime PM by default, I'd leave

Re: [PATCH 2/4] drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access

2020-02-05 Thread Felix Kuehling
If we're using the BAR, we should probably flush HDP cache/buffers before reading or after writing. Regards,   Felix On 2020-02-05 10:22 a.m., Christian König wrote: This should speed up debugging VRAM access a lot. Signed-off-by: Christian König ---

Re: [PATCH 4/4] drm/amdgpu: use amdgpu_device_vram_access in amdgpu_ttm_access_memory

2020-02-05 Thread Felix Kuehling
On 2020-02-05 10:22 a.m., Christian König wrote: Make use of the better performance here as well. This patch is only compile tested! Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 38 +++-- 1 file changed, 23 insertions(+), 15 deletions(-)

Re: [PATCH 4/4] drm/amdgpu: use amdgpu_device_vram_access in amdgpu_ttm_access_memory v2

2020-02-06 Thread Felix Kuehling
On 2020-02-06 9:30, Christian König wrote: Make use of the better performance here as well. This patch is only compile tested! v2: fix calculation bug pointed out by Felix Signed-off-by: Christian König Acked-by: Jonathan Kim The series is Reviewed-by: Felix Kuehling --- drivers

Re: [PATCH] drm/amdgpu: fix amdgpu pmu to use hwc->config instead of hwc->conf

2020-02-06 Thread Felix Kuehling
t;conf and hwc->config are in different members of that union. So hwc->conf aliases some other variable in the structure that hwc->config is in. If I did the math right, hwc->conf aliases hwc->last_tag. Anyway, the patch is Reviewed-by: Felix Kuehling Signed-off-by: Jonathan K

Re: [PATCH 1/5] drm/amdgpu: fix braces in amdgpu_vm_update_ptes

2020-01-30 Thread Felix Kuehling
On 2020-01-30 7:49, Christian König wrote: For the root PD mask can be 0x as well which would overrun to 0 if we don't cast it before we add one. You're fixing parentheses, not braces. Parentheses: () Brackets: [] Braces: {} With the title fixed, this patch is Reviewed-by: Felix

Re: [PATCH 2/5] drm/amdgpu: return EINVAL instead of ENOENT in the VM code

2020-01-30 Thread Felix Kuehling
On 2020-01-30 7:49, Christian König wrote: That we can't find a PD above the root is expected can only happen if we try to update a larger range than actually managed by the VM. Signed-off-by: Christian König Tested-by: Tom St Denis Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd

Re: [PATCH 3/5] drm/amdgpu: allow higher level PD invalidations

2020-01-30 Thread Felix Kuehling
, this patch is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 23 ++- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 9705c961405b

Re: [PATCH 5/5] drm/amdgpu: rework synchronization of VM updates v4

2020-01-30 Thread Felix Kuehling
-by: Christian König Tested-by: Tom St Denis Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 35 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 3 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c| 7 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

Re: [PATCH] drm/amdkfd: Fix a bug in SDMA RLC queue counting under HWS mode

2020-01-30 Thread Felix Kuehling
, compute_queue_count in pm_calc_rlib_size() is one more than the actual compute queue number, because the queue_count has been incremented while sdma_queue_count has not. This patch fixes that. Change-Id: I20353e657efd505353d0dd9f7eb2fab5085e7202 Signed-off-by: Yong Zhao Reviewed-by: Felix Kuehling But I

Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Felix Kuehling
On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote: Hello Felix, Thanks for your time to review and for your feedback. On 1/29/2020 5:52 PM, Felix Kuehling wrote: HI Rajneesh, See comments inline ... And a general question: Why do you need to set the autosuspend_delay in so many places? Amdgpu

Re: [PATCH 4/5] drm/amdgpu: simplify and fix amdgpu_sync_resv

2020-01-30 Thread Felix Kuehling
On 2020-01-30 7:49, Christian König wrote: No matter what we always need to sync to moves. Signed-off-by: Christian König Tested-by: Tom St Denis Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 15 +++ 1 file changed, 11 insertions(+), 4

Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Felix Kuehling
On 2020-01-30 17:11, Alex Deucher wrote: On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling wrote: On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote: Hello Felix, Thanks for your time to review and for your feedback. On 1/29/2020 5:52 PM, Felix Kuehling wrote: HI Rajneesh, See comments inline

<    5   6   7   8   9   10   11   12   13   14   >