Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

2021-10-20 Thread Lazar, Lijo
On 10/20/2021 10:05 PM, Kent Russell wrote: If the bad_page_threshold kernel parameter is set to -2, continue to post the GPU. Print a warning to dmesg that this action has been done, and that page retirement will obviously not work for said GPU Cc: Luben Tuikov Cc: Mukul Joshi

RE: [PATCH] drm/amdgpu/display: add yellow carp B0 with rest of driver

2021-10-20 Thread Liu, Aaron
[AMD Official Use Only] Reviewed-by: Aaron Liu -- Best Regards Aaron Liu > -Original Message- > From: amd-gfx On Behalf Of Alex > Deucher > Sent: Wednesday, October 20, 2021 9:53 PM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander > Subject: [PATCH] drm/amdgpu/display:

Re: [PATCH v2] drm/amdgpu: remove grbm cam index/data operations for gfx v10

2021-10-20 Thread Alex Deucher
On Wed, Oct 20, 2021 at 10:27 PM Huang Rui wrote: > > PSP firmware will be responsible for applying the GRBM CAM remapping in > the production. And the GRBM_CAM_INDEX / GRBM_CAM_DATA registers will be > protected by PSP under security policy. So remove it according to the > new security policy. >

[PATCH v2] drm/amdgpu: remove grbm cam index/data operations for gfx v10

2021-10-20 Thread Huang Rui
PSP firmware will be responsible for applying the GRBM CAM remapping in the production. And the GRBM_CAM_INDEX / GRBM_CAM_DATA registers will be protected by PSP under security policy. So remove it according to the new security policy. Signed-off-by: Huang Rui ---

Re: [PATCH 1/1] drm/amdgpu: fix BO leak after successful move test

2021-10-20 Thread zhang
On 2021/10/20 19:51, Christian König wrote: Am 20.10.21 um 13:50 schrieb Christian König: Am 13.10.21 um 17:09 schrieb Nirmoy Das: GTT BO cleanup code is with in the test for loop and we would skip cleaning up GTT BO on success. Reported-by: zhang Signed-off-by: Nirmoy Das ---  

[PATCH 2/2] drm/amdkfd: debug message to count successfully migrated pages

2021-10-20 Thread Philip Yang
Not all migrate.cpages returned from migrate_vma_setup can be migrated, for example non anonymous page, or out of device memory. So after migrate_vma_pages returns, add debug message to count pages are successfully migrated which has MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set.

[PATCH 1/2] drm/amdkfd: clarify the origin of cpages returned by migration functions

2021-10-20 Thread Philip Yang
cpages is only updated by migrate_vma_setup. So capture its value at that point to clarify the significance of the number. The next patch will add counting of actually migrated pages after migrate_vma_pages for debug purposes. Signed-off-by: Philip Yang ---

[PATCH] drm/amdkfd: svm get successfully migrated pages

2021-10-20 Thread Philip Yang
Not all migrate.cpages returned from migrate_vma_setup can be migrated, for example non anonymous page, or out of device memory. So after migrate_vma_pages returns, check pages are successfully migrated which has MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set. Signed-off-by: Philip Yang ---

Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-20 Thread Felix Kuehling
On 2021-10-20 5:50 p.m., Felix Kuehling wrote: On 2021-10-20 12:35 p.m., Kent Russell wrote: Currently dmesg doesn't warn when the number of bad pages approaches the threshold for page retirement. WARN when the number of bad pages is at 90% or greater for easier checks and planning, instead of

Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-20 Thread Luben Tuikov
On 2021-10-20 17:50, Felix Kuehling wrote: > On 2021-10-20 12:35 p.m., Kent Russell wrote: >> Currently dmesg doesn't warn when the number of bad pages approaches the >> threshold for page retirement. WARN when the number of bad pages >> is at 90% or greater for easier checks and planning, instead

Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

2021-10-20 Thread Luben Tuikov
On 2021-10-20 17:54, Felix Kuehling wrote: > On 2021-10-20 12:35 p.m., Kent Russell wrote: >> If the bad_page_threshold kernel parameter is set to -2, >> continue to post the GPU. Print a warning to dmesg that this action has >> been done, and that page retirement will obviously not work for said

Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

2021-10-20 Thread Felix Kuehling
On 2021-10-20 12:35 p.m., Kent Russell wrote: If the bad_page_threshold kernel parameter is set to -2, continue to post the GPU. Print a warning to dmesg that this action has been done, and that page retirement will obviously not work for said GPU I'd squash patch 2 and 3. The squashed patch

Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-20 Thread Felix Kuehling
On 2021-10-20 12:35 p.m., Kent Russell wrote: Currently dmesg doesn't warn when the number of bad pages approaches the threshold for page retirement. WARN when the number of bad pages is at 90% or greater for easier checks and planning, instead of waiting until the GPU is full of bad pages Cc:

Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-20 Thread Luben Tuikov
On 2021-10-20 12:35, Kent Russell wrote: Currently dmesg doesn't warn when the number of bad pages approaches the "Currently" is redundant in this sentence as it is already in present simple tense. threshold for page retirement. WARN

Re: Lockdep spalt on killing a processes

2021-10-20 Thread Andrey Grodzovsky
On 2021-10-04 4:14 a.m., Christian König wrote: The problem is a bit different. The callback is on the dependent fence, while we need to signal the scheduler fence. Daniel is right that this needs an irq_work struct to handle this properly. Christian. So we had some discussions with

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-20 Thread Joao Martins
On 10/20/21 18:12, Dan Williams wrote: > On Wed, Oct 20, 2021 at 10:09 AM Joao Martins > wrote: >> On 10/19/21 20:21, Dan Williams wrote: >>> On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe wrote: On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: > On 10/19/21 00:06, Jason

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-20 Thread Dan Williams
On Wed, Oct 20, 2021 at 10:09 AM Joao Martins wrote: > > On 10/19/21 20:21, Dan Williams wrote: > > On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe wrote: > >> > >> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: > >>> On 10/19/21 00:06, Jason Gunthorpe wrote: > On Mon, Oct

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-20 Thread Joao Martins
On 10/19/21 20:21, Dan Williams wrote: > On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe wrote: >> >> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote: >>> On 10/19/21 00:06, Jason Gunthorpe wrote: On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote: >>

[PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

2021-10-20 Thread Kent Russell
If the bad_page_threshold kernel parameter is set to -2, continue to post the GPU. Print a warning to dmesg that this action has been done, and that page retirement will obviously not work for said GPU Cc: Luben Tuikov Cc: Mukul Joshi Signed-off-by: Kent Russell ---

[PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-20 Thread Kent Russell
Currently dmesg doesn't warn when the number of bad pages approaches the threshold for page retirement. WARN when the number of bad pages is at 90% or greater for easier checks and planning, instead of waiting until the GPU is full of bad pages Cc: Luben Tuikov Cc: Mukul Joshi Signed-off-by:

[PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-20 Thread Kent Russell
When a GPU hits the bad_page_threshold, it will not be initialized by the amdgpu driver. This means that the table cannot be cleared, nor can information gathering be performed (getting serial number, BDF, etc). Add an override by using amdgpu_bad_page_threshold = -2 which will still initialize

Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold

2021-10-20 Thread Christian König
As it stands, we have at least two customers who are focused on having the threshold automatically remove the GPUs from use, to ensure data integrity. They just want warnings to know that it's getting bad (my 90% threshold patch), so that they can plan for HW replacement accordingly. We

RE: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold

2021-10-20 Thread Russell, Kent
[AMD Official Use Only] I can see both sides of the argument. Having a configurable threshold means that you can determine what sort of "HW reliability" that you want. The default value is likely not going to get hit by the average user. And users that DO hit that threshold can determine if

Re: [PATCH] drm/amd/amdgpu: move dpcs headers to dpcs directory

2021-10-20 Thread Harry Wentland
On 2021-10-20 09:54, Tom St Denis wrote: > Move dpcs headers from asic_reg/dcn to asic_reg/dpcs. > > Update various .c files to include new path. > > Signed-off-by: Tom St Denis Acked-by: Harry Wentland Harry > --- > drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c | 4 ++-- >

Re: [PATCH] drm/amdgpu/display: add yellow carp B0 with rest of driver

2021-10-20 Thread Kazlauskas, Nicholas
On 2021-10-20 9:53 a.m., Alex Deucher wrote: Fix revision id. Fixes: 626cbb641f1052 ("drm/amdgpu: support B0 external revision id for yellow carp") Signed-off-by: Alex Deucher Reviewed-by: Nicholas Kazlauskas Regards, Nicholas Kazlauskas ---

Re: [PATCH] drm/amdgpu/display: remove unused variable in dcn31_init_hw()

2021-10-20 Thread Harry Wentland
On 2021-10-19 16:51, Alex Deucher wrote: > Unused. Remove it. > > Fixes: d1065882691179 ("Revert "drm/amd/display: Add helper for blanking all > dp displays"") > Signed-off-by: Alex Deucher Reviewed-by: Harry Wentland Harry > --- > drivers/gpu/drm/amd/display/dc/dcn31/dcn31_hwseq.c | 1 -

Re: [PATCH] drm/amdgpu/display: add yellow carp B0 with rest of driver

2021-10-20 Thread Harry Wentland
On 2021-10-20 09:53, Alex Deucher wrote: > Fix revision id. > > Fixes: 626cbb641f1052 ("drm/amdgpu: support B0 external revision id for > yellow carp") > Signed-off-by: Alex Deucher Acked-by: Harry Wentland Harry > --- > drivers/gpu/drm/amd/display/include/dal_asic_id.h | 2 +- > 1 file

[PATCH] drm/amd/amdgpu: move dpcs headers to dpcs directory

2021-10-20 Thread Tom St Denis
Move dpcs headers from asic_reg/dcn to asic_reg/dpcs. Update various .c files to include new path. Signed-off-by: Tom St Denis --- drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c | 4 ++-- drivers/gpu/drm/amd/display/dc/dcn30/dcn30_resource.c | 4 ++--

[PATCH] drm/amdgpu/display: add yellow carp B0 with rest of driver

2021-10-20 Thread Alex Deucher
Fix revision id. Fixes: 626cbb641f1052 ("drm/amdgpu: support B0 external revision id for yellow carp") Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/display/include/dal_asic_id.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git

Re: [PATCH 1/1] drm/amdgpu: fix BO leak after successful move test

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 1:51 PM, Christian König wrote: Am 20.10.21 um 13:50 schrieb Christian König: Am 13.10.21 um 17:09 schrieb Nirmoy Das: GTT BO cleanup code is with in the test for loop and we would skip cleaning up GTT BO on success. Reported-by: zhang Signed-off-by: Nirmoy Das ---  

Re: [PATCH 1/1] drm/amdgpu: fix BO leak after successful move test

2021-10-20 Thread Christian König
Am 20.10.21 um 13:50 schrieb Christian König: Am 13.10.21 um 17:09 schrieb Nirmoy Das: GTT BO cleanup code is with in the test for loop and we would skip cleaning up GTT BO on success. Reported-by: zhang Signed-off-by: Nirmoy Das ---   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c | 25

Re: [PATCH 1/1] drm/amdgpu: fix BO leak after successful move test

2021-10-20 Thread Christian König
Am 13.10.21 um 17:09 schrieb Nirmoy Das: GTT BO cleanup code is with in the test for loop and we would skip cleaning up GTT BO on success. Reported-by: zhang Signed-off-by: Nirmoy Das --- drivers/gpu/drm/amd/amdgpu/amdgpu_test.c | 25 1 file changed, 12

Re: [PATCH 1/3] drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 12:49 PM, Christian König wrote: Am 20.10.21 um 11:19 schrieb Lazar, Lijo: On 10/20/2021 2:18 PM, Das, Nirmoy wrote: On 10/20/2021 8:49 AM, Christian König wrote: Am 19.10.21 um 20:14 schrieb Nirmoy Das: Do not allow exported amdgpu_gtt_mgr_*() to accept any

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 12:51 PM, Christian König wrote: Am 20.10.21 um 12:21 schrieb Das, Nirmoy: On 10/20/2021 12:15 PM, Lazar, Lijo wrote: On 10/20/2021 3:42 PM, Das, Nirmoy wrote: On 10/20/2021 12:03 PM, Lazar, Lijo wrote: On 10/20/2021 3:23 PM, Das, Nirmoy wrote: On 10/20/2021 11:11

Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold

2021-10-20 Thread Christian König
Am 19.10.21 um 19:50 schrieb Kent Russell: When a GPU hits the bad_page_threshold, it will not be initialized by the amdgpu driver. This means that the table cannot be cleared, nor can information gathering be performed (getting serial number, BDF, etc). Add an override called

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Christian König
Am 20.10.21 um 12:21 schrieb Das, Nirmoy: On 10/20/2021 12:15 PM, Lazar, Lijo wrote: On 10/20/2021 3:42 PM, Das, Nirmoy wrote: On 10/20/2021 12:03 PM, Lazar, Lijo wrote: On 10/20/2021 3:23 PM, Das, Nirmoy wrote: On 10/20/2021 11:11 AM, Lazar, Lijo wrote: On 10/19/2021 11:44 PM,

Re: [PATCH 1/3] drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr

2021-10-20 Thread Christian König
Am 20.10.21 um 11:19 schrieb Lazar, Lijo: On 10/20/2021 2:18 PM, Das, Nirmoy wrote: On 10/20/2021 8:49 AM, Christian König wrote: Am 19.10.21 um 20:14 schrieb Nirmoy Das: Do not allow exported amdgpu_gtt_mgr_*() to accept any ttm_resource_manager pointer. Also there is no need to force

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 12:15 PM, Lazar, Lijo wrote: On 10/20/2021 3:42 PM, Das, Nirmoy wrote: On 10/20/2021 12:03 PM, Lazar, Lijo wrote: On 10/20/2021 3:23 PM, Das, Nirmoy wrote: On 10/20/2021 11:11 AM, Lazar, Lijo wrote: On 10/19/2021 11:44 PM, Nirmoy Das wrote: Get rid off pin/unpin of

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Lazar, Lijo
On 10/20/2021 3:42 PM, Das, Nirmoy wrote: On 10/20/2021 12:03 PM, Lazar, Lijo wrote: On 10/20/2021 3:23 PM, Das, Nirmoy wrote: On 10/20/2021 11:11 AM, Lazar, Lijo wrote: On 10/19/2021 11:44 PM, Nirmoy Das wrote: Get rid off pin/unpin of gart BO at resume/suspend and instead pin only

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 12:03 PM, Lazar, Lijo wrote: On 10/20/2021 3:23 PM, Das, Nirmoy wrote: On 10/20/2021 11:11 AM, Lazar, Lijo wrote: On 10/19/2021 11:44 PM, Nirmoy Das wrote: Get rid off pin/unpin of gart BO at resume/suspend and instead pin only once and try to recover gart content at

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Lazar, Lijo
On 10/20/2021 3:23 PM, Das, Nirmoy wrote: On 10/20/2021 11:11 AM, Lazar, Lijo wrote: On 10/19/2021 11:44 PM, Nirmoy Das wrote: Get rid off pin/unpin of gart BO at resume/suspend and instead pin only once and try to recover gart content at resume time. This is much more stable in case

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 11:11 AM, Lazar, Lijo wrote: On 10/19/2021 11:44 PM, Nirmoy Das wrote: Get rid off pin/unpin of gart BO at resume/suspend and instead pin only once and try to recover gart content at resume time. This is much more stable in case there is OOM situation at 2nd call to

Re: [PATCH 1/3] drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr

2021-10-20 Thread Lazar, Lijo
On 10/20/2021 2:18 PM, Das, Nirmoy wrote: On 10/20/2021 8:49 AM, Christian König wrote: Am 19.10.21 um 20:14 schrieb Nirmoy Das: Do not allow exported amdgpu_gtt_mgr_*() to accept any ttm_resource_manager pointer. Also there is no need to force other module to call a ttm function just to

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Lazar, Lijo
On 10/19/2021 11:44 PM, Nirmoy Das wrote: Get rid off pin/unpin of gart BO at resume/suspend and instead pin only once and try to recover gart content at resume time. This is much more stable in case there is OOM situation at 2nd call to amdgpu_device_evict_resources() while evicting GART

Re: [PATCH 1/1] drm/amdgpu: fix BO leak after successful move test

2021-10-20 Thread Das, Nirmoy
ping. On 10/13/2021 5:09 PM, Nirmoy Das wrote: GTT BO cleanup code is with in the test for loop and we would skip cleaning up GTT BO on success. Reported-by: zhang Signed-off-by: Nirmoy Das --- drivers/gpu/drm/amd/amdgpu/amdgpu_test.c | 25 1 file changed, 12

Re: [PATCH 1/3] drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 8:49 AM, Christian König wrote: Am 19.10.21 um 20:14 schrieb Nirmoy Das: Do not allow exported amdgpu_gtt_mgr_*() to accept any ttm_resource_manager pointer. Also there is no need to force other module to call a ttm function just to eventually call gtt_mgr functions. That's a

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Das, Nirmoy
On 10/20/2021 8:52 AM, Christian König wrote: Am 19.10.21 um 20:14 schrieb Nirmoy Das: Get rid off pin/unpin of gart BO at resume/suspend and instead pin only once and try to recover gart content at resume time. This is much more stable in case there is OOM situation at 2nd call to

Re: [PATCH 02/13] drm: Move and rename i915 buddy source

2021-10-20 Thread Jani Nikula
On Wed, 20 Oct 2021, Arunpravin wrote: > - Move i915_buddy.c to drm root folder > - Rename "i915" string with "drm" string wherever applicable > - Rename "I915" string with "DRM" string wherever applicable > - Fix header file dependencies > - Fix alignment issues > > Signed-off-by: Arunpravin >

Re: [PATCH 03/13] drm: add Makefile support for drm buddy

2021-10-20 Thread Thomas Zimmermann
Hi Am 20.10.21 um 00:53 schrieb Arunpravin: - Include drm buddy to DRM root Makefile - Add drm buddy init and exit function calls to drm core Is there a hard requirement to have this code in the core? IMHO there's already too much code in the DRM core that should rather go into helpers.

Re: [PATCH 00/13] drm: Enable buddy allocator support

2021-10-20 Thread Christian König
Well please keep in mind that each patch on its own should not break anything. Especially patches #1, #2, #3 and #10 look like they need to be squashed together to cleanly move the i915 code into a common place. Christian. Am 20.10.21 um 00:53 schrieb Arunpravin: This series of patches

Re: [PATCH v2 3/3] drm/amdgpu: recover gart table at resume

2021-10-20 Thread Christian König
Am 19.10.21 um 20:14 schrieb Nirmoy Das: Get rid off pin/unpin of gart BO at resume/suspend and instead pin only once and try to recover gart content at resume time. This is much more stable in case there is OOM situation at 2nd call to amdgpu_device_evict_resources() while evicting GART table.

Re: [PATCH 1/3] drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr

2021-10-20 Thread Christian König
Am 19.10.21 um 20:14 schrieb Nirmoy Das: Do not allow exported amdgpu_gtt_mgr_*() to accept any ttm_resource_manager pointer. Also there is no need to force other module to call a ttm function just to eventually call gtt_mgr functions. That's a rather bad idea I think. The GTT and VRAM