date:20210308

RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Liu, Monk

[AMD Official Use Only - Internal Distribution Only]

Christian

what feasible and practice now is:
1) we implement the advanced TDR mode in upstream first (so we can copy the 
same scheme in our LTS kernel) -- if you want we can avoid change drm/scheduler 
part code, but that one is already rejected by you due to complicated 
2) then we retire the mirror list concept and rework the drm/scheduler with 
KFIFO 
3) remove the guilty/karma handling from scheduler 

So I basically agree with you on the spirit of above changes: hide those AMD 
internal concept or tricks in vendor's driver part and make scheduler simple 
and scalable
But that definitely need  a longer design and discussion, so why don't we focus 
on our current problems now, 
as long as the new change doesn't regress it is still a good change based on 
current TDR implements  

I would proposal we only change AMD part code in this time, Jack's first 
version patch didn't touch scheduler part, but you stated it was too 
complicated and rejected it 

So the allowable revise options is what Jack did in ver2, which need to 
introduce a new API in scheduler drm_sched_resubmit_jobs2().

Hah: --( what do you think ?

Thanks 

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: Koenig, Christian  
Sent: Monday, March 8, 2021 3:53 PM
To: Liu, Monk ; Zhang, Jack (Jian) ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; 
Deng, Emily 
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode



Am 08.03.21 um 05:06 schrieb Liu, Monk:
> [AMD Official Use Only - Internal Distribution Only]
>
>>> well first of all please completely drop the affinity group stuff from this 
>>> patch. We should concentrate on one feature at at time.
> We need it to expedite the process, we can introduce this change in 
> another patch
>
>
>>> Then the implementation is way to complicate. All you need to do is insert 
>>> a dma_fence_wait after re-scheduling each job after a reset.
> No that's not true, during the " drm_sched_resubmit_jobs" it will put 
> all jobs in mirror list into the hw ring, but we can only allow the 
> first job to the ring To catch the real guilty one (otherwise it is possible 
> that the later job in the ring also has bug and affect our judgement) So we 
> need to implement a new " drm_sched_resubmit_jobs2()", like this way:

Something like this. But since waiting for the guilty job is AMD specific we 
should rather rework the stuff from the beginning.

What I have in mind is the following:
1. Add a reference from the scheduler fence back to the job which is cleared 
only when the scheduler fence finishes.
2. Completely drop the ring_mirror_list and replace it with a kfifo of pointers 
to the active scheduler fences.
3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro 
which allows drivers to iterate over all the active jobs and resubmit/wait/mark 
them as guilty etc etc..
4. Remove the guilty/karma handling from the scheduler. This is something AMD 
specific and shouldn't leak into common code.

Regards,
Christian.

>
> drm_sched_resubmit_jobs2()
> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
>500 {
>501 struct drm_sched_job *s_job, *tmp;
>502 uint64_t guilty_context;
>503 bool found_guilty = false;
>504 struct dma_fence *fence;
> + 505 int i = 0;
>506
>507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, 
> node) {
>508 struct drm_sched_fence *s_fence = s_job->s_fence;
>509
> + 510 if (i >= max)
> + 511 break;
> + 512
>513 if (!found_guilty && atomic_read(_job->karma) > 
> sched->hang_limit) {
>514 found_guilty = true;
>515 guilty_context = s_job->s_fence->scheduled.context;
>516 }
>517
>518 if (found_guilty && s_job->s_fence->scheduled.context == 
> guilty_context)
>519 dma_fence_set_error(_fence->finished, -ECANCELED);
>520
>521 dma_fence_put(s_job->s_fence->parent);
>522 fence = sched->ops->run_job(s_job);
> + 523 i++;
>524
>525 if (IS_ERR_OR_NULL(fence)) {
>526 if (IS_ERR(fence))
>527 dma_fence_set_error(_fence->finished, 
> PTR_ERR(fence));
>528
>529 s_job->s_fence->parent = NULL;
>530 } else {
>531 s_job->s_fence->parent = fence;
>532 }
>533
>534
>535 }
>536 }
>537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>538
>
>
>
> Thanks
>
> --
> Monk Liu | Cloud-GPU Core team
> --
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Sunday, March 7, 2021 3:03 AM
> To: Zhang, Jack (Jian) ; 
> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
> ; Liu,

RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Zhang, Jack (Jian)

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,

Since this change is a bit critical to our project, we would be grateful that 
could get your review.

Are there anything that's is not clear enough I could help to explain?

Again,
Thanks for your huge help to our problem.

Thanks,
Jack

-Original Message-
From: Koenig, Christian 
Sent: Monday, March 8, 2021 8:45 PM
To: Zhang, Jack (Jian) ; Liu, Monk ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; 
Deng, Emily 
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

Hi Jack,

yes that comes pretty close. I'm going over the patch right now.

Some things still look a bit complicated to me, but I need to wrap my head 
around how and why we are doing it this way once more.

Christian.

Am 08.03.21 um 13:43 schrieb Zhang, Jack (Jian):
> [AMD Public Use]
>
> Hi, Christian,
>
> I made some change on V3 patch that insert a dma_fence_wait for the first 
> jobs after resubmit jobs.
> It seems simpler than the V2 patch. Is this what you first thinks of in your 
> mind?
>
> Thanks,
> Jack
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Monday, March 8, 2021 3:53 PM
> To: Liu, Monk ; Zhang, Jack (Jian)
> ; amd-gfx@lists.freedesktop.org; Grodzovsky,
> Andrey ; Deng, Emily 
> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>
>
>
> Am 08.03.21 um 05:06 schrieb Liu, Monk:
>> [AMD Official Use Only - Internal Distribution Only]
>>
 well first of all please completely drop the affinity group stuff from 
 this patch. We should concentrate on one feature at at time.
>> We need it to expedite the process, we can introduce this change in
>> another patch
>>
>>
 Then the implementation is way to complicate. All you need to do is insert 
 a dma_fence_wait after re-scheduling each job after a reset.
>> No that's not true, during the " drm_sched_resubmit_jobs" it will put
>> all jobs in mirror list into the hw ring, but we can only allow the
>> first job to the ring To catch the real guilty one (otherwise it is possible 
>> that the later job in the ring also has bug and affect our judgement) So we 
>> need to implement a new " drm_sched_resubmit_jobs2()", like this way:
> Something like this. But since waiting for the guilty job is AMD specific we 
> should rather rework the stuff from the beginning.
>
> What I have in mind is the following:
> 1. Add a reference from the scheduler fence back to the job which is cleared 
> only when the scheduler fence finishes.
> 2. Completely drop the ring_mirror_list and replace it with a kfifo of 
> pointers to the active scheduler fences.
> 3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro 
> which allows drivers to iterate over all the active jobs and 
> resubmit/wait/mark them as guilty etc etc..
> 4. Remove the guilty/karma handling from the scheduler. This is something AMD 
> specific and shouldn't leak into common code.
>
> Regards,
> Christian.
>
>> drm_sched_resubmit_jobs2()
>> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
>> 500 {
>> 501 struct drm_sched_job *s_job, *tmp;
>> 502 uint64_t guilty_context;
>> 503 bool found_guilty = false;
>> 504 struct dma_fence *fence;
>> + 505 int i = 0;
>> 506
>> 507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, 
>> node) {
>> 508 struct drm_sched_fence *s_fence = s_job->s_fence;
>> 509
>> + 510 if (i >= max)
>> + 511 break;
>> + 512
>> 513 if (!found_guilty && atomic_read(_job->karma) > 
>> sched->hang_limit) {
>> 514 found_guilty = true;
>> 515 guilty_context = s_job->s_fence->scheduled.context;
>> 516 }
>> 517
>> 518 if (found_guilty && s_job->s_fence->scheduled.context == 
>> guilty_context)
>> 519 dma_fence_set_error(_fence->finished, -ECANCELED);
>> 520
>> 521 dma_fence_put(s_job->s_fence->parent);
>> 522 fence = sched->ops->run_job(s_job);
>> + 523 i++;
>> 524
>> 525 if (IS_ERR_OR_NULL(fence)) {
>> 526 if (IS_ERR(fence))
>> 527 dma_fence_set_error(_fence->finished, 
>> PTR_ERR(fence));
>> 528
>> 529 s_job->s_fence->parent = NULL;
>> 530 } else {
>> 531 s_job->s_fence->parent = fence;
>> 532 }
>> 533
>> 534
>> 535 }
>> 536 }
>> 537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>> 538
>>
>>
>>
>> Thanks
>>
>> --
>> Monk Liu | Cloud-GPU Core team
>> --
>>
>> -Original Message-
>> From: Koenig, Christian 
>> Sent: Sunday, March 7, 2021 3:03 AM
>> To: Zhang, Jack (Jian) ;
>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
>> ; Liu, Monk ; Deng,
>> Emily 
>> Subject: Re: [PATCH v2] drm/amd/amdgpu implement

RE: [PATCH 4/7] drm/amdgpu: track what pmops flow we are in

2021-03-08 Thread Lazar, Lijo

[AMD Public Use]

This seems a duplicate of dev_pm_info states. Can't we reuse that?

Thanks,
Lijo

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Tuesday, March 9, 2021 9:40 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH 4/7] drm/amdgpu: track what pmops flow we are in

We reuse the same suspend and resume functions for all of the pmops states, so 
flag what state we are in so that we can alter behavior deeper in the driver 
depending on the current flow.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   | 20 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   | 58 +++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  3 +-
 3 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d47626ce9bc5..4ddc5cc983c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -347,6 +347,24 @@ int amdgpu_device_ip_block_add(struct amdgpu_device *adev, 
 bool amdgpu_get_bios(struct amdgpu_device *adev);  bool 
amdgpu_read_bios(struct amdgpu_device *adev);
 
+/*
+ * PM Ops
+ */
+enum amdgpu_pmops_state {
+   AMDGPU_PMOPS_NONE = 0,
+   AMDGPU_PMOPS_PREPARE,
+   AMDGPU_PMOPS_COMPLETE,
+   AMDGPU_PMOPS_SUSPEND,
+   AMDGPU_PMOPS_RESUME,
+   AMDGPU_PMOPS_FREEZE,
+   AMDGPU_PMOPS_THAW,
+   AMDGPU_PMOPS_POWEROFF,
+   AMDGPU_PMOPS_RESTORE,
+   AMDGPU_PMOPS_RUNTIME_SUSPEND,
+   AMDGPU_PMOPS_RUNTIME_RESUME,
+   AMDGPU_PMOPS_RUNTIME_IDLE,
+};
+
 /*
  * Clocks
  */
@@ -1019,8 +1037,8 @@ struct amdgpu_device {
u8  reset_magic[AMDGPU_RESET_MAGIC_NUM];
 
/* s3/s4 mask */
+   enum amdgpu_pmops_state pmops_state;
boolin_suspend;
-   boolin_hibernate;
 
/*
 * The combination flag in_poweroff_reboot_com used to identify the 
poweroff diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3e6bb7d79652..0312c52bd39d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1297,34 +1297,54 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)  static int 
amdgpu_pmops_prepare(struct device *dev)  {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
+   adev->pmops_state = AMDGPU_PMOPS_PREPARE;
/* Return a positive number here so
 * DPM_FLAG_SMART_SUSPEND works properly
 */
if (amdgpu_device_supports_boco(drm_dev))
-   return pm_runtime_suspended(dev) &&
+   r= pm_runtime_suspended(dev) &&
pm_suspend_via_firmware();
-
-   return 0;
+   else
+   r = 0;
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
+   return r;
 }
 
 static void amdgpu_pmops_complete(struct device *dev)  {
+   struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+
+   adev->pmops_state = AMDGPU_PMOPS_COMPLETE;
/* nothing to do */
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
 }
 
 static int amdgpu_pmops_suspend(struct device *dev)  {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
-   return amdgpu_device_suspend(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_SUSPEND;
+   r = amdgpu_device_suspend(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
+   return r;
 }
 
 static int amdgpu_pmops_resume(struct device *dev)  {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
-   return amdgpu_device_resume(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_RESUME;
+   r = amdgpu_device_resume(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
+   return r;
 }
 
 static int amdgpu_pmops_freeze(struct device *dev) @@ -1333,9 +1353,9 @@ 
static int amdgpu_pmops_freeze(struct device *dev)
struct amdgpu_device *adev = drm_to_adev(drm_dev);
int r;
 
-   adev->in_hibernate = true;
+   adev->pmops_state = AMDGPU_PMOPS_FREEZE;
r = amdgpu_device_suspend(drm_dev, true);
-   adev->in_hibernate = false;
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
if (r)
return r;
return amdgpu_asic_reset(adev);
@@ -1344,8 +1364,13 @@ static int amdgpu_pmops_freeze(struct device *dev)  
static int amdgpu_pmops_thaw(struct device *dev)  {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
-   return amdgpu_device_resume(drm_dev, true);
+   adev->pmops_state =

Re: [PATCH] drm/amdgpu: fix the hibernation suspend with s0ix

2021-03-08 Thread Huang Rui

On Tue, Mar 09, 2021 at 12:45:44PM +0800, Liang, Prike wrote:
> 
> 
> > -Original Message-
> > From: Alex Deucher 
> > Sent: Tuesday, March 9, 2021 12:07 PM
> > To: Liang, Prike 
> > Cc: amd-gfx list ; Deucher, Alexander
> > ; Huang, Ray 
> > Subject: Re: [PATCH] drm/amdgpu: fix the hibernation suspend with s0ix
> > 
> > On Mon, Mar 8, 2021 at 10:52 PM Prike Liang  wrote:
> > >
> > > During system hibernation suspend still need un-gate gfx CG/PG firstly
> > > to handle HW status check before HW resource destory.
> > >
> > > Signed-off-by: Prike Liang 
> > 
> > This is fine for stable, but we should work on cleaning this up.  I have a 
> > patch
> > set to improve this, but it's more invasive.  We really need to sort out 
> > what
> > specific parts of
> > amdgpu_device_ip_suspend_phase2() are problematic and special case
> > them.  We shouldn't just be skipping that function.
> [Prike] Yeah in this stage we're just try make the s0ix been functional and 
> stable. The AMDGPU work mode is aligning  with windows KMD s0ix sequence and 
> only suspend the DCE and IH for s0i3 entry . Will try figure out the each GNB 
> IP idle off dependency and then improve the AMDGPU suspend/resume sequence 
> for system-wide Sx  entry/exit.  
> 

Maybe we need a comment before amdgpu_device_ip_suspend_phase2() to mark it
as TODO. For this moment, it's ok for me as well.

Acked-by: Huang Rui 

> > Acked-by: Alex Deucher 
> > 
> > Alex
> > 
> > 
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index e247c3a..7079bfc 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -2683,7 +2683,7 @@ static int
> > > amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)  {
> > > int i, r;
> > >
> > > -   if (adev->in_poweroff_reboot_com ||
> > > +   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
> > > !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev))
> > {
> > > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> > > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> > > @@ -3750,7 +3750,7 @@ int amdgpu_device_suspend(struct drm_device
> > > *dev, bool fbcon)
> > >
> > > amdgpu_fence_driver_suspend(adev);
> > >
> > > -   if (adev->in_poweroff_reboot_com ||
> > > +   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
> > > !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev))
> > > r = amdgpu_device_ip_suspend_phase2(adev);
> > > else
> > > --
> > > 2.7.4
> > >
> > > ___
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> > > s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-
> > gfxdata=04%7C01%7CPr
> > >
> > ike.Liang%40amd.com%7C641ed997755644a7c30a08d8e2b0d7bf%7C3dd896
> > 1fe4884
> > >
> > e608e11a82d994e183d%7C0%7C0%7C637508596461291719%7CUnknown%7
> > CTWFpbGZsb
> > >
> > 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> > %3D%
> > >
> > 7C1000sdata=J%2Figj9QUO6Vk74BeE3udM5yVgloUanpXtJUue3pJoFI%
> > 3D
> > > reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: remove ECO_BITS programing on gmc9

2021-03-08 Thread Xu, Feifei

[AMD Public Use]

Thanks Anna. Result is good on SRIOV guest driver as well. Will push with 

Reviewed-by: Hawking Zhang 
Tested-by Anna Jin < anna@amd.com>

Thanks,
Feifei

-Original Message-
From: Zhang, Hawking  
Sent: 2021年3月5日 下午 8:51
To: Xu, Feifei ; amd-gfx@lists.freedesktop.org
Cc: Lin, Amber ; Xu, Feifei ; Jin, Anna 

Subject: RE: [PATCH] drm/amdgpu: remove ECO_BITS programing on gmc9

[AMD Public Use]

Reviewed-by: Hawking Zhang 

Per discussion, please work with Anna to identify the potential risk in SRIOV 
guest driver (VEGA10) before pushing the patch. Thanks.

Regards,
Hawking
-Original Message-
From: Feifei Xu  
Sent: Friday, March 5, 2021 17:10
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Lin, Amber ; Xu, 
Feifei 
Subject: [PATCH] drm/amdgpu: remove ECO_BITS programing on gmc9

Remove the ECO_BITS programing in driver on gfxhub1.0, mmhub1_x and mmhub_9.4.

Signed-off-by: Feifei Xu 
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 1 -  
drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c  | 1 -  
drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c  | 1 -  
drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c  | 2 --
 4 files changed, 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index 0ab498d93e48..0cf993797df8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -186,7 +186,6 @@ static void gfxhub_v1_0_init_tlb_regs(struct amdgpu_device 
*adev)
ENABLE_ADVANCED_DRIVER_MODEL, 1);
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL,
SYSTEM_APERTURE_UNMAPPED_ACCESS, 0);
-   tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL, ECO_BITS, 0);
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL,
MTYPE, MTYPE_UC);/* XXX for emulation. */
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL, ATC_EN, 1); diff --git 
a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
index 0145d4d201cf..37b985317012 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
@@ -164,7 +164,6 @@ static void mmhub_v1_0_init_tlb_regs(struct amdgpu_device 
*adev)
ENABLE_ADVANCED_DRIVER_MODEL, 1);
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL,
SYSTEM_APERTURE_UNMAPPED_ACCESS, 0);
-   tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL, ECO_BITS, 0);
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL,
MTYPE, MTYPE_UC);/* XXX for emulation. */
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL, ATC_EN, 1); diff --git 
a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
index 816ff110a074..9099162553a4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
@@ -189,7 +189,6 @@ static void mmhub_v1_7_init_tlb_regs(struct amdgpu_device 
*adev)
ENABLE_ADVANCED_DRIVER_MODEL, 1);
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL,
SYSTEM_APERTURE_UNMAPPED_ACCESS, 0);
-   tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL, ECO_BITS, 0);
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL,
MTYPE, MTYPE_UC);/* XXX for emulation. */
tmp = REG_SET_FIELD(tmp, MC_VM_MX_L1_TLB_CNTL, ATC_EN, 1); diff --git 
a/drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c
index 65fb88d391d3..d68f3cd2d40d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c
@@ -219,8 +219,6 @@ static void mmhub_v9_4_init_tlb_regs(struct amdgpu_device 
*adev, int hubid)
ENABLE_ADVANCED_DRIVER_MODEL, 1);
tmp = REG_SET_FIELD(tmp, VMSHAREDVC0_MC_VM_MX_L1_TLB_CNTL,
SYSTEM_APERTURE_UNMAPPED_ACCESS, 0);
-   tmp = REG_SET_FIELD(tmp, VMSHAREDVC0_MC_VM_MX_L1_TLB_CNTL,
-   ECO_BITS, 0);
tmp = REG_SET_FIELD(tmp, VMSHAREDVC0_MC_VM_MX_L1_TLB_CNTL,
MTYPE, MTYPE_UC);/* XXX for emulation. */
tmp = REG_SET_FIELD(tmp, VMSHAREDVC0_MC_VM_MX_L1_TLB_CNTL,
--
2.25.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: fix the hibernation suspend with s0ix

2021-03-08 Thread Liang, Prike




> -Original Message-
> From: Alex Deucher 
> Sent: Tuesday, March 9, 2021 12:07 PM
> To: Liang, Prike 
> Cc: amd-gfx list ; Deucher, Alexander
> ; Huang, Ray 
> Subject: Re: [PATCH] drm/amdgpu: fix the hibernation suspend with s0ix
> 
> On Mon, Mar 8, 2021 at 10:52 PM Prike Liang  wrote:
> >
> > During system hibernation suspend still need un-gate gfx CG/PG firstly
> > to handle HW status check before HW resource destory.
> >
> > Signed-off-by: Prike Liang 
> 
> This is fine for stable, but we should work on cleaning this up.  I have a 
> patch
> set to improve this, but it's more invasive.  We really need to sort out what
> specific parts of
> amdgpu_device_ip_suspend_phase2() are problematic and special case
> them.  We shouldn't just be skipping that function.
[Prike] Yeah in this stage we're just try make the s0ix been functional and 
stable. The AMDGPU work mode is aligning  with windows KMD s0ix sequence and 
only suspend the DCE and IH for s0i3 entry . Will try figure out the each GNB 
IP idle off dependency and then improve the AMDGPU suspend/resume sequence for 
system-wide Sx  entry/exit.  

> Acked-by: Alex Deucher 
> 
> Alex
> 
> 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index e247c3a..7079bfc 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2683,7 +2683,7 @@ static int
> > amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)  {
> > int i, r;
> >
> > -   if (adev->in_poweroff_reboot_com ||
> > +   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
> > !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev))
> {
> > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> > @@ -3750,7 +3750,7 @@ int amdgpu_device_suspend(struct drm_device
> > *dev, bool fbcon)
> >
> > amdgpu_fence_driver_suspend(adev);
> >
> > -   if (adev->in_poweroff_reboot_com ||
> > +   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
> > !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev))
> > r = amdgpu_device_ip_suspend_phase2(adev);
> > else
> > --
> > 2.7.4
> >
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> > s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-
> gfxdata=04%7C01%7CPr
> >
> ike.Liang%40amd.com%7C641ed997755644a7c30a08d8e2b0d7bf%7C3dd896
> 1fe4884
> >
> e608e11a82d994e183d%7C0%7C0%7C637508596461291719%7CUnknown%7
> CTWFpbGZsb
> >
> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> %3D%
> >
> 7C1000sdata=J%2Figj9QUO6Vk74BeE3udM5yVgloUanpXtJUue3pJoFI%
> 3D
> > reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 4/7] drm/amdgpu: track what pmops flow we are in

2021-03-08 Thread Alex Deucher

We reuse the same suspend and resume functions for
all of the pmops states, so flag what state we are
in so that we can alter behavior deeper in the driver
depending on the current flow.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   | 20 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   | 58 +++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  3 +-
 3 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d47626ce9bc5..4ddc5cc983c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -347,6 +347,24 @@ int amdgpu_device_ip_block_add(struct amdgpu_device *adev,
 bool amdgpu_get_bios(struct amdgpu_device *adev);
 bool amdgpu_read_bios(struct amdgpu_device *adev);
 
+/*
+ * PM Ops
+ */
+enum amdgpu_pmops_state {
+   AMDGPU_PMOPS_NONE = 0,
+   AMDGPU_PMOPS_PREPARE,
+   AMDGPU_PMOPS_COMPLETE,
+   AMDGPU_PMOPS_SUSPEND,
+   AMDGPU_PMOPS_RESUME,
+   AMDGPU_PMOPS_FREEZE,
+   AMDGPU_PMOPS_THAW,
+   AMDGPU_PMOPS_POWEROFF,
+   AMDGPU_PMOPS_RESTORE,
+   AMDGPU_PMOPS_RUNTIME_SUSPEND,
+   AMDGPU_PMOPS_RUNTIME_RESUME,
+   AMDGPU_PMOPS_RUNTIME_IDLE,
+};
+
 /*
  * Clocks
  */
@@ -1019,8 +1037,8 @@ struct amdgpu_device {
u8  reset_magic[AMDGPU_RESET_MAGIC_NUM];
 
/* s3/s4 mask */
+   enum amdgpu_pmops_state pmops_state;
boolin_suspend;
-   boolin_hibernate;
 
/*
 * The combination flag in_poweroff_reboot_com used to identify the 
poweroff
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3e6bb7d79652..0312c52bd39d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1297,34 +1297,54 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
 static int amdgpu_pmops_prepare(struct device *dev)
 {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
+   adev->pmops_state = AMDGPU_PMOPS_PREPARE;
/* Return a positive number here so
 * DPM_FLAG_SMART_SUSPEND works properly
 */
if (amdgpu_device_supports_boco(drm_dev))
-   return pm_runtime_suspended(dev) &&
+   r= pm_runtime_suspended(dev) &&
pm_suspend_via_firmware();
-
-   return 0;
+   else
+   r = 0;
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
+   return r;
 }
 
 static void amdgpu_pmops_complete(struct device *dev)
 {
+   struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+
+   adev->pmops_state = AMDGPU_PMOPS_COMPLETE;
/* nothing to do */
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
 }
 
 static int amdgpu_pmops_suspend(struct device *dev)
 {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
-   return amdgpu_device_suspend(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_SUSPEND;
+   r = amdgpu_device_suspend(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
+   return r;
 }
 
 static int amdgpu_pmops_resume(struct device *dev)
 {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
-   return amdgpu_device_resume(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_RESUME;
+   r = amdgpu_device_resume(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
+   return r;
 }
 
 static int amdgpu_pmops_freeze(struct device *dev)
@@ -1333,9 +1353,9 @@ static int amdgpu_pmops_freeze(struct device *dev)
struct amdgpu_device *adev = drm_to_adev(drm_dev);
int r;
 
-   adev->in_hibernate = true;
+   adev->pmops_state = AMDGPU_PMOPS_FREEZE;
r = amdgpu_device_suspend(drm_dev, true);
-   adev->in_hibernate = false;
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
if (r)
return r;
return amdgpu_asic_reset(adev);
@@ -1344,8 +1364,13 @@ static int amdgpu_pmops_freeze(struct device *dev)
 static int amdgpu_pmops_thaw(struct device *dev)
 {
struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+   int r;
 
-   return amdgpu_device_resume(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_THAW;
+   r = amdgpu_device_resume(drm_dev, true);
+   adev->pmops_state = AMDGPU_PMOPS_NONE;
+   return r;
 }
 
 static int amdgpu_pmops_poweroff(struct device *dev)
@@ -1354,17 +1379,24 @@ static int amdgpu_pmops_poweroff(struct device *dev)
struct amdgpu_device *adev = drm_to_adev(drm_dev);
int r;
 
+

[PATCH 3/7] drm/amdgpu: disentangle HG systems from vgaswitcheroo

2021-03-08 Thread Alex Deucher

There's no need to keep vgaswitcheroo around for HG
systems.  They don't use muxes and their power control
is handled via ACPI.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 34 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c|  9 ++---
 4 files changed, 34 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b5310b35721c..d47626ce9bc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1260,7 +1260,7 @@ void amdgpu_device_program_register_sequence(struct 
amdgpu_device *adev,
 const u32 *registers,
 const u32 array_size);
 
-bool amdgpu_device_supports_atpx(struct drm_device *dev);
+bool amdgpu_device_supports_px(struct drm_device *dev);
 bool amdgpu_device_supports_boco(struct drm_device *dev);
 bool amdgpu_device_supports_baco(struct drm_device *dev);
 bool amdgpu_device_is_peer_accessible(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6eb3b4d2c9b2..ac5f7837285b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -211,18 +211,18 @@ static DEVICE_ATTR(serial_number, S_IRUGO,
amdgpu_device_get_serial_number, NULL);
 
 /**
- * amdgpu_device_supports_atpx - Is the device a dGPU with HG/PX power control
+ * amdgpu_device_supports_px - Is the device a dGPU with ATPX power control
  *
  * @dev: drm_device pointer
  *
- * Returns true if the device is a dGPU with HG/PX power control,
+ * Returns true if the device is a dGPU with ATPX power control,
  * otherwise return false.
  */
-bool amdgpu_device_supports_atpx(struct drm_device *dev)
+bool amdgpu_device_supports_px(struct drm_device *dev)
 {
struct amdgpu_device *adev = drm_to_adev(dev);
 
-   if (adev->flags & AMD_IS_PX)
+   if ((adev->flags & AMD_IS_PX) && !amdgpu_is_atpx_hybrid())
return true;
return false;
 }
@@ -232,14 +232,15 @@ bool amdgpu_device_supports_atpx(struct drm_device *dev)
  *
  * @dev: drm_device pointer
  *
- * Returns true if the device is a dGPU with HG/PX power control,
+ * Returns true if the device is a dGPU with ACPI power control,
  * otherwise return false.
  */
 bool amdgpu_device_supports_boco(struct drm_device *dev)
 {
struct amdgpu_device *adev = drm_to_adev(dev);
 
-   if (adev->has_pr3)
+   if (adev->has_pr3 ||
+   ((adev->flags & AMD_IS_PX) && amdgpu_is_atpx_hybrid()))
return true;
return false;
 }
@@ -1429,7 +1430,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev 
*pdev,
struct drm_device *dev = pci_get_drvdata(pdev);
int r;
 
-   if (amdgpu_device_supports_atpx(dev) && state == VGA_SWITCHEROO_OFF)
+   if (amdgpu_device_supports_px(dev) && state == VGA_SWITCHEROO_OFF)
return;
 
if (state == VGA_SWITCHEROO_ON) {
@@ -3213,7 +3214,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
struct drm_device *ddev = adev_to_drm(adev);
struct pci_dev *pdev = adev->pdev;
int r, i;
-   bool atpx = false;
+   bool px = false;
u32 max_MBps;
 
adev->shutdown = false;
@@ -3385,16 +3386,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
vga_client_register(adev->pdev, adev, NULL, 
amdgpu_device_vga_set_decode);
 
-   if (amdgpu_device_supports_atpx(ddev))
-   atpx = true;
-   if (amdgpu_has_atpx() &&
-   (amdgpu_is_atpx_hybrid() ||
-amdgpu_has_atpx_dgpu_power_cntl()) &&
-   !pci_is_thunderbolt_attached(adev->pdev))
+   if (amdgpu_device_supports_px(ddev)) {
+   px = true;
vga_switcheroo_register_client(adev->pdev,
-  _switcheroo_ops, atpx);
-   if (atpx)
+  _switcheroo_ops, px);
vga_switcheroo_init_domain_pm_ops(adev->dev, 
>vga_pm_domain);
+   }
 
if (amdgpu_emu_mode == 1) {
/* post the asic on emulation mode */
@@ -3576,7 +3573,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 
 failed:
amdgpu_vf_error_trans_all(adev);
-   if (atpx)
+   if (px)
vga_switcheroo_fini_domain_pm_ops(adev->dev);
 
 failed_unmap:
@@ -3636,13 +3633,10 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 
kfree(adev->bios);
adev->bios = NULL;
-   if (amdgpu_has_atpx() &&
-   (amdgpu_is_atpx_hybrid() ||
-amdgpu_has_atpx_dgpu_power_cntl()) &&
-

[PATCH 6/7] drm/amdgpu: clean up S0ix logic

2021-03-08 Thread Alex Deucher

We only need special handling for the S0ix suspend and resume
cases, legacy S3/S4/shutdown/reboot/reset should use the
standard code pathes.  This should fix systems with S0ix
plus legacy S4.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  6 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  4 
 3 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4ddc5cc983c7..bf9359ccf3da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1040,12 +1040,6 @@ struct amdgpu_device {
enum amdgpu_pmops_state pmops_state;
boolin_suspend;
 
-   /*
-* The combination flag in_poweroff_reboot_com used to identify the 
poweroff
-* and reboot opt in the s0i3 system-wide suspend.
-*/
-   boolin_poweroff_reboot_com;
-
atomic_tin_gpu_reset;
enum pp_mp1_state   mp1_state;
struct rw_semaphore reset_sem;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ac5f7837285b..2b6e483259f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2680,9 +2680,10 @@ static void amdgpu_device_delay_enable_gfx_off(struct 
work_struct *work)
 static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
 {
int i, r;
+   bool s0ix_suspend = amdgpu_acpi_is_s0ix_supported(adev) &&
+   (adev->pmops_state == AMDGPU_PMOPS_SUSPEND);
 
-   if (adev->in_poweroff_reboot_com ||
-   !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev)) {
+   if (!s0ix_suspend) {
amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
}
@@ -3672,13 +3673,13 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
  */
 int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 {
-   struct amdgpu_device *adev;
+   struct amdgpu_device *adev = drm_to_adev(dev);
struct drm_crtc *crtc;
struct drm_connector *connector;
struct drm_connector_list_iter iter;
int r;
-
-   adev = drm_to_adev(dev);
+   bool s0ix_suspend = amdgpu_acpi_is_s0ix_supported(adev) &&
+   (adev->pmops_state == AMDGPU_PMOPS_SUSPEND);
 
if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
return 0;
@@ -3741,11 +3742,10 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
 
amdgpu_fence_driver_suspend(adev);
 
-   if (adev->in_poweroff_reboot_com ||
-   !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev))
-   r = amdgpu_device_ip_suspend_phase2(adev);
-   else
+   if (s0ix_suspend)
amdgpu_gfx_state_change_set(adev, sGpuChangeState_D3Entry);
+   else
+   r = amdgpu_device_ip_suspend_phase2(adev);
/* evict remaining vram memory
 * This second call to evict vram is to evict the gart page table
 * using the CPU.
@@ -3772,11 +3772,13 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
struct amdgpu_device *adev = drm_to_adev(dev);
struct drm_crtc *crtc;
int r = 0;
+   bool s0ix_resume = amdgpu_acpi_is_s0ix_supported(adev) &&
+   (adev->pmops_state == AMDGPU_PMOPS_RESUME);
 
if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
return 0;
 
-   if (amdgpu_acpi_is_s0ix_supported(adev))
+   if (s0ix_resume)
amdgpu_gfx_state_change_set(adev, sGpuChangeState_D0Entry);
 
/* post card */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 0312c52bd39d..dd6d24305b16 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1288,9 +1288,7 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
 */
if (!amdgpu_passthrough(adev))
adev->mp1_state = PP_MP1_STATE_UNLOAD;
-   adev->in_poweroff_reboot_com = true;
amdgpu_device_ip_suspend(adev);
-   adev->in_poweroff_reboot_com = false;
adev->mp1_state = PP_MP1_STATE_NONE;
 }
 
@@ -1380,9 +1378,7 @@ static int amdgpu_pmops_poweroff(struct device *dev)
int r;
 
adev->pmops_state = AMDGPU_PMOPS_POWEROFF;
-   adev->in_poweroff_reboot_com = true;
r =  amdgpu_device_suspend(drm_dev, true);
-   adev->in_poweroff_reboot_com = false;
adev->pmops_state = AMDGPU_PMOPS_NONE;
return r;
 }
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

[PATCH 5/7] drm/amdgpu: don't evict vram on APUs for suspend to ram

2021-03-08 Thread Alex Deucher

Vram is system memory, so no need to evict.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b8205442..2da3a3480863 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1028,13 +1028,11 @@ int amdgpu_bo_evict_vram(struct amdgpu_device *adev)
 {
struct ttm_resource_manager *man;
 
-   /* late 2.6.33 fix IGP hibernate - we need pm ops to do this correct */
-#ifndef CONFIG_HIBERNATION
-   if (adev->flags & AMD_IS_APU) {
-   /* Useless to evict on IGP chips */
+   if ((adev->flags & AMD_IS_APU) &&
+   (adev->pmops_state == AMDGPU_PMOPS_SUSPEND)) {
+   /* Useless to evict vram on APUs for suspend to ram */
return 0;
}
-#endif
 
man = ttm_manager_type(>mman.bdev, TTM_PL_VRAM);
return ttm_resource_manager_evict_all(>mman.bdev, man);
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 7/7] drm/amdgpu: clean up non-DC suspend/resume handling

2021-03-08 Thread Alex Deucher

Move the non-DC specific code into the DCE IP blocks similar
to how we handle DC.  This cleans up the common suspend
and resume pathes.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 82 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 88 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.h |  3 +
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c  |  9 ++-
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c  |  9 ++-
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c   |  8 +-
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c   |  9 ++-
 drivers/gpu/drm/amd/amdgpu/dce_virtual.c| 15 +++-
 8 files changed, 137 insertions(+), 86 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2b6e483259f1..c4ccf7a313f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3674,9 +3674,6 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 {
struct amdgpu_device *adev = drm_to_adev(dev);
-   struct drm_crtc *crtc;
-   struct drm_connector *connector;
-   struct drm_connector_list_iter iter;
int r;
bool s0ix_suspend = amdgpu_acpi_is_s0ix_supported(adev) &&
(adev->pmops_state == AMDGPU_PMOPS_SUSPEND);
@@ -3692,45 +3689,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
 
cancel_delayed_work_sync(>delayed_init_work);
 
-   if (!amdgpu_device_has_dc_support(adev)) {
-   /* turn off display hw */
-   drm_modeset_lock_all(dev);
-   drm_connector_list_iter_begin(dev, );
-   drm_for_each_connector_iter(connector, )
-   drm_helper_connector_dpms(connector,
- DRM_MODE_DPMS_OFF);
-   drm_connector_list_iter_end();
-   drm_modeset_unlock_all(dev);
-   /* unpin the front buffers and cursors */
-   list_for_each_entry(crtc, >mode_config.crtc_list, head) {
-   struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
-   struct drm_framebuffer *fb = crtc->primary->fb;
-   struct amdgpu_bo *robj;
-
-   if (amdgpu_crtc->cursor_bo && 
!adev->enable_virtual_display) {
-   struct amdgpu_bo *aobj = 
gem_to_amdgpu_bo(amdgpu_crtc->cursor_bo);
-   r = amdgpu_bo_reserve(aobj, true);
-   if (r == 0) {
-   amdgpu_bo_unpin(aobj);
-   amdgpu_bo_unreserve(aobj);
-   }
-   }
-
-   if (fb == NULL || fb->obj[0] == NULL) {
-   continue;
-   }
-   robj = gem_to_amdgpu_bo(fb->obj[0]);
-   /* don't unpin kernel fb objects */
-   if (!amdgpu_fbdev_robj_is_fb(adev, robj)) {
-   r = amdgpu_bo_reserve(robj, true);
-   if (r == 0) {
-   amdgpu_bo_unpin(robj);
-   amdgpu_bo_unreserve(robj);
-   }
-   }
-   }
-   }
-
amdgpu_ras_suspend(adev);
 
r = amdgpu_device_ip_suspend_phase1(adev);
@@ -3767,10 +3725,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
  */
 int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
 {
-   struct drm_connector *connector;
-   struct drm_connector_list_iter iter;
struct amdgpu_device *adev = drm_to_adev(dev);
-   struct drm_crtc *crtc;
int r = 0;
bool s0ix_resume = amdgpu_acpi_is_s0ix_supported(adev) &&
(adev->pmops_state == AMDGPU_PMOPS_RESUME);
@@ -3803,24 +3758,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
queue_delayed_work(system_wq, >delayed_init_work,
   msecs_to_jiffies(AMDGPU_RESUME_MS));
 
-   if (!amdgpu_device_has_dc_support(adev)) {
-   /* pin cursors */
-   list_for_each_entry(crtc, >mode_config.crtc_list, head) {
-   struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
-
-   if (amdgpu_crtc->cursor_bo && 
!adev->enable_virtual_display) {
-   struct amdgpu_bo *aobj = 
gem_to_amdgpu_bo(amdgpu_crtc->cursor_bo);
-   r = amdgpu_bo_reserve(aobj, true);
-   if (r == 0) {
-   r = amdgpu_bo_pin(aobj, 
AMDGPU_GEM_DOMAIN_VRAM);
-   if (r != 0)
-

[PATCH 2/7] drm/amdgpu: enable DPM_FLAG_MAY_SKIP_RESUME and DPM_FLAG_SMART_SUSPEND flags (v2)

2021-03-08 Thread Alex Deucher

Once the device has runtime suspended, we don't need to power it
back up again for system suspend.  Likewise for resume, we don't
to power up the device again on resume only to power it back off
again via runtime pm because it's still idle.

v2: add DPM_FLAG_SMART_PREPARE as well

Acked-by: Rajneesh Bhardwaj  (v1)
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index f269267a96d3..8e6ef4d8b7ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -205,6 +205,13 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, 
unsigned long flags)
if (amdgpu_device_supports_atpx(dev) &&
!amdgpu_is_atpx_hybrid())
dev_pm_set_driver_flags(dev->dev, 
DPM_FLAG_NO_DIRECT_COMPLETE);
+   /* we want direct complete for BOCO */
+   if ((amdgpu_device_supports_atpx(dev) &&
+   amdgpu_is_atpx_hybrid()) ||
+   amdgpu_device_supports_boco(dev))
+   dev_pm_set_driver_flags(dev->dev, 
DPM_FLAG_SMART_PREPARE |
+   DPM_FLAG_SMART_SUSPEND |
+   DPM_FLAG_MAY_SKIP_RESUME);
pm_runtime_use_autosuspend(dev->dev);
pm_runtime_set_autosuspend_delay(dev->dev, 5000);
pm_runtime_allow(dev->dev);
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 1/7] drm/amdgpu: add a dev_pm_ops prepare callback (v2)

2021-03-08 Thread Alex Deucher

as per:
https://www.kernel.org/doc/html/latest/driver-api/pm/devices.html

The prepare callback is required to support the DPM_FLAG_SMART_SUSPEND
driver flag.  This allows runtime pm to auto complete when the
system goes into suspend avoiding a wake up on suspend and on resume.
Apply this for hybrid gfx and BOCO systems where d3cold is
provided by the ACPI platform.

v2: check if device is runtime suspended in prepare.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index aecf7baf219a..8d4fbee01011 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "amdgpu.h"
 #include "amdgpu_irq.h"
@@ -1293,6 +1294,27 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
adev->mp1_state = PP_MP1_STATE_NONE;
 }
 
+static int amdgpu_pmops_prepare(struct device *dev)
+{
+   struct drm_device *drm_dev = dev_get_drvdata(dev);
+
+   /* Return a positive number here so
+* DPM_FLAG_SMART_SUSPEND works properly
+*/
+   if ((amdgpu_device_supports_atpx(drm_dev) &&
+   amdgpu_is_atpx_hybrid()) ||
+   amdgpu_device_supports_boco(drm_dev))
+   return pm_runtime_suspended(dev) &&
+   pm_suspend_via_firmware();
+
+   return 0;
+}
+
+static void amdgpu_pmops_complete(struct device *dev)
+{
+   /* nothing to do */
+}
+
 static int amdgpu_pmops_suspend(struct device *dev)
 {
struct drm_device *drm_dev = dev_get_drvdata(dev);
@@ -1511,6 +1533,8 @@ long amdgpu_drm_ioctl(struct file *filp,
 }
 
 static const struct dev_pm_ops amdgpu_pm_ops = {
+   .prepare = amdgpu_pmops_prepare,
+   .complete = amdgpu_pmops_complete,
.suspend = amdgpu_pmops_suspend,
.resume = amdgpu_pmops_resume,
.freeze = amdgpu_pmops_freeze,
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: fix the hibernation suspend with s0ix

2021-03-08 Thread Alex Deucher

On Mon, Mar 8, 2021 at 10:52 PM Prike Liang  wrote:
>
> During system hibernation suspend still need un-gate gfx CG/PG firstly to 
> handle HW
> status check before HW resource destory.
>
> Signed-off-by: Prike Liang 

This is fine for stable, but we should work on cleaning this up.  I
have a patch set to improve this, but it's more invasive.  We really
need to sort out what specific parts of
amdgpu_device_ip_suspend_phase2() are problematic and special case
them.  We shouldn't just be skipping that function.

Acked-by: Alex Deucher 

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a..7079bfc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2683,7 +2683,7 @@ static int amdgpu_device_ip_suspend_phase1(struct 
> amdgpu_device *adev)
>  {
> int i, r;
>
> -   if (adev->in_poweroff_reboot_com ||
> +   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
> !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev)) {
> amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> @@ -3750,7 +3750,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>
> amdgpu_fence_driver_suspend(adev);
>
> -   if (adev->in_poweroff_reboot_com ||
> +   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
> !amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev))
> r = amdgpu_device_ip_suspend_phase2(adev);
> else
> --
> 2.7.4
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: fix the hibernation suspend with s0ix

2021-03-08 Thread Prike Liang

During system hibernation suspend still need un-gate gfx CG/PG firstly to 
handle HW
status check before HW resource destory.

Signed-off-by: Prike Liang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a..7079bfc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2683,7 +2683,7 @@ static int amdgpu_device_ip_suspend_phase1(struct 
amdgpu_device *adev)
 {
int i, r;
 
-   if (adev->in_poweroff_reboot_com ||
+   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
!amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev)) {
amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
@@ -3750,7 +3750,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
 
amdgpu_fence_driver_suspend(adev);
 
-   if (adev->in_poweroff_reboot_com ||
+   if (adev->in_poweroff_reboot_com || adev->in_hibernate ||
!amdgpu_acpi_is_s0ix_supported(adev) || amdgpu_in_reset(adev))
r = amdgpu_device_ip_suspend_phase2(adev);
else
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2] drm/amdgpu: Verify bo size can fit framebuffer size on init.

2021-03-08 Thread Alex Deucher

On Mon, Mar 8, 2021 at 4:36 PM Mark Yacoub  wrote:
>
> From: Mark Yacoub 
>
> To initialize the framebuffer, call drm_gem_fb_init_with_funcs which
> verifies that the BO size can fit the FB size by calculating the minimum
> expected size of each plane.
>
> The bug was caught using igt-gpu-tools test: kms_addfb_basic.too-high
> and kms_addfb_basic.bo-too-small
>
> Tested on ChromeOS Zork by turning on the display and running a YT
> video.
>
> === Changes from v1 ===
> 1. Added new line under declarations.
> 2. Use C style comment.
>
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: Sean Paul 
> Signed-off-by: Mark Yacoub 

Applied.  Thanks.  FWIW, it would be nice to clean this up in general.
All of this fbdev stuff is pretty convoluted.

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 68 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c  |  4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h|  8 +++
>  3 files changed, 65 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> index 48cb33e5b3826..afa5f8ad0f563 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> @@ -870,17 +870,62 @@ static int amdgpu_display_get_fb_info(const struct 
> amdgpu_framebuffer *amdgpu_fb
> return r;
>  }
>
> +int amdgpu_display_gem_fb_init(struct drm_device *dev,
> +  struct amdgpu_framebuffer *rfb,
> +  const struct drm_mode_fb_cmd2 *mode_cmd,
> +  struct drm_gem_object *obj)
> +{
> +   int ret;
> +
> +   rfb->base.obj[0] = obj;
> +   drm_helper_mode_fill_fb_struct(dev, >base, mode_cmd);
> +   ret = drm_framebuffer_init(dev, >base, _fb_funcs);
> +   if (ret)
> +   goto err;
> +
> +   ret = amdgpu_display_framebuffer_init(dev, rfb, mode_cmd, obj);
> +   if (ret)
> +   goto err;
> +
> +   return 0;
> +err:
> +   drm_err(dev, "Failed to init gem fb: %d\n", ret);
> +   rfb->base.obj[0] = NULL;
> +   return ret;
> +}
> +
> +int amdgpu_display_gem_fb_verify_and_init(
> +   struct drm_device *dev, struct amdgpu_framebuffer *rfb,
> +   struct drm_file *file_priv, const struct drm_mode_fb_cmd2 *mode_cmd,
> +   struct drm_gem_object *obj)
> +{
> +   int ret;
> +
> +   rfb->base.obj[0] = obj;
> +
> +   /* Verify that bo size can fit the fb size. */
> +   ret = drm_gem_fb_init_with_funcs(dev, >base, file_priv, mode_cmd,
> +_fb_funcs);
> +   if (ret)
> +   goto err;
> +
> +   ret = amdgpu_display_framebuffer_init(dev, rfb, mode_cmd, obj);
> +   if (ret)
> +   goto err;
> +
> +   return 0;
> +err:
> +   drm_err(dev, "Failed to verify and init gem fb: %d\n", ret);
> +   rfb->base.obj[0] = NULL;
> +   return ret;
> +}
> +
>  int amdgpu_display_framebuffer_init(struct drm_device *dev,
> struct amdgpu_framebuffer *rfb,
> const struct drm_mode_fb_cmd2 *mode_cmd,
> struct drm_gem_object *obj)
>  {
> int ret, i;
> -   rfb->base.obj[0] = obj;
> -   drm_helper_mode_fill_fb_struct(dev, >base, mode_cmd);
> -   ret = drm_framebuffer_init(dev, >base, _fb_funcs);
> -   if (ret)
> -   goto fail;
>
> /*
>  * This needs to happen before modifier conversion as that might 
> change
> @@ -891,13 +936,13 @@ int amdgpu_display_framebuffer_init(struct drm_device 
> *dev,
> drm_dbg_kms(dev, "Plane 0 and %d have different BOs: 
> %u vs. %u\n",
> i, mode_cmd->handles[0], 
> mode_cmd->handles[i]);
> ret = -EINVAL;
> -   goto fail;
> +   return ret;
> }
> }
>
> ret = amdgpu_display_get_fb_info(rfb, >tiling_flags, 
> >tmz_surface);
> if (ret)
> -   goto fail;
> +   return ret;
>
> if (dev->mode_config.allow_fb_modifiers &&
> !(rfb->base.flags & DRM_MODE_FB_MODIFIERS)) {
> @@ -905,7 +950,7 @@ int amdgpu_display_framebuffer_init(struct drm_device 
> *dev,
> if (ret) {
> drm_dbg_kms(dev, "Failed to convert tiling flags 
> 0x%llX to a modifier",
> rfb->tiling_flags);
> -   goto fail;
> +   return ret;
> }
> }
>
> @@ -915,10 +960,6 @@ int amdgpu_display_framebuffer_init(struct drm_device 
> *dev,
> }
>
> return 0;
> -
> -fail:
> -   rfb->base.obj[0] = NULL;
> -   return ret;
>  }
>
>  struct drm_framebuffer *
> @@ -953,7 +994,8 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
> *dev,
>

Re: [PATCH 5/5] drm/amdgpu: use metadata members of struct amdgpu_bo_user

2021-03-08 Thread Felix Kuehling

Am 2021-03-05 um 10:06 a.m. schrieb Nirmoy Das:
> These members are only needed for BOs created by
> amdgpu_gem_object_create(), so we can remove these from the
> base class.
>
> CC: felix.kuehl...@amd.com
> Signed-off-by: Nirmoy Das 

Acked-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 48 +++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  3 --
>  3 files changed, 34 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index c5343a5eecbe..f8c8cbd8ab59 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -494,8 +494,6 @@ int amdgpu_amdkfd_get_dmabuf_info(struct kgd_dev *kgd, 
> int dma_buf_fd,
>   *dma_buf_kgd = (struct kgd_dev *)adev;
>   if (bo_size)
>   *bo_size = amdgpu_bo_size(bo);
> - if (metadata_size)
> - *metadata_size = bo->metadata_size;
>   if (metadata_buffer)
>   r = amdgpu_bo_get_metadata(bo, metadata_buffer, buffer_size,
>  metadata_size, _flags);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index ca60943312dc..7c744d90ed34 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -77,6 +77,7 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
>  {
>   struct amdgpu_device *adev = amdgpu_ttm_adev(tbo->bdev);
>   struct amdgpu_bo *bo = ttm_to_amdgpu_bo(tbo);
> + struct amdgpu_bo_user *ubo;
>  
>   if (bo->pin_count > 0)
>   amdgpu_bo_subtract_pin_size(bo);
> @@ -94,7 +95,11 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object 
> *tbo)
>   }
>   amdgpu_bo_unref(>parent);
>  
> - kfree(bo->metadata);
> + if (bo->tbo.type == ttm_bo_type_device) {
> + ubo = container_of((bo), struct amdgpu_bo_user, bo);
> + kfree(ubo->metadata);
> + }
> +
>   kfree(bo);
>  }
>  
> @@ -1236,13 +1241,20 @@ void amdgpu_bo_get_tiling_flags(struct amdgpu_bo *bo, 
> u64 *tiling_flags)
>  int amdgpu_bo_set_metadata (struct amdgpu_bo *bo, void *metadata,
>   uint32_t metadata_size, uint64_t flags)
>  {
> + struct amdgpu_bo_user *ubo;
>   void *buffer;
>  
> + if (bo->tbo.type != ttm_bo_type_device) {
> + DRM_ERROR("can not set metadata for a non-amdgpu_bo_user type 
> BO\n");
> + return -EINVAL;
> + }
> +
> + ubo = container_of((bo), struct amdgpu_bo_user, bo);
>   if (!metadata_size) {
> - if (bo->metadata_size) {
> - kfree(bo->metadata);
> - bo->metadata = NULL;
> - bo->metadata_size = 0;
> + if (ubo->metadata_size) {
> + kfree(ubo->metadata);
> + ubo->metadata = NULL;
> + ubo->metadata_size = 0;
>   }
>   return 0;
>   }
> @@ -1254,10 +1266,10 @@ int amdgpu_bo_set_metadata (struct amdgpu_bo *bo, 
> void *metadata,
>   if (buffer == NULL)
>   return -ENOMEM;
>  
> - kfree(bo->metadata);
> - bo->metadata_flags = flags;
> - bo->metadata = buffer;
> - bo->metadata_size = metadata_size;
> + kfree(ubo->metadata);
> + ubo->metadata_flags = flags;
> + ubo->metadata = buffer;
> + ubo->metadata_size = metadata_size;
>  
>   return 0;
>  }
> @@ -1281,21 +1293,29 @@ int amdgpu_bo_get_metadata(struct amdgpu_bo *bo, void 
> *buffer,
>  size_t buffer_size, uint32_t *metadata_size,
>  uint64_t *flags)
>  {
> + struct amdgpu_bo_user *ubo;
> +
>   if (!buffer && !metadata_size)
>   return -EINVAL;
>  
> + if (bo->tbo.type != ttm_bo_type_device) {
> + DRM_ERROR("can not get metadata for a non-amdgpu_bo_user type 
> BO\n");
> + return -EINVAL;
> + }
> +
> + ubo = container_of((bo), struct amdgpu_bo_user, bo);
>   if (buffer) {
> - if (buffer_size < bo->metadata_size)
> + if (buffer_size < ubo->metadata_size)
>   return -EINVAL;
>  
> - if (bo->metadata_size)
> - memcpy(buffer, bo->metadata, bo->metadata_size);
> + if (ubo->metadata_size)
> + memcpy(buffer, ubo->metadata, ubo->metadata_size);
>   }
>  
>   if (metadata_size)
> - *metadata_size = bo->metadata_size;
> + *metadata_size = ubo->metadata_size;
>   if (flags)
> - *flags = bo->metadata_flags;
> + *flags = ubo->metadata_flags;
>  
>   return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> index

[PATCH 1/1] drm/amdkfd: fix build error with AMD_IOMMU_V2=m

2021-03-08 Thread Felix Kuehling

Using 'imply AMD_IOMMU_V2' does not guarantee that the driver can link
against the exported functions. If the GPU driver is built-in but the
IOMMU driver is a loadable module, the kfd_iommu.c file is indeed
built but does not work:

x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_bind_process_to_device':
kfd_iommu.c:(.text+0x516): undefined reference to `amd_iommu_bind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_unbind_process':
kfd_iommu.c:(.text+0x691): undefined reference to `amd_iommu_unbind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_suspend':
kfd_iommu.c:(.text+0x966): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x97f): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x9a4): undefined reference to 
`amd_iommu_free_device'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_resume':
kfd_iommu.c:(.text+0xa9a): undefined reference to `amd_iommu_init_device'
x86_64-linux-ld: kfd_iommu.c:(.text+0xadc): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xaff): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xc72): undefined reference to 
`amd_iommu_bind_pasid'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe08): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe26): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe42): undefined reference to 
`amd_iommu_free_device'

Use IS_REACHABLE to only build IOMMU-V2 support if the amd_iommu symbols
are reachable by the amdkfd driver. Output a warning if they are not,
because that may not be what the user was expecting.

Fixes: 64d1c3a43a6f ("drm/amdkfd: Centralize IOMMUv2 code and make it 
conditional")
Reported-by: Arnd Bergmann 
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 10 ++
 drivers/gpu/drm/amd/amdkfd/kfd_iommu.h |  6 --
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
index 66bbca61e3ef..7199eb833f66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
@@ -20,6 +20,10 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include 
+
+#if IS_REACHABLE(CONFIG_AMD_IOMMU_V2)
+
 #include 
 #include 
 #include 
@@ -355,3 +359,9 @@ int kfd_iommu_add_perf_counters(struct kfd_topology_device 
*kdev)
 
return 0;
 }
+
+#else
+
+#warning "Moldular IOMMU-V2 is not usable by built-in KFD"
+
+#endif
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.h
index dd23d9fdf6a8..b25365fc2c4e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.h
@@ -23,7 +23,9 @@
 #ifndef __KFD_IOMMU_H__
 #define __KFD_IOMMU_H__
 
-#if defined(CONFIG_AMD_IOMMU_V2_MODULE) || defined(CONFIG_AMD_IOMMU_V2)
+#include 
+
+#if IS_REACHABLE(CONFIG_AMD_IOMMU_V2)
 
 #define KFD_SUPPORT_IOMMU_V2
 
@@ -73,6 +75,6 @@ static inline int kfd_iommu_add_perf_counters(struct 
kfd_topology_device *kdev)
return 0;
 }
 
-#endif /* defined(CONFIG_AMD_IOMMU_V2) */
+#endif /* IS_REACHABLE(CONFIG_AMD_IOMMU_V2) */
 
 #endif /* __KFD_IOMMU_H__ */
-- 
2.30.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] [variant b] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Felix Kuehling

Am 2021-03-08 um 3:45 p.m. schrieb Arnd Bergmann:
> From: Arnd Bergmann 
>
> Using 'imply AMD_IOMMU_V2' does not guarantee that the driver can link
> against the exported functions. If the GPU driver is built-in but the
> IOMMU driver is a loadable module, the kfd_iommu.c file is indeed
> built but does not work:
>
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_bind_process_to_device':
> kfd_iommu.c:(.text+0x516): undefined reference to `amd_iommu_bind_pasid'
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_unbind_process':
> kfd_iommu.c:(.text+0x691): undefined reference to `amd_iommu_unbind_pasid'
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_suspend':
> kfd_iommu.c:(.text+0x966): undefined reference to 
> `amd_iommu_set_invalidate_ctx_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0x97f): undefined reference to 
> `amd_iommu_set_invalid_ppr_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0x9a4): undefined reference to 
> `amd_iommu_free_device'
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_resume':
> kfd_iommu.c:(.text+0xa9a): undefined reference to `amd_iommu_init_device'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xadc): undefined reference to 
> `amd_iommu_set_invalidate_ctx_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xaff): undefined reference to 
> `amd_iommu_set_invalid_ppr_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xc72): undefined reference to 
> `amd_iommu_bind_pasid'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xe08): undefined reference to 
> `amd_iommu_set_invalidate_ctx_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xe26): undefined reference to 
> `amd_iommu_set_invalid_ppr_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xe42): undefined reference to 
> `amd_iommu_free_device'
>
> Change the 'imply' to a weak dependency that still allows compiling
> in all other configurations but disallows the configuration that
> causes a link failure.

I don't like this solution. It hides the HSA_AMD option in menuconfig
and it's not intuitively obvious to someone configuring a kernel why
it's not available. They may not even notice that it's missing and then
wonder later, why KFD functionality is missing in their kernel.

What I'm trying to achieve is, that KFD can be compiled without
AMD-IOMMU-V2 support in this case. I just tested my version using
IS_REACHABLE. I'll reply with that patch in a moment.

Regards,
  Felix


>
> Fixes: 64d1c3a43a6f ("drm/amdkfd: Centralize IOMMUv2 code and make it 
> conditional")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig 
> b/drivers/gpu/drm/amd/amdkfd/Kconfig
> index f02c938f75da..d01dba2af3bb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/Kconfig
> +++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
> @@ -6,7 +6,7 @@
>  config HSA_AMD
>   bool "HSA kernel driver for AMD GPU devices"
>   depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
> - imply AMD_IOMMU_V2 if X86_64
> + depends on AMD_IOMMU_V2=y || DRM_AMDGPU=m
>   select HMM_MIRROR
>   select MMU_NOTIFIER
>   select DRM_AMDGPU_USERPTR
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: update secure display TA header

2021-03-08 Thread Jinzhou Su

update secure display TA header file.

Signed-off-by: Jinzhou Su 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_securedisplay.c | 3 +++
 drivers/gpu/drm/amd/amdgpu/ta_secureDisplay_if.h  | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_securedisplay.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_securedisplay.c
index 834440ab9ff7..9cf856c94f94 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_securedisplay.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_securedisplay.c
@@ -69,6 +69,9 @@ void psp_securedisplay_parse_resp_status(struct psp_context 
*psp,
case TA_SECUREDISPLAY_STATUS__READ_CRC_ERROR:
dev_err(psp->adev->dev, "Secure display: Failed to Read CRC");
break;
+   case TA_SECUREDISPLAY_STATUS__I2C_INIT_ERROR:
+   dev_err(psp->adev->dev, "Secure display: Failed to initialize 
I2C.");
+   break;
default:
dev_err(psp->adev->dev, "Secure display: Failed to parse 
status: %d\n", status);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/ta_secureDisplay_if.h 
b/drivers/gpu/drm/amd/amdgpu/ta_secureDisplay_if.h
index 5039375bb1d4..cf8ff064dc72 100644
--- a/drivers/gpu/drm/amd/amdgpu/ta_secureDisplay_if.h
+++ b/drivers/gpu/drm/amd/amdgpu/ta_secureDisplay_if.h
@@ -50,6 +50,7 @@ enum ta_securedisplay_status {
TA_SECUREDISPLAY_STATUS__I2C_WRITE_ERROR = 0x04, /* 
Fail to Write to I2C */
TA_SECUREDISPLAY_STATUS__READ_DIO_SCRATCH_ERROR  = 0x05, /*Fail Read 
DIO Scratch Register*/
TA_SECUREDISPLAY_STATUS__READ_CRC_ERROR  = 0x06, /* 
Fail to Read CRC*/
+   TA_SECUREDISPLAY_STATUS__I2C_INIT_ERROR  = 0x07, /* Failed 
to initialize I2C */
 
TA_SECUREDISPLAY_STATUS__MAX = 0x7FFF,/* 
Maximum Value for status*/
 };
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: capture invalid hardware access v2

2021-03-08 Thread Li, Dennis

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
   amdgpu_device_skip_hw_access will always assert in reset thread, which 
seems not a good idea.

Best Regards
Dennis Li
-Original Message-
From: Christian König 
Sent: Tuesday, March 9, 2021 2:07 AM
To: amd-gfx@lists.freedesktop.org
Cc: Grodzovsky, Andrey ; Li, Dennis 

Subject: [PATCH] drm/amdgpu: capture invalid hardware access v2

From: Dennis Li 

When recovery thread has begun GPU reset, there should be not other threads to 
access hardware, otherwise system randomly hang.

v2 (chk): rewritten from scratch, use trylock and lockdep instead of
  hand wiring the logic.

Signed-off-by: Dennis Li 
Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 74 +-
 1 file changed, 57 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..c990af6a43ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -326,6 +326,34 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, 
loff_t pos,
 /*
  * register access helper functions.
  */
+
+/* Check if hw access should be skipped because of hotplug or device
+error */ static bool amdgpu_device_skip_hw_access(struct amdgpu_device
+*adev) {
+if (adev->in_pci_err_recovery)
+return true;
+
+#ifdef CONFIG_LOCKDEP
+/*
+ * This is a bit complicated to understand, so worth a comment. What we assert
+ * here is that the GPU reset is not running on another thread in parallel.
+ *
+ * For this we trylock the read side of the reset semaphore, if that succeeds
+ * we know that the reset is not running in paralell.
+ *
+ * If the trylock fails we assert that we are either already holding the read
+ * side of the lock or are the reset thread itself and hold the write side of
+ * the lock.
+ */
+if (down_read_trylock(>reset_sem))
+up_read(>reset_sem);
+else
+lockdep_assert_held(>reset_sem);
+#endif
+
+return false;
+}
+
 /**
  * amdgpu_device_rreg - read a memory mapped IO or indirect register
  *
@@ -340,7 +368,7 @@ uint32_t amdgpu_device_rreg(struct amdgpu_device *adev,  {
 uint32_t ret;

-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return 0;

 if ((reg * 4) < adev->rmmio_size) {
@@ -377,7 +405,7 @@ uint32_t amdgpu_device_rreg(struct amdgpu_device *adev,
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)  {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return 0;

 if (offset < adev->rmmio_size)
@@ -402,7 +430,7 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, 
uint32_t offset)
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value)  {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return;

 if (offset < adev->rmmio_size)
@@ -425,7 +453,7 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
 uint32_t reg, uint32_t v,
 uint32_t acc_flags)
 {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return;

 if ((reg * 4) < adev->rmmio_size) {
@@ -452,7 +480,7 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,  void 
amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
  uint32_t reg, uint32_t v)
 {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return;

 if (amdgpu_sriov_fullaccess(adev) &&
@@ -475,7 +503,7 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)  {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return 0;

 if ((reg * 4) < adev->rio_mem_size)
@@ -497,7 +525,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)  {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return;

 if ((reg * 4) < adev->rio_mem_size)
@@ -519,7 +547,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, 
u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)  {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return 0;

 if (index < adev->doorbell.num_doorbells) { @@ -542,7 +570,7 @@ u32 
amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)  {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return;

 if (index < adev->doorbell.num_doorbells) { @@ -563,7 +591,7 @@ void 
amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)  {
-if (adev->in_pci_err_recovery)
+if (amdgpu_device_skip_hw_access(adev))
 return 0;

 if (index < adev->doorbell.num_doorbells) { @@ -586,7 +614,7 @@ u64 
amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
  */
 void amdgpu_mm_wdoorbell64(struct

2021 X.Org Foundation Membership renewal ENDS on THURSDAY Mar 11

2021-03-08 Thread Harry Wentland

The nomination period for the 2021 X.Org Foundation Board of Directors 
Election closed yesterday and the election is rapidly approaching. We 
currently only see membership renewals for 59 people.


If you have not renewed your membership please do so by Thursday, Mar 11 
at https://members.x.org.


The nominated candidates will be announced a week from yesterday.

There were some hickups with our earlier emails and we realize some of 
you may have not received them. To ensure you receive this email we're 
BCCing any member that has been registered as a member in the last 2 years.


** Election Schedule **

Nomination period Start: Mon 22nd February
Nomination period End: Sun 7th March
Deadline of X.Org membership application or renewal: Thu 11th March
Publication of Candidates & start of Candidate QA: Mon 15th March
Election Planned Start: Mon 22nd March anywhere on earth
Election Planned End: Sun 4th April anywhere on earth

** Election Committee **

 * Eric Anholt
 * Mark Filion
 * Keith Packard
 * Harry Wentland

Thanks,
Harry Wentland,
on behalf of the X.Org elections committee
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v3] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Andrey Grodzovsky





On 2021-03-08 7:33 a.m., Jack Zhang wrote:

[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.
1.It synchronized resubmit wait step to find the real bad job. If the
job's hw fence get timeout, we decrease the old job's karma, treat
the new found one as a guilty one,do hw reset to recover hw. After
that, we conitue the resubmit step to resubmit the left jobs.

2. For whole gpu reset(vram lost), do resubmit as the old style.

Signed-off-by: Jack Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 33 +
  3 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..fa53c6c00ee9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4639,7 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
-
+   int tmp_vram_lost_counter;
/*
 * Special case: RAS triggered and full reset isn't supported
 */
@@ -4788,6 +4788,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
}
}
  
+	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));

/* Actual ASIC resets if needed.*/
/* TODO Implement XGMI hive reset logic for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@ -4807,17 +4808,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
  
  		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {

struct amdgpu_ring *ring = tmp_adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job = NULL;
  
  			if (!ring || !ring->sched.thread)

continue;
  
  			/* No point to resubmit jobs if we didn't HW reset*/

-   if (!tmp_adev->asic_reset_res && !job_signaled)
+   if (!tmp_adev->asic_reset_res && !job_signaled) {
drm_sched_resubmit_jobs(>sched);
  
-			drm_sched_start(>sched, !tmp_adev->asic_reset_res);

+   if (amdgpu_gpu_recovery == 2 &&
+   
!list_empty(>sched->ring_mirror_list)
+   && !(tmp_vram_lost_counter < 
atomic_read(>vram_lost_counter)
+) {
+
+   s_job = 
list_first_entry_or_null(>sched->ring_mirror_list, struct drm_sched_job, 
node);


Seems better to check for NULL here and skip checking for list_empty
above



+   ret = 
dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched->timeout);
+   if (ret == 0) { /* timeout */
+   /*reset karma to the right job*/
+   if (job && job != s_job)
+   
amdgpu_sched_decrease_karma(>base);
+   drm_sched_increase_karma(s_job);
+
+   /* do hw reset */
+   if (amdgpu_sriov_vf(adev)) {
+   
amdgpu_virt_fini_data_exchange(adev);
+   r = 
amdgpu_device_reset_sriov(adev, false);
+   if (r)
+   
adev->asic_reset_res = r;
+   } else {
+   r  = 
amdgpu_do_asic_reset(hive, device_list_handle, _full_reset, false);
+   if (r && r == -EAGAIN)
+   goto retry;
+
+   /* add reset counter so that 
the following resubmitted job could flush vmid */
+   
atomic_inc(_adev->gpu_reset_counter);
+
+   //resubmit again the left jobs
+   
drm_sched_resubmit_jobs(>sched);
+   }
+   }
+   }
+   }
+   if (amdgpu_gpu_recovery != 2)
+

Re: [PATCH] drm/amdgpu: Remove unnecessary conversion to bool

2021-03-08 Thread Alex Deucher

On Sun, Mar 7, 2021 at 10:14 PM Jiapeng Chong
 wrote:
>
> Fix the following coccicheck warnings:
>
> ./drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c:1600:40-45: WARNING: conversion
> to bool not needed here.
>
> ./drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c:1598:40-45: WARNING: conversion
> to bool not needed here.
>
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 

This patch was already applied.

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index 690a509..b39e7db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -1595,9 +1595,9 @@ static int sdma_v5_2_set_clockgating_state(void *handle,
> case CHIP_VANGOGH:
> case CHIP_DIMGREY_CAVEFISH:
> sdma_v5_2_update_medium_grain_clock_gating(adev,
> -   state == AMD_CG_STATE_GATE ? true : false);
> +   state == AMD_CG_STATE_GATE);
> sdma_v5_2_update_medium_grain_light_sleep(adev,
> -   state == AMD_CG_STATE_GATE ? true : false);
> +   state == AMD_CG_STATE_GATE);
> break;
> default:
> break;
> --
> 1.8.3.1
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: Remove unnecessary conversion to bool

2021-03-08 Thread Alex Deucher

On Sun, Mar 7, 2021 at 10:00 PM Jiapeng Chong
 wrote:
>
> Fix the following coccicheck warnings:
>
> ./drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c:561:34-39: WARNING:
> conversion to bool not needed here.
>
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 

This patch was already applied.

Alex

> ---
>  drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c 
> b/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c
> index ae6484a..42a4177 100644
> --- a/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c
> @@ -558,7 +558,7 @@ bool dal_ddc_service_query_ddc_data(
> /* should not set mot (middle of transaction) to 0
>  * if there are pending read payloads
>  */
> -   payload.mot = read_size == 0 ? false : true;
> +   payload.mot = !(read_size == 0);
> payload.length = write_size;
> payload.data = write_buf;
>
> --
> 1.8.3.1
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH v2] drm/amdgpu: Verify bo size can fit framebuffer size on init.

2021-03-08 Thread Mark Yacoub

From: Mark Yacoub 

To initialize the framebuffer, call drm_gem_fb_init_with_funcs which
verifies that the BO size can fit the FB size by calculating the minimum
expected size of each plane.

The bug was caught using igt-gpu-tools test: kms_addfb_basic.too-high
and kms_addfb_basic.bo-too-small

Tested on ChromeOS Zork by turning on the display and running a YT
video.

=== Changes from v1 ===
1. Added new line under declarations.
2. Use C style comment.

Cc: Alex Deucher 
Cc: "Christian König" 
Cc: Sean Paul 
Signed-off-by: Mark Yacoub 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 68 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c  |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h|  8 +++
 3 files changed, 65 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
index 48cb33e5b3826..afa5f8ad0f563 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
@@ -870,17 +870,62 @@ static int amdgpu_display_get_fb_info(const struct 
amdgpu_framebuffer *amdgpu_fb
return r;
 }
 
+int amdgpu_display_gem_fb_init(struct drm_device *dev,
+  struct amdgpu_framebuffer *rfb,
+  const struct drm_mode_fb_cmd2 *mode_cmd,
+  struct drm_gem_object *obj)
+{
+   int ret;
+
+   rfb->base.obj[0] = obj;
+   drm_helper_mode_fill_fb_struct(dev, >base, mode_cmd);
+   ret = drm_framebuffer_init(dev, >base, _fb_funcs);
+   if (ret)
+   goto err;
+
+   ret = amdgpu_display_framebuffer_init(dev, rfb, mode_cmd, obj);
+   if (ret)
+   goto err;
+
+   return 0;
+err:
+   drm_err(dev, "Failed to init gem fb: %d\n", ret);
+   rfb->base.obj[0] = NULL;
+   return ret;
+}
+
+int amdgpu_display_gem_fb_verify_and_init(
+   struct drm_device *dev, struct amdgpu_framebuffer *rfb,
+   struct drm_file *file_priv, const struct drm_mode_fb_cmd2 *mode_cmd,
+   struct drm_gem_object *obj)
+{
+   int ret;
+
+   rfb->base.obj[0] = obj;
+
+   /* Verify that bo size can fit the fb size. */
+   ret = drm_gem_fb_init_with_funcs(dev, >base, file_priv, mode_cmd,
+_fb_funcs);
+   if (ret)
+   goto err;
+
+   ret = amdgpu_display_framebuffer_init(dev, rfb, mode_cmd, obj);
+   if (ret)
+   goto err;
+
+   return 0;
+err:
+   drm_err(dev, "Failed to verify and init gem fb: %d\n", ret);
+   rfb->base.obj[0] = NULL;
+   return ret;
+}
+
 int amdgpu_display_framebuffer_init(struct drm_device *dev,
struct amdgpu_framebuffer *rfb,
const struct drm_mode_fb_cmd2 *mode_cmd,
struct drm_gem_object *obj)
 {
int ret, i;
-   rfb->base.obj[0] = obj;
-   drm_helper_mode_fill_fb_struct(dev, >base, mode_cmd);
-   ret = drm_framebuffer_init(dev, >base, _fb_funcs);
-   if (ret)
-   goto fail;
 
/*
 * This needs to happen before modifier conversion as that might change
@@ -891,13 +936,13 @@ int amdgpu_display_framebuffer_init(struct drm_device 
*dev,
drm_dbg_kms(dev, "Plane 0 and %d have different BOs: %u 
vs. %u\n",
i, mode_cmd->handles[0], 
mode_cmd->handles[i]);
ret = -EINVAL;
-   goto fail;
+   return ret;
}
}
 
ret = amdgpu_display_get_fb_info(rfb, >tiling_flags, 
>tmz_surface);
if (ret)
-   goto fail;
+   return ret;
 
if (dev->mode_config.allow_fb_modifiers &&
!(rfb->base.flags & DRM_MODE_FB_MODIFIERS)) {
@@ -905,7 +950,7 @@ int amdgpu_display_framebuffer_init(struct drm_device *dev,
if (ret) {
drm_dbg_kms(dev, "Failed to convert tiling flags 0x%llX 
to a modifier",
rfb->tiling_flags);
-   goto fail;
+   return ret;
}
}
 
@@ -915,10 +960,6 @@ int amdgpu_display_framebuffer_init(struct drm_device *dev,
}
 
return 0;
-
-fail:
-   rfb->base.obj[0] = NULL;
-   return ret;
 }
 
 struct drm_framebuffer *
@@ -953,7 +994,8 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
*dev,
return ERR_PTR(-ENOMEM);
}
 
-   ret = amdgpu_display_framebuffer_init(dev, amdgpu_fb, mode_cmd, obj);
+   ret = amdgpu_display_gem_fb_verify_and_init(dev, amdgpu_fb, file_priv,
+   mode_cmd, obj);
if (ret) {
kfree(amdgpu_fb);
drm_gem_object_put(obj);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c

Re: [PATCH] drm/amd/display: remove duplicate include in dcn21 and gpio

2021-03-08 Thread Alex Deucher

Applied.  Thanks!

Alex

On Sat, Mar 6, 2021 at 6:05 AM  wrote:
>
> From: Zhang Yunkai 
>
> 'dce110_resource.h' included in 'dcn21_resource.c' is duplicated.
> 'hw_gpio.h' included in 'hw_factory_dce110.c' is duplicated.
>
> Signed-off-by: Zhang Yunkai 
> ---
>  drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c | 1 -
>  .../gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c| 4 
>  2 files changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c 
> b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
> index 072f8c880924..8a6a965751e8 100644
> --- a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
> +++ b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
> @@ -61,7 +61,6 @@
>  #include "dcn21/dcn21_dccg.h"
>  #include "dcn21_hubbub.h"
>  #include "dcn10/dcn10_resource.h"
> -#include "dce110/dce110_resource.h"
>  #include "dce/dce_panel_cntl.h"
>
>  #include "dcn20/dcn20_dwb.h"
> diff --git a/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c 
> b/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c
> index 66e4841f41e4..ca335ea60412 100644
> --- a/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c
> +++ b/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c
> @@ -48,10 +48,6 @@
>  #define REGI(reg_name, block, id)\
> mm ## block ## id ## _ ## reg_name
>
> -#include "../hw_gpio.h"
> -#include "../hw_ddc.h"
> -#include "../hw_hpd.h"
> -
>  #include "reg_helper.h"
>  #include "../hpd_regs.h"
>
> --
> 2.25.1
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: remove duplicate include in amdgpu_dm.c

2021-03-08 Thread Alex Deucher

Applied.  Thanks!

Alex

On Sat, Mar 6, 2021 at 5:48 AM  wrote:
>
> From: Zhang Yunkai 
>
> 'drm/drm_hdcp.h' included in 'amdgpu_dm.c' is duplicated.
> It is also included in the 79th line.
>
> Signed-off-by: Zhang Yunkai 
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 3e1fd1e7d09f..fee46fbcb0b7 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -44,7 +44,6 @@
>  #include "amdgpu_dm.h"
>  #ifdef CONFIG_DRM_AMD_DC_HDCP
>  #include "amdgpu_dm_hdcp.h"
> -#include 
>  #endif
>  #include "amdgpu_pm.h"
>
> --
> 2.25.1
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: Verify bo size can fit framebuffer size on init.

2021-03-08 Thread Alex Deucher

On Thu, Mar 4, 2021 at 2:15 PM Mark Yacoub  wrote:
>
> From: Mark Yacoub 
>
> To initialize the framebuffer, use drm_gem_fb_init_with_funcs which
> verifies that the BO size can fit the FB size by calculating the minimum
> expected size of each plane.
>
> The bug was caught using igt-gpu-tools test: kms_addfb_basic.too-high
> and kms_addfb_basic.bo-too-small
>
> Tested on ChromeOS Zork by turning on the display and running a YT
> video.
>
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: Sean Paul 
> Signed-off-by: Mark Yacoub 

Thanks for the patch.  Just a few minor comments below.

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 66 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c  |  4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mode.h|  8 +++
>  3 files changed, 62 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> index 48cb33e5b3826..554038e5bbf6a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> @@ -870,18 +870,59 @@ static int amdgpu_display_get_fb_info(const struct 
> amdgpu_framebuffer *amdgpu_fb
> return r;
>  }
>
> -int amdgpu_display_framebuffer_init(struct drm_device *dev,
> -   struct amdgpu_framebuffer *rfb,
> -   const struct drm_mode_fb_cmd2 *mode_cmd,
> -   struct drm_gem_object *obj)
> +int amdgpu_display_gem_fb_init(struct drm_device *dev,
> +  struct amdgpu_framebuffer *rfb,
> +  const struct drm_mode_fb_cmd2 *mode_cmd,
> +  struct drm_gem_object *obj)
>  {
> -   int ret, i;
> +   int ret;

Please add a new line here.

> rfb->base.obj[0] = obj;
> drm_helper_mode_fill_fb_struct(dev, >base, mode_cmd);
> ret = drm_framebuffer_init(dev, >base, _fb_funcs);
> if (ret)
> -   goto fail;
> +   goto err;
> +
> +   ret = amdgpu_display_framebuffer_init(dev, rfb, mode_cmd, obj);
> +   if (ret)
> +   goto err;
> +
> +   return 0;
> +err:
> +   drm_err(dev, "Failed to init gem fb: %d\n", ret);
> +   rfb->base.obj[0] = NULL;
> +   return ret;
> +}
> +
> +int amdgpu_display_gem_fb_verify_and_init(
> +   struct drm_device *dev, struct amdgpu_framebuffer *rfb,
> +   struct drm_file *file_priv, const struct drm_mode_fb_cmd2 *mode_cmd,
> +   struct drm_gem_object *obj)
> +{
> +   int ret;
> +   rfb->base.obj[0] = obj;
> +   // Verify that bo size can fit the fb size.

Please change this to use C style comments.

> +   ret = drm_gem_fb_init_with_funcs(dev, >base, file_priv, mode_cmd,
> +_fb_funcs);
> +   if (ret)
> +   goto err;
>
> +   ret = amdgpu_display_framebuffer_init(dev, rfb, mode_cmd, obj);
> +   if (ret)
> +   goto err;
> +
> +   return 0;
> +err:
> +   drm_err(dev, "Failed to verify and init gem fb: %d\n", ret);
> +   rfb->base.obj[0] = NULL;
> +   return ret;
> +}
> +
> +int amdgpu_display_framebuffer_init(struct drm_device *dev,
> +   struct amdgpu_framebuffer *rfb,
> +   const struct drm_mode_fb_cmd2 *mode_cmd,
> +   struct drm_gem_object *obj)
> +{
> +   int ret, i;

New line here.

> /*
>  * This needs to happen before modifier conversion as that might 
> change
>  * the number of planes.
> @@ -891,13 +932,13 @@ int amdgpu_display_framebuffer_init(struct drm_device 
> *dev,
> drm_dbg_kms(dev, "Plane 0 and %d have different BOs: 
> %u vs. %u\n",
> i, mode_cmd->handles[0], 
> mode_cmd->handles[i]);
> ret = -EINVAL;
> -   goto fail;
> +   return ret;
> }
> }
>
> ret = amdgpu_display_get_fb_info(rfb, >tiling_flags, 
> >tmz_surface);
> if (ret)
> -   goto fail;
> +   return ret;
>
> if (dev->mode_config.allow_fb_modifiers &&
> !(rfb->base.flags & DRM_MODE_FB_MODIFIERS)) {
> @@ -905,7 +946,7 @@ int amdgpu_display_framebuffer_init(struct drm_device 
> *dev,
> if (ret) {
> drm_dbg_kms(dev, "Failed to convert tiling flags 
> 0x%llX to a modifier",
> rfb->tiling_flags);
> -   goto fail;
> +   return ret;
> }
> }
>
> @@ -915,10 +956,6 @@ int amdgpu_display_framebuffer_init(struct drm_device 
> *dev,
> }
>
> return 0;
> -
> -fail:
> -   rfb->base.obj[0] = NULL;
> -   return ret;
>  }
>
>  struct drm_framebuffer *
> @@ -953,7

Re: [PATCH] drm/amd/pm: correct the watermark settings for Polaris

2021-03-08 Thread Alex Deucher

On Fri, Mar 5, 2021 at 1:25 AM Evan Quan  wrote:
>
> The "/ 10" should be applied to the right-hand operand instead of
> the left-hand one.
>
> Change-Id: Ie730a1981aa5dee45cd6c3efccc7fb0f088cd679
> Signed-off-by: Evan Quan 
> Noticed-by: Georgios Toptsidis 

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c 
> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
> index c57dc9ae81f2..a2681fe875ed 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
> @@ -5216,10 +5216,10 @@ static int 
> smu7_set_watermarks_for_clocks_ranges(struct pp_hwmgr *hwmgr,
> for (j = 0; j < dep_sclk_table->count; j++) {
> valid_entry = false;
> for (k = 0; k < watermarks->num_wm_sets; k++) {
> -   if (dep_sclk_table->entries[i].clk / 10 >= 
> watermarks->wm_clk_ranges[k].wm_min_eng_clk_in_khz &&
> -   dep_sclk_table->entries[i].clk / 10 < 
> watermarks->wm_clk_ranges[k].wm_max_eng_clk_in_khz &&
> -   dep_mclk_table->entries[i].clk / 10 >= 
> watermarks->wm_clk_ranges[k].wm_min_mem_clk_in_khz &&
> -   dep_mclk_table->entries[i].clk / 10 < 
> watermarks->wm_clk_ranges[k].wm_max_mem_clk_in_khz) {
> +   if (dep_sclk_table->entries[i].clk >= 
> watermarks->wm_clk_ranges[k].wm_min_eng_clk_in_khz / 10 &&
> +   dep_sclk_table->entries[i].clk < 
> watermarks->wm_clk_ranges[k].wm_max_eng_clk_in_khz / 10 &&
> +   dep_mclk_table->entries[i].clk >= 
> watermarks->wm_clk_ranges[k].wm_min_mem_clk_in_khz / 10 &&
> +   dep_mclk_table->entries[i].clk < 
> watermarks->wm_clk_ranges[k].wm_max_mem_clk_in_khz / 10) {
> valid_entry = true;
> table->DisplayWatermark[i][j] = 
> watermarks->wm_clk_ranges[k].wm_set_id;
> break;
> --
> 2.29.0
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] gpu: drm: swsmu: fix error return code of smu_v11_0_set_allowed_mask()

2021-03-08 Thread Alex Deucher

Applied.  Thanks!

Alex

On Thu, Mar 4, 2021 at 11:02 PM Quan, Evan  wrote:
>
> [AMD Public Use]
>
> Thanks. Reviewed-by: Evan Quan 
>
> -Original Message-
> From: Jia-Ju Bai 
> Sent: Friday, March 5, 2021 11:54 AM
> To: Deucher, Alexander ; Koenig, Christian 
> ; airl...@linux.ie; dan...@ffwll.ch; Quan, Evan 
> ; Zhang, Hawking ; Wang, 
> Kevin(Yang) ; Gao, Likun 
> Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
> linux-ker...@vger.kernel.org; Jia-Ju Bai 
> Subject: [PATCH] gpu: drm: swsmu: fix error return code of 
> smu_v11_0_set_allowed_mask()
>
> When bitmap_empty() or feature->feature_num triggers an error, no error 
> return code of smu_v11_0_set_allowed_mask() is assigned.
> To fix this bug, ret is assigned with -EINVAL as error return code.
>
> Reported-by: TOTE Robot 
> Signed-off-by: Jia-Ju Bai 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> index 90585461a56e..82731a932308 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> @@ -747,8 +747,10 @@ int smu_v11_0_set_allowed_mask(struct smu_context *smu)
> int ret = 0;
> uint32_t feature_mask[2];
>
> -   if (bitmap_empty(feature->allowed, SMU_FEATURE_MAX) || 
> feature->feature_num < 64)
> +   if (bitmap_empty(feature->allowed, SMU_FEATURE_MAX) || 
> feature->feature_num < 64) {
> +   ret = -EINVAL;
> goto failed;
> +   }
>
> bitmap_copy((unsigned long *)feature_mask, feature->allowed, 64);
>
> --
> 2.17.1
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: add ih waiter on process until checkpoint

2021-03-08 Thread Kim, Jonathan

[AMD Official Use Only - Internal Distribution Only]

> -Original Message-
> From: Koenig, Christian 
> Sent: Saturday, March 6, 2021 4:12 AM
> To: Kim, Jonathan ; Christian König
> ; amd-gfx@lists.freedesktop.org
> Cc: Yang, Philip ; Kuehling, Felix
> 
> Subject: Re: [PATCH] drm/amdgpu: add ih waiter on process until checkpoint
>
>
>
> Am 05.03.21 um 22:34 schrieb Kim, Jonathan:
> > [AMD Official Use Only - Internal Distribution Only]
> >
> >> -Original Message-
> >> From: Christian König 
> >> Sent: Friday, March 5, 2021 3:18 PM
> >> To: Kim, Jonathan ; amd-
> >> g...@lists.freedesktop.org
> >> Cc: Yang, Philip ; Kuehling, Felix
> >> ; Koenig, Christian
> >> 
> >> Subject: Re: [PATCH] drm/amdgpu: add ih waiter on process until
> >> checkpoint
> >>
> >> [CAUTION: External Email]
> >>
> >> Am 05.03.21 um 21:11 schrieb Jonathan Kim:
> >>> Add IH function to allow caller to wait until ring entries are
> >>> processed until the checkpoint write pointer.
> >>>
> >>> This will be primarily used by HMM to drain pending page fault
> >>> interrupts before memory unmap to prevent HMM from handling stale
> >> interrupts.
> >>> v2: Update title and description to clarify use.
> >>> Add rptr/wptr wrap counter checks to guarantee ring entries are
> >>> processed until the checkpoint.
> >> First of all as I said please use a wait_event, busy waiting is a clear 
> >> NAK.
> > Who would do the wake though?  Are you suggesting wake be done in
> amdgpu_ih_process?  Or is waiting happening by the caller and this should go
> somewhere higher (like amdgpu_amdkfd for example)?
> >
> >>> Signed-off-by: Jonathan Kim 
> >>> ---
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 68
> >> +-
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h |  2 +
> >>>2 files changed, 69 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> >>> index dc852af4f3b7..954518b4fb79 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> >>> @@ -22,7 +22,7 @@
> >>> */
> >>>
> >>>#include 
> >>> -
> >>> +#include 
> >>>#include "amdgpu.h"
> >>>#include "amdgpu_ih.h"
> >>>
> >>> @@ -160,6 +160,72 @@ void amdgpu_ih_ring_write(struct
> >> amdgpu_ih_ring *ih, const uint32_t *iv,
> >>>}
> >>>}
> >>>
> >>> +/**
> >>> + * amdgpu_ih_wait_on_checkpoint_process - wait to process IVs up to
> >>> +checkpoint
> >>> + *
> >>> + * @adev: amdgpu_device pointer
> >>> + * @ih: ih ring to process
> >>> + *
> >>> + * Used to ensure ring has processed IVs up to the checkpoint write
> >> pointer.
> >>> + */
> >>> +int amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device
> >> *adev,
> >>> + struct amdgpu_ih_ring *ih) {
> >>> + u64 rptr_check, wptr_check, rptr_wrap = 0, wptr_wrap = 0;
> >>> + u32 prev_rptr, prev_wptr;
> >>> +
> >>> + if (!ih->enabled || adev->shutdown)
> >>> + return -ENODEV;
> >>> +
> >>> + prev_wptr = amdgpu_ih_get_wptr(adev, ih);
> >>> + /* Order wptr with rptr. */
> >>> + rmb();
> >>> + prev_rptr = READ_ONCE(ih->rptr);
> >>> + rptr_check = prev_rptr | (rptr_wrap << 32);
> >>> + wptr_check = prev_wptr | (wptr_wrap << 32);
> >> Hui what? That check doesn't even make remotely sense to me.
> > Can you clarify what you meant by creating a new 64 bit compare?
> > Snip from your last response:
> >
> > "This way you do something like this:
> > 1. get the wrap around counter.
> > 2. get wptr
> > 3. get rptr
> > 4. compare the wrap around counter with the old one, if it has changed
> > start over with #1 5. Use wrap around counter and rptr/wptr to come up
> with 64bit values.
> > 6. Compare wptr with rptr/wrap around counter until we are sure the IHs
> are processed."
> >
> >  From a discussion with Felix, I interpreted this as a way to guarantee
> rptr/wtpr ordering so that rptr monotonically follows wptr per check.
> > I'm assuming rptr/wptrs are 32 bits wide by the use of ptr_mask on
> read/write functions so a respective mask of rptr/wptr wrap count to the top
> 32 bits would mark how far apart the rptr and wptr are per check.
>
> Mhm, sounds like my description was a bit confusing. Let me try again.
>
> First of all rptr/wptr are not 32bit, their maximum is 20 or 19 bits IIRC (and
> they are dw, so 4M or 2M bytes).
>

Thanks Christian.  This makes sense now.  I can see how rptrs advance by dword 
sets in the iv decode helper.
My apologies, but I'm still a bit confused on the pseudo code below and have a 
few questions before I give this another go ...

> Then the purpose of the wait_event() is to wait for changes of the rptr, so
> the matching wake_up() should be at the same place as calling
> amdgpu_ih_set_rptr().
>
> My original idea of the wrap around counter assumes that the counter is
> updated in amdgpu_ih_process(). That isn't strictly necessary, but it could

[PATCH] [variant b] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Arnd Bergmann

From: Arnd Bergmann 

Using 'imply AMD_IOMMU_V2' does not guarantee that the driver can link
against the exported functions. If the GPU driver is built-in but the
IOMMU driver is a loadable module, the kfd_iommu.c file is indeed
built but does not work:

x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_bind_process_to_device':
kfd_iommu.c:(.text+0x516): undefined reference to `amd_iommu_bind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_unbind_process':
kfd_iommu.c:(.text+0x691): undefined reference to `amd_iommu_unbind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_suspend':
kfd_iommu.c:(.text+0x966): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x97f): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x9a4): undefined reference to 
`amd_iommu_free_device'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_resume':
kfd_iommu.c:(.text+0xa9a): undefined reference to `amd_iommu_init_device'
x86_64-linux-ld: kfd_iommu.c:(.text+0xadc): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xaff): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xc72): undefined reference to 
`amd_iommu_bind_pasid'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe08): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe26): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe42): undefined reference to 
`amd_iommu_free_device'

Change the 'imply' to a weak dependency that still allows compiling
in all other configurations but disallows the configuration that
causes a link failure.

Fixes: 64d1c3a43a6f ("drm/amdkfd: Centralize IOMMUv2 code and make it 
conditional")
Signed-off-by: Arnd Bergmann 
---
 drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig 
b/drivers/gpu/drm/amd/amdkfd/Kconfig
index f02c938f75da..d01dba2af3bb 100644
--- a/drivers/gpu/drm/amd/amdkfd/Kconfig
+++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
@@ -6,7 +6,7 @@
 config HSA_AMD
bool "HSA kernel driver for AMD GPU devices"
depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
-   imply AMD_IOMMU_V2 if X86_64
+   depends on AMD_IOMMU_V2=y || DRM_AMDGPU=m
select HMM_MIRROR
select MMU_NOTIFIER
select DRM_AMDGPU_USERPTR
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/6] amd/display: fail on cursor plane without an underlying plane

2021-03-08 Thread Kazlauskas, Nicholas


On 2021-03-08 3:18 p.m., Daniel Vetter wrote:

On Fri, Mar 5, 2021 at 10:24 AM Michel Dänzer  wrote:


On 2021-03-04 7:26 p.m., Kazlauskas, Nicholas wrote:

On 2021-03-04 10:35 a.m., Michel Dänzer wrote:

On 2021-03-04 4:09 p.m., Kazlauskas, Nicholas wrote:

On 2021-03-04 4:05 a.m., Michel Dänzer wrote:

On 2021-03-03 8:17 p.m., Daniel Vetter wrote:

On Wed, Mar 3, 2021 at 5:53 PM Michel Dänzer 
wrote:


Moreover, in the same scenario plus an overlay plane enabled with a
HW cursor compatible format, if the FB bound to the overlay plane is
destroyed, the common DRM code will attempt to disable the overlay
plane, but dm_check_crtc_cursor will reject that now. I can't
remember
exactly what the result is, but AFAIR it's not pretty.


CRTC gets disabled instead. That's why we went with the "always
require primary plane" hack. I think the only solution here would be
to enable the primary plane (but not in userspace-visible state, so
this needs to be done in the dc derived state objects only) that scans
out black any time we're in such a situation with cursor with no
planes.


This is about a scenario described by Nicholas earlier:

Cursor Plane - ARGB

Overlay Plane - ARGB Desktop/UI with cutout for video

Primary Plane - NV12 video

And destroying the FB bound to the overlay plane. The fallback to
disable
the CRTC in atomic_remove_fb only kicks in for the primary plane, so it
wouldn't in this case and would fail. Which would in turn trigger the
WARN in drm_framebuffer_remove (and leave the overlay plane scanning
out
from freed memory?).


The cleanest solution might be not to allow any formats incompatible
with
the HW cursor for the primary plane.


Legacy X userspace doesn't use overlays but Chrome OS does.

This would regress ChromeOS MPO support because it relies on the NV12
video plane being on the bottom.


Could it use the NV12 overlay plane below the ARGB primary plane?


Plane ordering was previously undefined in DRM so we have userspace that
assumes overlays are on top.


They can still be by default?


Today we have the z-order property in DRM that defines where it is in
the stack, so technically it could but we'd also be regressing existing
behavior on Chrome OS today.


That's unfortunate, but might be the least bad choice overall.

BTW, doesn't Chrome OS try to disable the ARGB overlay plane while there are no 
UI elements to display? If it does, this series might break it anyway (if the 
cursor plane can be enabled while the ARGB overlay plane is off).



When ChromeOS disables MPO it doesn't do it plane by plane, it does it
in one go from NV12+ARGB -> ARGB8.


Even so, we cannot expect all user space to do the same, and we cannot
allow any user space to trigger a WARN and scanout from freed memory.


The WARN doesn't trigger because there's still a reference on the FB -


The WARN triggers if atomic_remove_fb returns an error, which is the case if it can't 
disable an overlay plane. I actually hit this with IGT tests while working on 
b836a274b797 "drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC 
is" (I initially tried allowing the cursor plane to be enabled together with an 
overlay plane while the primary plane is off).


the reference held by DRM since it's still scanning out the overlay.
Userspace can't reclaim this memory with another buffer allocation
because it's still in use.


Good point, so at least there's no scanout of freed memory. Even so, the 
overlay plane continues displaying contents which user space apparently doesn't 
want to be displayed anymore.


Hm I do wonder how much we need to care for this. If you use planes,
you better use TEST_ONLY in atomic to it's full extend (including
cursor, if that's a real plane, which it is for every driver except
msm/mdp4). If userspace screws this up and worse, shuts of planes with
an RMFB, I think it's not entirely unreasonable to claim that it
should keep the pieces.

So maybe we should refine the WARN_ON to not trigger if other planes
than crtc->primary and crtc->cursor are enabled right now?


It's a little odd that a disable commit can fail, but I don't think
there's anything in DRM core that specifies that this can't happen for
planes.


I'd say it's more than just a little odd. :) Being unable to disable an overlay 
plane seems very surprising, and could make it tricky for user space (not to 
mention core DRM code like atomic_remove_fb) to find a solution.

I'd suggest the amdgpu DM code should rather virtualize the KMS API planes 
somehow such that an overlay plane can always be disabled. While this might 
incur some short-term pain, it will likely save more pain overall in the long 
term.


Yeah I think this amd dc cursor problem is the first case where
removing a plane can make things worse.

Since the hw is what it is, can't we put a transparent plane with
cursor compatible format in for the case where stuff would fail? So
not fully virtualize the planes (since I don't see how that helps),

Re: [PATCH 3/5] drm/amdgpu: fb BO should be ttm_bo_type_device

2021-03-08 Thread Christian König


Am 08.03.21 um 21:34 schrieb Alex Deucher:

On Mon, Mar 8, 2021 at 3:20 PM Christian König  wrote:

Am 08.03.21 um 16:37 schrieb Nirmoy Das:

FB BO should not be ttm_bo_type_kernel type and
amdgpufb_create_pinned_object() pins the FB BO anyway.

Mhm, why the heck was that a kernel object?

Maybe because the fbcon was the main user for this historically and
the code was copied from radeon which also still sets it to kernel.


That's most likely wrong for radeon as well.

All BOs which can be mapped using mmap() into an userspace process 
should be of type device if I'm not completely mistaken.


Going to double check that stuff when I have time.

Thanks for pointing this out Nirmoy.

Christian.



Alex


Signed-off-by: Nirmoy Das 

Acked-by: Christian König 


---
   drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
index 51cd49c6f38f..24010cacf7d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
@@ -146,7 +146,7 @@ static int amdgpufb_create_pinned_object(struct 
amdgpu_fbdev *rfbdev,
   size = mode_cmd->pitches[0] * height;
   aligned_size = ALIGN(size, PAGE_SIZE);
   ret = amdgpu_gem_object_create(adev, aligned_size, 0, domain, flags,
-ttm_bo_type_kernel, NULL, );
+ttm_bo_type_device, NULL, );
   if (ret) {
   pr_err("failed to allocate framebuffer (%d)\n", aligned_size);
   return -ENOMEM;

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Cchristian.koenig%40amd.com%7Cbe4189aa363c4f65f38208d8e271949e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637508324745711394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=NYv5BaFhU7hahVnN4e086QFv71GAXEZLeFnn2esK04o%3Dreserved=0


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Arnd Bergmann

On Mon, Mar 8, 2021 at 9:12 PM Christian König
 wrote:
> Am 08.03.21 um 21:02 schrieb Felix Kuehling:
> > Am 2021-03-08 um 2:33 p.m. schrieb Arnd Bergmann:

> > I don't want to create a hard dependency on AMD_IOMMU_V2 if I can avoid
> > it, because it is only really needed for a small number of AMD APUs, and
> > even there it is now optional for more recent ones.
> >
> > Is there a better way to avoid build failures without creating a hard
> > dependency?
>
> What you need is the same trick we used for AGP on radeon/nouveau:
>
> depends on AMD_IOMMU_V2 || !AMD_IOMMU_V2
>
> This way when AMD_IOMMU_V2 is build as a module DRM_AMDGPU will be build
> as a module as well. When it is disabled completely we don't care.

Note that this trick only works if you put it into the DRM_AMDGPU Kconfig option
itself, since that decides if the driver is built-in or a loadable module. If
HSA_AMD is disabled, that dependency is not really necessary.

The version I suggested  -- adding "depends on AMD_IOMMU_V2=y ||
DRM_AMDGPU=m" to the HSA_AMD option -- might be slightly nicer
since it lets you still build a kernel with DRM_AMDGPU=y and
AMD_IOMMU_V2=m, but without the HSA_AMD.

I can send a patch with either of those two options to replace my
original patch.

> >The documentation in
> > Documentation/kbuild/kconfig-language.rst suggests using if
> > (IS_REACHABLE(CONFIG_AMD_IOMMU_V2)) to guard those problematic function
> > calls. I think more generally, we could guard all of kfd_iommu.c with
> >
> >  #if IS_REACHABLE(CONFIG_AMD_IOMMU_V2)
> >
> > And use the same condition to define the stubs in kfd_iommu.h.

This would fix the compile-time error, but it's also the one I'd least
recommend out of all the options, because that causes the driver to
silently not work as expected. This seems even worse than failing
the build.

   Arnd
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/5] drm/amdgpu: fb BO should be ttm_bo_type_device

2021-03-08 Thread Alex Deucher

On Mon, Mar 8, 2021 at 3:20 PM Christian König  wrote:
>
> Am 08.03.21 um 16:37 schrieb Nirmoy Das:
> > FB BO should not be ttm_bo_type_kernel type and
> > amdgpufb_create_pinned_object() pins the FB BO anyway.
>
> Mhm, why the heck was that a kernel object?

Maybe because the fbcon was the main user for this historically and
the code was copied from radeon which also still sets it to kernel.

Alex

>
> >
> > Signed-off-by: Nirmoy Das 
>
> Acked-by: Christian König 
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
> > index 51cd49c6f38f..24010cacf7d0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
> > @@ -146,7 +146,7 @@ static int amdgpufb_create_pinned_object(struct 
> > amdgpu_fbdev *rfbdev,
> >   size = mode_cmd->pitches[0] * height;
> >   aligned_size = ALIGN(size, PAGE_SIZE);
> >   ret = amdgpu_gem_object_create(adev, aligned_size, 0, domain, flags,
> > -ttm_bo_type_kernel, NULL, );
> > +ttm_bo_type_device, NULL, );
> >   if (ret) {
> >   pr_err("failed to allocate framebuffer (%d)\n", aligned_size);
> >   return -ENOMEM;
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/5] drm/amdgpu: introduce struct amdgpu_bo_user

2021-03-08 Thread Christian König


Am 08.03.21 um 16:37 schrieb Nirmoy Das:

Implement a new struct amdgpu_bo_user as subclass of
struct amdgpu_bo and a function to created amdgpu_bo_user
bo with a flag to identify the owner.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 28 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 14 +++
  2 files changed, 42 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index d32379cbad89..abfeb8304894 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -695,6 +695,34 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
return r;
  }
  
+/**

+ * amdgpu_bo_create_user - create an _bo_user buffer object
+ * @adev: amdgpu device object
+ * @bp: parameters to be used for the buffer object
+ * @ubo_ptr: pointer to the buffer object pointer
+ *
+ * Create a BO to be used by user application;
+ *
+ * Returns:
+ * 0 for success or a negative error code on failure.
+ */
+
+int amdgpu_bo_create_user(struct amdgpu_device *adev,
+ struct amdgpu_bo_param *bp,
+ struct amdgpu_bo_user **ubo_ptr)
+{
+   struct amdgpu_bo *bo_ptr;
+   int r;
+
+   bp->flags = bp->flags & ~AMDGPU_GEM_CREATE_SHADOW;
+   bp->bo_ptr_size = sizeof(struct amdgpu_bo_user);
+   r = amdgpu_bo_do_create(adev, bp, _ptr);
+   if (r)
+   return r;
+
+   *ubo_ptr = amdgpu_bo_to_amdgpu_bo_user(bo_ptr);
+   return r;
+}
  /**
   * amdgpu_bo_validate - validate an _bo buffer object
   * @bo: pointer to the buffer object
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 8e2b556f0b7b..fd30221266c8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -37,6 +37,8 @@
  #define AMDGPU_BO_INVALID_OFFSET  LONG_MAX
  #define AMDGPU_BO_MAX_PLACEMENTS  3
  
+#define amdgpu_bo_to_amdgpu_bo_user(abo) container_of((abo), struct amdgpu_bo_user, bo)


Mhm, the name could be improved, but apart from that the patch looks 
good to me.


Christian.


+
  struct amdgpu_bo_param {
unsigned long   size;
int byte_align;
@@ -112,6 +114,15 @@ struct amdgpu_bo {
struct kgd_mem  *kfd_bo;
  };
  
+struct amdgpu_bo_user {

+   struct amdgpu_bobo;
+   u64 tiling_flags;
+   u64 metadata_flags;
+   void*metadata;
+   u32 metadata_size;
+
+};
+
  static inline struct amdgpu_bo *ttm_to_amdgpu_bo(struct ttm_buffer_object 
*tbo)
  {
return container_of(tbo, struct amdgpu_bo, tbo);
@@ -255,6 +266,9 @@ int amdgpu_bo_create_kernel(struct amdgpu_device *adev,
  int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev,
   uint64_t offset, uint64_t size, uint32_t domain,
   struct amdgpu_bo **bo_ptr, void **cpu_addr);
+int amdgpu_bo_create_user(struct amdgpu_device *adev,
+ struct amdgpu_bo_param *bp,
+ struct amdgpu_bo_user **ubo_ptr);
  void amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64 *gpu_addr,
   void **cpu_addr);
  int amdgpu_bo_kmap(struct amdgpu_bo *bo, void **ptr);


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 5/5] drm/amdgpu: use amdgpu_bo_user bo for metadata and tiling flag

2021-03-08 Thread Christian König


Am 08.03.21 um 16:37 schrieb Nirmoy Das:

Tiling flag and metadata are only needed for BOs created by
amdgpu_gem_object_create(), so we can remove those from the
base class.

CC: felix.kuehl...@amd.com
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 59 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  4 --
  3 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 00ac5c272f47..04a19cdc08c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -496,8 +496,6 @@ int amdgpu_amdkfd_get_dmabuf_info(struct kgd_dev *kgd, int 
dma_buf_fd,
*dma_buf_kgd = (struct kgd_dev *)adev;
if (bo_size)
*bo_size = amdgpu_bo_size(bo);
-   if (metadata_size)
-   *metadata_size = bo->metadata_size;
if (metadata_buffer)
r = amdgpu_bo_get_metadata(bo, metadata_buffer, buffer_size,
   metadata_size, _flags);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index abfeb8304894..c105ba96dd58 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -77,6 +77,7 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
  {
struct amdgpu_device *adev = amdgpu_ttm_adev(tbo->bdev);
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(tbo);
+   struct amdgpu_bo_user *ubo;

if (bo->tbo.pin_count > 0)
amdgpu_bo_subtract_pin_size(bo);
@@ -94,7 +95,11 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
}
amdgpu_bo_unref(>parent);

-   kfree(bo->metadata);
+   if (bo->tbo.type == ttm_bo_type_device) {
+   ubo = container_of((bo), struct amdgpu_bo_user, bo);


You could use your new casting macro here.


+   kfree(ubo->metadata);
+   }
+
kfree(bo);
  }

@@ -1161,12 +1166,15 @@ int amdgpu_bo_fbdev_mmap(struct amdgpu_bo *bo,
  int amdgpu_bo_set_tiling_flags(struct amdgpu_bo *bo, u64 tiling_flags)
  {
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   struct amdgpu_bo_user *ubo;

+   BUG_ON(bo->tbo.type != ttm_bo_type_device);
if (adev->family <= AMDGPU_FAMILY_CZ &&
AMDGPU_TILING_GET(tiling_flags, TILE_SPLIT) > 6)
return -EINVAL;

-   bo->tiling_flags = tiling_flags;
+   ubo = container_of((bo), struct amdgpu_bo_user, bo);
+   ubo->tiling_flags = tiling_flags;
return 0;
  }

@@ -1180,10 +1188,14 @@ int amdgpu_bo_set_tiling_flags(struct amdgpu_bo *bo, 
u64 tiling_flags)
   */
  void amdgpu_bo_get_tiling_flags(struct amdgpu_bo *bo, u64 *tiling_flags)
  {
+   struct amdgpu_bo_user *ubo;
+
+   BUG_ON(bo->tbo.type != ttm_bo_type_device);
dma_resv_assert_held(bo->tbo.base.resv);
+   ubo = amdgpu_bo_to_amdgpu_bo_user(bo);

if (tiling_flags)
-   *tiling_flags = bo->tiling_flags;
+   *tiling_flags = ubo->tiling_flags;
  }

  /**
@@ -1202,13 +1214,20 @@ void amdgpu_bo_get_tiling_flags(struct amdgpu_bo *bo, 
u64 *tiling_flags)
  int amdgpu_bo_set_metadata (struct amdgpu_bo *bo, void *metadata,
uint32_t metadata_size, uint64_t flags)
  {
+   struct amdgpu_bo_user *ubo;
void *buffer;

+   if (bo->tbo.type != ttm_bo_type_device) {
+   DRM_ERROR("can not set metadata for a non-amdgpu_bo_user type 
BO\n");
+   return -EINVAL;
+   }


Either BUG_ON or DRM_ERROR, but keep that consistent please.

Christian.


+
+   ubo = amdgpu_bo_to_amdgpu_bo_user(bo);
if (!metadata_size) {
-   if (bo->metadata_size) {
-   kfree(bo->metadata);
-   bo->metadata = NULL;
-   bo->metadata_size = 0;
+   if (ubo->metadata_size) {
+   kfree(ubo->metadata);
+   ubo->metadata = NULL;
+   ubo->metadata_size = 0;
}
return 0;
}
@@ -1220,10 +1239,10 @@ int amdgpu_bo_set_metadata (struct amdgpu_bo *bo, void 
*metadata,
if (buffer == NULL)
return -ENOMEM;

-   kfree(bo->metadata);
-   bo->metadata_flags = flags;
-   bo->metadata = buffer;
-   bo->metadata_size = metadata_size;
+   kfree(ubo->metadata);
+   ubo->metadata_flags = flags;
+   ubo->metadata = buffer;
+   ubo->metadata_size = metadata_size;

return 0;
  }
@@ -1247,21 +1266,29 @@ int amdgpu_bo_get_metadata(struct amdgpu_bo *bo, void 
*buffer,
   size_t buffer_size, uint32_t *metadata_size,
   uint64_t *flags)
  {
+   struct amdgpu_bo_user *ubo;
+

Re: [PATCH 3/5] drm/amdgpu: fb BO should be ttm_bo_type_device

2021-03-08 Thread Christian König


Am 08.03.21 um 16:37 schrieb Nirmoy Das:

FB BO should not be ttm_bo_type_kernel type and
amdgpufb_create_pinned_object() pins the FB BO anyway.


Mhm, why the heck was that a kernel object?



Signed-off-by: Nirmoy Das 


Acked-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
index 51cd49c6f38f..24010cacf7d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
@@ -146,7 +146,7 @@ static int amdgpufb_create_pinned_object(struct 
amdgpu_fbdev *rfbdev,
size = mode_cmd->pitches[0] * height;
aligned_size = ALIGN(size, PAGE_SIZE);
ret = amdgpu_gem_object_create(adev, aligned_size, 0, domain, flags,
-  ttm_bo_type_kernel, NULL, );
+  ttm_bo_type_device, NULL, );
if (ret) {
pr_err("failed to allocate framebuffer (%d)\n", aligned_size);
return -ENOMEM;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/6] amd/display: fail on cursor plane without an underlying plane

2021-03-08 Thread Daniel Vetter

On Fri, Mar 5, 2021 at 10:24 AM Michel Dänzer  wrote:
>
> On 2021-03-04 7:26 p.m., Kazlauskas, Nicholas wrote:
> > On 2021-03-04 10:35 a.m., Michel Dänzer wrote:
> >> On 2021-03-04 4:09 p.m., Kazlauskas, Nicholas wrote:
> >>> On 2021-03-04 4:05 a.m., Michel Dänzer wrote:
>  On 2021-03-03 8:17 p.m., Daniel Vetter wrote:
> > On Wed, Mar 3, 2021 at 5:53 PM Michel Dänzer 
> > wrote:
> >>
> >> Moreover, in the same scenario plus an overlay plane enabled with a
> >> HW cursor compatible format, if the FB bound to the overlay plane is
> >> destroyed, the common DRM code will attempt to disable the overlay
> >> plane, but dm_check_crtc_cursor will reject that now. I can't
> >> remember
> >> exactly what the result is, but AFAIR it's not pretty.
> >
> > CRTC gets disabled instead. That's why we went with the "always
> > require primary plane" hack. I think the only solution here would be
> > to enable the primary plane (but not in userspace-visible state, so
> > this needs to be done in the dc derived state objects only) that scans
> > out black any time we're in such a situation with cursor with no
> > planes.
> 
>  This is about a scenario described by Nicholas earlier:
> 
>  Cursor Plane - ARGB
> 
>  Overlay Plane - ARGB Desktop/UI with cutout for video
> 
>  Primary Plane - NV12 video
> 
>  And destroying the FB bound to the overlay plane. The fallback to
>  disable
>  the CRTC in atomic_remove_fb only kicks in for the primary plane, so it
>  wouldn't in this case and would fail. Which would in turn trigger the
>  WARN in drm_framebuffer_remove (and leave the overlay plane scanning
>  out
>  from freed memory?).
> 
> 
>  The cleanest solution might be not to allow any formats incompatible
>  with
>  the HW cursor for the primary plane.
> >>>
> >>> Legacy X userspace doesn't use overlays but Chrome OS does.
> >>>
> >>> This would regress ChromeOS MPO support because it relies on the NV12
> >>> video plane being on the bottom.
> >>
> >> Could it use the NV12 overlay plane below the ARGB primary plane?
> >
> > Plane ordering was previously undefined in DRM so we have userspace that
> > assumes overlays are on top.
>
> They can still be by default?
>
> > Today we have the z-order property in DRM that defines where it is in
> > the stack, so technically it could but we'd also be regressing existing
> > behavior on Chrome OS today.
>
> That's unfortunate, but might be the least bad choice overall.
>
> BTW, doesn't Chrome OS try to disable the ARGB overlay plane while there are 
> no UI elements to display? If it does, this series might break it anyway (if 
> the cursor plane can be enabled while the ARGB overlay plane is off).
>
>
> >>> When ChromeOS disables MPO it doesn't do it plane by plane, it does it
> >>> in one go from NV12+ARGB -> ARGB8.
> >>
> >> Even so, we cannot expect all user space to do the same, and we cannot
> >> allow any user space to trigger a WARN and scanout from freed memory.
> >
> > The WARN doesn't trigger because there's still a reference on the FB -
>
> The WARN triggers if atomic_remove_fb returns an error, which is the case if 
> it can't disable an overlay plane. I actually hit this with IGT tests while 
> working on b836a274b797 "drm/amdgpu/dc: Require primary plane to be enabled 
> whenever the CRTC is" (I initially tried allowing the cursor plane to be 
> enabled together with an overlay plane while the primary plane is off).
>
> > the reference held by DRM since it's still scanning out the overlay.
> > Userspace can't reclaim this memory with another buffer allocation
> > because it's still in use.
>
> Good point, so at least there's no scanout of freed memory. Even so, the 
> overlay plane continues displaying contents which user space apparently 
> doesn't want to be displayed anymore.

Hm I do wonder how much we need to care for this. If you use planes,
you better use TEST_ONLY in atomic to it's full extend (including
cursor, if that's a real plane, which it is for every driver except
msm/mdp4). If userspace screws this up and worse, shuts of planes with
an RMFB, I think it's not entirely unreasonable to claim that it
should keep the pieces.

So maybe we should refine the WARN_ON to not trigger if other planes
than crtc->primary and crtc->cursor are enabled right now?

> > It's a little odd that a disable commit can fail, but I don't think
> > there's anything in DRM core that specifies that this can't happen for
> > planes.
>
> I'd say it's more than just a little odd. :) Being unable to disable an 
> overlay plane seems very surprising, and could make it tricky for user space 
> (not to mention core DRM code like atomic_remove_fb) to find a solution.
>
> I'd suggest the amdgpu DM code should rather virtualize the KMS API planes 
> somehow such that an overlay plane can always be disabled. While this might 
>

Re: [PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Christian König


Am 08.03.21 um 21:02 schrieb Felix Kuehling:

Am 2021-03-08 um 2:33 p.m. schrieb Arnd Bergmann:

On Mon, Mar 8, 2021 at 8:11 PM Felix Kuehling  wrote:

Am 2021-03-08 um 2:05 p.m. schrieb Arnd Bergmann:

On Mon, Mar 8, 2021 at 5:24 PM Felix Kuehling  wrote:

The driver build should work without IOMMUv2. In amdkfd/Makefile, we
have this condition:

ifneq ($(CONFIG_AMD_IOMMU_V2),)
AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
endif

In amdkfd/kfd_iommu.h we define inline stubs of the functions that are
causing your link-failures if IOMMU_V2 is not enabled:

#if defined(CONFIG_AMD_IOMMU_V2_MODULE) || defined(CONFIG_AMD_IOMMU_V2)
... function declarations ...
#else
... stubs ...
#endif

Right, that is the problem I tried to explain in my patch description.

Should we just drop the 'imply' then and add a proper dependency like this?

   depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
   depends on AMD_IOMMU_V2=y || DRM_AMDGPU=m

I can send a v2 after some testing if you prefer this version.

No. My point is, there should not be a hard dependency. The build should
work without CONFIG_AMD_IOMMU_V2. I don't understand why it's not
working for you. It looks like you're building kfd_iommu.o, which should
not be happening when AMD_IOMMU_V2 is not enabled. The condition in
amdkfd/Makefile should make sure that kfd_iommu.o doesn't get built with
your kernel config.

Again, as I explained in the changelog text, AMD_IOMMU_V2 configured as
a loadable module, while AMDGPU is configured as built-in.

I'm sorry, I didn't read it carefully. And I thought "imply" was meant
to fix exactly this kind of issue.

I don't want to create a hard dependency on AMD_IOMMU_V2 if I can avoid
it, because it is only really needed for a small number of AMD APUs, and
even there it is now optional for more recent ones.

Is there a better way to avoid build failures without creating a hard
dependency?


What you need is the same trick we used for AGP on radeon/nouveau:

depends on AMD_IOMMU_V2 || !AMD_IOMMU_V2

This way when AMD_IOMMU_V2 is build as a module DRM_AMDGPU will be build 
as a module as well. When it is disabled completely we don't care.


Regards,
Christian.


   The documentation in
Documentation/kbuild/kconfig-language.rst suggests using if
(IS_REACHABLE(CONFIG_AMD_IOMMU_V2)) to guard those problematic function
calls. I think more generally, we could guard all of kfd_iommu.c with

     #if IS_REACHABLE(CONFIG_AMD_IOMMU_V2)

And use the same condition to define the stubs in kfd_iommu.h.

Regards,
   Felix



The causes a link failure for the vmlinux file, because the linker cannot
resolve addresses of loadable modules at compile time -- they have
not been loaded yet.

   Arnd
___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Felix Kuehling

Am 2021-03-08 um 2:33 p.m. schrieb Arnd Bergmann:
> On Mon, Mar 8, 2021 at 8:11 PM Felix Kuehling  wrote:
>> Am 2021-03-08 um 2:05 p.m. schrieb Arnd Bergmann:
>>> On Mon, Mar 8, 2021 at 5:24 PM Felix Kuehling  
>>> wrote:
 The driver build should work without IOMMUv2. In amdkfd/Makefile, we
 have this condition:

 ifneq ($(CONFIG_AMD_IOMMU_V2),)
 AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
 endif

 In amdkfd/kfd_iommu.h we define inline stubs of the functions that are
 causing your link-failures if IOMMU_V2 is not enabled:

 #if defined(CONFIG_AMD_IOMMU_V2_MODULE) || defined(CONFIG_AMD_IOMMU_V2)
 ... function declarations ...
 #else
 ... stubs ...
 #endif
>>> Right, that is the problem I tried to explain in my patch description.
>>>
>>> Should we just drop the 'imply' then and add a proper dependency like this?
>>>
>>>   depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
>>>   depends on AMD_IOMMU_V2=y || DRM_AMDGPU=m
>>>
>>> I can send a v2 after some testing if you prefer this version.
>> No. My point is, there should not be a hard dependency. The build should
>> work without CONFIG_AMD_IOMMU_V2. I don't understand why it's not
>> working for you. It looks like you're building kfd_iommu.o, which should
>> not be happening when AMD_IOMMU_V2 is not enabled. The condition in
>> amdkfd/Makefile should make sure that kfd_iommu.o doesn't get built with
>> your kernel config.
> Again, as I explained in the changelog text, AMD_IOMMU_V2 configured as
> a loadable module, while AMDGPU is configured as built-in.
I'm sorry, I didn't read it carefully. And I thought "imply" was meant
to fix exactly this kind of issue.

I don't want to create a hard dependency on AMD_IOMMU_V2 if I can avoid
it, because it is only really needed for a small number of AMD APUs, and
even there it is now optional for more recent ones.

Is there a better way to avoid build failures without creating a hard
dependency?  The documentation in
Documentation/kbuild/kconfig-language.rst suggests using if
(IS_REACHABLE(CONFIG_AMD_IOMMU_V2)) to guard those problematic function
calls. I think more generally, we could guard all of kfd_iommu.c with

    #if IS_REACHABLE(CONFIG_AMD_IOMMU_V2)

And use the same condition to define the stubs in kfd_iommu.h.

Regards,
  Felix


>
> The causes a link failure for the vmlinux file, because the linker cannot
> resolve addresses of loadable modules at compile time -- they have
> not been loaded yet.
>
>   Arnd
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Arnd Bergmann

On Mon, Mar 8, 2021 at 8:11 PM Felix Kuehling  wrote:
>
> Am 2021-03-08 um 2:05 p.m. schrieb Arnd Bergmann:
> > On Mon, Mar 8, 2021 at 5:24 PM Felix Kuehling  
> > wrote:
> >> The driver build should work without IOMMUv2. In amdkfd/Makefile, we
> >> have this condition:
> >>
> >> ifneq ($(CONFIG_AMD_IOMMU_V2),)
> >> AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
> >> endif
> >>
> >> In amdkfd/kfd_iommu.h we define inline stubs of the functions that are
> >> causing your link-failures if IOMMU_V2 is not enabled:
> >>
> >> #if defined(CONFIG_AMD_IOMMU_V2_MODULE) || defined(CONFIG_AMD_IOMMU_V2)
> >> ... function declarations ...
> >> #else
> >> ... stubs ...
> >> #endif
> > Right, that is the problem I tried to explain in my patch description.
> >
> > Should we just drop the 'imply' then and add a proper dependency like this?
> >
> >   depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
> >   depends on AMD_IOMMU_V2=y || DRM_AMDGPU=m
> >
> > I can send a v2 after some testing if you prefer this version.
>
> No. My point is, there should not be a hard dependency. The build should
> work without CONFIG_AMD_IOMMU_V2. I don't understand why it's not
> working for you. It looks like you're building kfd_iommu.o, which should
> not be happening when AMD_IOMMU_V2 is not enabled. The condition in
> amdkfd/Makefile should make sure that kfd_iommu.o doesn't get built with
> your kernel config.

Again, as I explained in the changelog text, AMD_IOMMU_V2 configured as
a loadable module, while AMDGPU is configured as built-in.

The causes a link failure for the vmlinux file, because the linker cannot
resolve addresses of loadable modules at compile time -- they have
not been loaded yet.

  Arnd
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/gem: add checks of drm_gem_object->funcs

2021-03-08 Thread Alex Deucher

On Mon, Mar 1, 2021 at 5:25 AM Christian König
 wrote:
>
>
>
> Am 01.03.21 um 11:04 schrieb Daniel Vetter:
> > On Mon, Mar 1, 2021 at 10:56 AM Thomas Zimmermann  
> > wrote:
> >> (cc'ing amd devs)
> >>
> >> Hi
> >>
> >> Am 28.02.21 um 17:10 schrieb Pavel Turinský:
> >>> The checks were removed in commit d693def4fd1c ("drm: Remove obsolete GEM
> >>> and PRIME callbacks from struct drm_driver") and can lead to following
> >>> kernel oops:
> >> Thanks for reporting. All drivers are supposed to set the funcs pointer
> >> in their GEM objects. This looks like a radeon bug. Adding the AMD devs.
> > Looks like we're setting obj->funcs only in radeon_gem_object_create,
> > but should set it in radeon_bo_create instead so it catches internal
> > functions too. I think this was missed in
> >
> > commit ce77038fdae385f947757a37573d90f2e83f0271
> > Author: Gerd Hoffmann 
> > Date:   Mon Aug 5 16:01:06 2019 +0200
> >
> > drm/radeon: use embedded gem object
>
> Maybe the same problem we had for amdgpu that the function pointer
> wasn't set for imported DMA-bufs?

Should be fixed here:
https://patchwork.freedesktop.org/patch/423250/

Alex

>
> Regards,
> Christian.
>
> >
> > Adding Gerd.
> > -Daniel
> >
> >> Best regards
> >> Thomas
> >>
> >>> [  139.449098] BUG: kernel NULL pointer dereference, address: 
> >>> 0008
> >>> [  139.449110] #PF: supervisor read access in kernel mode
> >>> [  139.449113] #PF: error_code(0x) - not-present page
> >>> [  139.449116] PGD 0 P4D 0
> >>> [  139.449121] Oops:  [#1] PREEMPT SMP PTI
> >>> [  139.449126] CPU: 4 PID: 1181 Comm: Xorg Not tainted 5.11.2LEdoian #2
> >>> [  139.449130] Hardware name: Gigabyte Technology Co., Ltd. To be filled 
> >>> by O.E.M./Z77-DS3H, BIOS F4 04/25/2012
> >>> [  139.449133] RIP: 0010:drm_gem_handle_create_tail+0xcb/0x190 [drm]
> >>> [  139.449185] Code: 00 48 89 ef e8 06 b4 49 f7 45 85 e4 78 77 48 8d 6b 
> >>> 18 4c 89 ee 48 89 ef e8 c2 f5 00 00 89 c2 85 c0 75 3e 48 8b 83 40 01 00 
> >>> 00 <48> 8b 40 0
> >>> 8 48 85 c0 74 0f 4c 89 ee 48 89 df e8 71 5d 87 f7 85 c0
> >>> [  139.449190] RSP: 0018:be21c194bd28 EFLAGS: 00010246
> >>> [  139.449194] RAX:  RBX: 9da9b3caf078 RCX: 
> >>> 
> >>> [  139.449197] RDX:  RSI: c039b893 RDI: 
> >>> 
> >>> [  139.449199] RBP: 9da9b3caf090 R08: 0040 R09: 
> >>> 9da983b911c0
> >>> [  139.449202] R10: 9da984749e00 R11: 9da9859bfc38 R12: 
> >>> 0007
> >>> [  139.449204] R13: 9da9859bfc00 R14: 9da9859bfc50 R15: 
> >>> 9da9859bfc38
> >>> [  139.449207] FS:  7f6332a56900() GS:9daea7b0() 
> >>> knlGS:
> >>> [  139.449211] CS:  0010 DS:  ES:  CR0: 80050033
> >>> [  139.449214] CR2: 0008 CR3: 0001319b8005 CR4: 
> >>> 001706e0
> >>> [  139.449217] Call Trace:
> >>> [  139.449224]  drm_gem_prime_fd_to_handle+0xff/0x1d0 [drm]
> >>> [  139.449274]  ? drm_prime_destroy_file_private+0x20/0x20 [drm]
> >>> [  139.449323]  drm_ioctl_kernel+0xac/0xf0 [drm]
> >>> [  139.449363]  drm_ioctl+0x20f/0x3b0 [drm]
> >>> [  139.449403]  ? drm_prime_destroy_file_private+0x20/0x20 [drm]
> >>> [  139.449454]  radeon_drm_ioctl+0x49/0x80 [radeon]
> >>> [  139.449500]  __x64_sys_ioctl+0x84/0xc0
> >>> [  139.449507]  do_syscall_64+0x33/0x40
> >>> [  139.449514]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>> [  139.449522] RIP: 0033:0x7f63330fbe6b
> >>> [  139.449526] Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 
> >>> 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 
> >>> 05 <48> 3d 01 f
> >>> 0 ff ff 73 01 c3 48 8b 0d d5 af 0c 00 f7 d8 64 89 01 48
> >>> [  139.449529] RSP: 002b:7fff1e9c4438 EFLAGS: 0246 ORIG_RAX: 
> >>> 0010
> >>> [  139.449534] RAX: ffda RBX: 7fff1e9c447c RCX: 
> >>> 7f63330fbe6b
> >>> [  139.449537] RDX: 7fff1e9c447c RSI: c00c642e RDI: 
> >>> 0012
> >>> [  139.449539] RBP: c00c642e R08: 7fff1e9c4520 R09: 
> >>> 7f63331c7a60
> >>> [  139.449542] R10: 7f6329fb9ab0 R11: 0246 R12: 
> >>> 55f69810ad40
> >>> [  139.449544] R13: 0012 R14: 0010 R15: 
> >>> 7fff1e9c4c20
> >>> [  139.449549] Modules linked in: 8021q garp mrp bridge stp llc 
> >>> nls_iso8859_1 vfat fat fuse btrfs blake2b_generic xor raid6_pq libcrc32c 
> >>> crypto_user tun i2c_de
> >>> v it87 hwmon_vid snd_seq snd_hda_codec_realtek snd_hda_codec_generic 
> >>> ledtrig_audio sg snd_hda_codec_hdmi virtio_balloon snd_hda_intel 
> >>> virtio_console snd_intel_
> >>> dspcfg soundwire_intel virtio_pci soundwire_generic_allocation 
> >>> soundwire_cadence virtio_blk snd_hda_codec intel_rapl_msr btusb 
> >>> intel_rapl_common virtio_net btr
> >>> tl net_failover uvcvideo snd_usb_audio snd_hda_core btbcm 
> >>> x86_pkg_temp_thermal failover soundwire_bus btintel intel_powerclamp 
> >>> snd_soc_core

Re: [PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Felix Kuehling

Am 2021-03-08 um 2:05 p.m. schrieb Arnd Bergmann:
> On Mon, Mar 8, 2021 at 5:24 PM Felix Kuehling  wrote:
>> The driver build should work without IOMMUv2. In amdkfd/Makefile, we
>> have this condition:
>>
>> ifneq ($(CONFIG_AMD_IOMMU_V2),)
>> AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
>> endif
>>
>> In amdkfd/kfd_iommu.h we define inline stubs of the functions that are
>> causing your link-failures if IOMMU_V2 is not enabled:
>>
>> #if defined(CONFIG_AMD_IOMMU_V2_MODULE) || defined(CONFIG_AMD_IOMMU_V2)
>> ... function declarations ...
>> #else
>> ... stubs ...
>> #endif
> Right, that is the problem I tried to explain in my patch description.
>
> Should we just drop the 'imply' then and add a proper dependency like this?
>
>   depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
>   depends on AMD_IOMMU_V2=y || DRM_AMDGPU=m
>
> I can send a v2 after some testing if you prefer this version.

No. My point is, there should not be a hard dependency. The build should
work without CONFIG_AMD_IOMMU_V2. I don't understand why it's not
working for you. It looks like you're building kfd_iommu.o, which should
not be happening when AMD_IOMMU_V2 is not enabled. The condition in
amdkfd/Makefile should make sure that kfd_iommu.o doesn't get built with
your kernel config.

Regards,
  Felix


>
> Arnd
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Arnd Bergmann

On Mon, Mar 8, 2021 at 5:24 PM Felix Kuehling  wrote:
>
> The driver build should work without IOMMUv2. In amdkfd/Makefile, we
> have this condition:
>
> ifneq ($(CONFIG_AMD_IOMMU_V2),)
> AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
> endif
>
> In amdkfd/kfd_iommu.h we define inline stubs of the functions that are
> causing your link-failures if IOMMU_V2 is not enabled:
>
> #if defined(CONFIG_AMD_IOMMU_V2_MODULE) || defined(CONFIG_AMD_IOMMU_V2)
> ... function declarations ...
> #else
> ... stubs ...
> #endif

Right, that is the problem I tried to explain in my patch description.

Should we just drop the 'imply' then and add a proper dependency like this?

  depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
  depends on AMD_IOMMU_V2=y || DRM_AMDGPU=m

I can send a v2 after some testing if you prefer this version.

Arnd
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/3] drm/radeon: keep __user during cast

2021-03-08 Thread Alex Deucher

Series is:
Reviewed-by: Alex Deucher 

On Mon, Mar 8, 2021 at 1:36 PM Christian König
 wrote:
>
> Silence static checker warning.
>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
> b/drivers/gpu/drm/radeon/radeon_ttm.c
> index 5ea647f454d3..808941e31d34 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -921,7 +921,7 @@ static ssize_t radeon_ttm_vram_read(struct file *f, char 
> __user *buf,
> value = RREG32(RADEON_MM_DATA);
> spin_unlock_irqrestore(>mmio_idx_lock, flags);
>
> -   r = put_user(value, (uint32_t *)buf);
> +   r = put_user(value, (uint32_t __user *)buf);
> if (r)
> return r;
>
> --
> 2.25.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 3/3] drm/radeon: keep __user during cast

2021-03-08 Thread Christian König

Silence static checker warning.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 5ea647f454d3..808941e31d34 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -921,7 +921,7 @@ static ssize_t radeon_ttm_vram_read(struct file *f, char 
__user *buf,
value = RREG32(RADEON_MM_DATA);
spin_unlock_irqrestore(>mmio_idx_lock, flags);
 
-   r = put_user(value, (uint32_t *)buf);
+   r = put_user(value, (uint32_t __user *)buf);
if (r)
return r;
 
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 2/3] drm/radeon: fix AGP dependency

2021-03-08 Thread Christian König

When AGP is compiled as module radeon must be compiled as module as
well.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index e392a90ca687..85b79a7fee63 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -228,6 +228,7 @@ source "drivers/gpu/drm/arm/Kconfig"
 config DRM_RADEON
tristate "ATI Radeon"
depends on DRM && PCI && MMU
+   depends on AGP || !AGP
select FW_LOADER
 select DRM_KMS_HELPER
 select DRM_TTM
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 1/3] drm/radeon: also init GEM funcs in radeon_gem_prime_import_sg_table

2021-03-08 Thread Christian König

Otherwise we will run into a NULL ptr deref.

Signed-off-by: Christian König 
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=212137
---
 drivers/gpu/drm/radeon/radeon.h   | 2 ++
 drivers/gpu/drm/radeon/radeon_gem.c   | 4 ++--
 drivers/gpu/drm/radeon/radeon_prime.c | 2 ++
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 2dcdd8448331..42281fce552e 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -567,6 +567,8 @@ struct radeon_gem {
struct list_headobjects;
 };
 
+extern const struct drm_gem_object_funcs radeon_gem_object_funcs;
+
 int radeon_gem_init(struct radeon_device *rdev);
 void radeon_gem_fini(struct radeon_device *rdev);
 int radeon_gem_object_create(struct radeon_device *rdev, unsigned long size,
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c 
b/drivers/gpu/drm/radeon/radeon_gem.c
index 412ab3181f84..05ea2f39f626 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -42,7 +42,7 @@ struct sg_table *radeon_gem_prime_get_sg_table(struct 
drm_gem_object *obj);
 int radeon_gem_prime_pin(struct drm_gem_object *obj);
 void radeon_gem_prime_unpin(struct drm_gem_object *obj);
 
-static const struct drm_gem_object_funcs radeon_gem_object_funcs;
+const struct drm_gem_object_funcs radeon_gem_object_funcs;
 
 static void radeon_gem_object_free(struct drm_gem_object *gobj)
 {
@@ -226,7 +226,7 @@ static int radeon_gem_handle_lockup(struct radeon_device 
*rdev, int r)
return r;
 }
 
-static const struct drm_gem_object_funcs radeon_gem_object_funcs = {
+const struct drm_gem_object_funcs radeon_gem_object_funcs = {
.free = radeon_gem_object_free,
.open = radeon_gem_object_open,
.close = radeon_gem_object_close,
diff --git a/drivers/gpu/drm/radeon/radeon_prime.c 
b/drivers/gpu/drm/radeon/radeon_prime.c
index ab29eb9e8667..42a87948e28c 100644
--- a/drivers/gpu/drm/radeon/radeon_prime.c
+++ b/drivers/gpu/drm/radeon/radeon_prime.c
@@ -56,6 +56,8 @@ struct drm_gem_object 
*radeon_gem_prime_import_sg_table(struct drm_device *dev,
if (ret)
return ERR_PTR(ret);
 
+   bo->tbo.base.funcs = _gem_object_funcs;
+
mutex_lock(>gem.mutex);
list_add_tail(>list, >gem.objects);
mutex_unlock(>gem.mutex);
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 1/5] drm/amdgpu: allow variable BO struct creation

2021-03-08 Thread Christian König


Am 08.03.21 um 16:37 schrieb Nirmoy Das:

Allow allocating BO structures with different structure size
than struct amdgpu_bo.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 9 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 +
  2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index ac1bb5089260..d32379cbad89 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -543,9 +543,10 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
if (!amdgpu_bo_validate_size(adev, size, bp->domain))
return -ENOMEM;
  
-	*bo_ptr = NULL;

+   BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
  
-	bo = kzalloc(sizeof(struct amdgpu_bo), GFP_KERNEL);

+   *bo_ptr = NULL;
+   bo = kzalloc(bp->bo_ptr_size, GFP_KERNEL);
if (bo == NULL)
return -ENOMEM;
drm_gem_private_object_init(adev_to_drm(adev), >tbo.base, size);
@@ -635,6 +636,7 @@ static int amdgpu_bo_create_shadow(struct amdgpu_device 
*adev,
AMDGPU_GEM_CREATE_SHADOW;
bp.type = ttm_bo_type_kernel;
bp.resv = bo->tbo.base.resv;
+   bp.bo_ptr_size = sizeof(struct amdgpu_bo);
  
  	r = amdgpu_bo_do_create(adev, , >shadow);

if (!r) {
@@ -669,6 +671,9 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
int r;
  
  	bp->flags = bp->flags & ~AMDGPU_GEM_CREATE_SHADOW;

+   if (bp->bo_ptr_size < sizeof(struct amdgpu_bo))
+   bp->bo_ptr_size = sizeof(struct amdgpu_bo);
+


It's not strictly necessary, but I would prefer if you change all 
callers of amdgpu_bo_create() to correctly do this instead of the check 
here.


Christian.


r = amdgpu_bo_do_create(adev, bp, bo_ptr);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 54ceb065e546..8e2b556f0b7b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -40,6 +40,7 @@
  struct amdgpu_bo_param {
unsigned long   size;
int byte_align;
+   u32 bo_ptr_size;
u32 domain;
u32 preferred_domain;
u64 flags;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: capture invalid hardware access v2

2021-03-08 Thread Christian König

From: Dennis Li 

When recovery thread has begun GPU reset, there should be not other
threads to access hardware, otherwise system randomly hang.

v2 (chk): rewritten from scratch, use trylock and lockdep instead of
  hand wiring the logic.

Signed-off-by: Dennis Li 
Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 74 +-
 1 file changed, 57 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..c990af6a43ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -326,6 +326,34 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, 
loff_t pos,
 /*
  * register access helper functions.
  */
+
+/* Check if hw access should be skipped because of hotplug or device error */
+static bool amdgpu_device_skip_hw_access(struct amdgpu_device *adev)
+{
+   if (adev->in_pci_err_recovery)
+   return true;
+
+#ifdef CONFIG_LOCKDEP
+   /*
+* This is a bit complicated to understand, so worth a comment. What we 
assert
+* here is that the GPU reset is not running on another thread in 
parallel.
+*
+* For this we trylock the read side of the reset semaphore, if that 
succeeds
+* we know that the reset is not running in paralell.
+*
+* If the trylock fails we assert that we are either already holding 
the read
+* side of the lock or are the reset thread itself and hold the write 
side of
+* the lock.
+*/
+   if (down_read_trylock(>reset_sem))
+   up_read(>reset_sem);
+   else
+   lockdep_assert_held(>reset_sem);
+#endif
+
+   return false;
+}
+
 /**
  * amdgpu_device_rreg - read a memory mapped IO or indirect register
  *
@@ -340,7 +368,7 @@ uint32_t amdgpu_device_rreg(struct amdgpu_device *adev,
 {
uint32_t ret;
 
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return 0;
 
if ((reg * 4) < adev->rmmio_size) {
@@ -377,7 +405,7 @@ uint32_t amdgpu_device_rreg(struct amdgpu_device *adev,
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return 0;
 
if (offset < adev->rmmio_size)
@@ -402,7 +430,7 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, 
uint32_t offset)
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return;
 
if (offset < adev->rmmio_size)
@@ -425,7 +453,7 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
uint32_t reg, uint32_t v,
uint32_t acc_flags)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return;
 
if ((reg * 4) < adev->rmmio_size) {
@@ -452,7 +480,7 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
 void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
 uint32_t reg, uint32_t v)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return;
 
if (amdgpu_sriov_fullaccess(adev) &&
@@ -475,7 +503,7 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return 0;
 
if ((reg * 4) < adev->rio_mem_size)
@@ -497,7 +525,7 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return;
 
if ((reg * 4) < adev->rio_mem_size)
@@ -519,7 +547,7 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, 
u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return 0;
 
if (index < adev->doorbell.num_doorbells) {
@@ -542,7 +570,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 
index)
  */
 void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return;
 
if (index < adev->doorbell.num_doorbells) {
@@ -563,7 +591,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 
index, u32 v)
  */
 u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
 {
-   if (adev->in_pci_err_recovery)
+   if (amdgpu_device_skip_hw_access(adev))
return 0;

Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-08 Thread Andrey Grodzovsky

Sure, patch 4 Reviewed-by: Andrey Grodzovsky andrey.grodzov...@amd.com 
and patch 5 Acked-by: Andrey Grodzovsky andrey.grodzov...@amd.com since 
I am not sure about the KFD bits.


Andrey

On 2021-03-08 11:10 a.m., Liu, Shaoyun wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey.
The first 3 patches in this serial already been acked by Alex. D, can you help 
review the rest two ?

Thanks
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey 
Sent: Monday, March 8, 2021 10:53 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

I see, thanks for explaning.

Andrey

On 2021-03-08 10:27 a.m., Liu, Shaoyun wrote:

[AMD Official Use Only - Internal Distribution Only]

Check the function amdgpu_xgmi_add_device, when  psp XGMI TA is bot available ,  the 
driver will assign a faked hive ID 0x10 for all  GPUs, it means all GPU will belongs 
to one same hive .  So I can still use hive->tb to sync the reset on all GPUs.   
The reason I can  not use the default amdgpu_do_asic_reset function  is because we  
want to build correct hive and node topology for all GPUs after reset, so we need to 
call amdgpu_xgmi_add_device inside the amdgpu_do_asic_reset function . To make this 
works ,  we need to destroy the hive by remove  the device (call 
amdgpu_xgmi_remove_device) first , so when calling amdgpu_do_asic_reset ,  the  faked 
hive(0x10) already   been destroyed. And  the hive->tb will not work in this case 
.   That's the reason I need to call the reset explicitly with the faked hive and 
then destroy the hive ,  build the device_list for amdgpu_do_asic_reset without the 
hive .
Hope I explain it clearly .

Thanks
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey 
Sent: Monday, March 8, 2021 1:28 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI
hive duirng probe

But the hive->tb object is used regardless, inside 
amdgpu_device_xgmi_reset_func currently, it means then even when you explcicitly 
schdule xgmi_reset_work as you do now they code will try to sync using a not well 
iniitlized tb object. Maybe you can define a global static tb object, fill it in 
the loop where you send xgmi_reset_work for all devices in system and use it from 
within amdgpu_device_xgmi_reset_func instead of the regular per hive tb object 
(obviosly under your special use case).

Andrey

On 2021-03-06 4:11 p.m., Liu, Shaoyun wrote:

[AMD Official Use Only - Internal Distribution Only]

It  seems I can  not directly reuse the reset HW  function inside the  
amdgpu_do_asic_reset,  the  synchronization is based on hive->tb,   but as 
explained , we actually don't know the GPU belongs to which hive and will rebuild the 
correct hive info inside the amdgpu_do_asic_reset function with 
amdgpu_xgmi_add_device .  so I need to remove  all GPUs from the hive first . This 
will lead to the sync don't work since the hive->tb will be removed as well when 
all GPUs are removed .

Thanks
shaopyunliu

-Original Message-
From: amd-gfx  On Behalf Of
Liu, Shaoyun
Sent: Saturday, March 6, 2021 3:41 PM
To: Grodzovsky, Andrey ;
amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI
hive duirng probe

[AMD Official Use Only - Internal Distribution Only]

I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so the 
reset won't be execute twice .  but probably I can  set this parameter to true 
and remove the code schedule for reset since now I already build the 
device_list not based on hive. Let me try that .
For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually not 
wait for SMU  to start. As I explained , I need to reset the all the GPUs in 
the system since I don't know which gpus belongs to which hive.  So this time 
is allow system to probe all the GPUs  in the system which means when this 
delayed thread starts ,  we can assume all the devices already been  populated 
in mgpu_info.

Regards
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey 
Sent: Saturday, March 6, 2021 1:09 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI
hive duirng probe

Thanks for explaining this, one thing I still don't understand is why you 
schedule the reset work explicilty in the begining of 
amdgpu_drv_delayed_reset_work_handler and then also call amdgpu_do_asic_reset 
which will do the same thing too. It looks like the physical reset will execute 
twice for each device.
Another thing is, more like improvement suggestion  - currently you schedule 
delayed_reset_work using AMDGPU_RESUME_MS - so i guesss this should give enough 
time for SMU to start ? Is there maybe a way to instead poll for SMU start 
completion and then execute this - some SMU status registers maybe ? Just to 
avoid relying on this arbitrary

Re: [PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Felix Kuehling

The driver build should work without IOMMUv2. In amdkfd/Makefile, we
have this condition:

ifneq ($(CONFIG_AMD_IOMMU_V2),)
AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
endif

In amdkfd/kfd_iommu.h we define inline stubs of the functions that are
causing your link-failures if IOMMU_V2 is not enabled:

#if defined(CONFIG_AMD_IOMMU_V2_MODULE) || defined(CONFIG_AMD_IOMMU_V2)
... function declarations ...
#else
... stubs ...
#endif

Regards,
  Felix

Am 2021-03-08 um 10:33 a.m. schrieb Arnd Bergmann:
> From: Arnd Bergmann 
>
> Using 'imply AMD_IOMMU_V2' does not guarantee that the driver can link
> against the exported functions. If the GPU driver is built-in but the
> IOMMU driver is a loadable module, the kfd_iommu.c file is indeed
> built but does not work:
>
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_bind_process_to_device':
> kfd_iommu.c:(.text+0x516): undefined reference to `amd_iommu_bind_pasid'
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_unbind_process':
> kfd_iommu.c:(.text+0x691): undefined reference to `amd_iommu_unbind_pasid'
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_suspend':
> kfd_iommu.c:(.text+0x966): undefined reference to 
> `amd_iommu_set_invalidate_ctx_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0x97f): undefined reference to 
> `amd_iommu_set_invalid_ppr_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0x9a4): undefined reference to 
> `amd_iommu_free_device'
> x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
> `kfd_iommu_resume':
> kfd_iommu.c:(.text+0xa9a): undefined reference to `amd_iommu_init_device'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xadc): undefined reference to 
> `amd_iommu_set_invalidate_ctx_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xaff): undefined reference to 
> `amd_iommu_set_invalid_ppr_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xc72): undefined reference to 
> `amd_iommu_bind_pasid'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xe08): undefined reference to 
> `amd_iommu_set_invalidate_ctx_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xe26): undefined reference to 
> `amd_iommu_set_invalid_ppr_cb'
> x86_64-linux-ld: kfd_iommu.c:(.text+0xe42): undefined reference to 
> `amd_iommu_free_device'
>
> Use a stronger 'select' instead.
>
> Fixes: 64d1c3a43a6f ("drm/amdkfd: Centralize IOMMUv2 code and make it 
> conditional")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/gpu/drm/amd/amdkfd/Kconfig | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig 
> b/drivers/gpu/drm/amd/amdkfd/Kconfig
> index f02c938f75da..91f85dfb7ba6 100644
> --- a/drivers/gpu/drm/amd/amdkfd/Kconfig
> +++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
> @@ -5,8 +5,9 @@
>  
>  config HSA_AMD
>   bool "HSA kernel driver for AMD GPU devices"
> - depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
> - imply AMD_IOMMU_V2 if X86_64
> + depends on DRM_AMDGPU && ((X86_64 && IOMMU_SUPPORT && ACPI) || ARM64 || 
> PPC64)
> + select AMD_IOMMU if X86_64
> + select AMD_IOMMU_V2 if X86_64
>   select HMM_MIRROR
>   select MMU_NOTIFIER
>   select DRM_AMDGPU_USERPTR
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-08 Thread Liu, Shaoyun

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey. 
The first 3 patches in this serial already been acked by Alex. D, can you help 
review the rest two ? 

Thanks
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey  
Sent: Monday, March 8, 2021 10:53 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

I see, thanks for explaning.

Andrey

On 2021-03-08 10:27 a.m., Liu, Shaoyun wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> Check the function amdgpu_xgmi_add_device, when  psp XGMI TA is bot available 
> ,  the driver will assign a faked hive ID 0x10 for all  GPUs, it means all 
> GPU will belongs to one same hive .  So I can still use hive->tb to sync the 
> reset on all GPUs.   The reason I can  not use the default 
> amdgpu_do_asic_reset function  is because we  want to build correct hive and 
> node topology for all GPUs after reset, so we need to call 
> amdgpu_xgmi_add_device inside the amdgpu_do_asic_reset function . To make 
> this works ,  we need to destroy the hive by remove  the device (call 
> amdgpu_xgmi_remove_device) first , so when calling amdgpu_do_asic_reset ,  
> the  faked hive(0x10) already   been destroyed. And  the hive->tb will not 
> work in this case .   That's the reason I need to call the reset explicitly 
> with the faked hive and then destroy the hive ,  build the device_list for 
> amdgpu_do_asic_reset without the hive .
> Hope I explain it clearly .
> 
> Thanks
> Shaoyun.liu
> 
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Monday, March 8, 2021 1:28 AM
> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> But the hive->tb object is used regardless, inside 
> amdgpu_device_xgmi_reset_func currently, it means then even when you 
> explcicitly schdule xgmi_reset_work as you do now they code will try to sync 
> using a not well iniitlized tb object. Maybe you can define a global static 
> tb object, fill it in the loop where you send xgmi_reset_work for all devices 
> in system and use it from within amdgpu_device_xgmi_reset_func instead of the 
> regular per hive tb object (obviosly under your special use case).
> 
> Andrey
> 
> On 2021-03-06 4:11 p.m., Liu, Shaoyun wrote:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> It  seems I can  not directly reuse the reset HW  function inside the  
>> amdgpu_do_asic_reset,  the  synchronization is based on hive->tb,   but as 
>> explained , we actually don't know the GPU belongs to which hive and will 
>> rebuild the correct hive info inside the amdgpu_do_asic_reset function with 
>> amdgpu_xgmi_add_device .  so I need to remove  all GPUs from the hive first 
>> . This will lead to the sync don't work since the hive->tb will be removed 
>> as well when all GPUs are removed .
>>
>> Thanks
>> shaopyunliu
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of 
>> Liu, Shaoyun
>> Sent: Saturday, March 6, 2021 3:41 PM
>> To: Grodzovsky, Andrey ; 
>> amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
>> hive duirng probe
>>
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so 
>> the reset won't be execute twice .  but probably I can  set this parameter 
>> to true and remove the code schedule for reset since now I already build the 
>> device_list not based on hive. Let me try that .
>> For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually 
>> not wait for SMU  to start. As I explained , I need to reset the all the 
>> GPUs in the system since I don't know which gpus belongs to which hive.  So 
>> this time is allow system to probe all the GPUs  in the system which means 
>> when this delayed thread starts ,  we can assume all the devices already 
>> been  populated in mgpu_info.
>>
>> Regards
>> Shaoyun.liu
>>
>> -Original Message-
>> From: Grodzovsky, Andrey 
>> Sent: Saturday, March 6, 2021 1:09 AM
>> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
>> hive duirng probe
>>
>> Thanks for explaining this, one thing I still don't understand is why you 
>> schedule the reset work explicilty in the begining of 
>> amdgpu_drv_delayed_reset_work_handler and then also call 
>> amdgpu_do_asic_reset which will do the same thing too. It looks like the 
>> physical reset will execute twice for each device.
>> Another thing is, more like improvement suggestion  - currently you schedule 
>> delayed_reset_work using AMDGPU_RESUME_MS - so i guesss this should give 
>> enough time for SMU to start ? Is there maybe a way to instead poll for SMU 
>> start completion and then execute this - some SMU status registers maybe ? 
>> Just to avoid relying on this

Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-08 Thread Andrey Grodzovsky


I see, thanks for explaning.

Andrey

On 2021-03-08 10:27 a.m., Liu, Shaoyun wrote:

[AMD Official Use Only - Internal Distribution Only]

Check the function amdgpu_xgmi_add_device, when  psp XGMI TA is bot available ,  the 
driver will assign a faked hive ID 0x10 for all  GPUs, it means all GPU will belongs 
to one same hive .  So I can still use hive->tb to sync the reset on all GPUs.   
The reason I can  not use the default amdgpu_do_asic_reset function  is because we  
want to build correct hive and node topology for all GPUs after reset, so we need to 
call amdgpu_xgmi_add_device inside the amdgpu_do_asic_reset function . To make this 
works ,  we need to destroy the hive by remove  the device (call 
amdgpu_xgmi_remove_device) first , so when calling amdgpu_do_asic_reset ,  the  faked 
hive(0x10) already   been destroyed. And  the hive->tb will not work in this case 
.   That's the reason I need to call the reset explicitly with the faked hive and 
then destroy the hive ,  build the device_list for amdgpu_do_asic_reset without the 
hive .
Hope I explain it clearly .

Thanks
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey 
Sent: Monday, March 8, 2021 1:28 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

But the hive->tb object is used regardless, inside 
amdgpu_device_xgmi_reset_func currently, it means then even when you explcicitly 
schdule xgmi_reset_work as you do now they code will try to sync using a not well 
iniitlized tb object. Maybe you can define a global static tb object, fill it in 
the loop where you send xgmi_reset_work for all devices in system and use it from 
within amdgpu_device_xgmi_reset_func instead of the regular per hive tb object 
(obviosly under your special use case).

Andrey

On 2021-03-06 4:11 p.m., Liu, Shaoyun wrote:

[AMD Official Use Only - Internal Distribution Only]

It  seems I can  not directly reuse the reset HW  function inside the  
amdgpu_do_asic_reset,  the  synchronization is based on hive->tb,   but as 
explained , we actually don't know the GPU belongs to which hive and will rebuild the 
correct hive info inside the amdgpu_do_asic_reset function with 
amdgpu_xgmi_add_device .  so I need to remove  all GPUs from the hive first . This 
will lead to the sync don't work since the hive->tb will be removed as well when 
all GPUs are removed .

Thanks
shaopyunliu

-Original Message-
From: amd-gfx  On Behalf Of
Liu, Shaoyun
Sent: Saturday, March 6, 2021 3:41 PM
To: Grodzovsky, Andrey ;
amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI
hive duirng probe

[AMD Official Use Only - Internal Distribution Only]

I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so the 
reset won't be execute twice .  but probably I can  set this parameter to true 
and remove the code schedule for reset since now I already build the 
device_list not based on hive. Let me try that .
For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually not 
wait for SMU  to start. As I explained , I need to reset the all the GPUs in 
the system since I don't know which gpus belongs to which hive.  So this time 
is allow system to probe all the GPUs  in the system which means when this 
delayed thread starts ,  we can assume all the devices already been  populated 
in mgpu_info.

Regards
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey 
Sent: Saturday, March 6, 2021 1:09 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI
hive duirng probe

Thanks for explaining this, one thing I still don't understand is why you 
schedule the reset work explicilty in the begining of 
amdgpu_drv_delayed_reset_work_handler and then also call amdgpu_do_asic_reset 
which will do the same thing too. It looks like the physical reset will execute 
twice for each device.
Another thing is, more like improvement suggestion  - currently you schedule 
delayed_reset_work using AMDGPU_RESUME_MS - so i guesss this should give enough 
time for SMU to start ? Is there maybe a way to instead poll for SMU start 
completion and then execute this - some SMU status registers maybe ? Just to 
avoid relying on this arbitrary value.

Andrey

On 2021-03-05 8:37 p.m., Liu, Shaoyun wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi,  Andrey
The existing reset function (amdgpu_device_gpu_recover or amd do_asic _reset) 
assumed driver already have  the correct hive info . But in my case, it's  not 
true . The gpus are in a bad state and the XGMI TA  might not functional 
properly , so driver can  not  get the hive and node info when probe the device 
.  It means driver even don't know  the device belongs to which hive on a 
system with multiple hive configuration (ex, 8 gpus in  two hive). The only 
solution I can think of is let driver trigger the

[PATCH 4/5] drm/amdgpu: use amdgpu_bo_create_user() for when possible

2021-03-08 Thread Nirmoy Das

Use amdgpu_bo_create_user() for all the BO allocations for
ttm_bo_type_device type.

CC: felix.kuehl...@amd.com
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c| 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index f44185f512de..00ac5c272f47 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -316,6 +316,7 @@ int amdgpu_amdkfd_alloc_gws(struct kgd_dev *kgd, size_t 
size,
 {
struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
struct amdgpu_bo *bo = NULL;
+   struct amdgpu_bo_user *ubo;
struct amdgpu_bo_param bp;
int r;

@@ -327,13 +328,14 @@ int amdgpu_amdkfd_alloc_gws(struct kgd_dev *kgd, size_t 
size,
bp.type = ttm_bo_type_device;
bp.resv = NULL;

-   r = amdgpu_bo_create(adev, , );
+   r = amdgpu_bo_create_user(adev, , );
if (r) {
dev_err(adev->dev,
"failed to allocate gws BO for amdkfd (%d)\n", r);
return r;
}

+   bo = >bo;
*mem_obj = bo;
return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index fb7171e5507c..beff96ddc0b4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -58,6 +58,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, 
unsigned long size,
 struct drm_gem_object **obj)
 {
struct amdgpu_bo *bo;
+   struct amdgpu_bo_user *ubo;
struct amdgpu_bo_param bp;
int r;

@@ -71,10 +72,11 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, 
unsigned long size,
bp.preferred_domain = initial_domain;
bp.flags = flags;
bp.domain = initial_domain;
-   r = amdgpu_bo_create(adev, , );
+   r = amdgpu_bo_create_user(adev, , );
if (r)
return r;

+   bo = >bo;
*obj = >tbo.base;
(*obj)->funcs = _gem_object_funcs;

--
2.30.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 5/5] drm/amdgpu: use amdgpu_bo_user bo for metadata and tiling flag

2021-03-08 Thread Nirmoy Das

Tiling flag and metadata are only needed for BOs created by
amdgpu_gem_object_create(), so we can remove those from the
base class.

CC: felix.kuehl...@amd.com
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 59 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  4 --
 3 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 00ac5c272f47..04a19cdc08c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -496,8 +496,6 @@ int amdgpu_amdkfd_get_dmabuf_info(struct kgd_dev *kgd, int 
dma_buf_fd,
*dma_buf_kgd = (struct kgd_dev *)adev;
if (bo_size)
*bo_size = amdgpu_bo_size(bo);
-   if (metadata_size)
-   *metadata_size = bo->metadata_size;
if (metadata_buffer)
r = amdgpu_bo_get_metadata(bo, metadata_buffer, buffer_size,
   metadata_size, _flags);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index abfeb8304894..c105ba96dd58 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -77,6 +77,7 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
 {
struct amdgpu_device *adev = amdgpu_ttm_adev(tbo->bdev);
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(tbo);
+   struct amdgpu_bo_user *ubo;

if (bo->tbo.pin_count > 0)
amdgpu_bo_subtract_pin_size(bo);
@@ -94,7 +95,11 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
}
amdgpu_bo_unref(>parent);

-   kfree(bo->metadata);
+   if (bo->tbo.type == ttm_bo_type_device) {
+   ubo = container_of((bo), struct amdgpu_bo_user, bo);
+   kfree(ubo->metadata);
+   }
+
kfree(bo);
 }

@@ -1161,12 +1166,15 @@ int amdgpu_bo_fbdev_mmap(struct amdgpu_bo *bo,
 int amdgpu_bo_set_tiling_flags(struct amdgpu_bo *bo, u64 tiling_flags)
 {
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   struct amdgpu_bo_user *ubo;

+   BUG_ON(bo->tbo.type != ttm_bo_type_device);
if (adev->family <= AMDGPU_FAMILY_CZ &&
AMDGPU_TILING_GET(tiling_flags, TILE_SPLIT) > 6)
return -EINVAL;

-   bo->tiling_flags = tiling_flags;
+   ubo = container_of((bo), struct amdgpu_bo_user, bo);
+   ubo->tiling_flags = tiling_flags;
return 0;
 }

@@ -1180,10 +1188,14 @@ int amdgpu_bo_set_tiling_flags(struct amdgpu_bo *bo, 
u64 tiling_flags)
  */
 void amdgpu_bo_get_tiling_flags(struct amdgpu_bo *bo, u64 *tiling_flags)
 {
+   struct amdgpu_bo_user *ubo;
+
+   BUG_ON(bo->tbo.type != ttm_bo_type_device);
dma_resv_assert_held(bo->tbo.base.resv);
+   ubo = amdgpu_bo_to_amdgpu_bo_user(bo);

if (tiling_flags)
-   *tiling_flags = bo->tiling_flags;
+   *tiling_flags = ubo->tiling_flags;
 }

 /**
@@ -1202,13 +1214,20 @@ void amdgpu_bo_get_tiling_flags(struct amdgpu_bo *bo, 
u64 *tiling_flags)
 int amdgpu_bo_set_metadata (struct amdgpu_bo *bo, void *metadata,
uint32_t metadata_size, uint64_t flags)
 {
+   struct amdgpu_bo_user *ubo;
void *buffer;

+   if (bo->tbo.type != ttm_bo_type_device) {
+   DRM_ERROR("can not set metadata for a non-amdgpu_bo_user type 
BO\n");
+   return -EINVAL;
+   }
+
+   ubo = amdgpu_bo_to_amdgpu_bo_user(bo);
if (!metadata_size) {
-   if (bo->metadata_size) {
-   kfree(bo->metadata);
-   bo->metadata = NULL;
-   bo->metadata_size = 0;
+   if (ubo->metadata_size) {
+   kfree(ubo->metadata);
+   ubo->metadata = NULL;
+   ubo->metadata_size = 0;
}
return 0;
}
@@ -1220,10 +1239,10 @@ int amdgpu_bo_set_metadata (struct amdgpu_bo *bo, void 
*metadata,
if (buffer == NULL)
return -ENOMEM;

-   kfree(bo->metadata);
-   bo->metadata_flags = flags;
-   bo->metadata = buffer;
-   bo->metadata_size = metadata_size;
+   kfree(ubo->metadata);
+   ubo->metadata_flags = flags;
+   ubo->metadata = buffer;
+   ubo->metadata_size = metadata_size;

return 0;
 }
@@ -1247,21 +1266,29 @@ int amdgpu_bo_get_metadata(struct amdgpu_bo *bo, void 
*buffer,
   size_t buffer_size, uint32_t *metadata_size,
   uint64_t *flags)
 {
+   struct amdgpu_bo_user *ubo;
+
if (!buffer && !metadata_size)
return -EINVAL;

+   if (bo->tbo.type != ttm_bo_type_device) {
+   DRM_ERROR("can not get metadata for a

[PATCH 3/5] drm/amdgpu: fb BO should be ttm_bo_type_device

2021-03-08 Thread Nirmoy Das

FB BO should not be ttm_bo_type_kernel type and
amdgpufb_create_pinned_object() pins the FB BO anyway.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
index 51cd49c6f38f..24010cacf7d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c
@@ -146,7 +146,7 @@ static int amdgpufb_create_pinned_object(struct 
amdgpu_fbdev *rfbdev,
size = mode_cmd->pitches[0] * height;
aligned_size = ALIGN(size, PAGE_SIZE);
ret = amdgpu_gem_object_create(adev, aligned_size, 0, domain, flags,
-  ttm_bo_type_kernel, NULL, );
+  ttm_bo_type_device, NULL, );
if (ret) {
pr_err("failed to allocate framebuffer (%d)\n", aligned_size);
return -ENOMEM;
-- 
2.30.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 1/5] drm/amdgpu: allow variable BO struct creation

2021-03-08 Thread Nirmoy Das

Allow allocating BO structures with different structure size
than struct amdgpu_bo.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 9 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index ac1bb5089260..d32379cbad89 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -543,9 +543,10 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
if (!amdgpu_bo_validate_size(adev, size, bp->domain))
return -ENOMEM;
 
-   *bo_ptr = NULL;
+   BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
 
-   bo = kzalloc(sizeof(struct amdgpu_bo), GFP_KERNEL);
+   *bo_ptr = NULL;
+   bo = kzalloc(bp->bo_ptr_size, GFP_KERNEL);
if (bo == NULL)
return -ENOMEM;
drm_gem_private_object_init(adev_to_drm(adev), >tbo.base, size);
@@ -635,6 +636,7 @@ static int amdgpu_bo_create_shadow(struct amdgpu_device 
*adev,
AMDGPU_GEM_CREATE_SHADOW;
bp.type = ttm_bo_type_kernel;
bp.resv = bo->tbo.base.resv;
+   bp.bo_ptr_size = sizeof(struct amdgpu_bo);
 
r = amdgpu_bo_do_create(adev, , >shadow);
if (!r) {
@@ -669,6 +671,9 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
int r;
 
bp->flags = bp->flags & ~AMDGPU_GEM_CREATE_SHADOW;
+   if (bp->bo_ptr_size < sizeof(struct amdgpu_bo))
+   bp->bo_ptr_size = sizeof(struct amdgpu_bo);
+
r = amdgpu_bo_do_create(adev, bp, bo_ptr);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 54ceb065e546..8e2b556f0b7b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -40,6 +40,7 @@
 struct amdgpu_bo_param {
unsigned long   size;
int byte_align;
+   u32 bo_ptr_size;
u32 domain;
u32 preferred_domain;
u64 flags;
-- 
2.30.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 2/5] drm/amdgpu: introduce struct amdgpu_bo_user

2021-03-08 Thread Nirmoy Das

Implement a new struct amdgpu_bo_user as subclass of
struct amdgpu_bo and a function to created amdgpu_bo_user
bo with a flag to identify the owner.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 28 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 14 +++
 2 files changed, 42 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index d32379cbad89..abfeb8304894 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -695,6 +695,34 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
return r;
 }
 
+/**
+ * amdgpu_bo_create_user - create an _bo_user buffer object
+ * @adev: amdgpu device object
+ * @bp: parameters to be used for the buffer object
+ * @ubo_ptr: pointer to the buffer object pointer
+ *
+ * Create a BO to be used by user application;
+ *
+ * Returns:
+ * 0 for success or a negative error code on failure.
+ */
+
+int amdgpu_bo_create_user(struct amdgpu_device *adev,
+ struct amdgpu_bo_param *bp,
+ struct amdgpu_bo_user **ubo_ptr)
+{
+   struct amdgpu_bo *bo_ptr;
+   int r;
+
+   bp->flags = bp->flags & ~AMDGPU_GEM_CREATE_SHADOW;
+   bp->bo_ptr_size = sizeof(struct amdgpu_bo_user);
+   r = amdgpu_bo_do_create(adev, bp, _ptr);
+   if (r)
+   return r;
+
+   *ubo_ptr = amdgpu_bo_to_amdgpu_bo_user(bo_ptr);
+   return r;
+}
 /**
  * amdgpu_bo_validate - validate an _bo buffer object
  * @bo: pointer to the buffer object
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 8e2b556f0b7b..fd30221266c8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -37,6 +37,8 @@
 #define AMDGPU_BO_INVALID_OFFSET   LONG_MAX
 #define AMDGPU_BO_MAX_PLACEMENTS   3
 
+#define amdgpu_bo_to_amdgpu_bo_user(abo) container_of((abo), struct 
amdgpu_bo_user, bo)
+
 struct amdgpu_bo_param {
unsigned long   size;
int byte_align;
@@ -112,6 +114,15 @@ struct amdgpu_bo {
struct kgd_mem  *kfd_bo;
 };
 
+struct amdgpu_bo_user {
+   struct amdgpu_bobo;
+   u64 tiling_flags;
+   u64 metadata_flags;
+   void*metadata;
+   u32 metadata_size;
+
+};
+
 static inline struct amdgpu_bo *ttm_to_amdgpu_bo(struct ttm_buffer_object *tbo)
 {
return container_of(tbo, struct amdgpu_bo, tbo);
@@ -255,6 +266,9 @@ int amdgpu_bo_create_kernel(struct amdgpu_device *adev,
 int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev,
   uint64_t offset, uint64_t size, uint32_t domain,
   struct amdgpu_bo **bo_ptr, void **cpu_addr);
+int amdgpu_bo_create_user(struct amdgpu_device *adev,
+ struct amdgpu_bo_param *bp,
+ struct amdgpu_bo_user **ubo_ptr);
 void amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64 *gpu_addr,
   void **cpu_addr);
 int amdgpu_bo_kmap(struct amdgpu_bo *bo, void **ptr);
-- 
2.30.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdkfd: fix build error with missing AMD_IOMMU_V2

2021-03-08 Thread Arnd Bergmann

From: Arnd Bergmann 

Using 'imply AMD_IOMMU_V2' does not guarantee that the driver can link
against the exported functions. If the GPU driver is built-in but the
IOMMU driver is a loadable module, the kfd_iommu.c file is indeed
built but does not work:

x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_bind_process_to_device':
kfd_iommu.c:(.text+0x516): undefined reference to `amd_iommu_bind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_unbind_process':
kfd_iommu.c:(.text+0x691): undefined reference to `amd_iommu_unbind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_suspend':
kfd_iommu.c:(.text+0x966): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x97f): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x9a4): undefined reference to 
`amd_iommu_free_device'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function 
`kfd_iommu_resume':
kfd_iommu.c:(.text+0xa9a): undefined reference to `amd_iommu_init_device'
x86_64-linux-ld: kfd_iommu.c:(.text+0xadc): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xaff): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xc72): undefined reference to 
`amd_iommu_bind_pasid'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe08): undefined reference to 
`amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe26): undefined reference to 
`amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe42): undefined reference to 
`amd_iommu_free_device'

Use a stronger 'select' instead.

Fixes: 64d1c3a43a6f ("drm/amdkfd: Centralize IOMMUv2 code and make it 
conditional")
Signed-off-by: Arnd Bergmann 
---
 drivers/gpu/drm/amd/amdkfd/Kconfig | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig 
b/drivers/gpu/drm/amd/amdkfd/Kconfig
index f02c938f75da..91f85dfb7ba6 100644
--- a/drivers/gpu/drm/amd/amdkfd/Kconfig
+++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
@@ -5,8 +5,9 @@
 
 config HSA_AMD
bool "HSA kernel driver for AMD GPU devices"
-   depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
-   imply AMD_IOMMU_V2 if X86_64
+   depends on DRM_AMDGPU && ((X86_64 && IOMMU_SUPPORT && ACPI) || ARM64 || 
PPC64)
+   select AMD_IOMMU if X86_64
+   select AMD_IOMMU_V2 if X86_64
select HMM_MIRROR
select MMU_NOTIFIER
select DRM_AMDGPU_USERPTR
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng probe

2021-03-08 Thread Liu, Shaoyun

[AMD Official Use Only - Internal Distribution Only]

Check the function amdgpu_xgmi_add_device, when  psp XGMI TA is bot available , 
 the driver will assign a faked hive ID 0x10 for all  GPUs, it means all GPU 
will belongs to one same hive .  So I can still use hive->tb to sync the reset 
on all GPUs.   The reason I can  not use the default amdgpu_do_asic_reset 
function  is because we  want to build correct hive and node topology for all 
GPUs after reset, so we need to call amdgpu_xgmi_add_device inside the 
amdgpu_do_asic_reset function . To make this works ,  we need to destroy the 
hive by remove  the device (call amdgpu_xgmi_remove_device) first , so when 
calling amdgpu_do_asic_reset ,  the  faked hive(0x10) already   been destroyed. 
And  the hive->tb will not work in this case .   That's the reason I need to 
call the reset explicitly with the faked hive and then destroy the hive ,  
build the device_list for amdgpu_do_asic_reset without the hive . 
Hope I explain it clearly . 

Thanks 
Shaoyun.liu

-Original Message-
From: Grodzovsky, Andrey  
Sent: Monday, March 8, 2021 1:28 AM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI hive duirng 
probe

But the hive->tb object is used regardless, inside 
amdgpu_device_xgmi_reset_func currently, it means then even when you 
explcicitly schdule xgmi_reset_work as you do now they code will try to sync 
using a not well iniitlized tb object. Maybe you can define a global static tb 
object, fill it in the loop where you send xgmi_reset_work for all devices in 
system and use it from within amdgpu_device_xgmi_reset_func instead of the 
regular per hive tb object (obviosly under your special use case).

Andrey

On 2021-03-06 4:11 p.m., Liu, Shaoyun wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> It  seems I can  not directly reuse the reset HW  function inside the  
> amdgpu_do_asic_reset,  the  synchronization is based on hive->tb,   but as 
> explained , we actually don't know the GPU belongs to which hive and will 
> rebuild the correct hive info inside the amdgpu_do_asic_reset function with 
> amdgpu_xgmi_add_device .  so I need to remove  all GPUs from the hive first . 
> This will lead to the sync don't work since the hive->tb will be removed as 
> well when all GPUs are removed .
> 
> Thanks
> shaopyunliu
> 
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Liu, Shaoyun
> Sent: Saturday, March 6, 2021 3:41 PM
> To: Grodzovsky, Andrey ; 
> amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> [AMD Official Use Only - Internal Distribution Only]
> 
> I call the amdgpu_do_asic_reset with the parameter skip_hw_reset = true  so 
> the reset won't be execute twice .  but probably I can  set this parameter to 
> true and remove the code schedule for reset since now I already build the 
> device_list not based on hive. Let me try that .
> For the  schedule delayed work thread with AMDGPU_RESUME_MS, It's actually 
> not wait for SMU  to start. As I explained , I need to reset the all the GPUs 
> in the system since I don't know which gpus belongs to which hive.  So this 
> time is allow system to probe all the GPUs  in the system which means when 
> this delayed thread starts ,  we can assume all the devices already been  
> populated in mgpu_info.
> 
> Regards
> Shaoyun.liu
> 
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Saturday, March 6, 2021 1:09 AM
> To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amdgpu: Reset the devices in the XGMI 
> hive duirng probe
> 
> Thanks for explaining this, one thing I still don't understand is why you 
> schedule the reset work explicilty in the begining of 
> amdgpu_drv_delayed_reset_work_handler and then also call amdgpu_do_asic_reset 
> which will do the same thing too. It looks like the physical reset will 
> execute twice for each device.
> Another thing is, more like improvement suggestion  - currently you schedule 
> delayed_reset_work using AMDGPU_RESUME_MS - so i guesss this should give 
> enough time for SMU to start ? Is there maybe a way to instead poll for SMU 
> start completion and then execute this - some SMU status registers maybe ? 
> Just to avoid relying on this arbitrary value.
> 
> Andrey
> 
> On 2021-03-05 8:37 p.m., Liu, Shaoyun wrote:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi,  Andrey
>> The existing reset function (amdgpu_device_gpu_recover or amd do_asic 
>> _reset) assumed driver already have  the correct hive info . But in my case, 
>> it's  not true . The gpus are in a bad state and the XGMI TA  might not 
>> functional properly , so driver can  not  get the hive and node info when 
>> probe the device .  It means driver even don't know  the device belongs to 
>> which hive on a system with multiple hive configuration (ex, 8 gpus in  two 
>>

Re: [PATCH v1 12/15] powerpc/uaccess: Refactor get/put_user() and __get/put_user()

2021-03-08 Thread Christian König

The radeon warning is trivial to fix, going to send out a patch in a few 
moments.


Regards,
Christian.

Am 08.03.21 um 13:14 schrieb Christophe Leroy:

+Evgeniy for W1 Dallas
+Alex & Christian for RADEON

Le 07/03/2021 à 11:23, kernel test robot a écrit :

Hi Christophe,

I love your patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on v5.12-rc2 next-20210305]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: 
https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-Cleanup-of-uaccess-h/20210226-015715
base: 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next

config: powerpc-randconfig-s031-20210307 (attached as .config)
compiler: powerpc-linux-gcc (GCC) 9.3.0
reproduce:
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
-O ~/bin/make.cross

 chmod +x ~/bin/make.cross
 # apt-get install sparse
 # sparse version: v0.6.3-245-gacc5c298-dirty
 # 
https://github.com/0day-ci/linux/commit/449bdbf978936e67e4919be8be0eec3e490a65e2

 git remote add linux-review https://github.com/0day-ci/linux
 git fetch --no-tags linux-review 
Christophe-Leroy/powerpc-Cleanup-of-uaccess-h/20210226-015715

 git checkout 449bdbf978936e67e4919be8be0eec3e490a65e2
 # save the attached .config to linux build tree
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 
make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=powerpc


If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 



The mentioned patch is not the source of the problem, it only allows 
to spot it.


Christophe




"sparse warnings: (new ones prefixed by >>)"
drivers/w1/slaves/w1_ds28e04.c:342:13: sparse: sparse: incorrect 
type in initializer (different address spaces) @@ expected char 
[noderef] __user *_pu_addr @@ got char *buf @@
    drivers/w1/slaves/w1_ds28e04.c:342:13: sparse: expected char 
[noderef] __user *_pu_addr

    drivers/w1/slaves/w1_ds28e04.c:342:13: sparse: got char *buf
drivers/w1/slaves/w1_ds28e04.c:356:13: sparse: sparse: incorrect 
type in initializer (different address spaces) @@ expected char 
const [noderef] __user *_gu_addr @@ got char const *buf @@
    drivers/w1/slaves/w1_ds28e04.c:356:13: sparse: expected char 
const [noderef] __user *_gu_addr
    drivers/w1/slaves/w1_ds28e04.c:356:13: sparse: got char const 
*buf

--
    drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: cast 
removes address space '__user' of expression
    drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: cast 
removes address space '__user' of expression
drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: 
incorrect type in initializer (different address spaces) @@ 
expected unsigned int [noderef] __user *_pu_addr @@ got 
unsigned int [usertype] * @@
    drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: expected 
unsigned int [noderef] __user *_pu_addr
    drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: got 
unsigned int [usertype] *
    drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: cast 
removes address space '__user' of expression


vim +342 drivers/w1/slaves/w1_ds28e04.c

fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21  338
fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21  339  static ssize_t 
crccheck_show(struct device *dev, struct device_attribute *attr,
fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21 340   
char *buf)

fbf7f7b4e2ae40 Markus Franke  2012-05-26  341  {
fbf7f7b4e2ae40 Markus Franke  2012-05-26 @342  if 
(put_user(w1_enable_crccheck + 0x30, buf))

fbf7f7b4e2ae40 Markus Franke  2012-05-26  343 return -EFAULT;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  344
fbf7f7b4e2ae40 Markus Franke  2012-05-26  345  return 
sizeof(w1_enable_crccheck);

fbf7f7b4e2ae40 Markus Franke  2012-05-26  346  }
fbf7f7b4e2ae40 Markus Franke  2012-05-26  347
fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21  348  static ssize_t 
crccheck_store(struct device *dev, struct device_attribute *attr,
fbf7f7b4e2ae40 Markus Franke  2012-05-26 349    
const char *buf, size_t count)

fbf7f7b4e2ae40 Markus Franke  2012-05-26  350  {
fbf7f7b4e2ae40 Markus Franke  2012-05-26  351  char val;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  352
fbf7f7b4e2ae40 Markus Franke  2012-05-26  353  if (count != 1 
|| !buf)

fbf7f7b4e2ae40 Markus Franke  2012-05-26  354 return -EINVAL;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  355
fbf7f7b4e2ae40 Markus Franke  2012-05-26 @356  if 
(get_user(val, buf))

fbf7f7b4e2ae40 Markus Franke  2012-05-26  357 return -EFAULT;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  358
fbf7f7b4e2ae40 Markus Franke

Re: [PATCH v1 12/15] powerpc/uaccess: Refactor get/put_user() and __get/put_user()

2021-03-08 Thread Christophe Leroy


+Evgeniy for W1 Dallas
+Alex & Christian for RADEON

Le 07/03/2021 à 11:23, kernel test robot a écrit :

Hi Christophe,

I love your patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on v5.12-rc2 next-20210305]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-Cleanup-of-uaccess-h/20210226-015715
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-randconfig-s031-20210307 (attached as .config)
compiler: powerpc-linux-gcc (GCC) 9.3.0
reproduce:
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # apt-get install sparse
 # sparse version: v0.6.3-245-gacc5c298-dirty
 # 
https://github.com/0day-ci/linux/commit/449bdbf978936e67e4919be8be0eec3e490a65e2
 git remote add linux-review https://github.com/0day-ci/linux
 git fetch --no-tags linux-review 
Christophe-Leroy/powerpc-Cleanup-of-uaccess-h/20210226-015715
 git checkout 449bdbf978936e67e4919be8be0eec3e490a65e2
 # save the attached .config to linux build tree
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 
CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=powerpc

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 



The mentioned patch is not the source of the problem, it only allows to spot it.

Christophe




"sparse warnings: (new ones prefixed by >>)"

drivers/w1/slaves/w1_ds28e04.c:342:13: sparse: sparse: incorrect type in 
initializer (different address spaces) @@ expected char [noderef] __user 
*_pu_addr @@ got char *buf @@

drivers/w1/slaves/w1_ds28e04.c:342:13: sparse: expected char [noderef] 
__user *_pu_addr
drivers/w1/slaves/w1_ds28e04.c:342:13: sparse: got char *buf

drivers/w1/slaves/w1_ds28e04.c:356:13: sparse: sparse: incorrect type in 
initializer (different address spaces) @@ expected char const [noderef] 
__user *_gu_addr @@ got char const *buf @@

drivers/w1/slaves/w1_ds28e04.c:356:13: sparse: expected char const 
[noderef] __user *_gu_addr
drivers/w1/slaves/w1_ds28e04.c:356:13: sparse: got char const *buf
--
drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: cast removes 
address space '__user' of expression
drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: cast removes 
address space '__user' of expression

drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: incorrect type in 
initializer (different address spaces) @@ expected unsigned int [noderef] 
__user *_pu_addr @@ got unsigned int [usertype] * @@

drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: expected unsigned 
int [noderef] __user *_pu_addr
drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: got unsigned int 
[usertype] *
drivers/gpu/drm/radeon/radeon_ttm.c:933:21: sparse: sparse: cast removes 
address space '__user' of expression

vim +342 drivers/w1/slaves/w1_ds28e04.c

fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21  338
fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21  339  static ssize_t 
crccheck_show(struct device *dev, struct device_attribute *attr,
fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21  340
char *buf)
fbf7f7b4e2ae40 Markus Franke  2012-05-26  341  {
fbf7f7b4e2ae40 Markus Franke  2012-05-26 @342   if 
(put_user(w1_enable_crccheck + 0x30, buf))
fbf7f7b4e2ae40 Markus Franke  2012-05-26  343   return -EFAULT;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  344
fbf7f7b4e2ae40 Markus Franke  2012-05-26  345   return 
sizeof(w1_enable_crccheck);
fbf7f7b4e2ae40 Markus Franke  2012-05-26  346  }
fbf7f7b4e2ae40 Markus Franke  2012-05-26  347
fa33a65a9cf7e2 Greg Kroah-Hartman 2013-08-21  348  static ssize_t 
crccheck_store(struct device *dev, struct device_attribute *attr,
fbf7f7b4e2ae40 Markus Franke  2012-05-26  349 
const char *buf, size_t count)
fbf7f7b4e2ae40 Markus Franke  2012-05-26  350  {
fbf7f7b4e2ae40 Markus Franke  2012-05-26  351   char val;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  352
fbf7f7b4e2ae40 Markus Franke  2012-05-26  353   if (count != 1 || !buf)
fbf7f7b4e2ae40 Markus Franke  2012-05-26  354   return -EINVAL;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  355
fbf7f7b4e2ae40 Markus Franke  2012-05-26 @356   if (get_user(val, buf))
fbf7f7b4e2ae40 Markus Franke  2012-05-26  357   return -EFAULT;
fbf7f7b4e2ae40 Markus Franke  2012-05-26  358
fbf7f7b4e2ae40 Markus Franke  2012-05-26  359   /* convert to decimal */
fbf7f7b4e2ae40 Markus Franke

Re: [RESEND 00/53] Rid GPU from W=1 warnings

2021-03-08 Thread Lee Jones

On Fri, 05 Mar 2021, Roland Scheidegger wrote:

> The vmwgfx ones look all good to me, so for
> 23-53: Reviewed-by: Roland Scheidegger 
> That said, they were already signed off by Zack, so not sure what
> happened here.

Yes, they were accepted at one point, then dropped without a reason.

Since I rebased onto the latest -next, I had to pluck them back out of
a previous one.

-- 
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/5] drm/amdgpu: use amdgpu_bo_create_user() for gem object

2021-03-08 Thread Nirmoy



On 3/8/21 2:58 PM, Christian König wrote:



Am 08.03.21 um 14:56 schrieb Nirmoy:


On 3/5/21 4:11 PM, Christian König wrote:

We might need to use this for the KFD as well.


Do you mean for amdgpu_amdkfd_alloc_gws() ?


For example, yes. Basically all places where KFD allocated an BO with 
the TTM type device/user.



Thanks Christian!




Regards,
Christian.




Regards,

Nirmoy





Christian.

Am 05.03.21 um 15:35 schrieb Nirmoy Das:

GEM objects encapsulate amdgpu_bo for userspace applications.
Now that we have a new amdgpu_bo_user subclass for that purpose,
let's use that instead.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c

index 8e9b8a6e6ef0..9d2b55eb31c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -54,6 +54,7 @@ int amdgpu_gem_object_create(struct amdgpu_device 
*adev, unsigned long size,

   struct drm_gem_object **obj)
  {
  struct amdgpu_bo *bo;
+    struct amdgpu_bo_user *ubo;
  struct amdgpu_bo_param bp;
  int r;
  @@ -68,7 +69,7 @@ int amdgpu_gem_object_create(struct 
amdgpu_device *adev, unsigned long size,

  retry:
  bp.flags = flags;
  bp.domain = initial_domain;
-    r = amdgpu_bo_create(adev, , );
+    r = amdgpu_bo_create_user(adev, , );
  if (r) {
  if (r != -ERESTARTSYS) {
  if (flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) {
@@ -85,6 +86,7 @@ int amdgpu_gem_object_create(struct amdgpu_device 
*adev, unsigned long size,

  }
  return r;
  }
+    bo = >bo;
  *obj = >tbo.base;
    return 0;





___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/5] drm/amdgpu: use amdgpu_bo_create_user() for gem object

2021-03-08 Thread Christian König




Am 08.03.21 um 14:56 schrieb Nirmoy:


On 3/5/21 4:11 PM, Christian König wrote:

We might need to use this for the KFD as well.


Do you mean for amdgpu_amdkfd_alloc_gws() ?


For example, yes. Basically all places where KFD allocated an BO with 
the TTM type device/user.


Regards,
Christian.




Regards,

Nirmoy





Christian.

Am 05.03.21 um 15:35 schrieb Nirmoy Das:

GEM objects encapsulate amdgpu_bo for userspace applications.
Now that we have a new amdgpu_bo_user subclass for that purpose,
let's use that instead.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c

index 8e9b8a6e6ef0..9d2b55eb31c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -54,6 +54,7 @@ int amdgpu_gem_object_create(struct amdgpu_device 
*adev, unsigned long size,

   struct drm_gem_object **obj)
  {
  struct amdgpu_bo *bo;
+    struct amdgpu_bo_user *ubo;
  struct amdgpu_bo_param bp;
  int r;
  @@ -68,7 +69,7 @@ int amdgpu_gem_object_create(struct 
amdgpu_device *adev, unsigned long size,

  retry:
  bp.flags = flags;
  bp.domain = initial_domain;
-    r = amdgpu_bo_create(adev, , );
+    r = amdgpu_bo_create_user(adev, , );
  if (r) {
  if (r != -ERESTARTSYS) {
  if (flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) {
@@ -85,6 +86,7 @@ int amdgpu_gem_object_create(struct amdgpu_device 
*adev, unsigned long size,

  }
  return r;
  }
+    bo = >bo;
  *obj = >tbo.base;
    return 0;




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/5] drm/amdgpu: use amdgpu_bo_create_user() for gem object

2021-03-08 Thread Nirmoy



On 3/5/21 4:11 PM, Christian König wrote:

We might need to use this for the KFD as well.


Do you mean for amdgpu_amdkfd_alloc_gws() ?


Regards,

Nirmoy





Christian.

Am 05.03.21 um 15:35 schrieb Nirmoy Das:

GEM objects encapsulate amdgpu_bo for userspace applications.
Now that we have a new amdgpu_bo_user subclass for that purpose,
let's use that instead.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c

index 8e9b8a6e6ef0..9d2b55eb31c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -54,6 +54,7 @@ int amdgpu_gem_object_create(struct amdgpu_device 
*adev, unsigned long size,

   struct drm_gem_object **obj)
  {
  struct amdgpu_bo *bo;
+    struct amdgpu_bo_user *ubo;
  struct amdgpu_bo_param bp;
  int r;
  @@ -68,7 +69,7 @@ int amdgpu_gem_object_create(struct amdgpu_device 
*adev, unsigned long size,

  retry:
  bp.flags = flags;
  bp.domain = initial_domain;
-    r = amdgpu_bo_create(adev, , );
+    r = amdgpu_bo_create_user(adev, , );
  if (r) {
  if (r != -ERESTARTSYS) {
  if (flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) {
@@ -85,6 +86,7 @@ int amdgpu_gem_object_create(struct amdgpu_device 
*adev, unsigned long size,

  }
  return r;
  }
+    bo = >bo;
  *obj = >tbo.base;
    return 0;



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 5/8] drm/amdgpu: use the new cursor in amdgpu_ttm_access_memory

2021-03-08 Thread Christian König

Separate the drm_mm_node walking from the actual handling.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 67 +++--
 1 file changed, 18 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 517611b709fa..2cbe4ace591f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -178,26 +178,6 @@ static int amdgpu_verify_access(struct ttm_buffer_object 
*bo, struct file *filp)
  filp->private_data);
 }
 
-/**
- * amdgpu_find_mm_node - Helper function finds the drm_mm_node corresponding to
- * @offset. It also modifies the offset to be within the drm_mm_node returned
- *
- * @mem: The region where the bo resides.
- * @offset: The offset that drm_mm_node is used for finding.
- *
- */
-static struct drm_mm_node *amdgpu_find_mm_node(struct ttm_resource *mem,
-  uint64_t *offset)
-{
-   struct drm_mm_node *mm_node = mem->mm_node;
-
-   while (*offset >= (mm_node->size << PAGE_SHIFT)) {
-   *offset -= (mm_node->size << PAGE_SHIFT);
-   ++mm_node;
-   }
-   return mm_node;
-}
-
 /**
  * amdgpu_ttm_map_buffer - Map memory into the GART windows
  * @bo: buffer object to map
@@ -1478,41 +1458,36 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct 
ttm_buffer_object *bo,
  * access for debugging purposes.
  */
 static int amdgpu_ttm_access_memory(struct ttm_buffer_object *bo,
-   unsigned long offset,
-   void *buf, int len, int write)
+   unsigned long offset, void *buf, int len,
+   int write)
 {
struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
-   struct drm_mm_node *nodes;
+   struct amdgpu_res_cursor cursor;
+   unsigned long flags;
uint32_t value = 0;
int ret = 0;
-   uint64_t pos;
-   unsigned long flags;
 
if (bo->mem.mem_type != TTM_PL_VRAM)
return -EIO;
 
-   pos = offset;
-   nodes = amdgpu_find_mm_node(>tbo.mem, );
-   pos += (nodes->start << PAGE_SHIFT);
-
-   while (len && pos < adev->gmc.mc_vram_size) {
-   uint64_t aligned_pos = pos & ~(uint64_t)3;
-   uint64_t bytes = 4 - (pos & 3);
-   uint32_t shift = (pos & 3) * 8;
+   amdgpu_res_first(>mem, offset, len, );
+   while (cursor.remaining) {
+   uint64_t aligned_pos = cursor.start & ~(uint64_t)3;
+   uint64_t bytes = 4 - (cursor.start & 3);
+   uint32_t shift = (cursor.start & 3) * 8;
uint32_t mask = 0x << shift;
 
-   if (len < bytes) {
-   mask &= 0x >> (bytes - len) * 8;
-   bytes = len;
+   if (cursor.size < bytes) {
+   mask &= 0x >> (bytes - cursor.size) * 8;
+   bytes = cursor.size;
}
 
if (mask != 0x) {
spin_lock_irqsave(>mmio_idx_lock, flags);
WREG32_NO_KIQ(mmMM_INDEX, ((uint32_t)aligned_pos) | 
0x8000);
WREG32_NO_KIQ(mmMM_INDEX_HI, aligned_pos >> 31);
-   if (!write || mask != 0x)
-   value = RREG32_NO_KIQ(mmMM_DATA);
+   value = RREG32_NO_KIQ(mmMM_DATA);
if (write) {
value &= ~mask;
value |= (*(uint32_t *)buf << shift) & mask;
@@ -1524,21 +1499,15 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
memcpy(buf, , bytes);
}
} else {
-   bytes = (nodes->start + nodes->size) << PAGE_SHIFT;
-   bytes = min(bytes - pos, (uint64_t)len & ~0x3ull);
-
-   amdgpu_device_vram_access(adev, pos, (uint32_t *)buf,
- bytes, write);
+   bytes = cursor.size & 0x3ull;
+   amdgpu_device_vram_access(adev, cursor.start,
+ (uint32_t *)buf, bytes,
+ write);
}
 
ret += bytes;
buf = (uint8_t *)buf + bytes;
-   pos += bytes;
-   len -= bytes;
-   if (pos >= (nodes->start + nodes->size) << PAGE_SHIFT) {
-   ++nodes;
-   pos = (nodes->start << PAGE_SHIFT);
-   }
+   amdgpu_res_next(, bytes);

[PATCH 8/8] drm/amdgpu: use the new cursor in the VM code

2021-03-08 Thread Christian König

Separate the drm_mm_node walking from the actual handling.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 54 +-
 1 file changed, 18 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9d19078246c8..bf9638bd0ddd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -37,6 +37,7 @@
 #include "amdgpu_gmc.h"
 #include "amdgpu_xgmi.h"
 #include "amdgpu_dma_buf.h"
+#include "amdgpu_res_cursor.h"
 
 /**
  * DOC: GPUVM
@@ -1582,7 +1583,7 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
  * @last: last mapped entry
  * @flags: flags for the entries
  * @offset: offset into nodes and pages_addr
- * @nodes: array of drm_mm_nodes with the MC addresses
+ * @res: ttm_resource to map
  * @pages_addr: DMA addresses to use for mapping
  * @fence: optional resulting fence
  *
@@ -1597,13 +1598,13 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
   bool unlocked, struct dma_resv *resv,
   uint64_t start, uint64_t last,
   uint64_t flags, uint64_t offset,
-  struct drm_mm_node *nodes,
+  struct ttm_resource *res,
   dma_addr_t *pages_addr,
   struct dma_fence **fence)
 {
struct amdgpu_vm_update_params params;
+   struct amdgpu_res_cursor cursor;
enum amdgpu_sync_mode sync_mode;
-   uint64_t pfn;
int r;
 
memset(, 0, sizeof(params));
@@ -1621,14 +1622,6 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
else
sync_mode = AMDGPU_SYNC_EXPLICIT;
 
-   pfn = offset >> PAGE_SHIFT;
-   if (nodes) {
-   while (pfn >= nodes->size) {
-   pfn -= nodes->size;
-   ++nodes;
-   }
-   }
-
amdgpu_vm_eviction_lock(vm);
if (vm->evicting) {
r = -EBUSY;
@@ -1647,23 +1640,17 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
if (r)
goto error_unlock;
 
-   do {
+   amdgpu_res_first(res, offset, (last - start + 1) * AMDGPU_GPU_PAGE_SIZE,
+);
+   while (cursor.remaining) {
uint64_t tmp, num_entries, addr;
 
-
-   num_entries = last - start + 1;
-   if (nodes) {
-   addr = nodes->start << PAGE_SHIFT;
-   num_entries = min((nodes->size - pfn) *
-   AMDGPU_GPU_PAGES_IN_CPU_PAGE, num_entries);
-   } else {
-   addr = 0;
-   }
-
+   num_entries = cursor.size >> AMDGPU_GPU_PAGE_SHIFT;
if (pages_addr) {
bool contiguous = true;
 
if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) {
+   uint64_t pfn = cursor.start >> PAGE_SHIFT;
uint64_t count;
 
contiguous = pages_addr[pfn + 1] ==
@@ -1683,16 +1670,18 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
}
 
if (!contiguous) {
-   addr = pfn << PAGE_SHIFT;
+   addr = cursor.start;
params.pages_addr = pages_addr;
} else {
-   addr = pages_addr[pfn];
+   addr = pages_addr[cursor.start >> PAGE_SHIFT];
params.pages_addr = NULL;
}
 
} else if (flags & (AMDGPU_PTE_VALID | AMDGPU_PTE_PRT)) {
-   addr += bo_adev->vm_manager.vram_base_offset;
-   addr += pfn << PAGE_SHIFT;
+   addr = bo_adev->vm_manager.vram_base_offset +
+   cursor.start;
+   } else {
+   addr = 0;
}
 
tmp = start + num_entries;
@@ -1700,14 +1689,9 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
if (r)
goto error_unlock;
 
-   pfn += num_entries / AMDGPU_GPU_PAGES_IN_CPU_PAGE;
-   if (nodes && nodes->size == pfn) {
-   pfn = 0;
-   ++nodes;
-   }
+   amdgpu_res_next(, num_entries * AMDGPU_GPU_PAGE_SIZE);
start = tmp;
-
-   } while (unlikely(start != last + 1));
+   };
 
r = vm->update_funcs->commit(, fence);
 
@@ -1736,7

[PATCH 7/8] drm/amdgpu: use the new cursor in amdgpu_ttm_bo_eviction_valuable

2021-03-08 Thread Christian König

Separate the drm_mm_node walking from the actual handling.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index d469ba5fef2c..5d88d1850781 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1398,7 +1398,7 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct 
ttm_buffer_object *bo,
const struct ttm_place *place)
 {
unsigned long num_pages = bo->mem.num_pages;
-   struct drm_mm_node *node = bo->mem.mm_node;
+   struct amdgpu_res_cursor cursor;
struct dma_resv_list *flist;
struct dma_fence *f;
int i;
@@ -1430,13 +1430,15 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct 
ttm_buffer_object *bo,
 
case TTM_PL_VRAM:
/* Check each drm MM node individually */
-   while (num_pages) {
-   if (place->fpfn < (node->start + node->size) &&
-   !(place->lpfn && place->lpfn <= node->start))
+   amdgpu_res_first(>mem, 0, (u64)num_pages << PAGE_SHIFT,
+);
+   while (cursor.remaining) {
+   if (place->fpfn < PFN_DOWN(cursor.start + cursor.size)
+   && !(place->lpfn &&
+place->lpfn <= PFN_DOWN(cursor.start)))
return true;
 
-   num_pages -= node->size;
-   ++node;
+   amdgpu_res_next(, cursor.size);
}
return false;
 
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 3/8] drm/amdgpu: use the new cursor in amdgpu_fill_buffer

2021-03-08 Thread Christian König

Separate the drm_mm_node walking from the actual handling.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 65 ++---
 1 file changed, 15 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index f71a9ff06748..365d5693f5f0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -178,28 +178,6 @@ static int amdgpu_verify_access(struct ttm_buffer_object 
*bo, struct file *filp)
  filp->private_data);
 }
 
-/**
- * amdgpu_mm_node_addr - Compute the GPU relative offset of a GTT buffer.
- *
- * @bo: The bo to assign the memory to.
- * @mm_node: Memory manager node for drm allocator.
- * @mem: The region where the bo resides.
- *
- */
-static uint64_t amdgpu_mm_node_addr(struct ttm_buffer_object *bo,
-   struct drm_mm_node *mm_node,
-   struct ttm_resource *mem)
-{
-   uint64_t addr = 0;
-
-   if (mm_node->start != AMDGPU_BO_INVALID_OFFSET) {
-   addr = mm_node->start << PAGE_SHIFT;
-   addr += amdgpu_ttm_domain_start(amdgpu_ttm_adev(bo->bdev),
-   mem->mem_type);
-   }
-   return addr;
-}
-
 /**
  * amdgpu_find_mm_node - Helper function finds the drm_mm_node corresponding to
  * @offset. It also modifies the offset to be within the drm_mm_node returned
@@ -2081,9 +2059,9 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
uint32_t max_bytes = adev->mman.buffer_funcs->fill_max_bytes;
struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
 
-   struct drm_mm_node *mm_node;
-   unsigned long num_pages;
+   struct amdgpu_res_cursor cursor;
unsigned int num_loops, num_dw;
+   uint64_t num_bytes;
 
struct amdgpu_job *job;
int r;
@@ -2099,15 +2077,13 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
return r;
}
 
-   num_pages = bo->tbo.mem.num_pages;
-   mm_node = bo->tbo.mem.mm_node;
+   num_bytes = bo->tbo.mem.num_pages << PAGE_SHIFT;
num_loops = 0;
-   while (num_pages) {
-   uint64_t byte_count = mm_node->size << PAGE_SHIFT;
 
-   num_loops += DIV_ROUND_UP_ULL(byte_count, max_bytes);
-   num_pages -= mm_node->size;
-   ++mm_node;
+   amdgpu_res_first(>tbo.mem, 0, num_bytes, );
+   while (cursor.remaining) {
+   num_loops += DIV_ROUND_UP_ULL(cursor.size, max_bytes);
+   amdgpu_res_next(, cursor.size);
}
num_dw = num_loops * adev->mman.buffer_funcs->fill_num_dw;
 
@@ -2129,27 +2105,16 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
}
}
 
-   num_pages = bo->tbo.mem.num_pages;
-   mm_node = bo->tbo.mem.mm_node;
-
-   while (num_pages) {
-   uint64_t byte_count = mm_node->size << PAGE_SHIFT;
-   uint64_t dst_addr;
+   amdgpu_res_first(>tbo.mem, 0, num_bytes, );
+   while (cursor.remaining) {
+   uint32_t cur_size = min_t(uint64_t, cursor.size, max_bytes);
+   uint64_t dst_addr = cursor.start;
 
-   dst_addr = amdgpu_mm_node_addr(>tbo, mm_node, >tbo.mem);
-   while (byte_count) {
-   uint32_t cur_size_in_bytes = min_t(uint64_t, byte_count,
-  max_bytes);
+   dst_addr += amdgpu_ttm_domain_start(adev, bo->tbo.mem.mem_type);
+   amdgpu_emit_fill_buffer(adev, >ibs[0], src_data, dst_addr,
+   cur_size);
 
-   amdgpu_emit_fill_buffer(adev, >ibs[0], src_data,
-   dst_addr, cur_size_in_bytes);
-
-   dst_addr += cur_size_in_bytes;
-   byte_count -= cur_size_in_bytes;
-   }
-
-   num_pages -= mm_node->size;
-   ++mm_node;
+   amdgpu_res_next(, cur_size);
}
 
amdgpu_ring_pad_ib(ring, >ibs[0]);
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 4/8] drm/amdgpu: use new cursor in amdgpu_ttm_io_mem_pfn

2021-03-08 Thread Christian König

Separate the drm_mm_node walking from the actual handling.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 365d5693f5f0..517611b709fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -631,12 +631,10 @@ static unsigned long amdgpu_ttm_io_mem_pfn(struct 
ttm_buffer_object *bo,
   unsigned long page_offset)
 {
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->bdev);
-   uint64_t offset = (page_offset << PAGE_SHIFT);
-   struct drm_mm_node *mm;
+   struct amdgpu_res_cursor cursor;
 
-   mm = amdgpu_find_mm_node(>mem, );
-   offset += adev->gmc.aper_base;
-   return mm->start + (offset >> PAGE_SHIFT);
+   amdgpu_res_first(>mem, (u64)page_offset << PAGE_SHIFT, 0, );
+   return (adev->gmc.aper_base + cursor.start) >> PAGE_SHIFT;
 }
 
 /**
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 6/8] drm/amdgpu: use new cursor in amdgpu_mem_visible

2021-03-08 Thread Christian König

Separate the drm_mm_node walking from the actual handling.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 2cbe4ace591f..d469ba5fef2c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -441,7 +441,8 @@ static int amdgpu_move_blit(struct ttm_buffer_object *bo,
 static bool amdgpu_mem_visible(struct amdgpu_device *adev,
   struct ttm_resource *mem)
 {
-   struct drm_mm_node *nodes = mem->mm_node;
+   uint64_t mem_size = (u64)mem->num_pages << PAGE_SHIFT;
+   struct amdgpu_res_cursor cursor;
 
if (mem->mem_type == TTM_PL_SYSTEM ||
mem->mem_type == TTM_PL_TT)
@@ -449,12 +450,13 @@ static bool amdgpu_mem_visible(struct amdgpu_device *adev,
if (mem->mem_type != TTM_PL_VRAM)
return false;
 
+   amdgpu_res_first(mem, 0, mem_size, );
+
/* ttm_resource_ioremap only supports contiguous memory */
-   if (nodes->size != mem->num_pages)
+   if (cursor.size != mem_size)
return false;
 
-   return ((nodes->start + nodes->size) << PAGE_SHIFT)
-   <= adev->gmc.visible_vram_size;
+   return cursor.start + cursor.size <= adev->gmc.visible_vram_size;
 }
 
 /*
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 1/8] drm/amdgpu: new resource cursor

2021-03-08 Thread Christian König

Allows to walk over the drm_mm nodes in a TTM resource object.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h| 105 ++
 1 file changed, 105 insertions(+)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
new file mode 100644
index ..1335e098510f
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Copyright 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: Christian König
+ */
+
+#ifndef __AMDGPU_RES_CURSOR_H__
+#define __AMDGPU_RES_CURSOR_H__
+
+#include 
+#include 
+
+/* state back for walking over vram_mgr and gtt_mgr allocations */
+struct amdgpu_res_cursor {
+   uint64_tstart;
+   uint64_tsize;
+   uint64_tremaining;
+   struct drm_mm_node  *node;
+};
+
+/**
+ * amdgpu_res_first - initialize a amdgpu_res_cursor
+ *
+ * @res: TTM resource object to walk
+ * @start: Start of the range
+ * @size: Size of the range
+ * @cur: cursor object to initialize
+ *
+ * Start walking over the range of allocations between @start and @size.
+ */
+static inline void amdgpu_res_first(struct ttm_resource *res,
+   uint64_t start, uint64_t size,
+   struct amdgpu_res_cursor *cur)
+{
+   struct drm_mm_node *node;
+
+   if (!res || !res->mm_node) {
+   cur->start = start;
+   cur->size = size;
+   cur->remaining = size;
+   cur->node = NULL;
+   return;
+   }
+
+   BUG_ON(start + size > res->num_pages << PAGE_SHIFT);
+
+   node = res->mm_node;
+   while (start > node->size << PAGE_SHIFT)
+   start -= node++->size << PAGE_SHIFT;
+
+   cur->start = (node->start << PAGE_SHIFT) + start;
+   cur->size = (node->size << PAGE_SHIFT) - start;
+   cur->remaining = size;
+   cur->node = node;
+}
+
+/**
+ * amdgpu_res_next - advance the cursor
+ *
+ * @cur: the cursor to advance
+ * @size: number of bytes to move forward
+ *
+ * Move the cursor @size bytes forwrad, walking to the next node if necessary.
+ */
+static inline void amdgpu_res_next(struct amdgpu_res_cursor *cur, uint64_t 
size)
+{
+   struct drm_mm_node *node = cur->node;
+
+   BUG_ON(size > cur->remaining);
+
+   cur->remaining -= size;
+   if (!cur->remaining)
+   return;
+
+   cur->size -= size;
+   if (cur->size) {
+   cur->start += size;
+   return;
+   }
+
+   cur->node = ++node;
+   cur->start = node->start << PAGE_SHIFT;
+   cur->size = min(node->size << PAGE_SHIFT, cur->remaining);
+}
+
+#endif
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 2/8] drm/amdgpu: use the new cursor in amdgpu_ttm_copy_mem_to_mem

2021-03-08 Thread Christian König

Separate the drm_mm_node walking from the actual handling.

Signed-off-by: Christian König 
Acked-by: Oak Zeng 
Tested-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 87 -
 1 file changed, 26 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 37751e7e4edc..f71a9ff06748 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -56,6 +56,7 @@
 #include "amdgpu_sdma.h"
 #include "amdgpu_ras.h"
 #include "amdgpu_atomfirmware.h"
+#include "amdgpu_res_cursor.h"
 #include "bif/bif_4_1_d.h"
 
 #define AMDGPU_TTM_VRAM_MAX_DW_READ(size_t)128
@@ -223,9 +224,8 @@ static struct drm_mm_node *amdgpu_find_mm_node(struct 
ttm_resource *mem,
  * amdgpu_ttm_map_buffer - Map memory into the GART windows
  * @bo: buffer object to map
  * @mem: memory object to map
- * @mm_node: drm_mm node object to map
+ * @mm_cur: range to map
  * @num_pages: number of pages to map
- * @offset: offset into @mm_node where to start
  * @window: which GART window to use
  * @ring: DMA ring to use for the copy
  * @tmz: if we should setup a TMZ enabled mapping
@@ -236,10 +236,10 @@ static struct drm_mm_node *amdgpu_find_mm_node(struct 
ttm_resource *mem,
  */
 static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
 struct ttm_resource *mem,
-struct drm_mm_node *mm_node,
-unsigned num_pages, uint64_t offset,
-unsigned window, struct amdgpu_ring *ring,
-bool tmz, uint64_t *addr)
+struct amdgpu_res_cursor *mm_cur,
+unsigned num_pages, unsigned window,
+struct amdgpu_ring *ring, bool tmz,
+uint64_t *addr)
 {
struct amdgpu_device *adev = ring->adev;
struct amdgpu_job *job;
@@ -256,14 +256,15 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object 
*bo,
 
/* Map only what can't be accessed directly */
if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) {
-   *addr = amdgpu_mm_node_addr(bo, mm_node, mem) + offset;
+   *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) +
+   mm_cur->start;
return 0;
}
 
*addr = adev->gmc.gart_start;
*addr += (u64)window * AMDGPU_GTT_MAX_TRANSFER_SIZE *
AMDGPU_GPU_PAGE_SIZE;
-   *addr += offset & ~PAGE_MASK;
+   *addr += mm_cur->start & ~PAGE_MASK;
 
num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
num_bytes = num_pages * 8;
@@ -291,17 +292,17 @@ static int amdgpu_ttm_map_buffer(struct ttm_buffer_object 
*bo,
cpu_addr = >ibs[0].ptr[num_dw];
 
if (mem->mem_type == TTM_PL_TT) {
-   dma_addr_t *dma_address;
+   dma_addr_t *dma_addr;
 
-   dma_address = >ttm->dma_address[offset >> PAGE_SHIFT];
-   r = amdgpu_gart_map(adev, 0, num_pages, dma_address, flags,
+   dma_addr = >ttm->dma_address[mm_cur->start >> PAGE_SHIFT];
+   r = amdgpu_gart_map(adev, 0, num_pages, dma_addr, flags,
cpu_addr);
if (r)
goto error_free;
} else {
dma_addr_t dma_address;
 
-   dma_address = (mm_node->start << PAGE_SHIFT) + offset;
+   dma_address = mm_cur->start;
dma_address += adev->vm_manager.vram_base_offset;
 
for (i = 0; i < num_pages; ++i) {
@@ -353,9 +354,8 @@ int amdgpu_ttm_copy_mem_to_mem(struct amdgpu_device *adev,
const uint32_t GTT_MAX_BYTES = (AMDGPU_GTT_MAX_TRANSFER_SIZE *
AMDGPU_GPU_PAGE_SIZE);
 
-   uint64_t src_node_size, dst_node_size, src_offset, dst_offset;
struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
-   struct drm_mm_node *src_mm, *dst_mm;
+   struct amdgpu_res_cursor src_mm, dst_mm;
struct dma_fence *fence = NULL;
int r = 0;
 
@@ -364,29 +364,13 @@ int amdgpu_ttm_copy_mem_to_mem(struct amdgpu_device *adev,
return -EINVAL;
}
 
-   src_offset = src->offset;
-   if (src->mem->mm_node) {
-   src_mm = amdgpu_find_mm_node(src->mem, _offset);
-   src_node_size = (src_mm->size << PAGE_SHIFT) - src_offset;
-   } else {
-   src_mm = NULL;
-   src_node_size = ULLONG_MAX;
-   }
-
-   dst_offset = dst->offset;
-   if (dst->mem->mm_node) {
-   dst_mm = amdgpu_find_mm_node(dst->mem, _offset);
-   dst_node_size = (dst_mm->size << PAGE_SHIFT) - dst_offset;
-   } else {
-   dst_mm = NULL;
-   dst_node_size = ULLONG_MAX;
-   }
+

RE: [PATCH] drm/amdgpu: Check if FB BAR is enabled for ROM read

2021-03-08 Thread Zhang, Hawking

[AMD Public Use]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
From: Lazar, Lijo 
Sent: Monday, March 8, 2021 21:16
To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Xu, Feifei 
; Zhang, Hawking 
Subject: RE: [PATCH] drm/amdgpu: Check if FB BAR is enabled for ROM read


[AMD Public Use]



From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 On Behalf Of Lazar, Lijo
Sent: Wednesday, March 3, 2021 10:15 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; Xu, Feifei 
mailto:feifei...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
Subject: [PATCH] drm/amdgpu: Check if FB BAR is enabled for ROM read


[AMD Public Use]

Some configurations don't have FB BAR enabled. Avoid reading ROM image
from FB BAR region in such cases.

Signed-off-by: Lijo Lazar mailto:lijo.la...@amd.com>>
Reviewed-by: Hawking Zhang mailto:hawking.zh...@amd.com>>
Reviewed-by: Feifei Xu mailto:feifei...@amd.com>>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 4 
1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
index efdf639f6593..f454a6bd0ed6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
@@ -97,6 +97,10 @@ static bool igp_read_bios_from_vram(struct amdgpu_device 
*adev)
if (amdgpu_device_need_post(adev))
return false;

+   /* FB BAR not enabled */
+   if (pci_resource_len(adev->pdev, 0) == 0)
+   return false;
+
adev->bios = NULL;
vram_base = pci_resource_start(adev->pdev, 0);
bios = ioremap_wc(vram_base, size);
--
2.29.2
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Check if FB BAR is enabled for ROM read

2021-03-08 Thread Lazar, Lijo

[AMD Public Use]



From: amd-gfx  On Behalf Of Lazar, Lijo
Sent: Wednesday, March 3, 2021 10:15 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Xu, Feifei 
; Zhang, Hawking 
Subject: [PATCH] drm/amdgpu: Check if FB BAR is enabled for ROM read


[AMD Public Use]

Some configurations don't have FB BAR enabled. Avoid reading ROM image
from FB BAR region in such cases.

Signed-off-by: Lijo Lazar mailto:lijo.la...@amd.com>>
Reviewed-by: Hawking Zhang mailto:hawking.zh...@amd.com>>
Reviewed-by: Feifei Xu mailto:feifei...@amd.com>>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 4 
1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
index efdf639f6593..f454a6bd0ed6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
@@ -97,6 +97,10 @@ static bool igp_read_bios_from_vram(struct amdgpu_device 
*adev)
if (amdgpu_device_need_post(adev))
return false;

+   /* FB BAR not enabled */
+   if (pci_resource_len(adev->pdev, 0) == 0)
+   return false;
+
adev->bios = NULL;
vram_base = pci_resource_start(adev->pdev, 0);
bios = ioremap_wc(vram_base, size);
--
2.29.2
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Christian König


Hi Jack,

yes that comes pretty close. I'm going over the patch right now.

Some things still look a bit complicated to me, but I need to wrap my 
head around how and why we are doing it this way once more.


Christian.

Am 08.03.21 um 13:43 schrieb Zhang, Jack (Jian):

[AMD Public Use]

Hi, Christian,

I made some change on V3 patch that insert a dma_fence_wait for the first jobs 
after resubmit jobs.
It seems simpler than the V2 patch. Is this what you first thinks of in your 
mind?

Thanks,
Jack

-Original Message-
From: Koenig, Christian 
Sent: Monday, March 8, 2021 3:53 PM
To: Liu, Monk ; Zhang, Jack (Jian) ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; Deng, Emily 

Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode



Am 08.03.21 um 05:06 schrieb Liu, Monk:

[AMD Official Use Only - Internal Distribution Only]


well first of all please completely drop the affinity group stuff from this 
patch. We should concentrate on one feature at at time.

We need it to expedite the process, we can introduce this change in
another patch



Then the implementation is way to complicate. All you need to do is insert a 
dma_fence_wait after re-scheduling each job after a reset.

No that's not true, during the " drm_sched_resubmit_jobs" it will put
all jobs in mirror list into the hw ring, but we can only allow the
first job to the ring To catch the real guilty one (otherwise it is possible that the 
later job in the ring also has bug and affect our judgement) So we need to implement a 
new " drm_sched_resubmit_jobs2()", like this way:

Something like this. But since waiting for the guilty job is AMD specific we 
should rather rework the stuff from the beginning.

What I have in mind is the following:
1. Add a reference from the scheduler fence back to the job which is cleared 
only when the scheduler fence finishes.
2. Completely drop the ring_mirror_list and replace it with a kfifo of pointers 
to the active scheduler fences.
3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro 
which allows drivers to iterate over all the active jobs and resubmit/wait/mark 
them as guilty etc etc..
4. Remove the guilty/karma handling from the scheduler. This is something AMD 
specific and shouldn't leak into common code.

Regards,
Christian.


drm_sched_resubmit_jobs2()
~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
500 {
501 struct drm_sched_job *s_job, *tmp;
502 uint64_t guilty_context;
503 bool found_guilty = false;
504 struct dma_fence *fence;
+ 505 int i = 0;
506
507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, 
node) {
508 struct drm_sched_fence *s_fence = s_job->s_fence;
509
+ 510 if (i >= max)
+ 511 break;
+ 512
513 if (!found_guilty && atomic_read(_job->karma) > 
sched->hang_limit) {
514 found_guilty = true;
515 guilty_context = s_job->s_fence->scheduled.context;
516 }
517
518 if (found_guilty && s_job->s_fence->scheduled.context == 
guilty_context)
519 dma_fence_set_error(_fence->finished, -ECANCELED);
520
521 dma_fence_put(s_job->s_fence->parent);
522 fence = sched->ops->run_job(s_job);
+ 523 i++;
524
525 if (IS_ERR_OR_NULL(fence)) {
526 if (IS_ERR(fence))
527 dma_fence_set_error(_fence->finished, PTR_ERR(fence));
528
529 s_job->s_fence->parent = NULL;
530 } else {
531 s_job->s_fence->parent = fence;
532 }
533
534
535 }
536 }
537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
538



Thanks

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: Koenig, Christian 
Sent: Sunday, March 7, 2021 3:03 AM
To: Zhang, Jack (Jian) ;
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
; Liu, Monk ; Deng, Emily

Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

Hi Jack,

well first of all please completely drop the affinity group stuff from this 
patch. We should concentrate on one feature at at time.

Then the implementation is way to complicate. All you need to do is insert a 
dma_fence_wait after re-scheduling each job after a reset.

Additional to that this feature is completely AMD specific and shouldn't affect 
the common scheduler in any way.

Regards,
Christian.

Am 06.03.21 um 18:25 schrieb Jack Zhang:

[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring
share internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal

RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Zhang, Jack (Jian)

[AMD Public Use]

Hi, Christian,

I made some change on V3 patch that insert a dma_fence_wait for the first jobs 
after resubmit jobs.
It seems simpler than the V2 patch. Is this what you first thinks of in your 
mind?

Thanks,
Jack

-Original Message-
From: Koenig, Christian  
Sent: Monday, March 8, 2021 3:53 PM
To: Liu, Monk ; Zhang, Jack (Jian) ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; 
Deng, Emily 
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode



Am 08.03.21 um 05:06 schrieb Liu, Monk:
> [AMD Official Use Only - Internal Distribution Only]
>
>>> well first of all please completely drop the affinity group stuff from this 
>>> patch. We should concentrate on one feature at at time.
> We need it to expedite the process, we can introduce this change in 
> another patch
>
>
>>> Then the implementation is way to complicate. All you need to do is insert 
>>> a dma_fence_wait after re-scheduling each job after a reset.
> No that's not true, during the " drm_sched_resubmit_jobs" it will put 
> all jobs in mirror list into the hw ring, but we can only allow the 
> first job to the ring To catch the real guilty one (otherwise it is possible 
> that the later job in the ring also has bug and affect our judgement) So we 
> need to implement a new " drm_sched_resubmit_jobs2()", like this way:

Something like this. But since waiting for the guilty job is AMD specific we 
should rather rework the stuff from the beginning.

What I have in mind is the following:
1. Add a reference from the scheduler fence back to the job which is cleared 
only when the scheduler fence finishes.
2. Completely drop the ring_mirror_list and replace it with a kfifo of pointers 
to the active scheduler fences.
3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro 
which allows drivers to iterate over all the active jobs and resubmit/wait/mark 
them as guilty etc etc..
4. Remove the guilty/karma handling from the scheduler. This is something AMD 
specific and shouldn't leak into common code.

Regards,
Christian.

>
> drm_sched_resubmit_jobs2()
> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
>500 {
>501 struct drm_sched_job *s_job, *tmp;
>502 uint64_t guilty_context;
>503 bool found_guilty = false;
>504 struct dma_fence *fence;
> + 505 int i = 0;
>506
>507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, 
> node) {
>508 struct drm_sched_fence *s_fence = s_job->s_fence;
>509
> + 510 if (i >= max)
> + 511 break;
> + 512
>513 if (!found_guilty && atomic_read(_job->karma) > 
> sched->hang_limit) {
>514 found_guilty = true;
>515 guilty_context = s_job->s_fence->scheduled.context;
>516 }
>517
>518 if (found_guilty && s_job->s_fence->scheduled.context == 
> guilty_context)
>519 dma_fence_set_error(_fence->finished, -ECANCELED);
>520
>521 dma_fence_put(s_job->s_fence->parent);
>522 fence = sched->ops->run_job(s_job);
> + 523 i++;
>524
>525 if (IS_ERR_OR_NULL(fence)) {
>526 if (IS_ERR(fence))
>527 dma_fence_set_error(_fence->finished, 
> PTR_ERR(fence));
>528
>529 s_job->s_fence->parent = NULL;
>530 } else {
>531 s_job->s_fence->parent = fence;
>532 }
>533
>534
>535 }
>536 }
>537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>538
>
>
>
> Thanks
>
> --
> Monk Liu | Cloud-GPU Core team
> --
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Sunday, March 7, 2021 3:03 AM
> To: Zhang, Jack (Jian) ; 
> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
> ; Liu, Monk ; Deng, Emily 
> 
> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>
> Hi Jack,
>
> well first of all please completely drop the affinity group stuff from this 
> patch. We should concentrate on one feature at at time.
>
> Then the implementation is way to complicate. All you need to do is insert a 
> dma_fence_wait after re-scheduling each job after a reset.
>
> Additional to that this feature is completely AMD specific and shouldn't 
> affect the common scheduler in any way.
>
> Regards,
> Christian.
>
> Am 06.03.21 um 18:25 schrieb Jack Zhang:
>> [Why]
>> Previous tdr design treats the first job in job_timeout as the bad job.
>> But sometimes a later bad compute job can block a good gfx job and 
>> cause an unexpected gfx job timeout because gfx and compute ring 
>> share internal GC HW mutually.
>>
>> [How]
>> This patch implements an advanced tdr mode.It involves an additinal 
>> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit 
>> step in order to find the real bad job.
>>
>> 1. For Bailing TDR job, re-insert it to

[PATCH v3] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Jack Zhang

[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.
1.It synchronized resubmit wait step to find the real bad job. If the
job's hw fence get timeout, we decrease the old job's karma, treat
the new found one as a guilty one,do hw reset to recover hw. After
that, we conitue the resubmit step to resubmit the left jobs.

2. For whole gpu reset(vram lost), do resubmit as the old style.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 33 +
 3 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..fa53c6c00ee9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4639,7 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
-
+   int tmp_vram_lost_counter;
/*
 * Special case: RAS triggered and full reset isn't supported
 */
@@ -4788,6 +4788,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
}
}
 
+   tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
/* Actual ASIC resets if needed.*/
/* TODO Implement XGMI hive reset logic for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@ -4807,17 +4808,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
 
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = tmp_adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job = NULL;
 
if (!ring || !ring->sched.thread)
continue;
 
/* No point to resubmit jobs if we didn't HW reset*/
-   if (!tmp_adev->asic_reset_res && !job_signaled)
+   if (!tmp_adev->asic_reset_res && !job_signaled) {
drm_sched_resubmit_jobs(>sched);
 
-   drm_sched_start(>sched, 
!tmp_adev->asic_reset_res);
+   if (amdgpu_gpu_recovery == 2 &&
+   
!list_empty(>sched->ring_mirror_list)
+   && !(tmp_vram_lost_counter < 
atomic_read(>vram_lost_counter)
+) {
+
+   s_job = 
list_first_entry_or_null(>sched->ring_mirror_list, struct drm_sched_job, 
node);
+   ret = 
dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched->timeout);
+   if (ret == 0) { /* timeout */
+   /*reset karma to the right job*/
+   if (job && job != s_job)
+   
amdgpu_sched_decrease_karma(>base);
+   drm_sched_increase_karma(s_job);
+
+   /* do hw reset */
+   if (amdgpu_sriov_vf(adev)) {
+   
amdgpu_virt_fini_data_exchange(adev);
+   r = 
amdgpu_device_reset_sriov(adev, false);
+   if (r)
+   
adev->asic_reset_res = r;
+   } else {
+   r  = 
amdgpu_do_asic_reset(hive, device_list_handle, _full_reset, false);
+   if (r && r == -EAGAIN)
+   goto retry;
+
+   /* add reset counter so that 
the following resubmitted job could flush vmid */
+   
atomic_inc(_adev->gpu_reset_counter);
+
+   //resubmit again the left jobs
+   
drm_sched_resubmit_jobs(>sched);
+   }
+   }
+   }
+   }
+   if (amdgpu_gpu_recovery != 2)
+   drm_sched_start(>sched,

[PATCH] drm/amdgpu: Remove unnecessary conversion to bool

2021-03-08 Thread Jiapeng Chong

Fix the following coccicheck warnings:

./drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c:1600:40-45: WARNING: conversion
to bool not needed here.

./drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c:1598:40-45: WARNING: conversion
to bool not needed here.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 690a509..b39e7db 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1595,9 +1595,9 @@ static int sdma_v5_2_set_clockgating_state(void *handle,
case CHIP_VANGOGH:
case CHIP_DIMGREY_CAVEFISH:
sdma_v5_2_update_medium_grain_clock_gating(adev,
-   state == AMD_CG_STATE_GATE ? true : false);
+   state == AMD_CG_STATE_GATE);
sdma_v5_2_update_medium_grain_light_sleep(adev,
-   state == AMD_CG_STATE_GATE ? true : false);
+   state == AMD_CG_STATE_GATE);
break;
default:
break;
-- 
1.8.3.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] gpu: drm: amd: amdgpu: fix error return code of amdgpu_acpi_init()

2021-03-08 Thread Jia-Ju Bai

Add error return code in error hanlding code of amdgpu_acpi_init().

Reported-by: TOTE Robot 
Signed-off-by: Jia-Ju Bai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
index 8155c54392c8..156f30d5a2c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
@@ -788,12 +788,15 @@ int amdgpu_acpi_init(struct amdgpu_device *adev)
 
/* Probe for ATIF, and initialize it if found */
atif_handle = amdgpu_atif_probe_handle(handle);
-   if (!atif_handle)
+   if (!atif_handle) {
+   ret = -EINVAL;
goto out;
+   }
 
atif = kzalloc(sizeof(*atif), GFP_KERNEL);
if (!atif) {
DRM_WARN("Not enough memory to initialize ATIF\n");
+   ret = -ENOMEM;
goto out;
}
atif->handle = atif_handle;
@@ -803,6 +806,7 @@ int amdgpu_acpi_init(struct amdgpu_device *adev)
if (ret) {
DRM_DEBUG_DRIVER("Call to ATIF verify_interface failed: %d\n", 
ret);
kfree(atif);
+   ret = -EINVAL;
goto out;
}
adev->atif = atif;
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amd/display: Remove unnecessary conversion to bool

2021-03-08 Thread Jiapeng Chong

Fix the following coccicheck warnings:

./drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c:561:34-39: WARNING:
conversion to bool not needed here.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
 drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c
index ae6484a..42a4177 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c
@@ -558,7 +558,7 @@ bool dal_ddc_service_query_ddc_data(
/* should not set mot (middle of transaction) to 0
 * if there are pending read payloads
 */
-   payload.mot = read_size == 0 ? false : true;
+   payload.mot = !(read_size == 0);
payload.length = write_size;
payload.data = write_buf;
 
-- 
1.8.3.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amd/display: remove duplicate include in amdgpu_dm.c

2021-03-08 Thread menglong8 . dong

From: Zhang Yunkai 

'drm/drm_hdcp.h' included in 'amdgpu_dm.c' is duplicated.
It is also included in the 79th line.

Signed-off-by: Zhang Yunkai 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 3e1fd1e7d09f..fee46fbcb0b7 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -44,7 +44,6 @@
 #include "amdgpu_dm.h"
 #ifdef CONFIG_DRM_AMD_DC_HDCP
 #include "amdgpu_dm_hdcp.h"
-#include 
 #endif
 #include "amdgpu_pm.h"
 
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amd/display: remove duplicate include in dcn21 and gpio

2021-03-08 Thread menglong8 . dong

From: Zhang Yunkai 

'dce110_resource.h' included in 'dcn21_resource.c' is duplicated.
'hw_gpio.h' included in 'hw_factory_dce110.c' is duplicated.

Signed-off-by: Zhang Yunkai 
---
 drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c | 1 -
 .../gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c| 4 
 2 files changed, 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
index 072f8c880924..8a6a965751e8 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
@@ -61,7 +61,6 @@
 #include "dcn21/dcn21_dccg.h"
 #include "dcn21_hubbub.h"
 #include "dcn10/dcn10_resource.h"
-#include "dce110/dce110_resource.h"
 #include "dce/dce_panel_cntl.h"
 
 #include "dcn20/dcn20_dwb.h"
diff --git a/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c 
b/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c
index 66e4841f41e4..ca335ea60412 100644
--- a/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c
+++ b/drivers/gpu/drm/amd/display/dc/gpio/dce110/hw_factory_dce110.c
@@ -48,10 +48,6 @@
 #define REGI(reg_name, block, id)\
mm ## block ## id ## _ ## reg_name
 
-#include "../hw_gpio.h"
-#include "../hw_ddc.h"
-#include "../hw_hpd.h"
-
 #include "reg_helper.h"
 #include "../hpd_regs.h"
 
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

88 matches

Mail list logo