Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

2021-12-09 Thread Lang Yu
On 12/10/ , Christian KKKnig wrote:
> Am 10.12.21 um 04:21 schrieb Lang Yu:
> > On 12/10/ , Quan, Evan wrote:
> > > [AMD Official Use Only]
> > > 
> > > 
> > > 
> > > > -Original Message-
> > > > From: Yu, Lang 
> > > > Sent: Friday, December 10, 2021 10:34 AM
> > > > To: Quan, Evan 
> > > > Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
> > > > ; Lazar, Lijo ; Huang,
> > > > Ray ; Deucher, Alexander
> > > > ; Koenig, Christian
> > > > 
> > > > Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > > 
> > > > On 12/10/ , Quan, Evan wrote:
> > > > > [AMD Official Use Only]
> > > > > 
> > > > > 
> > > > > 
> > > > > > -Original Message-
> > > > > > From: amd-gfx  On Behalf Of
> > > > > > Lang Yu
> > > > > > Sent: Thursday, December 9, 2021 4:49 PM
> > > > > > To: amd-gfx@lists.freedesktop.org
> > > > > > Cc: Grodzovsky, Andrey ; Lazar, Lijo
> > > > > > ; Huang, Ray ; Deucher,
> > > > > > Alexander ; Yu, Lang
> > > > ;
> > > > > > Koenig, Christian 
> > > > > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > > > > 
> > > > > > SMU firmware guys expect the driver maintains error context and
> > > > > > doesn't interact with SMU any more when SMU errors occurred.
> > > > > > That will aid in debugging SMU firmware issues.
> > > > > > 
> > > > > > Add SMU debug option support for this request, it can be enabled or
> > > > > > disabled via amdgpu_smu_debug debugfs file.
> > > > > > When enabled, it brings hardware to a kind of halt state so that no
> > > > > > one can touch it any more in the envent of SMU errors.
> > > > > > 
> > > > > > Currently, dirver interacts with SMU via sending messages.
> > > > > > And threre are three ways to sending messages to SMU.
> > > > > > Handle them respectively as following:
> > > > > > 
> > > > > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > > > > > 
> > > > > >Halt on any error.
> > > > > > 
> > > > > > 2,
> > > > smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > > > > > for longer timeout cases
> > > > > > 
> > > > > >Halt on errors apart from ETIME. Otherwise this way won't work.
> > > > > > 
> > > > > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > > > > > 
> > > > > >Halt on errors apart from ETIME. Otherwise second way won't work.
> > > > > > 
> > > > > > After halting, use BUG() to explicitly notify users.
> > > > > > 
> > > > > > == Command Guide ==
> > > > > > 
> > > > > > 1, enable SMU debug option
> > > > > > 
> > > > > >   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > > > > 
> > > > > > 2, disable SMU debug option
> > > > > > 
> > > > > >   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > > > > 
> > > > > > v4:
> > > > > >   - Set to halt state instead of a simple hang.(Christian)
> > > > > > 
> > > > > > v3:
> > > > > >   - Use debugfs_create_bool().(Christian)
> > > > > >   - Put variable into smu_context struct.
> > > > > >   - Don't resend command when timeout.
> > > > > > 
> > > > > > v2:
> > > > > >   - Resend command when timeout.(Lijo)
> > > > > >   - Use debugfs file instead of module parameter.
> > > > > > 
> > > > > > Signed-off-by: Lang Yu 
> > > > > > ---
> > > > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> > > > > >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h |  5 +
> > > > > >   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 20
> > > > > > +++-
> > > > > >   3 files changed, 27 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > index 164d6a9e9fbb..86cd888c7822 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> > > > amdgpu_device
> > > > > > *adev)
> > > > > > if (!debugfs_initialized())
> > > > > > return 0;
> > > > > > 
> > > > > > +   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > > > > > + >smu.smu_debug_mode);
> > > > > > +
> > > > > > ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> > > > > >   _ib_preempt);
> > > > > > if (IS_ERR(ent)) {
> > > > > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > index f738f7dc20c9..50dbf5594a9d 100644
> > > > > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > @@ -569,6 +569,11 @@ struct smu_context
> > > > > > struct smu_user_dpm_profile user_dpm_profile;
> > > > > > 
> > > > > > struct stb_context stb_context;
> > > > > > +   /*
> > > > > > +* When enabled, it makes SMU errors fatal.
> > > > > > +* (0 = disabled (default), 1 = enabled)
> > > > > > +*/
> > > > > > +   bool 

Re: [PATCH] drm/amdkfd: explicitly create/destroy queue attributes under /sys

2021-12-09 Thread Christian König

Am 09.12.21 um 23:27 schrieb Felix Kuehling:

Am 2021-12-09 um 5:14 p.m. schrieb Chen, Xiaogang:

On 12/9/2021 12:40 PM, Felix Kuehling wrote:

Am 2021-12-09 um 2:49 a.m. schrieb Xiaogang.Chen:

From: Xiaogang Chen 

When application is about finish it destroys queues it has created by
an ioctl. Driver deletes queue
entry(/sys/class/kfd/kfd/proc/pid/queues/queueid/)
which is directory including this queue all attributes. Low level
kernel
code deletes all attributes under this directory. The lock from
kernel is
on queue entry, not its attributes. At meantime another user space
application
can read the attributes. There is possibility that the application can
hold/read the attributes while kernel is deleting the queue entry,
cause
the application have invalid memory access, then killed by kernel.

Driver changes: explicitly create/destroy each attribute for each
queue,
let kernel put lock on each attribute too.

Is this working around a bug in kobject_del? Shouldn't that code take
care of the necessary locking itself?

Regards,
    Felix

The patches do not change kobject/kernfs that are too low level and
would involve deeper discussions.
Made changes at higher level(kfd) instead.

Have tested with MSF tool overnight.

OK. I'm OK with your changes. The patch is

Reviewed-by: Felix Kuehling 

But I think we should let the kernfs folks know that there is a problem
anyway. It might save someone else a lot of time and headaches down the
line. Ideally we'd come up with a small reproducer (dummy driver and a
user mode tool (could just be a bash script)) that doesn't require
special AMD hardware and the whole ROCm stack.


I think we could do this in the DKMS/release branches, but for upstream 
we should rather fix the underlying problem.


Additional to that this is explicitely what we should not do if I 
understood Greg correctly in previous discussions, but take that with a 
grain of salt since I'm not an expert on the topic.


Regards,
Christian.



Regards,
   Felix



Thanks
Xiaogang


Signed-off-by: Xiaogang Chen 
---
   drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  3 +++
   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 33
+++-
   2 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 0c3f911e3bf4..045da300749e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -546,6 +546,9 @@ struct queue {
     /* procfs */
   struct kobject kobj;
+    struct attribute attr_guid;
+    struct attribute attr_size;
+    struct attribute attr_type;
   };
     enum KFD_MQD_TYPE {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 9158f9754a24..04a5638f9196 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -73,6 +73,8 @@ static void evict_process_worker(struct
work_struct *work);
   static void restore_process_worker(struct work_struct *work);
     static void kfd_process_device_destroy_cwsr_dgpu(struct
kfd_process_device *pdd);
+static void kfd_sysfs_create_file(struct kobject *kobj, struct
attribute *attr,
+    char *name);
     struct kfd_procfs_tree {
   struct kobject *kobj;
@@ -441,35 +443,12 @@ static ssize_t kfd_sysfs_counters_show(struct
kobject *kobj,
   return 0;
   }
   -static struct attribute attr_queue_size = {
-    .name = "size",
-    .mode = KFD_SYSFS_FILE_MODE
-};
-
-static struct attribute attr_queue_type = {
-    .name = "type",
-    .mode = KFD_SYSFS_FILE_MODE
-};
-
-static struct attribute attr_queue_gpuid = {
-    .name = "gpuid",
-    .mode = KFD_SYSFS_FILE_MODE
-};
-
-static struct attribute *procfs_queue_attrs[] = {
-    _queue_size,
-    _queue_type,
-    _queue_gpuid,
-    NULL
-};
-
   static const struct sysfs_ops procfs_queue_ops = {
   .show = kfd_procfs_queue_show,
   };
     static struct kobj_type procfs_queue_type = {
   .sysfs_ops = _queue_ops,
-    .default_attrs = procfs_queue_attrs,
   };
     static const struct sysfs_ops procfs_stats_ops = {
@@ -511,6 +490,10 @@ int kfd_procfs_add_queue(struct queue *q)
   return ret;
   }
   +    kfd_sysfs_create_file(>kobj, >attr_guid, "guid");
+    kfd_sysfs_create_file(>kobj, >attr_size, "size");
+    kfd_sysfs_create_file(>kobj, >attr_type, "type");
+
   return 0;
   }
   @@ -655,6 +638,10 @@ void kfd_procfs_del_queue(struct queue *q)
   if (!q)
   return;
   +    sysfs_remove_file(>kobj, >attr_guid);
+    sysfs_remove_file(>kobj, >attr_size);
+    sysfs_remove_file(>kobj, >attr_type);
+
   kobject_del(>kobj);
   kobject_put(>kobj);
   }




Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Gerd Hoffmann
  Hi,

> > The drivers are asic and platform specific.  E.g., the driver for
> > vangogh is different from renoir is different from skylake, etc.  The
> > display programming interfaces are asic specific.
> 
> Cool, that makes sense! But if you (or anybody here) know some of these
> GOP drivers, e.g. for the qemu/qxl device,

OvmfPkg/QemuVideoDxe in tianocore source tree.

> I'm just curious to see/understand how complex is the FW driver to
> just put the device/screen in a usable state.

Note that qemu has a paravirtual interface for vesa vga mode programming
where you basically program a handful of registers with xres, yres,
depth etc. (after resetting the device to put it into vga compatibility
mode) and you are done.

Initializing physical hardware is an order of magnitude harder than
that.

With qxl you could also go figure the current state of the hardware and
fill screen_info with that to get a working boot framebuffer in the
kexec'ed kernel.

Problem with this approach is this works only in case the framebuffer
happens to be in a format usable by vesafb/efifb.  So no modifiers
(tiling etc.) and continuous in physical address space.  That is true
for qxl.  With virtio-gpu it wouldn't work though (framebuffer can be
scattered), and I expect with most modern physical hardware it wouldn't
work either.

take care,
  Gerd



RE: [PATCH V4 05/17] drm/amd/pm: do not expose those APIs used internally only in si_dpm.c

2021-12-09 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, December 9, 2021 8:09 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Feng, Kenneth 
> Subject: Re: [PATCH V4 05/17] drm/amd/pm: do not expose those APIs used
> internally only in si_dpm.c
> 
> 
> 
> On 12/3/2021 8:35 AM, Evan Quan wrote:
> > Move them to si_dpm.c instead.
> >
> > Signed-off-by: Evan Quan 
> > Change-Id: I288205cfd7c6ba09cfb22626ff70360d61ff0c67
> > --
> > v1->v2:
> >- rename the API with "si_" prefix(Alex)
> > v2->v3:
> >- rename other data structures used only in si_dpm.c(Lijo)
> > ---
> >   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   |  25 -
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  25 -
> >   drivers/gpu/drm/amd/pm/powerplay/si_dpm.c | 106
> +++---
> >   drivers/gpu/drm/amd/pm/powerplay/si_dpm.h |  15 ++-
> >   4 files changed, 83 insertions(+), 88 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > index 72a8cb70d36b..b31858ad9b83 100644
> > --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > @@ -894,31 +894,6 @@ void amdgpu_add_thermal_controller(struct
> amdgpu_device *adev)
> > }
> >   }
> >
> > -enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct
> amdgpu_device *adev,
> > -u32 sys_mask,
> > -enum amdgpu_pcie_gen
> asic_gen,
> > -enum amdgpu_pcie_gen
> default_gen)
> > -{
> > -   switch (asic_gen) {
> > -   case AMDGPU_PCIE_GEN1:
> > -   return AMDGPU_PCIE_GEN1;
> > -   case AMDGPU_PCIE_GEN2:
> > -   return AMDGPU_PCIE_GEN2;
> > -   case AMDGPU_PCIE_GEN3:
> > -   return AMDGPU_PCIE_GEN3;
> > -   default:
> > -   if ((sys_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
> &&
> > -   (default_gen == AMDGPU_PCIE_GEN3))
> > -   return AMDGPU_PCIE_GEN3;
> > -   else if ((sys_mask &
> CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2) &&
> > -(default_gen == AMDGPU_PCIE_GEN2))
> > -   return AMDGPU_PCIE_GEN2;
> > -   else
> > -   return AMDGPU_PCIE_GEN1;
> > -   }
> > -   return AMDGPU_PCIE_GEN1;
> > -}
> > -
> >   struct amd_vce_state*
> >   amdgpu_get_vce_clock_state(void *handle, u32 idx)
> >   {
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > index 6681b878e75f..f43b96dfe9d8 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > @@ -45,19 +45,6 @@ enum amdgpu_int_thermal_type {
> > THERMAL_TYPE_KV,
> >   };
> >
> > -enum amdgpu_dpm_auto_throttle_src {
> > -   AMDGPU_DPM_AUTO_THROTTLE_SRC_THERMAL,
> > -   AMDGPU_DPM_AUTO_THROTTLE_SRC_EXTERNAL
> > -};
> > -
> > -enum amdgpu_dpm_event_src {
> > -   AMDGPU_DPM_EVENT_SRC_ANALOG = 0,
> > -   AMDGPU_DPM_EVENT_SRC_EXTERNAL = 1,
> > -   AMDGPU_DPM_EVENT_SRC_DIGITAL = 2,
> > -   AMDGPU_DPM_EVENT_SRC_ANALOG_OR_EXTERNAL = 3,
> > -   AMDGPU_DPM_EVENT_SRC_DIGIAL_OR_EXTERNAL = 4
> > -};
> > -
> >   struct amdgpu_ps {
> > u32 caps; /* vbios flags */
> > u32 class; /* vbios flags */
> > @@ -252,13 +239,6 @@ struct amdgpu_dpm_fan {
> > bool ucode_fan_control;
> >   };
> >
> > -enum amdgpu_pcie_gen {
> > -   AMDGPU_PCIE_GEN1 = 0,
> > -   AMDGPU_PCIE_GEN2 = 1,
> > -   AMDGPU_PCIE_GEN3 = 2,
> > -   AMDGPU_PCIE_GEN_INVALID = 0x
> > -};
> > -
> >   #define amdgpu_dpm_reset_power_profile_state(adev, request) \
> > ((adev)->powerplay.pp_funcs-
> >reset_power_profile_state(\
> > (adev)->powerplay.pp_handle, request)) @@ -
> 403,11 +383,6 @@ void
> > amdgpu_free_extended_power_table(struct amdgpu_device *adev);
> >
> >   void amdgpu_add_thermal_controller(struct amdgpu_device *adev);
> >
> > -enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct
> amdgpu_device *adev,
> > -u32 sys_mask,
> > -enum amdgpu_pcie_gen
> asic_gen,
> > -enum amdgpu_pcie_gen
> default_gen);
> > -
> >   struct amd_vce_state*
> >   amdgpu_get_vce_clock_state(void *handle, u32 idx);
> >
> > diff --git a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > index 81f82aa05ec2..ab0fa6c79255 100644
> > --- a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > +++ b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > @@ -96,6 +96,19 @@ union pplib_clock_info {
> > struct _ATOM_PPLIB_SI_CLOCK_INFO si;
> >   };
> >
> > +enum si_dpm_auto_throttle_src {
> > +   DPM_AUTO_THROTTLE_SRC_THERMAL,
> > +   DPM_AUTO_THROTTLE_SRC_EXTERNAL
> > +};
> > +
> 
> Since the final usage is something like (1 <<
> DPM_AUTO_THROTTLE_SRC_EXTERNAL), it's better to 

Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

2021-12-09 Thread Christian König

Am 10.12.21 um 04:21 schrieb Lang Yu:

On 12/10/ , Quan, Evan wrote:

[AMD Official Use Only]




-Original Message-
From: Yu, Lang 
Sent: Friday, December 10, 2021 10:34 AM
To: Quan, Evan 
Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
; Lazar, Lijo ; Huang,
Ray ; Deucher, Alexander
; Koenig, Christian

Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

On 12/10/ , Quan, Evan wrote:

[AMD Official Use Only]




-Original Message-
From: amd-gfx  On Behalf Of
Lang Yu
Sent: Thursday, December 9, 2021 4:49 PM
To: amd-gfx@lists.freedesktop.org
Cc: Grodzovsky, Andrey ; Lazar, Lijo
; Huang, Ray ; Deucher,
Alexander ; Yu, Lang

;

Koenig, Christian 
Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

SMU firmware guys expect the driver maintains error context and
doesn't interact with SMU any more when SMU errors occurred.
That will aid in debugging SMU firmware issues.

Add SMU debug option support for this request, it can be enabled or
disabled via amdgpu_smu_debug debugfs file.
When enabled, it brings hardware to a kind of halt state so that no
one can touch it any more in the envent of SMU errors.

Currently, dirver interacts with SMU via sending messages.
And threre are three ways to sending messages to SMU.
Handle them respectively as following:

1, smu_cmn_send_smc_msg_with_param() for normal timeout cases

   Halt on any error.

2,

smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()

for longer timeout cases

   Halt on errors apart from ETIME. Otherwise this way won't work.

3, smu_cmn_send_msg_without_waiting() for no waiting cases

   Halt on errors apart from ETIME. Otherwise second way won't work.

After halting, use BUG() to explicitly notify users.

== Command Guide ==

1, enable SMU debug option

  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v4:
  - Set to halt state instead of a simple hang.(Christian)

v3:
  - Use debugfs_create_bool().(Christian)
  - Put variable into smu_context struct.
  - Don't resend command when timeout.

v2:
  - Resend command when timeout.(Lijo)
  - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h |  5 +
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 20
+++-
  3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct

amdgpu_device

*adev)
if (!debugfs_initialized())
return 0;

+   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;

struct stb_context stb_context;
+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;

[Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that

in future we can add support for other debug features.

  };

OK.


  struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..84016d22c075 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
smu_context *smu,
__smu_cmn_send_msg(smu, msg_index, param);
res = 0;
  Out:
+   if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
+   amdgpu_device_halt(smu->adev);
+   BUG();

[Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and

Andrey can share you more insights about that.

Do we still need the "BUG()" then?

The BUG() is used to explicitly notify users something went wrong.
Otherwise userspace may not know immediately.
FW guys request this in ticket.

[Quan, Evan] Won't drm_dev_unplug() and pci_disable_device() used in 
amdgpu_device_halt throw some errors(on user's further attempt to communicate 
with our driver)?
Also if the purpose is to raise user's concern, WARN() may be a more gentle way?

 From my testing and observation, it depends on what the driver will do next.
Probably trigger a page 

RE: [PATCH V4 03/17] drm/amd/pm: do not expose power implementation details to display

2021-12-09 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, December 9, 2021 8:05 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Feng, Kenneth 
> Subject: Re: [PATCH V4 03/17] drm/amd/pm: do not expose power
> implementation details to display
> 
> 
> 
> On 12/3/2021 8:35 AM, Evan Quan wrote:
> > Display is another client of our power APIs. It's not proper to spike
> > into power implementation details there.
> >
> > Signed-off-by: Evan Quan 
> > Change-Id: Ic897131e16473ed29d3d7586d822a55c64e6574a
> > ---
> >   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   6 +-
> >   .../amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c  | 246 +++-
> --
> >   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 218
> 
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  38 +++
> >   4 files changed, 344 insertions(+), 164 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 7837e0613717..2c6c840e14a1 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -2136,12 +2136,8 @@ static void s3_handle_mst(struct drm_device
> > *dev, bool suspend)
> >
> >   static int amdgpu_dm_smu_write_watermarks_table(struct
> amdgpu_device *adev)
> >   {
> > -   struct smu_context *smu = >smu;
> > int ret = 0;
> >
> > -   if (!is_support_sw_smu(adev))
> > -   return 0;
> > -
> > /* This interface is for dGPU Navi1x.Linux dc-pplib interface depends
> >  * on window driver dc implementation.
> >  * For Navi1x, clock settings of dcn watermarks are fixed. the
> > settings @@ -2180,7 +2176,7 @@ static int
> amdgpu_dm_smu_write_watermarks_table(struct amdgpu_device *adev)
> > return 0;
> > }
> >
> > -   ret = smu_write_watermarks_table(smu);
> > +   ret = amdgpu_dpm_write_watermarks_table(adev);
> > if (ret) {
> > DRM_ERROR("Failed to update WMTABLE!\n");
> > return ret;
> > diff --git
> a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
> > index eba270121698..46550811da00 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
> > +++
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
> > @@ -99,10 +99,7 @@ bool dm_pp_apply_display_requirements(
> > adev->pm.pm_display_cfg.displays[i].controller_id =
> dc_cfg->pipe_idx + 1;
> > }
> >
> > -   if (adev->powerplay.pp_funcs && adev-
> >powerplay.pp_funcs->display_configuration_change)
> > -   adev->powerplay.pp_funcs-
> >display_configuration_change(
> > -   adev->powerplay.pp_handle,
> > -   >pm.pm_display_cfg);
> > +   amdgpu_dpm_display_configuration_change(adev,
> > +>pm.pm_display_cfg);
> >
> > amdgpu_pm_compute_clocks(adev);
> > }
> > @@ -298,31 +295,25 @@ bool dm_pp_get_clock_levels_by_type(
> > struct dm_pp_clock_levels *dc_clks)
> >   {
> > struct amdgpu_device *adev = ctx->driver_context;
> > -   void *pp_handle = adev->powerplay.pp_handle;
> > struct amd_pp_clocks pp_clks = { 0 };
> > struct amd_pp_simple_clock_info validation_clks = { 0 };
> > uint32_t i;
> >
> > -   if (adev->powerplay.pp_funcs && adev->powerplay.pp_funcs-
> >get_clock_by_type) {
> > -   if (adev->powerplay.pp_funcs-
> >get_clock_by_type(pp_handle,
> > -   dc_to_pp_clock_type(clk_type), _clks)) {
> > -   /* Error in pplib. Provide default values. */
> > -   get_default_clock_levels(clk_type, dc_clks);
> > -   return true;
> > -   }
> > +   if (amdgpu_dpm_get_clock_by_type(adev,
> > +   dc_to_pp_clock_type(clk_type), _clks)) {
> > +   /* Error in pplib. Provide default values. */
> > +   get_default_clock_levels(clk_type, dc_clks);
> > +   return true;
> > }
> >
> > pp_to_dc_clock_levels(_clks, dc_clks, clk_type);
> >
> > -   if (adev->powerplay.pp_funcs && adev->powerplay.pp_funcs-
> >get_display_mode_validation_clocks) {
> > -   if (adev->powerplay.pp_funcs-
> >get_display_mode_validation_clocks(
> > -   pp_handle, _clks))
> {
> > -   /* Error in pplib. Provide default values. */
> > -   DRM_INFO("DM_PPLIB: Warning: using default
> validation clocks!\n");
> > -   validation_clks.engine_max_clock = 72000;
> > -   validation_clks.memory_max_clock = 8;
> > -   validation_clks.level = 0;
> > -   }
> > +   if (amdgpu_dpm_get_display_mode_validation_clks(adev,
> _clks)) {
> > +   /* Error in pplib. Provide default values. */
> > +   DRM_INFO("DM_PPLIB: 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Christian König

Am 09.12.21 um 19:28 schrieb Felix Kuehling:

Am 2021-12-09 um 10:30 a.m. schrieb Christian König:

That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.


I think it's fine for the whole amdgpu stack, my concern is more about 
radeon, nouveau and the ARM stacks which are using this as well.


That blew up so nicely the last time we tried to change it and I know of 
at least one case where radeon was/is used with BOs in a child process.


Regards,
Christian.



Regards,
   Felix



Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:

Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
you!

On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly
a good idea.


The child cannot access the BOs mapped by the parent anyway with
access restrictions applied

exactly that is not correct. That behavior is actively used by some
userspace stacks as far as I know.

Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:

Thanks Christian. Would it make it less intrusive if I just use the
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
this patch? For our use case, just the ttm_bo_mmap_obj change
should suffice and we don't want to put any more work arounds in
the user space (thunk, in our case).

The child cannot access the BOs mapped by the parent anyway with
access restrictions applied so I wonder why even inherit the vma?

On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its
shared
mappings also get reflected in the address space of child process
even
though it cannot access them with the object permissions applied.
With the
existing permission checks on the gem objects, it might be
reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space
application
doesn't need to explicitly call the madvise(addr, len,
MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in
the
address space of the child process. It also prevents the memory
leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which
we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint
restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of
this so
this is needed only for the render nodes.

Unfortunately that is most likely a NAK. We already tried
something similar.

While it is illegal by the OpenGL specification and doesn't work
for most userspace stacks, we do have some implementations which
call fork() with a GL context open and expect it to work.

Regards,
Christian.


Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
   drivers/gpu/drm/drm_gem.c   | 3 ++-
   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
   2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
*obj, unsigned long obj_size,
   goto err_drm_gem_object_put;
   }
   -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
VM_DONTDUMP;
+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
   vma->vm_page_prot =
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
   vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
   }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
*vma, struct ttm_buffer_object *bo)
     vma->vm_private_data = bo;
   -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
   vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
   return 0;
   }




Re: [PATCH V4 02/17] drm/amd/pm: do not expose power implementation details to amdgpu_pm.c

2021-12-09 Thread Lazar, Lijo




On 12/10/2021 10:50 AM, Quan, Evan wrote:

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Thursday, December 9, 2021 7:58 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian
; Feng, Kenneth 
Subject: Re: [PATCH V4 02/17] drm/amd/pm: do not expose power
implementation details to amdgpu_pm.c



On 12/3/2021 8:35 AM, Evan Quan wrote:

amdgpu_pm.c holds all the user sysfs/hwmon interfaces. It's another
client of our power APIs. It's not proper to spike into power
implementation details there.

Signed-off-by: Evan Quan 
Change-Id: I397853ddb13eacfce841366de2a623535422df9a
--
v1->v2:
- drop unneeded "return;" in

amdgpu_dpm_get_current_power_state(Guchun)

---
   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 456

++-

   drivers/gpu/drm/amd/pm/amdgpu_pm.c| 519 --
   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 160 +++
   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |   3 -
   4 files changed, 707 insertions(+), 431 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c

b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c

index 54abdf7080de..2c789eb5d066 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -1453,7 +1453,9 @@ static void

amdgpu_dpm_change_power_state_locked(struct amdgpu_device *adev)

if (equal)
return;

-   amdgpu_dpm_set_power_state(adev);
+   if (adev->powerplay.pp_funcs->set_power_state)
+   adev->powerplay.pp_funcs->set_power_state(adev-
powerplay.pp_handle);
+
amdgpu_dpm_post_set_power_state(adev);

adev->pm.dpm.current_active_crtcs = adev-
pm.dpm.new_active_crtcs;
@@ -1704,3 +1706,455 @@ int amdgpu_dpm_get_ecc_info(struct

amdgpu_device *adev,


return smu_get_ecc_info(>smu, umc_ecc);
   }
+
+struct amd_vce_state *amdgpu_dpm_get_vce_clock_state(struct

amdgpu_device *adev,

+uint32_t idx)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->get_vce_clock_state)
+   return NULL;
+
+   return pp_funcs->get_vce_clock_state(adev->powerplay.pp_handle,
+idx);
+}
+
+void amdgpu_dpm_get_current_power_state(struct amdgpu_device

*adev,

+   enum amd_pm_state_type *state)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->get_current_power_state) {
+   *state = adev->pm.dpm.user_state;
+   return;
+   }
+
+   *state = pp_funcs->get_current_power_state(adev-
powerplay.pp_handle);
+   if (*state < POWER_STATE_TYPE_DEFAULT ||
+   *state > POWER_STATE_TYPE_INTERNAL_3DPERF)
+   *state = adev->pm.dpm.user_state;
+}
+
+void amdgpu_dpm_set_power_state(struct amdgpu_device *adev,
+   enum amd_pm_state_type state)
+{
+   adev->pm.dpm.user_state = state;
+
+   if (adev->powerplay.pp_funcs->dispatch_tasks)
+   amdgpu_dpm_dispatch_task(adev,

AMD_PP_TASK_ENABLE_USER_STATE, );

+   else
+   amdgpu_pm_compute_clocks(adev);
+}
+
+enum amd_dpm_forced_level

amdgpu_dpm_get_performance_level(struct amdgpu_device *adev)

+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   enum amd_dpm_forced_level level;
+
+   if (pp_funcs->get_performance_level)
+   level = pp_funcs->get_performance_level(adev-
powerplay.pp_handle);
+   else
+   level = adev->pm.dpm.forced_level;
+
+   return level;
+}
+
+int amdgpu_dpm_force_performance_level(struct amdgpu_device

*adev,

+  enum amd_dpm_forced_level level)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (pp_funcs->force_performance_level) {
+   if (adev->pm.dpm.thermal_active)
+   return -EINVAL;
+
+   if (pp_funcs->force_performance_level(adev-
powerplay.pp_handle,
+ level))
+   return -EINVAL;
+   }
+
+   adev->pm.dpm.forced_level = level;
+


If the function is not implemented, why change the force level and
return success?

[Quan, Evan] Thanks! Will update that.



+   return 0;
+}
+
+int amdgpu_dpm_get_pp_num_states(struct amdgpu_device *adev,
+struct pp_states_info *states)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->get_pp_num_states)
+   return -EOPNOTSUPP;
+
+   return pp_funcs->get_pp_num_states(adev->powerplay.pp_handle,

states);

+}
+
+int amdgpu_dpm_dispatch_task(struct amdgpu_device *adev,
+ enum amd_pp_task task_id,
+ enum amd_pm_state_type *user_state)
+{
+ 

RE: [PATCH V4 02/17] drm/amd/pm: do not expose power implementation details to amdgpu_pm.c

2021-12-09 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, December 9, 2021 7:58 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Feng, Kenneth 
> Subject: Re: [PATCH V4 02/17] drm/amd/pm: do not expose power
> implementation details to amdgpu_pm.c
> 
> 
> 
> On 12/3/2021 8:35 AM, Evan Quan wrote:
> > amdgpu_pm.c holds all the user sysfs/hwmon interfaces. It's another
> > client of our power APIs. It's not proper to spike into power
> > implementation details there.
> >
> > Signed-off-by: Evan Quan 
> > Change-Id: I397853ddb13eacfce841366de2a623535422df9a
> > --
> > v1->v2:
> >- drop unneeded "return;" in
> amdgpu_dpm_get_current_power_state(Guchun)
> > ---
> >   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 456
> ++-
> >   drivers/gpu/drm/amd/pm/amdgpu_pm.c| 519 --
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 160 +++
> >   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |   3 -
> >   4 files changed, 707 insertions(+), 431 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > index 54abdf7080de..2c789eb5d066 100644
> > --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > @@ -1453,7 +1453,9 @@ static void
> amdgpu_dpm_change_power_state_locked(struct amdgpu_device *adev)
> > if (equal)
> > return;
> >
> > -   amdgpu_dpm_set_power_state(adev);
> > +   if (adev->powerplay.pp_funcs->set_power_state)
> > +   adev->powerplay.pp_funcs->set_power_state(adev-
> >powerplay.pp_handle);
> > +
> > amdgpu_dpm_post_set_power_state(adev);
> >
> > adev->pm.dpm.current_active_crtcs = adev-
> >pm.dpm.new_active_crtcs;
> > @@ -1704,3 +1706,455 @@ int amdgpu_dpm_get_ecc_info(struct
> amdgpu_device *adev,
> >
> > return smu_get_ecc_info(>smu, umc_ecc);
> >   }
> > +
> > +struct amd_vce_state *amdgpu_dpm_get_vce_clock_state(struct
> amdgpu_device *adev,
> > +uint32_t idx)
> > +{
> > +   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
> > +
> > +   if (!pp_funcs->get_vce_clock_state)
> > +   return NULL;
> > +
> > +   return pp_funcs->get_vce_clock_state(adev->powerplay.pp_handle,
> > +idx);
> > +}
> > +
> > +void amdgpu_dpm_get_current_power_state(struct amdgpu_device
> *adev,
> > +   enum amd_pm_state_type *state)
> > +{
> > +   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
> > +
> > +   if (!pp_funcs->get_current_power_state) {
> > +   *state = adev->pm.dpm.user_state;
> > +   return;
> > +   }
> > +
> > +   *state = pp_funcs->get_current_power_state(adev-
> >powerplay.pp_handle);
> > +   if (*state < POWER_STATE_TYPE_DEFAULT ||
> > +   *state > POWER_STATE_TYPE_INTERNAL_3DPERF)
> > +   *state = adev->pm.dpm.user_state;
> > +}
> > +
> > +void amdgpu_dpm_set_power_state(struct amdgpu_device *adev,
> > +   enum amd_pm_state_type state)
> > +{
> > +   adev->pm.dpm.user_state = state;
> > +
> > +   if (adev->powerplay.pp_funcs->dispatch_tasks)
> > +   amdgpu_dpm_dispatch_task(adev,
> AMD_PP_TASK_ENABLE_USER_STATE, );
> > +   else
> > +   amdgpu_pm_compute_clocks(adev);
> > +}
> > +
> > +enum amd_dpm_forced_level
> amdgpu_dpm_get_performance_level(struct amdgpu_device *adev)
> > +{
> > +   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
> > +   enum amd_dpm_forced_level level;
> > +
> > +   if (pp_funcs->get_performance_level)
> > +   level = pp_funcs->get_performance_level(adev-
> >powerplay.pp_handle);
> > +   else
> > +   level = adev->pm.dpm.forced_level;
> > +
> > +   return level;
> > +}
> > +
> > +int amdgpu_dpm_force_performance_level(struct amdgpu_device
> *adev,
> > +  enum amd_dpm_forced_level level)
> > +{
> > +   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
> > +
> > +   if (pp_funcs->force_performance_level) {
> > +   if (adev->pm.dpm.thermal_active)
> > +   return -EINVAL;
> > +
> > +   if (pp_funcs->force_performance_level(adev-
> >powerplay.pp_handle,
> > + level))
> > +   return -EINVAL;
> > +   }
> > +
> > +   adev->pm.dpm.forced_level = level;
> > +
> 
> If the function is not implemented, why change the force level and
> return success?
[Quan, Evan] Thanks! Will update that.
> 
> > +   return 0;
> > +}
> > +
> > +int amdgpu_dpm_get_pp_num_states(struct amdgpu_device *adev,
> > +struct pp_states_info *states)
> > +{
> > +   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
> > +
> > +   if (!pp_funcs->get_pp_num_states)
> > +   return -EOPNOTSUPP;
> > +
> > +   return 

Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Guilherme G. Piccoli
Thanks again Alex! Some comments inlined below:

On 09/12/2021 15:06, Alex Deucher wrote:
> Not really in a generic way.  It's asic and platform specific.  In
> addition most modern displays require link training to bring up the
> display, so you can't just save and restore registers.

Oh sure, I understand that. My question is more like: is there a way,
inside amdgpu driver, to save this state before taking
over/overwriting/reprogramming the device? So we could (again, from
inside the amdgpu driver) dump this pre-saved state in the shutdown
handler, for example, having the device in a "pre-OS" state when the new
kexec'ed kernel starts.

> 
> The drivers are asic and platform specific.  E.g., the driver for
> vangogh is different from renoir is different from skylake, etc.  The
> display programming interfaces are asic specific.

Cool, that makes sense! But if you (or anybody here) know some of these
GOP drivers, e.g. for the qemu/qxl device, I'm just curious to
see/understand how complex is the FW driver to just put the
device/screen in a usable state.

Cheers,


Guilherme


Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Alistair Popple
On Friday, 10 December 2021 3:54:31 AM AEDT Sierra Guiza, Alejandro (Alex) 
wrote:
> 
> On 12/9/2021 10:29 AM, Felix Kuehling wrote:
> > Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:
> >> On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro 
> >> (Alex) wrote:
> >>> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
>  Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
> > Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> >> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> >>> Avoid long term pinning for Coherent device type pages. This could
> >>> interfere with their own device memory manager.
> >>> If caller tries to get user device coherent pages with PIN_LONGTERM 
> >>> flag
> >>> set, those pages will be migrated back to system memory.
> >>>
> >>> Signed-off-by: Alex Sierra
> >>> ---
> >>>mm/gup.c | 32 ++--
> >>>1 file changed, 30 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 886d6148d3d0..1572eacf07f4 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
> >>>#endif /* CONFIG_ELF_CORE */
> >>>
> >>>#ifdef CONFIG_MIGRATION
> >>> +static int migrate_device_page(unsigned long address,
> >>> + struct page *page)
> >>> +{
> >>> + struct vm_area_struct *vma = find_vma(current->mm, address);
> >>> + struct vm_fault vmf = {
> >>> + .vma = vma,
> >>> + .address = address & PAGE_MASK,
> >>> + .flags = FAULT_FLAG_USER,
> >>> + .pgoff = linear_page_index(vma, address),
> >>> + .gfp_mask = GFP_KERNEL,
> >>> + .page = page,
> >>> + };
> >>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> >>> + return page->pgmap->ops->migrate_to_ram();
> >> How does this synchronise against pgmap being released? As I 
> >> understand things
> >> at this point we're not holding a reference on either the page or 
> >> pgmap, so
> >> the page and therefore the pgmap may have been freed.
> >>
> >> I think a similar problem exists for device private fault handling as 
> >> well and
> >> it has been on my list of things to fix for a while. I think the 
> >> solution is to
> >> call try_get_page(), except it doesn't work with device pages due to 
> >> the whole
> >> refcount thing. That issue is blocking a fair bit of work now so I've 
> >> started
> >> looking into it.
> > At least the page should have been pinned by the __get_user_pages_locked
> > call in __gup_longterm_locked. That refcount is dropped in
> > check_and_migrate_movable_pages when it returns 0 or an error.
>  Never mind. We unpin the pages first. Alex, would the migration work if
>  we unpinned them afterwards? Also, the normal CPU page fault code path
>  seems to make sure the page is locked (check in pfn_swap_entry_to_page)
>  before calling migrate_to_ram.
> >> I don't think that's true. The check in pfn_swap_entry_to_page() is only 
> >> for
> >> migration entries:
> >>
> >>BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> >>
> >> As this is coherent memory though why do we have to call into a device 
> >> driver
> >> to do the migration? Couldn't this all be done in the kernel?
> > I think you're right. I hadn't thought of that mainly because I'm even
> > less familiar with the non-device migration code. Alex, can you give
> > that a try? As long as the driver still gets a page-free callback when
> > the device page is freed, it should work.

Yes, you should still get the page-free callback when the migration code drops
the last page reference.

> ACK.Will do

There is currently not really any support for migrating device pages based on
pfn. What I think is needed is something like migrate_pages(), but that API
won't work for a couple of reasons - main one being that it relies on pages
being LRU pages.

I've been working on a series to implement an equivalent of migrate_pages() for
device-private (and by extension device-coherent) pages. It might also be useful
here so I will try and get it posted as an RFC next week.

 - Alistair

> Alex Sierra
> 
> > Regards,
> >Felix
> >
> >
> >>> No, you can not unpinned after migration. Due to the expected_count VS
> >>> page_count condition at migrate_page_move_mapping, during migrate_page 
> >>> call.
> >>>
> >>> Regards,
> >>> Alex Sierra
> >>>
>  Regards,
>  Felix
> 
> 
> >>






Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

2021-12-09 Thread Lang Yu
On 12/10/ , Quan, Evan wrote:
> [AMD Official Use Only]
> 
> 
> 
> > -Original Message-
> > From: Yu, Lang 
> > Sent: Friday, December 10, 2021 10:34 AM
> > To: Quan, Evan 
> > Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
> > ; Lazar, Lijo ; Huang,
> > Ray ; Deucher, Alexander
> > ; Koenig, Christian
> > 
> > Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > 
> > On 12/10/ , Quan, Evan wrote:
> > > [AMD Official Use Only]
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: amd-gfx  On Behalf Of
> > > > Lang Yu
> > > > Sent: Thursday, December 9, 2021 4:49 PM
> > > > To: amd-gfx@lists.freedesktop.org
> > > > Cc: Grodzovsky, Andrey ; Lazar, Lijo
> > > > ; Huang, Ray ; Deucher,
> > > > Alexander ; Yu, Lang
> > ;
> > > > Koenig, Christian 
> > > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > >
> > > > SMU firmware guys expect the driver maintains error context and
> > > > doesn't interact with SMU any more when SMU errors occurred.
> > > > That will aid in debugging SMU firmware issues.
> > > >
> > > > Add SMU debug option support for this request, it can be enabled or
> > > > disabled via amdgpu_smu_debug debugfs file.
> > > > When enabled, it brings hardware to a kind of halt state so that no
> > > > one can touch it any more in the envent of SMU errors.
> > > >
> > > > Currently, dirver interacts with SMU via sending messages.
> > > > And threre are three ways to sending messages to SMU.
> > > > Handle them respectively as following:
> > > >
> > > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > > >
> > > >   Halt on any error.
> > > >
> > > > 2,
> > smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > > > for longer timeout cases
> > > >
> > > >   Halt on errors apart from ETIME. Otherwise this way won't work.
> > > >
> > > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > > >
> > > >   Halt on errors apart from ETIME. Otherwise second way won't work.
> > > >
> > > > After halting, use BUG() to explicitly notify users.
> > > >
> > > > == Command Guide ==
> > > >
> > > > 1, enable SMU debug option
> > > >
> > > >  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > >
> > > > 2, disable SMU debug option
> > > >
> > > >  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > >
> > > > v4:
> > > >  - Set to halt state instead of a simple hang.(Christian)
> > > >
> > > > v3:
> > > >  - Use debugfs_create_bool().(Christian)
> > > >  - Put variable into smu_context struct.
> > > >  - Don't resend command when timeout.
> > > >
> > > > v2:
> > > >  - Resend command when timeout.(Lijo)
> > > >  - Use debugfs file instead of module parameter.
> > > >
> > > > Signed-off-by: Lang Yu 
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> > > >  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h |  5 +
> > > >  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 20
> > > > +++-
> > > >  3 files changed, 27 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > index 164d6a9e9fbb..86cd888c7822 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> > amdgpu_device
> > > > *adev)
> > > > if (!debugfs_initialized())
> > > > return 0;
> > > >
> > > > +   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > > > + >smu.smu_debug_mode);
> > > > +
> > > > ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> > > >   _ib_preempt);
> > > > if (IS_ERR(ent)) {
> > > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > index f738f7dc20c9..50dbf5594a9d 100644
> > > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > @@ -569,6 +569,11 @@ struct smu_context
> > > > struct smu_user_dpm_profile user_dpm_profile;
> > > >
> > > > struct stb_context stb_context;
> > > > +   /*
> > > > +* When enabled, it makes SMU errors fatal.
> > > > +* (0 = disabled (default), 1 = enabled)
> > > > +*/
> > > > +   bool smu_debug_mode;
> > > [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that
> > in future we can add support for other debug features.
> > > >  };
> > 
> > OK.
> > 
> > > >
> > > >  struct i2c_adapter;
> > > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > index 048ca1673863..84016d22c075 100644
> > > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > > 

[PATCH] drm/amdgpu: disable default navi2x co-op kernel support

2021-12-09 Thread Jonathan Kim
This patch reverts the following:
'commit fc547b2b1816 ("drm/amdkfd: add Navi2x to GWS init conditions")'

Disable GWS usage in default settings for now due to FW bugs.

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 67dd94b0b9a7..facc28f58c1f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -368,10 +368,7 @@ static int kfd_gws_init(struct kfd_dev *kfd)
(KFD_GC_VERSION(kfd) == IP_VERSION(9, 4, 1)
&& kfd->mec2_fw_version >= 0x30)   ||
(KFD_GC_VERSION(kfd) == IP_VERSION(9, 4, 2)
-   && kfd->mec2_fw_version >= 0x28)   ||
-   (KFD_GC_VERSION(kfd) >= IP_VERSION(10, 3, 0)
-   && KFD_GC_VERSION(kfd) <= IP_VERSION(10, 3, 5)
-   && kfd->mec2_fw_version >= 0x42
+   && kfd->mec2_fw_version >= 0x28
ret = amdgpu_amdkfd_alloc_gws(kfd->adev,
kfd->adev->gds.gws_size, >gws);
 
-- 
2.25.1



RE: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

2021-12-09 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Yu, Lang 
> Sent: Friday, December 10, 2021 10:34 AM
> To: Quan, Evan 
> Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
> ; Lazar, Lijo ; Huang,
> Ray ; Deucher, Alexander
> ; Koenig, Christian
> 
> Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> 
> On 12/10/ , Quan, Evan wrote:
> > [AMD Official Use Only]
> >
> >
> >
> > > -Original Message-
> > > From: amd-gfx  On Behalf Of
> > > Lang Yu
> > > Sent: Thursday, December 9, 2021 4:49 PM
> > > To: amd-gfx@lists.freedesktop.org
> > > Cc: Grodzovsky, Andrey ; Lazar, Lijo
> > > ; Huang, Ray ; Deucher,
> > > Alexander ; Yu, Lang
> ;
> > > Koenig, Christian 
> > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > >
> > > SMU firmware guys expect the driver maintains error context and
> > > doesn't interact with SMU any more when SMU errors occurred.
> > > That will aid in debugging SMU firmware issues.
> > >
> > > Add SMU debug option support for this request, it can be enabled or
> > > disabled via amdgpu_smu_debug debugfs file.
> > > When enabled, it brings hardware to a kind of halt state so that no
> > > one can touch it any more in the envent of SMU errors.
> > >
> > > Currently, dirver interacts with SMU via sending messages.
> > > And threre are three ways to sending messages to SMU.
> > > Handle them respectively as following:
> > >
> > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > >
> > >   Halt on any error.
> > >
> > > 2,
> smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > > for longer timeout cases
> > >
> > >   Halt on errors apart from ETIME. Otherwise this way won't work.
> > >
> > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > >
> > >   Halt on errors apart from ETIME. Otherwise second way won't work.
> > >
> > > After halting, use BUG() to explicitly notify users.
> > >
> > > == Command Guide ==
> > >
> > > 1, enable SMU debug option
> > >
> > >  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > >
> > > 2, disable SMU debug option
> > >
> > >  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > >
> > > v4:
> > >  - Set to halt state instead of a simple hang.(Christian)
> > >
> > > v3:
> > >  - Use debugfs_create_bool().(Christian)
> > >  - Put variable into smu_context struct.
> > >  - Don't resend command when timeout.
> > >
> > > v2:
> > >  - Resend command when timeout.(Lijo)
> > >  - Use debugfs file instead of module parameter.
> > >
> > > Signed-off-by: Lang Yu 
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> > >  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h |  5 +
> > >  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 20
> > > +++-
> > >  3 files changed, 27 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > index 164d6a9e9fbb..86cd888c7822 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> amdgpu_device
> > > *adev)
> > >   if (!debugfs_initialized())
> > >   return 0;
> > >
> > > + debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > > +   >smu.smu_debug_mode);
> > > +
> > >   ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> > > _ib_preempt);
> > >   if (IS_ERR(ent)) {
> > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > index f738f7dc20c9..50dbf5594a9d 100644
> > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > @@ -569,6 +569,11 @@ struct smu_context
> > >   struct smu_user_dpm_profile user_dpm_profile;
> > >
> > >   struct stb_context stb_context;
> > > + /*
> > > +  * When enabled, it makes SMU errors fatal.
> > > +  * (0 = disabled (default), 1 = enabled)
> > > +  */
> > > + bool smu_debug_mode;
> > [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that
> in future we can add support for other debug features.
> > >  };
> 
> OK.
> 
> > >
> > >  struct i2c_adapter;
> > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > index 048ca1673863..84016d22c075 100644
> > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > > smu_context *smu,
> > >   __smu_cmn_send_msg(smu, msg_index, param);
> > >   res = 0;
> > >  Out:
> > > + if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > + amdgpu_device_halt(smu->adev);
> > > + BUG();
> > [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and
> Andrey can share you more insights about that.
> > Do we still need 

RE: [PATCH v2] drm/amdgpu: fix incorrect VCN revision in SRIOV

2021-12-09 Thread Chen, Guchun
[Public]

Re: We can probably just drop the conditional here and just clear the high bits 
for everything.

It's addressed in v3 by Leslie.

Regards,
Guchun

-Original Message-
From: Alex Deucher  
Sent: Friday, December 10, 2021 12:02 AM
To: Shi, Leslie 
Cc: Lazar, Lijo ; amd-gfx list 
; Chen, Guchun 
Subject: Re: [PATCH v2] drm/amdgpu: fix incorrect VCN revision in SRIOV

On Thu, Dec 9, 2021 at 12:18 AM Leslie Shi  wrote:
>
> Guest OS will setup VCN instance 1 which is disabled as an enabled 
> instance and execute initialization work on it, but this causes VCN ib 
> ring test failure on the disabled VCN instance during modprobe:
>
> amdgpu :00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub 
> 1 amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> failed on vcn_dec_0 (-110).
> amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> failed on vcn_enc_0.0 (-110).
> [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
> failed (-110).
>
> v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to 
> vcn_config
>
> Fixes: 36b7d5646476 ("drm/amdgpu: handle SRIOV VCN revision parsing")
> Signed-off-by: Leslie Shi 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 13 +++--  
> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 15 ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  2 +-
>  4 files changed, 8 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 552031950518..53ff1bbe8bd6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -380,6 +380,9 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
> *adev)
>   ip->revision);
>
> if (le16_to_cpu(ip->hw_id) == VCN_HWID) {
> +   adev->vcn.vcn_config[adev->vcn.num_vcn_inst] =
> +   ip->revision & 0xc0;
> +
> if (amdgpu_sriov_vf(adev)) {

We can probably just drop the conditional here and just clear the high bits for 
everything.

Alex

> /* SR-IOV modifies each VCN’s 
> revision (uint8)
>  * Bit [5:0]: original 
> revision value @@ -388,8 +391,6 @@ int amdgpu_discovery_reg_base_init(struct 
> amdgpu_device *adev)
>  * 0b10 : encode is disabled
>  * 0b01 : decode is disabled
>  */
> -   
> adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
> -   (ip->revision & 0xc0) >> 6;
> ip->revision &= ~0xc0;
> }
> adev->vcn.num_vcn_inst++; @@ -485,14 
> +486,6 @@ int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int 
> hw_id, int n
> return -EINVAL;
>  }
>
> -
> -int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
> vcn_instance,
> -int *major, int *minor, int *revision)
> -{
> -   return amdgpu_discovery_get_ip_version(adev, VCN_HWID,
> -  vcn_instance, major, minor, 
> revision);
> -}
> -
>  void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)  {
> struct binary_header *bhdr;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> index 0ea029e3b850..14537cec19db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> @@ -33,8 +33,6 @@ void amdgpu_discovery_harvest_ip(struct 
> amdgpu_device *adev);  int amdgpu_discovery_get_ip_version(struct 
> amdgpu_device *adev, int hw_id, int number_instance,
>  int *major, int *minor, int 
> *revision);
>
> -int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
> vcn_instance,
> -int *major, int *minor, int *revision);
>  int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev);  int 
> amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> index 2658414c503d..38036cbf6203 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> @@ -284,20 +284,13 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device 
> *adev)  bool amdgpu_vcn_is_disabled_vcn(struct amdgpu_device *adev, 
> enum vcn_ring_type type, uint32_t vcn_instance)  {
> bool ret = false;
> +  

Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

2021-12-09 Thread Lang Yu
On 12/10/ , Quan, Evan wrote:
> [AMD Official Use Only]
> 
> 
> 
> > -Original Message-
> > From: amd-gfx  On Behalf Of Lang
> > Yu
> > Sent: Thursday, December 9, 2021 4:49 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Grodzovsky, Andrey ; Lazar, Lijo
> > ; Huang, Ray ; Deucher,
> > Alexander ; Yu, Lang ;
> > Koenig, Christian 
> > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > 
> > SMU firmware guys expect the driver maintains error context
> > and doesn't interact with SMU any more when SMU errors occurred.
> > That will aid in debugging SMU firmware issues.
> > 
> > Add SMU debug option support for this request, it can be
> > enabled or disabled via amdgpu_smu_debug debugfs file.
> > When enabled, it brings hardware to a kind of halt state
> > so that no one can touch it any more in the envent of SMU
> > errors.
> > 
> > Currently, dirver interacts with SMU via sending messages.
> > And threre are three ways to sending messages to SMU.
> > Handle them respectively as following:
> > 
> > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > 
> >   Halt on any error.
> > 
> > 2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > for longer timeout cases
> > 
> >   Halt on errors apart from ETIME. Otherwise this way won't work.
> > 
> > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > 
> >   Halt on errors apart from ETIME. Otherwise second way won't work.
> > 
> > After halting, use BUG() to explicitly notify users.
> > 
> > == Command Guide ==
> > 
> > 1, enable SMU debug option
> > 
> >  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > 
> > 2, disable SMU debug option
> > 
> >  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > 
> > v4:
> >  - Set to halt state instead of a simple hang.(Christian)
> > 
> > v3:
> >  - Use debugfs_create_bool().(Christian)
> >  - Put variable into smu_context struct.
> >  - Don't resend command when timeout.
> > 
> > v2:
> >  - Resend command when timeout.(Lijo)
> >  - Use debugfs file instead of module parameter.
> > 
> > Signed-off-by: Lang Yu 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> >  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h |  5 +
> >  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 20
> > +++-
> >  3 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > index 164d6a9e9fbb..86cd888c7822 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
> > *adev)
> > if (!debugfs_initialized())
> > return 0;
> > 
> > +   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > + >smu.smu_debug_mode);
> > +
> > ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> >   _ib_preempt);
> > if (IS_ERR(ent)) {
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > index f738f7dc20c9..50dbf5594a9d 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > @@ -569,6 +569,11 @@ struct smu_context
> > struct smu_user_dpm_profile user_dpm_profile;
> > 
> > struct stb_context stb_context;
> > +   /*
> > +* When enabled, it makes SMU errors fatal.
> > +* (0 = disabled (default), 1 = enabled)
> > +*/
> > +   bool smu_debug_mode;
> [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that in 
> future we can add support for other debug features.
> >  };

OK.

> > 
> >  struct i2c_adapter;
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > index 048ca1673863..84016d22c075 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > smu_context *smu,
> > __smu_cmn_send_msg(smu, msg_index, param);
> > res = 0;
> >  Out:
> > +   if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > +   amdgpu_device_halt(smu->adev);
> > +   BUG();
> [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and 
> Andrey can share you more insights about that.
> Do we still need the "BUG()" then? 

The BUG() is used to explicitly notify users something went 
wrong. Otherwise userspace may not know immediately. 
FW guys request this in ticket.

Regards,
Lang

> BR
> Evan
> > +   }
> > +
> > return res;
> >  }
> > 
> > @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> > smu_context *smu,
> >  int smu_cmn_wait_for_response(struct smu_context *smu)
> >  {
> > u32 reg;
> > +   int res;
> > 
> > reg = __smu_cmn_poll_stat(smu);
> > -   return __smu_cmn_reg2errno(smu, reg);
> > +  

RE: [PATCH] drm/amd/pm: skip gfx cgpg in the s0ix suspend-resume

2021-12-09 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: amd-gfx  On Behalf Of
> Prike Liang
> Sent: Thursday, December 9, 2021 9:51 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Liang, Prike
> ; Huang, Ray ; Limonciello,
> Mario 
> Subject: [PATCH] drm/amd/pm: skip gfx cgpg in the s0ix suspend-resume
> 
> In the s0ix entry need retain gfx in the gfxoff state,we don't
> disable gfx cgpg in the suspend so there is also needn't enable
> gfx cgpg in the s0ix resume.
> 
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 5839918..185269f 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1607,7 +1607,8 @@ static int smu_resume(void *handle)
>   return ret;
>   }
> 
> - if (smu->is_apu)
> + /* skip gfx cgpg in the s0ix suspend-resume case*/
> + if (smu->is_apu && !adev->in_s0ix)
>   smu_set_gfx_cgpg(>smu, true);
[Quan, Evan] I was wondering can we move the "!adev->in_s0ix" into the 
->set_gfx_cgpg(for now, only smu_v12_0_set_gfx_cgpg() supported by Renoir) 
implementation?
Also, considering this is only supported by Renoir, we may be able to drop the 
"smu->is_apu" check.

BR
Evan
> 
>   smu->disable_uclk_switch = 0;
> --
> 2.7.4


RE: [PATCH 2/2] drm/amdgpu: add support for SMU debug option

2021-12-09 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: amd-gfx  On Behalf Of Lang
> Yu
> Sent: Thursday, December 9, 2021 4:49 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Grodzovsky, Andrey ; Lazar, Lijo
> ; Huang, Ray ; Deucher,
> Alexander ; Yu, Lang ;
> Koenig, Christian 
> Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> 
> SMU firmware guys expect the driver maintains error context
> and doesn't interact with SMU any more when SMU errors occurred.
> That will aid in debugging SMU firmware issues.
> 
> Add SMU debug option support for this request, it can be
> enabled or disabled via amdgpu_smu_debug debugfs file.
> When enabled, it brings hardware to a kind of halt state
> so that no one can touch it any more in the envent of SMU
> errors.
> 
> Currently, dirver interacts with SMU via sending messages.
> And threre are three ways to sending messages to SMU.
> Handle them respectively as following:
> 
> 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> 
>   Halt on any error.
> 
> 2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> for longer timeout cases
> 
>   Halt on errors apart from ETIME. Otherwise this way won't work.
> 
> 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> 
>   Halt on errors apart from ETIME. Otherwise second way won't work.
> 
> After halting, use BUG() to explicitly notify users.
> 
> == Command Guide ==
> 
> 1, enable SMU debug option
> 
>  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> 2, disable SMU debug option
> 
>  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> v4:
>  - Set to halt state instead of a simple hang.(Christian)
> 
> v3:
>  - Use debugfs_create_bool().(Christian)
>  - Put variable into smu_context struct.
>  - Don't resend command when timeout.
> 
> v2:
>  - Resend command when timeout.(Lijo)
>  - Use debugfs file instead of module parameter.
> 
> Signed-off-by: Lang Yu 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
>  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h |  5 +
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 20
> +++-
>  3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 164d6a9e9fbb..86cd888c7822 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
> *adev)
>   if (!debugfs_initialized())
>   return 0;
> 
> + debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> +   >smu.smu_debug_mode);
> +
>   ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> _ib_preempt);
>   if (IS_ERR(ent)) {
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index f738f7dc20c9..50dbf5594a9d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -569,6 +569,11 @@ struct smu_context
>   struct smu_user_dpm_profile user_dpm_profile;
> 
>   struct stb_context stb_context;
> + /*
> +  * When enabled, it makes SMU errors fatal.
> +  * (0 = disabled (default), 1 = enabled)
> +  */
> + bool smu_debug_mode;
[Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that in 
future we can add support for other debug features.
>  };
> 
>  struct i2c_adapter;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 048ca1673863..84016d22c075 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> smu_context *smu,
>   __smu_cmn_send_msg(smu, msg_index, param);
>   res = 0;
>  Out:
> + if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> + amdgpu_device_halt(smu->adev);
> + BUG();
[Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and Andrey 
can share you more insights about that.
Do we still need the "BUG()" then? 

BR
Evan
> + }
> +
>   return res;
>  }
> 
> @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> smu_context *smu,
>  int smu_cmn_wait_for_response(struct smu_context *smu)
>  {
>   u32 reg;
> + int res;
> 
>   reg = __smu_cmn_poll_stat(smu);
> - return __smu_cmn_reg2errno(smu, reg);
> + res = __smu_cmn_reg2errno(smu, reg);
> +
> + if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> + amdgpu_device_halt(smu->adev);
> + BUG();
> + }
> +
> + return res;
>  }
> 
>  /**
> @@ -357,6 +370,11 @@ int smu_cmn_send_smc_msg_with_param(struct
> smu_context *smu,
>   if (read_arg)
>   smu_cmn_read_arg(smu, read_arg);
>  Out:
> + if 

[PATCH 2/3] Documentation/gpu: include description of AMDGPU hardware structure

2021-12-09 Thread Yann Dirson
This is Alex' description from the "gpu block diagram" thread, edited to
fit as ReST.

Originally-by: Alex Deucher 
Signed-off-by: Yann Dirson 
---
 Documentation/gpu/amdgpu/driver-core.rst | 81 
 1 file changed, 81 insertions(+)

diff --git a/Documentation/gpu/amdgpu/driver-core.rst 
b/Documentation/gpu/amdgpu/driver-core.rst
index 97f9a9b68924..909b13fad6a8 100644
--- a/Documentation/gpu/amdgpu/driver-core.rst
+++ b/Documentation/gpu/amdgpu/driver-core.rst
@@ -2,6 +2,87 @@
  Core Driver Infrastructure
 
 
+GPU hardware structure
+==
+
+Each asic is a collection of hardware blocks.  We refer to them as
+"IPs" (Intellectual Property blocks).  Each IP encapsulates certain
+functionality. IPs are versioned and can also be mixed and matched.
+E.g., you might have two different asics that both have SDMA 5.x IPs.
+The driver is arranged by IPs.  There are driver components to handle
+the initialization and operation of each IP.  There are also a bunch
+of smaller IPs that don't really need much if any driver interaction.
+Those end up getting lumped into the common stuff in the soc files.
+The soc files (e.g., vi.c, soc15.c nv.c) contain code for aspects of
+the SoC itself rather than specific IPs.  E.g., things like GPU resets
+and register access functions are SoC dependent.
+
+An APU contains more than just CPU and GPU, it also contains all of
+the platform stuff (audio, usb, gpio, etc.).  Also, a lot of
+components are shared between the CPU, platform, and the GPU (e.g.,
+SMU, PSP, etc.).  Specific components (CPU, GPU, etc.) usually have
+their interface to interact with those common components.  For things
+like S0i3 there is a ton of coordination required across all the
+components, but that is probably a bit beyond the scope of this
+section.
+
+With respect to the GPU, we have the following major IPs:
+
+GMC (Graphics Memory Controller)
+This was a dedicated IP on older pre-vega chips, but has since
+become somewhat decentralized on vega and newer chips.  They now
+have dedicated memory hubs for specific IPs or groups of IPs.  We
+still treat it as a single component in the driver however since
+the programming model is still pretty similar.  This is how the
+different IPs on the GPU get the memory (VRAM or system memory).
+It also provides the support for per process GPU virtual address
+spaces.
+
+IH (Interrupt Handler)
+This is the interrupt controller on the GPU.  All of the IPs feed
+their interrupts into this IP and it aggregates them into a set of
+ring buffers that the driver can parse to handle interrupts from
+different IPs.
+
+PSP (Platform Security Processor)
+This handles security policy for the SoC and executes trusted
+applications, and validates and loads firmwares for other blocks.
+
+SMU (System Management Unit)
+This is the power management microcontroller.  It manages the entire
+SoC.  The driver interacts with it to control power management
+features like clocks, voltages, power rails, etc.
+
+DCN (Display Controller Next)
+This is the display controller.  It handles the display hardware.
+
+SDMA (System DMA)
+This is a multi-purpose DMA engine.  The kernel driver uses it for
+various things including paging and GPU page table updates.  It's also
+exposed to userspace for use by user mode drivers (OpenGL, Vulkan,
+etc.)
+
+GC (Graphics and Compute)
+This is the graphics and compute engine, i.e., the block that
+encompasses the 3D pipeline and and shader blocks.  The is by far the
+largest block on the GPU.  The 3D pipeline has tons of sub-blocks.  In
+addition to that, it also contains the CP microcontrollers (ME, PFP,
+CE, MEC) and the RLC microcontroller.  It's exposed to userspace for
+user mode drivers (OpenGL, Vulkan, OpenCL, etc.)
+
+VCN (Video Core Next)
+This is the multi-media engine.  It handles video and image encode and
+decode.  It's exposed to userspace for user mode drivers (VA-API,
+OpenMAX, etc.)
+
+Driver structure
+
+
+In general, the driver has a list of all of the IPs on a particular
+SoC and for things like init/fini/suspend/resume, more or less just
+walks the list and handles each IP.
+
+
 .. _amdgpu_memory_domains:
 
 Memory Domains
-- 
2.31.1



[PATCH 3/3] Documentation/gpu: include description of some of the GC microcontrollers

2021-12-09 Thread Yann Dirson
This is Alex' description from the "Looking for clarifications around 
gfx/kcq/kiq"
thread, edited to fit as ReST.

Originally-by: Alex Deucher 
Signed-off-by: Yann Dirson 
---
 Documentation/gpu/amdgpu/driver-core.rst | 35 
 1 file changed, 35 insertions(+)

diff --git a/Documentation/gpu/amdgpu/driver-core.rst 
b/Documentation/gpu/amdgpu/driver-core.rst
index 909b13fad6a8..453566c280c5 100644
--- a/Documentation/gpu/amdgpu/driver-core.rst
+++ b/Documentation/gpu/amdgpu/driver-core.rst
@@ -75,6 +75,28 @@ VCN (Video Core Next)
 decode.  It's exposed to userspace for user mode drivers (VA-API,
 OpenMAX, etc.)
 
+Graphics and Compute microcontrollers
+-
+
+CP (Command Processor)
+The name for the hardware block that encompasses the front end of the
+GFX/Compute pipeline.  Consists mainly of a bunch of microcontrollers
+(PFP, ME, CE, MEC).  The firmware that runs on these microcontrollers
+provides the driver interface to interact with the GFX/Compute engine.
+
+MEC (MicroEngine Compute)
+The is the microcontroller that controls the compute queues on the
+GFX/compute engine.
+
+MES (MicroEngine Scheduler)
+This is a new engine for managing queues.  This is currently unused.
+
+RLC (RunList Controller)
+This is another microcontroller in the GFX/Compute engine.  It handles
+power management related functionality within the GFX/Compute engine.
+The name is a vestige of old hardware where it was originally added
+and doesn't really have much relation to what the engine does now.
+
 Driver structure
 
 
@@ -82,6 +104,19 @@ In general, the driver has a list of all of the IPs on a 
particular
 SoC and for things like init/fini/suspend/resume, more or less just
 walks the list and handles each IP.
 
+Some useful constructs:
+
+KIQ (Kernel Interface Queue)
+This is a control queue used by the kernel driver to manage other gfx
+and compute queues on the GFX/compute engine.  You can use it to
+map/unmap additional queues, etc.
+
+IB (Indirect Buffer)
+A command buffer for a particular engine.  Rather than writing
+commands directly to the queue, you can write the commands into a
+piece of memory and then put a pointer to the memory into the queue.
+The hardware will then follow the pointer and execute the commands in
+the memory, then returning to the rest of the commands in the ring.
 
 .. _amdgpu_memory_domains:
 
-- 
2.31.1



[PATCH 1/3] Documentation/gpu: split amdgpu/index for readability

2021-12-09 Thread Yann Dirson
This starts to make the formated index much more manageable to the reader.

Signed-off-by: Yann Dirson 
---
 Documentation/gpu/amdgpu/driver-core.rst  |  65 
 Documentation/gpu/amdgpu/driver-misc.rst  | 112 +++
 Documentation/gpu/amdgpu/index.rst| 298 +-
 .../gpu/amdgpu/module-parameters.rst  |   7 +
 Documentation/gpu/amdgpu/ras.rst  |  62 
 Documentation/gpu/amdgpu/thermal.rst  |  65 
 6 files changed, 324 insertions(+), 285 deletions(-)
 create mode 100644 Documentation/gpu/amdgpu/driver-core.rst
 create mode 100644 Documentation/gpu/amdgpu/driver-misc.rst
 create mode 100644 Documentation/gpu/amdgpu/module-parameters.rst
 create mode 100644 Documentation/gpu/amdgpu/ras.rst
 create mode 100644 Documentation/gpu/amdgpu/thermal.rst

diff --git a/Documentation/gpu/amdgpu/driver-core.rst 
b/Documentation/gpu/amdgpu/driver-core.rst
new file mode 100644
index ..97f9a9b68924
--- /dev/null
+++ b/Documentation/gpu/amdgpu/driver-core.rst
@@ -0,0 +1,65 @@
+
+ Core Driver Infrastructure
+
+
+.. _amdgpu_memory_domains:
+
+Memory Domains
+==
+
+.. kernel-doc:: include/uapi/drm/amdgpu_drm.h
+   :doc: memory domains
+
+Buffer Objects
+==
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+   :doc: amdgpu_object
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+   :internal:
+
+PRIME Buffer Sharing
+
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+   :doc: PRIME Buffer Sharing
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+   :internal:
+
+MMU Notifier
+
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+   :doc: MMU Notifier
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+   :internal:
+
+AMDGPU Virtual Memory
+=
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+   :doc: GPUVM
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+   :internal:
+
+Interrupt Handling
+==
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+   :doc: Interrupt Handling
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+   :internal:
+
+IP Blocks
+=
+
+.. kernel-doc:: drivers/gpu/drm/amd/include/amd_shared.h
+   :doc: IP Blocks
+
+.. kernel-doc:: drivers/gpu/drm/amd/include/amd_shared.h
+   :identifiers: amd_ip_block_type amd_ip_funcs
diff --git a/Documentation/gpu/amdgpu/driver-misc.rst 
b/Documentation/gpu/amdgpu/driver-misc.rst
new file mode 100644
index ..e3d6b2fa2493
--- /dev/null
+++ b/Documentation/gpu/amdgpu/driver-misc.rst
@@ -0,0 +1,112 @@
+
+ Misc AMDGPU driver information
+
+
+GPU Product Information
+===
+
+Information about the GPU can be obtained on certain cards
+via sysfs
+
+product_name
+
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+   :doc: product_name
+
+product_number
+--
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+   :doc: product_name
+
+serial_number
+-
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+   :doc: serial_number
+
+unique_id
+-
+
+.. kernel-doc:: drivers/gpu/drm/amd/pm/amdgpu_pm.c
+   :doc: unique_id
+
+GPU Memory Usage Information
+
+
+Various memory accounting can be accessed via sysfs
+
+mem_info_vram_total
+---
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+   :doc: mem_info_vram_total
+
+mem_info_vram_used
+--
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+   :doc: mem_info_vram_used
+
+mem_info_vis_vram_total
+---
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+   :doc: mem_info_vis_vram_total
+
+mem_info_vis_vram_used
+--
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+   :doc: mem_info_vis_vram_used
+
+mem_info_gtt_total
+--
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+   :doc: mem_info_gtt_total
+
+mem_info_gtt_used
+-
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+   :doc: mem_info_gtt_used
+
+PCIe Accounting Information
+===
+
+pcie_bw
+---
+
+.. kernel-doc:: drivers/gpu/drm/amd/pm/amdgpu_pm.c
+   :doc: pcie_bw
+
+pcie_replay_count
+-
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+   :doc: pcie_replay_count
+
+GPU SmartShift Information
+==
+
+GPU SmartShift information via sysfs
+
+smartshift_apu_power
+
+
+.. kernel-doc:: drivers/gpu/drm/amd/pm/amdgpu_pm.c
+   :doc: smartshift_apu_power
+
+smartshift_dgpu_power
+-
+
+.. kernel-doc:: drivers/gpu/drm/amd/pm/amdgpu_pm.c
+   :doc: smartshift_dgpu_power
+

[PATCH 0/3] Enrich amdgpu docs from recent threads

2021-12-09 Thread Yann Dirson
This series starts by splitting the amdgpu/index file to make some
room for additional contents, but I'm not really happy with the
result, we can certainly do better.

The rest is basically bringing Alex' descriptions of the hardware and
driver internals into the doc.

Yann Dirson (3):
  Documentation/gpu: split amdgpu/index for readability
  Documentation/gpu: include description of AMDGPU hardware structure
  Documentation/gpu: include description of some of the GC
microcontrollers

 Documentation/gpu/amdgpu/driver-core.rst  | 181 +++
 Documentation/gpu/amdgpu/driver-misc.rst  | 112 +++
 Documentation/gpu/amdgpu/index.rst| 298 +-
 .../gpu/amdgpu/module-parameters.rst  |   7 +
 Documentation/gpu/amdgpu/ras.rst  |  62 
 Documentation/gpu/amdgpu/thermal.rst  |  65 
 6 files changed, 440 insertions(+), 285 deletions(-)
 create mode 100644 Documentation/gpu/amdgpu/driver-core.rst
 create mode 100644 Documentation/gpu/amdgpu/driver-misc.rst
 create mode 100644 Documentation/gpu/amdgpu/module-parameters.rst
 create mode 100644 Documentation/gpu/amdgpu/ras.rst
 create mode 100644 Documentation/gpu/amdgpu/thermal.rst

-- 
2.31.1



Re: Potential Bug in drm/amd/display/dc_link

2021-12-09 Thread Harry Wentland


On 2021-12-09 03:02, Yizhuo Zhai wrote:
> Hi All:
> I just found a bug in the cramfs using the static analysis tool, but
> not sure if this could happen in reality, could you please advise me
> here? Thanks for your attention : ) And please ignore the last one
> with HTML format if you did not filter it out.
> 
> In function enable_stream_features(), the variable
> "old_downspread.raw" could be uninitialized if core_link_read_dpcd
> fails(), however, it is used in the later if statement, and further,
> core_link_write_dpcd() may write random value, which is potentially
> unsafe. But this function does not return the error code to the up
> caller and I got stuck in drafting the patch, could you please advise
> me here?
> 

Thanks for highlighting this.

Unfortunately we frequently ignore DPCD error codes.

In this case I would do a memset as shown below.

> The related code:
> static void enable_stream_features(struct pipe_ctx *pipe_ctx)
> {
>  union down_spread_ctrl old_downspread;

memset(_downspread, 0, sizeof(old_downspread));

> core_link_read_dpcd(link, DP_DOWNSPREAD_CTRL,
>  _downspread.raw, sizeof(old_downspread);
> 
> //old_downspread.raw used here
> if (new_downspread.raw != old_downspread.raw) {
>core_link_write_dpcd(link, DP_DOWNSPREAD_CTRL,
>  _downspread.raw, sizeof(new_downspread));
> }
> }
> enum dc_status core_link_read_dpcd(
> struct dc_link *link,
> uint32_t address,
> uint8_t *data,
> uint32_t size)
> {
> //data could be uninitialized if the helpers fails and log
> some error info
> if (!dm_helpers_dp_read_dpcd(link->ctx,
>link,address, data, size))
>   return DC_ERROR_UNEXPECTED;
> return DC_OK;
> }
> 
> The same issue in function wait_for_training_aux_rd_interval() in
> drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c

I don't see this. Do you mean this one?

> void dp_wait_for_training_aux_rd_interval(
>   struct dc_link *link,
>   uint32_t wait_in_micro_secs)
> {
> #if defined(CONFIG_DRM_AMD_DC_DCN)
>   if (wait_in_micro_secs > 16000)
>   msleep(wait_in_micro_secs/1000);
>   else
>   udelay(wait_in_micro_secs);
> #else
>   udelay(wait_in_micro_secs);
> #endif
> 
>   DC_LOG_HW_LINK_TRAINING("%s:\n wait = %d\n",
>   __func__,
>   wait_in_micro_secs);
> }

Thanks,
Harry

> 





Re: [PATCH v2 2/2] drm/amdgpu: Reduce SG bo memory usage for mGPUs

2021-12-09 Thread Felix Kuehling
Am 2021-12-09 um 10:47 a.m. schrieb Philip Yang:
> For userptr bo, if adev is not in IOMMU isolation mode, RAM direct map
> to GPU, multiple GPUs use same system memory dma mapping address, they
> can share the original mem->bo in attachment to reduce dma address array
> memory usage.
>
> Signed-off-by: Philip Yang 

The series is

Reviewed-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index b8490789eef4..f9bab963a948 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -708,10 +708,12 @@ static int kfd_mem_attach(struct amdgpu_device *adev, 
> struct kgd_mem *mem,
>   pr_debug("\t add VA 0x%llx - 0x%llx to vm %p\n", va,
>va + bo_size, vm);
>  
> - if (adev == bo_adev || (mem->domain == AMDGPU_GEM_DOMAIN_VRAM &&
> - amdgpu_xgmi_same_hive(adev, bo_adev))) {
> - /* Mappings on the local GPU and VRAM mappings in the
> -  * local hive share the original BO
> + if (adev == bo_adev ||
> +(amdgpu_ttm_tt_get_usermm(mem->bo->tbo.ttm) && 
> adev->ram_is_direct_mapped) ||
> +(mem->domain == AMDGPU_GEM_DOMAIN_VRAM && 
> amdgpu_xgmi_same_hive(adev, bo_adev))) {
> + /* Mappings on the local GPU, or VRAM mappings in the
> +  * local hive, or userptr mapping IOMMU direct map mode
> +  * share the original BO
>*/
>   attachment[i]->type = KFD_MEM_ATT_SHARED;
>   bo[i] = mem->bo;


Re: [PATCH] drm/amdkfd: explicitly create/destroy queue attributes under /sys

2021-12-09 Thread Felix Kuehling
Am 2021-12-09 um 5:14 p.m. schrieb Chen, Xiaogang:
>
> On 12/9/2021 12:40 PM, Felix Kuehling wrote:
>> Am 2021-12-09 um 2:49 a.m. schrieb Xiaogang.Chen:
>>> From: Xiaogang Chen 
>>>
>>> When application is about finish it destroys queues it has created by
>>> an ioctl. Driver deletes queue
>>> entry(/sys/class/kfd/kfd/proc/pid/queues/queueid/)
>>> which is directory including this queue all attributes. Low level
>>> kernel
>>> code deletes all attributes under this directory. The lock from
>>> kernel is
>>> on queue entry, not its attributes. At meantime another user space
>>> application
>>> can read the attributes. There is possibility that the application can
>>> hold/read the attributes while kernel is deleting the queue entry,
>>> cause
>>> the application have invalid memory access, then killed by kernel.
>>>
>>> Driver changes: explicitly create/destroy each attribute for each
>>> queue,
>>> let kernel put lock on each attribute too.
>> Is this working around a bug in kobject_del? Shouldn't that code take
>> care of the necessary locking itself?
>>
>> Regards,
>>    Felix
>
> The patches do not change kobject/kernfs that are too low level and
> would involve deeper discussions.
> Made changes at higher level(kfd) instead.
>
> Have tested with MSF tool overnight.

OK. I'm OK with your changes. The patch is

Reviewed-by: Felix Kuehling 

But I think we should let the kernfs folks know that there is a problem
anyway. It might save someone else a lot of time and headaches down the
line. Ideally we'd come up with a small reproducer (dummy driver and a
user mode tool (could just be a bash script)) that doesn't require
special AMD hardware and the whole ROCm stack.

Regards,
  Felix


>
> Thanks
> Xiaogang
>
>>
>>> Signed-off-by: Xiaogang Chen 
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  3 +++
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 33
>>> +++-
>>>   2 files changed, 13 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> index 0c3f911e3bf4..045da300749e 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> @@ -546,6 +546,9 @@ struct queue {
>>>     /* procfs */
>>>   struct kobject kobj;
>>> +    struct attribute attr_guid;
>>> +    struct attribute attr_size;
>>> +    struct attribute attr_type;
>>>   };
>>>     enum KFD_MQD_TYPE {
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 9158f9754a24..04a5638f9196 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -73,6 +73,8 @@ static void evict_process_worker(struct
>>> work_struct *work);
>>>   static void restore_process_worker(struct work_struct *work);
>>>     static void kfd_process_device_destroy_cwsr_dgpu(struct
>>> kfd_process_device *pdd);
>>> +static void kfd_sysfs_create_file(struct kobject *kobj, struct
>>> attribute *attr,
>>> +    char *name);
>>>     struct kfd_procfs_tree {
>>>   struct kobject *kobj;
>>> @@ -441,35 +443,12 @@ static ssize_t kfd_sysfs_counters_show(struct
>>> kobject *kobj,
>>>   return 0;
>>>   }
>>>   -static struct attribute attr_queue_size = {
>>> -    .name = "size",
>>> -    .mode = KFD_SYSFS_FILE_MODE
>>> -};
>>> -
>>> -static struct attribute attr_queue_type = {
>>> -    .name = "type",
>>> -    .mode = KFD_SYSFS_FILE_MODE
>>> -};
>>> -
>>> -static struct attribute attr_queue_gpuid = {
>>> -    .name = "gpuid",
>>> -    .mode = KFD_SYSFS_FILE_MODE
>>> -};
>>> -
>>> -static struct attribute *procfs_queue_attrs[] = {
>>> -    _queue_size,
>>> -    _queue_type,
>>> -    _queue_gpuid,
>>> -    NULL
>>> -};
>>> -
>>>   static const struct sysfs_ops procfs_queue_ops = {
>>>   .show = kfd_procfs_queue_show,
>>>   };
>>>     static struct kobj_type procfs_queue_type = {
>>>   .sysfs_ops = _queue_ops,
>>> -    .default_attrs = procfs_queue_attrs,
>>>   };
>>>     static const struct sysfs_ops procfs_stats_ops = {
>>> @@ -511,6 +490,10 @@ int kfd_procfs_add_queue(struct queue *q)
>>>   return ret;
>>>   }
>>>   +    kfd_sysfs_create_file(>kobj, >attr_guid, "guid");
>>> +    kfd_sysfs_create_file(>kobj, >attr_size, "size");
>>> +    kfd_sysfs_create_file(>kobj, >attr_type, "type");
>>> +
>>>   return 0;
>>>   }
>>>   @@ -655,6 +638,10 @@ void kfd_procfs_del_queue(struct queue *q)
>>>   if (!q)
>>>   return;
>>>   +    sysfs_remove_file(>kobj, >attr_guid);
>>> +    sysfs_remove_file(>kobj, >attr_size);
>>> +    sysfs_remove_file(>kobj, >attr_type);
>>> +
>>>   kobject_del(>kobj);
>>>   kobject_put(>kobj);
>>>   }


Re: [PATCH] drm/amdkfd: explicitly create/destroy queue attributes under /sys

2021-12-09 Thread Chen, Xiaogang



On 12/9/2021 12:40 PM, Felix Kuehling wrote:

Am 2021-12-09 um 2:49 a.m. schrieb Xiaogang.Chen:

From: Xiaogang Chen 

When application is about finish it destroys queues it has created by
an ioctl. Driver deletes queue 
entry(/sys/class/kfd/kfd/proc/pid/queues/queueid/)
which is directory including this queue all attributes. Low level kernel
code deletes all attributes under this directory. The lock from kernel is
on queue entry, not its attributes. At meantime another user space application
can read the attributes. There is possibility that the application can
hold/read the attributes while kernel is deleting the queue entry, cause
the application have invalid memory access, then killed by kernel.

Driver changes: explicitly create/destroy each attribute for each queue,
let kernel put lock on each attribute too.

Is this working around a bug in kobject_del? Shouldn't that code take
care of the necessary locking itself?

Regards,
   Felix


The patches do not change kobject/kernfs that are too low level and would 
involve deeper discussions.
Made changes at higher level(kfd) instead.

Have tested with MSF tool overnight.

Thanks
Xiaogang




Signed-off-by: Xiaogang Chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h|  3 +++
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 33 +++-
  2 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 0c3f911e3bf4..045da300749e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -546,6 +546,9 @@ struct queue {
  
  	/* procfs */

struct kobject kobj;
+   struct attribute attr_guid;
+   struct attribute attr_size;
+   struct attribute attr_type;
  };
  
  enum KFD_MQD_TYPE {

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 9158f9754a24..04a5638f9196 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -73,6 +73,8 @@ static void evict_process_worker(struct work_struct *work);
  static void restore_process_worker(struct work_struct *work);
  
  static void kfd_process_device_destroy_cwsr_dgpu(struct kfd_process_device *pdd);

+static void kfd_sysfs_create_file(struct kobject *kobj, struct attribute *attr,
+   char *name);
  
  struct kfd_procfs_tree {

struct kobject *kobj;
@@ -441,35 +443,12 @@ static ssize_t kfd_sysfs_counters_show(struct kobject 
*kobj,
return 0;
  }
  
-static struct attribute attr_queue_size = {

-   .name = "size",
-   .mode = KFD_SYSFS_FILE_MODE
-};
-
-static struct attribute attr_queue_type = {
-   .name = "type",
-   .mode = KFD_SYSFS_FILE_MODE
-};
-
-static struct attribute attr_queue_gpuid = {
-   .name = "gpuid",
-   .mode = KFD_SYSFS_FILE_MODE
-};
-
-static struct attribute *procfs_queue_attrs[] = {
-   _queue_size,
-   _queue_type,
-   _queue_gpuid,
-   NULL
-};
-
  static const struct sysfs_ops procfs_queue_ops = {
.show = kfd_procfs_queue_show,
  };
  
  static struct kobj_type procfs_queue_type = {

.sysfs_ops = _queue_ops,
-   .default_attrs = procfs_queue_attrs,
  };
  
  static const struct sysfs_ops procfs_stats_ops = {

@@ -511,6 +490,10 @@ int kfd_procfs_add_queue(struct queue *q)
return ret;
}
  
+	kfd_sysfs_create_file(>kobj, >attr_guid, "guid");

+   kfd_sysfs_create_file(>kobj, >attr_size, "size");
+   kfd_sysfs_create_file(>kobj, >attr_type, "type");
+
return 0;
  }
  
@@ -655,6 +638,10 @@ void kfd_procfs_del_queue(struct queue *q)

if (!q)
return;
  
+	sysfs_remove_file(>kobj, >attr_guid);

+   sysfs_remove_file(>kobj, >attr_size);
+   sysfs_remove_file(>kobj, >attr_type);
+
kobject_del(>kobj);
kobject_put(>kobj);
  }


[PATCH v2] drm/amdkfd: fix svm_bo release invalid wait context warning

2021-12-09 Thread Philip Yang
Add svm_range_bo_unref_async to schedule work to wait for svm_bo
eviction work done and then free svm_bo. __do_munmap put_page
is atomic context, call svm_range_bo_unref_async to avoid warning
invalid wait context. Other non atomic context call svm_range_bo_unref.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 31 +---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  3 ++-
 3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 9731151b67d6..d5d2cf2ee788 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -550,7 +550,7 @@ static void svm_migrate_page_free(struct page *page)
 
if (svm_bo) {
pr_debug_ratelimited("ref: %d\n", kref_read(_bo->kref));
-   svm_range_bo_unref(svm_bo);
+   svm_range_bo_unref_async(svm_bo);
}
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index c178d56361d6..b216842b5fe2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -332,6 +332,8 @@ static void svm_range_bo_release(struct kref *kref)
struct svm_range_bo *svm_bo;
 
svm_bo = container_of(kref, struct svm_range_bo, kref);
+   pr_debug("svm_bo 0x%p\n", svm_bo);
+
spin_lock(_bo->list_lock);
while (!list_empty(_bo->range_list)) {
struct svm_range *prange =
@@ -365,12 +367,33 @@ static void svm_range_bo_release(struct kref *kref)
kfree(svm_bo);
 }
 
-void svm_range_bo_unref(struct svm_range_bo *svm_bo)
+static void svm_range_bo_wq_release(struct work_struct *work)
 {
-   if (!svm_bo)
-   return;
+   struct svm_range_bo *svm_bo;
+
+   svm_bo = container_of(work, struct svm_range_bo, release_work);
+   svm_range_bo_release(_bo->kref);
+}
+
+static void svm_range_bo_release_async(struct kref *kref)
+{
+   struct svm_range_bo *svm_bo;
+
+   svm_bo = container_of(kref, struct svm_range_bo, kref);
+   pr_debug("svm_bo 0x%p\n", svm_bo);
+   INIT_WORK(_bo->release_work, svm_range_bo_wq_release);
+   schedule_work(_bo->release_work);
+}
 
-   kref_put(_bo->kref, svm_range_bo_release);
+void svm_range_bo_unref_async(struct svm_range_bo *svm_bo)
+{
+   kref_put(_bo->kref, svm_range_bo_release_async);
+}
+
+static void svm_range_bo_unref(struct svm_range_bo *svm_bo)
+{
+   if (svm_bo)
+   kref_put(_bo->kref, svm_range_bo_release);
 }
 
 static bool
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 6dc91c33e80f..2f8a95e86dcb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -48,6 +48,7 @@ struct svm_range_bo {
struct work_struct  eviction_work;
struct svm_range_list   *svms;
uint32_tevicting;
+   struct work_struct  release_work;
 };
 
 enum svm_work_list_ops {
@@ -195,7 +196,7 @@ void svm_range_list_lock_and_flush_work(struct 
svm_range_list *svms, struct mm_s
  */
 #define KFD_IS_SVM_API_SUPPORTED(dev) ((dev)->pgmap.type != 0)
 
-void svm_range_bo_unref(struct svm_range_bo *svm_bo);
+void svm_range_bo_unref_async(struct svm_range_bo *svm_bo);
 #else
 
 struct kfd_process;
-- 
2.17.1



Re: [PATCH v4 0/6] Expand display core documentation

2021-12-09 Thread Yann Dirson


> Thanks for this. It's really good to see this.
> 
> Reviewed-by: Harry Wentland 

Hearfully seconded, let's get this rolling :)

Reviewed-by: Yann Dirson 

> 
> Harry
> 
> On 2021-12-09 09:20, Rodrigo Siqueira wrote:
> > Display Core (DC) is one of the components under amdgpu, and it has
> > multiple features directly related to the KMS API. Unfortunately,
> > we
> > don't have enough documentation about DC in the upstream, which
> > makes
> > the life of some external contributors a little bit more
> > challenging.
> > For these reasons, this patchset reworks part of the DC
> > documentation
> > and introduces a new set of details on how the display core works
> > on DCN
> > IP. Another improvement that this documentation effort tries to
> > bring is
> > making explicit some of our hardware-specific details to guide
> > user-space developers better.
> > 
> > In my view, it is easier to review this series if you apply it in
> > your
> > local kernel and build the HTML version (make htmldocs). I'm
> > suggesting
> > this approach because I added a few SVG diagrams that will be
> > easier to
> > see in the HTML version. If you cannot build the documentation, try
> > to
> > open the SVG images while reviewing the content. In summary, in
> > this
> > series, you will find:
> > 
> > 1. Patch 1: Re-arrange of display core documentation. This is
> >preparation work for the other patches, but it is also a way to
> >expand
> >this documentation.
> > 2. Patch 2 to 4: Document some common debug options related to
> > display.
> > 3. Patch 5: This patch provides an overview of how our display core
> > next
> >works and a brief explanation of each component.
> > 4. Patch 6: We use a lot of acronyms in our driver; for this
> > reason, we
> >exposed a glossary with common terms used by display core.
> > 
> > Please let us know what you think we can improve this series and
> > what
> > kind of content you want to see for the next series.
> > 
> > Changes since V3:
> >  - Add new acronyms to amdgpu glossary
> >  - Add link between dc and amdgpu glossary
> > Changes since V2:
> >  - Add a comment about MMHUBBUB
> > Changes since V1:
> >  - Group amdgpu documentation together.
> >  - Create index pages.
> >  - Mirror display folder in the documentation.
> >  - Divide glossary based on driver context.
> >  - Make terms more consistent and update CPLIB
> >  - Add new acronyms to the glossary
> > 
> > Thanks
> > Siqueira
> > 
> > Rodrigo Siqueira (6):
> >   Documentation/gpu: Reorganize DC documentation
> >   Documentation/gpu: Document amdgpu_dm_visual_confirm debugfs
> >   entry
> >   Documentation/gpu: Document pipe split visual confirmation
> >   Documentation/gpu: How to collect DTN log
> >   Documentation/gpu: Add basic overview of DC pipeline
> >   Documentation/gpu: Add amdgpu and dc glossary
> > 
> >  Documentation/gpu/amdgpu-dc.rst   |   74 --
> >  Documentation/gpu/amdgpu/amdgpu-glossary.rst  |   87 ++
> >  .../gpu/amdgpu/display/config_example.svg |  414 ++
> >  Documentation/gpu/amdgpu/display/dc-debug.rst |   77 ++
> >  .../gpu/amdgpu/display/dc-glossary.rst|  237 
> >  .../amdgpu/display/dc_pipeline_overview.svg   | 1125
> >  +
> >  .../gpu/amdgpu/display/dcn-overview.rst   |  171 +++
> >  .../gpu/amdgpu/display/display-manager.rst|   42 +
> >  .../gpu/amdgpu/display/global_sync_vblank.svg |  485 +++
> >  Documentation/gpu/amdgpu/display/index.rst|   29 +
> >  .../gpu/{amdgpu.rst => amdgpu/index.rst}  |   25 +-
> >  Documentation/gpu/drivers.rst |3 +-
> >  12 files changed, 2690 insertions(+), 79 deletions(-)
> >  delete mode 100644 Documentation/gpu/amdgpu-dc.rst
> >  create mode 100644 Documentation/gpu/amdgpu/amdgpu-glossary.rst
> >  create mode 100644
> >  Documentation/gpu/amdgpu/display/config_example.svg
> >  create mode 100644 Documentation/gpu/amdgpu/display/dc-debug.rst
> >  create mode 100644
> >  Documentation/gpu/amdgpu/display/dc-glossary.rst
> >  create mode 100644
> >  Documentation/gpu/amdgpu/display/dc_pipeline_overview.svg
> >  create mode 100644
> >  Documentation/gpu/amdgpu/display/dcn-overview.rst
> >  create mode 100644
> >  Documentation/gpu/amdgpu/display/display-manager.rst
> >  create mode 100644
> >  Documentation/gpu/amdgpu/display/global_sync_vblank.svg
> >  create mode 100644 Documentation/gpu/amdgpu/display/index.rst
> >  rename Documentation/gpu/{amdgpu.rst => amdgpu/index.rst} (95%)
> > 
> 
> 


Re: [bisected][regression] Applications that need amdgpu doesn't run after waking up from suspend

2021-12-09 Thread w...@kernel.org
Hi,

thank you for the report!

> No issues in Kernel 5.13.13 and the issues exist in 5.14 to 5.15.7 .So
> I bisected the bug with
> git(https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux).first
> bad commit: [5a7b95fb993ec399c8a685552aa6a8fc995c40bd] i2c: core:
> support bus regulator controlling in adapter

Have you tried reverting the commit and see if things work again?

Kind regards,

   Wolfram



signature.asc
Description: PGP signature


Re: [PATCH v2 1/2] drm/amdgpu: Detect if amdgpu in IOMMU direct map mode

2021-12-09 Thread Alex Deucher
On Thu, Dec 9, 2021 at 12:02 PM Philip Yang  wrote:
>
> If host and amdgpu IOMMU is not enabled or IOMMU is pass through mode,
> set adev->ram_is_direct_mapped flag which will be used to optimize
> memory usage for multi GPU mappings.
>
> Signed-off-by: Philip Yang 

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++
>  2 files changed, 21 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 54c882a6b433..0ec19c83a203 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1097,6 +1097,8 @@ struct amdgpu_device {
>
> struct amdgpu_reset_control *reset_cntl;
> uint32_t
> ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE];
> +
> +   boolram_is_direct_mapped;
>  };
>
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index ce9bdef185c0..3318d92de8eb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -3381,6 +3382,22 @@ static int 
> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
> return ret;
>  }
>
> +/**
> + * amdgpu_device_check_iommu_direct_map - check if RAM direct mapped to GPU
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * RAM direct mapped to GPU if IOMMU is not enabled or is pass through mode
> + */
> +static void amdgpu_device_check_iommu_direct_map(struct amdgpu_device *adev)
> +{
> +   struct iommu_domain *domain;
> +
> +   domain = iommu_get_domain_for_dev(adev->dev);
> +   if (!domain || domain->type == IOMMU_DOMAIN_IDENTITY)
> +   adev->ram_is_direct_mapped = true;
> +}
> +
>  static const struct attribute *amdgpu_dev_attributes[] = {
> _attr_product_name.attr,
> _attr_product_number.attr,
> @@ -3784,6 +3801,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> queue_delayed_work(system_wq, _info.delayed_reset_work,
>msecs_to_jiffies(AMDGPU_RESUME_MS));
>
> +   amdgpu_device_check_iommu_direct_map(adev);
> +
> return 0;
>
>  release_ras_con:
> --
> 2.17.1
>


Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Alex Deucher
On Thu, Dec 9, 2021 at 1:18 PM Guilherme G. Piccoli  wrote:
>
> Thanks again Alex! Some comments inlined below:
>
> On 09/12/2021 15:06, Alex Deucher wrote:
> > Not really in a generic way.  It's asic and platform specific.  In
> > addition most modern displays require link training to bring up the
> > display, so you can't just save and restore registers.
>
> Oh sure, I understand that. My question is more like: is there a way,
> inside amdgpu driver, to save this state before taking
> over/overwriting/reprogramming the device? So we could (again, from
> inside the amdgpu driver) dump this pre-saved state in the shutdown
> handler, for example, having the device in a "pre-OS" state when the new
> kexec'ed kernel starts.

Sure, it could be done, it's just a fair amount of work.  Things like
legacy vga text mode is a bit more of a challenge, but that tends to
be less relevant as non-legacy UEFI becomes more pervasive.

>
> >
> > The drivers are asic and platform specific.  E.g., the driver for
> > vangogh is different from renoir is different from skylake, etc.  The
> > display programming interfaces are asic specific.
>
> Cool, that makes sense! But if you (or anybody here) know some of these
> GOP drivers, e.g. for the qemu/qxl device, I'm just curious to
> see/understand how complex is the FW driver to just put the
> device/screen in a usable state.

Most of the asic init and display setup on AMD GPUs is handled via
atombios command tables (basically little scripted stored in the
vbios) which are shared by the driver and the GOP driver for most
programming sequences.  In our case, the GOP driver is pretty simple.
Take a look at the pre-DC display code in amdgpu to see what a basic
display driver would look like (e.g., dce_v11_0.c).  The GOP driver
would call the atombios asic_init table to make sure the chip itself
is initialized (e.g., memory controller, etc.), then walk the display
data tables in the vbios to determine the display configuration
specific to this board, then probe the displays and use the atombios
display command tables to light them up.

Alex


Re: [PATCH 2/2] drm/amdkfd: Use prange->update_list head for remove_list

2021-12-09 Thread philip yang

  


On 2021-12-08 7:03 p.m., Felix Kuehling
  wrote:


  The remove_list head was only used for keeping track of existing ranges
that are to be removed from the svms->list. The update_list was used for
new or existing ranges that need updated attributes. These two cases are
mutually exclusive (i.e. the same range will never be on both lists).
Therefore we can use the update_list head to track the remove_list and
save another 16 bytes in the svm_range struct.

Signed-off-by: Felix Kuehling 

one nit-pick below.
Reviewed-by: Philip Yang 

  
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 ++---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 --
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index dea7c6236be5..ee7e1eb7394a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -295,7 +295,6 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
 	prange->last = last;
 	INIT_LIST_HEAD(>list);
 	INIT_LIST_HEAD(>update_list);
-	INIT_LIST_HEAD(>remove_list);
 	INIT_LIST_HEAD(>svm_bo_list);
 	INIT_LIST_HEAD(>deferred_list);
 	INIT_LIST_HEAD(>child_list);
@@ -1878,7 +1877,7 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 goto out;
 			}
 
-			list_add(>remove_list, remove_list);
+			list_add(>update_list, remove_list);
 			list_add(>list, insert_list);
 			list_add(>update_list, update_list);
 
@@ -3225,7 +3224,7 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 		/* TODO: unmap ranges from GPU that lost access */
 	}
 	list_for_each_entry_safe(prange, next, _list,
-remove_list) {
+update_list) {

This line can be combined with previous line.

  
 		pr_debug("unlink old 0x%p prange 0x%p [0x%lx 0x%lx]\n",
 			 prange->svms, prange, prange->start,
 			 prange->last);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index c3738bd35a3e..5edbd7dccad0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -75,7 +75,6 @@ struct svm_work_list_item {
  *  aligned, page size is (last - start + 1)
  * @list:   link list node, used to scan all ranges of svms
  * @update_list:link list node used to add to update_list
- * @remove_list:link list node used to add to remove list
  * @mapping:bo_va mapping structure to create and update GPU page table
  * @npages: number of pages
  * @dma_addr:   dma mapping address on each GPU for system memory physical page
@@ -111,7 +110,6 @@ struct svm_range {
 	struct interval_tree_node	it_node;
 	struct list_head		list;
 	struct list_head		update_list;
-	struct list_head		remove_list;
 	uint64_t			npages;
 	dma_addr_t			*dma_addr[MAX_GPU_INSTANCE];
 	struct ttm_resource		*ttm_res;


  



Re: [PATCH 1/2] drm/amdkfd: Use prange->list head for insert_list

2021-12-09 Thread philip yang

  


On 2021-12-08 7:03 p.m., Felix Kuehling
  wrote:


  There are seven list_heads in struct svm_range: list, update_list,
remove_list, insert_list, svm_bo_list, deferred_list, child_list. This
patch and the next one remove two of them that are redundant.

The insert_list head was only used for new ranges that are not on the
svms->list yet. So we can use that list head for keeping track of
new ranges before they get added, and use list_move_tail to move them
to the svms->list when ready.


prange->insert_list was added to handle rollback case if
  migration failed, to avoid insert_list is corrupted, now this is
  not needed as we changed rollback logic.

Reviewed-by: Philip Yang 


  
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 17 -
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  2 --
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index c178d56361d6..dea7c6236be5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -107,7 +107,7 @@ static void svm_range_add_to_svms(struct svm_range *prange)
 	pr_debug("svms 0x%p prange 0x%p [0x%lx 0x%lx]\n", prange->svms,
 		 prange, prange->start, prange->last);
 
-	list_add_tail(>list, >svms->list);
+	list_move_tail(>list, >svms->list);
 	prange->it_node.start = prange->start;
 	prange->it_node.last = prange->last;
 	interval_tree_insert(>it_node, >svms->objects);
@@ -296,7 +296,6 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
 	INIT_LIST_HEAD(>list);
 	INIT_LIST_HEAD(>update_list);
 	INIT_LIST_HEAD(>remove_list);
-	INIT_LIST_HEAD(>insert_list);
 	INIT_LIST_HEAD(>svm_bo_list);
 	INIT_LIST_HEAD(>deferred_list);
 	INIT_LIST_HEAD(>child_list);
@@ -995,7 +994,7 @@ svm_range_split_tail(struct svm_range *prange,
 	int r = svm_range_split(prange, prange->start, new_last, );
 
 	if (!r)
-		list_add(>insert_list, insert_list);
+		list_add(>list, insert_list);
 	return r;
 }
 
@@ -1007,7 +1006,7 @@ svm_range_split_head(struct svm_range *prange,
 	int r = svm_range_split(prange, new_start, prange->last, );
 
 	if (!r)
-		list_add(>insert_list, insert_list);
+		list_add(>list, insert_list);
 	return r;
 }
 
@@ -1880,7 +1879,7 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 			}
 
 			list_add(>remove_list, remove_list);
-			list_add(>insert_list, insert_list);
+			list_add(>list, insert_list);
 			list_add(>update_list, update_list);
 
 			if (node->start < start) {
@@ -1912,7 +1911,7 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 goto out;
 			}
 
-			list_add(>insert_list, insert_list);
+			list_add(>list, insert_list);
 			list_add(>update_list, update_list);
 		}
 
@@ -1927,13 +1926,13 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 			r = -ENOMEM;
 			goto out;
 		}
-		list_add(>insert_list, insert_list);
+		list_add(>list, insert_list);
 		list_add(>update_list, update_list);
 	}
 
 out:
 	if (r)
-		list_for_each_entry_safe(prange, tmp, insert_list, insert_list)
+		list_for_each_entry_safe(prange, tmp, insert_list, list)
 			svm_range_free(prange);
 
 	return r;
@@ -3217,7 +3216,7 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 		goto out;
 	}
 	/* Apply changes as a transaction */
-	list_for_each_entry_safe(prange, next, _list, insert_list) {
+	list_for_each_entry_safe(prange, next, _list, list) {
 		svm_range_add_to_svms(prange);
 		svm_range_add_notifier_locked(mm, prange);
 	}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 6dc91c33e80f..c3738bd35a3e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -76,7 +76,6 @@ struct svm_work_list_item {
  * @list:   link list node, used to scan all ranges of svms
  * @update_list:link list node used to add to update_list
  * @remove_list:link list node used to add to remove list
- * @insert_list:link list node used to add to insert list
  * @mapping:bo_va mapping structure to create and update GPU page table
  * @npages: number of pages
  * @dma_addr:   dma mapping address on each GPU for system memory physical page
@@ -113,7 +112,6 @@ struct svm_range {
 	struct list_head		list;
 	struct list_head		update_list;
 	struct list_head		remove_list;
-	struct list_head		insert_list;
 	uint64_t			npages;
 	dma_addr_t			*dma_addr[MAX_GPU_INSTANCE];
 	struct ttm_resource		*ttm_res;


  



RE: [PATCH] drm/amdkfd: add Navi2x to GWS init conditions

2021-12-09 Thread Kim, Jonathan
[AMD Official Use Only]

> -Original Message-
> From: Sider, Graham 
> Sent: December 9, 2021 1:33 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Kim, Jonathan ; Kuehling, Felix
> ; Sider, Graham 
> Subject: [PATCH] drm/amdkfd: add Navi2x to GWS init conditions
>
> Initalize GWS on Navi2x with mec2_fw_version >= 0x42.
>
> Signed-off-by: Graham Sider 

Reviewed-and-tested-by: Jonathan Kim 

> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index facc28f58c1f..67dd94b0b9a7 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -368,7 +368,10 @@ static int kfd_gws_init(struct kfd_dev *kfd)
>   (KFD_GC_VERSION(kfd) == IP_VERSION(9, 4, 1)
>   && kfd->mec2_fw_version >= 0x30)   ||
>   (KFD_GC_VERSION(kfd) == IP_VERSION(9, 4, 2)
> - && kfd->mec2_fw_version >= 0x28
> + && kfd->mec2_fw_version >= 0x28)   ||
> + (KFD_GC_VERSION(kfd) >= IP_VERSION(10, 3, 0)
> + && KFD_GC_VERSION(kfd) <= IP_VERSION(10, 3, 5)
> + && kfd->mec2_fw_version >= 0x42
>   ret = amdgpu_amdkfd_alloc_gws(kfd->adev,
>   kfd->adev->gds.gws_size, >gws);
>
> --
> 2.25.1



RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

2021-12-09 Thread Liu, Shaoyun
[AMD Official Use Only]

Sounds reasonable. 
 This patch is Reviewed by : Shaoyun.liu 

Regards
Shaoyun.liu

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 9, 2021 1:33 PM
To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

[AMD Official Use Only]

I wanted to keep the order the same as in amdgpu_device_lock_adev() (Set flag 
then acquire lock) to prevent any weird race conditions.

Thanks,
Victor

-Original Message-
From: Liu, Shaoyun 
Sent: Thursday, December 9, 2021 1:25 PM
To: Skvortsov, Victor ; amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

[AMD Official Use Only]

I think it's a good catch for reset_sem, any reason to change the  
adev->in_gpu_reset ?  

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Victor 
Skvortsov
Sent: Thursday, December 9, 2021 12:02 PM
To: amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

Host initiated VF FLR may fail if someone else is already holding a read_lock. 
Change from down_write_trylock to down_write to guarantee the reset goes 
through.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +++--  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index cd2719bc0139..e4365c97adaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -252,11 +252,12 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 2bc93808469a..1cde70c72e54 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -281,11 +281,12 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
--
2.25.1


Re: [PATCH] drm/amdkfd: explicitly create/destroy queue attributes under /sys

2021-12-09 Thread Felix Kuehling
Am 2021-12-09 um 2:49 a.m. schrieb Xiaogang.Chen:
> From: Xiaogang Chen 
>
> When application is about finish it destroys queues it has created by
> an ioctl. Driver deletes queue 
> entry(/sys/class/kfd/kfd/proc/pid/queues/queueid/)
> which is directory including this queue all attributes. Low level kernel
> code deletes all attributes under this directory. The lock from kernel is
> on queue entry, not its attributes. At meantime another user space application
> can read the attributes. There is possibility that the application can
> hold/read the attributes while kernel is deleting the queue entry, cause
> the application have invalid memory access, then killed by kernel.
>
> Driver changes: explicitly create/destroy each attribute for each queue,
> let kernel put lock on each attribute too.

Is this working around a bug in kobject_del? Shouldn't that code take
care of the necessary locking itself?

Regards,
  Felix


>
> Signed-off-by: Xiaogang Chen 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h|  3 +++
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 33 +++-
>  2 files changed, 13 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index 0c3f911e3bf4..045da300749e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -546,6 +546,9 @@ struct queue {
>  
>   /* procfs */
>   struct kobject kobj;
> + struct attribute attr_guid;
> + struct attribute attr_size;
> + struct attribute attr_type;
>  };
>  
>  enum KFD_MQD_TYPE {
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 9158f9754a24..04a5638f9196 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -73,6 +73,8 @@ static void evict_process_worker(struct work_struct *work);
>  static void restore_process_worker(struct work_struct *work);
>  
>  static void kfd_process_device_destroy_cwsr_dgpu(struct kfd_process_device 
> *pdd);
> +static void kfd_sysfs_create_file(struct kobject *kobj, struct attribute 
> *attr,
> + char *name);
>  
>  struct kfd_procfs_tree {
>   struct kobject *kobj;
> @@ -441,35 +443,12 @@ static ssize_t kfd_sysfs_counters_show(struct kobject 
> *kobj,
>   return 0;
>  }
>  
> -static struct attribute attr_queue_size = {
> - .name = "size",
> - .mode = KFD_SYSFS_FILE_MODE
> -};
> -
> -static struct attribute attr_queue_type = {
> - .name = "type",
> - .mode = KFD_SYSFS_FILE_MODE
> -};
> -
> -static struct attribute attr_queue_gpuid = {
> - .name = "gpuid",
> - .mode = KFD_SYSFS_FILE_MODE
> -};
> -
> -static struct attribute *procfs_queue_attrs[] = {
> - _queue_size,
> - _queue_type,
> - _queue_gpuid,
> - NULL
> -};
> -
>  static const struct sysfs_ops procfs_queue_ops = {
>   .show = kfd_procfs_queue_show,
>  };
>  
>  static struct kobj_type procfs_queue_type = {
>   .sysfs_ops = _queue_ops,
> - .default_attrs = procfs_queue_attrs,
>  };
>  
>  static const struct sysfs_ops procfs_stats_ops = {
> @@ -511,6 +490,10 @@ int kfd_procfs_add_queue(struct queue *q)
>   return ret;
>   }
>  
> + kfd_sysfs_create_file(>kobj, >attr_guid, "guid");
> + kfd_sysfs_create_file(>kobj, >attr_size, "size");
> + kfd_sysfs_create_file(>kobj, >attr_type, "type");
> +
>   return 0;
>  }
>  
> @@ -655,6 +638,10 @@ void kfd_procfs_del_queue(struct queue *q)
>   if (!q)
>   return;
>  
> + sysfs_remove_file(>kobj, >attr_guid);
> + sysfs_remove_file(>kobj, >attr_size);
> + sysfs_remove_file(>kobj, >attr_type);
> +
>   kobject_del(>kobj);
>   kobject_put(>kobj);
>  }


[PATCH] drm/amdkfd: add Navi2x to GWS init conditions

2021-12-09 Thread Graham Sider
Initalize GWS on Navi2x with mec2_fw_version >= 0x42.

Signed-off-by: Graham Sider 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index facc28f58c1f..67dd94b0b9a7 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -368,7 +368,10 @@ static int kfd_gws_init(struct kfd_dev *kfd)
(KFD_GC_VERSION(kfd) == IP_VERSION(9, 4, 1)
&& kfd->mec2_fw_version >= 0x30)   ||
(KFD_GC_VERSION(kfd) == IP_VERSION(9, 4, 2)
-   && kfd->mec2_fw_version >= 0x28
+   && kfd->mec2_fw_version >= 0x28)   ||
+   (KFD_GC_VERSION(kfd) >= IP_VERSION(10, 3, 0)
+   && KFD_GC_VERSION(kfd) <= IP_VERSION(10, 3, 5)
+   && kfd->mec2_fw_version >= 0x42
ret = amdgpu_amdkfd_alloc_gws(kfd->adev,
kfd->adev->gds.gws_size, >gws);
 
-- 
2.25.1



RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

2021-12-09 Thread Skvortsov, Victor
[AMD Official Use Only]

I wanted to keep the order the same as in amdgpu_device_lock_adev() 
(Set flag then acquire lock) to prevent any weird race conditions.

Thanks,
Victor

-Original Message-
From: Liu, Shaoyun  
Sent: Thursday, December 9, 2021 1:25 PM
To: Skvortsov, Victor ; amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

[AMD Official Use Only]

I think it's a good catch for reset_sem, any reason to change the  
adev->in_gpu_reset ?  

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Victor 
Skvortsov
Sent: Thursday, December 9, 2021 12:02 PM
To: amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

Host initiated VF FLR may fail if someone else is already holding a read_lock. 
Change from down_write_trylock to down_write to guarantee the reset goes 
through.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +++--  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index cd2719bc0139..e4365c97adaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -252,11 +252,12 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 2bc93808469a..1cde70c72e54 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -281,11 +281,12 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
--
2.25.1


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Felix Kuehling
Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
> That still won't work.
>
> But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.

Regards,
  Felix


>
> Regards,
> Christian.
>
> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
>> you!
>>
>> On 12/9/2021 10:27 AM, Christian König wrote:
>>> Hi Rajneesh,
>>>
>>> yes, separating this from the drm_gem_mmap_obj() change is certainly
>>> a good idea.
>>>
 The child cannot access the BOs mapped by the parent anyway with
 access restrictions applied
>>>
>>> exactly that is not correct. That behavior is actively used by some
>>> userspace stacks as far as I know.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
 Thanks Christian. Would it make it less intrusive if I just use the
 flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
 this patch? For our use case, just the ttm_bo_mmap_obj change
 should suffice and we don't want to put any more work arounds in
 the user space (thunk, in our case).

 The child cannot access the BOs mapped by the parent anyway with
 access restrictions applied so I wonder why even inherit the vma?

 On 12/9/2021 2:54 AM, Christian König wrote:
> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>> When an application having open file access to a node forks, its
>> shared
>> mappings also get reflected in the address space of child process
>> even
>> though it cannot access them with the object permissions applied.
>> With the
>> existing permission checks on the gem objects, it might be
>> reasonable to
>> also create the VMAs with VM_DONTCOPY flag so a user space
>> application
>> doesn't need to explicitly call the madvise(addr, len,
>> MADV_DONTFORK)
>> system call to prevent the pages in the mapped range to appear in
>> the
>> address space of the child process. It also prevents the memory
>> leaks
>> due to additional reference counts on the mapped BOs in the child
>> process that prevented freeing the memory in the parent for which
>> we had
>> worked around earlier in the user space inside the thunk library.
>>
>> Additionally, we faced this issue when using CRIU to checkpoint
>> restore
>> an application that had such inherited mappings in the child which
>> confuse CRIU when it mmaps on restore. Having this flag set for the
>> render node VMAs helps. VMAs mapped via KFD already take care of
>> this so
>> this is needed only for the render nodes.
>
> Unfortunately that is most likely a NAK. We already tried
> something similar.
>
> While it is illegal by the OpenGL specification and doesn't work
> for most userspace stacks, we do have some implementations which
> call fork() with a GL context open and expect it to work.
>
> Regards,
> Christian.
>
>>
>> Cc: Felix Kuehling 
>>
>> Signed-off-by: David Yat Sin 
>> Signed-off-by: Rajneesh Bhardwaj 
>> ---
>>   drivers/gpu/drm/drm_gem.c   | 3 ++-
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>   2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>> index 09c820045859..d9c4149f36dd 100644
>> --- a/drivers/gpu/drm/drm_gem.c
>> +++ b/drivers/gpu/drm/drm_gem.c
>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
>> *obj, unsigned long obj_size,
>>   goto err_drm_gem_object_put;
>>   }
>>   -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
>> VM_DONTDUMP;
>> +    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>> +    | VM_DONTDUMP | VM_DONTCOPY;
>>   vma->vm_page_prot =
>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>   vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>   }
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 33680c94127c..420a4898fdd2 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
>> *vma, struct ttm_buffer_object *bo)
>>     vma->vm_private_data = bo;
>>   -    vma->vm_flags |= VM_PFNMAP;
>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>   vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>   return 0;
>>   }
>
>>>
>


RE: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

2021-12-09 Thread Liu, Shaoyun
[AMD Official Use Only]

I think it's a good catch for reset_sem, any reason to change the  
adev->in_gpu_reset ?  

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Victor 
Skvortsov
Sent: Thursday, December 9, 2021 12:02 PM
To: amd-gfx@lists.freedesktop.org
Cc: Skvortsov, Victor 
Subject: [PATCH] drm/amdgpu: SRIOV flr_work should use down_write

Host initiated VF FLR may fail if someone else is already holding a read_lock. 
Change from down_write_trylock to down_write to guarantee the reset goes 
through.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +++--  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index cd2719bc0139..e4365c97adaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -252,11 +252,12 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 2bc93808469a..1cde70c72e54 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -281,11 +281,12 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
--
2.25.1


Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Alex Deucher
On Thu, Dec 9, 2021 at 1:00 PM Guilherme G. Piccoli  wrote:
>
> On 09/12/2021 14:31, Alex Deucher wrote:
> > [...]
> > Once the driver takes over, none of the pre-driver state is retained.
> > You'll need to load the driver in the new kernel to initialize the
> > displays.  Note the efifb doesn't actually have the ability to program
> > any hardware, it just takes over the memory region that was used for
> > the pre-OS framebuffer and whatever display timing was set up by the
> > GOP driver prior to the OS loading.  Once that OS driver has loaded
> > the area is gone and the display configuration may have changed.
> >
>
> Hi Christian and Alex, thanks for the clarifications!
>
> Is there any way to save/retain this state before amdgpu takes over?

Not really in a generic way.  It's asic and platform specific.  In
addition most modern displays require link training to bring up the
display, so you can't just save and restore registers.

> Would simpledrm be able to program the device again, to a working state?

No.  You need an asic specific driver that knows how to program the
specific hardware.  It's also platform specific in that you need to
determine platform specific details such as the number and type of
display connectors and encoders that are present on the system.

>
> Finally, do you have any example of such a GOP driver (open source) so I
> can take a look? I tried to find something like that in Tianocore
> project, but didn't find anything that seemed useful for my issue.

The drivers are asic and platform specific.  E.g., the driver for
vangogh is different from renoir is different from skylake, etc.  The
display programming interfaces are asic specific.

Alex


Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Guilherme G. Piccoli
On 09/12/2021 14:31, Alex Deucher wrote:
> [...] 
> Once the driver takes over, none of the pre-driver state is retained.
> You'll need to load the driver in the new kernel to initialize the
> displays.  Note the efifb doesn't actually have the ability to program
> any hardware, it just takes over the memory region that was used for
> the pre-OS framebuffer and whatever display timing was set up by the
> GOP driver prior to the OS loading.  Once that OS driver has loaded
> the area is gone and the display configuration may have changed.
> 

Hi Christian and Alex, thanks for the clarifications!

Is there any way to save/retain this state before amdgpu takes over?
Would simpledrm be able to program the device again, to a working state?

Finally, do you have any example of such a GOP driver (open source) so I
can take a look? I tried to find something like that in Tianocore
project, but didn't find anything that seemed useful for my issue.

Thanks again!


Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Alex Deucher
On Thu, Dec 9, 2021 at 12:04 PM Guilherme G. Piccoli
 wrote:
>
> Hi all, I have a question about the possibility of reusing a framebuffer
> after a regular (or panic) kexec - my case is with amdgpu (APU, aka, not
> a separate GPU hardware), but I guess the question is kinda generic
> hence I've looped most of the lists / people I think does make sense
> (apologies for duplicates).
>
>
> The context is: we have a hardware that has an amdgpu-controlled device
> (Vangogh model) and as soon as the machine boots, efifb is providing
> graphics - I understand the UEFI/GRUB outputs rely in EFI framebuffer as
> well. As soon amdgpu module is available, kernel loads it and it takes
> over the GPU, providing graphics. The kexec_file_load syscall allows to
> pass a valid screen_info structure, so by kexec'ing a new kernel, we
> have again efifb taking over on boot time, but this time I see nothing
> in the screen. I've manually blacklisted amdgpu in this new kexec'ed
> kernel, I'd like to rely in the simple framebuffer - the goal is to have
> a tiny kernel kexec'ed. I'm using kernel version 5.16.0-rc4.
>
> I've done some other experiments, for exemple: I've forced screen_info
> model to match VLFB, so vesafb took over after the kexec, with the same
> result. Also noticed that BusMaster bit was off after kexec, in the AMD
> APU PCIe device, so I've set it on efifb before probe, and finally
> tested the same things in qemu, with qxl, all with the same result
> (blank screen).
> The most interesting result I got (both with amdgpu and qemu/qxl) is
> that if I blacklist these drivers and let the machine continue using
> efifb since the beginning, after kexec the efifb is still able to
> produce graphics.
>
> Which then led me to think that likely there's something fundamentally
> "blocking" the reuse of the simple framebuffer after kexec, like maybe
> DRM stack is destroying the old framebuffer somehow? What kind of
> preparation is required at firmware level to make the simple EFI VGA
> framebuffer work, and could we perform this in a kexec (or "save it"
> before the amdgpu/qxl drivers take over and reuse later)?
>

Once the driver takes over, none of the pre-driver state is retained.
You'll need to load the driver in the new kernel to initialize the
displays.  Note the efifb doesn't actually have the ability to program
any hardware, it just takes over the memory region that was used for
the pre-OS framebuffer and whatever display timing was set up by the
GOP driver prior to the OS loading.  Once that OS driver has loaded
the area is gone and the display configuration may have changed.

Alex


> Any advice is greatly appreciated!
> Thanks in advance,
>
>
> Guilherme


A Potential Bug in drm/amd/display/dc_link.c

2021-12-09 Thread Yizhuo Zhai
Hi All:
I just found a bug in the cramfs using the static analysis tool, but not
sure if this could happen in reality, could you please advisehere? Thanks
for your attention : )

In function enable_stream_features(), the variable "old_downspread.raw
"
could be uninitialized if core_link_read_dpcd fails(), however, it is used
in the later if statement, and further, core_link_write_dpcd() may write random
value, which is potentially unsafe. But this function does not return the
error code to the up caller and I got stuck in drafting the patch, could
you please advise me here?

The related code:

static void enable_stream_features(struct pipe_ctx *pipe_ctx)
{
union down_spread_ctrl old_downspread;
core_link_read_dpcd(link, DP_DOWNSPREAD_CTRL,
_downspread.raw, sizeof(old_downspread);


//old_downspread.raw
if (new_downspread.raw != old_downspread.raw) {
core_link_write_dpcd(link, DP_DOWNSPREAD_CTRL,
_downspread.raw, sizeof(new_downspread));
}
}
enum dc_status core_link_read_dpcd(
struct dc_link *link,
uint32_t address,
uint8_t *data,
uint32_t size)
{
//data could be uninitialized if the helpers fails and log
some error info

if (!dm_helpers_dp_read_dpcd(link->ctx,
link,
address, data, size))
return DC_ERROR_UNEXPECTED;

return DC_OK;
}


-- 
Kind Regards,

*Yizhuo Zhai*

*Computer Science, Graduate Student*
*University of California, Riverside *


Re: [PATCH v10 08/10] dyndbg: add print-to-tracefs, selftest with it - RFC

2021-12-09 Thread Vincent Whitchurch
On Wed, Dec 08, 2021 at 06:16:10AM +0100, jim.cro...@gmail.com wrote:
> are you planning to dust this patchset off and resubmit it ?
> 
> Ive been playing with it and learning ftrace (decade+ late),
> I found your boot-line example very helpful as 1st steps
> (still havent even tried the filtering)
> 
> 
> with these adjustments (voiced partly to test my understanding)
> I would support it, and rework my patchset to use it.
> 
> - change flag to -e, good mnemonics for event/trace-event
>T is good too, but uppercase, no need to go there.

Any flag name works for me.

> - include/trace/events/dyndbg.h - separate file, not mixed with print.h
>   dyndbg class, so trace_event=dyndbg:*
> 
> - 1 event type per pr_debug, dev_dbg, netdev_dbg ? ibdev_dbg ?
>   with the extra args: descriptor that Steven wanted,
>   probably also struct <|net|ib>dev

For my use cases I don't see much value in having separate events for
the different debug functions, but since all of them can be easily
enabled (dyndbg:*, as you noted), that works for me too.

> If youre too busy for a while, I'd eventually take a (slow) run at it.

You're welcome to have a go.  I think you've already rebased the
patchset, but here's a diff top of v5.16-rc4 for reference.  I noticed a
bug inside the CONFIG_JUMP_LABEL handling (also present in the last
version I posted) which should be fixed as part of the diff below (I've
added a comment).  Proper tests for this, like the ones you are adding
in your patchset, would certainly be a good idea.  Thanks.

8<-
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst 
b/Documentation/admin-guide/dynamic-debug-howto.rst
index a89cfa083155..b9c4e808befc 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -228,6 +228,7 @@ of the characters::
 The flags are::
 
   penables the pr_debug() callsite.
+  xenables trace to the printk:dyndbg event
   fInclude the function name in the printed message
   lInclude line number in the printed message
   mInclude module name in the printed message
diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h
index dce631e678dd..bc21bfb0fdc6 100644
--- a/include/linux/dynamic_debug.h
+++ b/include/linux/dynamic_debug.h
@@ -27,7 +27,7 @@ struct _ddebug {
 * writes commands to /dynamic_debug/control
 */
 #define _DPRINTK_FLAGS_NONE0
-#define _DPRINTK_FLAGS_PRINT   (1<<0) /* printk() a message using the format */
+#define _DPRINTK_FLAGS_PRINTK  (1<<0) /* printk() a message using the format */
 #define _DPRINTK_FLAGS_INCL_MODNAME(1<<1)
 #define _DPRINTK_FLAGS_INCL_FUNCNAME   (1<<2)
 #define _DPRINTK_FLAGS_INCL_LINENO (1<<3)
@@ -37,8 +37,11 @@ struct _ddebug {
(_DPRINTK_FLAGS_INCL_MODNAME | _DPRINTK_FLAGS_INCL_FUNCNAME |\
 _DPRINTK_FLAGS_INCL_LINENO  | _DPRINTK_FLAGS_INCL_TID)
 
+#define _DPRINTK_FLAGS_TRACE   (1<<5)
+#define _DPRINTK_FLAGS_ENABLE  (_DPRINTK_FLAGS_PRINTK | \
+_DPRINTK_FLAGS_TRACE)
 #if defined DEBUG
-#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINT
+#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINTK
 #else
 #define _DPRINTK_FLAGS_DEFAULT 0
 #endif
@@ -120,10 +123,10 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor,
 
 #ifdef DEBUG
 #define DYNAMIC_DEBUG_BRANCH(descriptor) \
-   likely(descriptor.flags & _DPRINTK_FLAGS_PRINT)
+   likely(descriptor.flags & _DPRINTK_FLAGS_ENABLE)
 #else
 #define DYNAMIC_DEBUG_BRANCH(descriptor) \
-   unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)
+   unlikely(descriptor.flags & _DPRINTK_FLAGS_ENABLE)
 #endif
 
 #endif /* CONFIG_JUMP_LABEL */
diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h
index 13d405b2fd8b..1f78bd237a91 100644
--- a/include/trace/events/printk.h
+++ b/include/trace/events/printk.h
@@ -7,7 +7,7 @@
 
 #include 
 
-TRACE_EVENT(console,
+DECLARE_EVENT_CLASS(printk,
TP_PROTO(const char *text, size_t len),
 
TP_ARGS(text, len),
@@ -31,6 +31,16 @@ TRACE_EVENT(console,
 
TP_printk("%s", __get_str(msg))
 );
+
+DEFINE_EVENT(printk, console,
+   TP_PROTO(const char *text, size_t len),
+   TP_ARGS(text, len)
+);
+
+DEFINE_EVENT(printk, dyndbg,
+   TP_PROTO(const char *text, size_t len),
+   TP_ARGS(text, len)
+);
 #endif /* _TRACE_PRINTK_H */
 
 /* This part must be outside protection */
diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c
index dd7f56af9aed..161454fa0af8 100644
--- a/lib/dynamic_debug.c
+++ b/lib/dynamic_debug.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -86,11 +87,12 @@ static inline const char *trim_prefix(const char *path)
 }
 
 static struct { unsigned flag:8; char opt_char; } opt_array[] = {
-   { _DPRINTK_FLAGS_PRINT, 'p' },
+   { _DPRINTK_FLAGS_PRINTK, 'p' },
{ _DPRINTK_FLAGS_INCL_MODNAME, 'm' },
{ 

Re: [BUG] gpu: drm: amd: amdgpu: possible ABBA deadlock in amdgpu_set_power_dpm_force_performance_level() and amdgpu_debugfs_process_reg_op()

2021-12-09 Thread Jia-Ju Bai

Hello,

Could you please provide the feedback to my previous report?
Thanks a lot :)


Best wishes,
Jia-Ju Bai

On 2021/9/15 17:39, Jia-Ju Bai wrote:

Hello,

My static analysis tool reports a possible ABBA deadlock in the amdgpu 
driver in Linux 5.10:


amdgpu_debugfs_process_reg_op()
  mutex_lock(>grbm_idx_mutex); --> Line 250 (Lock A)
  mutex_lock(>pm.mutex); --> Line 259 (Lock B)

amdgpu_set_power_dpm_force_performance_level()
  mutex_lock(>pm.mutex); --> Line 381 (Lock B)
    pp_dpm_force_performance_level() --> function pointer via 
"amdgpu_dpm_force_performance_level()"

  pp_dpm_en_umd_pstate()
    amdgpu_device_ip_set_clockgating_state()
  gfx_v7_0_set_clockgating_state() --> function pointer via 
"funcs->set_clockgating_state()"

    gfx_v7_0_enable_mgcg()
  mutex_lock(>grbm_idx_mutex); --> Line 3646 (Lock A)
  mutex_lock(>grbm_idx_mutex); --> Line 3697 (Lock A)

When amdgpu_debugfs_process_reg_op() and 
amdgpu_set_power_dpm_force_performance_level() are concurrently 
executed, the deadlock can occur.


I am not quite sure whether this possible deadlock is real and how to 
fix it if it is real.

Any feedback would be appreciated, thanks :)

Reported-by: TOTE Robot 


Best wishes,
Jia-Ju Bai




[PATCH] drm/amdgpu: Fix reference leak in psp_xgmi_reflect_topology_info()

2021-12-09 Thread Jianglei Nie
In line 1138 (#1), amdgpu_get_xgmi_hive() increases the kobject reference
counter of the hive it returned. The hive returned by
amdgpu_get_xgmi_hive()should be released with the help of
amdgpu_put_xgmi_hive() to balance its kobject reference counter properly.
Forgetting the amdgpu_put_xgmi_hive() operation will result in reference
leak.

We can fix it by calling amdgpu_put_xgmi_hive() before the end of the
function (#2).

1128 static void psp_xgmi_reflect_topology_info(struct psp_context *psp,
1129struct psp_xgmi_node_info node_info)
1130 {

1138hive = amdgpu_get_xgmi_hive(psp->adev);
// #1: kzalloc space reference increment
1139list_for_each_entry(mirror_adev, >device_list, gmc.xgmi.head) {
1140struct psp_xgmi_topology_info *mirror_top_info;
1141int j;

1143if (mirror_adev->gmc.xgmi.node_id != dst_node_id)
1144continue;

1146mirror_top_info = _adev->psp.xgmi_context.top_info;
1147for (j = 0; j < mirror_top_info->num_nodes; j++) {
1148if (mirror_top_info->nodes[j].node_id != src_node_id)
1149continue;

1151mirror_top_info->nodes[j].num_hops = dst_num_hops;

1157if (dst_num_links)
1158mirror_top_info->nodes[j].num_links = 
dst_num_links;

1160break;
1161}

1163break;
1164}
// #2: missing reference decrement
1165 }

Signed-off-by: Jianglei Nie 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index c641f84649d6..f6362047ed71 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1162,6 +1162,7 @@ static void psp_xgmi_reflect_topology_info(struct 
psp_context *psp,
 
break;
}
+   amdgpu_put_xgmi_hive(hive);
 }
 
 int psp_xgmi_get_topology_info(struct psp_context *psp,
-- 
2.25.1



Potential Bug in drm/amd/display/dc_link

2021-12-09 Thread Yizhuo Zhai
Hi All:
I just found a bug in the cramfs using the static analysis tool, but
not sure if this could happen in reality, could you please advise me
here? Thanks for your attention : ) And please ignore the last one
with HTML format if you did not filter it out.

In function enable_stream_features(), the variable
"old_downspread.raw" could be uninitialized if core_link_read_dpcd
fails(), however, it is used in the later if statement, and further,
core_link_write_dpcd() may write random value, which is potentially
unsafe. But this function does not return the error code to the up
caller and I got stuck in drafting the patch, could you please advise
me here?

The related code:
static void enable_stream_features(struct pipe_ctx *pipe_ctx)
{
 union down_spread_ctrl old_downspread;
core_link_read_dpcd(link, DP_DOWNSPREAD_CTRL,
 _downspread.raw, sizeof(old_downspread);

//old_downspread.raw used here
if (new_downspread.raw != old_downspread.raw) {
   core_link_write_dpcd(link, DP_DOWNSPREAD_CTRL,
 _downspread.raw, sizeof(new_downspread));
}
}
enum dc_status core_link_read_dpcd(
struct dc_link *link,
uint32_t address,
uint8_t *data,
uint32_t size)
{
//data could be uninitialized if the helpers fails and log
some error info
if (!dm_helpers_dp_read_dpcd(link->ctx,
   link,address, data, size))
  return DC_ERROR_UNEXPECTED;
return DC_OK;
}

The same issue in function wait_for_training_aux_rd_interval() in
drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
-- 
Kind Regards,

Yizhuo Zhai

Computer Science, Graduate Student
University of California, Riverside


Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Guilherme G. Piccoli
Hi all, I have a question about the possibility of reusing a framebuffer
after a regular (or panic) kexec - my case is with amdgpu (APU, aka, not
a separate GPU hardware), but I guess the question is kinda generic
hence I've looped most of the lists / people I think does make sense
(apologies for duplicates).


The context is: we have a hardware that has an amdgpu-controlled device
(Vangogh model) and as soon as the machine boots, efifb is providing
graphics - I understand the UEFI/GRUB outputs rely in EFI framebuffer as
well. As soon amdgpu module is available, kernel loads it and it takes
over the GPU, providing graphics. The kexec_file_load syscall allows to
pass a valid screen_info structure, so by kexec'ing a new kernel, we
have again efifb taking over on boot time, but this time I see nothing
in the screen. I've manually blacklisted amdgpu in this new kexec'ed
kernel, I'd like to rely in the simple framebuffer - the goal is to have
a tiny kernel kexec'ed. I'm using kernel version 5.16.0-rc4.

I've done some other experiments, for exemple: I've forced screen_info
model to match VLFB, so vesafb took over after the kexec, with the same
result. Also noticed that BusMaster bit was off after kexec, in the AMD
APU PCIe device, so I've set it on efifb before probe, and finally
tested the same things in qemu, with qxl, all with the same result
(blank screen).
The most interesting result I got (both with amdgpu and qemu/qxl) is
that if I blacklist these drivers and let the machine continue using
efifb since the beginning, after kexec the efifb is still able to
produce graphics.

Which then led me to think that likely there's something fundamentally
"blocking" the reuse of the simple framebuffer after kexec, like maybe
DRM stack is destroying the old framebuffer somehow? What kind of
preparation is required at firmware level to make the simple EFI VGA
framebuffer work, and could we perform this in a kexec (or "save it"
before the amdgpu/qxl drivers take over and reuse later)?

Any advice is greatly appreciated!
Thanks in advance,


Guilherme


Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Jason Gunthorpe
On Thu, Dec 09, 2021 at 12:45:24PM +1100, Alistair Popple wrote:
> On Thursday, 9 December 2021 12:53:45 AM AEDT Jason Gunthorpe wrote:
> > > I think a similar problem exists for device private fault handling as 
> > > well and
> > > it has been on my list of things to fix for a while. I think the solution 
> > > is to
> > > call try_get_page(), except it doesn't work with device pages due to the 
> > > whole
> > > refcount thing. That issue is blocking a fair bit of work now so I've 
> > > started
> > > looking into it.
> > 
> > Where is this?
>  
> Nothing posted yet. I've been going through the mailing list and the old
> thread[1] to get an understanding of what is left to do. If you have any
> suggestions they would be welcome.

Oh, that

Joao's series here is the first step:

https://lore.kernel.org/linux-mm/20211202204422.26777-1-joao.m.mart...@oracle.com/

I already sent a patch to remove the DRM usage of PUD/PMD -
0d979509539e ("drm/ttm: remove ttm_bo_vm_insert_huge()")

Next, someone needs to change FSDAX to have a folio covering the
ZONE_DEVICE pages before it installs a PUD or PMD. I don't know
anything about FS's to know how to do this at all.

Thus all PUD/PMD entries will point at a head page or larger of a
compound. This is important because all the existing machinery for THP
assumes 1 PUD/PMD means 1 struct page to manipulate.

Then, consolidate all the duplicated code that runs when a page is
removed from a PTE/PMD/PUD etc into a function. Figure out why the
duplications are different to make them the same (I have some rough
patches for this step)

Start with PUD and have zap on PUD call the consolidated function and
make vmf_insert_pfn_pud_prot() accept a struct page not pfn and incr
the refcount. PUD is easy because there is no THP

Then do the same to PMD without breaking the THP code

Then make the PTE also incr the refcount on insert and zap

Exterminate vma_is_special_huge() along the way, there is no such
thing as a special huge VMA without a pud/pmd_special flag so all
things installed here must be struct page and not special.

Then the patches that are already posted are applicable and we can
kill the refcount == 1 stuff. No 0 ref count pages installed in page
tables.

Once all of that is done it is fairly straightforward to remove
pud/pmd/pte_devmap entirely and the pgmap stuff from gup.c

Jason


Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Alistair Popple
On Thursday, 9 December 2021 12:53:45 AM AEDT Jason Gunthorpe wrote:
> > I think a similar problem exists for device private fault handling as well 
> > and
> > it has been on my list of things to fix for a while. I think the solution 
> > is to
> > call try_get_page(), except it doesn't work with device pages due to the 
> > whole
> > refcount thing. That issue is blocking a fair bit of work now so I've 
> > started
> > looking into it.
> 
> Where is this?
 
Nothing posted yet. I've been going through the mailing list and the old
thread[1] to get an understanding of what is left to do. If you have any
suggestions they would be welcome.

[1] https://lore.kernel.org/all/20211014153928.16805-3-alex.sie...@amd.com/





[PATCH] drm:amdgpu:remove unneeded variable

2021-12-09 Thread cgel . zte
From: chiminghao 

return value form directly instead of
taking this in another redundant variable.

Reported-by: Zeal Robot 
Signed-off-by: chiminghao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c | 5 +
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 6 ++
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c
index 5cf142e849bb..fb92f827eeb7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c
@@ -37,12 +37,9 @@
 long amdgpu_kms_compat_ioctl(struct file *filp, unsigned int cmd, unsigned 
long arg)
 {
unsigned int nr = DRM_IOCTL_NR(cmd);
-   int ret;
 
if (nr < DRM_COMMAND_BASE)
return drm_compat_ioctl(filp, cmd, arg);
 
-   ret = amdgpu_drm_ioctl(filp, cmd, arg);
-
-   return ret;
+   return amdgpu_drm_ioctl(filp, cmd, arg);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index cb82404df534..269a7b04b7e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1742,7 +1742,7 @@ static int gmc_v9_0_hw_init(void *handle)
 {
struct amdgpu_device *adev = (struct amdgpu_device *)handle;
bool value;
-   int r, i;
+   int i;
 
/* The sequence of these two function calls matters.*/
gmc_v9_0_init_golden_registers(adev);
@@ -1777,9 +1777,7 @@ static int gmc_v9_0_hw_init(void *handle)
if (adev->umc.funcs && adev->umc.funcs->init_registers)
adev->umc.funcs->init_registers(adev);
 
-   r = gmc_v9_0_gart_enable(adev);
-
-   return r;
+   return gmc_v9_0_gart_enable(adev);
 }
 
 /**
-- 
2.25.1



Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Alistair Popple
On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) 
wrote:
> 
> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
> > Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
> >> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> >>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
>  Avoid long term pinning for Coherent device type pages. This could
>  interfere with their own device memory manager.
>  If caller tries to get user device coherent pages with PIN_LONGTERM flag
>  set, those pages will be migrated back to system memory.
> 
>  Signed-off-by: Alex Sierra 
>  ---
>    mm/gup.c | 32 ++--
>    1 file changed, 30 insertions(+), 2 deletions(-)
> 
>  diff --git a/mm/gup.c b/mm/gup.c
>  index 886d6148d3d0..1572eacf07f4 100644
>  --- a/mm/gup.c
>  +++ b/mm/gup.c
>  @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>    #endif /* CONFIG_ELF_CORE */
>    
>    #ifdef CONFIG_MIGRATION
>  +static int migrate_device_page(unsigned long address,
>  +struct page *page)
>  +{
>  +struct vm_area_struct *vma = find_vma(current->mm, address);
>  +struct vm_fault vmf = {
>  +.vma = vma,
>  +.address = address & PAGE_MASK,
>  +.flags = FAULT_FLAG_USER,
>  +.pgoff = linear_page_index(vma, address),
>  +.gfp_mask = GFP_KERNEL,
>  +.page = page,
>  +};
>  +if (page->pgmap && page->pgmap->ops->migrate_to_ram)
>  +return page->pgmap->ops->migrate_to_ram();
> >>> How does this synchronise against pgmap being released? As I understand 
> >>> things
> >>> at this point we're not holding a reference on either the page or pgmap, 
> >>> so
> >>> the page and therefore the pgmap may have been freed.
> >>>
> >>> I think a similar problem exists for device private fault handling as 
> >>> well and
> >>> it has been on my list of things to fix for a while. I think the solution 
> >>> is to
> >>> call try_get_page(), except it doesn't work with device pages due to the 
> >>> whole
> >>> refcount thing. That issue is blocking a fair bit of work now so I've 
> >>> started
> >>> looking into it.
> >> At least the page should have been pinned by the __get_user_pages_locked
> >> call in __gup_longterm_locked. That refcount is dropped in
> >> check_and_migrate_movable_pages when it returns 0 or an error.
> > Never mind. We unpin the pages first. Alex, would the migration work if
> > we unpinned them afterwards? Also, the normal CPU page fault code path
> > seems to make sure the page is locked (check in pfn_swap_entry_to_page)
> > before calling migrate_to_ram.

I don't think that's true. The check in pfn_swap_entry_to_page() is only for
migration entries:

BUG_ON(is_migration_entry(entry) && !PageLocked(p));

As this is coherent memory though why do we have to call into a device driver
to do the migration? Couldn't this all be done in the kernel?

> No, you can not unpinned after migration. Due to the expected_count VS 
> page_count condition at migrate_page_move_mapping, during migrate_page call.
> 
> Regards,
> Alex Sierra
> 
> > Regards,
> >Felix
> >
> >
> 






[PATCH] drm/amdgpu: SRIOV flr_work should use down_write

2021-12-09 Thread Victor Skvortsov
Host initiated VF FLR may fail if someone else
is already holding a read_lock. Change from
down_write_trylock to down_write to guarantee
the reset goes through.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +++--
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index cd2719bc0139..e4365c97adaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -252,11 +252,12 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 2bc93808469a..1cde70c72e54 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -281,11 +281,12 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 * otherwise the mailbox msg will be ruined/reseted by
 * the VF FLR.
 */
-   if (!down_write_trylock(>reset_sem))
+   if (atomic_cmpxchg(>in_gpu_reset, 0, 1) != 0)
return;
 
+   down_write(>reset_sem);
+
amdgpu_virt_fini_vf2pf_work_item(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
-- 
2.25.1



Re: [PATCH v2 09/10] drm/amdgpu: remove unnecessary variables

2021-12-09 Thread Felix Kuehling
Am 2021-12-09 um 10:47 a.m. schrieb Isabella Basso:
> This fixes the warnings below, and also drops the display_count
> variable, as it's unused.
>
>  In function 'svm_range_map_to_gpu':
>  warning: variable 'bo_va' set but not used [-Wunused-but-set-variable]
>  1172 | struct amdgpu_bo_va bo_va;
>   | ^
>  ...
>  In function 'dcn201_update_clocks':
>  warning: variable 'enter_display_off' set but not used 
> [-Wunused-but-set-variable]
>  132 | bool enter_display_off = false;
>  |  ^
>
> Changes since v1:
> - As suggested by Rodrigo Siqueira:
>   1. Drop display_count variable.
> - As suggested by Felix Kuehling:
>   1. Remove block surrounding amdgpu_xgmi_same_hive.
>
> Signed-off-by: Isabella Basso 

The kfd_svm.c portion is

Reviewed-by: Felix Kuehling 

Thank you!


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c   | 4 
>  .../gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c | 7 +--
>  2 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 82cb45e30197..835f202dc23d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1169,7 +1169,6 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct 
> amdgpu_vm *vm,
>unsigned long npages, bool readonly, dma_addr_t *dma_addr,
>struct amdgpu_device *bo_adev, struct dma_fence **fence)
>  {
> - struct amdgpu_bo_va bo_va;
>   bool table_freed = false;
>   uint64_t pte_flags;
>   unsigned long last_start;
> @@ -1182,9 +1181,6 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct 
> amdgpu_vm *vm,
>   pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n", prange->svms,
>last_start, last_start + npages - 1, readonly);
>  
> - if (prange->svm_bo && prange->ttm_res)
> - bo_va.is_xgmi = amdgpu_xgmi_same_hive(adev, bo_adev);
> -
>   for (i = offset; i < offset + npages; i++) {
>   last_domain = dma_addr[i] & SVM_RANGE_VRAM_DOMAIN;
>   dma_addr[i] &= ~SVM_RANGE_VRAM_DOMAIN;
> diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c 
> b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c
> index 8c20a0fb1e4f..fbdd0a92d146 100644
> --- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c
> +++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c
> @@ -90,10 +90,8 @@ static void dcn201_update_clocks(struct clk_mgr 
> *clk_mgr_base,
>   struct clk_mgr_internal *clk_mgr = TO_CLK_MGR_INTERNAL(clk_mgr_base);
>   struct dc_clocks *new_clocks = >bw_ctx.bw.dcn.clk;
>   struct dc *dc = clk_mgr_base->ctx->dc;
> - int display_count;
>   bool update_dppclk = false;
>   bool update_dispclk = false;
> - bool enter_display_off = false;
>   bool dpp_clock_lowered = false;
>   bool force_reset = false;
>   bool p_state_change_support;
> @@ -109,10 +107,7 @@ static void dcn201_update_clocks(struct clk_mgr 
> *clk_mgr_base,
>   dcn2_read_clocks_from_hw_dentist(clk_mgr_base);
>   }
>  
> - display_count = clk_mgr_helper_get_active_display_cnt(dc, context);
> -
> - if (display_count == 0)
> - enter_display_off = true;
> + clk_mgr_helper_get_active_display_cnt(dc, context);
>  
>   if (should_set_clock(safe_to_lower, new_clocks->phyclk_khz, 
> clk_mgr_base->clks.phyclk_khz))
>   clk_mgr_base->clks.phyclk_khz = new_clocks->phyclk_khz;


[PATCH 1/2] drm/amdgpu: Separate vf2pf work item init from virt data exchange

2021-12-09 Thread Victor Skvortsov
We want to be able to call virt data exchange conditionally
after gmc sw init to reserve bad pages as early as possible.
Since this is a conditional call, we will need to call
it again unconditionally later in the init sequence.

Refactor the data exchange function so it can be
called multiple times without re-initializing the work item.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 20 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 42 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  5 +--
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  2 +-
 5 files changed, 45 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ce9bdef185c0..3992c4086d26 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2181,7 +2181,7 @@ static int amdgpu_device_ip_early_init(struct 
amdgpu_device *adev)
 
/*get pf2vf msg info at it's earliest time*/
if (amdgpu_sriov_vf(adev))
-   amdgpu_virt_init_data_exchange(adev);
+   amdgpu_virt_exchange_data(adev);
 
}
}
@@ -2345,8 +2345,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
}
}
 
-   if (amdgpu_sriov_vf(adev))
-   amdgpu_virt_init_data_exchange(adev);
+   if (amdgpu_sriov_vf(adev)) {
+   amdgpu_virt_exchange_data(adev);
+   amdgpu_virt_init_vf2pf_work_item(adev);
+   }
 
r = amdgpu_ib_pool_init(adev);
if (r) {
@@ -2949,7 +2951,7 @@ int amdgpu_device_ip_suspend(struct amdgpu_device *adev)
int r;
 
if (amdgpu_sriov_vf(adev)) {
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
amdgpu_virt_request_full_gpu(adev, false);
}
 
@@ -3839,7 +3841,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 * */
if (amdgpu_sriov_vf(adev)) {
amdgpu_virt_request_full_gpu(adev, false);
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
}
 
/* disable all interrupts */
@@ -4317,7 +4319,9 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
if (r)
goto error;
 
-   amdgpu_virt_init_data_exchange(adev);
+   amdgpu_virt_exchange_data(adev);
+   amdgpu_virt_init_vf2pf_work_item(adev);
+
/* we need recover gart prior to run SMC/CP/SDMA resume */
amdgpu_gtt_mgr_recover(ttm_manager_type(>mman.bdev, TTM_PL_TT));
 
@@ -4495,7 +4499,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
 
if (amdgpu_sriov_vf(adev)) {
/* stop the data exchange thread */
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
}
 
/* block all schedulers and reset given job's ring */
@@ -4898,7 +4902,7 @@ static void amdgpu_device_recheck_guilty_jobs(
 retry:
/* do hw reset */
if (amdgpu_sriov_vf(adev)) {
-   amdgpu_virt_fini_data_exchange(adev);
+   amdgpu_virt_fini_vf2pf_work_item(adev);
r = amdgpu_device_reset_sriov(adev, false);
if (r)
adev->asic_reset_res = r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3fc49823f527..b6e3d379a86a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -611,16 +611,7 @@ static void amdgpu_virt_update_vf2pf_work_item(struct 
work_struct *work)
schedule_delayed_work(&(adev->virt.vf2pf_work), 
adev->virt.vf2pf_update_interval_ms);
 }
 
-void amdgpu_virt_fini_data_exchange(struct amdgpu_device *adev)
-{
-   if (adev->virt.vf2pf_update_interval_ms != 0) {
-   DRM_INFO("clean up the vf2pf work item\n");
-   cancel_delayed_work_sync(>virt.vf2pf_work);
-   adev->virt.vf2pf_update_interval_ms = 0;
-   }
-}
-
-void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)
+void amdgpu_virt_exchange_data(struct amdgpu_device *adev)
 {
uint64_t bp_block_offset = 0;
uint32_t bp_block_size = 0;
@@ -628,11 +619,8 @@ void amdgpu_virt_init_data_exchange(struct amdgpu_device 
*adev)
 
adev->virt.fw_reserve.p_pf2vf = NULL;
adev->virt.fw_reserve.p_vf2pf = NULL;
-   adev->virt.vf2pf_update_interval_ms = 0;
 
if (adev->mman.fw_vram_usage_va != NULL) {
-   adev->virt.vf2pf_update_interval_ms = 2000;
-

[PATCH 2/2] drm/amd/display: Reduce stack size for dml31 UseMinimumDCFCLK

2021-12-09 Thread Michel Dänzer
From: Michel Dänzer 

Use the struct display_mode_lib pointer instead of passing lots of large
arrays as parameters by value.

Addresses this warning (resulting in failure to build a RHEL debug kernel
with Werror enabled):

../drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c: In 
function ‘UseMinimumDCFCLK’:
../drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:7478:1:
 warning: the frame size of 2128 bytes is larger than 2048 bytes 
[-Wframe-larger-than=]

NOTE: AFAICT this function previously had no observable effect, since it
only modified parameters passed by value and doesn't return anything.
Now it may modify some values in struct display_mode_lib passed in by
reference.

Signed-off-by: Michel Dänzer 
---
 .../dc/dml/dcn31/display_mode_vba_31.c| 304 --
 1 file changed, 69 insertions(+), 235 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
index 8965f9af9d0a..6feb23432f8d 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
@@ -422,62 +422,8 @@ static void CalculateUrgentBurstFactor(
 
 static void UseMinimumDCFCLK(
struct display_mode_lib *mode_lib,
-   int MaxInterDCNTileRepeaters,
int MaxPrefetchMode,
-   double FinalDRAMClockChangeLatency,
-   double SREnterPlusExitTime,
-   int ReturnBusWidth,
-   int RoundTripPingLatencyCycles,
-   int ReorderingBytes,
-   int PixelChunkSizeInKByte,
-   int MetaChunkSize,
-   bool GPUVMEnable,
-   int GPUVMMaxPageTableLevels,
-   bool HostVMEnable,
-   int NumberOfActivePlanes,
-   double HostVMMinPageSize,
-   int HostVMMaxNonCachedPageTableLevels,
-   bool DynamicMetadataVMEnabled,
-   enum immediate_flip_requirement ImmediateFlipRequirement,
-   bool ProgressiveToInterlaceUnitInOPP,
-   double 
MaxAveragePercentOfIdealFabricAndSDPPortBWDisplayCanUseInNormalSystemOperation,
-   double PercentOfIdealFabricAndSDPPortBWReceivedAfterUrgLatency,
-   int VTotal[],
-   int VActive[],
-   int DynamicMetadataTransmittedBytes[],
-   int DynamicMetadataLinesBeforeActiveRequired[],
-   bool Interlace[],
-   double RequiredDPPCLK[][2][DC__NUM_DPP__MAX],
-   double RequiredDISPCLK[][2],
-   double UrgLatency[],
-   unsigned int NoOfDPP[][2][DC__NUM_DPP__MAX],
-   double ProjectedDCFCLKDeepSleep[][2],
-   double MaximumVStartup[][2][DC__NUM_DPP__MAX],
-   double TotalVActivePixelBandwidth[][2],
-   double TotalVActiveCursorBandwidth[][2],
-   double TotalMetaRowBandwidth[][2],
-   double TotalDPTERowBandwidth[][2],
-   unsigned int TotalNumberOfActiveDPP[][2],
-   unsigned int TotalNumberOfDCCActiveDPP[][2],
-   int dpte_group_bytes[],
-   double PrefetchLinesY[][2][DC__NUM_DPP__MAX],
-   double PrefetchLinesC[][2][DC__NUM_DPP__MAX],
-   int swath_width_luma_ub_all_states[][2][DC__NUM_DPP__MAX],
-   int swath_width_chroma_ub_all_states[][2][DC__NUM_DPP__MAX],
-   int BytePerPixelY[],
-   int BytePerPixelC[],
-   int HTotal[],
-   double PixelClock[],
-   double PDEAndMetaPTEBytesPerFrame[][2][DC__NUM_DPP__MAX],
-   double DPTEBytesPerRow[][2][DC__NUM_DPP__MAX],
-   double MetaRowBytes[][2][DC__NUM_DPP__MAX],
-   bool DynamicMetadataEnable[],
-   double VActivePixelBandwidth[][2][DC__NUM_DPP__MAX],
-   double VActiveCursorBandwidth[][2][DC__NUM_DPP__MAX],
-   double ReadBandwidthLuma[],
-   double ReadBandwidthChroma[],
-   double DCFCLKPerState[],
-   double DCFCLKState[][2]);
+   int ReorderingBytes);
 
 static void CalculatePixelDeliveryTimes(
unsigned int NumberOfActivePlanes,
@@ -5175,66 +5121,8 @@ void dml31_ModeSupportAndSystemConfigurationFull(struct 
display_mode_lib *mode_l
}
}
 
-   if (v->UseMinimumRequiredDCFCLK == true) {
-   UseMinimumDCFCLK(
-   mode_lib,
-   v->MaxInterDCNTileRepeaters,
-   MaxPrefetchMode,
-   v->DRAMClockChangeLatency,
-   v->SREnterPlusExitTime,
-   v->ReturnBusWidth,
-   v->RoundTripPingLatencyCycles,
-   

RE: [PATCH v4 4/6] drm: implement a method to free unused pages

2021-12-09 Thread Paneer Selvam, Arunpravin
[Public]

Hi Matthew,

Ping?

Regards,
Arun
-Original Message-
From: Paneer Selvam, Arunpravin  
Sent: Wednesday, December 1, 2021 10:10 PM
To: dri-de...@lists.freedesktop.org; intel-...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org
Cc: matthew.a...@intel.com; dan...@ffwll.ch; Koenig, Christian 
; Deucher, Alexander ; 
tzimmerm...@suse.de; jani.nik...@linux.intel.com; Paneer Selvam, Arunpravin 

Subject: [PATCH v4 4/6] drm: implement a method to free unused pages

On contiguous allocation, we round up the size to the *next* power of 2, 
implement a function to free the unused pages after the newly allocate block.

v2(Matthew Auld):
  - replace function name 'drm_buddy_free_unused_pages' with
drm_buddy_block_trim
  - replace input argument name 'actual_size' with 'new_size'
  - add more validation checks for input arguments
  - add overlaps check to avoid needless searching and splitting
  - merged the below patch to see the feature in action
- add free unused pages support to i915 driver
  - lock drm_buddy_block_trim() function as it calls mark_free/mark_split
are all globally visible

v3:
  - remove drm_buddy_block_trim() error handling and
print a warn message if it fails

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/drm_buddy.c   | 72 ++-
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.c | 10 +++
 include/drm/drm_buddy.h   |  4 ++
 3 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c index 
eddc1eeda02e..707efc82216d 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -434,7 +434,8 @@ alloc_from_freelist(struct drm_buddy_mm *mm,  static int 
__alloc_range(struct drm_buddy_mm *mm,
 struct list_head *dfs,
 u64 start, u64 size,
-struct list_head *blocks)
+struct list_head *blocks,
+bool trim_path)
 {
struct drm_buddy_block *block;
struct drm_buddy_block *buddy;
@@ -480,8 +481,20 @@ static int __alloc_range(struct drm_buddy_mm *mm,
 
if (!drm_buddy_block_is_split(block)) {
err = split_block(mm, block);
-   if (unlikely(err))
+   if (unlikely(err)) {
+   if (trim_path)
+   /*
+* Here in case of trim, we return and 
dont goto
+* split failure path as it removes 
from the
+* original list and potentially also 
freeing
+* the block. so we could leave as it 
is,
+* worse case we get some internal 
fragmentation
+* and leave the decision to the user
+*/
+   return err;
+
goto err_undo;
+   }
}
 
list_add(>right->tmp_link, dfs); @@ -535,8 +548,61 @@ 
static int __drm_buddy_alloc_range(struct drm_buddy_mm *mm,
for (i = 0; i < mm->n_roots; ++i)
list_add_tail(>roots[i]->tmp_link, );
 
-   return __alloc_range(mm, , start, size, blocks);
+   return __alloc_range(mm, , start, size, blocks, 0); }
+
+/**
+ * drm_buddy_block_trim - free unused pages
+ *
+ * @mm: DRM buddy manager
+ * @new_size: original size requested
+ * @blocks: output list head to add allocated blocks
+ *
+ * For contiguous allocation, we round up the size to the nearest
+ * power of two value, drivers consume *actual* size, so remaining
+ * portions are unused and it can be freed.
+ *
+ * Returns:
+ * 0 on success, error code on failure.
+ */
+int drm_buddy_block_trim(struct drm_buddy_mm *mm,
+u64 new_size,
+struct list_head *blocks)
+{
+   struct drm_buddy_block *block;
+   u64 new_start;
+   LIST_HEAD(dfs);
+
+   if (!list_is_singular(blocks))
+   return -EINVAL;
+
+   block = list_first_entry(blocks,
+struct drm_buddy_block,
+link);
+
+   if (!drm_buddy_block_is_allocated(block))
+   return -EINVAL;
+
+   if (new_size > drm_buddy_block_size(mm, block))
+   return -EINVAL;
+
+   if (!new_size && !IS_ALIGNED(new_size, mm->chunk_size))
+   return -EINVAL;
+
+   if (new_size == drm_buddy_block_size(mm, block))
+   return 0;
+
+   list_del(>link);
+
+   new_start = drm_buddy_block_offset(block);
+
+   mark_free(mm, block);
+
+   list_add(>tmp_link, );
+
+   return __alloc_range(mm, , new_start, new_size, blocks, 1);
 }

Re: [PATCH v4 0/6] Expand display core documentation

2021-12-09 Thread Harry Wentland
Thanks for this. It's really good to see this.

Reviewed-by: Harry Wentland 

Harry

On 2021-12-09 09:20, Rodrigo Siqueira wrote:
> Display Core (DC) is one of the components under amdgpu, and it has
> multiple features directly related to the KMS API. Unfortunately, we
> don't have enough documentation about DC in the upstream, which makes
> the life of some external contributors a little bit more challenging.
> For these reasons, this patchset reworks part of the DC documentation
> and introduces a new set of details on how the display core works on DCN
> IP. Another improvement that this documentation effort tries to bring is
> making explicit some of our hardware-specific details to guide
> user-space developers better.
> 
> In my view, it is easier to review this series if you apply it in your
> local kernel and build the HTML version (make htmldocs). I'm suggesting
> this approach because I added a few SVG diagrams that will be easier to
> see in the HTML version. If you cannot build the documentation, try to
> open the SVG images while reviewing the content. In summary, in this
> series, you will find:
> 
> 1. Patch 1: Re-arrange of display core documentation. This is
>preparation work for the other patches, but it is also a way to expand
>this documentation.
> 2. Patch 2 to 4: Document some common debug options related to display.
> 3. Patch 5: This patch provides an overview of how our display core next
>works and a brief explanation of each component.
> 4. Patch 6: We use a lot of acronyms in our driver; for this reason, we
>exposed a glossary with common terms used by display core.
> 
> Please let us know what you think we can improve this series and what
> kind of content you want to see for the next series.
> 
> Changes since V3:
>  - Add new acronyms to amdgpu glossary
>  - Add link between dc and amdgpu glossary
> Changes since V2:
>  - Add a comment about MMHUBBUB
> Changes since V1:
>  - Group amdgpu documentation together.
>  - Create index pages.
>  - Mirror display folder in the documentation.
>  - Divide glossary based on driver context.
>  - Make terms more consistent and update CPLIB
>  - Add new acronyms to the glossary
> 
> Thanks
> Siqueira
> 
> Rodrigo Siqueira (6):
>   Documentation/gpu: Reorganize DC documentation
>   Documentation/gpu: Document amdgpu_dm_visual_confirm debugfs entry
>   Documentation/gpu: Document pipe split visual confirmation
>   Documentation/gpu: How to collect DTN log
>   Documentation/gpu: Add basic overview of DC pipeline
>   Documentation/gpu: Add amdgpu and dc glossary
> 
>  Documentation/gpu/amdgpu-dc.rst   |   74 --
>  Documentation/gpu/amdgpu/amdgpu-glossary.rst  |   87 ++
>  .../gpu/amdgpu/display/config_example.svg |  414 ++
>  Documentation/gpu/amdgpu/display/dc-debug.rst |   77 ++
>  .../gpu/amdgpu/display/dc-glossary.rst|  237 
>  .../amdgpu/display/dc_pipeline_overview.svg   | 1125 +
>  .../gpu/amdgpu/display/dcn-overview.rst   |  171 +++
>  .../gpu/amdgpu/display/display-manager.rst|   42 +
>  .../gpu/amdgpu/display/global_sync_vblank.svg |  485 +++
>  Documentation/gpu/amdgpu/display/index.rst|   29 +
>  .../gpu/{amdgpu.rst => amdgpu/index.rst}  |   25 +-
>  Documentation/gpu/drivers.rst |3 +-
>  12 files changed, 2690 insertions(+), 79 deletions(-)
>  delete mode 100644 Documentation/gpu/amdgpu-dc.rst
>  create mode 100644 Documentation/gpu/amdgpu/amdgpu-glossary.rst
>  create mode 100644 Documentation/gpu/amdgpu/display/config_example.svg
>  create mode 100644 Documentation/gpu/amdgpu/display/dc-debug.rst
>  create mode 100644 Documentation/gpu/amdgpu/display/dc-glossary.rst
>  create mode 100644 Documentation/gpu/amdgpu/display/dc_pipeline_overview.svg
>  create mode 100644 Documentation/gpu/amdgpu/display/dcn-overview.rst
>  create mode 100644 Documentation/gpu/amdgpu/display/display-manager.rst
>  create mode 100644 Documentation/gpu/amdgpu/display/global_sync_vblank.svg
>  create mode 100644 Documentation/gpu/amdgpu/display/index.rst
>  rename Documentation/gpu/{amdgpu.rst => amdgpu/index.rst} (95%)
> 



[PATCH v2 09/10] drm/amdgpu: remove unnecessary variables

2021-12-09 Thread Isabella Basso
This fixes the warnings below, and also drops the display_count
variable, as it's unused.

 In function 'svm_range_map_to_gpu':
 warning: variable 'bo_va' set but not used [-Wunused-but-set-variable]
 1172 | struct amdgpu_bo_va bo_va;
  | ^
 ...
 In function 'dcn201_update_clocks':
 warning: variable 'enter_display_off' set but not used 
[-Wunused-but-set-variable]
 132 | bool enter_display_off = false;
 |  ^

Changes since v1:
- As suggested by Rodrigo Siqueira:
  1. Drop display_count variable.
- As suggested by Felix Kuehling:
  1. Remove block surrounding amdgpu_xgmi_same_hive.

Signed-off-by: Isabella Basso 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c   | 4 
 .../gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c | 7 +--
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 82cb45e30197..835f202dc23d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1169,7 +1169,6 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
 unsigned long npages, bool readonly, dma_addr_t *dma_addr,
 struct amdgpu_device *bo_adev, struct dma_fence **fence)
 {
-   struct amdgpu_bo_va bo_va;
bool table_freed = false;
uint64_t pte_flags;
unsigned long last_start;
@@ -1182,9 +1181,6 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n", prange->svms,
 last_start, last_start + npages - 1, readonly);
 
-   if (prange->svm_bo && prange->ttm_res)
-   bo_va.is_xgmi = amdgpu_xgmi_same_hive(adev, bo_adev);
-
for (i = offset; i < offset + npages; i++) {
last_domain = dma_addr[i] & SVM_RANGE_VRAM_DOMAIN;
dma_addr[i] &= ~SVM_RANGE_VRAM_DOMAIN;
diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c 
b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c
index 8c20a0fb1e4f..fbdd0a92d146 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn201/dcn201_clk_mgr.c
@@ -90,10 +90,8 @@ static void dcn201_update_clocks(struct clk_mgr 
*clk_mgr_base,
struct clk_mgr_internal *clk_mgr = TO_CLK_MGR_INTERNAL(clk_mgr_base);
struct dc_clocks *new_clocks = >bw_ctx.bw.dcn.clk;
struct dc *dc = clk_mgr_base->ctx->dc;
-   int display_count;
bool update_dppclk = false;
bool update_dispclk = false;
-   bool enter_display_off = false;
bool dpp_clock_lowered = false;
bool force_reset = false;
bool p_state_change_support;
@@ -109,10 +107,7 @@ static void dcn201_update_clocks(struct clk_mgr 
*clk_mgr_base,
dcn2_read_clocks_from_hw_dentist(clk_mgr_base);
}
 
-   display_count = clk_mgr_helper_get_active_display_cnt(dc, context);
-
-   if (display_count == 0)
-   enter_display_off = true;
+   clk_mgr_helper_get_active_display_cnt(dc, context);
 
if (should_set_clock(safe_to_lower, new_clocks->phyclk_khz, 
clk_mgr_base->clks.phyclk_khz))
clk_mgr_base->clks.phyclk_khz = new_clocks->phyclk_khz;
-- 
2.34.1



[PATCH 2/2] drm/amdgpu: Reserve Bad pages early for SRIOV VF

2021-12-09 Thread Victor Skvortsov
Add a pf-vf exchange right after GMC sw init in
order to reserve bad pages as early as possible

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3992c4086d26..a146a55c9864 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2315,6 +2315,9 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 
/* need to do gmc hw init early so we can allocate gpu mem */
if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
+   if (amdgpu_sriov_vf(adev))
+   amdgpu_virt_exchange_data(adev);
+
r = amdgpu_device_vram_scratch_init(adev);
if (r) {
DRM_ERROR("amdgpu_vram_scratch_init failed 
%d\n", r);
-- 
2.25.1



Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Sierra Guiza, Alejandro (Alex)


On 12/9/2021 10:29 AM, Felix Kuehling wrote:

Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:

On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) 
wrote:

On 12/8/2021 11:30 AM, Felix Kuehling wrote:

Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:

Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:

On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:

Avoid long term pinning for Coherent device type pages. This could
interfere with their own device memory manager.
If caller tries to get user device coherent pages with PIN_LONGTERM flag
set, those pages will be migrated back to system memory.

Signed-off-by: Alex Sierra
---
   mm/gup.c | 32 ++--
   1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..1572eacf07f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
   #endif /* CONFIG_ELF_CORE */
   
   #ifdef CONFIG_MIGRATION

+static int migrate_device_page(unsigned long address,
+   struct page *page)
+{
+   struct vm_area_struct *vma = find_vma(current->mm, address);
+   struct vm_fault vmf = {
+   .vma = vma,
+   .address = address & PAGE_MASK,
+   .flags = FAULT_FLAG_USER,
+   .pgoff = linear_page_index(vma, address),
+   .gfp_mask = GFP_KERNEL,
+   .page = page,
+   };
+   if (page->pgmap && page->pgmap->ops->migrate_to_ram)
+   return page->pgmap->ops->migrate_to_ram();

How does this synchronise against pgmap being released? As I understand things
at this point we're not holding a reference on either the page or pgmap, so
the page and therefore the pgmap may have been freed.

I think a similar problem exists for device private fault handling as well and
it has been on my list of things to fix for a while. I think the solution is to
call try_get_page(), except it doesn't work with device pages due to the whole
refcount thing. That issue is blocking a fair bit of work now so I've started
looking into it.

At least the page should have been pinned by the __get_user_pages_locked
call in __gup_longterm_locked. That refcount is dropped in
check_and_migrate_movable_pages when it returns 0 or an error.

Never mind. We unpin the pages first. Alex, would the migration work if
we unpinned them afterwards? Also, the normal CPU page fault code path
seems to make sure the page is locked (check in pfn_swap_entry_to_page)
before calling migrate_to_ram.

I don't think that's true. The check in pfn_swap_entry_to_page() is only for
migration entries:

BUG_ON(is_migration_entry(entry) && !PageLocked(p));

As this is coherent memory though why do we have to call into a device driver
to do the migration? Couldn't this all be done in the kernel?

I think you're right. I hadn't thought of that mainly because I'm even
less familiar with the non-device migration code. Alex, can you give
that a try? As long as the driver still gets a page-free callback when
the device page is freed, it should work.


ACK.Will do

Alex Sierra


Regards,
   Felix



No, you can not unpinned after migration. Due to the expected_count VS
page_count condition at migrate_page_move_mapping, during migrate_page call.

Regards,
Alex Sierra


Regards,
Felix




[PATCH 1/2] drm/amd/display: Reduce stack size for dml31_ModeSupportAndSystemConfigurationFull

2021-12-09 Thread Michel Dänzer
From: Michel Dänzer 

Move code using the Pipe struct to a new helper function.

Works around[0] this warning (resulting in failure to build a RHEL debug
kernel with Werror enabled):

../drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c: In 
function ‘dml31_ModeSupportAndSystemConfigurationFull’:
../drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:5740:1:
 warning: the frame size of 2144 bytes is larger than 2048 bytes 
[-Wframe-larger-than=]

The culprit seems to be the Pipe struct, so pull the relevant block out
into its own sub-function. (This is porting
a62427ef9b55 "drm/amd/display: Reduce stack size for 
dml21_ModeSupportAndSystemConfigurationFull"
from dml31 to dml21)

[0] AFAICT this doesn't actually reduce the total amount of stack which
can be used, just moves some of it from
dml31_ModeSupportAndSystemConfigurationFull to the new helper function,
so the former happens to no longer exceed the limit for a single
function.

Signed-off-by: Michel Dänzer 
---
 .../dc/dml/dcn31/display_mode_vba_31.c| 185 ++
 1 file changed, 99 insertions(+), 86 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
index 7e937bdcea00..8965f9af9d0a 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
@@ -3949,6 +3949,102 @@ static double TruncToValidBPP(
return BPP_INVALID;
 }
 
+static noinline void CalculatePrefetchSchedulePerPlane(
+   struct display_mode_lib *mode_lib,
+   double HostVMInefficiencyFactor,
+   int i,
+   unsigned j,
+   unsigned k)
+{
+   struct vba_vars_st *v = _lib->vba;
+   Pipe myPipe;
+
+   myPipe.DPPCLK = v->RequiredDPPCLK[i][j][k];
+   myPipe.DISPCLK = v->RequiredDISPCLK[i][j];
+   myPipe.PixelClock = v->PixelClock[k];
+   myPipe.DCFCLKDeepSleep = v->ProjectedDCFCLKDeepSleep[i][j];
+   myPipe.DPPPerPlane = v->NoOfDPP[i][j][k];
+   myPipe.ScalerEnabled = v->ScalerEnabled[k];
+   myPipe.SourceScan = v->SourceScan[k];
+   myPipe.BlockWidth256BytesY = v->Read256BlockWidthY[k];
+   myPipe.BlockHeight256BytesY = v->Read256BlockHeightY[k];
+   myPipe.BlockWidth256BytesC = v->Read256BlockWidthC[k];
+   myPipe.BlockHeight256BytesC = v->Read256BlockHeightC[k];
+   myPipe.InterlaceEnable = v->Interlace[k];
+   myPipe.NumberOfCursors = v->NumberOfCursors[k];
+   myPipe.VBlank = v->VTotal[k] - v->VActive[k];
+   myPipe.HTotal = v->HTotal[k];
+   myPipe.DCCEnable = v->DCCEnable[k];
+   myPipe.ODMCombineIsEnabled = v->ODMCombineEnablePerState[i][k] == 
dm_odm_combine_mode_4to1
+   || v->ODMCombineEnablePerState[i][k] == 
dm_odm_combine_mode_2to1;
+   myPipe.SourcePixelFormat = v->SourcePixelFormat[k];
+   myPipe.BytePerPixelY = v->BytePerPixelY[k];
+   myPipe.BytePerPixelC = v->BytePerPixelC[k];
+   myPipe.ProgressiveToInterlaceUnitInOPP = 
v->ProgressiveToInterlaceUnitInOPP;
+   v->NoTimeForPrefetch[i][j][k] = CalculatePrefetchSchedule(
+   mode_lib,
+   HostVMInefficiencyFactor,
+   ,
+   v->DSCDelayPerState[i][k],
+   v->DPPCLKDelaySubtotal + v->DPPCLKDelayCNVCFormater,
+   v->DPPCLKDelaySCL,
+   v->DPPCLKDelaySCLLBOnly,
+   v->DPPCLKDelayCNVCCursor,
+   v->DISPCLKDelaySubtotal,
+   v->SwathWidthYThisState[k] / v->HRatio[k],
+   v->OutputFormat[k],
+   v->MaxInterDCNTileRepeaters,
+   dml_min(v->MaxVStartup, v->MaximumVStartup[i][j][k]),
+   v->MaximumVStartup[i][j][k],
+   v->GPUVMMaxPageTableLevels,
+   v->GPUVMEnable,
+   v->HostVMEnable,
+   v->HostVMMaxNonCachedPageTableLevels,
+   v->HostVMMinPageSize,
+   v->DynamicMetadataEnable[k],
+   v->DynamicMetadataVMEnabled,
+   v->DynamicMetadataLinesBeforeActiveRequired[k],
+   v->DynamicMetadataTransmittedBytes[k],
+   v->UrgLatency[i],
+   v->ExtraLatency,
+   v->TimeCalc,
+   v->PDEAndMetaPTEBytesPerFrame[i][j][k],
+   v->MetaRowBytes[i][j][k],
+   v->DPTEBytesPerRow[i][j][k],
+   v->PrefetchLinesY[i][j][k],
+   v->SwathWidthYThisState[k],
+   v->PrefillY[k],
+   v->MaxNumSwY[k],
+   v->PrefetchLinesC[i][j][k],
+   v->SwathWidthCThisState[k],
+   v->PrefillC[k],
+   v->MaxNumSwC[k],
+   v->swath_width_luma_ub_this_state[k],
+   v->swath_width_chroma_ub_this_state[k],
+   v->SwathHeightYThisState[k],
+   v->SwathHeightCThisState[k],
+ 

[PATCH v2 1/2] drm/amdgpu: Detect if amdgpu in IOMMU direct map mode

2021-12-09 Thread Philip Yang
If host and amdgpu IOMMU is not enabled or IOMMU is pass through mode,
set adev->ram_is_direct_mapped flag which will be used to optimize
memory usage for multi GPU mappings.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++
 2 files changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 54c882a6b433..0ec19c83a203 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1097,6 +1097,8 @@ struct amdgpu_device {
 
struct amdgpu_reset_control *reset_cntl;
uint32_t
ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE];
+
+   boolram_is_direct_mapped;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ce9bdef185c0..3318d92de8eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -3381,6 +3382,22 @@ static int amdgpu_device_get_job_timeout_settings(struct 
amdgpu_device *adev)
return ret;
 }
 
+/**
+ * amdgpu_device_check_iommu_direct_map - check if RAM direct mapped to GPU
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * RAM direct mapped to GPU if IOMMU is not enabled or is pass through mode
+ */
+static void amdgpu_device_check_iommu_direct_map(struct amdgpu_device *adev)
+{
+   struct iommu_domain *domain;
+
+   domain = iommu_get_domain_for_dev(adev->dev);
+   if (!domain || domain->type == IOMMU_DOMAIN_IDENTITY)
+   adev->ram_is_direct_mapped = true;
+}
+
 static const struct attribute *amdgpu_dev_attributes[] = {
_attr_product_name.attr,
_attr_product_number.attr,
@@ -3784,6 +3801,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
queue_delayed_work(system_wq, _info.delayed_reset_work,
   msecs_to_jiffies(AMDGPU_RESUME_MS));
 
+   amdgpu_device_check_iommu_direct_map(adev);
+
return 0;
 
 release_ras_con:
-- 
2.17.1



Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Christian König

Hi Guilherme,

Am 09.12.21 um 17:00 schrieb Guilherme G. Piccoli:

Hi all, I have a question about the possibility of reusing a framebuffer
after a regular (or panic) kexec - my case is with amdgpu (APU, aka, not
a separate GPU hardware), but I guess the question is kinda generic
hence I've looped most of the lists / people I think does make sense
(apologies for duplicates).


The context is: we have a hardware that has an amdgpu-controlled device
(Vangogh model) and as soon as the machine boots, efifb is providing
graphics - I understand the UEFI/GRUB outputs rely in EFI framebuffer as
well. As soon amdgpu module is available, kernel loads it and it takes
over the GPU, providing graphics. The kexec_file_load syscall allows to
pass a valid screen_info structure, so by kexec'ing a new kernel, we
have again efifb taking over on boot time, but this time I see nothing
in the screen. I've manually blacklisted amdgpu in this new kexec'ed
kernel, I'd like to rely in the simple framebuffer - the goal is to have
a tiny kernel kexec'ed. I'm using kernel version 5.16.0-rc4.

I've done some other experiments, for exemple: I've forced screen_info
model to match VLFB, so vesafb took over after the kexec, with the same
result. Also noticed that BusMaster bit was off after kexec, in the AMD
APU PCIe device, so I've set it on efifb before probe, and finally
tested the same things in qemu, with qxl, all with the same result
(blank screen).
The most interesting result I got (both with amdgpu and qemu/qxl) is
that if I blacklist these drivers and let the machine continue using
efifb since the beginning, after kexec the efifb is still able to
produce graphics.

Which then led me to think that likely there's something fundamentally
"blocking" the reuse of the simple framebuffer after kexec, like maybe
DRM stack is destroying the old framebuffer somehow? What kind of
preparation is required at firmware level to make the simple EFI VGA
framebuffer work, and could we perform this in a kexec (or "save it"
before the amdgpu/qxl drivers take over and reuse later)?


unfortunately what you try here will most likely not work easily.

During bootup the ASIC is initialized in a VGA compatibility mode by the 
VBIOS which also allows efifb to display something. And among the first 
things amdgpu does is to disable this compatibility mode :)


What you need to do to get this working again is to issue a PCIe reset 
of the GPU and then re-init the ASIC with the VBIOS tables.


Alex should know more details about how to do this.

Regards,
Christian.



Any advice is greatly appreciated!
Thanks in advance,


Guilherme




Re: [PATCH v2] drm/amdgpu: fix incorrect VCN revision in SRIOV

2021-12-09 Thread Alex Deucher
On Thu, Dec 9, 2021 at 12:18 AM Leslie Shi  wrote:
>
> Guest OS will setup VCN instance 1 which is disabled as an enabled instance 
> and
> execute initialization work on it, but this causes VCN ib ring test failure
> on the disabled VCN instance during modprobe:
>
> amdgpu :00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub 1
> amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> failed on vcn_dec_0 (-110).
> amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> failed on vcn_enc_0.0 (-110).
> [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
> failed (-110).
>
> v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to
> vcn_config
>
> Fixes: 36b7d5646476 ("drm/amdgpu: handle SRIOV VCN revision parsing")
> Signed-off-by: Leslie Shi 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 13 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 15 ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  2 +-
>  4 files changed, 8 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 552031950518..53ff1bbe8bd6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -380,6 +380,9 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
> *adev)
>   ip->revision);
>
> if (le16_to_cpu(ip->hw_id) == VCN_HWID) {
> +   adev->vcn.vcn_config[adev->vcn.num_vcn_inst] =
> +   ip->revision & 0xc0;
> +
> if (amdgpu_sriov_vf(adev)) {

We can probably just drop the conditional here and just clear the high
bits for everything.

Alex

> /* SR-IOV modifies each VCN’s 
> revision (uint8)
>  * Bit [5:0]: original revision value
> @@ -388,8 +391,6 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
> *adev)
>  * 0b10 : encode is disabled
>  * 0b01 : decode is disabled
>  */
> -   
> adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
> -   (ip->revision & 0xc0) >> 6;
> ip->revision &= ~0xc0;
> }
> adev->vcn.num_vcn_inst++;
> @@ -485,14 +486,6 @@ int amdgpu_discovery_get_ip_version(struct amdgpu_device 
> *adev, int hw_id, int n
> return -EINVAL;
>  }
>
> -
> -int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
> vcn_instance,
> -int *major, int *minor, int *revision)
> -{
> -   return amdgpu_discovery_get_ip_version(adev, VCN_HWID,
> -  vcn_instance, major, minor, 
> revision);
> -}
> -
>  void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
>  {
> struct binary_header *bhdr;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> index 0ea029e3b850..14537cec19db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> @@ -33,8 +33,6 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
> *adev);
>  int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
> int number_instance,
>  int *major, int *minor, int *revision);
>
> -int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
> vcn_instance,
> -int *major, int *minor, int *revision);
>  int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev);
>  int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> index 2658414c503d..38036cbf6203 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> @@ -284,20 +284,13 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device *adev)
>  bool amdgpu_vcn_is_disabled_vcn(struct amdgpu_device *adev, enum 
> vcn_ring_type type, uint32_t vcn_instance)
>  {
> bool ret = false;
> +   int vcn_config = adev->vcn.vcn_config[vcn_instance];
>
> -   int major;
> -   int minor;
> -   int revision;
> -
> -   /* if cannot find IP data, then this VCN does not exist */
> -   if (amdgpu_discovery_get_vcn_version(adev, vcn_instance, , 
> , ) != 0)
> -   return true;
> -
> -   if ((type == VCN_ENCODE_RING) && (revision & 
> 

Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

2021-12-09 Thread Christian König

Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:


On 2021-12-09 4:00 a.m., Christian König wrote:



Am 09.12.21 um 09:49 schrieb Lang Yu:

It is useful to maintain error context when debugging
SW/FW issues. We introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

Suggested-by: Christian Koenig 
Suggested-by: Andrey Grodzovsky 
Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 
++

  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index c5cfe2926ca1..3f5f8f62aa5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct 
amdgpu_device *adev,

  void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
  struct amdgpu_ring *ring);
  +void amdgpu_device_halt(struct amdgpu_device *adev);
+
  /* atpx handler */
  #if defined(CONFIG_VGA_SWITCHEROO)
  void amdgpu_register_atpx_handler(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index a1c14466f23d..62216627cc83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct 
amdgpu_device *adev,

    amdgpu_asic_invalidate_hdp(adev, ring);
  }
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Bring hardware to some kind of halt state so that no one can 
touch it
+ * any more. It will help to maintain error context when error 
occurred.
+ * Compare to a simple hang, the system will keep stable at least 
for SSH

+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *
+ * 1. drm_dev_unplug() makes device inaccessible to user 
space(IOCTLs, etc),
+ *    clears all CPU mappings to device, disallows remappings 
through page faults

+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 5. pci_disable_device() and pci_wait_for_pending_transaction()
+ *    flush any in flight DMA operations
+ * 6. set adev->no_hw_access to true
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+    struct pci_dev *pdev = adev->pdev;
+    struct drm_device *ddev = >ddev;
+
+    drm_dev_unplug(ddev);
+
+    amdgpu_irq_disable_all(adev);
+
+    amdgpu_fence_driver_hw_fini(adev);
+
+    amdgpu_device_unmap_mmio(adev);



Note that this one will cause page fault on any subsequent MMIO access 
(trough registers or by direct VRAM access)





+
+    pci_disable_device(pdev);
+    pci_wait_for_pending_transaction(pdev);
+
+    adev->no_hw_access = true;


I think we need to reorder this, e.g. set adev->no_hw_access much 
earlier for example. Andrey what do you think?



Earlier can be ok but at least after the last HW configuration we 
actaully want to do like disabling IRQs.


My thinking was to at least do this before we unmap the MMIO to avoid 
the crash.


Additionally to that we maybe don't even want to do this for this case.

Christian.




Andrey



Apart from that sounds like the right idea to me.

Regards,
Christian.


+}






[PATCH v2 10/10] drm/amdgpu: re-format file header comments

2021-12-09 Thread Isabella Basso
Fix the warning below:

 warning: Cannot understand  * \file amdgpu_ioc32.c
 on line 2 - I thought it was a doc line

Changes since v1:
- As suggested by Alexander Deucher:
  1. Reduce diff to minimum as this DOC section doesn't provide much
 value.

Signed-off-by: Isabella Basso 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c
index 5cf142e849bb..a7efca6354b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c
@@ -1,4 +1,4 @@
-/**
+/*
  * \file amdgpu_ioc32.c
  *
  * 32-bit ioctl compatibility routines for the AMDGPU DRM.
-- 
2.34.1



[PATCH v4 1/6] Documentation/gpu: Reorganize DC documentation

2021-12-09 Thread Rodrigo Siqueira
Display core documentation is not well organized, and it is hard to find
information due to the lack of sections. This commit reorganizes the
documentation layout, and it is preparation work for future changes.

Changes since V1:
- Christian: Group amdgpu documentation together.
- Daniel: Drop redundant amdgpu prefix.
- Jani: Create index pages.
- Yann: Mirror display folder in the documentation.

Signed-off-by: Rodrigo Siqueira 
---
 Documentation/gpu/amdgpu-dc.rst   | 74 ---
 Documentation/gpu/amdgpu/display/dc-debug.rst |  4 +
 .../gpu/amdgpu/display/display-manager.rst| 42 +++
 Documentation/gpu/amdgpu/display/index.rst| 29 
 .../gpu/{amdgpu.rst => amdgpu/index.rst}  | 18 -
 Documentation/gpu/drivers.rst |  3 +-
 6 files changed, 91 insertions(+), 79 deletions(-)
 delete mode 100644 Documentation/gpu/amdgpu-dc.rst
 create mode 100644 Documentation/gpu/amdgpu/display/dc-debug.rst
 create mode 100644 Documentation/gpu/amdgpu/display/display-manager.rst
 create mode 100644 Documentation/gpu/amdgpu/display/index.rst
 rename Documentation/gpu/{amdgpu.rst => amdgpu/index.rst} (96%)

diff --git a/Documentation/gpu/amdgpu-dc.rst b/Documentation/gpu/amdgpu-dc.rst
deleted file mode 100644
index f7ff7e1309de..
--- a/Documentation/gpu/amdgpu-dc.rst
+++ /dev/null
@@ -1,74 +0,0 @@
-===
-drm/amd/display - Display Core (DC)
-===
-
-*placeholder - general description of supported platforms, what dc is, etc.*
-
-Because it is partially shared with other operating systems, the Display Core
-Driver is divided in two pieces.
-
-1. **Display Core (DC)** contains the OS-agnostic components. Things like
-   hardware programming and resource management are handled here.
-2. **Display Manager (DM)** contains the OS-dependent components. Hooks to the
-   amdgpu base driver and DRM are implemented here.
-
-It doesn't help that the entire package is frequently referred to as DC. But
-with the context in mind, it should be clear.
-
-When CONFIG_DRM_AMD_DC is enabled, DC will be initialized by default for
-supported ASICs. To force disable, set `amdgpu.dc=0` on kernel command line.
-Likewise, to force enable on unsupported ASICs, set `amdgpu.dc=1`.
-
-To determine if DC is loaded, search dmesg for the following entry:
-
-``Display Core initialized with ``
-
-AMDgpu Display Manager
-==
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
-   :doc: overview
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
-   :internal:
-
-Lifecycle
--
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
-   :doc: DM Lifecycle
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
-   :functions: dm_hw_init dm_hw_fini
-
-Interrupts
---
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
-   :doc: overview
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
-   :internal:
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
-   :functions: register_hpd_handlers dm_crtc_high_irq dm_pflip_high_irq
-
-Atomic Implementation
--
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
-   :doc: atomic
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
-   :functions: amdgpu_dm_atomic_check amdgpu_dm_atomic_commit_tail
-
-Display Core
-
-
-**WIP**
-
-FreeSync Video
---
-
-.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
-   :doc: FreeSync Video
diff --git a/Documentation/gpu/amdgpu/display/dc-debug.rst 
b/Documentation/gpu/amdgpu/display/dc-debug.rst
new file mode 100644
index ..bbb8c3fc8eee
--- /dev/null
+++ b/Documentation/gpu/amdgpu/display/dc-debug.rst
@@ -0,0 +1,4 @@
+Display Core Debug tools
+
+
+TODO
diff --git a/Documentation/gpu/amdgpu/display/display-manager.rst 
b/Documentation/gpu/amdgpu/display/display-manager.rst
new file mode 100644
index ..7ce31f89d9a0
--- /dev/null
+++ b/Documentation/gpu/amdgpu/display/display-manager.rst
@@ -0,0 +1,42 @@
+==
+AMDgpu Display Manager
+==
+
+.. contents:: Table of Contents
+:depth: 3
+
+.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+   :doc: overview
+
+.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
+   :internal:
+
+Lifecycle
+=
+
+.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+   :doc: DM Lifecycle
+
+.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+   :functions: dm_hw_init dm_hw_fini
+
+Interrupts
+==
+
+.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
+   :doc: overview
+
+.. kernel-doc:: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
+   :internal:
+
+.. kernel-doc:: 

[PATCH v2 06/10] drm/amd/display: fix function scopes

2021-12-09 Thread Isabella Basso
This turns previously global functions into static, thus removing
compile-time warnings such as:

  warning: no previous prototype for 'get_highest_allowed_voltage_level'
  [-Wmissing-prototypes]
  742 | unsigned int get_highest_allowed_voltage_level(uint32_t chip_family, 
uint32_t hw_internal_rev, uint32_t pci_revision_id)
  |  ^
  warning: no previous prototype for 'rv1_vbios_smu_send_msg_with_param'
  [-Wmissing-prototypes]
  102 | int rv1_vbios_smu_send_msg_with_param(struct clk_mgr_internal *clk_mgr, 
unsigned int msg_id, unsigned int param)
  | ^

Changes since v1:
- As suggested by Rodrigo Siqueira:
  1. Rewrite function signatures to make them more readable.
  2. Get rid of unused functions in order to remove 'defined but not
 used' warnings.

Signed-off-by: Isabella Basso 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 ++--
 .../gpu/drm/amd/display/dc/calcs/dcn_calcs.c  |  4 +-
 .../display/dc/clk_mgr/dcn10/rv1_clk_mgr.c|  2 +-
 .../display/dc/clk_mgr/dcn20/dcn20_clk_mgr.c  |  2 +-
 .../dc/clk_mgr/dcn201/dcn201_clk_mgr.c| 36 ---
 .../amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c | 23 +
 .../dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.c   |  4 +-
 .../display/dc/clk_mgr/dcn301/dcn301_smu.c|  6 +-
 .../display/dc/clk_mgr/dcn301/vg_clk_mgr.c| 20 +---
 .../display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c  |  7 +-
 .../amd/display/dc/clk_mgr/dcn31/dcn31_smu.c  |  6 +-
 drivers/gpu/drm/amd/display/dc/core/dc_link.c |  3 +-
 .../gpu/drm/amd/display/dc/dcn10/dcn10_dpp.c  |  8 --
 .../drm/amd/display/dc/dcn10/dcn10_dpp_dscl.c | 97 ---
 .../amd/display/dc/dcn10/dcn10_hw_sequencer.c | 29 +++---
 .../gpu/drm/amd/display/dc/dcn10/dcn10_opp.c  | 30 --
 .../gpu/drm/amd/display/dc/dcn10/dcn10_optc.c | 20 +---
 .../drm/amd/display/dc/dcn10/dcn10_resource.c | 18 ++--
 .../gpu/drm/amd/display/dc/dcn20/dcn20_dpp.c  | 14 ---
 .../drm/amd/display/dc/dcn20/dcn20_dwb_scl.c  |  4 +-
 .../gpu/drm/amd/display/dc/dcn20/dcn20_hubp.c |  7 +-
 .../drm/amd/display/dc/dcn20/dcn20_hwseq.c|  6 +-
 .../gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c  |  9 +-
 .../gpu/drm/amd/display/dc/dcn20/dcn20_optc.c | 57 +--
 .../drm/amd/display/dc/dcn201/dcn201_dccg.c   |  3 +-
 .../drm/amd/display/dc/dcn201/dcn201_hubp.c   |  7 +-
 .../display/dc/dcn201/dcn201_link_encoder.c   |  6 +-
 .../amd/display/dc/dcn201/dcn201_resource.c   | 16 ++-
 .../drm/amd/display/dc/dcn21/dcn21_hubbub.c   |  2 +-
 .../gpu/drm/amd/display/dc/dcn21/dcn21_hubp.c | 15 +--
 .../amd/display/dc/dcn21/dcn21_link_encoder.c |  9 +-
 .../drm/amd/display/dc/dcn21/dcn21_resource.c | 31 +++---
 .../dc/dcn30/dcn30_dio_stream_encoder.c   | 18 +---
 .../gpu/drm/amd/display/dc/dcn30/dcn30_dpp.c  | 36 ++-
 .../drm/amd/display/dc/dcn30/dcn30_mmhubbub.c |  2 +-
 .../gpu/drm/amd/display/dc/dcn30/dcn30_mpc.c  |  2 +-
 .../drm/amd/display/dc/dcn30/dcn30_resource.c | 12 +--
 .../amd/display/dc/dcn301/dcn301_panel_cntl.c | 10 +-
 .../amd/display/dc/dcn301/dcn301_resource.c   | 45 -
 .../gpu/drm/amd/display/dc/dcn31/dcn31_dccg.c |  2 +-
 .../display/dc/dcn31/dcn31_dio_link_encoder.c |  2 +-
 .../amd/display/dc/dcn31/dcn31_panel_cntl.c   | 10 +-
 .../drm/amd/display/dc/dcn31/dcn31_resource.c |  2 +-
 .../dc/dml/dcn21/display_rq_dlg_calc_21.c |  8 --
 .../display/dc/irq/dcn10/irq_service_dcn10.c  |  7 +-
 .../dc/irq/dcn201/irq_service_dcn201.c|  7 +-
 .../display/dc/irq/dcn21/irq_service_dcn21.c  |  7 +-
 .../display/dc/irq/dcn31/irq_service_dcn31.c  |  7 +-
 48 files changed, 179 insertions(+), 517 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 26c65c72eb75..3fe8a26dbfa0 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -632,7 +632,8 @@ static void dm_dcn_vertical_interrupt0_high_irq(void 
*interrupt_params)
  * Copies dmub notification to DM which is to be read by AUX command.
  * issuing thread and also signals the event to wake up the thread.
  */
-void dmub_aux_setconfig_callback(struct amdgpu_device *adev, struct 
dmub_notification *notify)
+static void dmub_aux_setconfig_callback(struct amdgpu_device *adev,
+   struct dmub_notification *notify)
 {
if (adev->dm.dmub_notify)
memcpy(adev->dm.dmub_notify, notify, sizeof(struct 
dmub_notification));
@@ -648,7 +649,8 @@ void dmub_aux_setconfig_callback(struct amdgpu_device 
*adev, struct dmub_notific
  * Dmub Hpd interrupt processing callback. Gets displayindex through the
  * ink index and calls helper to do the processing.
  */
-void dmub_hpd_callback(struct amdgpu_device *adev, struct dmub_notification 
*notify)
+static void dmub_hpd_callback(struct amdgpu_device *adev,
+ struct dmub_notification *notify)
 {
struct 

Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

2021-12-09 Thread Felix Kuehling
Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:
> On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) 
> wrote:
>> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
>>> Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
 Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
>> Avoid long term pinning for Coherent device type pages. This could
>> interfere with their own device memory manager.
>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
>> set, those pages will be migrated back to system memory.
>>
>> Signed-off-by: Alex Sierra 
>> ---
>>   mm/gup.c | 32 ++--
>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 886d6148d3d0..1572eacf07f4 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>>   #endif /* CONFIG_ELF_CORE */
>>   
>>   #ifdef CONFIG_MIGRATION
>> +static int migrate_device_page(unsigned long address,
>> +struct page *page)
>> +{
>> +struct vm_area_struct *vma = find_vma(current->mm, address);
>> +struct vm_fault vmf = {
>> +.vma = vma,
>> +.address = address & PAGE_MASK,
>> +.flags = FAULT_FLAG_USER,
>> +.pgoff = linear_page_index(vma, address),
>> +.gfp_mask = GFP_KERNEL,
>> +.page = page,
>> +};
>> +if (page->pgmap && page->pgmap->ops->migrate_to_ram)
>> +return page->pgmap->ops->migrate_to_ram();
> How does this synchronise against pgmap being released? As I understand 
> things
> at this point we're not holding a reference on either the page or pgmap, 
> so
> the page and therefore the pgmap may have been freed.
>
> I think a similar problem exists for device private fault handling as 
> well and
> it has been on my list of things to fix for a while. I think the solution 
> is to
> call try_get_page(), except it doesn't work with device pages due to the 
> whole
> refcount thing. That issue is blocking a fair bit of work now so I've 
> started
> looking into it.
 At least the page should have been pinned by the __get_user_pages_locked
 call in __gup_longterm_locked. That refcount is dropped in
 check_and_migrate_movable_pages when it returns 0 or an error.
>>> Never mind. We unpin the pages first. Alex, would the migration work if
>>> we unpinned them afterwards? Also, the normal CPU page fault code path
>>> seems to make sure the page is locked (check in pfn_swap_entry_to_page)
>>> before calling migrate_to_ram.
> I don't think that's true. The check in pfn_swap_entry_to_page() is only for
> migration entries:
>
>   BUG_ON(is_migration_entry(entry) && !PageLocked(p));
>
> As this is coherent memory though why do we have to call into a device driver
> to do the migration? Couldn't this all be done in the kernel?

I think you're right. I hadn't thought of that mainly because I'm even
less familiar with the non-device migration code. Alex, can you give
that a try? As long as the driver still gets a page-free callback when
the device page is freed, it should work.

Regards,
  Felix


>
>> No, you can not unpinned after migration. Due to the expected_count VS 
>> page_count condition at migrate_page_move_mapping, during migrate_page call.
>>
>> Regards,
>> Alex Sierra
>>
>>> Regards,
>>>Felix
>>>
>>>
>
>


[PATCH v2 03/10] drm/amdgpu: fix amdgpu_ras_mca_query_error_status scope

2021-12-09 Thread Isabella Basso
This commit fixes the compile-time warning below:

 warning: no previous prototype for ‘amdgpu_ras_mca_query_error_status’
 [-Wmissing-prototypes]

Changes since v1:
- As suggested by Alexander Deucher:
  1. Make function static instead of adding prototype.

Signed-off-by: Isabella Basso 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 00f94f6b5287..dc2a8d58d578 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -867,9 +867,9 @@ static int amdgpu_ras_enable_all_features(struct 
amdgpu_device *adev,
 /* feature ctl end */
 
 
-void amdgpu_ras_mca_query_error_status(struct amdgpu_device *adev,
-  struct ras_common_if *ras_block,
-  struct ras_err_data  *err_data)
+static void amdgpu_ras_mca_query_error_status(struct amdgpu_device *adev,
+ struct ras_common_if *ras_block,
+ struct ras_err_data  *err_data)
 {
switch (ras_block->sub_block_index) {
case AMDGPU_RAS_MCA_BLOCK__MP0:
-- 
2.34.1



Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Christian König

That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank 
you!


On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly 
a good idea.


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied


exactly that is not correct. That behavior is actively used by some 
userspace stacks as far as I know.


Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
Thanks Christian. Would it make it less intrusive if I just use the 
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from 
this patch? For our use case, just the ttm_bo_mmap_obj change should 
suffice and we don't want to put any more work arounds in the user 
space (thunk, in our case).


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
When an application having open file access to a node forks, its 
shared
mappings also get reflected in the address space of child process 
even
though it cannot access them with the object permissions applied. 
With the
existing permission checks on the gem objects, it might be 
reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space 
application

doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which 
we had

worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint 
restore

an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of 
this so

this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work 
for most userspace stacks, we do have some implementations which 
call fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct 
*vma, struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }








Re: [PATCH] drm/amdgpu: don't skip runtime pm get on A+A config

2021-12-09 Thread Deucher, Alexander
[Public]

No objections from me.
Acked-by: Alex Deucher 

From: Christian König 
Sent: Thursday, December 9, 2021 10:34 AM
To: Quan, Evan ; Deucher, Alexander 

Cc: amd-gfx@lists.freedesktop.org 
Subject: Re: [PATCH] drm/amdgpu: don't skip runtime pm get on A+A config

Am 07.12.21 um 08:40 schrieb Quan, Evan:
> [AMD Official Use Only]
>> -Original Message-
>> From: Christian König 
>> Sent: Tuesday, December 7, 2021 3:03 PM
>> To: Quan, Evan ; Deucher, Alexander
>> 
>> Cc: amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: don't skip runtime pm get on A+A config
>>
>> You are looking at outdated code, that stuff is gone by now.
>> amd-staging-drm-next probably needs a rebase.
> Yep, I can see it in the vanilla kernel.
> The patch is acked-by: Evan Quan 

Thanks.

Alex any objections that I push this to drm-misc-next? It was found
while working on changes already upstream in that function and would
conflict if we push it through amd-staging-drm-next.

Regards,
Christian.

>
> BR
> Evan
>> And this code was what the check was initially good for. Just skipping the PM
>> stuff as well on A+A was unintentionally.
>>
>> Regards,
>> Christian.
>>
>> Am 07.12.21 um 02:58 schrieb Quan, Evan:
>>> [AMD Official Use Only]
>>>
>>> It seems more jobs(below) other than bumping the runpm counter are
>> performed.
>>> Are they desired also?
>>>
>>>  r = __dma_resv_make_exclusive(bo->tbo.base.resv);
>>>  if (r)
>>>  goto out;
>>>
>>>  bo->prime_shared_count++;
>>>
>>> BR
>>> Evan
 -Original Message-
 From: amd-gfx  On Behalf Of
 Christian König
 Sent: Monday, December 6, 2021 4:46 PM
 To: Deucher, Alexander 
 Cc: amd-gfx@lists.freedesktop.org
 Subject: [PATCH] drm/amdgpu: don't skip runtime pm get on A+A config

 The runtime PM get was incorrectly added after the check.

 Signed-off-by: Christian König 
 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 3 ---
1 file changed, 3 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
 index ae6ab93c868b..4896c876ffec 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
 @@ -61,9 +61,6 @@ static int amdgpu_dma_buf_attach(struct dma_buf
 *dmabuf,
 if (pci_p2pdma_distance_many(adev->pdev, >dev, 1, true)
>> <
 0)
 attach->peer2peer = false;

 -  if (attach->dev->driver == adev->dev->driver)
 -  return 0;
 -
 r = pm_runtime_get_sync(adev_to_drm(adev)->dev);
 if (r < 0)
 goto out;
 --
 2.25.1



Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Bhardwaj, Rajneesh

Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank you!

On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly a 
good idea.


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied


exactly that is not correct. That behavior is actively used by some 
userspace stacks as far as I know.


Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
Thanks Christian. Would it make it less intrusive if I just use the 
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this 
patch? For our use case, just the ttm_bo_mmap_obj change should 
suffice and we don't want to put any more work arounds in the user 
space (thunk, in our case).


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
When an application having open file access to a node forks, its 
shared

mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. 
With the
existing permission checks on the gem objects, it might be 
reasonable to

also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which 
we had

worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint 
restore

an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of 
this so

this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call 
fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }






Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

2021-12-09 Thread Andrey Grodzovsky



On 2021-12-09 4:00 a.m., Christian König wrote:



Am 09.12.21 um 09:49 schrieb Lang Yu:

It is useful to maintain error context when debugging
SW/FW issues. We introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

Suggested-by: Christian Koenig 
Suggested-by: Andrey Grodzovsky 
Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++
  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index c5cfe2926ca1..3f5f8f62aa5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct 
amdgpu_device *adev,

  void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
  struct amdgpu_ring *ring);
  +void amdgpu_device_halt(struct amdgpu_device *adev);
+
  /* atpx handler */
  #if defined(CONFIG_VGA_SWITCHEROO)
  void amdgpu_register_atpx_handler(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index a1c14466f23d..62216627cc83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct 
amdgpu_device *adev,

    amdgpu_asic_invalidate_hdp(adev, ring);
  }
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Bring hardware to some kind of halt state so that no one can 
touch it
+ * any more. It will help to maintain error context when error 
occurred.
+ * Compare to a simple hang, the system will keep stable at least 
for SSH

+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *
+ * 1. drm_dev_unplug() makes device inaccessible to user 
space(IOCTLs, etc),
+ *    clears all CPU mappings to device, disallows remappings 
through page faults

+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 5. pci_disable_device() and pci_wait_for_pending_transaction()
+ *    flush any in flight DMA operations
+ * 6. set adev->no_hw_access to true
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+    struct pci_dev *pdev = adev->pdev;
+    struct drm_device *ddev = >ddev;
+
+    drm_dev_unplug(ddev);
+
+    amdgpu_irq_disable_all(adev);
+
+    amdgpu_fence_driver_hw_fini(adev);
+
+    amdgpu_device_unmap_mmio(adev);



Note that this one will cause page fault on any subsequent MMIO access 
(trough registers or by direct VRAM access)





+
+    pci_disable_device(pdev);
+    pci_wait_for_pending_transaction(pdev);
+
+    adev->no_hw_access = true;


I think we need to reorder this, e.g. set adev->no_hw_access much 
earlier for example. Andrey what do you think?



Earlier can be ok but at least after the last HW configuration we 
actaully want to do like disabling IRQs.


Andrey



Apart from that sounds like the right idea to me.

Regards,
Christian.


+}




Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Bhardwaj, Rajneesh
Thanks Christian. Would it make it less intrusive if I just use the flag 
for ttm bo mmap and remove the drm_gem_mmap_obj change from this patch? 
For our use case, just the ttm_bo_mmap_obj change should suffice and we 
don't want to put any more work arounds in the user space (thunk, in our 
case).


The child cannot access the BOs mapped by the parent anyway with access 
restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. 
With the

existing permission checks on the gem objects, it might be reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of this so
this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call 
fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }




[PATCH v4 2/6] Documentation/gpu: Document amdgpu_dm_visual_confirm debugfs entry

2021-12-09 Thread Rodrigo Siqueira
Display core provides a feature that makes it easy for users to debug
Multiple planes by enabling a visual notification at the bottom of each
plane. This commit introduces how to use such a feature.

Signed-off-by: Rodrigo Siqueira 
---
 Documentation/gpu/amdgpu/display/dc-debug.rst | 34 ++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/Documentation/gpu/amdgpu/display/dc-debug.rst 
b/Documentation/gpu/amdgpu/display/dc-debug.rst
index bbb8c3fc8eee..532cbbd64863 100644
--- a/Documentation/gpu/amdgpu/display/dc-debug.rst
+++ b/Documentation/gpu/amdgpu/display/dc-debug.rst
@@ -1,4 +1,36 @@
+
 Display Core Debug tools
 
 
-TODO
+DC Debugfs
+==
+
+Multiple Planes Debug
+-
+
+If you want to enable or debug multiple planes in a specific user-space
+application, you can leverage a debug feature named visual confirm. For
+enabling it, you will need::
+
+  echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
+
+You need to reload your GUI to see the visual confirmation. When the plane
+configuration changes or a full update occurs there will be a colored bar at
+the bottom of each hardware plane being drawn on the screen.
+
+* The color indicates the format - For example, red is AR24 and green is NV12
+* The height of the bar indicates the index of the plane
+* Pipe split can be observed if there are two bars with a difference in height
+  covering the same plane
+
+Consider the video playback case in which a video is played in a specific
+plane, and the desktop is drawn in another plane. The video plane should
+feature one or two green bars at the bottom of the video depending on pipe
+split configuration.
+
+* There should **not** be any visual corruption
+* There should **not** be any underflow or screen flashes
+* There should **not** be any black screens
+* There should **not** be any cursor corruption
+* Multiple plane **may** be briefly disabled during window transitions or
+  resizing but should come back after the action has finished
-- 
2.25.1



RE: [PATCH v4 2/6] drm: improve drm_buddy_alloc function

2021-12-09 Thread Paneer Selvam, Arunpravin
[AMD Official Use Only]

Hi Matthew,

Ping on this?

Regards,
Arun
-Original Message-
From: amd-gfx  On Behalf Of Arunpravin
Sent: Wednesday, December 1, 2021 10:10 PM
To: dri-de...@lists.freedesktop.org; intel-...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org
Cc: dan...@ffwll.ch; Paneer Selvam, Arunpravin 
; jani.nik...@linux.intel.com; 
matthew.a...@intel.com; tzimmerm...@suse.de; Deucher, Alexander 
; Koenig, Christian 
Subject: [PATCH v4 2/6] drm: improve drm_buddy_alloc function

- Make drm_buddy_alloc a single function to handle
  range allocation and non-range allocation demands

- Implemented a new function alloc_range() which allocates
  the requested power-of-two block comply with range limitations

- Moved order computation and memory alignment logic from
  i915 driver to drm buddy

v2:
  merged below changes to keep the build unbroken
   - drm_buddy_alloc_range() becomes obsolete and may be removed
   - enable ttm range allocation (fpfn / lpfn) support in i915 driver
   - apply enhanced drm_buddy_alloc() function to i915 driver

v3(Matthew Auld):
  - Fix alignment issues and remove unnecessary list_empty check
  - add more validation checks for input arguments
  - make alloc_range() block allocations as bottom-up
  - optimize order computation logic
  - replace uint64_t with u64, which is preferred in the kernel

v4(Matthew Auld):
  - keep drm_buddy_alloc_range() function implementation for generic
actual range allocations
  - keep alloc_range() implementation for end bias allocations

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/drm_buddy.c   | 316 +-
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  67 ++--
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   2 +
 include/drm/drm_buddy.h   |  22 +-
 4 files changed, 285 insertions(+), 122 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c index 
9340a4b61c5a..7f47632821f4 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -280,23 +280,97 @@ void drm_buddy_free_list(struct drm_buddy_mm *mm, struct 
list_head *objects)  }  EXPORT_SYMBOL(drm_buddy_free_list);
 
-/**
- * drm_buddy_alloc - allocate power-of-two blocks
- *
- * @mm: DRM buddy manager to allocate from
- * @order: size of the allocation
- *
- * The order value here translates to:
- *
- * 0 = 2^0 * mm->chunk_size
- * 1 = 2^1 * mm->chunk_size
- * 2 = 2^2 * mm->chunk_size
- *
- * Returns:
- * allocated ptr to the _buddy_block on success
- */
-struct drm_buddy_block *
-drm_buddy_alloc(struct drm_buddy_mm *mm, unsigned int order)
+static inline bool overlaps(u64 s1, u64 e1, u64 s2, u64 e2) {
+   return s1 <= e2 && e1 >= s2;
+}
+
+static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2) {
+   return s1 <= s2 && e1 >= e2;
+}
+
+static struct drm_buddy_block *
+alloc_range_bias(struct drm_buddy_mm *mm,
+u64 start, u64 end,
+unsigned int order)
+{
+   struct drm_buddy_block *block;
+   struct drm_buddy_block *buddy;
+   LIST_HEAD(dfs);
+   int err;
+   int i;
+
+   end = end - 1;
+
+   for (i = 0; i < mm->n_roots; ++i)
+   list_add_tail(>roots[i]->tmp_link, );
+
+   do {
+   u64 block_start;
+   u64 block_end;
+
+   block = list_first_entry_or_null(,
+struct drm_buddy_block,
+tmp_link);
+   if (!block)
+   break;
+
+   list_del(>tmp_link);
+
+   if (drm_buddy_block_order(block) < order)
+   continue;
+
+   block_start = drm_buddy_block_offset(block);
+   block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
+   if (!overlaps(start, end, block_start, block_end))
+   continue;
+
+   if (drm_buddy_block_is_allocated(block))
+   continue;
+
+   if (contains(start, end, block_start, block_end) &&
+   order == drm_buddy_block_order(block)) {
+   /*
+* Find the free block within the range.
+*/
+   if (drm_buddy_block_is_free(block))
+   return block;
+
+   continue;
+   }
+
+   if (!drm_buddy_block_is_split(block)) {
+   err = split_block(mm, block);
+   if (unlikely(err))
+   goto err_undo;
+   }
+
+   list_add(>right->tmp_link, );
+   list_add(>left->tmp_link, );
+   } while (1);
+
+   return ERR_PTR(-ENOSPC);
+
+err_undo:
+   /*
+* We really don't want to leave around a bunch of split blocks, since
+* bigger is better, so make sure we merge everything back before 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Christian König

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly a 
good idea.


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied


exactly that is not correct. That behavior is actively used by some 
userspace stacks as far as I know.


Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
Thanks Christian. Would it make it less intrusive if I just use the 
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this 
patch? For our use case, just the ttm_bo_mmap_obj change should 
suffice and we don't want to put any more work arounds in the user 
space (thunk, in our case).


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. 
With the
existing permission checks on the gem objects, it might be 
reasonable to

also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we 
had

worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of 
this so

this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call 
fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }






[PATCH v2 2/2] drm/amdgpu: Reduce SG bo memory usage for mGPUs

2021-12-09 Thread Philip Yang
For userptr bo, if adev is not in IOMMU isolation mode, RAM direct map
to GPU, multiple GPUs use same system memory dma mapping address, they
can share the original mem->bo in attachment to reduce dma address array
memory usage.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index b8490789eef4..f9bab963a948 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -708,10 +708,12 @@ static int kfd_mem_attach(struct amdgpu_device *adev, 
struct kgd_mem *mem,
pr_debug("\t add VA 0x%llx - 0x%llx to vm %p\n", va,
 va + bo_size, vm);
 
-   if (adev == bo_adev || (mem->domain == AMDGPU_GEM_DOMAIN_VRAM &&
-   amdgpu_xgmi_same_hive(adev, bo_adev))) {
-   /* Mappings on the local GPU and VRAM mappings in the
-* local hive share the original BO
+   if (adev == bo_adev ||
+  (amdgpu_ttm_tt_get_usermm(mem->bo->tbo.ttm) && 
adev->ram_is_direct_mapped) ||
+  (mem->domain == AMDGPU_GEM_DOMAIN_VRAM && 
amdgpu_xgmi_same_hive(adev, bo_adev))) {
+   /* Mappings on the local GPU, or VRAM mappings in the
+* local hive, or userptr mapping IOMMU direct map mode
+* share the original BO
 */
attachment[i]->type = KFD_MEM_ATT_SHARED;
bo[i] = mem->bo;
-- 
2.17.1



[PATCH v4 6/6] Documentation/gpu: Add amdgpu and dc glossary

2021-12-09 Thread Rodrigo Siqueira
In the DC driver, we have multiple acronyms that are not obvious most of
the time; the same idea is valid for amdgpu. This commit introduces a DC
and amdgpu glossary in order to make it easier to navigate through our
driver.

Changes since V3:
 - Yann: Add new acronyms to amdgpu glossary
 - Daniel: Add link between dc and amdgpu glossary

Changes since V2:
 - Add MMHUB

Changes since V1:
 - Yann: Divide glossary based on driver context.
 - Alex: Make terms more consistent and update CPLIB
 - Add new acronyms to the glossary

Signed-off-by: Rodrigo Siqueira 
---
 Documentation/gpu/amdgpu/amdgpu-glossary.rst  |  87 +++
 .../gpu/amdgpu/display/dc-glossary.rst| 237 ++
 Documentation/gpu/amdgpu/display/index.rst|   1 +
 Documentation/gpu/amdgpu/index.rst|   7 +
 4 files changed, 332 insertions(+)
 create mode 100644 Documentation/gpu/amdgpu/amdgpu-glossary.rst
 create mode 100644 Documentation/gpu/amdgpu/display/dc-glossary.rst

diff --git a/Documentation/gpu/amdgpu/amdgpu-glossary.rst 
b/Documentation/gpu/amdgpu/amdgpu-glossary.rst
new file mode 100644
index ..859dcec6c6f9
--- /dev/null
+++ b/Documentation/gpu/amdgpu/amdgpu-glossary.rst
@@ -0,0 +1,87 @@
+===
+AMDGPU Glossary
+===
+
+Here you can find some generic acronyms used in the amdgpu driver. Notice that
+we have a dedicated glossary for Display Core at
+'Documentation/gpu/amdgpu/display/dc-glossary.rst'.
+
+.. glossary::
+
+CP
+  Command Processor
+
+CPLIB
+  Content Protection Library
+
+DFS
+  Digital Frequency Synthesizer
+
+ECP
+  Enhanced Content Protection
+
+EOP
+  End Of Pipe/Pipeline
+
+GC
+  Graphics and Compute
+
+GMC
+  Graphic Memory Controller
+
+IH
+  Interrupt Handler
+
+HQD
+  Hardware Queue Descriptor
+
+IB
+  Indirect Buffer
+
+IP
+Intellectual Property blocks
+
+KCQ
+  Kernel Compute Queue
+
+KGQ
+  Kernel Graphics Queue
+
+KIQ
+  Kernel Interface Queue
+
+MEC
+  MicroEngine Compute
+
+MES
+  MicroEngine Scheduler
+
+MMHUB
+  Multi-Media HUB
+
+MQD
+  Memory Queue Descriptor
+
+PPLib
+  PowerPlay Library - PowerPlay is the power management component.
+
+PSP
+Platform Security Processor
+
+RCL
+  RunList Controller
+
+SDMA
+  System DMA
+
+SMU
+  System Management Unit
+
+SS
+  Spread Spectrum
+
+VCE
+  Video Compression Engine
+
+VCN
+  Video Codec Next
diff --git a/Documentation/gpu/amdgpu/display/dc-glossary.rst 
b/Documentation/gpu/amdgpu/display/dc-glossary.rst
new file mode 100644
index ..116f5f0942fd
--- /dev/null
+++ b/Documentation/gpu/amdgpu/display/dc-glossary.rst
@@ -0,0 +1,237 @@
+===
+DC Glossary
+===
+
+On this page, we try to keep track of acronyms related to the display
+component. If you do not find what you are looking for, look at the
+'Documentation/gpu/amdgpu/amdgpu-glossary.rst'; if you cannot find it anywhere,
+consider asking in the amdgfx and update this page.
+
+.. glossary::
+
+ABM
+  Adaptive Backlight Modulation
+
+APU
+  Accelerated Processing Unit
+
+ASIC
+  Application-Specific Integrated Circuit
+
+ASSR
+  Alternate Scrambler Seed Reset
+
+AZ
+  Azalia (HD audio DMA engine)
+
+BPC
+  Bits Per Colour/Component
+
+BPP
+  Bits Per Pixel
+
+Clocks
+  * PCLK: Pixel Clock
+  * SYMCLK: Symbol Clock
+  * SOCCLK: GPU Engine Clock
+  * DISPCLK: Display Clock
+  * DPPCLK: DPP Clock
+  * DCFCLK: Display Controller Fabric Clock
+  * REFCLK: Real Time Reference Clock
+  * PPLL: Pixel PLL
+  * FCLK: Fabric Clock
+  * MCLK: Memory Clock
+
+CRC
+  Cyclic Redundancy Check
+
+CRTC
+  Cathode Ray Tube Controller - commonly called "Controller" - Generates
+  raw stream of pixels, clocked at pixel clock
+
+CVT
+  Coordinated Video Timings
+
+DAL
+  Display Abstraction layer
+
+DC (Software)
+  Display Core
+
+DC (Hardware)
+  Display Controller
+
+DCC
+  Delta Colour Compression
+
+DCE
+  Display Controller Engine
+
+DCHUB
+  Display Controller HUB
+
+ARB
+  Arbiter
+
+VTG
+  Vertical Timing Generator
+
+DCN
+  Display Core Next
+
+DCCG
+  Display Clock Generator block
+
+DDC
+  Display Data Channel
+
+DIO
+  Display IO
+
+DPP
+  Display Pipes and Planes
+
+DSC
+  Display Stream Compression (Reduce the amount of bits to represent pixel
+  count while at the same pixel clock)
+
+dGPU
+  discrete GPU
+
+DMIF
+  Display Memory Interface
+
+DML
+  Display Mode Library
+
+DMCU
+  Display Micro-Controller Unit
+
+DMCUB
+  Display Micro-Controller Unit, version B
+
+DPCD
+  DisplayPort Configuration Data
+
+DPM(S)
+  

[PATCH v2 00/10] drm/amd: fix various compilation warnings

2021-12-09 Thread Isabella Basso
This patchset aims at fixing various compilation warnings in the AMD GPU
driver. All warnings were generated using gcc and the W=1 flag. I
decided to deal with them in the same order as the issues were presented
in the log, with the exception of those that were about the lack of
protypes, which were gathered by a script [1].

Some of these patches were already applied [2], so not all are being
sent in this new version.

Changes since v1:
- Made amdgpu_ras_mca_query_error_status static instead of prototyping
  it in patch 3/10
- Rewrote function signatures in patch 6/10
- Removed unused functions in patch 6/10
- Removed more unecessary code in patch 9/10
- Reduced patch 10/10 to a minimum

[1] - https://pad.riseup.net/p/ZMkzoeO89Kt7R_IC4iAo-keep
[2] - https://patchwork.freedesktop.org/series/97701

Isabella Basso (10):
  drm/amd: Mark IP_BASE definition as __maybe_unused
  drm/amd: fix improper docstring syntax
  drm/amdgpu: fix amdgpu_ras_mca_query_error_status scope
  drm/amdgpu: fix function scopes
  drm/amdkfd: fix function scopes
  drm/amd/display: fix function scopes
  drm/amd: append missing includes
  drm/amdgpu: fix location of prototype for amdgpu_kms_compat_ioctl
  drm/amdgpu: remove unnecessary variables
  drm/amdgpu: re-format file header comments

 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  2 -
 .../gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c  |  4 +-
 .../gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c  |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.h   |  3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ioc32.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_pll.c   |  2 +
 .../gpu/drm/amd/amdgpu/amdgpu_preempt_mgr.c   |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 12 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c   |  1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c  |  4 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c|  2 -
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c|  2 -
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c  |  4 +-
 .../drm/amd/amdkfd/kfd_packet_manager_vi.c|  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  |  5 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 11 ++-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 ++--
 .../amd/display/amdgpu_dm/amdgpu_dm_color.c   |  4 +
 .../gpu/drm/amd/display/dc/calcs/dcn_calcs.c  |  4 +-
 .../display/dc/clk_mgr/dcn10/rv1_clk_mgr.c|  2 +-
 .../dc/clk_mgr/dcn10/rv1_clk_mgr_vbios_smu.c  |  2 +
 .../display/dc/clk_mgr/dcn20/dcn20_clk_mgr.c  |  2 +-
 .../dc/clk_mgr/dcn201/dcn201_clk_mgr.c| 43 +---
 .../amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c | 23 +
 .../dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.c   |  6 +-
 .../display/dc/clk_mgr/dcn301/dcn301_smu.c|  6 +-
 .../display/dc/clk_mgr/dcn301/vg_clk_mgr.c| 20 +---
 .../display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c  |  7 +-
 .../amd/display/dc/clk_mgr/dcn31/dcn31_smu.c  |  6 +-
 drivers/gpu/drm/amd/display/dc/core/dc_link.c |  3 +-
 .../display/dc/dce110/dce110_hw_sequencer.c   |  2 +
 .../gpu/drm/amd/display/dc/dcn10/dcn10_dpp.c  |  8 --
 .../drm/amd/display/dc/dcn10/dcn10_dpp_dscl.c | 97 ---
 .../amd/display/dc/dcn10/dcn10_hw_sequencer.c | 29 +++---
 .../gpu/drm/amd/display/dc/dcn10/dcn10_opp.c  | 30 --
 .../gpu/drm/amd/display/dc/dcn10/dcn10_optc.c | 20 +---
 .../drm/amd/display/dc/dcn10/dcn10_resource.c | 18 ++--
 .../gpu/drm/amd/display/dc/dcn20/dcn20_dpp.c  | 14 ---
 .../drm/amd/display/dc/dcn20/dcn20_dwb_scl.c  |  4 +-
 .../gpu/drm/amd/display/dc/dcn20/dcn20_hubp.c |  7 +-
 .../drm/amd/display/dc/dcn20/dcn20_hwseq.c|  6 +-
 .../gpu/drm/amd/display/dc/dcn20/dcn20_init.c |  2 +
 .../gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c  |  9 +-
 .../gpu/drm/amd/display/dc/dcn20/dcn20_optc.c | 57 +--
 .../drm/amd/display/dc/dcn201/dcn201_dccg.c   |  3 +-
 .../drm/amd/display/dc/dcn201/dcn201_hubp.c   |  7 +-
 .../display/dc/dcn201/dcn201_link_encoder.c   |  6 +-
 .../amd/display/dc/dcn201/dcn201_resource.c   | 16 ++-
 .../drm/amd/display/dc/dcn21/dcn21_hubbub.c   |  2 +-
 .../gpu/drm/amd/display/dc/dcn21/dcn21_hubp.c | 15 +--
 .../gpu/drm/amd/display/dc/dcn21/dcn21_init.c |  2 +
 .../amd/display/dc/dcn21/dcn21_link_encoder.c |  9 +-
 .../drm/amd/display/dc/dcn21/dcn21_resource.c | 31 +++---
 .../dc/dcn30/dcn30_dio_stream_encoder.c   | 18 +---
 .../gpu/drm/amd/display/dc/dcn30/dcn30_dpp.c  | 36 ++-
 .../gpu/drm/amd/display/dc/dcn30/dcn30_init.c |  2 +
 .../drm/amd/display/dc/dcn30/dcn30_mmhubbub.c |  2 +-
 .../gpu/drm/amd/display/dc/dcn30/dcn30_mpc.c  |  2 +-
 .../drm/amd/display/dc/dcn30/dcn30_resource.c | 12 +--
 .../drm/amd/display/dc/dcn301/dcn301_init.c   |  2 +
 .../amd/display/dc/dcn301/dcn301_panel_cntl.c | 10 +-
 .../amd/display/dc/dcn301/dcn301_resource.c   | 45 -
 .../drm/amd/display/dc/dcn302/dcn302_init.c   |  2 +
 .../drm/amd/display/dc/dcn303/dcn303_init.c   |  2 +
 

[PATCH v4 4/6] Documentation/gpu: How to collect DTN log

2021-12-09 Thread Rodrigo Siqueira
Introduce how to collect DTN log from debugfs.

Signed-off-by: Rodrigo Siqueira 
---
 Documentation/gpu/amdgpu/display/dc-debug.rst | 17 +
 1 file changed, 17 insertions(+)

diff --git a/Documentation/gpu/amdgpu/display/dc-debug.rst 
b/Documentation/gpu/amdgpu/display/dc-debug.rst
index 6dbd21f7f59e..40c55a618918 100644
--- a/Documentation/gpu/amdgpu/display/dc-debug.rst
+++ b/Documentation/gpu/amdgpu/display/dc-debug.rst
@@ -58,3 +58,20 @@ In this case, if you have a pipe split, you will see one 
small red bar at the
 bottom of the display covering the entire display width and another bar
 covering the second pipe. In other words, you will see a bit high bar in the
 second pipe.
+
+DTN Debug
+=
+
+DC (DCN) provides an extensive log that dumps multiple details from our
+hardware configuration. Via debugfs, you can capture those status values by
+using Display Test Next (DTN) log, which can be captured via debugfs by using::
+
+  cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
+
+Since this log is updated accordingly with DCN status, you can also follow the
+change in real-time by using something like::
+
+  sudo watch -d cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
+
+When reporting a bug related to DC, consider attaching this log before and
+after you reproduce the bug.
-- 
2.25.1



[PATCH v4 5/6] Documentation/gpu: Add basic overview of DC pipeline

2021-12-09 Thread Rodrigo Siqueira
This commit describes how DCN works by providing high-level diagrams
with an explanation of each component. In particular, it details the
Global Sync signals.

Change since V2:
 - Add a comment about MMHUBBUB.

Signed-off-by: Rodrigo Siqueira 
---
 .../gpu/amdgpu/display/config_example.svg |  414 ++
 .../amdgpu/display/dc_pipeline_overview.svg   | 1125 +
 .../gpu/amdgpu/display/dcn-overview.rst   |  171 +++
 .../gpu/amdgpu/display/global_sync_vblank.svg |  485 +++
 Documentation/gpu/amdgpu/display/index.rst|   23 +-
 5 files changed, 2206 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/gpu/amdgpu/display/config_example.svg
 create mode 100644 Documentation/gpu/amdgpu/display/dc_pipeline_overview.svg
 create mode 100644 Documentation/gpu/amdgpu/display/dcn-overview.rst
 create mode 100644 Documentation/gpu/amdgpu/display/global_sync_vblank.svg

diff --git a/Documentation/gpu/amdgpu/display/config_example.svg 
b/Documentation/gpu/amdgpu/display/config_example.svg
new file mode 100644
index ..cdac9858601c
--- /dev/null
+++ b/Documentation/gpu/amdgpu/display/config_example.svg
@@ -0,0 +1,414 @@
+
+
+
+http://purl.org/dc/elements/1.1/;
+   xmlns:cc="http://creativecommons.org/ns#;
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#;
+   xmlns:svg="http://www.w3.org/2000/svg;
+   xmlns="http://www.w3.org/2000/svg;
+   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd;
+   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape;
+   width="144.63406mm"
+   height="66.596054mm"
+   viewBox="0 0 144.15195 66.596054"
+   version="1.1"
+   id="svg8"
+   inkscape:version="0.92.5 (2060ec1f9f, 2020-04-08)"
+   sodipodi:docname="config_example.svg">
+  
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+  
+  
+
+  
+  
+
+  
+image/svg+xml
+http://purl.org/dc/dcmitype/StillImage; />
+
+  
+
+  
+  
+
+
+
+
+
+
+
+
+
+
+Configurations
+A
+B
+C
+
+
+
+
+
+A
+B
+C
+C
+Old config
+Old config
+
+
+VUpdate
+UpdateLock
+Register updatePending Status
+Buf 0
+Buf 1
+  
+
diff --git a/Documentation/gpu/amdgpu/display/dc_pipeline_overview.svg 
b/Documentation/gpu/amdgpu/display/dc_pipeline_overview.svg
new file mode 100644
index ..9adecebfe65b
--- /dev/null
+++ b/Documentation/gpu/amdgpu/display/dc_pipeline_overview.svg
@@ -0,0 +1,1125 @@
+
+
+
+http://purl.org/dc/elements/1.1/;
+   xmlns:cc="http://creativecommons.org/ns#;
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#;
+   xmlns:svg="http://www.w3.org/2000/svg;
+   xmlns="http://www.w3.org/2000/svg;
+   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd;
+   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape;
+   width="1296.7491"
+   height="741.97845"
+   viewBox="0 0 343.0982 196.31514"
+   version="1.1"
+   id="svg8"
+   inkscape:version="0.92.5 (2060ec1f9f, 2020-04-08)"
+   sodipodi:docname="dc_pipeline_overview.svg">
+  
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+
+  
+
+  
+  
+  
+
+  
+image/svg+xml
+http://purl.org/dc/dcmitype/StillImage; />
+
+  
+
+  
+  
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+  
+  DCHUB
+  HUBP(n)
+
+
+  DPP(n)
+  
+
+
+  MPC
+  
+
+
+  OPTC
+  
+
+
+  DIO
+  
+
+
+  DCCG
+  
+
+
+  DMU
+  
+
+
+
+  AZ
+  
+
+
+  MMHUBBUB
+  
+
+
+  DWB(n)
+  
+
+
+
+
+
+
+
+
+
+
+
+Global 
sync
+Pixel 
data
+Sideband 
signal
+Config. 
Bus
+
+SDP
+
+Monitor
+
+  OPP
+  
+
+
+
+
+
+
+  
+
+
+  
+
+dc_plane
+dc_stream
+
+  
+
+dc_state
+
+  
+
+Code 
struct
+
+  
+
+dc_link
+
+  
+
+Floating 
pointcalculation
+
+  
+
+bit-depthreduction/dither
+}
+Notes
+  
+
diff --git a/Documentation/gpu/amdgpu/display/dcn-overview.rst 
b/Documentation/gpu/amdgpu/display/dcn-overview.rst
new file mode 100644
index ..f98624d7828e
--- /dev/null
+++ b/Documentation/gpu/amdgpu/display/dcn-overview.rst
@@ -0,0 +1,171 @@

[PATCH v4 0/6] Expand display core documentation

2021-12-09 Thread Rodrigo Siqueira
Display Core (DC) is one of the components under amdgpu, and it has
multiple features directly related to the KMS API. Unfortunately, we
don't have enough documentation about DC in the upstream, which makes
the life of some external contributors a little bit more challenging.
For these reasons, this patchset reworks part of the DC documentation
and introduces a new set of details on how the display core works on DCN
IP. Another improvement that this documentation effort tries to bring is
making explicit some of our hardware-specific details to guide
user-space developers better.

In my view, it is easier to review this series if you apply it in your
local kernel and build the HTML version (make htmldocs). I'm suggesting
this approach because I added a few SVG diagrams that will be easier to
see in the HTML version. If you cannot build the documentation, try to
open the SVG images while reviewing the content. In summary, in this
series, you will find:

1. Patch 1: Re-arrange of display core documentation. This is
   preparation work for the other patches, but it is also a way to expand
   this documentation.
2. Patch 2 to 4: Document some common debug options related to display.
3. Patch 5: This patch provides an overview of how our display core next
   works and a brief explanation of each component.
4. Patch 6: We use a lot of acronyms in our driver; for this reason, we
   exposed a glossary with common terms used by display core.

Please let us know what you think we can improve this series and what
kind of content you want to see for the next series.

Changes since V3:
 - Add new acronyms to amdgpu glossary
 - Add link between dc and amdgpu glossary
Changes since V2:
 - Add a comment about MMHUBBUB
Changes since V1:
 - Group amdgpu documentation together.
 - Create index pages.
 - Mirror display folder in the documentation.
 - Divide glossary based on driver context.
 - Make terms more consistent and update CPLIB
 - Add new acronyms to the glossary

Thanks
Siqueira

Rodrigo Siqueira (6):
  Documentation/gpu: Reorganize DC documentation
  Documentation/gpu: Document amdgpu_dm_visual_confirm debugfs entry
  Documentation/gpu: Document pipe split visual confirmation
  Documentation/gpu: How to collect DTN log
  Documentation/gpu: Add basic overview of DC pipeline
  Documentation/gpu: Add amdgpu and dc glossary

 Documentation/gpu/amdgpu-dc.rst   |   74 --
 Documentation/gpu/amdgpu/amdgpu-glossary.rst  |   87 ++
 .../gpu/amdgpu/display/config_example.svg |  414 ++
 Documentation/gpu/amdgpu/display/dc-debug.rst |   77 ++
 .../gpu/amdgpu/display/dc-glossary.rst|  237 
 .../amdgpu/display/dc_pipeline_overview.svg   | 1125 +
 .../gpu/amdgpu/display/dcn-overview.rst   |  171 +++
 .../gpu/amdgpu/display/display-manager.rst|   42 +
 .../gpu/amdgpu/display/global_sync_vblank.svg |  485 +++
 Documentation/gpu/amdgpu/display/index.rst|   29 +
 .../gpu/{amdgpu.rst => amdgpu/index.rst}  |   25 +-
 Documentation/gpu/drivers.rst |3 +-
 12 files changed, 2690 insertions(+), 79 deletions(-)
 delete mode 100644 Documentation/gpu/amdgpu-dc.rst
 create mode 100644 Documentation/gpu/amdgpu/amdgpu-glossary.rst
 create mode 100644 Documentation/gpu/amdgpu/display/config_example.svg
 create mode 100644 Documentation/gpu/amdgpu/display/dc-debug.rst
 create mode 100644 Documentation/gpu/amdgpu/display/dc-glossary.rst
 create mode 100644 Documentation/gpu/amdgpu/display/dc_pipeline_overview.svg
 create mode 100644 Documentation/gpu/amdgpu/display/dcn-overview.rst
 create mode 100644 Documentation/gpu/amdgpu/display/display-manager.rst
 create mode 100644 Documentation/gpu/amdgpu/display/global_sync_vblank.svg
 create mode 100644 Documentation/gpu/amdgpu/display/index.rst
 rename Documentation/gpu/{amdgpu.rst => amdgpu/index.rst} (95%)

-- 
2.25.1



[PATCH v4 3/6] Documentation/gpu: Document pipe split visual confirmation

2021-12-09 Thread Rodrigo Siqueira
Display core provides a feature that makes it easy for users to debug
Pipe Split. This commit introduces how to use such a debug option.

Signed-off-by: Rodrigo Siqueira 
---
 Documentation/gpu/amdgpu/display/dc-debug.rst | 28 +--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/Documentation/gpu/amdgpu/display/dc-debug.rst 
b/Documentation/gpu/amdgpu/display/dc-debug.rst
index 532cbbd64863..6dbd21f7f59e 100644
--- a/Documentation/gpu/amdgpu/display/dc-debug.rst
+++ b/Documentation/gpu/amdgpu/display/dc-debug.rst
@@ -2,8 +2,18 @@
 Display Core Debug tools
 
 
-DC Debugfs
-==
+DC Visual Confirmation
+==
+
+Display core provides a feature named visual confirmation, which is a set of
+bars added at the scanout time by the driver to convey some specific
+information. In general, you can enable this debug option by using::
+
+  echo  > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
+
+Where `N` is an integer number for some specific scenarios that the developer
+wants to enable, you will see some of these debug cases in the following
+subsection.
 
 Multiple Planes Debug
 -
@@ -34,3 +44,17 @@ split configuration.
 * There should **not** be any cursor corruption
 * Multiple plane **may** be briefly disabled during window transitions or
   resizing but should come back after the action has finished
+
+Pipe Split Debug
+
+
+Sometimes we need to debug if DCN is splitting pipes correctly, and visual
+confirmation is also handy for this case. Similar to the MPO case, you can use
+the below command to enable visual confirmation::
+
+  echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
+
+In this case, if you have a pipe split, you will see one small red bar at the
+bottom of the display covering the entire display width and another bar
+covering the second pipe. In other words, you will see a bit high bar in the
+second pipe.
-- 
2.25.1



Re: [PATCH] drm/amdgpu: don't skip runtime pm get on A+A config

2021-12-09 Thread Christian König

Am 07.12.21 um 08:40 schrieb Quan, Evan:

[AMD Official Use Only]

-Original Message-
From: Christian König 
Sent: Tuesday, December 7, 2021 3:03 PM
To: Quan, Evan ; Deucher, Alexander

Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: don't skip runtime pm get on A+A config

You are looking at outdated code, that stuff is gone by now.
amd-staging-drm-next probably needs a rebase.

Yep, I can see it in the vanilla kernel.
The patch is acked-by: Evan Quan 


Thanks.

Alex any objections that I push this to drm-misc-next? It was found 
while working on changes already upstream in that function and would 
conflict if we push it through amd-staging-drm-next.


Regards,
Christian.



BR
Evan

And this code was what the check was initially good for. Just skipping the PM
stuff as well on A+A was unintentionally.

Regards,
Christian.

Am 07.12.21 um 02:58 schrieb Quan, Evan:

[AMD Official Use Only]

It seems more jobs(below) other than bumping the runpm counter are

performed.

Are they desired also?

r = __dma_resv_make_exclusive(bo->tbo.base.resv);
if (r)
goto out;

bo->prime_shared_count++;

BR
Evan

-Original Message-
From: amd-gfx  On Behalf Of
Christian König
Sent: Monday, December 6, 2021 4:46 PM
To: Deucher, Alexander 
Cc: amd-gfx@lists.freedesktop.org
Subject: [PATCH] drm/amdgpu: don't skip runtime pm get on A+A config

The runtime PM get was incorrectly added after the check.

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 3 ---
   1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index ae6ab93c868b..4896c876ffec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -61,9 +61,6 @@ static int amdgpu_dma_buf_attach(struct dma_buf
*dmabuf,
if (pci_p2pdma_distance_many(adev->pdev, >dev, 1, true)

<

0)
attach->peer2peer = false;

-   if (attach->dev->driver == adev->dev->driver)
-   return 0;
-
r = pm_runtime_get_sync(adev_to_drm(adev)->dev);
if (r < 0)
goto out;
--
2.25.1




Re: [PATCH V4 14/17] drm/amd/pm: relocate the power related headers

2021-12-09 Thread Lazar, Lijo




On 12/3/2021 8:35 AM, Evan Quan wrote:

Instead of centralizing all headers in the same folder. Separate them into
different folders and place them among those source files those who really
need them.

Signed-off-by: Evan Quan 
Change-Id: Id74cb4c7006327ca7ecd22daf17321e417c4aa71
---
  drivers/gpu/drm/amd/pm/Makefile   | 10 +++---
  drivers/gpu/drm/amd/pm/legacy-dpm/Makefile| 32 +++
  .../pm/{powerplay => legacy-dpm}/cik_dpm.h|  0
  .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.c |  0
  .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.h |  0
  .../amd/pm/{powerplay => legacy-dpm}/kv_smc.c |  0
  .../pm/{powerplay => legacy-dpm}/legacy_dpm.c |  0
  .../pm/{powerplay => legacy-dpm}/legacy_dpm.h |  0
  .../amd/pm/{powerplay => legacy-dpm}/ppsmc.h  |  0
  .../pm/{powerplay => legacy-dpm}/r600_dpm.h   |  0
  .../amd/pm/{powerplay => legacy-dpm}/si_dpm.c |  0
  .../amd/pm/{powerplay => legacy-dpm}/si_dpm.h |  0
  .../amd/pm/{powerplay => legacy-dpm}/si_smc.c |  0
  .../{powerplay => legacy-dpm}/sislands_smc.h  |  0
  drivers/gpu/drm/amd/pm/powerplay/Makefile |  6 +---
  .../pm/{ => powerplay}/inc/amd_powerplay.h|  0
  .../drm/amd/pm/{ => powerplay}/inc/cz_ppsmc.h |  0
  .../amd/pm/{ => powerplay}/inc/fiji_ppsmc.h   |  0
  .../pm/{ => powerplay}/inc/hardwaremanager.h  |  0
  .../drm/amd/pm/{ => powerplay}/inc/hwmgr.h|  0
  .../{ => powerplay}/inc/polaris10_pwrvirus.h  |  0
  .../amd/pm/{ => powerplay}/inc/power_state.h  |  0
  .../drm/amd/pm/{ => powerplay}/inc/pp_debug.h |  0
  .../amd/pm/{ => powerplay}/inc/pp_endian.h|  0
  .../amd/pm/{ => powerplay}/inc/pp_thermal.h   |  0
  .../amd/pm/{ => powerplay}/inc/ppinterrupt.h  |  0
  .../drm/amd/pm/{ => powerplay}/inc/rv_ppsmc.h |  0
  .../drm/amd/pm/{ => powerplay}/inc/smu10.h|  0
  .../pm/{ => powerplay}/inc/smu10_driver_if.h  |  0
  .../pm/{ => powerplay}/inc/smu11_driver_if.h  |  0
  .../gpu/drm/amd/pm/{ => powerplay}/inc/smu7.h |  0
  .../drm/amd/pm/{ => powerplay}/inc/smu71.h|  0
  .../pm/{ => powerplay}/inc/smu71_discrete.h   |  0
  .../drm/amd/pm/{ => powerplay}/inc/smu72.h|  0
  .../pm/{ => powerplay}/inc/smu72_discrete.h   |  0
  .../drm/amd/pm/{ => powerplay}/inc/smu73.h|  0
  .../pm/{ => powerplay}/inc/smu73_discrete.h   |  0
  .../drm/amd/pm/{ => powerplay}/inc/smu74.h|  0
  .../pm/{ => powerplay}/inc/smu74_discrete.h   |  0
  .../drm/amd/pm/{ => powerplay}/inc/smu75.h|  0
  .../pm/{ => powerplay}/inc/smu75_discrete.h   |  0
  .../amd/pm/{ => powerplay}/inc/smu7_common.h  |  0
  .../pm/{ => powerplay}/inc/smu7_discrete.h|  0
  .../amd/pm/{ => powerplay}/inc/smu7_fusion.h  |  0
  .../amd/pm/{ => powerplay}/inc/smu7_ppsmc.h   |  0
  .../gpu/drm/amd/pm/{ => powerplay}/inc/smu8.h |  0
  .../amd/pm/{ => powerplay}/inc/smu8_fusion.h  |  0
  .../gpu/drm/amd/pm/{ => powerplay}/inc/smu9.h |  0
  .../pm/{ => powerplay}/inc/smu9_driver_if.h   |  0
  .../{ => powerplay}/inc/smu_ucode_xfer_cz.h   |  0
  .../{ => powerplay}/inc/smu_ucode_xfer_vi.h   |  0
  .../drm/amd/pm/{ => powerplay}/inc/smumgr.h   |  0
  .../amd/pm/{ => powerplay}/inc/tonga_ppsmc.h  |  0
  .../amd/pm/{ => powerplay}/inc/vega10_ppsmc.h |  0
  .../inc/vega12/smu9_driver_if.h   |  0
  .../amd/pm/{ => powerplay}/inc/vega12_ppsmc.h |  0
  .../amd/pm/{ => powerplay}/inc/vega20_ppsmc.h |  0
  .../amd/pm/{ => swsmu}/inc/aldebaran_ppsmc.h  |  0
  .../drm/amd/pm/{ => swsmu}/inc/amdgpu_smu.h   |  0
  .../amd/pm/{ => swsmu}/inc/arcturus_ppsmc.h   |  0
  .../inc/smu11_driver_if_arcturus.h|  0
  .../inc/smu11_driver_if_cyan_skillfish.h  |  0
  .../{ => swsmu}/inc/smu11_driver_if_navi10.h  |  0
  .../inc/smu11_driver_if_sienna_cichlid.h  |  0
  .../{ => swsmu}/inc/smu11_driver_if_vangogh.h |  0
  .../amd/pm/{ => swsmu}/inc/smu12_driver_if.h  |  0
  .../inc/smu13_driver_if_aldebaran.h   |  0
  .../inc/smu13_driver_if_yellow_carp.h |  0
  .../pm/{ => swsmu}/inc/smu_11_0_cdr_table.h   |  0
  .../drm/amd/pm/{ => swsmu}/inc/smu_types.h|  0
  .../drm/amd/pm/{ => swsmu}/inc/smu_v11_0.h|  0
  .../pm/{ => swsmu}/inc/smu_v11_0_7_ppsmc.h|  0
  .../pm/{ => swsmu}/inc/smu_v11_0_7_pptable.h  |  0
  .../amd/pm/{ => swsmu}/inc/smu_v11_0_ppsmc.h  |  0
  .../pm/{ => swsmu}/inc/smu_v11_0_pptable.h|  0
  .../amd/pm/{ => swsmu}/inc/smu_v11_5_pmfw.h   |  0
  .../amd/pm/{ => swsmu}/inc/smu_v11_5_ppsmc.h  |  0
  .../amd/pm/{ => swsmu}/inc/smu_v11_8_pmfw.h   |  0
  .../amd/pm/{ => swsmu}/inc/smu_v11_8_ppsmc.h  |  0
  .../drm/amd/pm/{ => swsmu}/inc/smu_v12_0.h|  0
  .../amd/pm/{ => swsmu}/inc/smu_v12_0_ppsmc.h  |  0
  .../drm/amd/pm/{ => swsmu}/inc/smu_v13_0.h|  0
  .../amd/pm/{ => swsmu}/inc/smu_v13_0_1_pmfw.h |  0
  .../pm/{ => swsmu}/inc/smu_v13_0_1_ppsmc.h|  0
  .../pm/{ => swsmu}/inc/smu_v13_0_pptable.h|  0
  .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  1 -
  .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  1 -
  87 files changed, 39 insertions(+), 11 deletions(-)
  create 

Re: [PATCH V4 11/17] drm/amd/pm: correct the usage for amdgpu_dpm_dispatch_task()

2021-12-09 Thread Lazar, Lijo




On 12/3/2021 8:35 AM, Evan Quan wrote:

We should avoid having multi-function APIs. It should be up to the caller
to determine when or whether to call amdgpu_dpm_dispatch_task().

Signed-off-by: Evan Quan 
Change-Id: I78ec4eb8ceb6e526a4734113d213d15a5fbaa8a4
---
  drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 18 ++
  drivers/gpu/drm/amd/pm/amdgpu_pm.c  | 26 --
  2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 6d9db2e2cbd3..32bf1247fb60 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -554,8 +554,6 @@ void amdgpu_dpm_set_power_state(struct amdgpu_device *adev,
enum amd_pm_state_type state)
  {
adev->pm.dpm.user_state = state;
-
-   amdgpu_dpm_dispatch_task(adev, AMD_PP_TASK_ENABLE_USER_STATE, );
  }
  
  enum amd_dpm_forced_level amdgpu_dpm_get_performance_level(struct amdgpu_device *adev)

@@ -723,13 +721,7 @@ int amdgpu_dpm_set_sclk_od(struct amdgpu_device *adev, 
uint32_t value)
if (!pp_funcs->set_sclk_od)
return -EOPNOTSUPP;
  
-	pp_funcs->set_sclk_od(adev->powerplay.pp_handle, value);

-
-   amdgpu_dpm_dispatch_task(adev,
-AMD_PP_TASK_READJUST_POWER_STATE,
-NULL);
-
-   return 0;
+   return pp_funcs->set_sclk_od(adev->powerplay.pp_handle, value);
  }
  
  int amdgpu_dpm_get_mclk_od(struct amdgpu_device *adev)

@@ -749,13 +741,7 @@ int amdgpu_dpm_set_mclk_od(struct amdgpu_device *adev, 
uint32_t value)
if (!pp_funcs->set_mclk_od)
return -EOPNOTSUPP;
  
-	pp_funcs->set_mclk_od(adev->powerplay.pp_handle, value);

-
-   amdgpu_dpm_dispatch_task(adev,
-AMD_PP_TASK_READJUST_POWER_STATE,
-NULL);
-
-   return 0;
+   return pp_funcs->set_mclk_od(adev->powerplay.pp_handle, value);
  }
  
  int amdgpu_dpm_get_power_profile_mode(struct amdgpu_device *adev,

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index fa2f4e11e94e..89e1134d660f 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -187,6 +187,10 @@ static ssize_t amdgpu_set_power_dpm_state(struct device 
*dev,
  
  	amdgpu_dpm_set_power_state(adev, state);
  
+	amdgpu_dpm_dispatch_task(adev,

+AMD_PP_TASK_ENABLE_USER_STATE,
+);
+
pm_runtime_mark_last_busy(ddev->dev);
pm_runtime_put_autosuspend(ddev->dev);
  
@@ -1278,7 +1282,16 @@ static ssize_t amdgpu_set_pp_sclk_od(struct device *dev,

return ret;
}
  
-	amdgpu_dpm_set_sclk_od(adev, (uint32_t)value);

+   ret = amdgpu_dpm_set_sclk_od(adev, (uint32_t)value);


amdgpu_set_pp_sclk_od has a verbatim API like amdgpu_dpm_set_sclk_od and 
one would expect that to handle everything required to set the clock.


If locking is the problem, then it should be handled differently. This 
kind of mixing is not the right way.


Thanks,
Lijo


+   if (ret) {
+   pm_runtime_mark_last_busy(ddev->dev);
+   pm_runtime_put_autosuspend(ddev->dev);
+   return ret;
+   }
+
+   amdgpu_dpm_dispatch_task(adev,
+AMD_PP_TASK_READJUST_POWER_STATE,
+NULL);
  
  	pm_runtime_mark_last_busy(ddev->dev);

pm_runtime_put_autosuspend(ddev->dev);
@@ -1340,7 +1353,16 @@ static ssize_t amdgpu_set_pp_mclk_od(struct device *dev,
return ret;
}
  
-	amdgpu_dpm_set_mclk_od(adev, (uint32_t)value);

+   ret = amdgpu_dpm_set_mclk_od(adev, (uint32_t)value);
+   if (ret) {
+   pm_runtime_mark_last_busy(ddev->dev);
+   pm_runtime_put_autosuspend(ddev->dev);
+   return ret;
+   }
+
+   amdgpu_dpm_dispatch_task(adev,
+AMD_PP_TASK_READJUST_POWER_STATE,
+NULL);
  
  	pm_runtime_mark_last_busy(ddev->dev);

pm_runtime_put_autosuspend(ddev->dev);



Re: [PATCH V4 09/17] drm/amd/pm: optimize the amdgpu_pm_compute_clocks() implementations

2021-12-09 Thread Lazar, Lijo




On 12/3/2021 8:35 AM, Evan Quan wrote:

Drop cross callings and multi-function APIs. Also avoid exposing
internal implementations details.

Signed-off-by: Evan Quan 
Change-Id: I55e5ab3da6a70482f5f5d8c256eed2f754feae20
--
v1->v2:
   - add back the adev->pm.dpm_enabled check(Lijo)
---
  .../gpu/drm/amd/include/kgd_pp_interface.h|   2 +-
  drivers/gpu/drm/amd/pm/Makefile   |   2 +-
  drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 225 +++---
  drivers/gpu/drm/amd/pm/amdgpu_dpm_internal.c  |  94 
  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |   2 -
  .../gpu/drm/amd/pm/inc/amdgpu_dpm_internal.h  |  32 +++
  .../gpu/drm/amd/pm/powerplay/amd_powerplay.c  |  39 ++-
  drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c |   6 +-
  drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.c |  60 -
  drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.h |   3 +-
  drivers/gpu/drm/amd/pm/powerplay/si_dpm.c |  41 +++-
  11 files changed, 296 insertions(+), 210 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/pm/amdgpu_dpm_internal.c
  create mode 100644 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm_internal.h

diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index cdf724dcf832..7919e96e772b 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -404,7 +404,7 @@ struct amd_pm_funcs {
int (*get_dpm_clock_table)(void *handle,
   struct dpm_clocks *clock_table);
int (*get_smu_prv_buf_details)(void *handle, void **addr, size_t *size);
-   int (*change_power_state)(void *handle);
+   void (*pm_compute_clocks)(void *handle);
  };
  
  struct metrics_table_header {

diff --git a/drivers/gpu/drm/amd/pm/Makefile b/drivers/gpu/drm/amd/pm/Makefile
index 8cf6eff1ea93..d35ffde387f1 100644
--- a/drivers/gpu/drm/amd/pm/Makefile
+++ b/drivers/gpu/drm/amd/pm/Makefile
@@ -40,7 +40,7 @@ AMD_PM = $(addsuffix /Makefile,$(addprefix 
$(FULL_AMD_PATH)/pm/,$(PM_LIBS)))
  
  include $(AMD_PM)
  
-PM_MGR = amdgpu_dpm.o amdgpu_pm.o

+PM_MGR = amdgpu_dpm.o amdgpu_pm.o amdgpu_dpm_internal.o
  
  AMD_PM_POWER = $(addprefix $(AMD_PM_PATH)/,$(PM_MGR))
  
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c

index fe6bf5d950c2..952fd865db13 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -37,73 +37,6 @@
  #define amdgpu_dpm_enable_bapm(adev, e) \

((adev)->powerplay.pp_funcs->enable_bapm((adev)->powerplay.pp_handle, (e)))
  
-static void amdgpu_dpm_get_active_displays(struct amdgpu_device *adev)

-{
-   struct drm_device *ddev = adev_to_drm(adev);
-   struct drm_crtc *crtc;
-   struct amdgpu_crtc *amdgpu_crtc;
-
-   adev->pm.dpm.new_active_crtcs = 0;
-   adev->pm.dpm.new_active_crtc_count = 0;
-   if (adev->mode_info.num_crtc && 
adev->mode_info.mode_config_initialized) {
-   list_for_each_entry(crtc,
-   >mode_config.crtc_list, head) {
-   amdgpu_crtc = to_amdgpu_crtc(crtc);
-   if (amdgpu_crtc->enabled) {
-   adev->pm.dpm.new_active_crtcs |= (1 << 
amdgpu_crtc->crtc_id);
-   adev->pm.dpm.new_active_crtc_count++;
-   }
-   }
-   }
-}
-
-u32 amdgpu_dpm_get_vblank_time(struct amdgpu_device *adev)
-{
-   struct drm_device *dev = adev_to_drm(adev);
-   struct drm_crtc *crtc;
-   struct amdgpu_crtc *amdgpu_crtc;
-   u32 vblank_in_pixels;
-   u32 vblank_time_us = 0x; /* if the displays are off, vblank 
time is max */
-
-   if (adev->mode_info.num_crtc && 
adev->mode_info.mode_config_initialized) {
-   list_for_each_entry(crtc, >mode_config.crtc_list, head) {
-   amdgpu_crtc = to_amdgpu_crtc(crtc);
-   if (crtc->enabled && amdgpu_crtc->enabled && 
amdgpu_crtc->hw_mode.clock) {
-   vblank_in_pixels =
-   amdgpu_crtc->hw_mode.crtc_htotal *
-   (amdgpu_crtc->hw_mode.crtc_vblank_end -
-   amdgpu_crtc->hw_mode.crtc_vdisplay +
-   (amdgpu_crtc->v_border * 2));
-
-   vblank_time_us = vblank_in_pixels * 1000 / 
amdgpu_crtc->hw_mode.clock;
-   break;
-   }
-   }
-   }
-
-   return vblank_time_us;
-}
-
-static u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device *adev)
-{
-   struct drm_device *dev = adev_to_drm(adev);
-   struct drm_crtc *crtc;
-   struct amdgpu_crtc *amdgpu_crtc;
-   u32 vrefresh = 0;
-
-   if (adev->mode_info.num_crtc && 
adev->mode_info.mode_config_initialized) {
-   

Re: [PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV

2021-12-09 Thread Lazar, Lijo




On 12/9/2021 2:46 PM, Chen, Guchun wrote:

[Public]

Hi Lijo,

The check is not necessary. It has a guard by for loop in the caller.

for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
...
if (amdgpu_vcn_is_disabled_vcn(adev, VCN_ENCODE_RING, i)) {
..
}



Thanks for the clarification Guchun.
Reviewed-by: Lijo Lazar 


Regards,
Guchun

-Original Message-
From: Lazar, Lijo 
Sent: Thursday, December 9, 2021 4:53 PM
To: Shi, Leslie ; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun 
Subject: Re: [PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV



On 12/9/2021 1:56 PM, Leslie Shi wrote:

Guest OS will setup VCN instance 1 which is disabled as an enabled
instance and execute initialization work on it, but this causes VCN ib
ring test failure on the disabled VCN instance during modprobe:

amdgpu :00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub
1 amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
failed on vcn_dec_0 (-110).
amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on vcn_enc_0.0 (-110).
[drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
failed (-110).

v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to
vcn_config
v3: modify VCN's revision in SR-IOV and bare-metal

Fixes: 36b7d5646476 ("drm/amdgpu: handle SRIOV VCN revision parsing")
Signed-off-by: Leslie Shi 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 29 ++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 --
   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 15 +++---
   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  2 +-
   4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 552031950518..f31bc0187394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -380,18 +380,15 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
*adev)
  ip->revision);
   
   			if (le16_to_cpu(ip->hw_id) == VCN_HWID) {

-   if (amdgpu_sriov_vf(adev)) {
-   /* SR-IOV modifies each VCN’s revision 
(uint8)
-* Bit [5:0]: original revision value
-* Bit [7:6]: en/decode capability:
-* 0b00 : VCN function normally
-* 0b10 : encode is disabled
-* 0b01 : decode is disabled
-*/
-   
adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
-   (ip->revision & 0xc0) >> 6;
-   ip->revision &= ~0xc0;
-   }
+   /* Bit [5:0]: original revision value
+* Bit [7:6]: en/decode capability:
+* 0b00 : VCN function normally
+* 0b10 : encode is disabled
+* 0b01 : decode is disabled
+*/
+   adev->vcn.vcn_config[adev->vcn.num_vcn_inst] =
+   ip->revision & 0xc0;
+   ip->revision &= ~0xc0;
adev->vcn.num_vcn_inst++;
}
if (le16_to_cpu(ip->hw_id) == SDMA0_HWID || @@ -485,14 
+482,6 @@
int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int n
return -EINVAL;
   }
   
-

-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
vcn_instance,
-int *major, int *minor, int *revision)
-{
-   return amdgpu_discovery_get_ip_version(adev, VCN_HWID,
-  vcn_instance, major, minor, 
revision);
-}
-
   void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
   {
struct binary_header *bhdr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
index 0ea029e3b850..14537cec19db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
@@ -33,8 +33,6 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev);
   int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
int number_instance,
   int *major, int *minor, int
*revision);
   
-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int vcn_instance,

-int *major, int *minor, int *revision);
   int 

Re: [PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV

2021-12-09 Thread Lazar, Lijo




On 12/9/2021 1:56 PM, Leslie Shi wrote:

Guest OS will setup VCN instance 1 which is disabled as an enabled instance and
execute initialization work on it, but this causes VCN ib ring test failure
on the disabled VCN instance during modprobe:

amdgpu :00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub 1
amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on vcn_dec_0 (-110).
amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on vcn_enc_0.0 (-110).
[drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
failed (-110).

v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to
vcn_config
v3: modify VCN's revision in SR-IOV and bare-metal

Fixes: 36b7d5646476 ("drm/amdgpu: handle SRIOV VCN revision parsing")
Signed-off-by: Leslie Shi 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 29 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 15 +++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  2 +-
  4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 552031950518..f31bc0187394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -380,18 +380,15 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
*adev)
  ip->revision);
  
  			if (le16_to_cpu(ip->hw_id) == VCN_HWID) {

-   if (amdgpu_sriov_vf(adev)) {
-   /* SR-IOV modifies each VCN’s revision 
(uint8)
-* Bit [5:0]: original revision value
-* Bit [7:6]: en/decode capability:
-* 0b00 : VCN function normally
-* 0b10 : encode is disabled
-* 0b01 : decode is disabled
-*/
-   
adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
-   (ip->revision & 0xc0) >> 6;
-   ip->revision &= ~0xc0;
-   }
+   /* Bit [5:0]: original revision value
+* Bit [7:6]: en/decode capability:
+* 0b00 : VCN function normally
+* 0b10 : encode is disabled
+* 0b01 : decode is disabled
+*/
+   adev->vcn.vcn_config[adev->vcn.num_vcn_inst] =
+   ip->revision & 0xc0;
+   ip->revision &= ~0xc0;
adev->vcn.num_vcn_inst++;
}
if (le16_to_cpu(ip->hw_id) == SDMA0_HWID ||
@@ -485,14 +482,6 @@ int amdgpu_discovery_get_ip_version(struct amdgpu_device 
*adev, int hw_id, int n
return -EINVAL;
  }
  
-

-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
vcn_instance,
-int *major, int *minor, int *revision)
-{
-   return amdgpu_discovery_get_ip_version(adev, VCN_HWID,
-  vcn_instance, major, minor, 
revision);
-}
-
  void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
  {
struct binary_header *bhdr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
index 0ea029e3b850..14537cec19db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
@@ -33,8 +33,6 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev);
  int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
int number_instance,
  int *major, int *minor, int *revision);
  
-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int vcn_instance,

-int *major, int *minor, int *revision);
  int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev);
  int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c

index 2658414c503d..38036cbf6203 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -284,20 +284,13 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device *adev)
  bool amdgpu_vcn_is_disabled_vcn(struct amdgpu_device *adev, enum 
vcn_ring_type type, uint32_t vcn_instance)
  {
bool ret = false;
+   int vcn_config = 

RE: [PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV

2021-12-09 Thread Chen, Guchun
[Public]

Hi Lijo,

The check is not necessary. It has a guard by for loop in the caller.

for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
...
if (amdgpu_vcn_is_disabled_vcn(adev, VCN_ENCODE_RING, i)) {
..
}

Regards,
Guchun

-Original Message-
From: Lazar, Lijo  
Sent: Thursday, December 9, 2021 4:53 PM
To: Shi, Leslie ; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun 
Subject: Re: [PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV



On 12/9/2021 1:56 PM, Leslie Shi wrote:
> Guest OS will setup VCN instance 1 which is disabled as an enabled 
> instance and execute initialization work on it, but this causes VCN ib 
> ring test failure on the disabled VCN instance during modprobe:
> 
> amdgpu :00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub 
> 1 amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> failed on vcn_dec_0 (-110).
> amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> failed on vcn_enc_0.0 (-110).
> [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
> failed (-110).
> 
> v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to 
> vcn_config
> v3: modify VCN's revision in SR-IOV and bare-metal
> 
> Fixes: 36b7d5646476 ("drm/amdgpu: handle SRIOV VCN revision parsing")
> Signed-off-by: Leslie Shi 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 29 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 --
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 15 +++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  2 +-
>   4 files changed, 14 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 552031950518..f31bc0187394 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -380,18 +380,15 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
> *adev)
> ip->revision);
>   
>   if (le16_to_cpu(ip->hw_id) == VCN_HWID) {
> - if (amdgpu_sriov_vf(adev)) {
> - /* SR-IOV modifies each VCN’s revision 
> (uint8)
> -  * Bit [5:0]: original revision value
> -  * Bit [7:6]: en/decode capability:
> -  * 0b00 : VCN function normally
> -  * 0b10 : encode is disabled
> -  * 0b01 : decode is disabled
> -  */
> - 
> adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
> - (ip->revision & 0xc0) >> 6;
> - ip->revision &= ~0xc0;
> - }
> + /* Bit [5:0]: original revision value
> +  * Bit [7:6]: en/decode capability:
> +  * 0b00 : VCN function normally
> +  * 0b10 : encode is disabled
> +  * 0b01 : decode is disabled
> +  */
> + adev->vcn.vcn_config[adev->vcn.num_vcn_inst] =
> + ip->revision & 0xc0;
> + ip->revision &= ~0xc0;
>   adev->vcn.num_vcn_inst++;
>   }
>   if (le16_to_cpu(ip->hw_id) == SDMA0_HWID || @@ -485,14 
> +482,6 @@ 
> int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
> int n
>   return -EINVAL;
>   }
>   
> -
> -int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
> vcn_instance,
> -  int *major, int *minor, int *revision)
> -{
> - return amdgpu_discovery_get_ip_version(adev, VCN_HWID,
> -vcn_instance, major, minor, 
> revision);
> -}
> -
>   void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
>   {
>   struct binary_header *bhdr;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> index 0ea029e3b850..14537cec19db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> @@ -33,8 +33,6 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
> *adev);
>   int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
> int number_instance,
>   int *major, int *minor, int 
> *revision);
>   
> -int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
> vcn_instance,
> -  int *major, int *minor, int 

[PATCH 2/2] drm/amdgpu: add support for SMU debug option

2021-12-09 Thread Lang Yu
SMU firmware guys expect the driver maintains error context
and doesn't interact with SMU any more when SMU errors occurred.
That will aid in debugging SMU firmware issues.

Add SMU debug option support for this request, it can be
enabled or disabled via amdgpu_smu_debug debugfs file.
When enabled, it brings hardware to a kind of halt state
so that no one can touch it any more in the envent of SMU
errors.

Currently, dirver interacts with SMU via sending messages.
And threre are three ways to sending messages to SMU.
Handle them respectively as following:

1, smu_cmn_send_smc_msg_with_param() for normal timeout cases

  Halt on any error.

2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
for longer timeout cases

  Halt on errors apart from ETIME. Otherwise this way won't work.

3, smu_cmn_send_msg_without_waiting() for no waiting cases

  Halt on errors apart from ETIME. Otherwise second way won't work.

After halting, use BUG() to explicitly notify users.

== Command Guide ==

1, enable SMU debug option

 # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

 # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v4:
 - Set to halt state instead of a simple hang.(Christian)

v3:
 - Use debugfs_create_bool().(Christian)
 - Put variable into smu_context struct.
 - Don't resend command when timeout.

v2:
 - Resend command when timeout.(Lijo)
 - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h |  5 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 20 +++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
if (!debugfs_initialized())
return 0;
 
+   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;
 
struct stb_context stb_context;
+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
 };
 
 struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..84016d22c075 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct smu_context 
*smu,
__smu_cmn_send_msg(smu, msg_index, param);
res = 0;
 Out:
+   if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
+   amdgpu_device_halt(smu->adev);
+   BUG();
+   }
+
return res;
 }
 
@@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct smu_context 
*smu,
 int smu_cmn_wait_for_response(struct smu_context *smu)
 {
u32 reg;
+   int res;
 
reg = __smu_cmn_poll_stat(smu);
-   return __smu_cmn_reg2errno(smu, reg);
+   res = __smu_cmn_reg2errno(smu, reg);
+
+   if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
+   amdgpu_device_halt(smu->adev);
+   BUG();
+   }
+
+   return res;
 }
 
 /**
@@ -357,6 +370,11 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
*smu,
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
 Out:
+   if (unlikely(smu->smu_debug_mode) && res) {
+   amdgpu_device_halt(smu->adev);
+   BUG();
+   }
+
mutex_unlock(>message_lock);
return res;
 }
-- 
2.25.1



Re: [PATCH] drm/amdgpu: Handle fault with same timestamp

2021-12-09 Thread Christian König

Am 08.12.21 um 21:27 schrieb Alex Deucher:

On Wed, Dec 8, 2021 at 3:25 PM Alex Deucher  wrote:

On Wed, Dec 8, 2021 at 3:17 PM Philip Yang  wrote:

Remove not unique timestamp WARNING as same timestamp interrupt happens
on some chips,

Drain fault need to wait for the processed_timestamp to be truly greater
than the checkpoint or the ring to be empty to be sure no stale faults
are handled.

Signed-off-by: Philip Yang 

Maybe add the link to the bug when you push this?

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1818


With that done Reviewed-by: Christian König 



Alex


Alex


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c  | 4 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 3 ---
  2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index 8050f7ba93ad..3df146579ad9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -188,8 +188,8 @@ int amdgpu_ih_wait_on_checkpoint_process_ts(struct 
amdgpu_device *adev,
 checkpoint_ts = amdgpu_ih_decode_iv_ts(adev, ih, checkpoint_wptr, -1);

 return wait_event_interruptible_timeout(ih->wait_process,
-   !amdgpu_ih_ts_after(ih->processed_timestamp, checkpoint_ts),
-   timeout);
+   amdgpu_ih_ts_after(checkpoint_ts, ih->processed_timestamp) 
||
+   ih->rptr == amdgpu_ih_get_wptr(adev, ih), timeout);
  }

  /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index e031f0cf93a2..571b7992 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -522,9 +522,6 @@ void amdgpu_irq_dispatch(struct amdgpu_device *adev,
 if (!handled)
 amdgpu_amdkfd_interrupt(adev, entry.iv_entry);

-   dev_WARN_ONCE(adev->dev, ih->processed_timestamp == entry.timestamp,
- "IH timestamps are not unique");
-
 if (amdgpu_ih_ts_after(ih->processed_timestamp, entry.timestamp))
 ih->processed_timestamp = entry.timestamp;
  }
--
2.17.1





Re: [PATCH V4 05/17] drm/amd/pm: do not expose those APIs used internally only in si_dpm.c

2021-12-09 Thread Lazar, Lijo




On 12/3/2021 8:35 AM, Evan Quan wrote:

Move them to si_dpm.c instead.

Signed-off-by: Evan Quan 
Change-Id: I288205cfd7c6ba09cfb22626ff70360d61ff0c67
--
v1->v2:
   - rename the API with "si_" prefix(Alex)
v2->v3:
   - rename other data structures used only in si_dpm.c(Lijo)
---
  drivers/gpu/drm/amd/pm/amdgpu_dpm.c   |  25 -
  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  25 -
  drivers/gpu/drm/amd/pm/powerplay/si_dpm.c | 106 +++---
  drivers/gpu/drm/amd/pm/powerplay/si_dpm.h |  15 ++-
  4 files changed, 83 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 72a8cb70d36b..b31858ad9b83 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -894,31 +894,6 @@ void amdgpu_add_thermal_controller(struct amdgpu_device 
*adev)
}
  }
  
-enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct amdgpu_device *adev,

-u32 sys_mask,
-enum amdgpu_pcie_gen asic_gen,
-enum amdgpu_pcie_gen 
default_gen)
-{
-   switch (asic_gen) {
-   case AMDGPU_PCIE_GEN1:
-   return AMDGPU_PCIE_GEN1;
-   case AMDGPU_PCIE_GEN2:
-   return AMDGPU_PCIE_GEN2;
-   case AMDGPU_PCIE_GEN3:
-   return AMDGPU_PCIE_GEN3;
-   default:
-   if ((sys_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3) &&
-   (default_gen == AMDGPU_PCIE_GEN3))
-   return AMDGPU_PCIE_GEN3;
-   else if ((sys_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2) &&
-(default_gen == AMDGPU_PCIE_GEN2))
-   return AMDGPU_PCIE_GEN2;
-   else
-   return AMDGPU_PCIE_GEN1;
-   }
-   return AMDGPU_PCIE_GEN1;
-}
-
  struct amd_vce_state*
  amdgpu_get_vce_clock_state(void *handle, u32 idx)
  {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
index 6681b878e75f..f43b96dfe9d8 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
@@ -45,19 +45,6 @@ enum amdgpu_int_thermal_type {
THERMAL_TYPE_KV,
  };
  
-enum amdgpu_dpm_auto_throttle_src {

-   AMDGPU_DPM_AUTO_THROTTLE_SRC_THERMAL,
-   AMDGPU_DPM_AUTO_THROTTLE_SRC_EXTERNAL
-};
-
-enum amdgpu_dpm_event_src {
-   AMDGPU_DPM_EVENT_SRC_ANALOG = 0,
-   AMDGPU_DPM_EVENT_SRC_EXTERNAL = 1,
-   AMDGPU_DPM_EVENT_SRC_DIGITAL = 2,
-   AMDGPU_DPM_EVENT_SRC_ANALOG_OR_EXTERNAL = 3,
-   AMDGPU_DPM_EVENT_SRC_DIGIAL_OR_EXTERNAL = 4
-};
-
  struct amdgpu_ps {
u32 caps; /* vbios flags */
u32 class; /* vbios flags */
@@ -252,13 +239,6 @@ struct amdgpu_dpm_fan {
bool ucode_fan_control;
  };
  
-enum amdgpu_pcie_gen {

-   AMDGPU_PCIE_GEN1 = 0,
-   AMDGPU_PCIE_GEN2 = 1,
-   AMDGPU_PCIE_GEN3 = 2,
-   AMDGPU_PCIE_GEN_INVALID = 0x
-};
-
  #define amdgpu_dpm_reset_power_profile_state(adev, request) \
((adev)->powerplay.pp_funcs->reset_power_profile_state(\
(adev)->powerplay.pp_handle, request))
@@ -403,11 +383,6 @@ void amdgpu_free_extended_power_table(struct amdgpu_device 
*adev);
  
  void amdgpu_add_thermal_controller(struct amdgpu_device *adev);
  
-enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct amdgpu_device *adev,

-u32 sys_mask,
-enum amdgpu_pcie_gen asic_gen,
-enum amdgpu_pcie_gen 
default_gen);
-
  struct amd_vce_state*
  amdgpu_get_vce_clock_state(void *handle, u32 idx);
  
diff --git a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c

index 81f82aa05ec2..ab0fa6c79255 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
@@ -96,6 +96,19 @@ union pplib_clock_info {
struct _ATOM_PPLIB_SI_CLOCK_INFO si;
  };
  
+enum si_dpm_auto_throttle_src {

+   DPM_AUTO_THROTTLE_SRC_THERMAL,
+   DPM_AUTO_THROTTLE_SRC_EXTERNAL
+};
+


Since the final usage is something like (1 << 
DPM_AUTO_THROTTLE_SRC_EXTERNAL), it's better to associate the SI context 
also along with that - SI_DPM_AUTO_THROTTLE_SRC_EXTERNAL - to denote 
that these are SI specific values.


Thanks,
Lijo


+enum si_dpm_event_src {
+   DPM_EVENT_SRC_ANALOG = 0,
+   DPM_EVENT_SRC_EXTERNAL = 1,
+   DPM_EVENT_SRC_DIGITAL = 2,
+   DPM_EVENT_SRC_ANALOG_OR_EXTERNAL = 3,
+   DPM_EVENT_SRC_DIGIAL_OR_EXTERNAL = 4
+};
+
  static const u32 r600_utc[R600_PM_NUMBER_OF_TC] =
  {
R600_UTC_DFLT_00,
@@ -3718,25 +3731,25 @@ static void si_set_dpm_event_sources(struct 
amdgpu_device *adev, u32 sources)
  {
struct rv7xx_power_info *pi = rv770_get_pi(adev);

RE: [PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV

2021-12-09 Thread Chen, Guchun
[Public]

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: Shi, Leslie  
Sent: Thursday, December 9, 2021 4:27 PM
To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun ; Shi, Leslie 
Subject: [PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV

Guest OS will setup VCN instance 1 which is disabled as an enabled instance and 
execute initialization work on it, but this causes VCN ib ring test failure on 
the disabled VCN instance during modprobe:

amdgpu :00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub 1 amdgpu 
:00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on 
vcn_dec_0 (-110).
amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on vcn_enc_0.0 (-110).
[drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
failed (-110).

v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to vcn_config
v3: modify VCN's revision in SR-IOV and bare-metal

Fixes: 36b7d5646476 ("drm/amdgpu: handle SRIOV VCN revision parsing")
Signed-off-by: Leslie Shi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 29 ++-  
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 15 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  2 +-
 4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 552031950518..f31bc0187394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -380,18 +380,15 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
*adev)
  ip->revision);
 
if (le16_to_cpu(ip->hw_id) == VCN_HWID) {
-   if (amdgpu_sriov_vf(adev)) {
-   /* SR-IOV modifies each VCN’s 
revision (uint8)
-* Bit [5:0]: original revision value
-* Bit [7:6]: en/decode capability:
-* 0b00 : VCN function normally
-* 0b10 : encode is disabled
-* 0b01 : decode is disabled
-*/
-   
adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
-   (ip->revision & 0xc0) >> 6;
-   ip->revision &= ~0xc0;
-   }
+   /* Bit [5:0]: original revision value
+* Bit [7:6]: en/decode capability:
+* 0b00 : VCN function normally
+* 0b10 : encode is disabled
+* 0b01 : decode is disabled
+*/
+   adev->vcn.vcn_config[adev->vcn.num_vcn_inst] =
+   ip->revision & 0xc0;
+   ip->revision &= ~0xc0;
adev->vcn.num_vcn_inst++;
}
if (le16_to_cpu(ip->hw_id) == SDMA0_HWID || @@ -485,14 
+482,6 @@ int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int 
hw_id, int n
return -EINVAL;
 }
 
-
-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
vcn_instance,
-int *major, int *minor, int *revision)
-{
-   return amdgpu_discovery_get_ip_version(adev, VCN_HWID,
-  vcn_instance, major, minor, 
revision);
-}
-
 void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)  {
struct binary_header *bhdr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
index 0ea029e3b850..14537cec19db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
@@ -33,8 +33,6 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev); 
 int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int 
number_instance,
 int *major, int *minor, int *revision);
 
-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
vcn_instance,
-int *major, int *minor, int *revision);
 int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev);  int 
amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 2658414c503d..38036cbf6203 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ 

Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

2021-12-09 Thread Christian König




Am 09.12.21 um 09:49 schrieb Lang Yu:

It is useful to maintain error context when debugging
SW/FW issues. We introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

Suggested-by: Christian Koenig 
Suggested-by: Andrey Grodzovsky 
Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++
  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c5cfe2926ca1..3f5f8f62aa5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
  void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
struct amdgpu_ring *ring);
  
+void amdgpu_device_halt(struct amdgpu_device *adev);

+
  /* atpx handler */
  #if defined(CONFIG_VGA_SWITCHEROO)
  void amdgpu_register_atpx_handler(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a1c14466f23d..62216627cc83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device 
*adev,
  
  	amdgpu_asic_invalidate_hdp(adev, ring);

  }
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Bring hardware to some kind of halt state so that no one can touch it
+ * any more. It will help to maintain error context when error occurred.
+ * Compare to a simple hang, the system will keep stable at least for SSH
+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *
+ * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
+ *clears all CPU mappings to device, disallows remappings through page 
faults
+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 5. pci_disable_device() and pci_wait_for_pending_transaction()
+ *flush any in flight DMA operations
+ * 6. set adev->no_hw_access to true
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+   struct pci_dev *pdev = adev->pdev;
+   struct drm_device *ddev = >ddev;
+
+   drm_dev_unplug(ddev);
+
+   amdgpu_irq_disable_all(adev);
+
+   amdgpu_fence_driver_hw_fini(adev);
+
+   amdgpu_device_unmap_mmio(adev);
+
+   pci_disable_device(pdev);
+   pci_wait_for_pending_transaction(pdev);
+
+   adev->no_hw_access = true;


I think we need to reorder this, e.g. set adev->no_hw_access much 
earlier for example. Andrey what do you think?


Apart from that sounds like the right idea to me.

Regards,
Christian.


+}




[PATCH v3] drm/amdgpu: fix incorrect VCN revision in SRIOV

2021-12-09 Thread Leslie Shi
Guest OS will setup VCN instance 1 which is disabled as an enabled instance and
execute initialization work on it, but this causes VCN ib ring test failure
on the disabled VCN instance during modprobe:

amdgpu :00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub 1
amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on vcn_dec_0 (-110).
amdgpu :00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on vcn_enc_0.0 (-110).
[drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
failed (-110).

v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to
vcn_config
v3: modify VCN's revision in SR-IOV and bare-metal

Fixes: 36b7d5646476 ("drm/amdgpu: handle SRIOV VCN revision parsing")
Signed-off-by: Leslie Shi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 29 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 15 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  2 +-
 4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 552031950518..f31bc0187394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -380,18 +380,15 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
*adev)
  ip->revision);
 
if (le16_to_cpu(ip->hw_id) == VCN_HWID) {
-   if (amdgpu_sriov_vf(adev)) {
-   /* SR-IOV modifies each VCN???s 
revision (uint8)
-* Bit [5:0]: original revision value
-* Bit [7:6]: en/decode capability:
-* 0b00 : VCN function normally
-* 0b10 : encode is disabled
-* 0b01 : decode is disabled
-*/
-   
adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
-   (ip->revision & 0xc0) >> 6;
-   ip->revision &= ~0xc0;
-   }
+   /* Bit [5:0]: original revision value
+* Bit [7:6]: en/decode capability:
+* 0b00 : VCN function normally
+* 0b10 : encode is disabled
+* 0b01 : decode is disabled
+*/
+   adev->vcn.vcn_config[adev->vcn.num_vcn_inst] =
+   ip->revision & 0xc0;
+   ip->revision &= ~0xc0;
adev->vcn.num_vcn_inst++;
}
if (le16_to_cpu(ip->hw_id) == SDMA0_HWID ||
@@ -485,14 +482,6 @@ int amdgpu_discovery_get_ip_version(struct amdgpu_device 
*adev, int hw_id, int n
return -EINVAL;
 }
 
-
-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
vcn_instance,
-int *major, int *minor, int *revision)
-{
-   return amdgpu_discovery_get_ip_version(adev, VCN_HWID,
-  vcn_instance, major, minor, 
revision);
-}
-
 void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
 {
struct binary_header *bhdr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
index 0ea029e3b850..14537cec19db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
@@ -33,8 +33,6 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev);
 int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int 
number_instance,
 int *major, int *minor, int *revision);
 
-int amdgpu_discovery_get_vcn_version(struct amdgpu_device *adev, int 
vcn_instance,
-int *major, int *minor, int *revision);
 int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev);
 int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 2658414c503d..38036cbf6203 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -284,20 +284,13 @@ int amdgpu_vcn_sw_fini(struct amdgpu_device *adev)
 bool amdgpu_vcn_is_disabled_vcn(struct amdgpu_device *adev, enum vcn_ring_type 
type, uint32_t vcn_instance)
 {
bool ret = false;
+   int vcn_config = adev->vcn.vcn_config[vcn_instance];
 
-   int major;
-  

  1   2   >