Re: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

2024-02-15 Thread Christian König




Am 15.02.24 um 15:43 schrieb Christian König:

Am 15.02.24 um 15:36 schrieb Alex Deucher:

On Thu, Feb 15, 2024 at 2:53 AM Christian König
 wrote:

Well using this is in sysfs is a bug to begin with. This would prevent
starting new applications and crashing applications which don't expect
to get an -EPERM in return here.

If we need to make operations mutual exclusive with resets then we need
to take the appropriate locks and *not* work around by abusing
amdgpu_in_reset().

The functionality of amdgpu_in_reset() is just to check in lower level
functions if we are inside the higher level reset thread and *not*
protect anybody from concurrent access.

I think we should probably completely nuke the underlying flag and 
using

the thread owner of the lock to prevent such an abuse.

Can we land some variant of this for now?


I don't think so, it most likely will break existing use cases.

What we might be able to do is to frame this with 
amdgpu_device_lock_reset_domain() / amdgpu_device_unlock_reset_domain().


It fixes known issues and it's the same behavior we have in sysfs and 
debugfs already.


Yeah, as I said that is broken to begin with. It's just that for sysfs 
and debugfs nobody notices.


Wait a second, debugfs is actually doing the right thing in some functions:

    /* Avoid accidently unparking the sched thread during GPU reset */
    r = down_read_killable(>reset_domain->sem);
    if (r)
    goto pro_end;
...

    up_read(>reset_domain->sem);

This needs to replace amdgpu_in_reset() in pretty much all debugfs and 
sysfs function.


Probably best to wrap that in some inline amdgpu_reset_* functions and 
document why those needs to be used.


Regards,
Christian.




Regards,
Christian.


   It's not
clear to me how this should best be handled.  We basically want to
block any access to the GPU (registers, firmwares, etc.) while the GPU
is going through a reset.  Just locking the reset domain doesn't seem
like the right solution.

Alex


Regards,
Christian.

Am 12.02.24 um 21:56 schrieb Deucher, Alexander:

[AMD Official Use Only - General]

Ping?


-Original Message-
From: Deucher, Alexander 
Sent: Monday, January 29, 2024 10:56 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in 
reset


This avoids queries to read registers or query the SMU for 
telemetry data while

the GPU is in reset. This mirrors what we already do for sysfs.

Signed-off-by: Alex Deucher 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index a2df3025a754..d522e99c6f81 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -607,6 +607,9 @@ int amdgpu_info_ioctl(struct drm_device *dev, 
void

*data, struct drm_file *filp)
    int i, found, ret;
    int ui32_size = sizeof(ui32);

+ if (amdgpu_in_reset(adev))
+ return -EPERM;
+
    if (!info->return_size || !info->return_pointer)
    return -EINVAL;

--
2.42.0






Re: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

2024-02-15 Thread Christian König

Am 15.02.24 um 15:36 schrieb Alex Deucher:

On Thu, Feb 15, 2024 at 2:53 AM Christian König
 wrote:

Well using this is in sysfs is a bug to begin with. This would prevent
starting new applications and crashing applications which don't expect
to get an -EPERM in return here.

If we need to make operations mutual exclusive with resets then we need
to take the appropriate locks and *not* work around by abusing
amdgpu_in_reset().

The functionality of amdgpu_in_reset() is just to check in lower level
functions if we are inside the higher level reset thread and *not*
protect anybody from concurrent access.

I think we should probably completely nuke the underlying flag and using
the thread owner of the lock to prevent such an abuse.

Can we land some variant of this for now?


I don't think so, it most likely will break existing use cases.

What we might be able to do is to frame this with 
amdgpu_device_lock_reset_domain() / amdgpu_device_unlock_reset_domain().



It fixes known issues and it's the same behavior we have in sysfs and debugfs 
already.


Yeah, as I said that is broken to begin with. It's just that for sysfs 
and debugfs nobody notices.


Regards,
Christian.


   It's not
clear to me how this should best be handled.  We basically want to
block any access to the GPU (registers, firmwares, etc.) while the GPU
is going through a reset.  Just locking the reset domain doesn't seem
like the right solution.

Alex


Regards,
Christian.

Am 12.02.24 um 21:56 schrieb Deucher, Alexander:

[AMD Official Use Only - General]

Ping?


-Original Message-
From: Deucher, Alexander 
Sent: Monday, January 29, 2024 10:56 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

This avoids queries to read registers or query the SMU for telemetry data while
the GPU is in reset. This mirrors what we already do for sysfs.

Signed-off-by: Alex Deucher 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index a2df3025a754..d522e99c6f81 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -607,6 +607,9 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
*data, struct drm_file *filp)
int i, found, ret;
int ui32_size = sizeof(ui32);

+ if (amdgpu_in_reset(adev))
+ return -EPERM;
+
if (!info->return_size || !info->return_pointer)
return -EINVAL;

--
2.42.0




Re: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

2024-02-15 Thread Alex Deucher
On Thu, Feb 15, 2024 at 2:53 AM Christian König
 wrote:
>
> Well using this is in sysfs is a bug to begin with. This would prevent
> starting new applications and crashing applications which don't expect
> to get an -EPERM in return here.
>
> If we need to make operations mutual exclusive with resets then we need
> to take the appropriate locks and *not* work around by abusing
> amdgpu_in_reset().
>
> The functionality of amdgpu_in_reset() is just to check in lower level
> functions if we are inside the higher level reset thread and *not*
> protect anybody from concurrent access.
>
> I think we should probably completely nuke the underlying flag and using
> the thread owner of the lock to prevent such an abuse.

Can we land some variant of this for now?  It fixes known issues and
it's the same behavior we have in sysfs and debugfs already.  It's not
clear to me how this should best be handled.  We basically want to
block any access to the GPU (registers, firmwares, etc.) while the GPU
is going through a reset.  Just locking the reset domain doesn't seem
like the right solution.

Alex

>
> Regards,
> Christian.
>
> Am 12.02.24 um 21:56 schrieb Deucher, Alexander:
> > [AMD Official Use Only - General]
> >
> > Ping?
> >
> >> -Original Message-
> >> From: Deucher, Alexander 
> >> Sent: Monday, January 29, 2024 10:56 AM
> >> To: amd-gfx@lists.freedesktop.org
> >> Cc: Deucher, Alexander 
> >> Subject: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset
> >>
> >> This avoids queries to read registers or query the SMU for telemetry data 
> >> while
> >> the GPU is in reset. This mirrors what we already do for sysfs.
> >>
> >> Signed-off-by: Alex Deucher 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> index a2df3025a754..d522e99c6f81 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> >> @@ -607,6 +607,9 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> >> *data, struct drm_file *filp)
> >>int i, found, ret;
> >>int ui32_size = sizeof(ui32);
> >>
> >> + if (amdgpu_in_reset(adev))
> >> + return -EPERM;
> >> +
> >>if (!info->return_size || !info->return_pointer)
> >>return -EINVAL;
> >>
> >> --
> >> 2.42.0
>


Re: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

2024-02-14 Thread Christian König
Well using this is in sysfs is a bug to begin with. This would prevent 
starting new applications and crashing applications which don't expect 
to get an -EPERM in return here.


If we need to make operations mutual exclusive with resets then we need 
to take the appropriate locks and *not* work around by abusing 
amdgpu_in_reset().


The functionality of amdgpu_in_reset() is just to check in lower level 
functions if we are inside the higher level reset thread and *not* 
protect anybody from concurrent access.


I think we should probably completely nuke the underlying flag and using 
the thread owner of the lock to prevent such an abuse.


Regards,
Christian.

Am 12.02.24 um 21:56 schrieb Deucher, Alexander:

[AMD Official Use Only - General]

Ping?


-Original Message-
From: Deucher, Alexander 
Sent: Monday, January 29, 2024 10:56 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

This avoids queries to read registers or query the SMU for telemetry data while
the GPU is in reset. This mirrors what we already do for sysfs.

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index a2df3025a754..d522e99c6f81 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -607,6 +607,9 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
*data, struct drm_file *filp)
   int i, found, ret;
   int ui32_size = sizeof(ui32);

+ if (amdgpu_in_reset(adev))
+ return -EPERM;
+
   if (!info->return_size || !info->return_pointer)
   return -EINVAL;

--
2.42.0




RE: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

2024-02-12 Thread Deucher, Alexander
[AMD Official Use Only - General]

Ping?

> -Original Message-
> From: Deucher, Alexander 
> Sent: Monday, January 29, 2024 10:56 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset
>
> This avoids queries to read registers or query the SMU for telemetry data 
> while
> the GPU is in reset. This mirrors what we already do for sysfs.
>
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index a2df3025a754..d522e99c6f81 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -607,6 +607,9 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> *data, struct drm_file *filp)
>   int i, found, ret;
>   int ui32_size = sizeof(ui32);
>
> + if (amdgpu_in_reset(adev))
> + return -EPERM;
> +
>   if (!info->return_size || !info->return_pointer)
>   return -EINVAL;
>
> --
> 2.42.0