Re: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-27 Thread Christian König

Hi Shaoyun,

yes, absolutely. If you say that this is ok then I'm fine with that as well.

Thanks,
Christian.

Am 26.10.22 um 20:13 schrieb Liu, Shaoyun:

[AMD Official Use Only - General]

The SRIOV already has its own reset routine amdgpu_device_reset_sriov,  we try 
to put the sriov specific sequence  inside this function. For the rest 
part(re-submitting etc ) we should try to have the same  behavior as bare-metal.
Can  we just don't do the re-submission for all kind of reset since kernel 
already signal the reset event  to user level (at least for compute stack) ?

Regard
Sshaoyun.liu

-Original Message-
From: Koenig, Christian 
Sent: Wednesday, October 26, 2022 1:27 PM
To: Liu, Shaoyun ; Tuikov, Luben ; Prosyak, 
Vitaly ; Deucher, Alexander ; 
daniel.vet...@ffwll.ch; amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Subject: Re: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

The problem is that this re-submitting is currently an integral part of how 
SRIOV works.

The host can send a function level reset request to the clients when it sees 
that some schedule switching didn't worked as expected and in this case (and 
only this case) the hardware has actually never started to even work on the 
IBs. So the re-submission is actually save from this side.

But in general you are right, the sw side is just completely broken because we 
came up with a bunch of rather strict rules for the dma_fence implementation 
(and those rules are perfectly valid and necessary).

Regards,
Christian.

Am 26.10.22 um 18:10 schrieb Liu, Shaoyun:

[AMD Official Use Only - General]

The  user space  shouldn't care about  SRIOV or not ,  I don't think we need to 
keep the re-submission for SRIOV as well.  The reset from SRIOV could trigger 
the  host do a whole GPU reset which will have the same issue as bare metal.

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of
Christian König
Sent: Wednesday, October 26, 2022 11:36 AM
To: Tuikov, Luben ; Prosyak, Vitaly
; Deucher, Alexander
; daniel.vet...@ffwll.ch;
amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Cc: Koenig, Christian 
Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal
reset

Re-submitting IBs by the kernel has many problems because pre- requisite state 
is not automatically re-created as well. In other words neither binary 
semaphores nor things like ring buffer pointers are in the state they should be 
when the hardware starts to work on the IBs again.

Additional to that even after more than 5 years of developing this feature it 
is still not stable and we have massively problems getting the reference counts 
right.

As discussed with user space developers this behavior is not helpful in the 
first place. For graphics and multimedia workloads it makes much more sense to 
either completely re-create the context or at least re-submitting the IBs from 
userspace.

For compute use cases re-submitting is also not very helpful since userspace 
must rely on the accuracy of the result.

Because of this we stop this practice and instead just properly note that the 
fence submission was canceled. The only use case we keep the re-submission for 
now is SRIOV and function level resets.

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4584e577b51..39e94feba1ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  continue;

  /* No point to resubmit jobs if we didn't HW reset*/
-   if (!tmp_adev->asic_reset_res && !job_signaled)
+   if (!tmp_adev->asic_reset_res && !job_signaled &&
+   amdgpu_sriov_vf(tmp_adev))

drm_sched_resubmit_jobs(>sched);

  drm_sched_start(>sched,
!tmp_adev->asic_reset_res);
--
2.25.1





RE: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-26 Thread Liu, Shaoyun
[AMD Official Use Only - General]

The SRIOV already has its own reset routine amdgpu_device_reset_sriov,  we try 
to put the sriov specific sequence  inside this function. For the rest 
part(re-submitting etc ) we should try to have the same  behavior as bare-metal.
Can  we just don't do the re-submission for all kind of reset since kernel 
already signal the reset event  to user level (at least for compute stack) ?

Regard
Sshaoyun.liu

-Original Message-
From: Koenig, Christian 
Sent: Wednesday, October 26, 2022 1:27 PM
To: Liu, Shaoyun ; Tuikov, Luben ; 
Prosyak, Vitaly ; Deucher, Alexander 
; daniel.vet...@ffwll.ch; 
amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Subject: Re: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

The problem is that this re-submitting is currently an integral part of how 
SRIOV works.

The host can send a function level reset request to the clients when it sees 
that some schedule switching didn't worked as expected and in this case (and 
only this case) the hardware has actually never started to even work on the 
IBs. So the re-submission is actually save from this side.

But in general you are right, the sw side is just completely broken because we 
came up with a bunch of rather strict rules for the dma_fence implementation 
(and those rules are perfectly valid and necessary).

Regards,
Christian.

Am 26.10.22 um 18:10 schrieb Liu, Shaoyun:
> [AMD Official Use Only - General]
>
> The  user space  shouldn't care about  SRIOV or not ,  I don't think we need 
> to keep the re-submission for SRIOV as well.  The reset from SRIOV could 
> trigger the  host do a whole GPU reset which will have the same issue as bare 
> metal.
>
> Regards
> Shaoyun.liu
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Christian König
> Sent: Wednesday, October 26, 2022 11:36 AM
> To: Tuikov, Luben ; Prosyak, Vitaly
> ; Deucher, Alexander
> ; daniel.vet...@ffwll.ch;
> amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
> Cc: Koenig, Christian 
> Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal
> reset
>
> Re-submitting IBs by the kernel has many problems because pre- requisite 
> state is not automatically re-created as well. In other words neither binary 
> semaphores nor things like ring buffer pointers are in the state they should 
> be when the hardware starts to work on the IBs again.
>
> Additional to that even after more than 5 years of developing this feature it 
> is still not stable and we have massively problems getting the reference 
> counts right.
>
> As discussed with user space developers this behavior is not helpful in the 
> first place. For graphics and multimedia workloads it makes much more sense 
> to either completely re-create the context or at least re-submitting the IBs 
> from userspace.
>
> For compute use cases re-submitting is also not very helpful since userspace 
> must rely on the accuracy of the result.
>
> Because of this we stop this practice and instead just properly note that the 
> fence submission was canceled. The only use case we keep the re-submission 
> for now is SRIOV and function level resets.
>
> Signed-off-by: Christian König 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d4584e577b51..39e94feba1ac 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>  continue;
>
>  /* No point to resubmit jobs if we didn't HW reset*/
> -   if (!tmp_adev->asic_reset_res && !job_signaled)
> +   if (!tmp_adev->asic_reset_res && !job_signaled &&
> +   amdgpu_sriov_vf(tmp_adev))
>
> drm_sched_resubmit_jobs(>sched);
>
>  drm_sched_start(>sched,
> !tmp_adev->asic_reset_res);
> --
> 2.25.1
>



Re: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-26 Thread Christian König
The problem is that this re-submitting is currently an integral part of 
how SRIOV works.


The host can send a function level reset request to the clients when it 
sees that some schedule switching didn't worked as expected and in this 
case (and only this case) the hardware has actually never started to 
even work on the IBs. So the re-submission is actually save from this side.


But in general you are right, the sw side is just completely broken 
because we came up with a bunch of rather strict rules for the dma_fence 
implementation (and those rules are perfectly valid and necessary).


Regards,
Christian.

Am 26.10.22 um 18:10 schrieb Liu, Shaoyun:

[AMD Official Use Only - General]

The  user space  shouldn't care about  SRIOV or not ,  I don't think we need to 
keep the re-submission for SRIOV as well.  The reset from SRIOV could trigger 
the  host do a whole GPU reset which will have the same issue as bare metal.

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Wednesday, October 26, 2022 11:36 AM
To: Tuikov, Luben ; Prosyak, Vitaly ; 
Deucher, Alexander ; daniel.vet...@ffwll.ch; 
amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Cc: Koenig, Christian 
Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

Re-submitting IBs by the kernel has many problems because pre- requisite state 
is not automatically re-created as well. In other words neither binary 
semaphores nor things like ring buffer pointers are in the state they should be 
when the hardware starts to work on the IBs again.

Additional to that even after more than 5 years of developing this feature it 
is still not stable and we have massively problems getting the reference counts 
right.

As discussed with user space developers this behavior is not helpful in the 
first place. For graphics and multimedia workloads it makes much more sense to 
either completely re-create the context or at least re-submitting the IBs from 
userspace.

For compute use cases re-submitting is also not very helpful since userspace 
must rely on the accuracy of the result.

Because of this we stop this practice and instead just properly note that the 
fence submission was canceled. The only use case we keep the re-submission for 
now is SRIOV and function level resets.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4584e577b51..39e94feba1ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 continue;

 /* No point to resubmit jobs if we didn't HW reset*/
-   if (!tmp_adev->asic_reset_res && !job_signaled)
+   if (!tmp_adev->asic_reset_res && !job_signaled &&
+   amdgpu_sriov_vf(tmp_adev))
 drm_sched_resubmit_jobs(>sched);

 drm_sched_start(>sched, 
!tmp_adev->asic_reset_res);
--
2.25.1





RE: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-26 Thread Liu, Shaoyun
[AMD Official Use Only - General]

The  user space  shouldn't care about  SRIOV or not ,  I don't think we need to 
keep the re-submission for SRIOV as well.  The reset from SRIOV could trigger 
the  host do a whole GPU reset which will have the same issue as bare metal.

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Wednesday, October 26, 2022 11:36 AM
To: Tuikov, Luben ; Prosyak, Vitaly 
; Deucher, Alexander ; 
daniel.vet...@ffwll.ch; amd-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org
Cc: Koenig, Christian 
Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

Re-submitting IBs by the kernel has many problems because pre- requisite state 
is not automatically re-created as well. In other words neither binary 
semaphores nor things like ring buffer pointers are in the state they should be 
when the hardware starts to work on the IBs again.

Additional to that even after more than 5 years of developing this feature it 
is still not stable and we have massively problems getting the reference counts 
right.

As discussed with user space developers this behavior is not helpful in the 
first place. For graphics and multimedia workloads it makes much more sense to 
either completely re-create the context or at least re-submitting the IBs from 
userspace.

For compute use cases re-submitting is also not very helpful since userspace 
must rely on the accuracy of the result.

Because of this we stop this practice and instead just properly note that the 
fence submission was canceled. The only use case we keep the re-submission for 
now is SRIOV and function level resets.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4584e577b51..39e94feba1ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
continue;

/* No point to resubmit jobs if we didn't HW reset*/
-   if (!tmp_adev->asic_reset_res && !job_signaled)
+   if (!tmp_adev->asic_reset_res && !job_signaled &&
+   amdgpu_sriov_vf(tmp_adev))
drm_sched_resubmit_jobs(>sched);

drm_sched_start(>sched, 
!tmp_adev->asic_reset_res);
--
2.25.1