Re: [PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control

2023-11-09 Thread Luben Tuikov
On 2023-11-09 19:57, Luben Tuikov wrote:
> On 2023-11-09 19:16, Danilo Krummrich wrote:
[snip]
>> @@ -667,6 +771,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>   * drm_sched_job_init - init a scheduler job
>>   * @job: scheduler job to init
>>   * @entity: scheduler entity to use
>> + * @credits: the number of credits this job contributes to the schedulers
>> + * credit limit
>>   * @owner: job owner for debugging
>>   *
>>   * Refer to drm_sched_entity_push_job() documentation
>> @@ -684,7 +790,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>   */
>>  int drm_sched_job_init(struct drm_sched_job *job,
>> struct drm_sched_entity *entity,
>> -   void *owner)
>> +   u32 credits, void *owner)
>>  {
>>  if (!entity->rq) {
>>  /* This will most likely be followed by missing frames
>> @@ -695,7 +801,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>  return -ENOENT;
>>  }
>>  
>> +if (unlikely(!credits))
>> +return -EINVAL;
>> +
> 
> This will most likely result in bad user experience (read: blank screen),
> and debugging this would be really hard without something to go by
> in the kernel log.
> 
> (This was exactly the case with amdgpu when 56e449603f0ac5
> ("drm/sched: Convert the GPU scheduler to variable number of run-queues") 
> was being worked on and merged. Without the drm_err() on missing rq in
> the lines immediately before the hunk above returning -ENOENT, there
> was no indication why setting up an fb was failing very early on (blank 
> screen).)
> 
> So it is best to print a "[drm] *ERROR* "-equivalent string in the logs,
> so that we can make a note of this, without relying on drivers, old and new, 
> logging
> that drm_sched_job_init() failed.

If you add _exactly_ this,

if (unlikely(!credits)) {
pr_err("*ERROR* %s: credits cannot be 0!\n", __func__)
return -EINVAL;
}

You can add my,

Reviewed-by: Luben Tuikov 

and push it.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control

2023-11-09 Thread Luben Tuikov
On 2023-11-09 19:16, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich 
> ---
> Changes in V2:
> ==
>   - fixed up influence on scheduling fairness due to consideration of a job's
> size
> - If we reach a ready entity in drm_sched_select_entity() but can't 
> actually
>   queue a job from it due to size limitations, just give up and go to 
> sleep
>   until woken up due to a pending job finishing, rather than continue to 
> try
>   other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
> (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
> proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to 
> the
> following scheme
> - drm_sched_job::credits
> - drm_gpu_scheduler::credit_limit
> - drm_gpu_scheduler::credit_count
> - drm_sched_init(..., u32 credit_limit, ...)
> - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> Changes in V4:
> ==
>   - address Lubens comments regarding documentation
>   - switch to drm_WARN() variants
>   - WARN_ON() drivers passing in zero job credits for both 
> drm_sched_job_init()
> and the update_job_credits() callback
>   - don't retry with another runq if job doesn't fit on the ring to prevent
> priority inversion
>   - rebase onto drm-misc-next (will probably land before Matt's series)
> 
> Changes in V5:
> ==
>   - fix panfrost, lima and etnaviv build
>   - add proposed comments regarding how the code avoids runq priority 
> inversion
>   - address Lubens feedback regarding wording
>   - rebase onto latest drm-misc-next (XE scheduler patches)
> 
> Changes in V6:
> ==
>   - rebase due to conflicts introduced meanwhile
>   - drm_sched_job_init(): fail with EINVAL, rather than WARN() if job->credits
> is zero
>   - drm_sched_can_queue: truncate job->credits if they exceed the scheduler's
> credit limit to guarantee forward progress
> 
> Patch also available in [1].
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gpu.c |   2 +-
>  drivers/gpu/drm/lima/lima_device.c|   2 +-
>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c   |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 168 ++
>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>  include/drm/gpu_scheduler.h   |  28 ++-
>  14 files changed, 173 insertions(+), 51 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>  
> +Flow Control
> +
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

[PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control

2023-11-09 Thread Danilo Krummrich
Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the number of jobs. Therefore, add a field to track a job's
credit count, which represents the number of credits a job contributes
to the scheduler's credit limit.

Signed-off-by: Danilo Krummrich 
---
Changes in V2:
==
  - fixed up influence on scheduling fairness due to consideration of a job's
size
- If we reach a ready entity in drm_sched_select_entity() but can't actually
  queue a job from it due to size limitations, just give up and go to sleep
  until woken up due to a pending job finishing, rather than continue to try
  other entities.
  - added a callback to dynamically update a job's credits (Boris)
  - renamed 'units' to 'credits'
  - fixed commit message and comments as requested by Luben

Changes in V3:
==
  - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
  - move up drm_sched_can_queue() instead of adding a forward declaration
(Boris)
  - add a drm_sched_available_credits() helper (Boris)
  - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
  - re-phrase a few comments and fix a typo (Luben)
  - change naming of all structures credit fields and function parameters to the
following scheme
- drm_sched_job::credits
- drm_gpu_scheduler::credit_limit
- drm_gpu_scheduler::credit_count
- drm_sched_init(..., u32 credit_limit, ...)
- drm_sched_job_init(..., u32 credits, ...)
  - add proper documentation for the scheduler's job-flow control mechanism

Changes in V4:
==
  - address Lubens comments regarding documentation
  - switch to drm_WARN() variants
  - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
and the update_job_credits() callback
  - don't retry with another runq if job doesn't fit on the ring to prevent
priority inversion
  - rebase onto drm-misc-next (will probably land before Matt's series)

Changes in V5:
==
  - fix panfrost, lima and etnaviv build
  - add proposed comments regarding how the code avoids runq priority inversion
  - address Lubens feedback regarding wording
  - rebase onto latest drm-misc-next (XE scheduler patches)

Changes in V6:
==
  - rebase due to conflicts introduced meanwhile
  - drm_sched_job_init(): fail with EINVAL, rather than WARN() if job->credits
is zero
  - drm_sched_can_queue: truncate job->credits if they exceed the scheduler's
credit limit to guarantee forward progress

Patch also available in [1].

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
---
 Documentation/gpu/drm-mm.rst  |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gpu.c |   2 +-
 drivers/gpu/drm/lima/lima_device.c|   2 +-
 drivers/gpu/drm/lima/lima_sched.c |   2 +-
 drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c   |   2 +-
 .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c| 168 ++
 drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
 include/drm/gpu_scheduler.h   |  28 ++-
 14 files changed, 173 insertions(+), 51 deletions(-)

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index 602010cb6894..acc5901ac840 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,6 +552,12 @@ Overview
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Overview
 
+Flow Control
+
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Flow Control
+
 Scheduler Function References
 -
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 1f357198533f..62bb7fc7448a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
if (!entity)
return 0;
 
-   return