Re: [PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control
On 2023-11-09 19:57, Luben Tuikov wrote: > On 2023-11-09 19:16, Danilo Krummrich wrote: [snip] >> @@ -667,6 +771,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs); >> * drm_sched_job_init - init a scheduler job >> * @job: scheduler job to init >> * @entity: scheduler entity to use >> + * @credits: the number of credits this job contributes to the schedulers >> + * credit limit >> * @owner: job owner for debugging >> * >> * Refer to drm_sched_entity_push_job() documentation >> @@ -684,7 +790,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs); >> */ >> int drm_sched_job_init(struct drm_sched_job *job, >> struct drm_sched_entity *entity, >> - void *owner) >> + u32 credits, void *owner) >> { >> if (!entity->rq) { >> /* This will most likely be followed by missing frames >> @@ -695,7 +801,11 @@ int drm_sched_job_init(struct drm_sched_job *job, >> return -ENOENT; >> } >> >> +if (unlikely(!credits)) >> +return -EINVAL; >> + > > This will most likely result in bad user experience (read: blank screen), > and debugging this would be really hard without something to go by > in the kernel log. > > (This was exactly the case with amdgpu when 56e449603f0ac5 > ("drm/sched: Convert the GPU scheduler to variable number of run-queues") > was being worked on and merged. Without the drm_err() on missing rq in > the lines immediately before the hunk above returning -ENOENT, there > was no indication why setting up an fb was failing very early on (blank > screen).) > > So it is best to print a "[drm] *ERROR* "-equivalent string in the logs, > so that we can make a note of this, without relying on drivers, old and new, > logging > that drm_sched_job_init() failed. If you add _exactly_ this, if (unlikely(!credits)) { pr_err("*ERROR* %s: credits cannot be 0!\n", __func__) return -EINVAL; } You can add my, Reviewed-by: Luben Tuikov and push it. -- Regards, Luben OpenPGP_0x4C15479431A334AF.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: [PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control
On 2023-11-09 19:16, Danilo Krummrich wrote: > Currently, job flow control is implemented simply by limiting the number > of jobs in flight. Therefore, a scheduler is initialized with a credit > limit that corresponds to the number of jobs which can be sent to the > hardware. > > This implies that for each job, drivers need to account for the maximum > job size possible in order to not overflow the ring buffer. > > However, there are drivers, such as Nouveau, where the job size has a > rather large range. For such drivers it can easily happen that job > submissions not even filling the ring by 1% can block subsequent > submissions, which, in the worst case, can lead to the ring run dry. > > In order to overcome this issue, allow for tracking the actual job size > instead of the number of jobs. Therefore, add a field to track a job's > credit count, which represents the number of credits a job contributes > to the scheduler's credit limit. > > Signed-off-by: Danilo Krummrich > --- > Changes in V2: > == > - fixed up influence on scheduling fairness due to consideration of a job's > size > - If we reach a ready entity in drm_sched_select_entity() but can't > actually > queue a job from it due to size limitations, just give up and go to > sleep > until woken up due to a pending job finishing, rather than continue to > try > other entities. > - added a callback to dynamically update a job's credits (Boris) > - renamed 'units' to 'credits' > - fixed commit message and comments as requested by Luben > > Changes in V3: > == > - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt > - move up drm_sched_can_queue() instead of adding a forward declaration > (Boris) > - add a drm_sched_available_credits() helper (Boris) > - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's > proposal > - re-phrase a few comments and fix a typo (Luben) > - change naming of all structures credit fields and function parameters to > the > following scheme > - drm_sched_job::credits > - drm_gpu_scheduler::credit_limit > - drm_gpu_scheduler::credit_count > - drm_sched_init(..., u32 credit_limit, ...) > - drm_sched_job_init(..., u32 credits, ...) > - add proper documentation for the scheduler's job-flow control mechanism > > Changes in V4: > == > - address Lubens comments regarding documentation > - switch to drm_WARN() variants > - WARN_ON() drivers passing in zero job credits for both > drm_sched_job_init() > and the update_job_credits() callback > - don't retry with another runq if job doesn't fit on the ring to prevent > priority inversion > - rebase onto drm-misc-next (will probably land before Matt's series) > > Changes in V5: > == > - fix panfrost, lima and etnaviv build > - add proposed comments regarding how the code avoids runq priority > inversion > - address Lubens feedback regarding wording > - rebase onto latest drm-misc-next (XE scheduler patches) > > Changes in V6: > == > - rebase due to conflicts introduced meanwhile > - drm_sched_job_init(): fail with EINVAL, rather than WARN() if job->credits > is zero > - drm_sched_can_queue: truncate job->credits if they exceed the scheduler's > credit limit to guarantee forward progress > > Patch also available in [1]. > > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/ > --- > Documentation/gpu/drm-mm.rst | 6 + > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- > drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +- > drivers/gpu/drm/etnaviv/etnaviv_gpu.c | 2 +- > drivers/gpu/drm/lima/lima_device.c| 2 +- > drivers/gpu/drm/lima/lima_sched.c | 2 +- > drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- > drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- > drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- > drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- > .../gpu/drm/scheduler/gpu_scheduler_trace.h | 2 +- > drivers/gpu/drm/scheduler/sched_main.c| 168 ++ > drivers/gpu/drm/v3d/v3d_gem.c | 2 +- > include/drm/gpu_scheduler.h | 28 ++- > 14 files changed, 173 insertions(+), 51 deletions(-) > > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst > index 602010cb6894..acc5901ac840 100644 > --- a/Documentation/gpu/drm-mm.rst > +++ b/Documentation/gpu/drm-mm.rst > @@ -552,6 +552,12 @@ Overview > .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > :doc: Overview > > +Flow Control > + > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > + :doc: Flow Control > + > Scheduler Function References > - > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
[PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control
Currently, job flow control is implemented simply by limiting the number of jobs in flight. Therefore, a scheduler is initialized with a credit limit that corresponds to the number of jobs which can be sent to the hardware. This implies that for each job, drivers need to account for the maximum job size possible in order to not overflow the ring buffer. However, there are drivers, such as Nouveau, where the job size has a rather large range. For such drivers it can easily happen that job submissions not even filling the ring by 1% can block subsequent submissions, which, in the worst case, can lead to the ring run dry. In order to overcome this issue, allow for tracking the actual job size instead of the number of jobs. Therefore, add a field to track a job's credit count, which represents the number of credits a job contributes to the scheduler's credit limit. Signed-off-by: Danilo Krummrich --- Changes in V2: == - fixed up influence on scheduling fairness due to consideration of a job's size - If we reach a ready entity in drm_sched_select_entity() but can't actually queue a job from it due to size limitations, just give up and go to sleep until woken up due to a pending job finishing, rather than continue to try other entities. - added a callback to dynamically update a job's credits (Boris) - renamed 'units' to 'credits' - fixed commit message and comments as requested by Luben Changes in V3: == - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt - move up drm_sched_can_queue() instead of adding a forward declaration (Boris) - add a drm_sched_available_credits() helper (Boris) - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal - re-phrase a few comments and fix a typo (Luben) - change naming of all structures credit fields and function parameters to the following scheme - drm_sched_job::credits - drm_gpu_scheduler::credit_limit - drm_gpu_scheduler::credit_count - drm_sched_init(..., u32 credit_limit, ...) - drm_sched_job_init(..., u32 credits, ...) - add proper documentation for the scheduler's job-flow control mechanism Changes in V4: == - address Lubens comments regarding documentation - switch to drm_WARN() variants - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init() and the update_job_credits() callback - don't retry with another runq if job doesn't fit on the ring to prevent priority inversion - rebase onto drm-misc-next (will probably land before Matt's series) Changes in V5: == - fix panfrost, lima and etnaviv build - add proposed comments regarding how the code avoids runq priority inversion - address Lubens feedback regarding wording - rebase onto latest drm-misc-next (XE scheduler patches) Changes in V6: == - rebase due to conflicts introduced meanwhile - drm_sched_job_init(): fail with EINVAL, rather than WARN() if job->credits is zero - drm_sched_can_queue: truncate job->credits if they exceed the scheduler's credit limit to guarantee forward progress Patch also available in [1]. [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/ --- Documentation/gpu/drm-mm.rst | 6 + drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_gpu.c | 2 +- drivers/gpu/drm/lima/lima_device.c| 2 +- drivers/gpu/drm/lima/lima_sched.c | 2 +- drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- .../gpu/drm/scheduler/gpu_scheduler_trace.h | 2 +- drivers/gpu/drm/scheduler/sched_main.c| 168 ++ drivers/gpu/drm/v3d/v3d_gem.c | 2 +- include/drm/gpu_scheduler.h | 28 ++- 14 files changed, 173 insertions(+), 51 deletions(-) diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index 602010cb6894..acc5901ac840 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -552,6 +552,12 @@ Overview .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c :doc: Overview +Flow Control + + +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c + :doc: Flow Control + Scheduler Function References - diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 1f357198533f..62bb7fc7448a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm, if (!entity) return 0; - return