from:"Luben Tuikov"

Re: [PATCH] drm/sched: Re-queue run job worker when drm_sched_entity_pop_job() returns NULL

2024-02-05 Thread Luben Tuikov

On 2024-02-05 08:33, Rodrigo Vivi wrote:
> On Mon, Feb 05, 2024 at 09:44:56AM +0100, Christian König wrote:
>> Am 02.02.24 um 22:58 schrieb Rodrigo Vivi:
>>> On Tue, Jan 30, 2024 at 08:05:29AM +0100, Christian König wrote:
>>>> Am 30.01.24 um 04:04 schrieb Matthew Brost:
>>>>> Rather then loop over entities until one with a ready job is found,
>>>>> re-queue the run job worker when drm_sched_entity_pop_job() returns NULL.
>>>>>
>>>>> Fixes: 6dbd9004a55 ("drm/sched: Drain all entities in DRM sched run job 
>>>>> worker")
>>> First of all there's a small typo in this Fixes tag that needs to be fixed.
>>> The correct one is:
>>>
>>> Fixes: 66dbd9004a55 ("drm/sched: Drain all entities in DRM sched run job 
>>> worker")
> 
> Cc: Dave Airlie 
> 
>>>
>>> But I couldn't apply this right now in any of our drm-tip trees because it
>>> is not clear where this is coming from originally.
>>>
>>> likely amd tree?!
>>
>> No, this comes from Matthews work on the DRM scheduler.
>>
>> Matthews patches were most likely merged through drm-misc.
> 
> the original is not there in drm-misc-next.
> it looks like Dave had taken that one directly to drm-next.
> So we either need the drm-misc maintainers to have a backmerge or
> Dave to take this through the drm-fixes directly.

This is indeed the case.

I was going to push this patch through drm-misc-next, but the original/base 
patch
(<20240124210811.1639040-1-matthew.br...@intel.com>) isn't there.

If drm-misc maintainers back merge drm-fixes into drm-misc-next,
I'll push this patch into drm-misc-next right away, so that it is complete,
and people can run and test it fully.

Else, Dave will have to pull this patch directly into drm-fixes with our tags,
as was done for the base patch.

Reviewed-by: Luben Tuikov 

Regards,
Luben

> 
>>
>> Regards,
>> Christian.
>>
>>>
>>>>> Signed-off-by: Matthew Brost 
>>>> Reviewed-by: Christian König 
>>> Christian, if this came from the amd, could you please apply it there and
>>> propagate through your fixes flow?
>>>
>>> Thanks,
>>> Rodrigo.
>>>
>>>>> ---
>>>>>drivers/gpu/drm/scheduler/sched_main.c | 15 +--
>>>>>1 file changed, 9 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 8acbef7ae53d..7e90c9f95611 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1178,21 +1178,24 @@ static void drm_sched_run_job_work(struct 
>>>>> work_struct *w)
>>>>>   struct drm_sched_entity *entity;
>>>>>   struct dma_fence *fence;
>>>>>   struct drm_sched_fence *s_fence;
>>>>> - struct drm_sched_job *sched_job = NULL;
>>>>> + struct drm_sched_job *sched_job;
>>>>>   int r;
>>>>>   if (READ_ONCE(sched->pause_submit))
>>>>>   return;
>>>>>   /* Find entity with a ready job */
>>>>> - while (!sched_job && (entity = drm_sched_select_entity(sched))) {
>>>>> - sched_job = drm_sched_entity_pop_job(entity);
>>>>> - if (!sched_job)
>>>>> - complete_all(>entity_idle);
>>>>> - }
>>>>> + entity = drm_sched_select_entity(sched);
>>>>>   if (!entity)
>>>>>   return; /* No more work */
>>>>> + sched_job = drm_sched_entity_pop_job(entity);
>>>>> + if (!sched_job) {
>>>>> + complete_all(>entity_idle);
>>>>> + drm_sched_run_job_queue(sched);
>>>>> + return;
>>>>> + }
>>>>> +
>>>>>   s_fence = sched_job->s_fence;
>>>>>   atomic_add(sched_job->credits, >credit_count);
>>

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Re-queue run job worker when drm_sched_entity_pop_job() returns NULL

2024-02-05 Thread Luben Tuikov

On 2024-01-29 22:04, Matthew Brost wrote:
> Rather then loop over entities until one with a ready job is found,
> re-queue the run job worker when drm_sched_entity_pop_job() returns NULL.
> 
> Fixes: 6dbd9004a55 ("drm/sched: Drain all entities in DRM sched run job 
> worker")
> Signed-off-by: Matthew Brost 

Indeed, we cannot have any loops in the GPU scheduler work items, as we need to 
bounce
between submit and free in the same work queue. (Coming from the original 
design before
work items/queues were introduced).

Thanks for fixing this, Matt!

Reviewed-by: Luben Tuikov 
-- 
Regards,
Luben

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 8acbef7ae53d..7e90c9f95611 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1178,21 +1178,24 @@ static void drm_sched_run_job_work(struct work_struct 
> *w)
>   struct drm_sched_entity *entity;
>   struct dma_fence *fence;
>   struct drm_sched_fence *s_fence;
> - struct drm_sched_job *sched_job = NULL;
> + struct drm_sched_job *sched_job;
>   int r;
>  
>   if (READ_ONCE(sched->pause_submit))
>   return;
>  
>   /* Find entity with a ready job */
> - while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> - sched_job = drm_sched_entity_pop_job(entity);
> - if (!sched_job)
> - complete_all(>entity_idle);
> - }
> + entity = drm_sched_select_entity(sched);
>   if (!entity)
>   return; /* No more work */
>  
> + sched_job = drm_sched_entity_pop_job(entity);
> + if (!sched_job) {
> + complete_all(>entity_idle);
> + drm_sched_run_job_queue(sched);
> + return;
> + }
> +
>   s_fence = sched_job->s_fence;
>  
>   atomic_add(sched_job->credits, >credit_count);


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Add Matthew Brost to maintainers

2024-02-05 Thread Luben Tuikov

On 2024-02-05 19:06, Luben Tuikov wrote:
> On 2024-02-01 07:56, Christian König wrote:
>> Am 31.01.24 um 18:11 schrieb Daniel Vetter:
>>> On Tue, Jan 30, 2024 at 07:03:02PM -0800, Matthew Brost wrote:
>>>> Add Matthew Brost to DRM scheduler maintainers.
>>>>
>>>> Cc: Luben Tuikov 
>>>> Cc: Daniel Vetter 
>>>> Cc: Dave Airlie 
>>>> Cc: Christian König 
>>>> Signed-off-by: Matthew Brost 
>>> Definitely need more people taking care of drm/sched, so thanks for
>>> volunteering!
>>>
>>> Acked-by: Daniel Vetter 
> 
> Yeah, that's a good idea.
> 
> Acked-by: Luben Tuikov 

This patch has been pushed to drm-misc-next.

Regards,
Luben

> 
> Regards,
> Luben
> 
>>>
>>> I think this also needs an ack from Luben and Christian. And you also need
>>> drm-misc commit rights first, or it's going to be a bit tricky to take
>>> care of maintainer duties for merging patches. But since your sched
>>> patches now have landed in upstream this should be just a formality.
>>
>> Ack from my side, I don't have time to look into scheduler stuff anyway.
>>
>> Maybe I can get somebody from Leo's team to volunteer as another 
>> reviewer for scheduler related stuff.
>>
>> Cheers,
>> Christian.
>>
>>>
>>> Cheers, Sima
>>>
>>>> ---
>>>>   MAINTAINERS | 1 +
>>>>   1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 5c00fad59e91..e968d68a96c8 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -7308,6 +7308,7 @@ F:   drivers/gpu/drm/xlnx/
>>>>   
>>>>   DRM GPU SCHEDULER
>>>>   M:   Luben Tuikov 
>>>> +M:Matthew Brost 
>>>>   L:   dri-devel@lists.freedesktop.org
>>>>   S:   Maintained
>>>>   T:   git git://anongit.freedesktop.org/drm/drm-misc
>>>> -- 
>>>> 2.34.1
>>>>
>>
> 

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Add Matthew Brost to maintainers

2024-02-05 Thread Luben Tuikov

On 2024-02-01 07:56, Christian König wrote:
> Am 31.01.24 um 18:11 schrieb Daniel Vetter:
>> On Tue, Jan 30, 2024 at 07:03:02PM -0800, Matthew Brost wrote:
>>> Add Matthew Brost to DRM scheduler maintainers.
>>>
>>> Cc: Luben Tuikov 
>>> Cc: Daniel Vetter 
>>> Cc: Dave Airlie 
>>> Cc: Christian König 
>>> Signed-off-by: Matthew Brost 
>> Definitely need more people taking care of drm/sched, so thanks for
>> volunteering!
>>
>> Acked-by: Daniel Vetter 

Yeah, that's a good idea.

Acked-by: Luben Tuikov 

Regards,
Luben

>>
>> I think this also needs an ack from Luben and Christian. And you also need
>> drm-misc commit rights first, or it's going to be a bit tricky to take
>> care of maintainer duties for merging patches. But since your sched
>> patches now have landed in upstream this should be just a formality.
> 
> Ack from my side, I don't have time to look into scheduler stuff anyway.
> 
> Maybe I can get somebody from Leo's team to volunteer as another 
> reviewer for scheduler related stuff.
> 
> Cheers,
> Christian.
> 
>>
>> Cheers, Sima
>>
>>> ---
>>>   MAINTAINERS | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 5c00fad59e91..e968d68a96c8 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -7308,6 +7308,7 @@ F:drivers/gpu/drm/xlnx/
>>>   
>>>   DRM GPU SCHEDULER
>>>   M:Luben Tuikov 
>>> +M: Matthew Brost 
>>>   L:dri-devel@lists.freedesktop.org
>>>   S:Maintained
>>>   T:git git://anongit.freedesktop.org/drm/drm-misc
>>> -- 
>>> 2.34.1
>>>
> 

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker

2024-01-29 Thread Luben Tuikov

On 2024-01-29 02:44, Christian König wrote:
> Am 26.01.24 um 17:29 schrieb Matthew Brost:
>> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
 On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>> All entities must be drained in the DRM scheduler run job worker to
>> avoid the following case. An entity found that is ready, no job found
>> ready on entity, and run job worker goes idle with other entities + jobs
>> ready. Draining all ready entities (i.e. loop over all ready entities)
>> in the run job worker ensures all job that are ready will be scheduled.
> That doesn't make sense. drm_sched_select_entity() only returns entities
> which are "ready", e.g. have a job to run.
>
 That is what I thought too, hence my original design but it is not
 exactly true. Let me explain.

 drm_sched_select_entity() returns an entity with a non-empty spsc queue
 (job in queue) and no *current* waiting dependecies [1]. Dependecies for
 an entity can be added when drm_sched_entity_pop_job() is called [2][3]
 returning a NULL job. Thus we can get into a scenario where 2 entities
 A and B both have jobs and no current dependecies. A's job is waiting
 B's job, entity A gets selected first, a dependecy gets installed in
 drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>>> And here is the real problem. run work doesn't goes idle in that moment.
>>>
>>> drm_sched_run_job_work() should restarts itself until there is either no
>>> more space in the ring buffer or it can't find a ready entity any more.
>>>
>>> At least that was the original design when that was all still driven by a
>>> kthread.
>>>
>>> It can perfectly be that we messed this up when switching from kthread to a
>>> work item.
>>>
>> Right, that what this patch does - the run worker does not go idle until
>> no ready entities are found. That was incorrect in the original patch
>> and fixed here. Do you have any issues with this fix? It has been tested
>> 3x times and clearly fixes the issue.
> 
> Ah! Yes in this case that patch here is a little bit ugly as well.
> 
> The original idea was that run_job restarts so that we are able to pause 
> the submission thread without searching for an entity to submit more.
> 
> I strongly suggest to replace the while loop with a call to 
> drm_sched_run_job_queue() so that when the entity can't provide a job we 
> just restart the queuing work.

I agree with Christian. This more closely preserves the original design
of the GPU schedulers, so we should go with that.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker

2024-01-28 Thread Luben Tuikov

On 2024-01-26 11:29, Matthew Brost wrote:
> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>>>
>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>>>> All entities must be drained in the DRM scheduler run job worker to

Hi Matt,

Thanks for the patch. Under close review, let's use "checked" instead of 
"drained",
to read as follows,

All entities must be checked in the DRM scheduler run job worker to ...

>>>>> avoid the following case. An entity found that is ready, no job found

Continue with the example given by using a colon, as follows,

... avoid the following case: an entity is found which is ready, yet
no job is returned for that entity when calling 
drm_sched_entity_pop_job(entity).
This causes the job worker to go idle. The correct behaviour is to loop
over all ready entities, until drm_sched_entity_pop_job(entity) returns 
non-NULL,
or there are no more ready entities.

>>>>> ready on entity, and run job worker goes idle with other entities + jobs
>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)

You see here how "drain" isn't clear enough, and you clarify in parenthesis
that we in fact "loop over all ready entities". So, perhaps let's not use the
verb "drain" and simply use the sentence in the paragraph I've redacted above.

Also, let's please not use "drain" in the title, as it is confusing and makes me
think of capacitors, transistors, or buckets with water and Archimedes screws 
and siphons,
and instead say,

[PATCH]: drm/sched: Really find a ready entity and job in DRM sched run-job 
worker

Which makes it really simple and accessible a description. :-)

>>>>> in the run job worker ensures all job that are ready will be scheduled.
>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
>>>> which are "ready", e.g. have a job to run.
>>>>
>>> That is what I thought too, hence my original design but it is not
>>> exactly true. Let me explain.
>>>
>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
>>> returning a NULL job. Thus we can get into a scenario where 2 entities
>>> A and B both have jobs and no current dependecies. A's job is waiting
>>> B's job, entity A gets selected first, a dependecy gets installed in
>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>>
>> And here is the real problem. run work doesn't goes idle in that moment.
>>
>> drm_sched_run_job_work() should restarts itself until there is either no
>> more space in the ring buffer or it can't find a ready entity any more.
>>
>> At least that was the original design when that was all still driven by a
>> kthread.
>>
>> It can perfectly be that we messed this up when switching from kthread to a
>> work item.
>>
> 
> Right, that what this patch does - the run worker does not go idle until
> no ready entities are found. That was incorrect in the original patch
> and fixed here. Do you have any issues with this fix? It has been tested
> 3x times and clearly fixes the issue.

Thanks for following up with Christian.

I agree, the fix makes sense and achieves the original intention as described
by Christian. Also, thanks to all who tested it. Good work, thanks!

With the above changes to the patch title and text addressed, this patch would 
be then,

Reviewed-by: Luben Tuikov 

-- 
Regards,
Luben

 
> 
> Matt
> 
>> Regards,
>> Christian.
>>
>>>
>>> The proper solution is to loop over all ready entities until one with a
>>> job is found via drm_sched_entity_pop_job() and then requeue the run
>>> job worker. Or loop over all entities until drm_sched_select_entity()
>>> returns NULL and then let the run job worker go idle. This is what the
>>> old threaded design did too [4]. Hope this clears everything up.
>>>
>>> Matt
>>>
>>> [1] 
>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
>>> [2] 
>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
>>> [3] 
>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
>>> [4] 
>>> https://elixir.bootlin.

Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker

2024-01-28 Thread Luben Tuikov

On 2024-01-24 16:08, Matthew Brost wrote:
> All entities must be drained in the DRM scheduler run job worker to
> avoid the following case. An entity found that is ready, no job found
> ready on entity, and run job worker goes idle with other entities + jobs
> ready. Draining all ready entities (i.e. loop over all ready entities)
> in the run job worker ensures all job that are ready will be scheduled.
> 
> Cc: Thorsten Leemhuis 
> Reported-by: Mikhail Gavrilov 
> Closes: 
> https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuowtoeee+q26z...@mail.gmail.com/
> Reported-and-tested-by: Mario Limonciello 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> Link: 
> https://lore.kernel.org/all/20240123021155.2775-1-mario.limoncie...@amd.com/
> Reported-by: Vlastimil Babka 
> Closes: 
> https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd15...@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> Signed-off-by: Matthew Brost 

Hi Matthew,

Thanks for working on this and sending the patch.

Could we add a fixes-tag to the tag list,

Fixes: f7fe64ad0f22 ("drm/sched: Split free_job into own work item")

This really drives to point as shown here,
https://gitlab.freedesktop.org/drm/amd/-/issues/3124
which is mentioned in a Closes tag--thanks!
-- 
Regards,
Luben

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 550492a7a031..85f082396d42 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct 
> *w)
>   struct drm_sched_entity *entity;
>   struct dma_fence *fence;
>   struct drm_sched_fence *s_fence;
> - struct drm_sched_job *sched_job;
> + struct drm_sched_job *sched_job = NULL;
>   int r;
>  
>   if (READ_ONCE(sched->pause_submit))
>   return;
>  
> - entity = drm_sched_select_entity(sched);
> + /* Find entity with a ready job */
> + while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> + sched_job = drm_sched_entity_pop_job(entity);
> + if (!sched_job)
> + complete_all(>entity_idle);
> + }
>   if (!entity)
> - return;
> -
> - sched_job = drm_sched_entity_pop_job(entity);
> - if (!sched_job) {
> - complete_all(>entity_idle);
>   return; /* No more work */
> - }
>  
>   s_fence = sched_job->s_fence;
>  


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/2] drm/sched: Return an error code only as a constant in drm_sched_init()

2024-01-07 Thread Luben Tuikov

On 2023-12-26 10:58, Markus Elfring wrote:
> From: Markus Elfring 
> Date: Tue, 26 Dec 2023 16:37:37 +0100
> 
> Return an error code without storing it in an intermediate variable.
> 
> Signed-off-by: Markus Elfring 

Thank you Markus for this patch.

Reviewed-by: Luben Tuikov 

Pushed to drm-misc-next.
-- 
Regards,
Luben

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index b99d4e9ff109..1abbcdf38430 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1249,7 +1249,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  long timeout, struct workqueue_struct *timeout_wq,
>  atomic_t *score, const char *name, struct device *dev)
>  {
> - int i, ret;
> + int i;
> 
>   sched->ops = ops;
>   sched->credit_limit = credit_limit;
> @@ -1285,7 +1285,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> 
>   sched->own_submit_wq = true;
>   }
> - ret = -ENOMEM;
> +
>   sched->sched_rq = kmalloc_array(num_rqs, sizeof(*sched->sched_rq),
>   GFP_KERNEL | __GFP_ZERO);
>   if (!sched->sched_rq)
> @@ -1321,7 +1321,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   if (sched->own_submit_wq)
>   destroy_workqueue(sched->submit_wq);
>   drm_err(sched, "%s: Failed to setup GPU scheduler--out of memory\n", 
> __func__);
> - return ret;
> + return -ENOMEM;
>  }
>  EXPORT_SYMBOL(drm_sched_init);
> 
> --
> 2.43.0
> 


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/2] drm/sched: One function call less in drm_sched_init() after error detection

2024-01-07 Thread Luben Tuikov

On 2023-12-26 10:56, Markus Elfring wrote:
> From: Markus Elfring 
> Date: Tue, 26 Dec 2023 16:30:25 +0100
> 
> The kfree() function was called in one case by the
> drm_sched_init() function during error handling
> even if the passed data structure member contained a null pointer.
> This issue was detected by using the Coccinelle software.
> 
> Thus adjust a jump target.
> 
> Signed-off-by: Markus Elfring 

Thank you Markus for this patch.

Reviewed-by: Luben Tuikov 

Pushed to drm-misc-next.
-- 
Regards,
Luben

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 550492a7a031..b99d4e9ff109 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1289,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   sched->sched_rq = kmalloc_array(num_rqs, sizeof(*sched->sched_rq),
>   GFP_KERNEL | __GFP_ZERO);
>   if (!sched->sched_rq)
> - goto Out_free;
> + goto Out_check_own;
>   sched->num_rqs = num_rqs;
>   for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>   sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), 
> GFP_KERNEL);
> @@ -1314,9 +1314,10 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  Out_unroll:
>   for (--i ; i >= DRM_SCHED_PRIORITY_KERNEL; i--)
>   kfree(sched->sched_rq[i]);
> -Out_free:
> +
>   kfree(sched->sched_rq);
>   sched->sched_rq = NULL;
> +Out_check_own:
>   if (sched->own_submit_wq)
>   destroy_workqueue(sched->submit_wq);
>   drm_err(sched, "%s: Failed to setup GPU scheduler--out of memory\n", 
> __func__);
> --
> 2.43.0
> 


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/scheduler: Unwrap job dependencies

2023-12-09 Thread Luben Tuikov

Hi,

On 2023-12-05 14:02, Rob Clark wrote:
> From: Rob Clark 
> 
> Container fences have burner contexts, which makes the trick to store at
> most one fence per context somewhat useless if we don't unwrap array or
> chain fences.
> 
> Signed-off-by: Rob Clark 

Link: https://lore.kernel.org/all/2023034403.35742-1-robdcl...@gmail.com/

Let's include a link to the original thread, as the main discussion can be found
therein.

Christian, could you review this patch please?

Thanks!
-- 
Regards,
Luben

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 47 ++
>  1 file changed, 32 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 9762464e3f99..16b550949c57 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -52,6 +52,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -684,27 +685,14 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>  }
>  EXPORT_SYMBOL(drm_sched_job_arm);
>  
> -/**
> - * drm_sched_job_add_dependency - adds the fence as a job dependency
> - * @job: scheduler job to add the dependencies to
> - * @fence: the dma_fence to add to the list of dependencies.
> - *
> - * Note that @fence is consumed in both the success and error cases.
> - *
> - * Returns:
> - * 0 on success, or an error on failing to expand the array.
> - */
> -int drm_sched_job_add_dependency(struct drm_sched_job *job,
> -  struct dma_fence *fence)
> +static int drm_sched_job_add_single_dependency(struct drm_sched_job *job,
> +struct dma_fence *fence)
>  {
>   struct dma_fence *entry;
>   unsigned long index;
>   u32 id = 0;
>   int ret;
>  
> - if (!fence)
> - return 0;
> -
>   /* Deduplicate if we already depend on a fence from the same context.
>* This lets the size of the array of deps scale with the number of
>* engines involved, rather than the number of BOs.
> @@ -728,6 +716,35 @@ int drm_sched_job_add_dependency(struct drm_sched_job 
> *job,
>  
>   return ret;
>  }
> +
> +/**
> + * drm_sched_job_add_dependency - adds the fence as a job dependency
> + * @job: scheduler job to add the dependencies to
> + * @fence: the dma_fence to add to the list of dependencies.
> + *
> + * Note that @fence is consumed in both the success and error cases.
> + *
> + * Returns:
> + * 0 on success, or an error on failing to expand the array.
> + */
> +int drm_sched_job_add_dependency(struct drm_sched_job *job,
> +  struct dma_fence *fence)
> +{
> + struct dma_fence_unwrap iter;
> + struct dma_fence *f;
> + int ret = 0;
> +
> + dma_fence_unwrap_for_each (f, , fence) {
> + dma_fence_get(f);
> + ret = drm_sched_job_add_single_dependency(job, f);
> + if (ret)
> + break;
> + }
> +
> + dma_fence_put(fence);
> +
> + return ret;
> +}
>  EXPORT_SYMBOL(drm_sched_job_add_dependency);
>  
>  /**


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov

On 2023-11-29 22:36, Luben Tuikov wrote:
> On 2023-11-29 15:49, Alex Deucher wrote:
>> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
>>>
>>> Actually I think I see the problem.  I'll try and send out a patch
>>> later today to test.
>>
>> Does the attached patch fix it?
> 
> Thanks for the patch, Alex.
> 
> Is it possible for AMD to also reproduce this issue and test this patch on a 
> Navi23 system?
> 
>> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
>> From: Alex Deucher 
>> Date: Wed, 29 Nov 2023 15:44:25 -0500
>> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
>>
>> We need to make disable this after the last eviction
> 
> "make disable" --> "disable"
> 
>> call, but before we disable the SDMA IP.
>>
>> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
>> Link: 
>> https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html
> 
> Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net
> 
> Let's link the start of the thread.
> 
> Regards,
> Luben
> 
>> Signed-off-by: Alex Deucher 
>> Cc: Phillip Susi 
>> Cc: Luben Tuikov 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index b5edf40b5d03..78553e027db4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
>> fbcon)
>>  
>>  amdgpu_ras_suspend(adev);
>>  
>> -amdgpu_ttm_set_buffer_funcs_status(adev, false);
>> -
>>  amdgpu_device_ip_suspend_phase1(adev);
>>  
>>  if (!adev->in_s0ix)
>> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
>> fbcon)
>>  if (r)
>>  return r;
>>  
>> +amdgpu_ttm_set_buffer_funcs_status(adev, false);
>> +

If you're moving this past phase 1, there's another instance in 
amdgpu_device_ip_suspend(),
which may need to be moved down.

Regards,
Luben

>>  amdgpu_fence_driver_hw_fini(adev);
>>  
>>  amdgpu_device_ip_suspend_phase2(adev);
> 
>>
>> Alex
>>
>>>
>>> Alex
>>>
>>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
>>>>
>>>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
>>>>>
>>>>> On 2023-11-29 10:22, Alex Deucher wrote:
>>>>>> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On 2023-11-28 17:13, Alex Deucher wrote:
>>>>>>>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Alex Deucher  writes:
>>>>>>>>>>
>>>>>>>>>>>> In that case those are the already known problems with the 
>>>>>>>>>>>> scheduler
>>>>>>>>>>>> changes, aren't they?
>>>>>>>>>>>
>>>>>>>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>>>>>>>> misunderstanding what the original report was actually testing.  If 
>>>>>>>>>>> it
>>>>>>>>>>> was 6.7, then try reverting:
>>>>>>>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>>>>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>>>>>>>
>>>>>>>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>>>>>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>>>>>>>> today and still has the problem.  Is there a known issue I could 
>>>>>>>>>> track?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> At this point, unless there are any objections, I think we should just
>>>>&g

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov

On 2023-11-29 15:49, Alex Deucher wrote:
> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
>>
>> Actually I think I see the problem.  I'll try and send out a patch
>> later today to test.
> 
> Does the attached patch fix it?

Thanks for the patch, Alex.

Is it possible for AMD to also reproduce this issue and test this patch on a 
Navi23 system?

> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
> From: Alex Deucher 
> Date: Wed, 29 Nov 2023 15:44:25 -0500
> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
> 
> We need to make disable this after the last eviction

"make disable" --> "disable"

> call, but before we disable the SDMA IP.
> 
> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
> Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html

Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net

Let's link the start of the thread.

Regards,
Luben

> Signed-off-by: Alex Deucher 
> Cc: Phillip Susi 
> Cc: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b5edf40b5d03..78553e027db4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>  
>   amdgpu_ras_suspend(adev);
>  
> - amdgpu_ttm_set_buffer_funcs_status(adev, false);
> -
>   amdgpu_device_ip_suspend_phase1(adev);
>  
>   if (!adev->in_s0ix)
> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   if (r)
>   return r;
>  
> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
> +
>   amdgpu_fence_driver_hw_fini(adev);
>  
>   amdgpu_device_ip_suspend_phase2(adev);

> 
> Alex
> 
>>
>> Alex
>>
>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
>>>
>>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
>>>>
>>>> On 2023-11-29 10:22, Alex Deucher wrote:
>>>>> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
>>>>> wrote:
>>>>>>
>>>>>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
>>>>>> wrote:
>>>>>>>
>>>>>>> On 2023-11-28 17:13, Alex Deucher wrote:
>>>>>>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Alex Deucher  writes:
>>>>>>>>>
>>>>>>>>>>> In that case those are the already known problems with the scheduler
>>>>>>>>>>> changes, aren't they?
>>>>>>>>>>
>>>>>>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>>>>>>> misunderstanding what the original report was actually testing.  If 
>>>>>>>>>> it
>>>>>>>>>> was 6.7, then try reverting:
>>>>>>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>>>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>>>>>>
>>>>>>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>>>>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>>>>>>> today and still has the problem.  Is there a known issue I could 
>>>>>>>>> track?
>>>>>>>>>
>>>>>>>>
>>>>>>>> At this point, unless there are any objections, I think we should just
>>>>>>>> revert the two patches
>>>>>>> Uhm, no.
>>>>>>>
>>>>>>> Why "the two" patches?
>>>>>>>
>>>>>>> This email, part of this thread,
>>>>>>>
>>>>>>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>>>>>>
>>>>>>> clearly states that reverting *only* this commit,
>>>>>>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number 
>>>>>>> of run-queues
>>>>>>> *does not* mi

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov

On 2023-11-29 10:22, Alex Deucher wrote:
> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  wrote:
>>
>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
>>>
>>> On 2023-11-28 17:13, Alex Deucher wrote:
>>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>>>>>
>>>>> Alex Deucher  writes:
>>>>>
>>>>>>> In that case those are the already known problems with the scheduler
>>>>>>> changes, aren't they?
>>>>>>
>>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>>> misunderstanding what the original report was actually testing.  If it
>>>>>> was 6.7, then try reverting:
>>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>>
>>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>>> today and still has the problem.  Is there a known issue I could track?
>>>>>
>>>>
>>>> At this point, unless there are any objections, I think we should just
>>>> revert the two patches
>>> Uhm, no.
>>>
>>> Why "the two" patches?
>>>
>>> This email, part of this thread,
>>>
>>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>>
>>> clearly states that reverting *only* this commit,
>>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
>>> run-queues
>>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
>>> really change
>>> anything operational, other than using an allocated array, instead of a 
>>> static one, in DRM,
>>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>>
>>> Leaving us with only this change,
>>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>>> to be at fault, as the kernel log attached in the linked email above shows.
>>>
>>> The conclusion is that only b70438004a14f4 needs reverting.
>>
>> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
>> 56e449603f0ac5 breaks amdgpu.
> 
> We can try and re-enable it in the next kernel.  I'm just not sure
> we'll be able to fix this in time for 6.7 with the holidays and all
> and I don't want to cause a lot of scheduler churn at the end of the
> 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> best short term solution.

A lot of subsequent code has come in since commit 56e449603f0ac5, as it opened
the opportunity for a 1-to-1 relationship between an entity and a scheduler.
(Should've always been the case, from the outset. Not sure why it was coded as
a fixed-size array.)

Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the problem
is wholly contained in amdgpu, and no other driver has this problem, there is
no reason to have to "churn", i.e. go back and forth in DRM, only to cover up
an init bug in amdgpu. See the response I just sent in @this thread:
https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com

And it's not like this issue is unknown. I first posted about it on 2023-10-16. 

Ideally, amdgpu would just fix their init code.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov

On 2023-11-29 08:50, Alex Deucher wrote:
> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
>>
>> On 2023-11-28 17:13, Alex Deucher wrote:
>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>>>>
>>>> Alex Deucher  writes:
>>>>
>>>>>> In that case those are the already known problems with the scheduler
>>>>>> changes, aren't they?
>>>>>
>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>> misunderstanding what the original report was actually testing.  If it
>>>>> was 6.7, then try reverting:
>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>
>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>> today and still has the problem.  Is there a known issue I could track?
>>>>
>>>
>>> At this point, unless there are any objections, I think we should just
>>> revert the two patches
>> Uhm, no.
>>
>> Why "the two" patches?
>>
>> This email, part of this thread,
>>
>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>
>> clearly states that reverting *only* this commit,
>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
>> run-queues
>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
>> really change
>> anything operational, other than using an allocated array, instead of a 
>> static one, in DRM,
>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>
>> Leaving us with only this change,
>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>> to be at fault, as the kernel log attached in the linked email above shows.
>>
>> The conclusion is that only b70438004a14f4 needs reverting.
> 
> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> 56e449603f0ac5 breaks amdgpu.

It doesn't "break" it, amdgpu just needs to be fixed.

I know we put in a Fixes tag in 
b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level"
pointing to 56e449603f0ac5 "drm/sched: Convert the GPU scheduler to variable 
number of run-queues",
but given the testing Phillip has done, the culprit is wholly contained in
the amdgpu driver code.

No other driver has this problem since commit 56e449603f0ac5.

The Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a 
level" should've ideally
pointed to an amdgpu-driver code commit only (perhaps an old-old commit), and I 
was a bit uncomfortable
putting in a Fixes tag which pointed to drm code, but we did it so that the 
amdgpu commit follows
the changes in DRM. In retrospect, the Fixes tag should've pointed to and 
amdgpu-driver commit when
that the amdgpu code was originally written.

I remember that the problem was really that amdgpu called 
drm_sched_entity_init(),
in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized the 
scheduler
used therein. For instance, the code before commit b70438004a14f4, looked like 
this:

void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
{
struct ttm_resource_manager *man = ttm_manager_type(>mman.bdev, 
TTM_PL_VRAM);
uint64_t size;
int r;

if (!adev->mman.initialized || amdgpu_in_reset(adev) ||
adev->mman.buffer_funcs_enabled == enable)
return;

if (enable) {
struct amdgpu_ring *ring;
struct drm_gpu_scheduler *sched;

ring = adev->mman.buffer_funcs_ring;
sched = >sched; <-- LT: No 
one has initialized this scheduler
r = drm_sched_entity_init(>mman.entity, <-- Oopses, 
now that sched->sched_rq is not a static array
  DRM_SCHED_PRIORITY_KERNEL, ,
  1, NULL);
if (r) {
DRM_ERROR("Failed setting up TTM BO move entity (%d)\n",
  r);
return;
}

Before commit 56e449603f0ac5, amdgpu was getting away with this, because the 
sched->sched_rq
was a static array.

Ideally, amdgpu code would be fixed.
-- 
Regards,
Luben

OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: Radeon regression in 6.6 kernel

2023-11-28 Thread Luben Tuikov

On 2023-11-28 17:13, Alex Deucher wrote:
> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>>
>> Alex Deucher  writes:
>>
 In that case those are the already known problems with the scheduler
 changes, aren't they?
>>>
>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>> misunderstanding what the original report was actually testing.  If it
>>> was 6.7, then try reverting:
>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>> b70438004a14f4d0f9890b3297cd66248728546c
>>
>> At some point it was suggested that I file a gitlab issue, but I took
>> this to mean it was already known and being worked on.  -rc3 came out
>> today and still has the problem.  Is there a known issue I could track?
>>
> 
> At this point, unless there are any objections, I think we should just
> revert the two patches
Uhm, no.

Why "the two" patches?

This email, part of this thread,

https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/

clearly states that reverting *only* this commit,
56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
run-queues
*does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
really change
anything operational, other than using an allocated array, instead of a static 
one, in DRM,
while the 2nd patch is solely contained within the amdgpu driver code.)

Leaving us with only this change,
b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
to be at fault, as the kernel log attached in the linked email above shows.

The conclusion is that only b70438004a14f4 needs reverting.
-- 
Regards,
Luben

OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Partial revert of "Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()"

2023-11-28 Thread Luben Tuikov

On 2023-11-27 11:09, Bert Karwatzki wrote:
> Commit f3123c2590005c, in combination with the use of work queues by the GPU
> scheduler, leads to random lock-ups of the GUI.
> 
> This is a partial revert of of commit f3123c2590005c since drm_sched_wakeup() 
> still
> needs its entity argument to pass it to drm_sched_can_queue().
> 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> Link: 
> https://lists.freedesktop.org/archives/dri-devel/2023-November/431606.html
> Fixes: f3123c2590005c ("drm/sched: Qualify drm_sched_wakeup() by 
> drm_sched_entity_is_ready()")
> 
> Signed-off-by: Bert Karwatzki 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 682aebe96db7..550492a7a031 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
> struct drm_sched_entity *entity)
>  {
> - if (drm_sched_entity_is_ready(entity))
> - if (drm_sched_can_queue(sched, entity))
> - drm_sched_run_job_queue(sched);
> + if (drm_sched_can_queue(sched, entity))
> + drm_sched_run_job_queue(sched);
>  }
> 
>  /**
> --
> 2.43.0
> 

Reviewed-by: Luben Tuikov 

Pushed to drm-misc-next.

Thanks!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Revert "drm/sched: Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()"

2023-11-27 Thread Luben Tuikov

Hi Bert,

# The title of the patch should be:

drm/sched: Partial revert of "Qualify drm_sched_wakeup() by 
drm_sched_entity_is_ready()"

On 2023-11-27 08:30, Bert Karwatzki wrote:
> Commit f3123c25 (in combination with the use of work queues by the gpu

Commit f3123c2590005c, in combination with the use of work queues by the GPU
scheduler, leads to random lock-ups of the GUI.

> scheduler) leads to random lock ups of the GUI [1,2].
> 
> This is not a complete revert of commit f3123c25 as drm_sched_wakeup

This is a partial revert of of commit f3123c2590005c since drm_sched_wakeup()

> still needs its entity argument to pass it to drm_sched_can_queue.

... drm_sched_can_queue().

# Don't forget a SoB line!

Signed-off-by: Bert ...

>> [1] https://gitlab.freedesktop.org/drm/amd/-/issues/2994

# Use a Link: tag instead, like this:
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2994

> [2] https://lists.freedesktop.org/archives/dri-devel/2023-November/431606.html

# Use a Link: tag instead, like this:
Link: https://lists.freedesktop.org/archives/dri-devel/2023-November/431606.html

> 
> This reverts commit f3123c2590005c5ff631653d31428e40cd10c618.

# The line above is *not* necessary, since this is a partial commit. Instead we 
need
# a Fixes: line, like this:

Fixes: f3123c2590005c ("drm/sched: Qualify drm_sched_wakeup() by 
drm_sched_entity_is_ready()")

###---

Then after you do "git format-patch", post it like this:

git send-email \
--in-reply-to=c5292d06-2e37-4715-96dc-699f36911...@gmail.com \
--to=ltuiko...@gmail.com \
--cc=christian.koe...@amd.com \
--cc=d...@redhat.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=matthew.br...@intel.com \
--cc=spassw...@web.de \
--cc=tvrtko.ursu...@intel.com \
/path/to/PATCH

This follows your thread where all the information is stored.

Thanks!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/2] drm/sched: Rename priority MIN to LOW

2023-11-27 Thread Luben Tuikov

On 2023-11-27 09:20, Christian König wrote:
> Am 27.11.23 um 15:13 schrieb Luben Tuikov:
>> On 2023-11-27 08:55, Christian König wrote:
>>> Hi Luben,
>>>
>>> Am 24.11.23 um 08:57 schrieb Christian König:
>>>> Am 24.11.23 um 06:27 schrieb Luben Tuikov:
>>>>> Rename DRM_SCHED_PRIORITY_MIN to DRM_SCHED_PRIORITY_LOW.
>>>>>
>>>>> This mirrors DRM_SCHED_PRIORITY_HIGH, for a list of DRM scheduler
>>>>> priorities
>>>>> in ascending order,
>>>>>     DRM_SCHED_PRIORITY_LOW,
>>>>>     DRM_SCHED_PRIORITY_NORMAL,
>>>>>     DRM_SCHED_PRIORITY_HIGH,
>>>>>     DRM_SCHED_PRIORITY_KERNEL.
>>>>>
>>>>> Cc: Rob Clark 
>>>>> Cc: Abhinav Kumar 
>>>>> Cc: Dmitry Baryshkov 
>>>>> Cc: Danilo Krummrich 
>>>>> Cc: Alex Deucher 
>>>>> Cc: Christian König 
>>>>> Cc: linux-arm-...@vger.kernel.org
>>>>> Cc: freedr...@lists.freedesktop.org
>>>>> Cc: dri-devel@lists.freedesktop.org
>>>>> Signed-off-by: Luben Tuikov 
>>>> Reviewed-by: Christian König 
>>> Looks like you missed one usage in Nouveau:
>>>
>>> drivers/gpu/drm/nouveau/nouveau_sched.c:21:41: error:
>>> ‘DRM_SCHED_PRIORITY_MIN’ undeclared here (not in a function); did you
>>> mean ‘DRM_SCHED_PRIORITY_LOW’?
>>>      21 | NOUVEAU_SCHED_PRIORITY_SINGLE = DRM_SCHED_PRIORITY_MIN,
>>>     | ^~
>>>     | DRM_SCHED_PRIORITY_LOW
>>>
>>> This now results in a build error on drm-misc-next.
>> I'm waiting for someone to R-B the fix I posted two days ago:
>> https://lore.kernel.org/r/20231125192246.87268-2-ltuiko...@gmail.com
> 
> There must be something wrong with the dri-devel mailing list (or my 
> gmail, but I doubt so). I don't see this mail in my inbox anywhere.
> 
> Feel free to add my rb and push it.

Done.

Thanks.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/2] drm/sched: Rename priority MIN to LOW

2023-11-27 Thread Luben Tuikov

On 2023-11-27 08:55, Christian König wrote:
> Hi Luben,
> 
> Am 24.11.23 um 08:57 schrieb Christian König:
>> Am 24.11.23 um 06:27 schrieb Luben Tuikov:
>>> Rename DRM_SCHED_PRIORITY_MIN to DRM_SCHED_PRIORITY_LOW.
>>>
>>> This mirrors DRM_SCHED_PRIORITY_HIGH, for a list of DRM scheduler 
>>> priorities
>>> in ascending order,
>>>    DRM_SCHED_PRIORITY_LOW,
>>>    DRM_SCHED_PRIORITY_NORMAL,
>>>    DRM_SCHED_PRIORITY_HIGH,
>>>    DRM_SCHED_PRIORITY_KERNEL.
>>>
>>> Cc: Rob Clark 
>>> Cc: Abhinav Kumar 
>>> Cc: Dmitry Baryshkov 
>>> Cc: Danilo Krummrich 
>>> Cc: Alex Deucher 
>>> Cc: Christian König 
>>> Cc: linux-arm-...@vger.kernel.org
>>> Cc: freedr...@lists.freedesktop.org
>>> Cc: dri-devel@lists.freedesktop.org
>>> Signed-off-by: Luben Tuikov 
>>
>> Reviewed-by: Christian König 
> 
> Looks like you missed one usage in Nouveau:
> 
> drivers/gpu/drm/nouveau/nouveau_sched.c:21:41: error: 
> ‘DRM_SCHED_PRIORITY_MIN’ undeclared here (not in a function); did you 
> mean ‘DRM_SCHED_PRIORITY_LOW’?
>     21 | NOUVEAU_SCHED_PRIORITY_SINGLE = DRM_SCHED_PRIORITY_MIN,
>    | ^~
>    | DRM_SCHED_PRIORITY_LOW
> 
> This now results in a build error on drm-misc-next.

I'm waiting for someone to R-B the fix I posted two days ago:
https://lore.kernel.org/r/20231125192246.87268-2-ltuiko...@gmail.com
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: linux-next: build failure after merge of the drm-misc tree

2023-11-26 Thread Luben Tuikov

On 2023-11-26 18:38, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the drm-misc tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> drivers/gpu/drm/nouveau/nouveau_sched.c:21:41: error: 
> 'DRM_SCHED_PRIORITY_MIN' undeclared here (not in a function); did you mean 
> 'DRM_SCHED_PRIORITY_LOW'?
>21 | NOUVEAU_SCHED_PRIORITY_SINGLE = DRM_SCHED_PRIORITY_MIN,
>   | ^~
>   | DRM_SCHED_PRIORITY_LOW
> 
> Caused by commit
> 
>   fe375c74806d ("drm/sched: Rename priority MIN to LOW")
> 
> I have used the drm-misc tree from next-20231124 for today.

I posted a fix for this yesterday:
https://lore.kernel.org/r/20231125192246.87268-2-ltuiko...@gmail.com
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Fix compilation issues with DRM priority rename

2023-11-26 Thread Luben Tuikov

On 2023-11-25 14:22, Luben Tuikov wrote:
> Fix compilation issues with DRM scheduler priority rename MIN to LOW.
> 
> Signed-off-by: Luben Tuikov 
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202311252109.wgbjsskg-...@intel.com/
> Cc: Danilo Krummrich 
> Cc: Frank Binns 
> Cc: Donald Robson 
> Cc: Matt Coster 
> Cc: Direct Rendering Infrastructure - Development 
> 
> Fixes: fe375c74806dbd ("drm/sched: Rename priority MIN to LOW")
Fixes: 38f922a563aac3 ("drm/sched: Reverse run-queue priority enumeration")
> Fixes: 5f03a507b29e44 ("drm/nouveau: implement 1:1 scheduler - entity 
> relationship")
> ---

Added an additional Fixes tag as shown above, to complete the set.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: drm scheduler redesign causes deadlocks [extended repost]

2023-11-25 Thread Luben Tuikov

On 2023-11-24 04:38, Bert Karwatzki wrote:
> Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov:
>> On 2023-11-21 04:00, Bert Karwatzki wrote:
>>> Since linux-next-20231115 my linux system (debian sid on msi alpha 15
>>> laptop)
>>> suffers from random deadlocks which can occur after  30 - 180min of usage.
>>> These
>>> deadlocks can be actively provoked by creating high system load (usually by
>>> compiling a kernel with make -j NRCPUS) and the opening instances of
>>> libreoffice
>>> --writer until the system GUI locks (the mouse cursor can still be moved but
>>> the
>>> screen is frozen). In this state ssh'ing into the machine is still possible
>>> and
>>> at least sometimes log messages about hung tasks appear in
>>> /var/log/kern.log.
>>>
>>> More info can be found here:
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
>>>
>>> Using the method described to trigger the bug I bisected the problem in the
>>> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
>>> As this simple patch fixes the problem
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 044a8c4875ba..25b97db1b623 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>>>   struct drm_sched_entity *entity)
>>>  {
>>> -   if (drm_sched_entity_is_ready(entity))
>>> -   if (drm_sched_can_queue(sched, entity))
>>> -   drm_sched_run_job_queue(sched);
>>> +   if (drm_sched_can_queue(sched, entity))
>>> +   drm_sched_run_job_queue(sched);
>>>  }
>>>  
>>>  /**
>>>
>>> there might be in the entity->dependency branch of drm_sched_entity_is_ready
>>> (some kind of circular dependencies ...).
>>>
>>> To see if the change to drm_sched_wakeup is the actual cause of the problem
>>> or
>>> if this problem has been cause by the redesign of the drm scheduler in linux
>>> next-20231115+, I created the following patch for linux-6.6.0:
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index a42763e1429d..dc2abd299aeb 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>>  container_of(cb, struct drm_sched_entity, cb);
>>>
>>>  drm_sched_entity_clear_dep(f, cb);
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>
>>>  /**
>>> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
>>> *sched_job)
>>>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>  drm_sched_rq_update_fifo(entity, submit_ts);
>>>
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>  }
>>>  EXPORT_SYMBOL(drm_sched_entity_push_job);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 5a3a622fc672..bbe06403b33d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct
>>> drm_gpu_scheduler
>>> *sched)
>>>   *
>>>   * Wake up the scheduler if we can queue jobs.
>>>   */
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_entity *entity)
>>>  {
>>> - if (drm_sched_can_queue(sched))
>>> - wake_up_interruptible(>wake_up_worker);
>>> + if(drm_sched_entity_is_ready(entity))
>>> + if (drm_sched_can_queue(sched))
>>> + wake_up_interruptible(>wake_up_worker);
>>>  }
>>>
>>>  /**
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index ac65f0626cfc..6cfe3d193e69 100644
>>>

[PATCH] drm/sched: Fix compilation issues with DRM priority rename

2023-11-25 Thread Luben Tuikov

Fix compilation issues with DRM scheduler priority rename MIN to LOW.

Signed-off-by: Luben Tuikov 
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202311252109.wgbjsskg-...@intel.com/
Cc: Danilo Krummrich 
Cc: Frank Binns 
Cc: Donald Robson 
Cc: Matt Coster 
Cc: Direct Rendering Infrastructure - Development 

Fixes: fe375c74806dbd ("drm/sched: Rename priority MIN to LOW")
Fixes: 5f03a507b29e44 ("drm/nouveau: implement 1:1 scheduler - entity 
relationship")
---
 drivers/gpu/drm/imagination/pvr_queue.c | 2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/imagination/pvr_queue.c 
b/drivers/gpu/drm/imagination/pvr_queue.c
index d65c3fbedf5ac4..5ed9c98fb599c8 100644
--- a/drivers/gpu/drm/imagination/pvr_queue.c
+++ b/drivers/gpu/drm/imagination/pvr_queue.c
@@ -1292,7 +1292,7 @@ struct pvr_queue *pvr_queue_create(struct pvr_context 
*ctx,
goto err_release_ufo;
 
err = drm_sched_entity_init(>entity,
-   DRM_SCHED_PRIORITY_MIN,
+   DRM_SCHED_PRIORITY_KERNEL,
, 1, >faulty);
if (err)
goto err_sched_fini;
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c 
b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 3393647bd94423..dd98f6910f9cab 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -18,7 +18,7 @@
  * index to the run-queue array.
  */
 enum nouveau_sched_priority {
-   NOUVEAU_SCHED_PRIORITY_SINGLE = DRM_SCHED_PRIORITY_MIN,
+   NOUVEAU_SCHED_PRIORITY_SINGLE = DRM_SCHED_PRIORITY_KERNEL,
NOUVEAU_SCHED_PRIORITY_COUNT,
 };
 
@@ -423,7 +423,7 @@ nouveau_sched_init(struct nouveau_sched *sched, struct 
nouveau_drm *drm,
if (ret)
goto fail_wq;
 
-   /* Using DRM_SCHED_PRIORITY_MIN, since that's what we're required to use
+   /* Using DRM_SCHED_PRIORITY_KERNEL, since that's what we're required to 
use
 * when we want to have a single run-queue only.
 *
 * It's not documented, but one will find out when trying to use any
@@ -433,7 +433,7 @@ nouveau_sched_init(struct nouveau_sched *sched, struct 
nouveau_drm *drm,
 * Can't use NOUVEAU_SCHED_PRIORITY_SINGLE either, because it's not
 * matching the enum type used in drm_sched_entity_init().
 */
-   ret = drm_sched_entity_init(entity, DRM_SCHED_PRIORITY_MIN,
+   ret = drm_sched_entity_init(entity, DRM_SCHED_PRIORITY_KERNEL,
_sched, 1, NULL);
if (ret)
goto fail_sched;

base-commit: 38f922a563aac3148ac73e73689805917f034cb5
-- 
2.43.0

Re: [PATCH 2/2] drm/sched: Reverse run-queue priority enumeration

2023-11-24 Thread Luben Tuikov

On 2023-11-24 04:38, Christian König wrote:
> Am 24.11.23 um 09:22 schrieb Luben Tuikov:
>> On 2023-11-24 03:04, Christian König wrote:
>>> Am 24.11.23 um 06:27 schrieb Luben Tuikov:
>>>> Reverse run-queue priority enumeration such that the higest priority is 
>>>> now 0,
>>>> and for each consecutive integer the prioirty diminishes.
>>>>
>>>> Run-queues correspond to priorities. To an external observer a scheduler
>>>> created with a single run-queue, and another created with
>>>> DRM_SCHED_PRIORITY_COUNT number of run-queues, should always schedule
>>>> sched->sched_rq[0] with the same "priority", as that index run-queue 
>>>> exists in
>>>> both schedulers, i.e. a scheduler with one run-queue or many. This patch 
>>>> makes
>>>> it so.
>>>>
>>>> In other words, the "priority" of sched->sched_rq[n], n >= 0, is the same 
>>>> for
>>>> any scheduler created with any allowable number of run-queues 
>>>> (priorities), 0
>>>> to DRM_SCHED_PRIORITY_COUNT.
>>>>
>>>> Cc: Rob Clark 
>>>> Cc: Abhinav Kumar 
>>>> Cc: Dmitry Baryshkov 
>>>> Cc: Danilo Krummrich 
>>>> Cc: Alex Deucher 
>>>> Cc: Christian König 
>>>> Cc: linux-arm-...@vger.kernel.org
>>>> Cc: freedr...@lists.freedesktop.org
>>>> Cc: dri-devel@lists.freedesktop.org
>>>> Signed-off-by: Luben Tuikov 
>>>> ---
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  2 +-
>>>>drivers/gpu/drm/msm/msm_gpu.h|  2 +-
>>>>drivers/gpu/drm/scheduler/sched_entity.c |  7 ---
>>>>drivers/gpu/drm/scheduler/sched_main.c   | 15 +++
>>>>include/drm/gpu_scheduler.h  |  6 +++---
>>>>5 files changed, 16 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 1a25931607c514..71a5cf37b472d4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -325,7 +325,7 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
>>>> drm_gpu_scheduler *sched)
>>>>int i;
>>>>
>>>>/* Signal all jobs not yet scheduled */
>>>> -  for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_LOW; i--) {
>>>> +  for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>>>>struct drm_sched_rq *rq = sched->sched_rq[i];
>>>>spin_lock(>lock);
>>>>list_for_each_entry(s_entity, >entities, list) {
>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>>> index eb0c97433e5f8a..2bfcb222e35338 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>> @@ -347,7 +347,7 @@ struct msm_gpu_perfcntr {
>>>> * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
>>>> * cases, so we don't use it (no need for kernel generated jobs).
>>>> */
>>>> -#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - 
>>>> DRM_SCHED_PRIORITY_LOW)
>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_LOW - 
>>>> DRM_SCHED_PRIORITY_HIGH)
>>>>
>>>>/**
>>>> * struct msm_file_private - per-drm_file context
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index cb7445be3cbb4e..6e2b02e45e3a32 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -81,14 +81,15 @@ int drm_sched_entity_init(struct drm_sched_entity 
>>>> *entity,
>>>> */
>>>>pr_warn("%s: called with uninitialized scheduler\n", 
>>>> __func__);
>>>>} else if (num_sched_list) {
>>>> -  /* The "priority" of an entity cannot exceed the number
>>>> -   * of run-queues of a scheduler.
>>>> +  /* The "priority" of an entity cannot exceed the number of
>>>> +   * run-queues of a scheduler. Choose the lowest p

Re: linux-next: Signed-off-by missing for commit in the drm-misc tree

2023-11-24 Thread Luben Tuikov

On 2023-11-24 08:20, Jani Nikula wrote:
> On Wed, 22 Nov 2023, Luben Tuikov  wrote:
>> On 2023-11-22 07:00, Maxime Ripard wrote:
>>> Hi Luben,
>>>
>>> On Thu, Nov 16, 2023 at 09:27:58AM +0100, Daniel Vetter wrote:
>>>> On Thu, Nov 16, 2023 at 09:11:43AM +0100, Maxime Ripard wrote:
>>>>> On Tue, Nov 14, 2023 at 06:46:21PM -0500, Luben Tuikov wrote:
>>>>>> On 2023-11-13 22:08, Stephen Rothwell wrote:
>>>>>>> BTW, cherry picking commits does not avoid conflicts - in fact it can
>>>>>>> cause conflicts if there are further changes to the files affected by
>>>>>>> the cherry picked commit in either the tree/branch the commit was
>>>>>>> cheery picked from or the destination tree/branch (I have to deal with
>>>>>>> these all the time when merging the drm trees in linux-next).  Much
>>>>>>> better is to cross merge the branches so that the patch only appears
>>>>>>> once or have a shared branches that are merged by any other branch that
>>>>>>> needs the changes.
>>>>>>>
>>>>>>> I understand that things are not done like this in the drm trees :-(
>>>>>>
>>>>>> Hi Stephen,
>>>>>>
>>>>>> Thank you for the clarification--understood. I'll be more careful in the 
>>>>>> future.
>>>>>> Thanks again! :-)
>>>>>
>>>>> In this case, the best thing to do would indeed have been to ask the
>>>>> drm-misc maintainers to merge drm-misc-fixes into drm-misc-next.
>>>>>
>>>>> We're doing that all the time, but we're not ubiquitous so you need to
>>>>> ask us :)
>>>>>
>>>>> Also, dim should have caught that when you pushed the branch. Did you
>>>>> use it?
>>>>
>>>> Yeah dim must be used, exactly to avoid these issues. Both for applying
>>>> patches (so not git am directly, or cherry-picking from your own
>>>> development branch), and for pushing. The latter is even checked for by
>>>> the server (dim sets a special push flag which is very long and contains a
>>>> very clear warning if you bypass it).
>>>>
>>>> If dim was used, this would be a bug in the dim script that we need to
>>>> fix.
>>>
>>> It would be very useful for you to explain what happened here so we
>>> improve the tooling or doc and can try to make sure it doesn't happen
>>> again
>>>
>>> Maxime
>>
>> There is no problem with the tooling--I just forced the commit in.
> 
> Wait what?
> 
> What do you mean by forcing the commit in? Bypass dim?
> 
> If yes, please *never* do that when you're dealing with dim managed
> branches. That's part of the deal for getting commit access, along with
> following all the other maintainer tools documentation.

Hi Jani,

I only use dim, ever.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/2] drm/sched: Reverse run-queue priority enumeration

2023-11-24 Thread Luben Tuikov

On 2023-11-24 03:04, Christian König wrote:
> Am 24.11.23 um 06:27 schrieb Luben Tuikov:
>> Reverse run-queue priority enumeration such that the higest priority is now 
>> 0,
>> and for each consecutive integer the prioirty diminishes.
>>
>> Run-queues correspond to priorities. To an external observer a scheduler
>> created with a single run-queue, and another created with
>> DRM_SCHED_PRIORITY_COUNT number of run-queues, should always schedule
>> sched->sched_rq[0] with the same "priority", as that index run-queue exists 
>> in
>> both schedulers, i.e. a scheduler with one run-queue or many. This patch 
>> makes
>> it so.
>>
>> In other words, the "priority" of sched->sched_rq[n], n >= 0, is the same for
>> any scheduler created with any allowable number of run-queues (priorities), 0
>> to DRM_SCHED_PRIORITY_COUNT.
>>
>> Cc: Rob Clark 
>> Cc: Abhinav Kumar 
>> Cc: Dmitry Baryshkov 
>> Cc: Danilo Krummrich 
>> Cc: Alex Deucher 
>> Cc: Christian König 
>> Cc: linux-arm-...@vger.kernel.org
>> Cc: freedr...@lists.freedesktop.org
>> Cc: dri-devel@lists.freedesktop.org
>> Signed-off-by: Luben Tuikov 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  2 +-
>>   drivers/gpu/drm/msm/msm_gpu.h|  2 +-
>>   drivers/gpu/drm/scheduler/sched_entity.c |  7 ---
>>   drivers/gpu/drm/scheduler/sched_main.c   | 15 +++
>>   include/drm/gpu_scheduler.h  |  6 +++---
>>   5 files changed, 16 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 1a25931607c514..71a5cf37b472d4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -325,7 +325,7 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
>> drm_gpu_scheduler *sched)
>>  int i;
>>   
>>  /* Signal all jobs not yet scheduled */
>> -for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_LOW; i--) {
>> +for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>>  struct drm_sched_rq *rq = sched->sched_rq[i];
>>  spin_lock(>lock);
>>  list_for_each_entry(s_entity, >entities, list) {
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>> index eb0c97433e5f8a..2bfcb222e35338 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>> @@ -347,7 +347,7 @@ struct msm_gpu_perfcntr {
>>* DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
>>* cases, so we don't use it (no need for kernel generated jobs).
>>*/
>> -#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - 
>> DRM_SCHED_PRIORITY_LOW)
>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_LOW - 
>> DRM_SCHED_PRIORITY_HIGH)
>>   
>>   /**
>>* struct msm_file_private - per-drm_file context
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index cb7445be3cbb4e..6e2b02e45e3a32 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -81,14 +81,15 @@ int drm_sched_entity_init(struct drm_sched_entity 
>> *entity,
>>   */
>>  pr_warn("%s: called with uninitialized scheduler\n", __func__);
>>  } else if (num_sched_list) {
>> -/* The "priority" of an entity cannot exceed the number
>> - * of run-queues of a scheduler.
>> +/* The "priority" of an entity cannot exceed the number of
>> + * run-queues of a scheduler. Choose the lowest priority
>> + * available.
>>   */
>>  if (entity->priority >= sched_list[0]->num_rqs) {
>>  drm_err(sched_list[0], "entity with out-of-bounds 
>> priority:%u num_rqs:%u\n",
>>  entity->priority, sched_list[0]->num_rqs);
>>  entity->priority = max_t(s32, (s32) 
>> sched_list[0]->num_rqs - 1,
>> - (s32) DRM_SCHED_PRIORITY_LOW);
>> + (s32) 
>> DRM_SCHED_PRIORITY_KERNEL);
> 
> That seems to be a no-op. You basically say max_T(.., num_rqs - 1, 0), 
> this will always be num_rqs - 1

This protects against num_rqs being equal to 0, in which case we select KERNEL 
(0).

This comes from "[PATCH] drm/sched: Fix bounds limiting when given a malformed 
entity"
which I sent yesterday (Message-ID: 
<20231123122422.167832-2-ltuiko...@gmail.com>).

Could you R-B that patch too?

> 
> Apart from that looks good to me.

Okay, could you R-B this patch then.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

[PATCH 1/2] drm/sched: Rename priority MIN to LOW

2023-11-23 Thread Luben Tuikov

Rename DRM_SCHED_PRIORITY_MIN to DRM_SCHED_PRIORITY_LOW.

This mirrors DRM_SCHED_PRIORITY_HIGH, for a list of DRM scheduler priorities
in ascending order,
  DRM_SCHED_PRIORITY_LOW,
  DRM_SCHED_PRIORITY_NORMAL,
  DRM_SCHED_PRIORITY_HIGH,
  DRM_SCHED_PRIORITY_KERNEL.

Cc: Rob Clark 
Cc: Abhinav Kumar 
Cc: Dmitry Baryshkov 
Cc: Danilo Krummrich 
Cc: Alex Deucher 
Cc: Christian König 
Cc: linux-arm-...@vger.kernel.org
Cc: freedr...@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c  |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  2 +-
 drivers/gpu/drm/msm/msm_gpu.h|  2 +-
 drivers/gpu/drm/scheduler/sched_entity.c |  2 +-
 drivers/gpu/drm/scheduler/sched_main.c   | 10 +-
 include/drm/gpu_scheduler.h  |  2 +-
 6 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index e2ae9ba147ba97..5cb33ac99f7089 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -73,10 +73,10 @@ amdgpu_ctx_to_drm_sched_prio(int32_t ctx_prio)
return DRM_SCHED_PRIORITY_NORMAL;
 
case AMDGPU_CTX_PRIORITY_VERY_LOW:
-   return DRM_SCHED_PRIORITY_MIN;
+   return DRM_SCHED_PRIORITY_LOW;
 
case AMDGPU_CTX_PRIORITY_LOW:
-   return DRM_SCHED_PRIORITY_MIN;
+   return DRM_SCHED_PRIORITY_LOW;
 
case AMDGPU_CTX_PRIORITY_NORMAL:
return DRM_SCHED_PRIORITY_NORMAL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 62bb7fc7448ad9..1a25931607c514 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -325,7 +325,7 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
drm_gpu_scheduler *sched)
int i;
 
/* Signal all jobs not yet scheduled */
-   for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+   for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_LOW; i--) {
struct drm_sched_rq *rq = sched->sched_rq[i];
spin_lock(>lock);
list_for_each_entry(s_entity, >entities, list) {
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 4252e3839fbc83..eb0c97433e5f8a 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -347,7 +347,7 @@ struct msm_gpu_perfcntr {
  * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
  * cases, so we don't use it (no need for kernel generated jobs).
  */
-#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - 
DRM_SCHED_PRIORITY_MIN)
+#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - 
DRM_SCHED_PRIORITY_LOW)
 
 /**
  * struct msm_file_private - per-drm_file context
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 20c9c561843ce1..cb7445be3cbb4e 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -88,7 +88,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
drm_err(sched_list[0], "entity with out-of-bounds 
priority:%u num_rqs:%u\n",
entity->priority, sched_list[0]->num_rqs);
entity->priority = max_t(s32, (s32) 
sched_list[0]->num_rqs - 1,
-(s32) DRM_SCHED_PRIORITY_MIN);
+(s32) DRM_SCHED_PRIORITY_LOW);
}
entity->rq = sched_list[0]->sched_rq[entity->priority];
}
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 044a8c4875ba64..b6d7bc49ff6ef4 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1052,7 +1052,7 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
int i;
 
/* Kernel run queue has higher priority than normal run queue*/
-   for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+   for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_LOW; i--) {
entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
drm_sched_rq_select_entity_fifo(sched, 
sched->sched_rq[i]) :
drm_sched_rq_select_entity_rr(sched, 
sched->sched_rq[i]);
@@ -1291,7 +1291,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
if (!sched->sched_rq)
goto Out_free;
sched->num_rqs = num_rqs;
-   for (i = DRM_SCHED_PRIORITY_MIN; i < sched->num_rqs; i++) {
+   for (i = DRM_SCHED_PRIORITY_LOW; i < sched->num_rqs; i++) {
sched->sched_rq[i] = kzalloc(sizeof(*sched->sch

[PATCH 2/2] drm/sched: Reverse run-queue priority enumeration

2023-11-23 Thread Luben Tuikov

Reverse run-queue priority enumeration such that the higest priority is now 0,
and for each consecutive integer the prioirty diminishes.

Run-queues correspond to priorities. To an external observer a scheduler
created with a single run-queue, and another created with
DRM_SCHED_PRIORITY_COUNT number of run-queues, should always schedule
sched->sched_rq[0] with the same "priority", as that index run-queue exists in
both schedulers, i.e. a scheduler with one run-queue or many. This patch makes
it so.

In other words, the "priority" of sched->sched_rq[n], n >= 0, is the same for
any scheduler created with any allowable number of run-queues (priorities), 0
to DRM_SCHED_PRIORITY_COUNT.

Cc: Rob Clark 
Cc: Abhinav Kumar 
Cc: Dmitry Baryshkov 
Cc: Danilo Krummrich 
Cc: Alex Deucher 
Cc: Christian König 
Cc: linux-arm-...@vger.kernel.org
Cc: freedr...@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  2 +-
 drivers/gpu/drm/msm/msm_gpu.h|  2 +-
 drivers/gpu/drm/scheduler/sched_entity.c |  7 ---
 drivers/gpu/drm/scheduler/sched_main.c   | 15 +++
 include/drm/gpu_scheduler.h  |  6 +++---
 5 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 1a25931607c514..71a5cf37b472d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -325,7 +325,7 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
drm_gpu_scheduler *sched)
int i;
 
/* Signal all jobs not yet scheduled */
-   for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_LOW; i--) {
+   for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
struct drm_sched_rq *rq = sched->sched_rq[i];
spin_lock(>lock);
list_for_each_entry(s_entity, >entities, list) {
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index eb0c97433e5f8a..2bfcb222e35338 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -347,7 +347,7 @@ struct msm_gpu_perfcntr {
  * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
  * cases, so we don't use it (no need for kernel generated jobs).
  */
-#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - 
DRM_SCHED_PRIORITY_LOW)
+#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_LOW - 
DRM_SCHED_PRIORITY_HIGH)
 
 /**
  * struct msm_file_private - per-drm_file context
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index cb7445be3cbb4e..6e2b02e45e3a32 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -81,14 +81,15 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 */
pr_warn("%s: called with uninitialized scheduler\n", __func__);
} else if (num_sched_list) {
-   /* The "priority" of an entity cannot exceed the number
-* of run-queues of a scheduler.
+   /* The "priority" of an entity cannot exceed the number of
+* run-queues of a scheduler. Choose the lowest priority
+* available.
 */
if (entity->priority >= sched_list[0]->num_rqs) {
drm_err(sched_list[0], "entity with out-of-bounds 
priority:%u num_rqs:%u\n",
entity->priority, sched_list[0]->num_rqs);
entity->priority = max_t(s32, (s32) 
sched_list[0]->num_rqs - 1,
-(s32) DRM_SCHED_PRIORITY_LOW);
+(s32) 
DRM_SCHED_PRIORITY_KERNEL);
}
entity->rq = sched_list[0]->sched_rq[entity->priority];
}
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index b6d7bc49ff6ef4..682aebe96db781 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1051,8 +1051,9 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
struct drm_sched_entity *entity;
int i;
 
-   /* Kernel run queue has higher priority than normal run queue*/
-   for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_LOW; i--) {
+   /* Start with the highest priority.
+*/
+   for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
drm_sched_rq_select_entity_fifo(sched, 
sched->sched_rq[i]) :
drm_sched_rq_select_entity_rr(sched, 
sched->sched_rq[i]);
@@ -1291,7 +1292,7 @@ int drm

[PATCH 0/2] Make scheduling of the same index, the same

2023-11-23 Thread Luben Tuikov

The first patch renames priority MIN to LOW.

The second patch makes the "priority" of the same run-queue index in any two
schedulers, the same.

This series sits on top on this fix
https://patchwork.freedesktop.org/patch/568723/ which I sent yesterday.

Luben Tuikov (2):
  drm/sched: Rename priority MIN to LOW
  drm/sched: Reverse run-queue priority enumeration

 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c  |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  2 +-
 drivers/gpu/drm/msm/msm_gpu.h|  2 +-
 drivers/gpu/drm/scheduler/sched_entity.c |  7 ---
 drivers/gpu/drm/scheduler/sched_main.c   | 15 +++
 include/drm/gpu_scheduler.h  |  6 +++---
 6 files changed, 18 insertions(+), 18 deletions(-)

Cc: Rob Clark 
Cc: Abhinav Kumar 
Cc: Dmitry Baryshkov 
Cc: Danilo Krummrich 
Cc: Alex Deucher 
Cc: Christian König 
Cc: linux-arm-...@vger.kernel.org
Cc: freedr...@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org

base-commit: e4d983ac270ccee417445a69b9ed198658b1
prerequisite-patch-id: d0fec7c91768937b5e22ce9508017e5b9d462000
-- 
2.43.0

[PATCH] drm/sched: Fix bounds limiting when given a malformed entity

2023-11-23 Thread Luben Tuikov

If we're given a malformed entity in drm_sched_entity_init()--shouldn't
happen, but we verify--with out-of-bounds priority value, we set it to an
allowed value. Fix the expression which sets this limit.

Signed-off-by: Luben Tuikov 
Fixes: 56e449603f0ac5 ("drm/sched: Convert the GPU scheduler to variable number 
of run-queues")
---
 drivers/gpu/drm/scheduler/sched_entity.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 4d42b1e4daa67f..20c9c561843ce1 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -84,9 +84,12 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
/* The "priority" of an entity cannot exceed the number
 * of run-queues of a scheduler.
 */
-   if (entity->priority >= sched_list[0]->num_rqs)
-   entity->priority = max_t(u32, sched_list[0]->num_rqs,
-DRM_SCHED_PRIORITY_MIN);
+   if (entity->priority >= sched_list[0]->num_rqs) {
+   drm_err(sched_list[0], "entity with out-of-bounds 
priority:%u num_rqs:%u\n",
+   entity->priority, sched_list[0]->num_rqs);
+   entity->priority = max_t(s32, (s32) 
sched_list[0]->num_rqs - 1,
+(s32) DRM_SCHED_PRIORITY_MIN);
+   }
entity->rq = sched_list[0]->sched_rq[entity->priority];
}
 

base-commit: b3c5a7de9aeb51cb19160f3f61343ed87487abde
-- 
2.43.0

Re: Radeon regression in 6.6 kernel

2023-11-22 Thread Luben Tuikov

On 2023-11-21 17:05, Phillip Susi wrote:
> Alex Deucher  writes:
> 
>> Does reverting 56e449603f0ac580700621a356d35d5716a62ce5 alone fix it?
>> Can you also attach your full dmesg log for the failed suspend?
> 
> No, it doesn't.  Here is the full syslog from the boot with only that
> revert:
> 

Thank you Phillip for verifying this.

BTW, luben.tui...@amd.com should absolutely bounce for everyone sending emails 
to it. Not sure why it is still active.
My new email is the one this email is coming from.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: drm scheduler redesign causes deadlocks [extended repost]

2023-11-22 Thread Luben Tuikov

On 2023-11-21 04:00, Bert Karwatzki wrote:
> Since linux-next-20231115 my linux system (debian sid on msi alpha 15 laptop)
> suffers from random deadlocks which can occur after  30 - 180min of usage. 
> These
> deadlocks can be actively provoked by creating high system load (usually by
> compiling a kernel with make -j NRCPUS) and the opening instances of 
> libreoffice
> --writer until the system GUI locks (the mouse cursor can still be moved but 
> the
> screen is frozen). In this state ssh'ing into the machine is still possible 
> and
> at least sometimes log messages about hung tasks appear in /var/log/kern.log.
> 
> More info can be found here:
> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> 
> Using the method described to trigger the bug I bisected the problem in the
> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
> As this simple patch fixes the problem
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..25b97db1b623 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>   struct drm_sched_entity *entity)
>  {
> -   if (drm_sched_entity_is_ready(entity))
> -   if (drm_sched_can_queue(sched, entity))
> -   drm_sched_run_job_queue(sched);
> +   if (drm_sched_can_queue(sched, entity))
> +   drm_sched_run_job_queue(sched);
>  }
>  
>  /**
> 
> there might be in the entity->dependency branch of drm_sched_entity_is_ready
> (some kind of circular dependencies ...).
> 
> To see if the change to drm_sched_wakeup is the actual cause of the problem or
> if this problem has been cause by the redesign of the drm scheduler in linux
> next-20231115+, I created the following patch for linux-6.6.0:
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index a42763e1429d..dc2abd299aeb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>  container_of(cb, struct drm_sched_entity, cb);
> 
>  drm_sched_entity_clear_dep(f, cb);
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
> 
>  /**
> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
> *sched_job)
>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>  drm_sched_rq_update_fifo(entity, submit_ts);
> 
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
>  }
>  EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 5a3a622fc672..bbe06403b33d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler
> *sched)
>   *
>   * Wake up the scheduler if we can queue jobs.
>   */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity)
>  {
> - if (drm_sched_can_queue(sched))
> - wake_up_interruptible(>wake_up_worker);
> + if(drm_sched_entity_is_ready(entity))
> + if (drm_sched_can_queue(sched))
> + wake_up_interruptible(>wake_up_worker);
>  }
> 
>  /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index ac65f0626cfc..6cfe3d193e69 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity
> *entity,
> unsigned int num_sched_list);
> 
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
> *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> 
> This brings the extra check to the old scheduler and has so far not caused any
> trouble (using the same stress test described above), so chances are that the
> error is somewhere else in redesigned scheduler.
> 
> 
> Bert Karwatzki

Hi Bert,

Thanks for looking into this.

As an afterthought, removing the "entity_is_ready()" qualifier in wake-up, makes
the scheduling more opportunistic, and I agree that that is the more correct 
approach.
Commit f3123c2590005, basically made the code as close to the way it

Re: linux-next: Signed-off-by missing for commit in the drm-misc tree

2023-11-22 Thread Luben Tuikov

On 2023-11-22 07:00, Maxime Ripard wrote:
> Hi Luben,
> 
> On Thu, Nov 16, 2023 at 09:27:58AM +0100, Daniel Vetter wrote:
>> On Thu, Nov 16, 2023 at 09:11:43AM +0100, Maxime Ripard wrote:
>>> On Tue, Nov 14, 2023 at 06:46:21PM -0500, Luben Tuikov wrote:
>>>> On 2023-11-13 22:08, Stephen Rothwell wrote:
>>>>> BTW, cherry picking commits does not avoid conflicts - in fact it can
>>>>> cause conflicts if there are further changes to the files affected by
>>>>> the cherry picked commit in either the tree/branch the commit was
>>>>> cheery picked from or the destination tree/branch (I have to deal with
>>>>> these all the time when merging the drm trees in linux-next).  Much
>>>>> better is to cross merge the branches so that the patch only appears
>>>>> once or have a shared branches that are merged by any other branch that
>>>>> needs the changes.
>>>>>
>>>>> I understand that things are not done like this in the drm trees :-(
>>>>
>>>> Hi Stephen,
>>>>
>>>> Thank you for the clarification--understood. I'll be more careful in the 
>>>> future.
>>>> Thanks again! :-)
>>>
>>> In this case, the best thing to do would indeed have been to ask the
>>> drm-misc maintainers to merge drm-misc-fixes into drm-misc-next.
>>>
>>> We're doing that all the time, but we're not ubiquitous so you need to
>>> ask us :)
>>>
>>> Also, dim should have caught that when you pushed the branch. Did you
>>> use it?
>>
>> Yeah dim must be used, exactly to avoid these issues. Both for applying
>> patches (so not git am directly, or cherry-picking from your own
>> development branch), and for pushing. The latter is even checked for by
>> the server (dim sets a special push flag which is very long and contains a
>> very clear warning if you bypass it).
>>
>> If dim was used, this would be a bug in the dim script that we need to
>> fix.
> 
> It would be very useful for you to explain what happened here so we
> improve the tooling or doc and can try to make sure it doesn't happen
> again
> 
> Maxime

There is no problem with the tooling--I just forced the commit in.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/2] drm/scheduler: improve GPU scheduler documentation v2

2023-11-17 Thread Luben Tuikov

Hi,

On 2023-11-16 09:15, Christian König wrote:
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.
> 
> v2: Some improvements suggested by Danilo, add section about error
> handling.
> 
> Signed-off-by: Christian König 
> ---
>  Documentation/gpu/drm-mm.rst   |  36 +
>  drivers/gpu/drm/scheduler/sched_main.c | 174 +
>  2 files changed, 188 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index acc5901ac840..112463fa9f3a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,12 +552,48 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>  
> +Job Object
> +--
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Job Object
> +
> +Entity Object
> +-
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Entity Object
> +
> +Hardware Fence Object
> +-
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +--
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +---
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler and Run Queue Objects
> +
>  Flow Control
>  
>  
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Flow Control
>  
> +Error and Timeout handling
> +--
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Error and Timeout handling
> +
>  Scheduler Function References
>  -
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..026123497b0e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,122 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the 
> entities
> - * from the run queue using a FIFO. The scheduler provides dependency 
> handling
> - * features among jobs. The driver is supposed to provide callback functions 
> for
> - * backend operations to the scheduler like submitting a job to hardware run 
> queue,
> - * returning the dependencies of a job etc.
> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *(e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *the hardware.
> - *
> - * The jobs in a entity are always scheduled in the order that they were 
> pushed.
> - *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced 
> anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command submission
> + * to push next to the hardware. Another major use case of the GPU scheduler

You can't start a 2nd sentence with "Another major use ...", not unless you'd 
been
discussing the other major use for at least a few paragraphs, but not after just
once sentence.

Get rid of "some" in "some logic", and just say "implements logic to ..."

Then 2nd sentence should say, "The GPU scheduler also enforces correct ..."

> + * is to enforce correct driver behavior around those command submissions.
> + * Because of this it's also used by drivers which don't need the actual
> + * scheduling functionality.
> + *
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This 
> especially

"deadlock-free"

Link to "DMA-fence signaling rules" would be nice to have. Can't mention them,
and not provide a link. Naturally someone reading this would immediately ask 
themselves,
"What are the ``DMA-fence signaling rules''?", and if they don't need to ask 
themselves
this, then they probably mostly know all of this here too.

> + * means that for normal operation no memory can be allocated in a callback.

What callback? Perhaps say, "callback into the driver", or name it/them,
as they're in the code.

> + * All memory which is needed for pushing the job to the hardware must be

"pushing _a_ job"

> + * allocated before arming a

[PATCH] drm/print: Handle NULL drm device in __drm_printk()

2023-11-16 Thread Luben Tuikov

drm_{err,warn,...}() use __drm_printk() which takes a drm device pointer and
uses the embedded device pointer to print the device. This facility handles
NULL device pointer, but not NULL drm device pointer. This patch makes
__drm_printk() also handle a NULL drm device pointer. The printed output is
identical to if drm->dev had been NULL.

Signed-off-by: Luben Tuikov 
---
 include/drm/drm_print.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
index a93a387f8a1a15..dd4883df876a6d 100644
--- a/include/drm/drm_print.h
+++ b/include/drm/drm_print.h
@@ -453,7 +453,7 @@ void __drm_dev_dbg(struct _ddebug *desc, const struct 
device *dev,
 
 /* Helper for struct drm_device based logging. */
 #define __drm_printk(drm, level, type, fmt, ...)   \
-   dev_##level##type((drm)->dev, "[drm] " fmt, ##__VA_ARGS__)
+   dev_##level##type((drm) ? (drm)->dev : NULL, "[drm] " fmt, 
##__VA_ARGS__)
 
 
 #define drm_info(drm, fmt, ...)\

base-commit: 3b434a3445fff3149128db0169da864d67057325
-- 
2.42.1

Re: linux-next: Signed-off-by missing for commit in the drm-misc tree

2023-11-16 Thread Luben Tuikov

On 2023-11-16 04:22, Maxime Ripard wrote:
> Hi,
> 
> On Mon, Nov 13, 2023 at 09:56:32PM -0500, Luben Tuikov wrote:
>> On 2023-11-13 21:45, Stephen Rothwell wrote:
>>> Hi Luben,
>>>
>>> On Mon, 13 Nov 2023 20:32:40 -0500 Luben Tuikov  wrote:
>>>>
>>>> On 2023-11-13 20:08, Luben Tuikov wrote:
>>>>> On 2023-11-13 15:55, Stephen Rothwell wrote:  
>>>>>> Hi all,
>>>>>>
>>>>>> Commit
>>>>>>
>>>>>>   0da611a87021 ("dma-buf: add dma_fence_timestamp helper")
>>>>>>
>>>>>> is missing a Signed-off-by from its committer.
>>>>>>  
>>>>>
>>>>> In order to merge the scheduler changes necessary for the Xe driver, 
>>>>> those changes
>>>>> were based on drm-tip, which included this change from drm-misc-fixes, 
>>>>> but which
>>>>> wasn't present in drm-misc-next.
>>>>>
>>>>> I didn't want to create a merge conflict between drm-misc-next and 
>>>>> drm-misc-fixes,
>>>>> when pulling that change from drm-misc-next to drm-misc-fixes, so that I 
>>>>> can apply  
>>>>
>>>> ... when pulling that change from from drm-misc-fixes into drm-misc-next, 
>>>> so that I can apply...
>>>>
>>>>> the Xe scheduler changes on top of drm-misc-next.  
>>>>
>>>> The change in drm-misc-fixes is b83ce9cb4a465b. The latter is contained
>>>> in linus-master, and in drm-misc-fixes, while the former is in 
>>>> drm-misc-next.
>>>> When we merge linus-master/drm-misc-fixes into drm-misc-next, or whichever 
>>>> way
>>>> it happens, I'd like to avoid a merge conflict, but wanted to expedite the 
>>>> changes
>>>> for Xe.
>>>
>>> None of that is relevant ... if you commit a patch to a tree that will
>>> be in the linux kernel tree, you must add your Signed-off-by to the commit.
>>
>> Noted!
>>
>> So I always do this when I do git-am and such, but wasn't sure for this one 
>> single cherry-pick whose
>> original author was the committer in drm-misc-fixes, but will add my 
>> Signed-off-by in those
>> rare circumstances.
>>
>> Thanks for the clarification!
> 
> In order to move forward with this, can you provide your SoB here for
> that patch so that we can at least point to it in the drm-misc-next PR?
> 
> Maxime

Signed-off-by: Luben Tuikov 

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Define pr_fmt() for DRM using pr_*()

2023-11-16 Thread Luben Tuikov

On 2023-11-15 03:24, Jani Nikula wrote:
> On Tue, 14 Nov 2023, Luben Tuikov  wrote:
>> diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
>> index a93a387f8a1a15..ce784118e4f762 100644
>> --- a/include/drm/drm_print.h
>> +++ b/include/drm/drm_print.h
>> @@ -453,7 +453,7 @@ void __drm_dev_dbg(struct _ddebug *desc, const struct 
>> device *dev,
>>  
>>  /* Helper for struct drm_device based logging. */
>>  #define __drm_printk(drm, level, type, fmt, ...)   \
>> -   dev_##level##type((drm)->dev, "[drm] " fmt, ##__VA_ARGS__)
>> +   dev_##level##type(drm ? (drm)->dev : NULL, "[drm] " fmt, 
>> ##__VA_ARGS__)
> 
> I think that would be an improvement that stands on its own merits.
> 
> Please also wrap the first drm in parens (drm).

Okay.

> 
>> The output would be similar to that if drm->dev were NULL.
> 
> Yes. I don't know how people will feel about intentionally using
> drm_err(NULL, ...) all over the place, but that's another matter. ;)

:-)

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/scheduler: improve GPU scheduler documentation

2023-11-16 Thread Luben Tuikov

On 2023-11-13 07:38, Christian König wrote:
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.

Thanks Christian for doing this--much needed.

> 
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 126 -
>  1 file changed, 104 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 506371c42745..36a7c5dc852d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,110 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the 
> entities
> - * from the run queue using a FIFO. The scheduler provides dependency 
> handling
> - * features among jobs. The driver is supposed to provide callback functions 
> for
> - * backend operations to the scheduler like submitting a job to hardware run 
> queue,
> - * returning the dependencies of a job etc.

So, I don't mind this paragraph, as it provides an overview or the relationship 
between
a DRM GPU scheduler, entities, run-queues, and jobs.

> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *(e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *the hardware.

This is also good, and shouldn't have been deleted.

> - *
> - * The jobs in a entity are always scheduled in the order that they were 
> pushed.

I'd have said here "jobs within an entity". Again shouldn't have been deleted.
This is good overall/overview information.

> - *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced 
> anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command submission
> + * to push next to the hardware. Another major use case for the GPU scheduler
> + * is to enforce correct driver behavior around those command submission.

The GPU scheduler also enforces correct driver behaviour around those command 
submissions.

> + * Because of this it's also used by drivers which don't need the actual
> + * scheduling functionality.

... but need to push jobs into their firmware/hardware and maintain/keep correct
DRM dependencies in the form of "fences".

> + *
> + * To fulfill this task the GPU scheduler uses of the following objects:
> + *
> + * 1. The job object which contains a bunch of dependencies in the form of

Drop "which".

Instead of listing what it contains, how it is used, what it does, explain
what it is: work/task to be executed by the GPU. _Then_ you can start listing
what it contains, how it is used, what it does.

> + *DMA-fence objects. Drivers can also implement an optional prepare_job
> + *callback which returns additional dependencies as DMA-fence objects.

"can also"? This would usually follow if the other callbacks/etc., have been 
described
and they haven't, so I'd say "Drivers implement an optional prepare_job 
callback,..."

> + *It's important to note that this callback must follow the DMA-fence 
> rules,
> + *so it can't easily allocate memory or grab locks under which memory is
> + *allocated. Drivers should use this as base class for an object which
> + *contains the necessary state to push the command submission to the
> + *hardware.
> + *
> + *The lifetime of the job object should at least be from pushing it into 
> the
> + *scheduler until the scheduler notes through the free callback that a 
> job
> + *isn't needed any more. Drivers can of course keep their job object 
> alive
> + *longer than that, but that's outside of the scope of the scheduler
> + *component.

[New paragraph starts describing the job initialization.]

Add:  Job initialization is split into two parts,
> + *drm_sched_job_init() and drm_sched_job_arm().

Perhaps we should mention briefly what each one does..?

Add:  It's important to note that
> + *after arming a job drivers must follow the DMA-fence rules and can't
> + *easily allocate memory or takes locks under which memory is allocated.
> + *
> + * 2. The entity object which is a container for jobs which should execute

Drop "which". "The entity object is a container of ..."

> + *sequentially. Drivers should create an entity for each individual 
> context
> + *they maintain for command

Re: [PATCH] drm/sched: Define pr_fmt() for DRM using pr_*()

2023-11-14 Thread Luben Tuikov

On 2023-11-14 07:20, Jani Nikula wrote:
> On Mon, 13 Nov 2023, Luben Tuikov  wrote:
>> Hi Jani,
>>
>> On 2023-11-10 07:40, Jani Nikula wrote:
>>> On Thu, 09 Nov 2023, Luben Tuikov  wrote:
>>>> Define pr_fmt() as "[drm] " for DRM code using pr_*() facilities, 
>>>> especially
>>>> when no devices are available. This makes it easier to browse kernel logs.
>>>
>>> Please do not merge patches before people have actually had a chance to
>>> look at them. This was merged *way* too quickly.
>>>
>>> This does not do what you think it does, and it's not robust enough.
>>>
>>> The drm_print.[ch] facilities use very few pr_*() calls directly. The
>>> users of pr_*() calls do not necessarily include  at
>>> all, and really don't have to.
>>>
>>> Even the ones that do include it, usually have  includes
>>> first, and  includes next. Notably,  includes
>>> .
>>>
>>> And, of course,  defines pr_fmt() itself if not already
>>> defined.
>>>
>>>> Signed-off-by: Luben Tuikov 
>>>> ---
>>>>  include/drm/drm_print.h | 14 ++
>>>>  1 file changed, 14 insertions(+)
>>>>
>>>> diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
>>>> index a93a387f8a1a15..e8fe60d0eb8783 100644
>>>> --- a/include/drm/drm_print.h
>>>> +++ b/include/drm/drm_print.h
>>>> @@ -26,6 +26,20 @@
>>>>  #ifndef DRM_PRINT_H_
>>>>  #define DRM_PRINT_H_
>>>>  
>>>> +/* Define this before including linux/printk.h, so that the format
>>>> + * string in pr_*() macros is correctly set for DRM. If a file wants
>>>> + * to define this to something else, it should do so before including
>>>> + * this header file.
>>>
>>> The only way this would work is by including  as the
>>> very first header, and that's fragile at best.
>>>
>>>> + *
>>>> + * It is encouraged code using pr_err() to prefix their format with
>>>> + * the string "*ERROR* ", to make it easier to scan kernel logs. For
>>>> + * instance,
>>>> + *   pr_err("*ERROR* ", args).
>>>
>>> No, it's encouraged not to use pr_*() at all, and prefer drm device
>>> based logging, or device based logging.
>>>
>>> I'd rather this whole thing was just reverted.
>>
>> The revert has been pushed--thanks for R-B-ing it.
>>
>> FWIW, I wanted a device-less DRM print, with a prefix "[drm] *ERROR* ",
>> because this is what we scan for, especially when we get a blank screen at 
>> boot/modprobe.
>> There's a few cases in DRM where when we return -E... it's most likely a 
>> blank screen result,
>> as was the case with a recent debug I had with amdgpu when pushing the 
>> variable sched->rq.
>>
>> So then I went by this, in linux/printk.h:
>>
>> /**
>>  * pr_fmt - used by the pr_*() macros to generate the printk format string
>>  * @fmt: format string passed from a pr_*() macro
>>  *
>>  * This macro can be used to generate a unified format string for pr_*()
>>  * macros. A common use is to prefix all pr_*() messages in a file with a 
>> common
>>  * string. For example, defining this at the top of a source file:
>>  *
>>  *#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>  *
>>  * would prefix all pr_info, pr_emerg... messages in the file with the module
>>  * name.
>>  */
>> #ifndef pr_fmt
>> #define pr_fmt(fmt) fmt
>> #endif
>>
>> Any suggestions as to a device-less DRM print with prefix "[drm] *ERROR* "?
> 
> I don't think there's a way to do that using pr_fmt for an entire driver
> or subsystem. That really only works for individual compilation units.
> 
> We have DRM_ERROR() which does the trick, but the documentation says
> it's been deprecated in favor pr_err()... though I think drm_err()
> should be preferred over pr_err() where possible.

Yes, that's what made me use pr_err() and the pr_fmt() modification...

> 
> Maybe we should extend 7911902129a8 ("drm/print: Handle potentially NULL
> drm_devices in drm_dbg_*") to __drm_printk() and handle NULL drm device
> gracefully.

Yeah, that actually would work.

> 
> With just "(drm) ? (drm)->dev : NULL" the output will have "(NULL device
> *)" which works but is a bit meh, but maybe something like this is
> possible (untested):

So, I don't mind

Re: linux-next: Signed-off-by missing for commit in the drm-misc tree

2023-11-14 Thread Luben Tuikov

On 2023-11-13 22:08, Stephen Rothwell wrote:
> Hi Luben,
> 
> BTW, cherry picking commits does not avoid conflicts - in fact it can
> cause conflicts if there are further changes to the files affected by
> the cherry picked commit in either the tree/branch the commit was
> cheery picked from or the destination tree/branch (I have to deal with
> these all the time when merging the drm trees in linux-next).  Much
> better is to cross merge the branches so that the patch only appears
> once or have a shared branches that are merged by any other branch that
> needs the changes.
> 
> I understand that things are not done like this in the drm trees :-(

Hi Stephen,

Thank you for the clarification--understood. I'll be more careful in the future.
Thanks again! :-)
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: linux-next: Signed-off-by missing for commit in the drm-misc tree

2023-11-13 Thread Luben Tuikov

On 2023-11-13 21:45, Stephen Rothwell wrote:
> Hi Luben,
> 
> On Mon, 13 Nov 2023 20:32:40 -0500 Luben Tuikov  wrote:
>>
>> On 2023-11-13 20:08, Luben Tuikov wrote:
>>> On 2023-11-13 15:55, Stephen Rothwell wrote:  
>>>> Hi all,
>>>>
>>>> Commit
>>>>
>>>>   0da611a87021 ("dma-buf: add dma_fence_timestamp helper")
>>>>
>>>> is missing a Signed-off-by from its committer.
>>>>  
>>>
>>> In order to merge the scheduler changes necessary for the Xe driver, those 
>>> changes
>>> were based on drm-tip, which included this change from drm-misc-fixes, but 
>>> which
>>> wasn't present in drm-misc-next.
>>>
>>> I didn't want to create a merge conflict between drm-misc-next and 
>>> drm-misc-fixes,
>>> when pulling that change from drm-misc-next to drm-misc-fixes, so that I 
>>> can apply  
>>
>> ... when pulling that change from from drm-misc-fixes into drm-misc-next, so 
>> that I can apply...
>>
>>> the Xe scheduler changes on top of drm-misc-next.  
>>
>> The change in drm-misc-fixes is b83ce9cb4a465b. The latter is contained
>> in linus-master, and in drm-misc-fixes, while the former is in drm-misc-next.
>> When we merge linus-master/drm-misc-fixes into drm-misc-next, or whichever 
>> way
>> it happens, I'd like to avoid a merge conflict, but wanted to expedite the 
>> changes
>> for Xe.
> 
> None of that is relevant ... if you commit a patch to a tree that will
> be in the linux kernel tree, you must add your Signed-off-by to the commit.

Hi Stephen,

Noted!

So I always do this when I do git-am and such, but wasn't sure for this one 
single cherry-pick whose
original author was the committer in drm-misc-fixes, but will add my 
Signed-off-by in those
rare circumstances.

Thanks for the clarification!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: linux-next: Signed-off-by missing for commit in the drm-misc tree

2023-11-13 Thread Luben Tuikov

On 2023-11-13 20:08, Luben Tuikov wrote:
> On 2023-11-13 15:55, Stephen Rothwell wrote:
>> Hi all,
>>
>> Commit
>>
>>   0da611a87021 ("dma-buf: add dma_fence_timestamp helper")
>>
>> is missing a Signed-off-by from its committer.
>>
> 
> In order to merge the scheduler changes necessary for the Xe driver, those 
> changes
> were based on drm-tip, which included this change from drm-misc-fixes, but 
> which
> wasn't present in drm-misc-next.
> 
> I didn't want to create a merge conflict between drm-misc-next and 
> drm-misc-fixes,
> when pulling that change from drm-misc-next to drm-misc-fixes, so that I can 
> apply

... when pulling that change from from drm-misc-fixes into drm-misc-next, so 
that I can apply...

> the Xe scheduler changes on top of drm-misc-next.

The change in drm-misc-fixes is b83ce9cb4a465b. The latter is contained
in linus-master, and in drm-misc-fixes, while the former is in drm-misc-next.
When we merge linus-master/drm-misc-fixes into drm-misc-next, or whichever way
it happens, I'd like to avoid a merge conflict, but wanted to expedite the 
changes
for Xe.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: linux-next: Signed-off-by missing for commit in the drm-misc tree

2023-11-13 Thread Luben Tuikov

On 2023-11-13 15:55, Stephen Rothwell wrote:
> Hi all,
> 
> Commit
> 
>   0da611a87021 ("dma-buf: add dma_fence_timestamp helper")
> 
> is missing a Signed-off-by from its committer.
> 

In order to merge the scheduler changes necessary for the Xe driver, those 
changes
were based on drm-tip, which included this change from drm-misc-fixes, but which
wasn't present in drm-misc-next.

I didn't want to create a merge conflict between drm-misc-next and 
drm-misc-fixes,
when pulling that change from drm-misc-next to drm-misc-fixes, so that I can 
apply
the Xe scheduler changes on top of drm-misc-next.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Define pr_fmt() for DRM using pr_*()

2023-11-13 Thread Luben Tuikov

Hi Jani,

On 2023-11-10 07:40, Jani Nikula wrote:
> On Thu, 09 Nov 2023, Luben Tuikov  wrote:
>> Define pr_fmt() as "[drm] " for DRM code using pr_*() facilities, especially
>> when no devices are available. This makes it easier to browse kernel logs.
> 
> Please do not merge patches before people have actually had a chance to
> look at them. This was merged *way* too quickly.
> 
> This does not do what you think it does, and it's not robust enough.
> 
> The drm_print.[ch] facilities use very few pr_*() calls directly. The
> users of pr_*() calls do not necessarily include  at
> all, and really don't have to.
> 
> Even the ones that do include it, usually have  includes
> first, and  includes next. Notably,  includes
> .
> 
> And, of course,  defines pr_fmt() itself if not already
> defined.
> 
>> Signed-off-by: Luben Tuikov 
>> ---
>>  include/drm/drm_print.h | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
>> index a93a387f8a1a15..e8fe60d0eb8783 100644
>> --- a/include/drm/drm_print.h
>> +++ b/include/drm/drm_print.h
>> @@ -26,6 +26,20 @@
>>  #ifndef DRM_PRINT_H_
>>  #define DRM_PRINT_H_
>>  
>> +/* Define this before including linux/printk.h, so that the format
>> + * string in pr_*() macros is correctly set for DRM. If a file wants
>> + * to define this to something else, it should do so before including
>> + * this header file.
> 
> The only way this would work is by including  as the
> very first header, and that's fragile at best.
> 
>> + *
>> + * It is encouraged code using pr_err() to prefix their format with
>> + * the string "*ERROR* ", to make it easier to scan kernel logs. For
>> + * instance,
>> + *   pr_err("*ERROR* ", args).
> 
> No, it's encouraged not to use pr_*() at all, and prefer drm device
> based logging, or device based logging.
> 
> I'd rather this whole thing was just reverted.

The revert has been pushed--thanks for R-B-ing it.

FWIW, I wanted a device-less DRM print, with a prefix "[drm] *ERROR* ",
because this is what we scan for, especially when we get a blank screen at 
boot/modprobe.
There's a few cases in DRM where when we return -E... it's most likely a blank 
screen result,
as was the case with a recent debug I had with amdgpu when pushing the variable 
sched->rq.

So then I went by this, in linux/printk.h:

/**
 * pr_fmt - used by the pr_*() macros to generate the printk format string
 * @fmt: format string passed from a pr_*() macro
 *
 * This macro can be used to generate a unified format string for pr_*()
 * macros. A common use is to prefix all pr_*() messages in a file with a common
 * string. For example, defining this at the top of a source file:
 *
 *#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 *
 * would prefix all pr_info, pr_emerg... messages in the file with the module
 * name.
 */
#ifndef pr_fmt
#define pr_fmt(fmt) fmt
#endif

Any suggestions as to a device-less DRM print with prefix "[drm] *ERROR* "?
-- 
Regards,
Luben

OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Revert "drm/sched: Define pr_fmt() for DRM using pr_*()"

2023-11-13 Thread Luben Tuikov

On 2023-11-11 06:33, Jani Nikula wrote:
> On Sat, 11 Nov 2023, Luben Tuikov  wrote:
>> From Jani:
>> The drm_print.[ch] facilities use very few pr_*() calls directly. The
>> users of pr_*() calls do not necessarily include  at
>> all, and really don't have to.
>>
>> Even the ones that do include it, usually have  includes
>> first, and  includes next. Notably,  includes
>> .
>>
>> And, of course,  defines pr_fmt() itself if not already
>> defined.
>>
>> No, it's encouraged not to use pr_*() at all, and prefer drm device
>> based logging, or device based logging.
>>
>> This reverts commit 36245bd02e88e68ac5955c2958c968879d7b75a9.
>>
>> Signed-off-by: Luben Tuikov 
>> Link: https://patchwork.freedesktop.org/patch/msgid/878r75wzm9@intel.com
> 
> Reviewed-by: Jani Nikula 
> 
> 
>> ---
>>  include/drm/drm_print.h | 14 --
>>  1 file changed, 14 deletions(-)
>>
>> diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
>> index e8fe60d0eb8783..a93a387f8a1a15 100644
>> --- a/include/drm/drm_print.h
>> +++ b/include/drm/drm_print.h
>> @@ -26,20 +26,6 @@
>>  #ifndef DRM_PRINT_H_
>>  #define DRM_PRINT_H_
>>  
>> -/* Define this before including linux/printk.h, so that the format
>> - * string in pr_*() macros is correctly set for DRM. If a file wants
>> - * to define this to something else, it should do so before including
>> - * this header file.
>> - *
>> - * It is encouraged code using pr_err() to prefix their format with
>> - * the string "*ERROR* ", to make it easier to scan kernel logs. For
>> - * instance,
>> - *   pr_err("*ERROR* ", args).
>> - */
>> -#ifndef pr_fmt
>> -#define pr_fmt(fmt) "[drm] " fmt
>> -#endif
>> -
>>  #include 
>>  #include 
>>  #include 
>>
>> base-commit: 540527b1385fb203cc4513ca838b4de60bbbc49a
> 

Pushed.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

[PATCH] Revert "drm/sched: Define pr_fmt() for DRM using pr_*()"

2023-11-11 Thread Luben Tuikov

>From Jani:
The drm_print.[ch] facilities use very few pr_*() calls directly. The
users of pr_*() calls do not necessarily include  at
all, and really don't have to.

Even the ones that do include it, usually have  includes
first, and  includes next. Notably,  includes
.

And, of course,  defines pr_fmt() itself if not already
defined.

No, it's encouraged not to use pr_*() at all, and prefer drm device
based logging, or device based logging.

This reverts commit 36245bd02e88e68ac5955c2958c968879d7b75a9.

Signed-off-by: Luben Tuikov 
Link: https://patchwork.freedesktop.org/patch/msgid/878r75wzm9@intel.com
---
 include/drm/drm_print.h | 14 --
 1 file changed, 14 deletions(-)

diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
index e8fe60d0eb8783..a93a387f8a1a15 100644
--- a/include/drm/drm_print.h
+++ b/include/drm/drm_print.h
@@ -26,20 +26,6 @@
 #ifndef DRM_PRINT_H_
 #define DRM_PRINT_H_
 
-/* Define this before including linux/printk.h, so that the format
- * string in pr_*() macros is correctly set for DRM. If a file wants
- * to define this to something else, it should do so before including
- * this header file.
- *
- * It is encouraged code using pr_err() to prefix their format with
- * the string "*ERROR* ", to make it easier to scan kernel logs. For
- * instance,
- *   pr_err("*ERROR* ", args).
- */
-#ifndef pr_fmt
-#define pr_fmt(fmt) "[drm] " fmt
-#endif
-
 #include 
 #include 
 #include 

base-commit: 540527b1385fb203cc4513ca838b4de60bbbc49a
-- 
2.42.1

[PATCH] Revert "drm/sched: Define pr_fmt() for DRM using pr_*()"

2023-11-10 Thread Luben Tuikov

This reverts commit 36245bd02e88e68ac5955c2958c968879d7b75a9.

Signed-off-by: Luben Tuikov 
---
 include/drm/drm_print.h | 14 --
 1 file changed, 14 deletions(-)

diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
index e8fe60d0eb8783..a93a387f8a1a15 100644
--- a/include/drm/drm_print.h
+++ b/include/drm/drm_print.h
@@ -26,20 +26,6 @@
 #ifndef DRM_PRINT_H_
 #define DRM_PRINT_H_
 
-/* Define this before including linux/printk.h, so that the format
- * string in pr_*() macros is correctly set for DRM. If a file wants
- * to define this to something else, it should do so before including
- * this header file.
- *
- * It is encouraged code using pr_err() to prefix their format with
- * the string "*ERROR* ", to make it easier to scan kernel logs. For
- * instance,
- *   pr_err("*ERROR* ", args).
- */
-#ifndef pr_fmt
-#define pr_fmt(fmt) "[drm] " fmt
-#endif
-
 #include 
 #include 
 #include 

base-commit: 540527b1385fb203cc4513ca838b4de60bbbc49a
-- 
2.42.1

Re: [PATCH] drm/sched: Define pr_fmt() for DRM using pr_*()

2023-11-10 Thread Luben Tuikov

On 2023-11-10 07:40, Jani Nikula wrote:
> On Thu, 09 Nov 2023, Luben Tuikov  wrote:
>> Define pr_fmt() as "[drm] " for DRM code using pr_*() facilities, especially
>> when no devices are available. This makes it easier to browse kernel logs.
> 
> Please do not merge patches before people have actually had a chance to
> look at them. This was merged *way* too quickly.

Agreed.

> 
> This does not do what you think it does, and it's not robust enough.
> 
> The drm_print.[ch] facilities use very few pr_*() calls directly. The
> users of pr_*() calls do not necessarily include  at
> all, and really don't have to.
> 
> Even the ones that do include it, usually have  includes
> first, and  includes next. Notably,  includes
> .
> 
> And, of course,  defines pr_fmt() itself if not already
> defined.
> 
>> Signed-off-by: Luben Tuikov 
>> ---
>>  include/drm/drm_print.h | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
>> index a93a387f8a1a15..e8fe60d0eb8783 100644
>> --- a/include/drm/drm_print.h
>> +++ b/include/drm/drm_print.h
>> @@ -26,6 +26,20 @@
>>  #ifndef DRM_PRINT_H_
>>  #define DRM_PRINT_H_
>>  
>> +/* Define this before including linux/printk.h, so that the format
>> + * string in pr_*() macros is correctly set for DRM. If a file wants
>> + * to define this to something else, it should do so before including
>> + * this header file.
> 
> The only way this would work is by including  as the
> very first header, and that's fragile at best.
> 
>> + *
>> + * It is encouraged code using pr_err() to prefix their format with
>> + * the string "*ERROR* ", to make it easier to scan kernel logs. For
>> + * instance,
>> + *   pr_err("*ERROR* ", args).
> 
> No, it's encouraged not to use pr_*() at all, and prefer drm device
> based logging, or device based logging.
> 
> I'd rather this whole thing was just reverted.

Agreed.

Do I have your R-B for a revert?
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control

2023-11-09 Thread Luben Tuikov

On 2023-11-09 19:57, Luben Tuikov wrote:
> On 2023-11-09 19:16, Danilo Krummrich wrote:
[snip]
>> @@ -667,6 +771,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>   * drm_sched_job_init - init a scheduler job
>>   * @job: scheduler job to init
>>   * @entity: scheduler entity to use
>> + * @credits: the number of credits this job contributes to the schedulers
>> + * credit limit
>>   * @owner: job owner for debugging
>>   *
>>   * Refer to drm_sched_entity_push_job() documentation
>> @@ -684,7 +790,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>   */
>>  int drm_sched_job_init(struct drm_sched_job *job,
>> struct drm_sched_entity *entity,
>> -   void *owner)
>> +   u32 credits, void *owner)
>>  {
>>  if (!entity->rq) {
>>  /* This will most likely be followed by missing frames
>> @@ -695,7 +801,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>  return -ENOENT;
>>  }
>>  
>> +if (unlikely(!credits))
>> +return -EINVAL;
>> +
> 
> This will most likely result in bad user experience (read: blank screen),
> and debugging this would be really hard without something to go by
> in the kernel log.
> 
> (This was exactly the case with amdgpu when 56e449603f0ac5
> ("drm/sched: Convert the GPU scheduler to variable number of run-queues") 
> was being worked on and merged. Without the drm_err() on missing rq in
> the lines immediately before the hunk above returning -ENOENT, there
> was no indication why setting up an fb was failing very early on (blank 
> screen).)
> 
> So it is best to print a "[drm] *ERROR* "-equivalent string in the logs,
> so that we can make a note of this, without relying on drivers, old and new, 
> logging
> that drm_sched_job_init() failed.

If you add _exactly_ this,

if (unlikely(!credits)) {
pr_err("*ERROR* %s: credits cannot be 0!\n", __func__)
return -EINVAL;
}

You can add my,

Reviewed-by: Luben Tuikov 

and push it.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v6] drm/sched: implement dynamic job-flow control

2023-11-09 Thread Luben Tuikov

On 2023-11-09 19:16, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich 
> ---
> Changes in V2:
> ==
>   - fixed up influence on scheduling fairness due to consideration of a job's
> size
> - If we reach a ready entity in drm_sched_select_entity() but can't 
> actually
>   queue a job from it due to size limitations, just give up and go to 
> sleep
>   until woken up due to a pending job finishing, rather than continue to 
> try
>   other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
> (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
> proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to 
> the
> following scheme
> - drm_sched_job::credits
> - drm_gpu_scheduler::credit_limit
> - drm_gpu_scheduler::credit_count
> - drm_sched_init(..., u32 credit_limit, ...)
> - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> Changes in V4:
> ==
>   - address Lubens comments regarding documentation
>   - switch to drm_WARN() variants
>   - WARN_ON() drivers passing in zero job credits for both 
> drm_sched_job_init()
> and the update_job_credits() callback
>   - don't retry with another runq if job doesn't fit on the ring to prevent
> priority inversion
>   - rebase onto drm-misc-next (will probably land before Matt's series)
> 
> Changes in V5:
> ==
>   - fix panfrost, lima and etnaviv build
>   - add proposed comments regarding how the code avoids runq priority 
> inversion
>   - address Lubens feedback regarding wording
>   - rebase onto latest drm-misc-next (XE scheduler patches)
> 
> Changes in V6:
> ==
>   - rebase due to conflicts introduced meanwhile
>   - drm_sched_job_init(): fail with EINVAL, rather than WARN() if job->credits
> is zero
>   - drm_sched_can_queue: truncate job->credits if they exceed the scheduler's
> credit limit to guarantee forward progress
> 
> Patch also available in [1].
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gpu.c |   2 +-
>  drivers/gpu/drm/lima/lima_device.c|   2 +-
>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c   |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 168 ++
>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>  include/drm/gpu_scheduler.h   |  28 ++-
>  14 files changed, 173 insertions(+), 51 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>  
> +Flow Control
> +
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

[PATCH] drm/sched: Define pr_fmt() for DRM using pr_*()

2023-11-09 Thread Luben Tuikov

Define pr_fmt() as "[drm] " for DRM code using pr_*() facilities, especially
when no devices are available. This makes it easier to browse kernel logs.

Signed-off-by: Luben Tuikov 
---
 include/drm/drm_print.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
index a93a387f8a1a15..e8fe60d0eb8783 100644
--- a/include/drm/drm_print.h
+++ b/include/drm/drm_print.h
@@ -26,6 +26,20 @@
 #ifndef DRM_PRINT_H_
 #define DRM_PRINT_H_
 
+/* Define this before including linux/printk.h, so that the format
+ * string in pr_*() macros is correctly set for DRM. If a file wants
+ * to define this to something else, it should do so before including
+ * this header file.
+ *
+ * It is encouraged code using pr_err() to prefix their format with
+ * the string "*ERROR* ", to make it easier to scan kernel logs. For
+ * instance,
+ *   pr_err("*ERROR* ", args).
+ */
+#ifndef pr_fmt
+#define pr_fmt(fmt) "[drm] " fmt
+#endif
+
 #include 
 #include 
 #include 

base-commit: f3123c2590005c5ff631653d31428e40cd10c618
-- 
2.42.1

[PATCH] drm/sched: Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()

2023-11-09 Thread Luben Tuikov

Don't "wake up" the GPU scheduler unless the entity is ready, as well as we
can queue to the scheduler, i.e. there is no point in waking up the scheduler
for the entity unless the entity is ready.

Signed-off-by: Luben Tuikov 
Fixes: bc8d6a9df99038 ("drm/sched: Don't disturb the entity when in RR-mode 
scheduling")
---
 drivers/gpu/drm/scheduler/sched_entity.c | 4 ++--
 drivers/gpu/drm/scheduler/sched_main.c   | 8 +---
 include/drm/gpu_scheduler.h  | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index f1db63cc819812..4d42b1e4daa67f 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
container_of(cb, struct drm_sched_entity, cb);
 
drm_sched_entity_clear_dep(f, cb);
-   drm_sched_wakeup(entity->rq->sched);
+   drm_sched_wakeup(entity->rq->sched, entity);
 }
 
 /**
@@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job 
*sched_job)
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
drm_sched_rq_update_fifo(entity, submit_ts);
 
-   drm_sched_wakeup(entity->rq->sched);
+   drm_sched_wakeup(entity->rq->sched, entity);
}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index cd0dc3f81d05f0..8f5e466bd58239 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -925,10 +925,12 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
*sched)
  *
  * Wake up the scheduler if we can queue jobs.
  */
-void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
+void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
+ struct drm_sched_entity *entity)
 {
-   if (drm_sched_can_queue(sched))
-   drm_sched_run_job_queue(sched);
+   if (drm_sched_entity_is_ready(entity))
+   if (drm_sched_can_queue(sched))
+   drm_sched_run_job_queue(sched);
 }
 
 /**
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 754fd2217334e5..09916c84703f59 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -559,7 +559,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity 
*entity,
 
 void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
-void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+void drm_sched_wakeup(struct drm_gpu_scheduler *sched, struct drm_sched_entity 
*entity);
 bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);

base-commit: f415a6078f640ab15bae34d3c6a1d8e6071363de
-- 
2.42.1

Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-09 Thread Luben Tuikov

On 2023-11-09 18:41, Danilo Krummrich wrote:
> On 11/9/23 20:24, Danilo Krummrich wrote:
>> On 11/9/23 07:52, Luben Tuikov wrote:
>>> Hi,
>>>
>>> On 2023-11-07 19:41, Danilo Krummrich wrote:
>>>> On 11/7/23 05:10, Luben Tuikov wrote:
>>>>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In 
>>>>> fact,
>>>>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and 
>>>>> let
>>>>> it do just that, schedule the work item for execution.
>>>>>
>>>>> The problem is that drm_sched_run_job_queue() calls 
>>>>> drm_sched_select_entity()
>>>>> to determine if the scheduler has an entity ready in one of its 
>>>>> run-queues,
>>>>> and in the case of the Round-Robin (RR) scheduling, the function
>>>>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
>>>>> which is ready, sets up the run-queue and completion and returns that
>>>>> entity. The FIFO scheduling algorithm is unaffected.
>>>>>
>>>>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>>>>> then
>>>>> in the case of RR scheduling, that would result in 
>>>>> drm_sched_select_entity()
>>>>> having been called twice, which may result in skipping a ready entity if 
>>>>> more
>>>>> than one entity is ready. This commit fixes this by eliminating the call 
>>>>> to
>>>>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it 
>>>>> only
>>>>> in drm_sched_run_job_work().
>>>>>
>>>>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>>>>>   Add fixes-tag. (Tvrtko)
>>>>>
>>>>> Signed-off-by: Luben Tuikov 
>>>>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
>>>>>    1 file changed, 3 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 27843e37d9b769..cd0dc3f81d05f0 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
>>>>> *rq)
>>>>>    }
>>>>>    /**
>>>>> - * __drm_sched_run_job_queue - enqueue run-job work
>>>>> + * drm_sched_run_job_queue - enqueue run-job work
>>>>>     * @sched: scheduler instance
>>>>>     */
>>>>> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    if (!READ_ONCE(sched->pause_submit))
>>>>>    queue_work(sched->submit_wq, >work_run_job);
>>>>> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct 
>>>>> drm_gpu_scheduler *sched)
>>>>>    void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    if (drm_sched_can_queue(sched))
>>>>> -    __drm_sched_run_job_queue(sched);
>>>>> +    drm_sched_run_job_queue(sched);
>>>>>    }
>>>>>    /**
>>>>> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
>>>>> **sched_list,
>>>>>    }
>>>>>    EXPORT_SYMBOL(drm_sched_pick_best);
>>>>> -/**
>>>>> - * drm_sched_run_job_queue - enqueue run-job work if there are ready 
>>>>> entities
>>>>> - * @sched: scheduler instance
>>>>> - */
>>>>> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>> -{
>>>>> -    if (drm_sched_select_entity(sched))
>>>>
>>>> Hm, now that I rebase my patch to implement dynamic job-flow control I 
>>>> recognize that
>>>> we probably need the peek semantics here. If we do not select an entity 
>>>> here, we also
>>>> do not check whether the corresponding job fits on the ring.
>>>>
>>>> Alternatively, we simpl

Re: [PATCH] drm/sched: fix potential page fault in drm_sched_job_init()

2023-11-09 Thread Luben Tuikov

On 2023-11-09 14:55, Danilo Krummrich wrote:
> On 11/9/23 01:09, Danilo Krummrich wrote:
>> On 11/8/23 06:46, Luben Tuikov wrote:
>>> Hi,
>>>
>>> Could you please use my gmail address, the one one I'm responding from--I 
>>> don't want
>>> to miss any DRM scheduler patches. BTW, the luben.tui...@amd.com email 
>>> should bounce
>>> as undeliverable.
>>>
>>> On 2023-11-07 21:26, Danilo Krummrich wrote:
>>>> Commit 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable
>>>> number of run-queues") introduces drm_err() in drm_sched_job_init(), in
>>>> order to indicate that the given entity has no runq, however at this
>>>> time job->sched is not yet set, likely to be NULL initialized, and hence
>>>> shouldn't be used.
>>>>
>>>> Replace the corresponding drm_err() call with pr_err() to avoid a
>>>> potential page fault.
>>>>
>>>> While at it, extend the documentation of drm_sched_job_init() to
>>>> indicate that job->sched is not a valid pointer until
>>>> drm_sched_job_arm() has been called.
>>>>
>>>> Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable 
>>>> number of run-queues")
>>>> Signed-off-by: Danilo Krummrich 
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 5 -
>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 27843e37d9b7..dd28389f0ddd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -680,6 +680,9 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>>    * This function returns -ENOENT in this case (which probably should be 
>>>> -EIO as
>>>>    * a more meanigful return value).
>>>>    *
>>>> + * Note that job->sched is not a valid pointer until drm_sched_job_arm() 
>>>> has
>>>> + * been called.
>>>> + *
>>>
>>> Good catch!
>>>
>>> Did you actually get this to page-fault and have a kernel log?
>>
>> No, I just found it because I was about to make the same mistake.
>>
>>>
>>> I'm asking because we see it correctly set in this kernel log coming from 
>>> AMD,
>>
>> I think that's because amdgpu just sets job->sched to *some* scheduler 
>> instance after
>> job allocation [1].
>>
>> [1] 
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L108
>>
>>>
>>> [   11.886024] amdgpu :0a:00.0: [drm] *ERROR* drm_sched_job_init: 
>>> entity has no rq!
>>>
>>> in this email,
>>> https://lore.kernel.org/r/CADnq5_PS64jYS_Y3kGW27m-kuWP+FQFiaVcOaZiB=jlsgpn...@mail.gmail.com
>>>
>>>>    * Returns 0 for success, negative error code otherwise.
>>>>    */
>>>>   int drm_sched_job_init(struct drm_sched_job *job,
>>>> @@ -691,7 +694,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>>    * or worse--a blank screen--leave a trail in the
>>>>    * logs, so this can be debugged easier.
>>>>    */
>>>> -    drm_err(job->sched, "%s: entity has no rq!\n", __func__);
>>>> +    pr_err("%s: entity has no rq!\n", __func__);
>>>
>>> Is it feasible to do something like the following?
>>>
>>>     dev_err(job->sched ? job->sched->dev : NULL, "%s: entity has no 
>>> rq!\n", __func__);
>>
>> I don't think that's a good idea. Although I'd assume that every driver 
>> zero-initializes its job
>> structures, I can't see a rule enforcing that. Hence, job->sched can be a 
>> random value until
>> drm_sched_job_arm() is called.
>>
>> However, I notice there are quite a view more fields of struct drm_sched_job 
>> that are never
>> initialized, hence there are either a couple more potential bugs or missing 
>> documentation that
>> drivers *must* ensure that a job is zero-initialized.
> 
> Any opinions on that? Otherwise I'd probably go ahead and send a fix for the 
> other bugs too.

Send the patches.

Will those patches also add pr_fmt() for DRM?

I'm asking because you said you'll add pr_fmt() in a "separate" patch, and I 
thought it was
okay being self-contained in your patch as per the version I sent.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-08 Thread Luben Tuikov

Hi,

On 2023-11-07 19:41, Danilo Krummrich wrote:
> On 11/7/23 05:10, Luben Tuikov wrote:
>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
>> it do just that, schedule the work item for execution.
>>
>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
>> to determine if the scheduler has an entity ready in one of its run-queues,
>> and in the case of the Round-Robin (RR) scheduling, the function
>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
>> which is ready, sets up the run-queue and completion and returns that
>> entity. The FIFO scheduling algorithm is unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in drm_sched_select_entity()
>> having been called twice, which may result in skipping a ready entity if more
>> than one entity is ready. This commit fixes this by eliminating the call to
>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
>> in drm_sched_run_job_work().
>>
>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>>  Add fixes-tag. (Tvrtko)
>>
>> Signed-off-by: Luben Tuikov 
>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
>>   1 file changed, 3 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 27843e37d9b769..cd0dc3f81d05f0 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
>> *rq)
>>   }
>>   
>>   /**
>> - * __drm_sched_run_job_queue - enqueue run-job work
>> + * drm_sched_run_job_queue - enqueue run-job work
>>* @sched: scheduler instance
>>*/
>> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>   {
>>  if (!READ_ONCE(sched->pause_submit))
>>  queue_work(sched->submit_wq, >work_run_job);
>> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
>> *sched)
>>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>>   {
>>  if (drm_sched_can_queue(sched))
>> -__drm_sched_run_job_queue(sched);
>> +drm_sched_run_job_queue(sched);
>>   }
>>   
>>   /**
>> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
>> **sched_list,
>>   }
>>   EXPORT_SYMBOL(drm_sched_pick_best);
>>   
>> -/**
>> - * drm_sched_run_job_queue - enqueue run-job work if there are ready 
>> entities
>> - * @sched: scheduler instance
>> - */
>> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>> -{
>> -if (drm_sched_select_entity(sched))
> 
> Hm, now that I rebase my patch to implement dynamic job-flow control I 
> recognize that
> we probably need the peek semantics here. If we do not select an entity here, 
> we also
> do not check whether the corresponding job fits on the ring.
> 
> Alternatively, we simply can't do this check in drm_sched_wakeup(). The 
> consequence would
> be that we don't detect that we need to wait for credits to free up before 
> the run work is
> already executing and the run work selects an entity.

So I rebased v5 on top of the latest drm-misc-next, and looked around and found 
out that
drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like 
the following,

void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
  struct drm_sched_entity *entity)
{
if (drm_sched_entity_is_ready(entity))
if (drm_sched_can_queue(sched, entity))
drm_sched_run_job_queue(sched);
}

See the attached patch. (Currently running with base-commit and the attached 
patch.)
-- 
Regards,
Luben
From 65b8b8be52e8c112d7350397cb54b4fb3470b008 Mon Sep 17 00:00:00 2001
From: Danilo Krummrich 
Date: Thu, 2 Nov 2023 01:10:34 +0100
Subject: [PATCH] drm/sched: implement dynamic job-flow control

Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This i

Re: [PATCH] drm/sched: fix potential page fault in drm_sched_job_init()

2023-11-08 Thread Luben Tuikov

On 2023-11-08 19:09, Danilo Krummrich wrote:
> On 11/8/23 06:46, Luben Tuikov wrote:
>> Hi,
>>
>> Could you please use my gmail address, the one one I'm responding from--I 
>> don't want
>> to miss any DRM scheduler patches. BTW, the luben.tui...@amd.com email 
>> should bounce
>> as undeliverable.
>>
>> On 2023-11-07 21:26, Danilo Krummrich wrote:
>>> Commit 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable
>>> number of run-queues") introduces drm_err() in drm_sched_job_init(), in
>>> order to indicate that the given entity has no runq, however at this
>>> time job->sched is not yet set, likely to be NULL initialized, and hence
>>> shouldn't be used.
>>>
>>> Replace the corresponding drm_err() call with pr_err() to avoid a
>>> potential page fault.
>>>
>>> While at it, extend the documentation of drm_sched_job_init() to
>>> indicate that job->sched is not a valid pointer until
>>> drm_sched_job_arm() has been called.
>>>
>>> Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable 
>>> number of run-queues")
>>> Signed-off-by: Danilo Krummrich 
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 5 -
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 27843e37d9b7..dd28389f0ddd 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -680,6 +680,9 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>* This function returns -ENOENT in this case (which probably should be 
>>> -EIO as
>>>* a more meanigful return value).
>>>*
>>> + * Note that job->sched is not a valid pointer until drm_sched_job_arm() 
>>> has
>>> + * been called.
>>> + *
>>
>> Good catch!
>>
>> Did you actually get this to page-fault and have a kernel log?
> 
> No, I just found it because I was about to make the same mistake.
> 
>>
>> I'm asking because we see it correctly set in this kernel log coming from 
>> AMD,
> 
> I think that's because amdgpu just sets job->sched to *some* scheduler 
> instance after
> job allocation [1].
> 
> [1] 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L108
> 
>>
>> [   11.886024] amdgpu :0a:00.0: [drm] *ERROR* drm_sched_job_init: entity 
>> has no rq!
>>
>> in this email,
>> https://lore.kernel.org/r/CADnq5_PS64jYS_Y3kGW27m-kuWP+FQFiaVcOaZiB=jlsgpn...@mail.gmail.com
>>
>>>* Returns 0 for success, negative error code otherwise.
>>>*/
>>>   int drm_sched_job_init(struct drm_sched_job *job,
>>> @@ -691,7 +694,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>  * or worse--a blank screen--leave a trail in the
>>>  * logs, so this can be debugged easier.
>>>  */
>>> -   drm_err(job->sched, "%s: entity has no rq!\n", __func__);
>>> +   pr_err("%s: entity has no rq!\n", __func__);
>>
>> Is it feasible to do something like the following?
>>
>>  dev_err(job->sched ? job->sched->dev : NULL, "%s: entity has no 
>> rq!\n", __func__);
> 
> I don't think that's a good idea. Although I'd assume that every driver 
> zero-initializes its job
> structures, I can't see a rule enforcing that. Hence, job->sched can be a 
> random value until
> drm_sched_job_arm() is called.

Okay. However, when using pr_err() we're losing "[drm] *ERROR* " prefix and we 
scan for that
in the logs to quickly find the cause of the error.

Perhaps we can define pr_fmt() and also include "*ERROR*" so that we can get 
the desired result
as the attached patch shows?
-- 
Regards,
Luben
From 1f3ed97947a406a555a3efea05cab67da94172e7 Mon Sep 17 00:00:00 2001
From: Danilo Krummrich 
Date: Wed, 8 Nov 2023 03:26:07 +0100
Subject: [PATCH] drm/sched: fix potential page fault in drm_sched_job_init()

Commit 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable
number of run-queues") introduces drm_err() in drm_sched_job_init(), in
order to indicate that the given entity has no runq, however at this
time job->sched is not yet set, likely to be NULL initialized, and hence
shouldn't be used.

Replace the corresponding drm_err() call with pr_err() to avoid a
potential page fault.

While at it, extend the

Re: [PATCH] drm/sched: fix potential page fault in drm_sched_job_init()

2023-11-07 Thread Luben Tuikov

On 2023-11-08 00:46, Luben Tuikov wrote:
> Hi,
> 
> Could you please use my gmail address, the one one I'm responding from--I 
> don't want
> to miss any DRM scheduler patches. BTW, the luben.tui...@amd.com email should 
> bounce
> as undeliverable.
> 
> On 2023-11-07 21:26, Danilo Krummrich wrote:
>> Commit 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable
>> number of run-queues") introduces drm_err() in drm_sched_job_init(), in
>> order to indicate that the given entity has no runq, however at this
>> time job->sched is not yet set, likely to be NULL initialized, and hence
>> shouldn't be used.
>>
>> Replace the corresponding drm_err() call with pr_err() to avoid a
>> potential page fault.
>>
>> While at it, extend the documentation of drm_sched_job_init() to
>> indicate that job->sched is not a valid pointer until
>> drm_sched_job_arm() has been called.
>>
>> Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable 
>> number of run-queues")
>> Signed-off-by: Danilo Krummrich 
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 27843e37d9b7..dd28389f0ddd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -680,6 +680,9 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>   * This function returns -ENOENT in this case (which probably should be 
>> -EIO as
>>   * a more meanigful return value).
>>   *
>> + * Note that job->sched is not a valid pointer until drm_sched_job_arm() has
>> + * been called.
>> + *
> 
> Good catch!
> 
> Did you actually get this to page-fault and have a kernel log?
> 
> I'm asking because we see it correctly set in this kernel log coming from AMD,
> 
> [   11.886024] amdgpu :0a:00.0: [drm] *ERROR* drm_sched_job_init: entity 
> has no rq!
> 
> in this email,
> https://lore.kernel.org/r/CADnq5_PS64jYS_Y3kGW27m-kuWP+FQFiaVcOaZiB=jlsgpn...@mail.gmail.com
> 
>>   * Returns 0 for success, negative error code otherwise.
>>   */
>>  int drm_sched_job_init(struct drm_sched_job *job,
>> @@ -691,7 +694,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>   * or worse--a blank screen--leave a trail in the
>>   * logs, so this can be debugged easier.
>>   */
>> -drm_err(job->sched, "%s: entity has no rq!\n", __func__);
>> +pr_err("%s: entity has no rq!\n", __func__);
> 
> Is it feasible to do something like the following?
> 
>   dev_err(job->sched ? job->sched->dev : NULL, "%s: entity has no 
> rq!\n", __func__);

Sorry, that was meant to be like this to make the print look just like the 
original,

dev_err(job->sched ? job->sched->dev : NULL, "[drm] *ERROR* %s: 
entity has no rq!\n", __func__);

> 
>>  return -ENOENT;
>>  }
>>  
>>
>> base-commit: c015fb6d01adb616fb54824feb55ce5ab18e8ca1
> 

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: fix potential page fault in drm_sched_job_init()

2023-11-07 Thread Luben Tuikov

Hi,

Could you please use my gmail address, the one one I'm responding from--I don't 
want
to miss any DRM scheduler patches. BTW, the luben.tui...@amd.com email should 
bounce
as undeliverable.

On 2023-11-07 21:26, Danilo Krummrich wrote:
> Commit 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable
> number of run-queues") introduces drm_err() in drm_sched_job_init(), in
> order to indicate that the given entity has no runq, however at this
> time job->sched is not yet set, likely to be NULL initialized, and hence
> shouldn't be used.
> 
> Replace the corresponding drm_err() call with pr_err() to avoid a
> potential page fault.
> 
> While at it, extend the documentation of drm_sched_job_init() to
> indicate that job->sched is not a valid pointer until
> drm_sched_job_arm() has been called.
> 
> Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable number 
> of run-queues")
> Signed-off-by: Danilo Krummrich 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 27843e37d9b7..dd28389f0ddd 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -680,6 +680,9 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>   * This function returns -ENOENT in this case (which probably should be -EIO 
> as
>   * a more meanigful return value).
>   *
> + * Note that job->sched is not a valid pointer until drm_sched_job_arm() has
> + * been called.
> + *

Good catch!

Did you actually get this to page-fault and have a kernel log?

I'm asking because we see it correctly set in this kernel log coming from AMD,

[   11.886024] amdgpu :0a:00.0: [drm] *ERROR* drm_sched_job_init: entity 
has no rq!

in this email,
https://lore.kernel.org/r/CADnq5_PS64jYS_Y3kGW27m-kuWP+FQFiaVcOaZiB=jlsgpn...@mail.gmail.com

>   * Returns 0 for success, negative error code otherwise.
>   */
>  int drm_sched_job_init(struct drm_sched_job *job,
> @@ -691,7 +694,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>* or worse--a blank screen--leave a trail in the
>* logs, so this can be debugged easier.
>*/
> - drm_err(job->sched, "%s: entity has no rq!\n", __func__);
> + pr_err("%s: entity has no rq!\n", __func__);

Is it feasible to do something like the following?

dev_err(job->sched ? job->sched->dev : NULL, "%s: entity has no 
rq!\n", __func__);

>   return -ENOENT;
>   }
>  
> 
> base-commit: c015fb6d01adb616fb54824feb55ce5ab18e8ca1

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-07 Thread Luben Tuikov

On 2023-11-07 12:53, Danilo Krummrich wrote:
> On 11/7/23 05:10, Luben Tuikov wrote:
>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
>> it do just that, schedule the work item for execution.
>>
>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
>> to determine if the scheduler has an entity ready in one of its run-queues,
>> and in the case of the Round-Robin (RR) scheduling, the function
>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
>> which is ready, sets up the run-queue and completion and returns that
>> entity. The FIFO scheduling algorithm is unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in drm_sched_select_entity()
>> having been called twice, which may result in skipping a ready entity if more
>> than one entity is ready. This commit fixes this by eliminating the call to
>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
>> in drm_sched_run_job_work().
>>
>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>>  Add fixes-tag. (Tvrtko)
>>
>> Signed-off-by: Luben Tuikov 
>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
> 
> Reviewed-by: Danilo Krummrich 

Thank you, sir!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-07 Thread Luben Tuikov

On 2023-11-07 06:48, Matthew Brost wrote:
> On Mon, Nov 06, 2023 at 11:10:21PM -0500, Luben Tuikov wrote:
>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
>> it do just that, schedule the work item for execution.
>>
>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
>> to determine if the scheduler has an entity ready in one of its run-queues,
>> and in the case of the Round-Robin (RR) scheduling, the function
>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
>> which is ready, sets up the run-queue and completion and returns that
>> entity. The FIFO scheduling algorithm is unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in drm_sched_select_entity()
>> having been called twice, which may result in skipping a ready entity if more
>> than one entity is ready. This commit fixes this by eliminating the call to
>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
>> in drm_sched_run_job_work().
>>
>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>> Add fixes-tag. (Tvrtko)
>>
>> Signed-off-by: Luben Tuikov 
>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
> 
> Reviewed-by: Matthew Brost 

Thank you, sir!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

[PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-06 Thread Luben Tuikov

Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov 
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
---
 drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 }
 
 /**
- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
  * @sched: scheduler instance
  */
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
 {
if (!READ_ONCE(sched->pause_submit))
queue_work(sched->submit_wq, >work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
*sched)
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
 {
if (drm_sched_can_queue(sched))
-   __drm_sched_run_job_queue(sched);
+   drm_sched_run_job_queue(sched);
 }
 
 /**
@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
**sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
-/**
- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-   if (drm_sched_select_entity(sched))
-   __drm_sched_run_job_queue(sched);
-}
-
 /**
  * drm_sched_free_job_work - worker to call free_job
  *

base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3
-- 
2.42.1

Re: [PATCH 0/5] Some drm scheduler internal renames

2023-11-06 Thread Luben Tuikov

On 2023-11-06 07:41, Tvrtko Ursulin wrote:
> 
> On 05/11/2023 01:51, Luben Tuikov wrote:
>> On 2023-11-02 06:55, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin 
>>>
>>> I found some of the naming a bit incosistent and unclear so just a small
>>> attempt to clarify and tidy some of them. See what people think if my first
>>> stab improves things or not.
>>>
>>> Cc: Luben Tuikov 
>>> Cc: Matthew Brost 
>>>
>>> Tvrtko Ursulin (5):
>>>drm/sched: Rename drm_sched_get_cleanup_job to be more descriptive
>>>drm/sched: Move free worker re-queuing out of the if block
>>>drm/sched: Rename drm_sched_free_job_queue to be more descriptive
>>>drm/sched: Rename drm_sched_run_job_queue_if_ready and clarify
>>>  kerneldoc
>>>drm/sched: Drop suffix from drm_sched_wakeup_if_can_queue
>>>
>>>   drivers/gpu/drm/scheduler/sched_entity.c |  4 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c   | 53 ++++
>>>   include/drm/gpu_scheduler.h  |  2 +-
>>>   3 files changed, 29 insertions(+), 30 deletions(-)
>>>
>>
>> Series is,
>>
>> Reviewed-by: Luben Tuikov 
>>
>> and pushed to drm-misc-next.
> 
> Oh thanks, I definitely did not expect that to happen so quickly, 
> especially since it conflicts with your fix for RR and there are some 
> other opens. But it is fine, all that can be worked on top.

Yeah, it does conflict, and it does make some changes obsolete,
but your series was fine and an improvement, so might as well push it.

I'll rebase my patch on top of yours.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 0/5] Some drm scheduler internal renames

2023-11-04 Thread Luben Tuikov

On 2023-11-02 06:55, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> I found some of the naming a bit incosistent and unclear so just a small
> attempt to clarify and tidy some of them. See what people think if my first
> stab improves things or not.
> 
> Cc: Luben Tuikov 
> Cc: Matthew Brost 
> 
> Tvrtko Ursulin (5):
>   drm/sched: Rename drm_sched_get_cleanup_job to be more descriptive
>   drm/sched: Move free worker re-queuing out of the if block
>   drm/sched: Rename drm_sched_free_job_queue to be more descriptive
>   drm/sched: Rename drm_sched_run_job_queue_if_ready and clarify
> kerneldoc
>   drm/sched: Drop suffix from drm_sched_wakeup_if_can_queue
> 
>  drivers/gpu/drm/scheduler/sched_entity.c |  4 +-
>  drivers/gpu/drm/scheduler/sched_main.c   | 53 
>  include/drm/gpu_scheduler.h  |  2 +-
>  3 files changed, 29 insertions(+), 30 deletions(-)
> 

Series is,

Reviewed-by: Luben Tuikov 

and pushed to drm-misc-next.

Thanks!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH 0/5] Some drm scheduler internal renames

2023-11-04 Thread Luben Tuikov

Hi Tvrtko,

I only now saw this patch and I had to look for it...

Do you get a bounce from luben.tui...@amd.com? No? You should have.

Please don't use luben.tui...@amd.com.

Please use ltuiko...@gmail.com, this email.
 
Regards,
Luben

On 2023-11-02 06:55, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> I found some of the naming a bit incosistent and unclear so just a small
> attempt to clarify and tidy some of them. See what people think if my first
> stab improves things or not.
> 
> Cc: Luben Tuikov 
> Cc: Matthew Brost 
> 
> Tvrtko Ursulin (5):
>   drm/sched: Rename drm_sched_get_cleanup_job to be more descriptive
>   drm/sched: Move free worker re-queuing out of the if block
>   drm/sched: Rename drm_sched_free_job_queue to be more descriptive
>   drm/sched: Rename drm_sched_run_job_queue_if_ready and clarify
> kerneldoc
>   drm/sched: Drop suffix from drm_sched_wakeup_if_can_queue
> 
>  drivers/gpu/drm/scheduler/sched_entity.c |  4 +-
>  drivers/gpu/drm/scheduler/sched_main.c   | 53 
>  include/drm/gpu_scheduler.h  |  2 +-
>  3 files changed, 29 insertions(+), 30 deletions(-)
> 


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Eliminate drm_sched_run_job_queue_if_ready()

2023-11-03 Thread Luben Tuikov

Hi Tvrtko,

On 2023-11-03 06:39, Tvrtko Ursulin wrote:
> 
> On 02/11/2023 22:46, Luben Tuikov wrote:
>> Eliminate drm_sched_run_job_queue_if_ready() and instead just call
>> drm_sched_run_job_queue() in drm_sched_free_job_work(). The problem is that
>> the former function uses drm_sched_select_entity() to determine if the
>> scheduler had an entity ready in one of its run-queues, and in the case of 
>> the
>> Round-Robin (RR) scheduling, the function drm_sched_rq_select_entity_rr() 
>> does
>> just that, selects the _next_ entity which is ready, sets up the run-queue 
>> and
>> completion and returns that entity. The FIFO scheduling algorithm is 
>> unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in calling select_entity()
>> twice, which may result in skipping a ready entity if more than one entity is
>> ready. This commit fixes this by eliminating the if_ready() variant.
> 
> Fixes: is missing since the regression already landed.

Ah, yes, thank you for pointing that out. :-)
I'll add one.

> 
>>
>> Signed-off-by: Luben Tuikov 
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 14 ++
>>   1 file changed, 2 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 98b2ad54fc7071..05816e7cae8c8b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
>> **sched_list,
>>   }
>>   EXPORT_SYMBOL(drm_sched_pick_best);
>>   
>> -/**
>> - * drm_sched_run_job_queue_if_ready - enqueue run-job work if ready
>> - * @sched: scheduler instance
>> - */
>> -static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler 
>> *sched)
>> -{
>> -if (drm_sched_select_entity(sched))
>> -drm_sched_run_job_queue(sched);
>> -}
>> -
>>   /**
>>* drm_sched_free_job_work - worker to call free_job
>>*
>> @@ -1069,7 +1059,7 @@ static void drm_sched_free_job_work(struct work_struct 
>> *w)
>>  sched->ops->free_job(cleanup_job);
>>   
>>  drm_sched_free_job_queue_if_done(sched);
>> -drm_sched_run_job_queue_if_ready(sched);
>> +drm_sched_run_job_queue(sched);
> 
> It works but is a bit wasteful causing needless CPU wake ups with a 

I'd not worry about "waking up the CPU" as the CPU scheduler would most likely
put the wq on the same CPU by instruction cache locality.

> potentially empty queue, both here and in drm_sched_run_job_work below.

That's true, but if you were to look at the typical execution of
this code you'd see we get a string of function entry when the incoming queue
is non-empty, followed by one empty entry only to be taken off the CPU. So,
it really isn't a breaker.

So, there's a way to mitigate this in drm_sched_run_job_work(). I'll see that it
makes it in the next version of the patch.

Thanks!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/sched: Eliminate drm_sched_run_job_queue_if_ready()

2023-11-03 Thread Luben Tuikov

Hi Matt, :-)

On 2023-11-03 11:13, Matthew Brost wrote:
> On Thu, Nov 02, 2023 at 06:46:54PM -0400, Luben Tuikov wrote:
>> Eliminate drm_sched_run_job_queue_if_ready() and instead just call
>> drm_sched_run_job_queue() in drm_sched_free_job_work(). The problem is that
>> the former function uses drm_sched_select_entity() to determine if the
>> scheduler had an entity ready in one of its run-queues, and in the case of 
>> the
>> Round-Robin (RR) scheduling, the function drm_sched_rq_select_entity_rr() 
>> does
>> just that, selects the _next_ entity which is ready, sets up the run-queue 
>> and
>> completion and returns that entity. The FIFO scheduling algorithm is 
>> unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in calling select_entity()
>> twice, which may result in skipping a ready entity if more than one entity is
>> ready. This commit fixes this by eliminating the if_ready() variant.
>>
> 
> Ah, yes I guess we both missed this. What about reviving the peek
> argument [1]? This would avoid unnecessary re-queues.

So, I really am not too fond of "peek-then-get-and-do" (scheduling) 
organizations,
because they don't scale. As we've seen in our case, the RR has a side effect,
as Tvrtko pointed out (thanks!), and in the future this
"peek-first, then go-again, to go"-type of organization would only prevent us
from doing more interesting things.

Also, with the GPU scheduler organization, mixing in the "peek", we just get
to carry it around through many a function, only to be used in a leaf function,
and exported way back up (because we don't know the rq at that level).

I'd much rather we just did "consume-until-empty", and if we have one last
empty check (or first), then that's not a breaker. (I mean, we have a
drm_sched_pick_best() which has time complexity O(n), and we execute it every 
time
we arm a job, so it's not that big of a deal.) Plus, it makes the code concise
and compact.

Let me reconstitute the patch and I'll send it for yours review.
-- 
Regards,
Luben

OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH v8 3/5] drm/sched: Split free_job into own work item

2023-11-02 Thread Luben Tuikov

On 2023-11-02 07:13, Tvrtko Ursulin wrote:
> 
> On 31/10/2023 03:24, Matthew Brost wrote:
>> Rather than call free_job and run_job in same work item have a dedicated
>> work item for each. This aligns with the design and intended use of work
>> queues.
>>
>> v2:
>> - Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting
>>   timestamp in free_job() work item (Danilo)
>> v3:
>>- Drop forward dec of drm_sched_select_entity (Boris)
>>- Return in drm_sched_run_job_work if entity NULL (Boris)
>> v4:
>>- Replace dequeue with peek and invert logic (Luben)
>>- Wrap to 100 lines (Luben)
>>- Update comments for *_queue / *_queue_if_ready functions (Luben)
>> v5:
>>- Drop peek argument, blindly reinit idle (Luben)
>>- s/drm_sched_free_job_queue_if_ready/drm_sched_free_job_queue_if_done 
>> (Luben)
>>- Update work_run_job & work_free_job kernel doc (Luben)
>> v6:
>>- Do not move drm_sched_select_entity in file (Luben)
>>
>> Signed-off-by: Matthew Brost 
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 146 +
>>   include/drm/gpu_scheduler.h|   4 +-
>>   2 files changed, 101 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index d1ae05bded15..3b1b2f8eafe8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -265,6 +265,32 @@ static void drm_sched_run_job_queue(struct 
>> drm_gpu_scheduler *sched)
>>  queue_work(sched->submit_wq, >work_run_job);
>>   }
>>   
>> +/**
>> + * drm_sched_free_job_queue - enqueue free-job work
>> + * @sched: scheduler instance
>> + */
>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
>> +{
>> +if (!READ_ONCE(sched->pause_submit))
>> +queue_work(sched->submit_wq, >work_free_job);
>> +}
>> +
>> +/**
>> + * drm_sched_free_job_queue_if_done - enqueue free-job work if ready
>> + * @sched: scheduler instance
>> + */
>> +static void drm_sched_free_job_queue_if_done(struct drm_gpu_scheduler 
>> *sched)
>> +{
>> +struct drm_sched_job *job;
>> +
>> +spin_lock(>job_list_lock);
>> +job = list_first_entry_or_null(>pending_list,
>> +   struct drm_sched_job, list);
>> +if (job && dma_fence_is_signaled(>s_fence->finished))
>> +drm_sched_free_job_queue(sched);
>> +spin_unlock(>job_list_lock);
>> +}
>> +
>>   /**
>>* drm_sched_job_done - complete a job
>>* @s_job: pointer to the job which is done
>> @@ -284,7 +310,7 @@ static void drm_sched_job_done(struct drm_sched_job 
>> *s_job, int result)
>>  dma_fence_get(_fence->finished);
>>  drm_sched_fence_finished(s_fence, result);
>>  dma_fence_put(_fence->finished);
>> -drm_sched_run_job_queue(sched);
>> +drm_sched_free_job_queue(sched);
>>   }
>>   
>>   /**
>> @@ -943,8 +969,10 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler 
>> *sched)
>>  typeof(*next), list);
>>   
>>  if (next) {
>> -next->s_fence->scheduled.timestamp =
>> -dma_fence_timestamp(>s_fence->finished);
>> +if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>> + >s_fence->scheduled.flags))
>> +next->s_fence->scheduled.timestamp =
>> +
>> dma_fence_timestamp(>s_fence->finished);
>>  /* start TO timer for next job */
>>  drm_sched_start_timeout(sched);
>>  }
>> @@ -994,7 +1022,40 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
>> **sched_list,
>>   EXPORT_SYMBOL(drm_sched_pick_best);
>>   
>>   /**
>> - * drm_sched_run_job_work - main scheduler thread
>> + * drm_sched_run_job_queue_if_ready - enqueue run-job work if ready
>> + * @sched: scheduler instance
>> + */
>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler 
>> *sched)
>> +{
>> +if (drm_sched_select_entity(sched))
>> +drm_sched_run_job_queue(sched);
>> +}
>> +
>> +/**
>> + * drm_sched_free_job_work - worker to call free_job
>> + *
>> + * @w: free job work
>> + */
>> +static void drm_sched_free_job_work(struct work_struct *w)
>> +{
>> +struct drm_gpu_scheduler *sched =
>> +container_of(w, struct drm_gpu_scheduler, work_free_job);
>> +struct drm_sched_job *cleanup_job;
>> +
>> +if (READ_ONCE(sched->pause_submit))
>> +return;
>> +
>> +cleanup_job = drm_sched_get_cleanup_job(sched);
>> +if (cleanup_job) {
>> +sched->ops->free_job(cleanup_job);
>> +
>> +drm_sched_free_job_queue_if_done(sched);
>> +drm_sched_run_job_queue_if_ready(sched);
> 
> Are finished jobs now disturbing the round robin selection?
> 
> Every time this cleans up a job we get:
> 
> drm_sched_run_job_queue_if_ready
>   ->

[PATCH] drm/sched: Eliminate drm_sched_run_job_queue_if_ready()

2023-11-02 Thread Luben Tuikov

Eliminate drm_sched_run_job_queue_if_ready() and instead just call
drm_sched_run_job_queue() in drm_sched_free_job_work(). The problem is that
the former function uses drm_sched_select_entity() to determine if the
scheduler had an entity ready in one of its run-queues, and in the case of the
Round-Robin (RR) scheduling, the function drm_sched_rq_select_entity_rr() does
just that, selects the _next_ entity which is ready, sets up the run-queue and
completion and returns that entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in calling select_entity()
twice, which may result in skipping a ready entity if more than one entity is
ready. This commit fixes this by eliminating the if_ready() variant.

Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/scheduler/sched_main.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 98b2ad54fc7071..05816e7cae8c8b 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
**sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
-/**
- * drm_sched_run_job_queue_if_ready - enqueue run-job work if ready
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
-{
-   if (drm_sched_select_entity(sched))
-   drm_sched_run_job_queue(sched);
-}
-
 /**
  * drm_sched_free_job_work - worker to call free_job
  *
@@ -1069,7 +1059,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
sched->ops->free_job(cleanup_job);
 
drm_sched_free_job_queue_if_done(sched);
-   drm_sched_run_job_queue_if_ready(sched);
+   drm_sched_run_job_queue(sched);
}
 }
 
@@ -1127,7 +1117,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
}
 
wake_up(>job_scheduled);
-   drm_sched_run_job_queue_if_ready(sched);
+   drm_sched_run_job_queue(sched);
 }
 
 /**

base-commit: 6fd9487147c4f18ad77eea00bd8c9189eec74a3e
-- 
2.42.1

Re: [PATCH v8 0/5] DRM scheduler changes for Xe

2023-11-02 Thread Luben Tuikov

On 2023-10-30 23:24, Matthew Brost wrote:
>   As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first.
> 
> This a continuation of a RFC [3] with all comments addressed, ready for
> a full review, and hopefully in state which can merged in the near
> future. More details of this series can found in the cover letter of the
> RFC [3].
> 
> These changes have been tested with the Xe driver. Based on drm-tip branch.
> 
> A follow up series will be posted to address some of dakr requets for
> kernel doc changes.
> 
> v2:
>  - Break run job, free job, and process message in own work items
>  - This might break other drivers as run job and free job now can run in
>parallel, can fix up if needed
> 
> v3:
>  - Include missing patch 'drm/sched: Add drm_sched_submit_* helpers'
>  - Fix issue with setting timestamp to early
>  - Don't dequeue jobs for single entity after calling entity fini
>  - Flush pending jobs on entity fini
>  - Add documentation for entity teardown
>  - Add Matthew Brost to maintainers of DRM scheduler
> 
> v4:
>  - Drop message interface
>  - Drop 'Flush pending jobs on entity fini'
>  - Drop 'Add documentation for entity teardown'
>  - Address all feedback
> 
> v5:
>  - Address Luben's feedback
>  - Drop starting TDR after calling run_job()
>  - Drop adding Matthew Brost to maintainers of DRM scheduler
> 
> v6:
>  - Address Luben's feedback
>  - Include base commit
> 
> v7:
>  - Drop SINGLE_ENTITY mode rather pull in Luben's patch for dynamic run queues
>  - Address Luben's feedback for free_job work item patch
> 
> v8:
>  - Rebase on drm-tip which includes Luben's patch for dynamic run queues
>  - Don't adjust comments, change variable names, function names twice in 
> series
>  - Don't move existing code to different places in a file to preserve git 
> history
> 
> Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/116055/
> 
> Matthew Brost (5):
>   drm/sched: Add drm_sched_wqueue_* helpers
>   drm/sched: Convert drm scheduler to use a work queue rather than
> kthread
>   drm/sched: Split free_job into own work item
>   drm/sched: Add drm_sched_start_timeout_unlocked helper
>   drm/sched: Add a helper to queue TDR immediately
> 
>  .../drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c   |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  15 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  14 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c   |   2 +-
>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>  drivers/gpu/drm/msm/adreno/adreno_device.c|   6 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c   |   2 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 301 --
>  drivers/gpu/drm/v3d/v3d_sched.c   |  10 +-
>  include/drm/gpu_scheduler.h   |  20 +-
>  12 files changed, 248 insertions(+), 130 deletions(-)
> 
> 
> base-commit: b560681c6bf623db41064ac486dd148d6c103e53

Hi Matthew,

I've pushed this series into drm-misc-next--I've tested and am running live 
with it.
Make sure to use "dim update-branches" to get all the resolutions, etc.

Thank you for working through this. Have a nice rest of your week. :-)
-- 
Regards,
Luben


OpenPGP_0x521A33F7270C04E6.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH v8 3/5] drm/sched: Split free_job into own work item

2023-11-02 Thread Luben Tuikov

On 2023-10-30 23:24, Matthew Brost wrote:
> Rather than call free_job and run_job in same work item have a dedicated
> work item for each. This aligns with the design and intended use of work
> queues.
> 
> v2:
>- Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting
>  timestamp in free_job() work item (Danilo)
> v3:
>   - Drop forward dec of drm_sched_select_entity (Boris)
>   - Return in drm_sched_run_job_work if entity NULL (Boris)
> v4:
>   - Replace dequeue with peek and invert logic (Luben)
>   - Wrap to 100 lines (Luben)
>   - Update comments for *_queue / *_queue_if_ready functions (Luben)
> v5:
>   - Drop peek argument, blindly reinit idle (Luben)
>   - s/drm_sched_free_job_queue_if_ready/drm_sched_free_job_queue_if_done 
> (Luben)
>   - Update work_run_job & work_free_job kernel doc (Luben)
> v6:
>   - Do not move drm_sched_select_entity in file (Luben)
> 
> Signed-off-by: Matthew Brost 

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 146 +
>  include/drm/gpu_scheduler.h|   4 +-
>  2 files changed, 101 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index d1ae05bded15..3b1b2f8eafe8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -265,6 +265,32 @@ static void drm_sched_run_job_queue(struct 
> drm_gpu_scheduler *sched)
>   queue_work(sched->submit_wq, >work_run_job);
>  }
>  
> +/**
> + * drm_sched_free_job_queue - enqueue free-job work
> + * @sched: scheduler instance
> + */
> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> +{
> + if (!READ_ONCE(sched->pause_submit))
> + queue_work(sched->submit_wq, >work_free_job);
> +}
> +
> +/**
> + * drm_sched_free_job_queue_if_done - enqueue free-job work if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_free_job_queue_if_done(struct drm_gpu_scheduler *sched)
> +{
> + struct drm_sched_job *job;
> +
> + spin_lock(>job_list_lock);
> + job = list_first_entry_or_null(>pending_list,
> +struct drm_sched_job, list);
> + if (job && dma_fence_is_signaled(>s_fence->finished))
> + drm_sched_free_job_queue(sched);
> + spin_unlock(>job_list_lock);
> +}
> +
>  /**
>   * drm_sched_job_done - complete a job
>   * @s_job: pointer to the job which is done
> @@ -284,7 +310,7 @@ static void drm_sched_job_done(struct drm_sched_job 
> *s_job, int result)
>   dma_fence_get(_fence->finished);
>   drm_sched_fence_finished(s_fence, result);
>   dma_fence_put(_fence->finished);
> - drm_sched_run_job_queue(sched);
> + drm_sched_free_job_queue(sched);
>  }
>  
>  /**
> @@ -943,8 +969,10 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler 
> *sched)
>   typeof(*next), list);
>  
>   if (next) {
> - next->s_fence->scheduled.timestamp =
> - dma_fence_timestamp(>s_fence->finished);
> + if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
> +  >s_fence->scheduled.flags))
> + next->s_fence->scheduled.timestamp =
> + 
> dma_fence_timestamp(>s_fence->finished);
>   /* start TO timer for next job */
>   drm_sched_start_timeout(sched);
>   }
> @@ -994,7 +1022,40 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
> **sched_list,
>  EXPORT_SYMBOL(drm_sched_pick_best);
>  
>  /**
> - * drm_sched_run_job_work - main scheduler thread
> + * drm_sched_run_job_queue_if_ready - enqueue run-job work if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> + if (drm_sched_select_entity(sched))
> + drm_sched_run_job_queue(sched);
> +}
> +
> +/**
> + * drm_sched_free_job_work - worker to call free_job
> + *
> + * @w: free job work
> + */
> +static void drm_sched_free_job_work(struct work_struct *w)
> +{
> + struct drm_gpu_scheduler *sched =
> + container_of(w, struct drm_gpu_scheduler, work_free_job);
> + struct drm_sched_job *cleanup_job;
> +
> + if (READ_ONCE(sched->pause_submit))
> + return;
> +
> + cleanup_job = drm_sched_get_cleanup_job(sched);
> + if (cleanup_job) {
&g

Re: [PATCH drm-misc-next v5] drm/sched: implement dynamic job-flow control

2023-11-01 Thread Luben Tuikov

On 2023-11-01 20:10, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich 

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
> Changes in V2:
> ==
>   - fixed up influence on scheduling fairness due to consideration of a job's
> size
> - If we reach a ready entity in drm_sched_select_entity() but can't 
> actually
>   queue a job from it due to size limitations, just give up and go to 
> sleep
>   until woken up due to a pending job finishing, rather than continue to 
> try
>   other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
> (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
> proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to 
> the
> following scheme
> - drm_sched_job::credits
> - drm_gpu_scheduler::credit_limit
> - drm_gpu_scheduler::credit_count
> - drm_sched_init(..., u32 credit_limit, ...)
> - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> Changes in V4:
> ==
>   - address Lubens comments regarding documentation
>   - switch to drm_WARN() variants
>   - WARN_ON() drivers passing in zero job credits for both 
> drm_sched_job_init()
> and the update_job_credits() callback
>   - don't retry with another runq if job doesn't fit on the ring to prevent
> priority inversion
>   - rebase onto drm-misc-next (will probably land before Matt's series)
> 
> Changes in V5:
> ==
>   - fix panfrost, lima and etnaviv build
>   - add proposed comments regarding how the code avoids runq priority 
> inversion
>   - address Lubens feedback regarding wording
>   - rebase onto latest drm-misc-next (XE scheduler patches)
> 
> Patch also available in [1].
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gpu.c |   2 +-
>  drivers/gpu/drm/lima/lima_device.c|   2 +-
>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c   |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 166 ++
>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>  include/drm/gpu_scheduler.h   |  31 +++-
>  15 files changed, 174 insertions(+), 55 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>  
> +Flow Control
> +
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-11-01 Thread Luben Tuikov

On 2023-10-31 22:23, Danilo Krummrich wrote:
> Hi Luben,
> 

[snip]
>>> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq 
>>> *rq,
>>>  /**
>>>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a 
>>> job to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Try to find a ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>>> + struct drm_sched_rq *rq)
>>>  {
>>> struct drm_sched_entity *entity;
>>>  
>>> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> if (entity) {
>>> list_for_each_entry_continue(entity, >entities, list) {
>>> if (drm_sched_entity_is_ready(entity)) {
>>> +   /* If we can't queue yet, preserve the current
>>> +* entity in terms of fairness.
>>> +*/
>>> +   if (!drm_sched_can_queue(sched, entity)) {
>>> +   spin_unlock(>lock);
>>> +   return ERR_PTR(-ENOSPC);
>>> +   }
>>> +
>>> rq->current_entity = entity;
>>> reinit_completion(>entity_idle);
>>> spin_unlock(>lock);
>>> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> }
>>>  
>>> list_for_each_entry(entity, >entities, list) {
>>> -
>>> if (drm_sched_entity_is_ready(entity)) {
>>> +   /* If we can't queue yet, preserve the current entity in
>>> +* terms of fairness.
>>> +*/
>>> +   if (!drm_sched_can_queue(sched, entity)) {
>>> +   spin_unlock(>lock);
>>> +   return ERR_PTR(-ENOSPC);
>>> +   }
>>> +
>>> rq->current_entity = entity;
>>> reinit_completion(>entity_idle);
>>> spin_unlock(>lock);
>>> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>  /**
>>>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job 
>>> to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Find oldest waiting ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>>> +   struct drm_sched_rq *rq)
>>>  {
>>> struct rb_node *rb;
>>>  
>>> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
>>> *rq)
>>>  
>>> entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>> if (drm_sched_entity_is_ready(entity)) {
>>> +   /* If we can't queue yet, preserve the current entity in
>>> +* terms of fairness.
>>> +*/
>>> +   if (!drm_sched_can_queue(sched, entity)) {
>>> +   spin_unlock(>lock);
>>> +   return ERR_PTR(-ENOSPC);
>>> +   }
>>> +
>>
>> So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3,
>> and now I see we return and return an error now, which doesn't seem to be 
>> used
>> by anyone. In fact, in drm_sched_select_entity(), we do this,
>>
>> -return entity;
>> +return IS_ERR(entity) ? NULL : entity;
>>
>> So, it's perhaps best to leave this as it was in v3, and if in the future
>> we need to distinguish between the type of error, then that future patch
>> would do that and also show how this is used with new code and logic.
>>
>> There is little value to over-engineer this right now, given that in
>> the future, the logic may be more esoteric than we can think of.
> 
> Ok, maybe what I do here is a little bit subtle and requires a comment. Let me
> explain.
> 
> The reason I return an ERR_PTR() instead of NULL is to indicate to
> drm_sched_select_entity() to break out of the runqueue loop
> (drm_sched_select_entity() breaks the loop when the returned entity is not
> NULL).
> 
> Without that, we'd continue the runqueue loop in drm_sched_select_entity() and
> retry with the next lower priority. This could lead to prioritiy inversion of
> the runqueues, since a lower priority runqueue with jobs with less credits 
> could
> stall a higher priority runqueue with jobs with more credits.
> 
> Hence, in drm_sched_select_entity() we need to be able to distinguish between
> drm_sched_rq_select_entity_fifo()/drm_sched_rq_select_entity_rr() can't find 
> an
> entity and they *can* find an entity,

Re: [PATCH] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-31 Thread Luben Tuikov

On 2023-10-31 09:33, Danilo Krummrich wrote:
> 
> On 10/26/23 19:25, Luben Tuikov wrote:
>> On 2023-10-26 12:39, Danilo Krummrich wrote:
>>> On 10/23/23 05:22, Luben Tuikov wrote:
>>>> The GPU scheduler has now a variable number of run-queues, which are set 
>>>> up at
>>>> drm_sched_init() time. This way, each driver announces how many run-queues 
>>>> it
>>>> requires (supports) per each GPU scheduler it creates. Note, that 
>>>> run-queues
>>>> correspond to scheduler "priorities", thus if the number of run-queues is 
>>>> set
>>>> to 1 at drm_sched_init(), then that scheduler supports a single run-queue,
>>>> i.e. single "priority". If a driver further sets a single entity per
>>>> run-queue, then this creates a 1-to-1 correspondence between a scheduler 
>>>> and
>>>> a scheduled entity.
>>>
>>> Generally, I'm fine with this patch and how it replaces / generalizes the 
>>> single
>>> entity approach.
>>
>> Great!
>>
>>> However, I'm not quite sure how to properly use this. What is a driver 
>>> supposed to
>>> do, which previously took advantage of DRM_SCHED_POLICY_SINGLE_ENTITY?
>>>
>>> Is it supposed to call drm_sched_init() with num_rqs=1? If so, what's the 
>>> correct way
>>
>> Yes, you call drm_sched_init() with num_rqs set to 1.
>>
>>> to initialize the drm_sched_entity then? Calling drm_sched_entity_init() 
>>> with priority=0?
>>
>> Yes, with priority set to 0.
>>
>> One unfortunate fact I noticed when doing this patch is that the numerical 
>> values
>> assigned to enum drm_sched_priority is that the names to values are upside 
>> down.
>> Instead of min being 0, normal:1, high:2, kernel:3, it should've been 
>> kernel:0 (highest),
>> high:1, normal:2, low:4, and so on.
>>
>> The reason for this is absolutely clear: if you had a single priority, it 
>> would be
>> 0, the kernel, one, highest one. This is similar to how lanes in a highway 
>> are counted:
>> you always have lane 1. Similarly to nice(1) and kernel priorities...
>>
>>> Any other priority consequently faults in drm_sched_job_arm().
>>
>> drm_sched_job_arm() faults on !ENTITY, but the "priority" is just
>> assigned to s_priority:
>>  job->s_priority = entity->priority;
>>
>>
>>> While I might sound like a broken record (sorry for that), I really think 
>>> everything
>>> related to Matt's series needs documentation, as in:
>>
>> Yes, I agree.
> 
> Great! Do you plan to send a subsequent patch adding some documentation for 
> this one? I
> think it'd be good to get all the above documented.

A lot of this would be the magic sauce of drivers and hardware--as we've seen 
with Xe,
and it would be presumptuous of me to write down to the detail of what and how 
this
and that should be used.

So long as things are dynamic--as we've seen with the latest change in 
sched_rq--we let
drivers and hardware set the numbers and do their magic in their drivers and 
hardware.

Having said this, if something fundamental comes up to mind, I'd be sure to add 
a comment
there in--this applies to anyone else guys--don't be shy to post a patch adding 
comments
where you think there should be some.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-31 Thread Luben Tuikov

Hi,

(PSA: luben.tui...@amd.com should've bounced :-) I'm removing it from the To: 
field.)

On 2023-10-30 20:26, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich 
> ---
> Changes in V2:
> ==
>   - fixed up influence on scheduling fairness due to consideration of a job's
> size
> - If we reach a ready entity in drm_sched_select_entity() but can't 
> actually
>   queue a job from it due to size limitations, just give up and go to 
> sleep
>   until woken up due to a pending job finishing, rather than continue to 
> try
>   other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
> (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
> proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to 
> the
> following scheme
> - drm_sched_job::credits
> - drm_gpu_scheduler::credit_limit
> - drm_gpu_scheduler::credit_count
> - drm_sched_init(..., u32 credit_limit, ...)
> - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> Changes in V4:
> ==
>   - address Lubens comments regarding documentation
>   - switch to drm_WARN() variants
>   - WARN_ON() drivers passing in zero job credits for both 
> drm_sched_job_init()
> and the update_job_credits() callback
>   - don't retry with another runq if job doesn't fit on the ring to prevent
> priority inversion
>   - rebase onto drm-misc-next (will probably land before Matt's series)
> 
> Patch also available in [1].
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 148 ++
>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>  include/drm/gpu_scheduler.h   |  31 +++-
>  12 files changed, 156 insertions(+), 49 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>  
> +Flow Control
> +
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1f357198533f..62bb7fc7448a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
> amdgpu_vm *vm,
>   if (!entity)
>   return 0;
>  
> - return drm_sched_job_init(&(*job)->base, entity, owner);
> + return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>  }
>  
>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
> b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> index

Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

2023-10-27 Thread Luben Tuikov

On 2023-10-27 12:41, Boris Brezillon wrote:
> On Fri, 27 Oct 2023 10:32:52 -0400
> Luben Tuikov  wrote:
> 
>> On 2023-10-27 04:25, Boris Brezillon wrote:
>>> Hi Danilo,
>>>
>>> On Thu, 26 Oct 2023 18:13:00 +0200
>>> Danilo Krummrich  wrote:
>>>   
>>>> Currently, job flow control is implemented simply by limiting the number
>>>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>>>> limit that corresponds to the number of jobs which can be sent to the
>>>> hardware.
>>>>
>>>> This implies that for each job, drivers need to account for the maximum
>>>> job size possible in order to not overflow the ring buffer.
>>>>
>>>> However, there are drivers, such as Nouveau, where the job size has a
>>>> rather large range. For such drivers it can easily happen that job
>>>> submissions not even filling the ring by 1% can block subsequent
>>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>>
>>>> In order to overcome this issue, allow for tracking the actual job size
>>>> instead of the number of jobs. Therefore, add a field to track a job's
>>>> credit count, which represents the number of credits a job contributes
>>>> to the scheduler's credit limit.
>>>>
>>>> Signed-off-by: Danilo Krummrich 
>>>> ---
>>>> Changes in V2:
>>>> ==
>>>>   - fixed up influence on scheduling fairness due to consideration of a 
>>>> job's
>>>> size
>>>> - If we reach a ready entity in drm_sched_select_entity() but can't 
>>>> actually
>>>>   queue a job from it due to size limitations, just give up and go to 
>>>> sleep
>>>>   until woken up due to a pending job finishing, rather than continue 
>>>> to try
>>>>   other entities.
>>>>   - added a callback to dynamically update a job's credits (Boris)  
>>>
>>> This callback seems controversial. I'd suggest dropping it, so the
>>> patch can be merged.  
>>
>> Sorry, why is it controversial? (I did read back-and-forth above, but it 
>> wasn't clear
>> why it is /controversial/.)
> 
> That's a question for Christian, I guess. I didn't quite get what he
> was worried about, other than this hook introducing a new way for
> drivers to screw things up by returning funky/invalid credits (which we

It's up to the driver--they can test, test, test, and fix their code and so on.
We can only do so much and shouldn't be baby-sitting drivers ad nauseam. Drivers
can also not define this callback. :-)

> can report with WARN_ON()s). But let's be honest, there's probably a
> hundred different ways (if not more) drivers can shoot themselves in the
> foot with drm_sched already...

Yes, that's true. So there's no worries with this hook.

> 
>>
>> I believe only drivers are privy to changes in the credit availability as 
>> their
>> firmware and hardware executes new jobs and finishes others, and so this 
>> "update"
>> here is essential--leaving it only to prepare_job() wouldn't quite fulfill 
>> the vision
>> of why the credit mechanism introduced by this patch in the first place.
> 
> I kinda agree with you, even if I wouldn't so pessimistic as to how
> useful this patch would be without the ->update_job_credits() hook
> (it already makes the situation a lot better for Nouveau and probably
> other drivers too).

Sure, and that's a good thing.

The heart of the dynamic credit scheme this patch is introducing *is* 
update_job_credits()
callback. Without it, it's just about like the current job flow-control scheme 
we have with
varied job weights (credits). Remember, it is an optional callback and driver 
can choose NOT
to define it--simple. :-)

So, I'm very excited about this, and see a wide range of applications and 
tricks drivers
can do with the credit scheme (albeit had it been an "int" bwha-ha-ha-ha ]:-> ).

Have a good weekend everyone!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

2023-10-27 Thread Luben Tuikov

On 2023-10-27 12:31, Boris Brezillon wrote:
> On Fri, 27 Oct 2023 16:23:24 +0200
> Danilo Krummrich  wrote:
> 
>> On 10/27/23 10:25, Boris Brezillon wrote:
>>> Hi Danilo,
>>>
>>> On Thu, 26 Oct 2023 18:13:00 +0200
>>> Danilo Krummrich  wrote:
>>>   
 Currently, job flow control is implemented simply by limiting the number
 of jobs in flight. Therefore, a scheduler is initialized with a credit
 limit that corresponds to the number of jobs which can be sent to the
 hardware.

 This implies that for each job, drivers need to account for the maximum
 job size possible in order to not overflow the ring buffer.

 However, there are drivers, such as Nouveau, where the job size has a
 rather large range. For such drivers it can easily happen that job
 submissions not even filling the ring by 1% can block subsequent
 submissions, which, in the worst case, can lead to the ring run dry.

 In order to overcome this issue, allow for tracking the actual job size
 instead of the number of jobs. Therefore, add a field to track a job's
 credit count, which represents the number of credits a job contributes
 to the scheduler's credit limit.

 Signed-off-by: Danilo Krummrich 
 ---
 Changes in V2:
 ==
- fixed up influence on scheduling fairness due to consideration of a 
 job's
  size
  - If we reach a ready entity in drm_sched_select_entity() but can't 
 actually
queue a job from it due to size limitations, just give up and go to 
 sleep
until woken up due to a pending job finishing, rather than continue 
 to try
other entities.
- added a callback to dynamically update a job's credits (Boris)  
>>>
>>> This callback seems controversial. I'd suggest dropping it, so the
>>> patch can be merged.  
>>
>> I don't think we should drop it just for the sake of moving forward. If 
>> there are objections
>> we'll discuss them. I've seen good reasons why the drivers you are working 
>> on require this,
>> while, following the discussion, I have *not* seen any reasons to drop it. 
>> It helps simplifying
>> driver code and doesn't introduce any complexity or overhead to existing 
>> drivers.
> 
> Up to you. I'm just saying, moving one step in the right direction is
> better than being stuck, and it's not like adding this callback in a
> follow-up patch is super complicated either. If you're confident that
> we can get all parties to agree on keeping this hook, fine by me.

I'd rather have it in now, as it is really *the vision* of this patch. There's 
no point
in pushing in something half-baked.
-- 
Regards,
Luben

OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

2023-10-27 Thread Luben Tuikov

Hi,

On 2023-10-27 12:26, Boris Brezillon wrote:
> On Fri, 27 Oct 2023 16:34:26 +0200
> Danilo Krummrich  wrote:
> 
>> On 10/27/23 09:17, Boris Brezillon wrote:
>>> Hi Danilo,
>>>
>>> On Thu, 26 Oct 2023 18:13:00 +0200
>>> Danilo Krummrich  wrote:
>>>   
 +
 +  /**
 +   * @update_job_credits: Called once the scheduler is considering this
 +   * job for execution.
 +   *
 +   * Drivers may use this to update the job's submission credits, which is
 +   * useful to e.g. deduct the number of native fences which have been
 +   * signaled meanwhile.
 +   *
 +   * The callback must either return the new number of submission credits
 +   * for the given job, or zero if no update is required.
 +   *
 +   * This callback is optional.
 +   */
 +  u32 (*update_job_credits)(struct drm_sched_job *sched_job);  
>>>
>>> I'm copying my late reply to v2 here so it doesn't get lost:
>>>
>>> I keep thinking it'd be simpler to make this a void function that
>>> updates s_job->submission_credits directly. I also don't see the
>>> problem with doing a sanity check on job->submission_credits. I mean,
>>> if the driver is doing something silly, you can't do much to prevent it
>>> anyway, except warn the user that something wrong has happened. If you
>>> want to
>>>
>>> WARN_ON(job->submission_credits == 0 ||
>>> job->submission_credits > job_old_submission_credits);
>>>
>>> that's fine. But none of this sanity checking has to do with the
>>> function prototype/semantics, and I'm still not comfortable with this 0  
>>> => no-change. If there's no change, we should just leave  
>>> job->submission_credits unchanged (or return job->submission_credits)
>>> instead of inventing a new special case.  
>>
>> If we can avoid letting drivers change fields of generic structures directly
>> without any drawbacks I think we should avoid it. Currently, drivers 
>> shouldn't
>> have the need to mess with job->credits directly. The initial value is set
>> through drm_sched_job_init() and is updated through the return value of
>> update_job_credits().
> 
> Fair enough. I do agree that keeping internal fields out of driver
> hands is a good thing in general, it's just that it's already
> free-for-all in so many places in drm_sched (like the fact drivers

"Free-for-all" doesn't mean we need to follow suit. We should keep
good programming practices, as this patch strives to.

> iterate the pending list in their stop-queue handling) that I didn't
> really see it as an issue. Note that's there's always the option of
> providing drm_sched_job_{update,get}_credits() helpers, with the update
> helper making sure the new credits value is consistent (smaller or
> equal to the old one, and not zero).
> 
>>
>> I'm fine getting rid of the 0 => no-change semantics though. Instead we can 
>> just
>> WARN() on 0.
> 
> Yeah, I think that's preferable. It's pretty easy to return the old
> value if the driver has a way to detect when nothing changed (with a
> get helper if you don't want drivers to touch the credits field).
> 
>> However, if we do that I'd also want to change it for
>> drm_sched_job_init() (where 0 currently defaults to 1) such that we accept 
>> 0, but
>> WARN() accordingly.
> 
> Sure. You update all drivers anyway, so passing 1 instead of 0 is not a
> big deal, I would say.

At this point in time, we should consider 1 as normal, 0 out of spec and
WARN on it but carry on and (perhaps) reset it to 1. Drivers in the future, may
see a need (i.e. do tricks) to return 0, at which point they'll submit a patch 
which
does two things, 1) removes the WARN, 2) removes the reset from 0 to 1, and
explain why they need to return 0 to allow (one more) job, but we're nowhere 
near then yet,
so status quo for now.

I don't see how it makes sense to call drm_sched_job_init(credits:0), and I 
believe
the code is correct to default to 1 in that case--which defaults to the current
flow control we have, which we want.

> 
>>
>> I think it's consequent to either consistently give 0 a different meaning or 
>> just
>> accept it but WARN() on it.
> 
> Using default as a default value makes sense when you're passing

I suppose you meant "using zero as a default value".

> zero-initialized objects that are later extended with new fields, but
> here you update the function prototype and all the call sites, so we're
> better off considering 0 as an invalid value, IMHO.

Yes, absolutely.

You never want to give 0 a meaning, since as you pointed out, it is zero-ed
memory, and as such, can have any meaning you'd like. So yes: WARN on 0;
1 is good and normal.

Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

2023-10-27 Thread Luben Tuikov

Hi Danilo,

On 2023-10-27 10:45, Danilo Krummrich wrote:
> Hi Luben,
> 
> On 10/26/23 23:13, Luben Tuikov wrote:
>> On 2023-10-26 12:13, Danilo Krummrich wrote:
>>> Currently, job flow control is implemented simply by limiting the number
>>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>>> limit that corresponds to the number of jobs which can be sent to the
>>> hardware.
>>>
>>> This implies that for each job, drivers need to account for the maximum
>>> job size possible in order to not overflow the ring buffer.
>>>
>>> However, there are drivers, such as Nouveau, where the job size has a
>>> rather large range. For such drivers it can easily happen that job
>>> submissions not even filling the ring by 1% can block subsequent
>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>
>>> In order to overcome this issue, allow for tracking the actual job size
>>> instead of the number of jobs. Therefore, add a field to track a job's
>>> credit count, which represents the number of credits a job contributes
>>> to the scheduler's credit limit.
>>>
>>> Signed-off-by: Danilo Krummrich 
>>> ---
>>> Changes in V2:
>>> ==
>>>- fixed up influence on scheduling fairness due to consideration of a 
>>> job's
>>>  size
>>>  - If we reach a ready entity in drm_sched_select_entity() but can't 
>>> actually
>>>queue a job from it due to size limitations, just give up and go to 
>>> sleep
>>>until woken up due to a pending job finishing, rather than continue 
>>> to try
>>>other entities.
>>>- added a callback to dynamically update a job's credits (Boris)
>>>- renamed 'units' to 'credits'
>>>- fixed commit message and comments as requested by Luben
>>>
>>> Changes in V3:
>>> ==
>>>- rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>>- move up drm_sched_can_queue() instead of adding a forward declaration
>>>  (Boris)
>>>- add a drm_sched_available_credits() helper (Boris)
>>>- adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
>>> proposal
>>>- re-phrase a few comments and fix a typo (Luben)
>>>- change naming of all structures credit fields and function parameters 
>>> to the
>>>  following scheme
>>>  - drm_sched_job::credits
>>>  - drm_gpu_scheduler::credit_limit
>>>  - drm_gpu_scheduler::credit_count
>>>  - drm_sched_init(..., u32 credit_limit, ...)
>>>  - drm_sched_job_init(..., u32 credits, ...)
>>>- add proper documentation for the scheduler's job-flow control mechanism
>>>
>>> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
>>> provides a branch based on drm-misc-next, with the named series and this 
>>> patch
>>> on top of it.
>>>
>>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>>> ---
>>>   Documentation/gpu/drm-mm.rst  |   6 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>>>   drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>>   drivers/gpu/drm/lima/lima_sched.c |   2 +-
>>>   drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>>>   drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>>>   drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>>>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>>   drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c| 142 ++
>>>   drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>>>   include/drm/gpu_scheduler.h   |  33 +++-
>>>   12 files changed, 152 insertions(+), 49 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>>> index 602010cb6894..acc5901ac840 100644
>>> --- a/Documentation/gpu/drm-mm.rst
>>> +++ b/Documentation/gpu/drm-mm.rst
>>> @@ -552,6 +552,12 @@ Overview
>>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>  :doc: Overview
>>>   
>>> +Flow Control
>>> +
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Flow Contro

Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

2023-10-27 Thread Luben Tuikov

On 2023-10-27 04:25, Boris Brezillon wrote:
> Hi Danilo,
> 
> On Thu, 26 Oct 2023 18:13:00 +0200
> Danilo Krummrich  wrote:
> 
>> Currently, job flow control is implemented simply by limiting the number
>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>> limit that corresponds to the number of jobs which can be sent to the
>> hardware.
>>
>> This implies that for each job, drivers need to account for the maximum
>> job size possible in order to not overflow the ring buffer.
>>
>> However, there are drivers, such as Nouveau, where the job size has a
>> rather large range. For such drivers it can easily happen that job
>> submissions not even filling the ring by 1% can block subsequent
>> submissions, which, in the worst case, can lead to the ring run dry.
>>
>> In order to overcome this issue, allow for tracking the actual job size
>> instead of the number of jobs. Therefore, add a field to track a job's
>> credit count, which represents the number of credits a job contributes
>> to the scheduler's credit limit.
>>
>> Signed-off-by: Danilo Krummrich 
>> ---
>> Changes in V2:
>> ==
>>   - fixed up influence on scheduling fairness due to consideration of a job's
>> size
>> - If we reach a ready entity in drm_sched_select_entity() but can't 
>> actually
>>   queue a job from it due to size limitations, just give up and go to 
>> sleep
>>   until woken up due to a pending job finishing, rather than continue to 
>> try
>>   other entities.
>>   - added a callback to dynamically update a job's credits (Boris)
> 
> This callback seems controversial. I'd suggest dropping it, so the
> patch can be merged.

Sorry, why is it controversial? (I did read back-and-forth above, but it wasn't 
clear
why it is /controversial/.)

I believe only drivers are privy to changes in the credit availability as their
firmware and hardware executes new jobs and finishes others, and so this 
"update"
here is essential--leaving it only to prepare_job() wouldn't quite fulfill the 
vision
of why the credit mechanism introduced by this patch in the first place.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

2023-10-26 Thread Luben Tuikov

On 2023-10-26 17:13, Luben Tuikov wrote:
> On 2023-10-26 12:13, Danilo Krummrich wrote:
>> Currently, job flow control is implemented simply by limiting the number
>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>> limit that corresponds to the number of jobs which can be sent to the
>> hardware.
>>
>> This implies that for each job, drivers need to account for the maximum
>> job size possible in order to not overflow the ring buffer.
>>
>> However, there are drivers, such as Nouveau, where the job size has a
>> rather large range. For such drivers it can easily happen that job
>> submissions not even filling the ring by 1% can block subsequent
>> submissions, which, in the worst case, can lead to the ring run dry.
>>
>> In order to overcome this issue, allow for tracking the actual job size
>> instead of the number of jobs. Therefore, add a field to track a job's
>> credit count, which represents the number of credits a job contributes
>> to the scheduler's credit limit.
>>
>> Signed-off-by: Danilo Krummrich 
>> ---
>> Changes in V2:
>> ==
>>   - fixed up influence on scheduling fairness due to consideration of a job's
>> size
>> - If we reach a ready entity in drm_sched_select_entity() but can't 
>> actually
>>   queue a job from it due to size limitations, just give up and go to 
>> sleep
>>   until woken up due to a pending job finishing, rather than continue to 
>> try
>>   other entities.
>>   - added a callback to dynamically update a job's credits (Boris)
>>   - renamed 'units' to 'credits'
>>   - fixed commit message and comments as requested by Luben
>>
>> Changes in V3:
>> ==
>>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>   - move up drm_sched_can_queue() instead of adding a forward declaration
>> (Boris)
>>   - add a drm_sched_available_credits() helper (Boris)
>>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
>> proposal
>>   - re-phrase a few comments and fix a typo (Luben)
>>   - change naming of all structures credit fields and function parameters to 
>> the
>> following scheme
>> - drm_sched_job::credits
>> - drm_gpu_scheduler::credit_limit
>> - drm_gpu_scheduler::credit_count
>> - drm_sched_init(..., u32 credit_limit, ...)
>> - drm_sched_job_init(..., u32 credits, ...)
>>   - add proper documentation for the scheduler's job-flow control mechanism
>>
>> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
>> provides a branch based on drm-misc-next, with the named series and this 
>> patch
>> on top of it.
>>
>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>> ---
>>  Documentation/gpu/drm-mm.rst  |   6 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
>>  drivers/gpu/drm/scheduler/sched_main.c| 142 ++
>>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>>  include/drm/gpu_scheduler.h   |  33 +++-
>>  12 files changed, 152 insertions(+), 49 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>> index 602010cb6894..acc5901ac840 100644
>> --- a/Documentation/gpu/drm-mm.rst
>> +++ b/Documentation/gpu/drm-mm.rst
>> @@ -552,6 +552,12 @@ Overview
>>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> :doc: Overview
>>  
>> +Flow Control
>> +
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Flow Control
>> +
>>  Scheduler Function References
>>  -
>>  
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 1f357198533f..62bb7fc7448a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
>> amdgpu_v

Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

2023-10-26 Thread Luben Tuikov

On 2023-10-26 12:13, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich 
> ---
> Changes in V2:
> ==
>   - fixed up influence on scheduling fairness due to consideration of a job's
> size
> - If we reach a ready entity in drm_sched_select_entity() but can't 
> actually
>   queue a job from it due to size limitations, just give up and go to 
> sleep
>   until woken up due to a pending job finishing, rather than continue to 
> try
>   other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
> (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
> proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to 
> the
> following scheme
> - drm_sched_job::credits
> - drm_gpu_scheduler::credit_limit
> - drm_gpu_scheduler::credit_count
> - drm_sched_init(..., u32 credit_limit, ...)
> - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
> provides a branch based on drm-misc-next, with the named series and this patch
> on top of it.
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 142 ++
>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>  include/drm/gpu_scheduler.h   |  33 +++-
>  12 files changed, 152 insertions(+), 49 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>  
> +Flow Control
> +
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1f357198533f..62bb7fc7448a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
> amdgpu_vm *vm,
>   if (!entity)
>   return 0;
>  
> - return drm_sched_job_init(&(*job)->base, entity, owner);
> + return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>  }
>  
>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
> b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> index 2416c526f9b0..3d0f8d182506 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void 
> *data,
>  
>   ret = drm_sched_job_init(>sched_job,
>>sched_entity[args->pipe],
> -  submit->ctx);

[PATCH] MAINTAINERS: Update the GPU Scheduler email

2023-10-26 Thread Luben Tuikov

Update the GPU Scheduler maintainer email.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Daniel Vetter 
Cc: Dave Airlie 
Cc: AMD Graphics 
Cc: Direct Rendering Infrastructure - Development 

Signed-off-by: Luben Tuikov 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4452508bc1b040..f13e476ed8038b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7153,7 +7153,7 @@ F:Documentation/devicetree/bindings/display/xlnx/
 F: drivers/gpu/drm/xlnx/
 
 DRM GPU SCHEDULER
-M: Luben Tuikov 
+M: Luben Tuikov 
 L: dri-devel@lists.freedesktop.org
 S: Maintained
 T: git git://anongit.freedesktop.org/drm/drm-misc

base-commit: 56e449603f0ac580700621a356d35d5716a62ce5
-- 
2.42.0

Re: [PATCH] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-26 Thread Luben Tuikov

On 2023-10-26 12:39, Danilo Krummrich wrote:
> On 10/23/23 05:22, Luben Tuikov wrote:
>> The GPU scheduler has now a variable number of run-queues, which are set up 
>> at
>> drm_sched_init() time. This way, each driver announces how many run-queues it
>> requires (supports) per each GPU scheduler it creates. Note, that run-queues
>> correspond to scheduler "priorities", thus if the number of run-queues is set
>> to 1 at drm_sched_init(), then that scheduler supports a single run-queue,
>> i.e. single "priority". If a driver further sets a single entity per
>> run-queue, then this creates a 1-to-1 correspondence between a scheduler and
>> a scheduled entity.
> 
> Generally, I'm fine with this patch and how it replaces / generalizes the 
> single
> entity approach.

Great!

> However, I'm not quite sure how to properly use this. What is a driver 
> supposed to
> do, which previously took advantage of DRM_SCHED_POLICY_SINGLE_ENTITY?
> 
> Is it supposed to call drm_sched_init() with num_rqs=1? If so, what's the 
> correct way

Yes, you call drm_sched_init() with num_rqs set to 1.

> to initialize the drm_sched_entity then? Calling drm_sched_entity_init() with 
> priority=0?

Yes, with priority set to 0.

One unfortunate fact I noticed when doing this patch is that the numerical 
values
assigned to enum drm_sched_priority is that the names to values are upside down.
Instead of min being 0, normal:1, high:2, kernel:3, it should've been kernel:0 
(highest),
high:1, normal:2, low:4, and so on.

The reason for this is absolutely clear: if you had a single priority, it would 
be
0, the kernel, one, highest one. This is similar to how lanes in a highway are 
counted:
you always have lane 1. Similarly to nice(1) and kernel priorities...

> Any other priority consequently faults in drm_sched_job_arm().

drm_sched_job_arm() faults on !ENTITY, but the "priority" is just
assigned to s_priority:
job->s_priority = entity->priority;


> While I might sound like a broken record (sorry for that), I really think 
> everything
> related to Matt's series needs documentation, as in:

Yes, I agree.
 
> - What is the typical application of the single entity / variable run queue 
> design?
>How do drivers make use of it?

I believe most drivers in the future, would want to have a single sched_rq 
(i.e. single
"priority", and then would tack a single entity to this, and then do 
prioritization
internally in their firmware/hardware.

> - How to tear down a scheduler instance properly?
> - Motivation and implications of the workqueue topology (default workqueue, 
> external
>driver workqueue, free job work, run job work, etc.).
> 
> But also the existing scheduler infrastructure requires more documentation.
> 
> The scheduler barely has some documentation to describe its basic topology of
> struct drm_gpu_scheduler, struct drm_sched_entity and struct drm_sched_job.
> Plus a few hints regarding run queue priorities, which, with this patch, seem 
> to be
> (partially) out-dated or at least incomplete.
> 
> I think Sima also mentioned that we really need to put some efforts here. [1]

Yes, that's true.

Regards,
Luben

> 
> - Danilo
> 
> [1] 
> https://lore.kernel.org/all/20231017150958.838613-1-matthew.br...@intel.com/T/#m330335b44bdb7ae93ac6ebdedd65706df5a0f03d
> 
>>
>> Cc: Lucas Stach 
>> Cc: Russell King 
>> Cc: Qiang Yu 
>> Cc: Rob Clark 
>> Cc: Abhinav Kumar 
>> Cc: Dmitry Baryshkov 
>> Cc: Danilo Krummrich 
>> Cc: Matthew Brost 
>> Cc: Boris Brezillon 
>> Cc: Alex Deucher 
>> Cc: Christian König 
>> Cc: Emma Anholt 
>> Cc: etna...@lists.freedesktop.org
>> Cc: l...@lists.freedesktop.org
>> Cc: linux-arm-...@vger.kernel.org
>> Cc: freedr...@lists.freedesktop.org
>> Cc: nouv...@lists.freedesktop.org
>> Cc: dri-devel@lists.freedesktop.org
>> Signed-off-by: Luben Tuikov 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c|  1 +
>>   drivers/gpu/drm/lima/lima_sched.c  |  4 +-
>>   drivers/gpu/drm/msm/msm_ringbuffer.c   |  5 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c|  1 +
>>   drivers/gpu/drm/panfrost/panfrost_job.c|  1 +
>>   drivers/gpu/drm/scheduler/sched_entity.c   | 18 +-
>>   drivers/gpu/drm/scheduler/sched_main.c | 74 ++
>>   drivers/gpu/drm/v3d/v3d_sched.c|  5 ++
>>   include/drm/gpu_scheduler.h|  9 ++-
>>   11 files changed, 98 insertions(+), 25 deletions(-)
>>
>> d

Re: [PATCH] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-26 Thread Luben Tuikov

Hi,

I've pushed this commit as I got a verbal Acked-by from Christian in our kernel 
meeting this morning.

Matt, please rebase your patches to drm-misc-next.

Regards,
Luben

On 2023-10-26 11:20, Luben Tuikov wrote:
> Ping!
> 
> On 2023-10-22 23:22, Luben Tuikov wrote:
>> The GPU scheduler has now a variable number of run-queues, which are set up 
>> at
>> drm_sched_init() time. This way, each driver announces how many run-queues it
>> requires (supports) per each GPU scheduler it creates. Note, that run-queues
>> correspond to scheduler "priorities", thus if the number of run-queues is set
>> to 1 at drm_sched_init(), then that scheduler supports a single run-queue,
>> i.e. single "priority". If a driver further sets a single entity per
>> run-queue, then this creates a 1-to-1 correspondence between a scheduler and
>> a scheduled entity.
>>
>> Cc: Lucas Stach 
>> Cc: Russell King 
>> Cc: Qiang Yu 
>> Cc: Rob Clark 
>> Cc: Abhinav Kumar 
>> Cc: Dmitry Baryshkov 
>> Cc: Danilo Krummrich 
>> Cc: Matthew Brost 
>> Cc: Boris Brezillon 
>> Cc: Alex Deucher 
>> Cc: Christian König 
>> Cc: Emma Anholt 
>> Cc: etna...@lists.freedesktop.org
>> Cc: l...@lists.freedesktop.org
>> Cc: linux-arm-...@vger.kernel.org
>> Cc: freedr...@lists.freedesktop.org
>> Cc: nouv...@lists.freedesktop.org
>> Cc: dri-devel@lists.freedesktop.org
>> Signed-off-by: Luben Tuikov 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +-
>>  drivers/gpu/drm/etnaviv/etnaviv_sched.c|  1 +
>>  drivers/gpu/drm/lima/lima_sched.c  |  4 +-
>>  drivers/gpu/drm/msm/msm_ringbuffer.c   |  5 +-
>>  drivers/gpu/drm/nouveau/nouveau_sched.c|  1 +
>>  drivers/gpu/drm/panfrost/panfrost_job.c|  1 +
>>  drivers/gpu/drm/scheduler/sched_entity.c   | 18 +-
>>  drivers/gpu/drm/scheduler/sched_main.c | 74 ++
>>  drivers/gpu/drm/v3d/v3d_sched.c|  5 ++
>>  include/drm/gpu_scheduler.h|  9 ++-
>>  11 files changed, 98 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 2b8356699f235d..251995a90bbe69 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2280,6 +2280,7 @@ static int amdgpu_device_init_schedulers(struct 
>> amdgpu_device *adev)
>>  }
>>  
>>  r = drm_sched_init(>sched, _sched_ops,
>> +   DRM_SCHED_PRIORITY_COUNT,
>> ring->num_hw_submission, 0,
>> timeout, adev->reset_domain->wq,
>> ring->sched_score, ring->name,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 78476bc75b4e1d..1f357198533f3e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -325,8 +325,8 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
>> drm_gpu_scheduler *sched)
>>  int i;
>>  
>>  /* Signal all jobs not yet scheduled */
>> -for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; 
>> i--) {
>> -struct drm_sched_rq *rq = >sched_rq[i];
>> +for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>> +struct drm_sched_rq *rq = sched->sched_rq[i];
>>  spin_lock(>lock);
>>  list_for_each_entry(s_entity, >entities, list) {
>>  while ((s_job = 
>> to_drm_sched_job(spsc_queue_pop(_entity->job_queue {
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
>> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> index 345fec6cb1a4c1..9b79f218e21afc 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> @@ -135,6 +135,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>>  int ret;
>>  
>>  ret = drm_sched_init(>sched, _sched_ops,
>> + DRM_SCHED_PRIORITY_COUNT,
>>   etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>>   msecs_to_jiffies(500), NULL, NULL,
>>   dev_name(gpu->dev), gpu->dev);
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c 
>> b/drivers/gpu/

Re: [PATCH] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-26 Thread Luben Tuikov

Ping!

On 2023-10-22 23:22, Luben Tuikov wrote:
> The GPU scheduler has now a variable number of run-queues, which are set up at
> drm_sched_init() time. This way, each driver announces how many run-queues it
> requires (supports) per each GPU scheduler it creates. Note, that run-queues
> correspond to scheduler "priorities", thus if the number of run-queues is set
> to 1 at drm_sched_init(), then that scheduler supports a single run-queue,
> i.e. single "priority". If a driver further sets a single entity per
> run-queue, then this creates a 1-to-1 correspondence between a scheduler and
> a scheduled entity.
> 
> Cc: Lucas Stach 
> Cc: Russell King 
> Cc: Qiang Yu 
> Cc: Rob Clark 
> Cc: Abhinav Kumar 
> Cc: Dmitry Baryshkov 
> Cc: Danilo Krummrich 
> Cc: Matthew Brost 
> Cc: Boris Brezillon 
> Cc: Alex Deucher 
> Cc: Christian König 
> Cc: Emma Anholt 
> Cc: etna...@lists.freedesktop.org
> Cc: l...@lists.freedesktop.org
> Cc: linux-arm-...@vger.kernel.org
> Cc: freedr...@lists.freedesktop.org
> Cc: nouv...@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org
> Signed-off-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c|  1 +
>  drivers/gpu/drm/lima/lima_sched.c  |  4 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c   |  5 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c|  1 +
>  drivers/gpu/drm/panfrost/panfrost_job.c|  1 +
>  drivers/gpu/drm/scheduler/sched_entity.c   | 18 +-
>  drivers/gpu/drm/scheduler/sched_main.c | 74 ++
>  drivers/gpu/drm/v3d/v3d_sched.c|  5 ++
>  include/drm/gpu_scheduler.h|  9 ++-
>  11 files changed, 98 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2b8356699f235d..251995a90bbe69 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2280,6 +2280,7 @@ static int amdgpu_device_init_schedulers(struct 
> amdgpu_device *adev)
>   }
>  
>   r = drm_sched_init(>sched, _sched_ops,
> +DRM_SCHED_PRIORITY_COUNT,
>  ring->num_hw_submission, 0,
>  timeout, adev->reset_domain->wq,
>  ring->sched_score, ring->name,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 78476bc75b4e1d..1f357198533f3e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -325,8 +325,8 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
> drm_gpu_scheduler *sched)
>   int i;
>  
>   /* Signal all jobs not yet scheduled */
> - for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; 
> i--) {
> - struct drm_sched_rq *rq = >sched_rq[i];
> + for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> + struct drm_sched_rq *rq = sched->sched_rq[i];
>   spin_lock(>lock);
>   list_for_each_entry(s_entity, >entities, list) {
>   while ((s_job = 
> to_drm_sched_job(spsc_queue_pop(_entity->job_queue {
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index 345fec6cb1a4c1..9b79f218e21afc 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -135,6 +135,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>   int ret;
>  
>   ret = drm_sched_init(>sched, _sched_ops,
> +  DRM_SCHED_PRIORITY_COUNT,
>etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>msecs_to_jiffies(500), NULL, NULL,
>dev_name(gpu->dev), gpu->dev);
> diff --git a/drivers/gpu/drm/lima/lima_sched.c 
> b/drivers/gpu/drm/lima/lima_sched.c
> index ffd91a5ee29901..295f0353a02e58 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -488,7 +488,9 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, 
> const char *name)
>  
>   INIT_WORK(>recover_work, lima_sched_recover_work);
>  
> - return drm_sched_init(>base, _sched_ops, 1,
> + return drm_sched_init(>base, _sched_ops,
> +   DRM_SCHED_PRIORITY_COUNT,
> +   1,
> l

Re: [PATCH v7 3/6] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-26 Thread Luben Tuikov

Also note that there were no complaints from "kernel test robot 
"
when I posted my patch (this patch), but there is now, which further shows
that there's unwarranted changes. Just follow the steps I outlined below,
and we should all be good.

Thanks!

Regards,
Luben

On 2023-10-26 05:36, Luben Tuikov wrote:
> Hi,
> 
> On 2023-10-26 02:33, kernel test robot wrote:
>> Hi Matthew,
>>
>> kernel test robot noticed the following build warnings:
>>
>> [auto build test WARNING on 201c8a7bd1f3f415920a2df4b8a8817e973f42fe]
>>
>> url:
>> https://github.com/intel-lab-lkp/linux/commits/Matthew-Brost/drm-sched-Add-drm_sched_wqueue_-helpers/20231026-121313
>> base:   201c8a7bd1f3f415920a2df4b8a8817e973f42fe
>> patch link:
>> https://lore.kernel.org/r/20231026041236.1273694-4-matthew.brost%40intel.com
>> patch subject: [PATCH v7 3/6] drm/sched: Convert the GPU scheduler to 
>> variable number of run-queues
>> config: m68k-allyesconfig 
>> (https://download.01.org/0day-ci/archive/20231026/202310261439.3rbateob-...@intel.com/config)
>> compiler: m68k-linux-gcc (GCC) 13.2.0
>> reproduce (this is a W=1 build): 
>> (https://download.01.org/0day-ci/archive/20231026/202310261439.3rbateob-...@intel.com/reproduce)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version 
>> of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot 
>> | Closes: 
>> https://lore.kernel.org/oe-kbuild-all/202310261439.3rbateob-...@intel.com/
>>
>> All warnings (new ones prefixed by >>):
>>
>>drivers/gpu/drm/etnaviv/etnaviv_sched.c: In function 'etnaviv_sched_init':
>>>> drivers/gpu/drm/etnaviv/etnaviv_sched.c:138:30: warning: passing argument 
>>>> 3 of 'drm_sched_init' makes pointer from integer without a cast 
>>>> [-Wint-conversion]
>>  138 |  DRM_SCHED_PRIORITY_COUNT, NULL,
>>  |  ^~~~
>>  |  |
>>  |  int
>>In file included from drivers/gpu/drm/etnaviv/etnaviv_drv.h:20,
>> from drivers/gpu/drm/etnaviv/etnaviv_sched.c:8:
>>include/drm/gpu_scheduler.h:530:45: note: expected 'struct 
>> workqueue_struct *' but argument is of type 'int'
>>  530 |struct workqueue_struct *submit_wq,
>>  |~^
>>In file included from include/uapi/linux/posix_types.h:5,
>> from include/uapi/linux/types.h:14,
>> from include/linux/types.h:6,
>> from include/linux/kasan-checks.h:5,
>> from include/asm-generic/rwonce.h:26,
>> from ./arch/m68k/include/generated/asm/rwonce.h:1,
>> from include/linux/compiler.h:246,
>> from include/linux/build_bug.h:5,
>> from include/linux/init.h:5,
>> from include/linux/moduleparam.h:5,
>> from drivers/gpu/drm/etnaviv/etnaviv_sched.c:6:
> 
> The reason for this compilation failure is that this patch is completely 
> mangled and nothing like the patch I posted.
> 
> My patch is: 
> https://lore.kernel.org/all/20231023032251.164775-1-luben.tui...@amd.com/
> 
> Save it raw to your disk from this link: 
> https://lore.kernel.org/all/20231023032251.164775-1-luben.tui...@amd.com/raw
> 
> And apply it with "git am " on top of your clean tree, e.g. 
> drm-misc-next. THEN, after that,
> apply your patches.
> 
> It should then compile without any problems.
> 
> Just looking at the first hunk in my patch:
> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 2b8356699f235d..251995a90bbe69 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2280,6 +2280,7 @@ static int amdgpu_device_init_schedulers(struct 
>> amdgpu_device *adev)
>> }
>>  
>> r = drm_sched_init(>sched, _sched_ops,
>> +  DRM_SCHED_PRIORITY_COUNT,
>>ring->num_hw_submission, 0,
>>timeout, adev->reset_domain->wq,
>>ring->sched_score, ring->name,
> 
> While this looks like this in the version you posted of my patch:
> 
&

Re: [PATCH v7 3/6] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-26 Thread Luben Tuikov

Hi,

On 2023-10-26 02:33, kernel test robot wrote:
> Hi Matthew,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on 201c8a7bd1f3f415920a2df4b8a8817e973f42fe]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Matthew-Brost/drm-sched-Add-drm_sched_wqueue_-helpers/20231026-121313
> base:   201c8a7bd1f3f415920a2df4b8a8817e973f42fe
> patch link:
> https://lore.kernel.org/r/20231026041236.1273694-4-matthew.brost%40intel.com
> patch subject: [PATCH v7 3/6] drm/sched: Convert the GPU scheduler to 
> variable number of run-queues
> config: m68k-allyesconfig 
> (https://download.01.org/0day-ci/archive/20231026/202310261439.3rbateob-...@intel.com/config)
> compiler: m68k-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20231026/202310261439.3rbateob-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202310261439.3rbateob-...@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
>drivers/gpu/drm/etnaviv/etnaviv_sched.c: In function 'etnaviv_sched_init':
>>> drivers/gpu/drm/etnaviv/etnaviv_sched.c:138:30: warning: passing argument 3 
>>> of 'drm_sched_init' makes pointer from integer without a cast 
>>> [-Wint-conversion]
>  138 |  DRM_SCHED_PRIORITY_COUNT, NULL,
>  |  ^~~~
>  |  |
>  |  int
>In file included from drivers/gpu/drm/etnaviv/etnaviv_drv.h:20,
> from drivers/gpu/drm/etnaviv/etnaviv_sched.c:8:
>include/drm/gpu_scheduler.h:530:45: note: expected 'struct 
> workqueue_struct *' but argument is of type 'int'
>  530 |struct workqueue_struct *submit_wq,
>  |~^
>In file included from include/uapi/linux/posix_types.h:5,
> from include/uapi/linux/types.h:14,
> from include/linux/types.h:6,
> from include/linux/kasan-checks.h:5,
> from include/asm-generic/rwonce.h:26,
> from ./arch/m68k/include/generated/asm/rwonce.h:1,
> from include/linux/compiler.h:246,
> from include/linux/build_bug.h:5,
> from include/linux/init.h:5,
> from include/linux/moduleparam.h:5,
> from drivers/gpu/drm/etnaviv/etnaviv_sched.c:6:

The reason for this compilation failure is that this patch is completely 
mangled and nothing like the patch I posted.

My patch is: 
https://lore.kernel.org/all/20231023032251.164775-1-luben.tui...@amd.com/

Save it raw to your disk from this link: 
https://lore.kernel.org/all/20231023032251.164775-1-luben.tui...@amd.com/raw

And apply it with "git am " on top of your clean tree, e.g. 
drm-misc-next. THEN, after that,
apply your patches.

It should then compile without any problems.

Just looking at the first hunk in my patch:

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2b8356699f235d..251995a90bbe69 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2280,6 +2280,7 @@ static int amdgpu_device_init_schedulers(struct 
> amdgpu_device *adev)
> }
>  
> r = drm_sched_init(>sched, _sched_ops,
> +  DRM_SCHED_PRIORITY_COUNT,
>ring->num_hw_submission, 0,
>timeout, adev->reset_domain->wq,
>ring->sched_score, ring->name,

While this looks like this in the version you posted of my patch:

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b54c4d771104..94d073bfbd13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2280,6 +2280,7 @@ static int amdgpu_device_init_schedulers(struct 
> amdgpu_device *adev)
>   }
>  
>   r = drm_sched_init(>sched, _sched_ops, NULL,
> +DRM_SCHED_PRIORITY_COUNT,
>  ring->num_hw_submission, 0,
>  timeout, adev->reset_domain->wq,
>  ring->sched_score, ring->name,

What's that "NULL" doing as the 3rd argument???

And the rest is similarly mangled as well.

Please apply my patch AS IS, no local changes, and then apply your patches on 
top. That should ensure compilation is correct for all,
and a more precise review can be had.

FWIW, we

Re: [PATCH v7 4/6] drm/sched: Split free_job into own work item

2023-10-25 Thread Luben Tuikov

On 2023-10-26 00:12, Matthew Brost wrote:
> Rather than call free_job and run_job in same work item have a dedicated
> work item for each. This aligns with the design and intended use of work
> queues.
> 
> v2:
>- Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting
>  timestamp in free_job() work item (Danilo)
> v3:
>   - Drop forward dec of drm_sched_select_entity (Boris)
>   - Return in drm_sched_run_job_work if entity NULL (Boris)
> v4:
>   - Replace dequeue with peek and invert logic (Luben)
>   - Wrap to 100 lines (Luben)
>   - Update comments for *_queue / *_queue_if_ready functions (Luben)
> v5:
>   - Drop peek argument, blindly blindly reinit idle (Luben)
>   - s/drm_sched_free_job_queue_if_ready/drm_sched_free_job_queue_if_done 
> (Luben)
>   - Update work_run_job & work_free_job kernel doc (Luben)
> 
> Signed-off-by: Matthew Brost 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 241 +++--
>  include/drm/gpu_scheduler.h|   8 +-
>  2 files changed, 151 insertions(+), 98 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index b22157e920d4..3d89420d4ffb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -257,12 +257,89 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>  
>  /**
>   * drm_sched_run_job_queue - enqueue scheduler submission
> + * drm_sched_run_job_queue - enqueue run-job work
>   * @sched: scheduler instance
>   */

No, please, see you're introducing this in patch 2, and it is there where
you should do all changes pertaining to that function.

Basically, you want to normalize your patches--don't spread the changes
to the same new thing you're introducing to many patches--just one single
patch and in no other patch.

This makes it easy to review, easy to see what is being changed, in one
place, and not in many places, especially when it is part of the same patchset.

And for the rest of the functions.

Thanks!

Regards,
Luben

>  static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>  {
>   if (!READ_ONCE(sched->pause_submit))
> - queue_work(sched->submit_wq, >work_submit);
> + queue_work(sched->submit_wq, >work_run_job);
> +}
> +
> +/**
> + * drm_sched_can_queue -- Can we queue more to the hardware?
> + * @sched: scheduler instance
> + *
> + * Return true if we can push more jobs to the hw, otherwise false.
> + */
> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> +{
> + return atomic_read(>hw_rq_count) <
> + sched->hw_submission_limit;
> +}
> +
> +/**
> + * drm_sched_select_entity - Select next entity to process
> + *
> + * @sched: scheduler instance
> + *
> + * Returns the entity to process or NULL if none are found.
> + */
> +static struct drm_sched_entity *
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> +{
> + struct drm_sched_entity *entity;
> + int i;
> +
> + if (!drm_sched_can_queue(sched))
> + return NULL;
> +
> + /* Kernel run queue has higher priority than normal run queue*/
> + for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> + entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> + drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
> + drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
> + if (entity)
> + break;
> + }
> +
> + return entity;
> +}
> +
> +/**
> + * drm_sched_run_job_queue_if_ready - enqueue run-job work if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> + if (drm_sched_select_entity(sched))
> + drm_sched_run_job_queue(sched);
> +}
> +
> +/**
> + * drm_sched_free_job_queue - enqueue free-job work
> + * @sched: scheduler instance
> + */
> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> +{
> + if (!READ_ONCE(sched->pause_submit))
> + queue_work(sched->submit_wq, >work_free_job);
> +}
> +
> +/**
> + * drm_sched_free_job_queue_if_done - enqueue free-job work if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_free_job_queue_if_done(struct drm_gpu_scheduler *sched)
> +{
> + struct drm_sched_job *job;
> +
> + spin_lock(>job_list_lock);
> + job = list_first_entry_or_null(>pending_list,
> +struct drm_sched_job, list);
> + if (job && dma_fence_is_signaled(>s_fence->finished))
> + drm_sched_free_job_queue(sched);
> + spin_unlock(>job_list_lock);
>  }
>  
>  /**
> @@ -284,7 +361,7 @@ static void drm_sched_job_done(struct drm_sched_job 
> *s_job, int result)
>   dma_fence_get(_fence->finished);
>   drm_sched_fence_finished(s_fence, result);
>   dma_fence_put(_fence->finished);
> - drm_sched_run_job_queue(sched);
> +

Re: [PATCH v7 6/6] drm/sched: Add a helper to queue TDR immediately

2023-10-25 Thread Luben Tuikov

On 2023-10-26 00:12, Matthew Brost wrote:
> Add a helper whereby a driver can invoke TDR immediately.
> 
> v2:
>  - Drop timeout args, rename function, use mod delayed work (Luben)
> v3:
>  - s/XE/Xe (Luben)
>  - present tense in commit message (Luben)
>  - Adjust comment for drm_sched_tdr_queue_imm (Luben)
> v4:
>  - Adjust commit message (Luben)
> 
> Cc: Luben Tuikov 
> Signed-off-by: Matthew Brost 
> Reviewed-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 18 +-
>  include/drm/gpu_scheduler.h|  1 +
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index ae66cabc3162..246213963928 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -389,7 +389,7 @@ static void drm_sched_start_timeout(struct 
> drm_gpu_scheduler *sched)
>  
>   if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>   !list_empty(>pending_list))
> - queue_delayed_work(sched->timeout_wq, >work_tdr, 
> sched->timeout);
> + mod_delayed_work(sched->timeout_wq, >work_tdr, 
> sched->timeout);
>  }
>  
>  static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
> @@ -399,6 +399,22 @@ static void drm_sched_start_timeout_unlocked(struct 
> drm_gpu_scheduler *sched)
>   spin_unlock(>job_list_lock);
>  }
>  
> +/**
> + * drm_sched_tdr_queue_imm: - immediately start job timeout handler

No need for a colon char (:) after the name.

Regards,
Luben

> + *
> + * @sched: scheduler for which the timeout handling should be started.
> + *
> + * Start timeout handling immediately for the named scheduler.
> + */
> +void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched)
> +{
> + spin_lock(>job_list_lock);
> + sched->timeout = 0;
> + drm_sched_start_timeout(sched);
> + spin_unlock(>job_list_lock);
> +}
> +EXPORT_SYMBOL(drm_sched_tdr_queue_imm);
> +
>  /**
>   * drm_sched_fault - immediately start timeout handler
>   *
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 37749f561866..e5a6166eb152 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -557,6 +557,7 @@ void drm_sched_entity_modify_sched(struct 
> drm_sched_entity *entity,
>   struct drm_gpu_scheduler **sched_list,
> unsigned int num_sched_list);
>  
> +void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>  void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>  bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);

Re: [PATCH v7 4/6] drm/sched: Split free_job into own work item

2023-10-25 Thread Luben Tuikov

On 2023-10-26 00:12, Matthew Brost wrote:
> Rather than call free_job and run_job in same work item have a dedicated
> work item for each. This aligns with the design and intended use of work
> queues.
> 
> v2:
>- Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting
>  timestamp in free_job() work item (Danilo)
> v3:
>   - Drop forward dec of drm_sched_select_entity (Boris)
>   - Return in drm_sched_run_job_work if entity NULL (Boris)
> v4:
>   - Replace dequeue with peek and invert logic (Luben)
>   - Wrap to 100 lines (Luben)
>   - Update comments for *_queue / *_queue_if_ready functions (Luben)
> v5:
>   - Drop peek argument, blindly blindly reinit idle (Luben)
>   - s/drm_sched_free_job_queue_if_ready/drm_sched_free_job_queue_if_done 
> (Luben)
>   - Update work_run_job & work_free_job kernel doc (Luben)
> 
> Signed-off-by: Matthew Brost 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 241 +++--
>  include/drm/gpu_scheduler.h|   8 +-
>  2 files changed, 151 insertions(+), 98 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index b22157e920d4..3d89420d4ffb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -257,12 +257,89 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>  
>  /**
>   * drm_sched_run_job_queue - enqueue scheduler submission
> + * drm_sched_run_job_queue - enqueue run-job work
>   * @sched: scheduler instance
>   */
>  static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>  {
>   if (!READ_ONCE(sched->pause_submit))
> - queue_work(sched->submit_wq, >work_submit);
> + queue_work(sched->submit_wq, >work_run_job);
> +}
> +
> +/**
> + * drm_sched_can_queue -- Can we queue more to the hardware?
> + * @sched: scheduler instance
> + *
> + * Return true if we can push more jobs to the hw, otherwise false.
> + */
> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> +{
> + return atomic_read(>hw_rq_count) <
> + sched->hw_submission_limit;
> +}
> +
> +/**
> + * drm_sched_select_entity - Select next entity to process
> + *
> + * @sched: scheduler instance
> + *
> + * Returns the entity to process or NULL if none are found.
> + */
> +static struct drm_sched_entity *
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> +{
> + struct drm_sched_entity *entity;
> + int i;
> +
> + if (!drm_sched_can_queue(sched))
> + return NULL;
> +
> + /* Kernel run queue has higher priority than normal run queue*/
> + for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> + entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> + drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
> + drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
> + if (entity)
> + break;
> + }
> +
> + return entity;
> +}
> +

Could you please rewrite the patch so as to not do the unnecessary move
of the drm_sched_select_entity(). This function has no changes in it,
and this move a few lines up only creates noise, when doing "git blame"
and similar data-mining.

So, just rewrite the patch without moving unchanged code up/down, so
that only the essential differences are shown by git-diff(1). This makes
it easy to review, easy to do data-mining down the road, and so on.

The rest of the patch looks good.

Thanks!

Regards,
Luben

> +/**
> + * drm_sched_run_job_queue_if_ready - enqueue run-job work if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> + if (drm_sched_select_entity(sched))
> + drm_sched_run_job_queue(sched);
> +}
> +
> +/**
> + * drm_sched_free_job_queue - enqueue free-job work
> + * @sched: scheduler instance
> + */
> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> +{
> + if (!READ_ONCE(sched->pause_submit))
> + queue_work(sched->submit_wq, >work_free_job);
> +}
> +
> +/**
> + * drm_sched_free_job_queue_if_done - enqueue free-job work if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_free_job_queue_if_done(struct drm_gpu_scheduler *sched)
> +{
> + struct drm_sched_job *job;
> +
> + spin_lock(>job_list_lock);
> + job = list_first_entry_or_null(>pending_list,
> +struct drm_sched_job, list);
> + if (job && dma_fence_is_signaled(>s_fence->finished))
> + drm_sched_free_job_queue(sched);
> + spin_unlock(>job_list_lock);
>  }
>  
>  /**
> @@ -284,7 +361,7 @@ static void drm_sched_job_done(struct drm_sched_job 
> *s_job, int result)
>   dma_fence_get(_fence->finished);
>   drm_sched_fence_finished(s_fence, result);
>   dma_fence_put(_fence->finished);
> - drm_sched_run_job_queue(sched);
> +

Re: [PATCH v7 3/6] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-25 Thread Luben Tuikov

On 2023-10-26 00:12, Matthew Brost wrote:
> From: Luben Tuikov 
> 
> The GPU scheduler has now a variable number of run-queues, which are set up at
> drm_sched_init() time. This way, each driver announces how many run-queues it
> requires (supports) per each GPU scheduler it creates. Note, that run-queues
> correspond to scheduler "priorities", thus if the number of run-queues is set
> to 1 at drm_sched_init(), then that scheduler supports a single run-queue,
> i.e. single "priority". If a driver further sets a single entity per
> run-queue, then this creates a 1-to-1 correspondence between a scheduler and
> a scheduled entity.
> 
> Cc: Lucas Stach 
> Cc: Russell King 
> Cc: Qiang Yu 
> Cc: Rob Clark 
> Cc: Abhinav Kumar 
> Cc: Dmitry Baryshkov 
> Cc: Danilo Krummrich 
> Cc: Matthew Brost 
> Cc: Boris Brezillon 
> Cc: Alex Deucher 
> Cc: Christian König 
> Cc: Emma Anholt 
> Cc: etna...@lists.freedesktop.org
> Cc: l...@lists.freedesktop.org
> Cc: linux-arm-...@vger.kernel.org
> Cc: freedr...@lists.freedesktop.org
> Cc: nouv...@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org
> Signed-off-by: Luben Tuikov 
> Signed-off-by: Matthew Brost 

Normally, you'd add your R-B.

You should add your S-O-B tag, if you've co-authored/contributed to the commit
or are the actual committer to drm-*, as "dim" requires it, but dim will tell 
you,
so generally, I don't do that unless the tool tells me to. :-) 

So here, feel free to R-B the patch instead of S-O-B, on a patch post.

Regards,
Luben

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c|  3 +-
>  drivers/gpu/drm/lima/lima_sched.c  |  3 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c   |  3 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c|  1 +
>  drivers/gpu/drm/panfrost/panfrost_job.c|  3 +-
>  drivers/gpu/drm/scheduler/sched_entity.c   | 18 +-
>  drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>  drivers/gpu/drm/v3d/v3d_sched.c| 14 ++--
>  include/drm/gpu_scheduler.h|  9 ++-
>  11 files changed, 102 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b54c4d771104..94d073bfbd13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2280,6 +2280,7 @@ static int amdgpu_device_init_schedulers(struct 
> amdgpu_device *adev)
>   }
>  
>   r = drm_sched_init(>sched, _sched_ops, NULL,
> +DRM_SCHED_PRIORITY_COUNT,
>  ring->num_hw_submission, 0,
>  timeout, adev->reset_domain->wq,
>  ring->sched_score, ring->name,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 78476bc75b4e..1f357198533f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -325,8 +325,8 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
> drm_gpu_scheduler *sched)
>   int i;
>  
>   /* Signal all jobs not yet scheduled */
> - for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; 
> i--) {
> - struct drm_sched_rq *rq = >sched_rq[i];
> + for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> + struct drm_sched_rq *rq = sched->sched_rq[i];
>   spin_lock(>lock);
>   list_for_each_entry(s_entity, >entities, list) {
>   while ((s_job = 
> to_drm_sched_job(spsc_queue_pop(_entity->job_queue {
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index 618a804ddc34..396334984e4d 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -134,7 +134,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>  {
>   int ret;
>  
> - ret = drm_sched_init(>sched, _sched_ops, NULL,
> + ret = drm_sched_init(>sched, _sched_ops,
> +  DRM_SCHED_PRIORITY_COUNT, NULL,
>etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>msecs_to_jiffies(500), NULL, NULL,
>dev_name(gpu->dev), gpu->dev);
> diff --git a/drivers/gpu/drm/lima/lima_sched.c 
> b/drivers/gpu/drm/lima/lima_sched.c
> index 8d858aed0e56..23

Re: [PATCH v7 0/6] DRM scheduler changes for Xe

2023-10-25 Thread Luben Tuikov

Hi,

On 2023-10-26 00:12, Matthew Brost wrote:
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first.
> 
> This a continuation of a RFC [3] with all comments addressed, ready for
> a full review, and hopefully in state which can merged in the near
> future. More details of this series can found in the cover letter of the
> RFC [3].
> 
> These changes have been tested with the Xe driver. Based on drm-tip branch.
> 
> A follow up series will be posted to address some of dakr requets for
> kernel doc changes.
> 
> v2:
>  - Break run job, free job, and process message in own work items
>  - This might break other drivers as run job and free job now can run in
>parallel, can fix up if needed
> 
> v3:
>  - Include missing patch 'drm/sched: Add drm_sched_submit_* helpers'
>  - Fix issue with setting timestamp to early
>  - Don't dequeue jobs for single entity after calling entity fini
>  - Flush pending jobs on entity fini
>  - Add documentation for entity teardown
>  - Add Matthew Brost to maintainers of DRM scheduler
> 
> v4:
>  - Drop message interface
>  - Drop 'Flush pending jobs on entity fini'
>  - Drop 'Add documentation for entity teardown'
>  - Address all feedback
> 
> v5:
>  - Address Luben's feedback
>  - Drop starting TDR after calling run_job()
>  - Drop adding Matthew Brost to maintainers of DRM scheduler
> 
> v6:
>  - Address Luben's feedback
>  - Include base commit
> 
> v7:
>  - Drop SINGLE_ENTITY mode rather pull in Luben's patch for dynamic run queues
>  - Address Luben's feedback for free_job work item patch
> 
> Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/116055/
> 
> Luben Tuikov (1):
>   drm/sched: Convert the GPU scheduler to variable number of run-queues
> 
> Matthew Brost (5):
>   drm/sched: Add drm_sched_wqueue_* helpers
>   drm/sched: Convert drm scheduler to use a work queue rather than
> kthread
>   drm/sched: Split free_job into own work item
>   drm/sched: Add drm_sched_start_timeout_unlocked helper
>   drm/sched: Add a helper to queue TDR immediately
> 
>  .../drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c   |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  15 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  15 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   4 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c   |   1 +
>  drivers/gpu/drm/lima/lima_sched.c |   3 +-
>  drivers/gpu/drm/msm/adreno/adreno_device.c|   6 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c  |   6 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   3 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c   |   1 +
>  drivers/gpu/drm/scheduler/sched_entity.c  |  18 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 444 --
>  drivers/gpu/drm/v3d/v3d_sched.c   |  10 +-
>  include/drm/gpu_scheduler.h   |  29 +-
>  14 files changed, 373 insertions(+), 184 deletions(-)
> 
> 
> base-commit: 201c8a7bd1f3f415920a2df4b8a8817e973f42fe

I get a "bad object" doing a lookup for this object. drm-tip isn't very
"stable".
-- 
Regards,
Luben

Re: [PATCH v6 4/7] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy

2023-10-25 Thread Luben Tuikov

Hi Matt,

On 2023-10-25 11:13, Matthew Brost wrote:
> On Mon, Oct 23, 2023 at 11:50:26PM -0400, Luben Tuikov wrote:
>> Hi,
>>
>> On 2023-10-17 11:09, Matthew Brost wrote:
>>> DRM_SCHED_POLICY_SINGLE_ENTITY creates a 1 to 1 relationship between
>>> scheduler and entity. No priorities or run queue used in this mode.
>>> Intended for devices with firmware schedulers.
>>>
>>> v2:
>>>   - Drop sched / rq union (Luben)
>>> v3:
>>>   - Don't pick entity if stopped in drm_sched_select_entity (Danilo)
>>> v4:
>>>   - Rework if statement in drm_sched_entity_init (Luben)
>>>   - Update comment for drm_sched_entity_to_scheduler (Luben)
>>>   - Reword a few things in DOC comment (Luben)
>>>   - Do not check sched policy in for statement (Luben)
>>> v5:
>>>   - Fix extra blank lines (Luben / Checkpatch)
>>>
>>> Signed-off-by: Matthew Brost 
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_entity.c | 69 +++
>>>  drivers/gpu/drm/scheduler/sched_fence.c  |  2 +-
>>>  drivers/gpu/drm/scheduler/sched_main.c   | 87 ++--
>>>  include/drm/gpu_scheduler.h  |  8 +++
>>>  4 files changed, 130 insertions(+), 36 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index cf42e2265d64..940f63dd6965 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -83,6 +83,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>> memset(entity, 0, sizeof(struct drm_sched_entity));
>>> INIT_LIST_HEAD(>list);
>>> entity->rq = NULL;
>>> +   entity->single_sched = NULL;
>>> entity->guilty = guilty;
>>> entity->num_sched_list = num_sched_list;
>>> entity->priority = priority;
>>> @@ -90,8 +91,17 @@ int drm_sched_entity_init(struct drm_sched_entity 
>>> *entity,
>>> RCU_INIT_POINTER(entity->last_scheduled, NULL);
>>> RB_CLEAR_NODE(>rb_tree_node);
>>>  
>>> -   if(num_sched_list)
>>> -   entity->rq = _list[0]->sched_rq[entity->priority];
>>> +   if (num_sched_list) {
>>> +   if (sched_list[0]->sched_policy !=
>>> +   DRM_SCHED_POLICY_SINGLE_ENTITY) {
>>> +   entity->rq = _list[0]->sched_rq[entity->priority];
>>> +   } else if (num_sched_list == 1 && 
>>> !sched_list[0]->single_entity) {
>>> +   sched_list[0]->single_entity = entity;
>>> +   entity->single_sched = sched_list[0];
>>
>> To simplify the rest of the GPU scheduler design and generalize the logic,
>> we can do
>>  entity->rq = sched_list[0]->sched_rq[entity->priority];
>> where the "priority" is 0, thus having a single rq.
>>
>> We shouldn't splice scheduler and entity, but rather make it so that
>> we can set the number of rqs to 1, thus resulting in a single rq.
>>
>> (https://lore.kernel.org/r/20231023032251.164775-1-luben.tui...@amd.com)
>>
> 
> I pulled out this patch [1] + previous one [2] and included your [3] to
> test this in Xe [4].
> 
> It seems to work with just one rq per scheduler. We can go with this
> approach in feel like this is the route. My next post will include your
> patch [3] if we agree.

Yeah, this is good. Thanks!

I feel that the sched_rq[] static array should've been a dynamic one
from the outset--one of the hallmarks of a flexible scheduler. Then
let each driver announce how many priorities they care about. A scheduler
shouldn't be as rigid as to force drivers to care and/or support so and
so many run-queues (priorities).

For a good code design, we want to allow for more driver implementations
with minimal (or no) DRM changes--all accommodated by drm_sched_init(), yet,
accommodating a varied behaviour, and having the sched_rq be dynamic and let
each driver announce how many run-queues it wants is the minimal change we
can do now (and easiest), with best outcome for Xe and new drivers.

Regards,
Luben

> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/patch/563094/?series=121745=8
> [2] https://patchwork.freedesktop.org/patch/563093/?series=121745=8
> [3] https://patchwork.freedesktop.org/patch/563817/?series=125433=1
> [4] https://patchwork.freedesktop.org/series/125540/
> 
>>> +   } else {
>>> +   return -EINVAL;
>>> +   }
>

Re: [PATCH v6 4/7] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy

2023-10-23 Thread Luben Tuikov

Hi,

On 2023-10-17 11:09, Matthew Brost wrote:
> DRM_SCHED_POLICY_SINGLE_ENTITY creates a 1 to 1 relationship between
> scheduler and entity. No priorities or run queue used in this mode.
> Intended for devices with firmware schedulers.
> 
> v2:
>   - Drop sched / rq union (Luben)
> v3:
>   - Don't pick entity if stopped in drm_sched_select_entity (Danilo)
> v4:
>   - Rework if statement in drm_sched_entity_init (Luben)
>   - Update comment for drm_sched_entity_to_scheduler (Luben)
>   - Reword a few things in DOC comment (Luben)
>   - Do not check sched policy in for statement (Luben)
> v5:
>   - Fix extra blank lines (Luben / Checkpatch)
> 
> Signed-off-by: Matthew Brost 
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c | 69 +++
>  drivers/gpu/drm/scheduler/sched_fence.c  |  2 +-
>  drivers/gpu/drm/scheduler/sched_main.c   | 87 ++--
>  include/drm/gpu_scheduler.h  |  8 +++
>  4 files changed, 130 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index cf42e2265d64..940f63dd6965 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -83,6 +83,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>   memset(entity, 0, sizeof(struct drm_sched_entity));
>   INIT_LIST_HEAD(>list);
>   entity->rq = NULL;
> + entity->single_sched = NULL;
>   entity->guilty = guilty;
>   entity->num_sched_list = num_sched_list;
>   entity->priority = priority;
> @@ -90,8 +91,17 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>   RCU_INIT_POINTER(entity->last_scheduled, NULL);
>   RB_CLEAR_NODE(>rb_tree_node);
>  
> - if(num_sched_list)
> - entity->rq = _list[0]->sched_rq[entity->priority];
> + if (num_sched_list) {
> + if (sched_list[0]->sched_policy !=
> + DRM_SCHED_POLICY_SINGLE_ENTITY) {
> + entity->rq = _list[0]->sched_rq[entity->priority];
> + } else if (num_sched_list == 1 && 
> !sched_list[0]->single_entity) {
> + sched_list[0]->single_entity = entity;
> + entity->single_sched = sched_list[0];

To simplify the rest of the GPU scheduler design and generalize the logic,
we can do
entity->rq = sched_list[0]->sched_rq[entity->priority];
where the "priority" is 0, thus having a single rq.

We shouldn't splice scheduler and entity, but rather make it so that
we can set the number of rqs to 1, thus resulting in a single rq.

(https://lore.kernel.org/r/20231023032251.164775-1-luben.tui...@amd.com)

> + } else {
> + return -EINVAL;
> + }
> + }
>  
>   init_completion(>entity_idle);
>  
> @@ -124,7 +134,8 @@ void drm_sched_entity_modify_sched(struct 
> drm_sched_entity *entity,
>   struct drm_gpu_scheduler **sched_list,
>   unsigned int num_sched_list)
>  {
> - WARN_ON(!num_sched_list || !sched_list);
> + WARN_ON(!num_sched_list || !sched_list ||
> + !!entity->single_sched);

We wouldn't need this check.

>  
>   entity->sched_list = sched_list;
>   entity->num_sched_list = num_sched_list;
> @@ -231,13 +242,15 @@ static void drm_sched_entity_kill(struct 
> drm_sched_entity *entity)
>  {
>   struct drm_sched_job *job;
>   struct dma_fence *prev;
> + bool single_entity = !!entity->single_sched;
>  
> - if (!entity->rq)
> + if (!entity->rq && !single_entity)
>   return;
>  
>   spin_lock(>rq_lock);
>   entity->stopped = true;
> - drm_sched_rq_remove_entity(entity->rq, entity);
> + if (!single_entity)
> + drm_sched_rq_remove_entity(entity->rq, entity);
>   spin_unlock(>rq_lock);

We should be able to carry on the existing infrastructure when
having num_rqs = 1, with dynamic rqs. So we shouldn't warrant
any changes here.

>  
>   /* Make sure this entity is not used by the scheduler at the moment */
> @@ -259,6 +272,20 @@ static void drm_sched_entity_kill(struct 
> drm_sched_entity *entity)
>   dma_fence_put(prev);
>  }
>  
> +/**
> + * drm_sched_entity_to_scheduler - Schedule entity to GPU scheduler
> + * @entity: scheduler entity
> + *
> + * Returns GPU scheduler for the entity
> + */
> +struct drm_gpu_scheduler *
> +drm_sched_entity_to_scheduler(struct drm_sched_entity *entity)
> +{
> + bool single_entity = !!entity->single_sched;
> +
> + return single_entity ? entity->single_sched : entity->rq->sched;

It would be "entity->rq->sched" for any and all cases which simplifies things.

> +}
> +
>  /**
>   * drm_sched_entity_flush - Flush a context entity
>   *
> @@ -276,11 +303,12 @@ long drm_sched_entity_flush(struct drm_sched_entity 
> *entity, long timeout)
>   struct drm_gpu_scheduler *sched;
>   struct task_struct *last_user;
>

Re: [PATCH v6 3/7] drm/sched: Move schedule policy to scheduler

2023-10-23 Thread Luben Tuikov

On 2023-10-17 11:09, Matthew Brost wrote:
> Rather than a global modparam for scheduling policy, move the scheduling
> policy to scheduler so user can control each scheduler policy.
> 
> v2:
>   - s/DRM_SCHED_POLICY_MAX/DRM_SCHED_POLICY_COUNT (Luben)
>   - Only include policy in scheduler (Luben)
> v3:
>   - use a ternary operator as opposed to an if-control (Luben)
>   - s/DRM_SCHED_POLICY_DEFAULT/DRM_SCHED_POLICY_UNSET/ (Luben)
>   - s/default_drm_sched_policy/drm_sched_policy_default/ (Luben)
>   - Update commit message (Boris)
>   - Fix v3d build (CI)
>   - s/bad_policies/drm_sched_policy_mismatch/ (Luben)
>   - Don't update modparam doc (Luben)
> v4:
>   - Fix alignment in msm_ringbuffer_new (Luben / checkpatch)
> 
> Signed-off-by: Matthew Brost 
> Reviewed-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c|  3 ++-
>  drivers/gpu/drm/lima/lima_sched.c  |  3 ++-
>  drivers/gpu/drm/msm/msm_ringbuffer.c   |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c|  3 ++-
>  drivers/gpu/drm/panfrost/panfrost_job.c|  3 ++-
>  drivers/gpu/drm/scheduler/sched_entity.c   | 24 ++
>  drivers/gpu/drm/scheduler/sched_main.c | 19 -
>  drivers/gpu/drm/v3d/v3d_sched.c| 15 +-
>  include/drm/gpu_scheduler.h| 20 --
>  10 files changed, 68 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b54c4d771104..e4e6f91450a4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2283,6 +2283,7 @@ static int amdgpu_device_init_schedulers(struct 
> amdgpu_device *adev)
>  ring->num_hw_submission, 0,
>  timeout, adev->reset_domain->wq,
>  ring->sched_score, ring->name,
> +DRM_SCHED_POLICY_UNSET,
>  adev->dev);

I think we should drop this patch.

Drivers shouldn't be able to select their own policy, when there's a kernel 
parameter,
which says what the scheduling policy should be. Imagine you're a user setting
the policy at kernel command line, only to learn that some driver has decided
to ignore (programmed, mind you) your choice and set whatever it decides. Plus,
this opens the Pandora box for other drivers to do this, and it's not a good
software engineering direction.

For the 1-1-1 case in S-R-E (sched-rq-entity) which Xe is using, DRM scheduling 
policy is
irrelevant. We can show this by showing that,
  a) In the case of RR, we always sample the only entity there, namely the 
single
 entity in the single run-queue.
  b) In the case of FIFO, we always pick the only node in the RB tree, namely 
the single
 entity in the single run-queue.

So whether it is RR or FIFO, for the 1-1-1 case, it always works as expected, 
and
the actual selection policy (scheduling policy) is irrelevant. This is a good 
design.

However, what prevents the Xe driver from doing this cleanly, is this ghastly
array of "priorities" in the scheduler struct,

struct drm_gpu_scheduler {
...
struct drm_sched_rq sched_rq[DRM_SCHED_PRIORITY_COUNT];
...

which is very speculative and ambitious... Why should the scheduler have
a set, constant, number of "priorities" and no more or less? Who said that
those are _the_only_ options, and why? It makes for a very rigid design
which doesn't yield well to novel and varied implementations, and that's
not a sign of a good design.

With the "[PATCH] drm/sched: Convert the GPU scheduler to variable number
of run-queues" 
(https://lore.kernel.org/r/20231023032251.164775-1-luben.tui...@amd.com)
drivers can specify the number of run-queues in the S-R-E relationship.

Note, that in the S-R-E relationship, the driver already controls
the number of E's. With the patch above, drivers will also control the number 
of R's,
and I think that will have much more flexible implications for drivers to play 
with
rather than keeping the R constant.

The idea is that the Xe driver would specify that they're using a scheduler
with a single R, at drm_sched_init() and then attach a single E to that,
leaving much of the rest of the code alone. This generalizes (leaves alone) the 
rest of
the code, which is a good design.

There's a bug in the amdgpu code which when using dynamic rq, it oopses, because
it was using the scheduler without ever calling drm_sched_init()--the only 
reason
amdgpu was getting away with it was because the array was statically defined. 
I've posted
a patch for that. 
(https://lore.kernel.org/r/202310230

Re: [PATCH drm-misc-next v2] drm/sched: implement dynamic job-flow control

2023-10-23 Thread Luben Tuikov

On 2023-10-23 18:35, Danilo Krummrich wrote:
> On Wed, Oct 11, 2023 at 09:52:36PM -0400, Luben Tuikov wrote:
>> Hi,
>>
>> Thanks for fixing the title and submitting a v2 of this patch. Comments 
>> inlined below.
>>
>> On 2023-10-09 18:35, Danilo Krummrich wrote:
>>> Currently, job flow control is implemented simply by limiting the number
>>> of jobs in flight. Therefore, a scheduler is initialized with a
>>> submission limit that corresponds to the number of jobs which can be
>>> sent to the hardware.
>>>
>>> This implies that for each job, drivers need to account for the maximum
>>> job size possible in order to not overflow the ring buffer.
>>>
>>> However, there are drivers, such as Nouveau, where the job size has a
>>> rather large range. For such drivers it can easily happen that job
>>> submissions not even filling the ring by 1% can block subsequent
>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>
>>> In order to overcome this issue, allow for tracking the actual job size
>>> instead of the number of jobs. Therefore, add a field to track a job's
>>> submission credits, which represents the number of credits a job
>>> contributes to the scheduler's submission limit.
>>>
>>> Signed-off-by: Danilo Krummrich 
>>> ---
>>> Changes in V2:
>>> ==
>>>   - fixed up influence on scheduling fairness due to consideration of a 
>>> job's
>>> size
>>> - If we reach a ready entity in drm_sched_select_entity() but can't 
>>> actually
>>>   queue a job from it due to size limitations, just give up and go to 
>>> sleep
>>>   until woken up due to a pending job finishing, rather than continue 
>>> to try
>>>   other entities.
>>>   - added a callback to dynamically update a job's credits (Boris)
>>>   - renamed 'units' to 'credits'
>>>   - fixed commit message and comments as requested by Luben
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>>>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>>>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>>>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>>>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>>>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>>  drivers/gpu/drm/scheduler/sched_entity.c  |   5 +-
>>>  drivers/gpu/drm/scheduler/sched_main.c| 101 +-
>>>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>>>  include/drm/gpu_scheduler.h   |  33 --
>>>  11 files changed, 115 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 78476bc75b4e..d54daaf64bf1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
>>> amdgpu_vm *vm,
>>> if (!entity)
>>> return 0;
>>>  
>>> -   return drm_sched_job_init(&(*job)->base, entity, owner);
>>> +   return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>>  }
>>>  
>>>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
>>> b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> index 45403ea38906..74a446711207 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> @@ -538,7 +538,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, 
>>> void *data,
>>>  
>>> ret = drm_sched_job_init(>sched_job,
>>>  >sched_entity[args->pipe],
>>> -submit->ctx);
>>> +1, submit->ctx);
>>> if (ret)
>>> goto err_submit_put;
>>>  
>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c 
>>> b/drivers/gpu/drm/lima/lima_sched.c
>>> index 50c2075228aa..5dc6678e1eb9 100644
>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct li

Re: [PATCH drm-misc-next v2] drm/sched: implement dynamic job-flow control

2023-10-23 Thread Luben Tuikov

On 2023-10-23 18:57, Danilo Krummrich wrote:
> On Tue, Oct 10, 2023 at 09:41:51AM +0200, Boris Brezillon wrote:
>> On Tue, 10 Oct 2023 00:35:53 +0200
>> Danilo Krummrich  wrote:
>>
>>> Currently, job flow control is implemented simply by limiting the number
>>> of jobs in flight. Therefore, a scheduler is initialized with a
>>> submission limit that corresponds to the number of jobs which can be
>>> sent to the hardware.
>>>
>>> This implies that for each job, drivers need to account for the maximum
>>> job size possible in order to not overflow the ring buffer.
>>>
>>> However, there are drivers, such as Nouveau, where the job size has a
>>> rather large range. For such drivers it can easily happen that job
>>> submissions not even filling the ring by 1% can block subsequent
>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>
>>> In order to overcome this issue, allow for tracking the actual job size
>>> instead of the number of jobs. Therefore, add a field to track a job's
>>> submission credits, which represents the number of credits a job
>>> contributes to the scheduler's submission limit.
>>>
>>> Signed-off-by: Danilo Krummrich 
>>> ---
>>> Changes in V2:
>>> ==
>>>   - fixed up influence on scheduling fairness due to consideration of a 
>>> job's
>>> size
>>> - If we reach a ready entity in drm_sched_select_entity() but can't 
>>> actually
>>>   queue a job from it due to size limitations, just give up and go to 
>>> sleep
>>>   until woken up due to a pending job finishing, rather than continue 
>>> to try
>>>   other entities.
>>>   - added a callback to dynamically update a job's credits (Boris)
>>>   - renamed 'units' to 'credits'
>>>   - fixed commit message and comments as requested by Luben
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>>>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>>>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>>>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>>>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>>>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>>  drivers/gpu/drm/scheduler/sched_entity.c  |   5 +-
>>>  drivers/gpu/drm/scheduler/sched_main.c| 101 +-
>>>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>>>  include/drm/gpu_scheduler.h   |  33 --
>>>  11 files changed, 115 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 78476bc75b4e..d54daaf64bf1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
>>> amdgpu_vm *vm,
>>> if (!entity)
>>> return 0;
>>>  
>>> -   return drm_sched_job_init(&(*job)->base, entity, owner);
>>> +   return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>>  }
>>>  
>>>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
>>> b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> index 45403ea38906..74a446711207 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> @@ -538,7 +538,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, 
>>> void *data,
>>>  
>>> ret = drm_sched_job_init(>sched_job,
>>>  >sched_entity[args->pipe],
>>> -submit->ctx);
>>> +1, submit->ctx);
>>> if (ret)
>>> goto err_submit_put;
>>>  
>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c 
>>> b/drivers/gpu/drm/lima/lima_sched.c
>>> index 50c2075228aa..5dc6678e1eb9 100644
>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>>> for (i = 0; i < num_bos; i++)
>>> drm_gem_object_get([i]->base.base);
>>>  
>>> -   err = drm_sched_job_init(>base, >base, vm);
>>> +   err = drm_sched_job_init(>base, >base, 1, vm);
>>> if (err) {
>>> kfree(task->bos);
>>> return err;
>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c 
>>> b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> index 3f1aa4de3b87..6d230c38e4f5 100644
>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct 
>>> drm_device *dev,
>>> return ERR_PTR(ret);
>>> }
>>>  
>>> -   ret = drm_sched_job_init(>base, queue->entity, queue);
>>> +   ret = drm_sched_job_init(>base, queue->entity, 1, queue);
>>> if (ret) {
>>> kfree(submit->hw_fence);
>>> kfree(submit);
>>> diff --git

[PATCH] drm/sched: Convert the GPU scheduler to variable number of run-queues

2023-10-22 Thread Luben Tuikov

The GPU scheduler has now a variable number of run-queues, which are set up at
drm_sched_init() time. This way, each driver announces how many run-queues it
requires (supports) per each GPU scheduler it creates. Note, that run-queues
correspond to scheduler "priorities", thus if the number of run-queues is set
to 1 at drm_sched_init(), then that scheduler supports a single run-queue,
i.e. single "priority". If a driver further sets a single entity per
run-queue, then this creates a 1-to-1 correspondence between a scheduler and
a scheduled entity.

Cc: Lucas Stach 
Cc: Russell King 
Cc: Qiang Yu 
Cc: Rob Clark 
Cc: Abhinav Kumar 
Cc: Dmitry Baryshkov 
Cc: Danilo Krummrich 
Cc: Matthew Brost 
Cc: Boris Brezillon 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Emma Anholt 
Cc: etna...@lists.freedesktop.org
Cc: l...@lists.freedesktop.org
Cc: linux-arm-...@vger.kernel.org
Cc: freedr...@lists.freedesktop.org
Cc: nouv...@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c|  1 +
 drivers/gpu/drm/lima/lima_sched.c  |  4 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c   |  5 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c|  1 +
 drivers/gpu/drm/panfrost/panfrost_job.c|  1 +
 drivers/gpu/drm/scheduler/sched_entity.c   | 18 +-
 drivers/gpu/drm/scheduler/sched_main.c | 74 ++
 drivers/gpu/drm/v3d/v3d_sched.c|  5 ++
 include/drm/gpu_scheduler.h|  9 ++-
 11 files changed, 98 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2b8356699f235d..251995a90bbe69 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2280,6 +2280,7 @@ static int amdgpu_device_init_schedulers(struct 
amdgpu_device *adev)
}
 
r = drm_sched_init(>sched, _sched_ops,
+  DRM_SCHED_PRIORITY_COUNT,
   ring->num_hw_submission, 0,
   timeout, adev->reset_domain->wq,
   ring->sched_score, ring->name,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 78476bc75b4e1d..1f357198533f3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -325,8 +325,8 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
drm_gpu_scheduler *sched)
int i;
 
/* Signal all jobs not yet scheduled */
-   for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; 
i--) {
-   struct drm_sched_rq *rq = >sched_rq[i];
+   for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+   struct drm_sched_rq *rq = sched->sched_rq[i];
spin_lock(>lock);
list_for_each_entry(s_entity, >entities, list) {
while ((s_job = 
to_drm_sched_job(spsc_queue_pop(_entity->job_queue {
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 345fec6cb1a4c1..9b79f218e21afc 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -135,6 +135,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
int ret;
 
ret = drm_sched_init(>sched, _sched_ops,
+DRM_SCHED_PRIORITY_COUNT,
 etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
 msecs_to_jiffies(500), NULL, NULL,
 dev_name(gpu->dev), gpu->dev);
diff --git a/drivers/gpu/drm/lima/lima_sched.c 
b/drivers/gpu/drm/lima/lima_sched.c
index ffd91a5ee29901..295f0353a02e58 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -488,7 +488,9 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, 
const char *name)
 
INIT_WORK(>recover_work, lima_sched_recover_work);
 
-   return drm_sched_init(>base, _sched_ops, 1,
+   return drm_sched_init(>base, _sched_ops,
+ DRM_SCHED_PRIORITY_COUNT,
+ 1,
  lima_job_hang_limit,
  msecs_to_jiffies(timeout), NULL,
  NULL, name, pipe->ldev->dev);
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c 
b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 40c0bc35a44cee..95257ab0185dc4 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -95,8 +95,9 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu 
*gpu, int id,
sched_timeout = MAX_SCHEDULE_TIMEOUT;

Re: [PATCH v6 5/7] drm/sched: Split free_job into own work item

2023-10-21 Thread Luben Tuikov

Hi,

On 2023-10-19 12:55, Matthew Brost wrote:
> On Wed, Oct 18, 2023 at 09:25:36PM -0400, Luben Tuikov wrote:
>> Hi,
>>
>> On 2023-10-17 11:09, Matthew Brost wrote:
>>> Rather than call free_job and run_job in same work item have a dedicated
>>> work item for each. This aligns with the design and intended use of work
>>> queues.
>>>
>>> v2:
>>>- Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting
>>>  timestamp in free_job() work item (Danilo)
>>> v3:
>>>   - Drop forward dec of drm_sched_select_entity (Boris)
>>>   - Return in drm_sched_run_job_work if entity NULL (Boris)
>>> v4:
>>>   - Replace dequeue with peek and invert logic (Luben)
>>>   - Wrap to 100 lines (Luben)
>>>   - Update comments for *_queue / *_queue_if_ready functions (Luben)
>>>
>>> Signed-off-by: Matthew Brost 
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 287 +++--
>>>  include/drm/gpu_scheduler.h|   8 +-
>>>  2 files changed, 178 insertions(+), 117 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 273e0fbc4eab..b1b8d9f96da5 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq 
>>> *rq,
>>>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a 
>>> job to run
>>>   *
>>>   * @rq: scheduler run queue to check.
>>> + * @peek: Just find, don't set to current.
>>
>> The "peek" rename is good--thanks!
>>
>>>   *
>>>   * Try to find a ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool peek)
>>>  {
>>> struct drm_sched_entity *entity;
>>>  
>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> if (entity) {
>>> list_for_each_entry_continue(entity, >entities, list) {
>>> if (drm_sched_entity_is_ready(entity)) {
>>> -   rq->current_entity = entity;
>>> -   reinit_completion(>entity_idle);
>>> +   if (!peek) {
>>> +   rq->current_entity = entity;
>>> +   reinit_completion(>entity_idle);
>>> +   }
>>> spin_unlock(>lock);
>>> return entity;
>>> }
>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> list_for_each_entry(entity, >entities, list) {
>>>  
>>> if (drm_sched_entity_is_ready(entity)) {
>>> -   rq->current_entity = entity;
>>> -   reinit_completion(>entity_idle);
>>> +   if (!peek) {
>>> +   rq->current_entity = entity;
>>> +   reinit_completion(>entity_idle);
>>> +   }
>>> spin_unlock(>lock);
>>> return entity;
>>> }
>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job 
>>> to run
>>>   *
>>>   * @rq: scheduler run queue to check.
>>> + * @peek: Just find, don't set to current.
>>>   *
>>>   * Find oldest waiting ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool peek)
>>>  {
>>> struct rb_node *rb;
>>>  
>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
>>> *rq)
>>>  
>>> entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>> if (drm_sched_entity_is_ready(entity)) {
>>> -   rq->current_entity = entity;
>>> -   reinit_completion(>entity_idle);
&

Re: [PATCH] drm/amdgpu: Remove redundant call to priority_is_valid()

2023-10-21 Thread Luben Tuikov

On 2023-10-20 12:37, Alex Deucher wrote:
> On Tue, Oct 17, 2023 at 9:22 PM Luben Tuikov  wrote:
>>
>> Remove a redundant call to amdgpu_ctx_priority_is_valid() from
>> amdgpu_ctx_priority_permit(), which is called from amdgpu_ctx_init() which is
>> called from amdgpu_ctx_alloc() which is called from amdgpu_ctx_ioctl(), where
>> we've called amdgpu_ctx_priority_is_valid() already first thing in the
>> function.
>>
>> Cc: Alex Deucher 
>> Cc: Christian König 
>> Signed-off-by: Luben Tuikov 
> 
> Please push this to drm-misc since it depends on your previous patches.

Done!

Pushed to drm-misc-fixes, where the other two landed.

Regards,
Luben

> 
> Alex
> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 15 ---
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 68db924161ef66..4c6ffca97c4512 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -56,6 +56,10 @@ bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio)
>> return true;
>> default:
>> case AMDGPU_CTX_PRIORITY_UNSET:
>> +   /* UNSET priority is not valid and we don't carry that
>> +* around, but set it to NORMAL in the only place this
>> +* function is called, amdgpu_ctx_ioctl().
>> +*/
>> return false;
>> }
>>  }
>> @@ -96,9 +100,6 @@ amdgpu_ctx_to_drm_sched_prio(int32_t ctx_prio)
>>  static int amdgpu_ctx_priority_permit(struct drm_file *filp,
>>   int32_t priority)
>>  {
>> -   if (!amdgpu_ctx_priority_is_valid(priority))
>> -   return -EINVAL;
>> -
>> /* NORMAL and below are accessible by everyone */
>> if (priority <= AMDGPU_CTX_PRIORITY_NORMAL)
>> return 0;
>> @@ -625,8 +626,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>> return 0;
>>  }
>>
>> -
>> -
>>  static int amdgpu_ctx_stable_pstate(struct amdgpu_device *adev,
>> struct amdgpu_fpriv *fpriv, uint32_t id,
>> bool set, u32 *stable_pstate)
>> @@ -669,8 +668,10 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
>> id = args->in.ctx_id;
>> priority = args->in.priority;
>>
>> -   /* For backwards compatibility reasons, we need to accept
>> -* ioctls with garbage in the priority field */
>> +   /* For backwards compatibility, we need to accept ioctls with garbage
>> +* in the priority field. Garbage values in the priority field, 
>> result
>> +* in the priority being set to NORMAL.
>> +*/
>> if (!amdgpu_ctx_priority_is_valid(priority))
>> priority = AMDGPU_CTX_PRIORITY_NORMAL;
>>
>>
>> base-commit: 915718484b8fa1eede4499a939e2e4fc0d85caa4
>> prerequisite-patch-id: a36f628997d923f66da5342e760e8b45ff959fb8
>> prerequisite-patch-id: f15148c302329c0c60d86040571c61d367bd05e7
>> --
>> 2.42.0
>>

1 2 3 4 >

1 - 100 of 390 matches

Mail list logo