Re: [RFC v8 00/21] DRM scheduling cgroup controller

2025-09-30 Thread Philipp Stanner
+Cc Sima, Dave On Mon, 2025-09-29 at 16:07 +0200, Danilo Krummrich wrote: > On Wed Sep 3, 2025 at 5:23 PM CEST, Tvrtko Ursulin wrote: > > This is another respin of this old work^1 which since v7 is a total rewrite > > and > > completely changes how the control is done. > > I only got some of the

Re: [RFC PATCH] rust: sync: Add dma_fence abstractions

2025-09-28 Thread Philipp Stanner
On Sun, 2025-09-28 at 16:34 +0200, Christian König wrote: > On 27.09.25 11:01, Philipp Stanner wrote: > > On Fri, 2025-09-26 at 09:10 -0700, Boqun Feng wrote: > > > On Thu, Sep 18, 2025 at 02:30:59PM +0200, Philipp Stanner wrote: > > > > dma_fence is a synchronizatio

Re: [RFC PATCH] rust: sync: Add dma_fence abstractions

2025-09-27 Thread Philipp Stanner
On Fri, 2025-09-26 at 09:10 -0700, Boqun Feng wrote: > On Thu, Sep 18, 2025 at 02:30:59PM +0200, Philipp Stanner wrote: > > dma_fence is a synchronization mechanism which is needed by virtually > > all GPU drivers. > > > > A dma_fence offers many features, among wh

[PATCH] drm/sched: Mandate usage of drm_sched_job_cleanup()

2025-09-26 Thread Philipp Stanner
drm_sched_job_cleanup()'s documentation so far uses relatively soft language, only "recommending" usage of the function. To avoid memory leaks and, potentiall, other bugs, however, the function has to be used. Demand usage of the function explicitly. Signed-off-by: Philipp Stanne

Re: [RFC v8 07/12] drm/sched: Account entity GPU time

2025-09-26 Thread Philipp Stanner
On Thu, 2025-09-25 at 12:52 +0100, Tvrtko Ursulin wrote: > > On 24/09/2025 10:11, Philipp Stanner wrote: > > On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote: > > > To implement fair scheduling we need a view into the GPU time consumed by > > > entities. Pr

Re: [RFC PATCH] rust: sync: Add dma_fence abstractions

2025-09-26 Thread Philipp Stanner
On Thu, 2025-09-18 at 15:52 +0200, Boqun Feng wrote: > On Thu, Sep 18, 2025 at 02:30:59PM +0200, Philipp Stanner wrote: > [...] > > --- > > So. ¡Hola! > > > > This is a highly WIP RFC. It's obviously at many places not yet > > conforming very well to Ru

Re: [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler

2025-09-24 Thread Philipp Stanner
is to do things like general improvements by renaming variables (see my comments in the previous patch) in a separate cleanup-patch following this one. Few comments below > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Bros

Re: [RFC v8 06/12] drm/sched: Free all finished jobs at once

2025-09-24 Thread Philipp Stanner
eted jobs as soon as possible so the metric is most up to date when > view from the submission side of things. > > Signed-off-by: Tvrtko Ursulin Looks like a good patch to me. > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner &

Re: [PATCH] drm/sched/tests: Remove relict of done_list

2025-09-19 Thread Philipp Stanner
On Fri, 2025-09-19 at 08:33 +0100, Tvrtko Ursulin wrote: > > On 19/09/2025 07:44, Philipp Stanner wrote: > > A rework of the scheduler unit tests removed the done_list. That list is > > still mentioned in the mock test header. > > > > Remove that relict. > >

[PATCH] drm/sched/tests: Remove relict of done_list

2025-09-18 Thread Philipp Stanner
A rework of the scheduler unit tests removed the done_list. That list is still mentioned in the mock test header. Remove that relict. Fixes: 4576de9b7977 ("drm/sched/tests: Implement cancel_job() callback") Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/tests/sched_t

[RFC PATCH] rust: sync: Add dma_fence abstractions

2025-09-18 Thread Philipp Stanner
should_ be relatively trivial to implement, though. Signed-off-by: Philipp Stanner --- So. ¡Hola! This is a highly WIP RFC. It's obviously at many places not yet conforming very well to Rust's standards. Nevertheless, it has progressed enough that I want to request comments from the community.

Re: [RFC PATCH] rust: sync: Add dma_fence abstractions

2025-09-18 Thread Philipp Stanner
ences more > robust regarding context lifetime. > > On 18.09.25 14:30, Philipp Stanner wrote: > > dma_fence is a synchronization mechanism which is needed by virtually > > all GPU drivers. > > > > A dma_fence offers many features, among which the most important ones >

Re: [PATCH v2] drm/sched: Extend and update documentation

2025-09-18 Thread Philipp Stanner
On Tue, 2025-09-02 at 13:12 +0200, Philipp Stanner wrote: > From: Philipp Stanner > > The various objects and their memory lifetime used by the GPU scheduler > are currently not fully documented. > > Add documentation describing the scheduler's objects. Improve the >

Re: [PATCH v2] drm/sched: struct member doc fix

2025-09-17 Thread Philipp Stanner
On Mon, 2025-09-15 at 21:23 +0800, Luc Ma wrote: > The mentioned function has been renamed since commit 180fc134d712 > ("drm/scheduler: Rename cleanup functions v2."), so let it refer to > the current one. > > v2: use proper pattern for function cross-reference > > Signed-off-by: Luc Ma Applied

Re: [RFC v8 04/12] drm/sched: Consolidate entity run queue management

2025-09-16 Thread Philipp Stanner
On Thu, 2025-09-11 at 15:55 +0100, Tvrtko Ursulin wrote: > > On 11/09/2025 15:20, Philipp Stanner wrote: > > On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote: > > > Move the code dealing with entities entering and exiting run queues to > > > helpers to l

Re: [PATCH] drm/sched: struct member doc fix

2025-09-15 Thread Philipp Stanner
On Fri, 2025-09-12 at 21:44 +0800, Luc Ma wrote: > The mentioned function has been renamed since commit 180fc134d712 > ("drm/scheduler: Rename cleanup functions v2."), so let it refer to > the current one. > > Signed-off-by: Luc Ma Thx for the patch. > --- >  include/drm/gpu_scheduler.h | 2 +-

Re: [RFC v8 08/12] drm/sched: Remove idle entity from tree

2025-09-11 Thread Philipp Stanner
work. Or could it be made generic for the current in-tree scheduler? > > Apart from that, the upcoming fair scheduling algorithm will rely on the > tree only containing runnable entities. > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich >

Re: [RFC v8 04/12] drm/sched: Consolidate entity run queue management

2025-09-11 Thread Philipp Stanner
patches or could it be branched out? P. > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner > --- >  drivers/gpu/drm/scheduler/sched_entity.c   | 64 ++- >  drivers/gpu/drm/scheduler/s

Re: [PATCH v2] Revert "drm/nouveau: Remove waitque for sched teardown"

2025-09-04 Thread Philipp Stanner
On Thu, 2025-09-04 at 12:27 +0200, Christian König wrote: > On 01.09.25 10:31, Philipp Stanner wrote: > > This reverts: > > > > commit bead88002227 ("drm/nouveau: Remove waitque for sched teardown") > > commit 5f46f5c7af8c ("drm/nouveau: Add new callback

Re: [PATCH v2] Revert "drm/nouveau: Remove waitque for sched teardown"

2025-09-04 Thread Philipp Stanner
On Thu, 2025-09-04 at 13:56 +0200, Christian König wrote: > On 04.09.25 13:12, Philipp Stanner wrote: > > On Thu, 2025-09-04 at 12:27 +0200, Christian König wrote: > > > On 01.09.25 10:31, Philipp Stanner wrote: > > > > This reverts: > > > > > &g

Re: [PATCH 2/2] dma-buf: add warning when dma_fence is signaled from IOCTL

2025-09-04 Thread Philipp Stanner
On Tue, 2025-08-12 at 16:34 +0200, Christian König wrote: > From: Christian König > > We have the re-occurring problem that people try to invent a > DMA-fences implementation which signals fences based on an userspace > IOCTL. > > This is well known as source of hard to track down crashes and is

Re: [PATCH v1 2/2] drm/sched: limit sched score update to jobs change

2025-09-02 Thread Philipp Stanner
On Mon, 2025-09-01 at 15:14 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > Le 25/08/2025 à 15:13, Philipp Stanner a écrit : > > On Fri, 2025-08-22 at 15:43 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > Currently, the scheduler score is incremented when a job is pushe

Re: [PATCH v2] drm/sched: Fix racy access to drm_sched_entity.dependency

2025-09-02 Thread Philipp Stanner
On Tue, 2025-09-02 at 10:22 +0200, Philipp Stanner wrote: > On Tue, 2025-09-02 at 10:18 +0200, Philipp Stanner wrote: > > On Tue, 2025-09-02 at 08:59 +0100, Tvrtko Ursulin wrote: > > > > > > On 02/09/2025 08:27, Philipp Stanner wrote: > > > > On Mon, 2025-09

[PATCH v2] drm/sched: Extend and update documentation

2025-09-02 Thread Philipp Stanner
From: Philipp Stanner The various objects and their memory lifetime used by the GPU scheduler are currently not fully documented. Add documentation describing the scheduler's objects. Improve the general documentation at a few other places. Co-developed-by: Christian König Signed-o

Re: [PATCH v2] drm/sched: Fix racy access to drm_sched_entity.dependency

2025-09-02 Thread Philipp Stanner
On Tue, 2025-09-02 at 10:18 +0200, Philipp Stanner wrote: > On Tue, 2025-09-02 at 08:59 +0100, Tvrtko Ursulin wrote: > > > > On 02/09/2025 08:27, Philipp Stanner wrote: > > > On Mon, 2025-09-01 at 14:40 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > > The drm

Re: [PATCH v2] drm/sched: Fix racy access to drm_sched_entity.dependency

2025-09-02 Thread Philipp Stanner
On Tue, 2025-09-02 at 08:59 +0100, Tvrtko Ursulin wrote: > > On 02/09/2025 08:27, Philipp Stanner wrote: > > On Mon, 2025-09-01 at 14:40 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > The drm_sched_job_unschedulable trace point can access > > > entity->depen

Re: [PATCH v2] drm/sched: Fix racy access to drm_sched_entity.dependency

2025-09-02 Thread Philipp Stanner
On Mon, 2025-09-01 at 14:40 +0200, Pierre-Eric Pelloux-Prayer wrote: > The drm_sched_job_unschedulable trace point can access > entity->dependency after it was cleared by the callback > installed in drm_sched_entity_add_dependency_cb, causing: > > BUG: kernel NULL pointer dereference, address: 000

[PATCH v2] Revert "drm/nouveau: Remove waitque for sched teardown"

2025-09-01 Thread Philipp Stanner
ove waitque for sched teardown") Suggested-by: Danilo Krummrich Signed-off-by: Philipp Stanner --- Changes in v2: - Don't revert commit 89b2675198ab ("drm/nouveau: Make fence container helper usable driver-wide") - Add Fixes-tag --- drivers/gpu/drm/nouveau/nouveau_fence.c | 15 --

[PATCH] Revert "drm/nouveau: Remove waitque for sched teardown"

2025-08-29 Thread Philipp Stanner
ll patches related to the waitqueue removal. Suggested-by: Danilo Krummrich Signed-off-by: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +--- drivers/gpu/drm/nouveau/nouveau_fence.h | 6 -- drivers/gpu/drm/nouveau/nouveau_sched.c | 20 ++

Re: [PATCH v2] drm/sched: Document race condition in drm_sched_fini()

2025-08-28 Thread Philipp Stanner
On Wed, 2025-08-13 at 14:58 +0200, Danilo Krummrich wrote: > On Wed Aug 13, 2025 at 10:56 AM CEST, Philipp Stanner wrote: > > In drm_sched_fini() all entities are marked as stopped - without taking > > the appropriate lock, because that would deadlock. That means that > >

Re: drm/sched/tests: Remove redundant header files

2025-08-28 Thread Philipp Stanner
On Mon, 2025-08-25 at 12:48 +0200, Markus Elfring wrote: > > > > The header file is already included on line 8. Remove > > > > the > > > > redundant include. > > > > > > You would like to omit a duplicate #include directive, don't you? > > The change intention is probably clear. > > > > > Wil

Re: [PATCH] drm/sched: Remove mention of indirect buffers

2025-08-28 Thread Philipp Stanner
-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner Applied to drm-misc-next Thx P. > --- >  drivers/gpu/drm/scheduler/sched_entity.c | 14 +++--- >  1 file changed, 7 insertions(+), 7 deletions(-) > &

Re: [PATCH v1 2/2] drm/sched: limit sched score update to jobs change

2025-08-25 Thread Philipp Stanner
On Fri, 2025-08-22 at 15:43 +0200, Pierre-Eric Pelloux-Prayer wrote: > Currently, the scheduler score is incremented when a job is pushed to an > entity and when an entity is attached to the scheduler. It's indeed awkward why attaching is treated equivalently to job submission. Can you expand the

Re: [PATCH] drm/sched/tests: Remove redundant header files

2025-08-25 Thread Philipp Stanner
On Tue, 2025-08-19 at 18:15 +0200, Markus Elfring wrote: > > The header file is already included on line 8. Remove the > > redundant include. > > You would like to omit a duplicate #include directive, don't you? > Will a corresponding refinement become helpful for the summary phrase > and change

Re: [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency

2025-08-25 Thread Philipp Stanner
On Wed, 2025-08-20 at 11:06 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > Le 21/07/2025 à 17:18, Pierre-Eric Pelloux-Prayer a écrit : > > > > > > Le 26/06/2025 à 16:05, Tvrtko Ursulin a écrit : > > > > > > On 26/06/2025 14:43, Pierre-Eric Pelloux-Prayer wrote: > > > > Hi, > > > > > > > > Le

Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue.

2025-08-14 Thread Philipp Stanner
On Thu, 2025-08-14 at 12:45 +0100, Tvrtko Ursulin wrote: > > On 14/08/2025 11:42, Tvrtko Ursulin wrote: > > > > On 21/07/2025 08:52, Philipp Stanner wrote: > > > +Cc Tvrtko, who's currently reworking FIFO and RR. > > > > > > On Sun, 2025-07-20

[PATCH v2] drm/sched: Document race condition in drm_sched_fini()

2025-08-13 Thread Philipp Stanner
associated with a scheduler must be torn down first. Then, however, the locking should be removed from drm_sched_fini() alltogether with an appropriate comment. Reported-by: James Flowers Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2...@fastmail.com/ Signed-off-by: Philipp

Re: [PATCH 2/2] dma-buf: add warning when dma_fence is signaled from IOCTL

2025-08-13 Thread Philipp Stanner
On Tue, 2025-08-12 at 16:34 +0200, Christian König wrote: > From: Christian König Is this the correct mail addr? :) > > We have the re-occurring problem that people try to invent a > DMA-fences implementation which signals fences based on an userspace > IOCTL. > > This is well known as source

Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

2025-08-12 Thread Philipp Stanner
On Tue, 2025-08-12 at 08:58 +0200, Christian König wrote: > On 12.08.25 08:37, Liu01, Tong (Esther) wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > Hi Christian, > > > > If a job is submitted into a stopped entity, in addition to an error log, > > it will also cause t

Re: [PATCH] drm/sched: Extend and update documentation

2025-08-11 Thread Philipp Stanner
On Thu, 2025-08-07 at 16:15 +0200, Christian König wrote: > On 05.08.25 12:22, Philipp Stanner wrote: > > On Tue, 2025-08-05 at 11:05 +0200, Christian König wrote: > > > On 24.07.25 17:07, Philipp Stanner wrote: > > > > > +/** > >

Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

2025-08-11 Thread Philipp Stanner
On Mon, 2025-08-11 at 10:18 +0200, Philipp Stanner wrote: > Hi, > > title: this patch changes nothing in amdgpu. > > Thus, the prefix must be drm/sched: Fix […] > > > Furthermore, please use scripts/get_maintainer. A few relevant folks > are missing. +Cc Danilo, Ma

Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

2025-08-11 Thread Philipp Stanner
Hi, title: this patch changes nothing in amdgpu. Thus, the prefix must be drm/sched: Fix […] Furthermore, please use scripts/get_maintainer. A few relevant folks are missing. +Cc Danilo, Matthew On Mon, 2025-08-11 at 15:20 +0800, Liu01 Tong wrote: > During process kill, drm_sched_entity_flush

[PATCH] MAINTAINERS: Add website of Nova GPU driver

2025-08-07 Thread Philipp Stanner
The Nova GPU driver has a sub-website on the Rust-for-Linux website which so far was missing from the respective section in MAINTAINERS. Add the Nova website. Signed-off-by: Philipp Stanner --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index

Re: [PATCH] drm/sched: Extend and update documentation

2025-08-05 Thread Philipp Stanner
On Tue, 2025-08-05 at 11:05 +0200, Christian König wrote: > On 24.07.25 17:07, Philipp Stanner wrote: > > > +/** > > > + * DOC: Scheduler Fence Object > > > + * > > > + * The scheduler fence object (&struct drm_sched_fence) encapsulates the > > >

Re: [PATCH] drm/nouveau: Remove surplus struct member

2025-08-01 Thread Philipp Stanner
On Fri, 2025-08-01 at 15:42 +, Timur Tabi wrote: > On Fri, 2025-08-01 at 17:12 +0200, Danilo Krummrich wrote: > > On Fri Aug 1, 2025 at 4:50 PM CEST, Timur Tabi wrote: > > > Does mean that the TODO has been done, or that someone completely forgot > > > and now your patch > > > is > > > remove

[PATCH] drm/nouveau: Remove surplus struct member

2025-08-01 Thread Philipp Stanner
struct nouveau_channel contains the member 'accel_done' and a forgotten TODO which hints at that mechanism being removed in the "near future". Since that variable is read nowhere anymore, this "near future" is now. Remove the variable and the TODO. Signed-off-by:

[PATCH] drm/sched: Document race condition in drm_sched_fini()

2025-07-31 Thread Philipp Stanner
associated with a scheduler must be torn down first. Then, however, the locking should be removed from drm_sched_fini() alltogether with an appropriate comment. Reported-by: James Flowers Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2...@fastmail.com/ Signed-off-by: Philipp

Re: [RFC v7 10/12] drm/sched: Break submission patterns with some randomness

2025-07-30 Thread Philipp Stanner
gt; > loosely called random. Under the assumption it will not always be the > > > same > > > entity which is re-joining the queue under these circumstances. > > > > > > Another way to look at this is that it is adding a little bit of limited > > > random

Re: [PATCH] drm/sched: Extend and update documentation

2025-07-24 Thread Philipp Stanner
Two comments from myself to open up room for discussion: On Thu, 2025-07-24 at 16:01 +0200, Philipp Stanner wrote: > From: Philipp Stanner > > The various objects and their memory lifetime used by the GPU scheduler > are currently not fully documented. > > Add documentat

[PATCH] drm/sched: Extend and update documentation

2025-07-24 Thread Philipp Stanner
From: Philipp Stanner The various objects and their memory lifetime used by the GPU scheduler are currently not fully documented. Add documentation describing the scheduler's objects. Improve the general documentation at a few other places. Co-developed-by: Christian König Signed-o

Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue.

2025-07-23 Thread Philipp Stanner
Hello, On Tue, 2025-07-22 at 13:05 -0700, James wrote: > On Mon, Jul 21, 2025, at 1:16 AM, Philipp Stanner wrote: > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > +Cc Tvrtko, who's currently reworking FIFO and RR. > > > > > > On Sun,

Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue.

2025-07-22 Thread Philipp Stanner
On Tue, 2025-07-22 at 01:45 -0700, Matthew Brost wrote: > On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote: > > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > > > On M

Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue.

2025-07-22 Thread Philipp Stanner
On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > On

Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue.

2025-07-21 Thread Philipp Stanner
On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > +Cc Tvrtko, who's currently reworking FIFO and RR. > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > Fixes an issue where entities are added to the run queue in > > drm_sched_rq_update_fifo

Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue.

2025-07-21 Thread Philipp Stanner
+Cc Tvrtko, who's currently reworking FIFO and RR. On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > Fixes an issue where entities are added to the run queue in > drm_sched_rq_update_fifo_locked after being killed, causing a > slab-use-after-free error. > > Signed-off-by: James Flowers >

Re: [PATCH] drm/sched: Avoid double re-lock on the job free path

2025-07-18 Thread Philipp Stanner
On Fri, 2025-07-18 at 10:35 +0100, Tvrtko Ursulin wrote: > > On 18/07/2025 10:31, Philipp Stanner wrote: > > On Fri, 2025-07-18 at 08:13 +0100, Tvrtko Ursulin wrote: > > > > > > On 16/07/2025 21:44, Maíra Canal wrote: > > > > Hi Tvrtko, > > >

Re: [PATCH] drm/sched: Avoid double re-lock on the job free path

2025-07-18 Thread Philipp Stanner
gt; > > > in the > > > > > > > queue we can simply add the signaled check and have it return the > > > > > > > presence > > > > > > > of more jobs to be freed to the caller. That way the work item > > > > > > &g

Re: [PATCH v2] drm/sched: Remove optimization that causes hang when killing dependent jobs

2025-07-17 Thread Philipp Stanner
On Thu, 2025-07-17 at 16:44 +0800, Lin.Cao wrote: > When application A submits jobs and application B submits a job with > a > dependency on A's fence, the normal flow wakes up the scheduler after > processing each job. However, the optimization in > drm_sched_entity_add_dependency_cb() uses a call

Re: [PATCH] drm/sched: Fix a race in DRM_GPU_SCHED_STAT_NO_HANG test

2025-07-17 Thread Philipp Stanner
e > job and add a new flag to verify that the code path had executed as > expected. > > Signed-off-by: Tvrtko Ursulin > Fixes: 1472e7549f84 ("drm/sched: Add new test for > DRM_GPU_SCHED_STAT_NO_HANG") > Cc: Maíra Canal > Cc: Philipp Stanner Applied to drm-misc

Re: [PATCH] drm/sched: Fix a race in DRM_GPU_SCHED_STAT_NO_HANG test

2025-07-16 Thread Philipp Stanner
; > > > from the > > > > mock scheduler job list and the drm_mock_sched_advance() call > > > > in the > > > > test > > > > will fail. > > > > > > > > Fix it by making the "don't reset" flag persist for the > > > &

Re: [PATCH] drm/sched: Remove optimization that causes hang when killing dependent jobs

2025-07-16 Thread Philipp Stanner
On Wed, 2025-07-16 at 14:05 +0200, Greg Kroah-Hartman wrote: > On Wed, Jul 16, 2025 at 01:32:42PM +0200, Philipp Stanner wrote: > > On Wed, 2025-07-16 at 13:15 +0200, Greg Kroah-Hartman wrote: > > > On Wed, Jul 16, 2025 at 12:58:28PM +0200, Christian König wrote: > > > &

Re: [PATCH] drm/sched: Remove optimization that causes hang when killing dependent jobs

2025-07-16 Thread Philipp Stanner
On Wed, 2025-07-16 at 13:15 +0200, Greg Kroah-Hartman wrote: > On Wed, Jul 16, 2025 at 12:58:28PM +0200, Christian König wrote: > > On 16.07.25 12:46, Philipp Stanner wrote: > > > +Cc Greg, Sasha > > > > > > On Wed, 2025-07-16 at 12:40 +0200, Michel Dän

Re: [PATCH] drm/sched: Remove optimization that causes hang when killing dependent jobs

2025-07-16 Thread Philipp Stanner
+Cc Greg, Sasha On Wed, 2025-07-16 at 12:40 +0200, Michel Dänzer wrote: > On 16.07.25 11:57, Philipp Stanner wrote: > > On Wed, 2025-07-16 at 09:43 +, cao, lin wrote: > > > > > > Hi Philipp, > > > > > > > > > Thank you for the review.

Re: [PATCH] drm/sched: Remove optimization that causes hang when killing dependent jobs

2025-07-16 Thread Philipp Stanner
: sta...@vger.kernel.org # v4.6+ Fixes: 777dbd458c89 ("drm/amdgpu: drop a dummy wakeup scheduler") P. > > > Thanks, > Lin > > > From: Philipp Stanner > Sent: Wednesday, July 16, 2025 16:33 > To: cao, lin ; dri-devel@lists.freedesktop.org > >

Re: [PATCH] drm/sched: Remove optimization that causes hang when killing dependent jobs

2025-07-16 Thread Philipp Stanner
On Tue, 2025-07-15 at 21:50 +0800, Lin.Cao wrote: > When application A submits jobs and application B submits a job with > a > dependency on A's fence, the normal flow wakes up the scheduler after > processing each job. However, the optimization in > drm_sched_entity_add_dependency_cb() uses a call

Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-15 Thread Philipp Stanner
On Tue, 2025-07-15 at 14:32 +0200, Christian König wrote: > On 15.07.25 14:20, Philipp Stanner wrote: > > On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote: > > > On 15.07.25 12:27, Philipp Stanner wrote: > > > > On Tue, 2025-07-15 at 09:51 +, cao, lin wr

Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-15 Thread Philipp Stanner
On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote: > On 15.07.25 12:27, Philipp Stanner wrote: > > On Tue, 2025-07-15 at 09:51 +, cao, lin wrote: > > > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > >

Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-15 Thread Philipp Stanner
"optimization" consists of the work item not being scheduled. I think that was the piece of the puzzle I was missing. I / DRM tools will also include a link to this thread, so I think that will then be sufficient. Thx P. > > Thanks, > Lin > > > > > &

Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-15 Thread Philipp Stanner
ence_scheduled() > > > Thanks, > Lin > > > From: Koenig, Christian > Sent: Monday, July 14, 2025 21:39 > To: pha...@kernel.org ; cao, lin ; > dri-devel@lists.freedesktop.org > Cc: Yin, ZhenGuo (Chris) ; Deng, Emily > ; d...@kernel.org ; > matthew.br...@in

Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-14 Thread Philipp Stanner
On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote: > > > On 14.07.25 14:46, Philipp Stanner wrote: > > regarding the patch subject: the prefix we use for the scheduler > > is: > > drm/sched: > > > > > > On Mon, 2025-07-14 at 14:23 +0800,

Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-14 Thread Philipp Stanner
regarding the patch subject: the prefix we use for the scheduler is: drm/sched: On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote: > When Application A submits jobs (a1, a2, a3) and application B submits s/Application/application > job b1 with a dependency on a2's scheduler fence, killing appli

Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-07-14 Thread Philipp Stanner
On Mon, 2025-07-14 at 11:23 +0200, Christian König wrote: > On 13.07.25 21:03, Maíra Canal wrote: > > Hi Christian, > > > > On 11/07/25 12:20, Christian König wrote: > > > On 11.07.25 15:37, Philipp Stanner wrote: > > > > On Fri, 2025-0

Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-07-11 Thread Philipp Stanner
On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote: > > > On 08.07.25 15:25, Maíra Canal wrote: > > When the DRM scheduler times out, it's possible that the GPU isn't hung; > > instead, a job just took unusually long (longer than the timeout) but is > > still running, and there is, thus, no

Re: [PATCH] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-11 Thread Philipp Stanner
nce, -ESRCH); >     WARN_ON(job->s_fence->parent); >     job->sched->ops->free_job(job); > -- > > > Thanks, > Lin > > > > > > From: Koenig, Christian > Sent: Thursday, July 10, 2025 15:52 > To: cao, lin ; dri-devel@lists.freedesktop.org >

Re: [PATCH] drm/sched: Avoid double re-lock on the job free path

2025-07-11 Thread Philipp Stanner
t way the work item does not optional nit: s/to free/to be freed Reads a bit more cleanly. > have > to lock the list again and repeat the signaled check. > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Phi

Re: [PATCH v4 0/8] drm/sched: Fix memory leaks with cancel_job() callback

2025-07-10 Thread Philipp Stanner
On Thu, 2025-07-10 at 14:54 +0200, Philipp Stanner wrote: > Changes in v4: >   - Change dev_err() to dev_warn() in pending_list emptyness check. > > Changes in v3: >   - Remove forgotten copy-past artifacts. (Tvrtko) >   - Remove forgotten done_list struct member. (Tvrtko) >

Re: [PATCH v4 1/8] drm/panfrost: Fix scheduler workqueue bug

2025-07-10 Thread Philipp Stanner
On Thu, 2025-07-10 at 14:54 +0200, Philipp Stanner wrote: > When the GPU scheduler was ported to using a struct for its > initialization parameters, it was overlooked that panfrost creates a > distinct workqueue for timeout handling. > > The pointer to this new workqueue is not ini

[PATCH v4 8/8] drm/nouveau: Remove waitque for sched teardown

2025-07-10 Thread Philipp Stanner
nouveau_sched_cancel_job() the waitque is not necessary anymore. Remove the waitque. Signed-off-by: Philipp Stanner Acked-by: Danilo Krummrich --- drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +++-- drivers/gpu/drm/nouveau/nouveau_uvmm.c

[PATCH v4 6/8] drm/nouveau: Make fence container helper usable driver-wide

2025-07-10 Thread Philipp Stanner
: Philipp Stanner Acked-by: Danilo Krummrich --- drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_fence.h | 6 ++ 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm

[PATCH v4 7/8] drm/nouveau: Add new callback for scheduler teardown

2025-07-10 Thread Philipp Stanner
There is a new callback for always tearing the scheduler down in a leak-free, deadlock-free manner. Port Nouveau as its first user by providing the scheduler with a callback that ensures the fence context gets killed in drm_sched_fini(). Signed-off-by: Philipp Stanner Acked-by: Danilo Krummrich

[PATCH v4 5/8] drm/sched: Warn if pending_list is not empty

2025-07-10 Thread Philipp Stanner
drm_sched_fini() can leak jobs under certain circumstances. Warn if that happens. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/sched_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c

[PATCH v4 4/8] drm/sched/tests: Add unit test for cancel_job()

2025-07-10 Thread Philipp Stanner
The scheduler unit tests now provide a new callback, cancel_job(). This callback gets used by drm_sched_fini() for all still pending jobs to cancel them. Implement a new unit test to test this. Signed-off-by: Philipp Stanner Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/scheduler/tests

[PATCH v4 3/8] drm/sched/tests: Implement cancel_job() callback

2025-07-10 Thread Philipp Stanner
the code where necessary. Signed-off-by: Philipp Stanner --- .../gpu/drm/scheduler/tests/mock_scheduler.c | 68 +++ drivers/gpu/drm/scheduler/tests/sched_tests.h | 1 - 2 files changed, 25 insertions(+), 44 deletions(-) diff --git a/drivers/gpu/drm/scheduler/tests

[PATCH v4 2/8] drm/sched: Avoid memory leaks with cancel_job() callback

2025-07-10 Thread Philipp Stanner
-tvrtko.ursu...@igalia.com/ Signed-off-by: Philipp Stanner Reviewed-by: Maíra Canal --- drivers/gpu/drm/scheduler/sched_main.c | 34 -- include/drm/gpu_scheduler.h| 18 ++ 2 files changed, 39 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm

[PATCH v4 1/8] drm/panfrost: Fix scheduler workqueue bug

2025-07-10 Thread Philipp Stanner
-4d55-aa47-c35cd7861...@igalia.com/ Signed-off-by: Philipp Stanner --- drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 5657106c2f7d..15e2d505550f 10

[PATCH v4 0/8] drm/sched: Fix memory leaks with cancel_job() callback

2025-07-10 Thread Philipp Stanner
e still in drm_sched.pending_list. This series solves the leaks in a backwards-compatible manner by adding a new, optional callback. If that callback is implemented, the scheduler uses it to cancel all jobs from pending_list and then frees them. Philipp Stanner (8): drm/panfrost: Fix scheduler workqueue

Re: [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG

2025-07-09 Thread Philipp Stanner
> [3] > https://lore.kernel.org/dri-devel/20250430210643.57924-1-mca...@igalia.com/T/ > > Best Regards, > - Maíra > > --- > v1 -> v2: > > - Fix several grammar nits across the documentation and commit messages. > - Drop "drm/sched: Always free the job after the

Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-07-09 Thread Philipp Stanner
gt;ops->free_job()` - leading to a memory leak. > > To solve these problems, create a new `drm_gpu_sched_stat`, called > DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. > The > new status will indicate that the job must be reinserted into > `sched->pending_list`, an

[PATCH v3 7/7] drm/nouveau: Remove waitque for sched teardown

2025-07-09 Thread Philipp Stanner
nouveau_sched_cancel_job() the waitque is not necessary anymore. Remove the waitque. Signed-off-by: Philipp Stanner Acked-by: Danilo Krummrich --- drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +++-- drivers/gpu/drm/nouveau/nouveau_uvmm.c

[PATCH v3 6/7] drm/nouveau: Add new callback for scheduler teardown

2025-07-09 Thread Philipp Stanner
There is a new callback for always tearing the scheduler down in a leak-free, deadlock-free manner. Port Nouveau as its first user by providing the scheduler with a callback that ensures the fence context gets killed in drm_sched_fini(). Signed-off-by: Philipp Stanner Acked-by: Danilo Krummrich

[PATCH v3 4/7] drm/sched: Warn if pending_list is not empty

2025-07-09 Thread Philipp Stanner
drm_sched_fini() can leak jobs under certain circumstances. Warn if that happens. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/sched_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c

[PATCH v3 5/7] drm/nouveau: Make fence container helper usable driver-wide

2025-07-09 Thread Philipp Stanner
: Philipp Stanner Acked-by: Danilo Krummrich --- drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_fence.h | 6 ++ 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm

[PATCH v3 3/7] drm/sched/tests: Add unit test for cancel_job()

2025-07-09 Thread Philipp Stanner
The scheduler unit tests now provide a new callback, cancel_job(). This callback gets used by drm_sched_fini() for all still pending jobs to cancel them. Implement a new unit test to test this. Signed-off-by: Philipp Stanner Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/scheduler/tests

[PATCH v3 2/7] drm/sched/tests: Implement cancel_job() callback

2025-07-09 Thread Philipp Stanner
the code where necessary. Signed-off-by: Philipp Stanner --- .../gpu/drm/scheduler/tests/mock_scheduler.c | 68 +++ drivers/gpu/drm/scheduler/tests/sched_tests.h | 1 - 2 files changed, 25 insertions(+), 44 deletions(-) diff --git a/drivers/gpu/drm/scheduler/tests

[PATCH v3 1/7] drm/sched: Avoid memory leaks with cancel_job() callback

2025-07-09 Thread Philipp Stanner
-tvrtko.ursu...@igalia.com/ Signed-off-by: Philipp Stanner Reviewed-by: Maíra Canal --- drivers/gpu/drm/scheduler/sched_main.c | 34 -- include/drm/gpu_scheduler.h| 18 ++ 2 files changed, 39 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm

[PATCH v3 0/7] drm/sched: Fix memory leaks with cancel_job() callback

2025-07-09 Thread Philipp Stanner
manner by adding a new, optional callback. If that callback is implemented, the scheduler uses it to cancel all jobs from pending_list and then frees them. Philipp Stanner (7): drm/sched: Avoid memory leaks with cancel_job() callback drm/sched/tests: Implement cancel_job() callback drm/sched/

[PATCH] drm/panfrost: Fix scheduler workqueue bug

2025-07-09 Thread Philipp Stanner
-4d55-aa47-c35cd7861...@igalia.com/ Signed-off-by: Philipp Stanner --- drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 5657106c2f7d..15e2d505550f 10

Re: [PATCH v4] drm/sched: Use struct for drm_sched_init() params

2025-07-09 Thread Philipp Stanner
On Tue, 2025-07-08 at 14:02 +0100, Tvrtko Ursulin wrote: > > > On 11/02/2025 11:14, Philipp Stanner wrote: > > drm_sched_init() has a great many parameters and upcoming new > > functionality for the scheduler might add even more. Generally, the > > great number of par

Re: [PATCH] drm/sched: Consolidate drm_sched_rq_select_entity_rr

2025-07-09 Thread Philipp Stanner
On Tue, 2025-07-08 at 13:21 +0100, Tvrtko Ursulin wrote: > Extract out two copies of the identical code to function epilogue to > make > it smaller and more readable. > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Bros

Re: [PATCH v6 14/15] drm/sched: Queue all free credits in one worker invocation

2025-07-08 Thread Philipp Stanner
> > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner > --- >  drivers/gpu/drm/scheduler/sched_internal.h |   2 - >  drivers/gpu/drm/scheduler/sched_main.c | 132 ++- > -- >  

  1   2   3   4   5   6   7   8   >