Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-22 Thread Matthew Brost
On Fri, Nov 22, 2024 at 05:24:03PM +0100, Philipp Stanner wrote:
> On Thu, 2024-11-21 at 10:05 -0800, Matthew Brost wrote:
> > On Tue, Oct 29, 2024 at 08:22:22AM +0100, Philipp Stanner wrote:
> > > Christian, Sima?
> > > 
> > > Matthew? (+CC)
> > > 
> > 
> > Sorry catching up here. Internally in Xe we ref count the scheduler
> > as
> > the scheduler is embedded into our user queue structure
> > (xe_exec_queue).
> > Jobs ref the queue once they are armed. Upon queue killing we set TDR
> > to
> > zero which will flush out all jobs / signal all the job's fences.
> > Once
> > the ref count of queue is zero we do hardware / firmware teardown of
> > the
> > queue and then finally call drm_sched_fini before freeing the queues
> > memory (and thus the scheduler).
> > 
> > This flow seems to work pretty well but largely bypasses the
> > scheduler
> > functions to implement this. Not sure if this flow could be
> > normalized
> > at all but I would expect usage models of a 1 to 1 relationship
> > between
> > queue and scheduler to something similar to Xe's flow.
> 
> One of the solution I proposed basically does the same: jobs added to
> gpu_scheduler.pending_list refcount the scheduler.
> 
> If we'd go for that solution I assume the separate refcounting could be
> removed from Xe. For exmaple.
> 

That should work. We likely need a vfunc for when the refcount goes to
zero which could either free the scheduler or initiate hw / hw teardown
process. In Xe the queue and scheduler actually lives for a while after
the refcount is zero due to several ping / pongs with the fw. Also long
as we can still support that, moving the refcount from our queues to the
scheduler seems reasonable. Ofc we need buy in from the other scheduler
use cases too.

> 
> >  Maybe we could
> > write this done as design guideline so other drivers don't have to
> > figure this out - this took me a bit to land on this.
> 
> I already improved the documentation a while ago [1], but didn't detail
> how drivers are supposed to solve this. The reason is that I don't want
> to encourage more separate solutions while we're working on solving it
> centrally.

Let me read this and add comments there if needed.

Matt

> 
> > With that, in general I agree with Christian's patch here complaining
> > about pending jobs if drm_sched_fini is called.
> 
> Yes, firing a WARN_ON there is also fine from my POV.
> 
> P.
> 
> 
> [1] 
> https://lore.kernel.org/dri-devel/20241105143137.71893-2-pstan...@redhat.com/
> 
> 
> > 
> > > Opinions on the below?
> > > 
> > > tl;dr:
> > > I still think it's a good thing to detectably block in
> > > drm_sched_fini(), or at the very least drm_sched_flush(), because
> > > then
> > 
> > See above. I don't think drm_sched_fini should block rather just
> > complain this patch does which says 'go fix your driver'.
> > 
> > Matt
> > 
> > > you'll find out that the driver is broken and can repair it.
> > > 
> > > P.
> > > 
> > > 
> > > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König
> > > > > wrote:
> > > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > > > wrote:
> > > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König
> > > > > > > > > > wrote:
> > > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian
> > > > > > > > > > > > König
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > Tearing down the scheduler with jobs still on
> > > > > > > > > > > > > the
> > > > > > > > > > > > > pending
> > > > > > > > > > > > > list
> > > > > > > > > > > > > can
> > > > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > > > drivers try
> > > > > > > > > > > > > to
> > > > > > > > > > > > > destroy a scheduler which still has work pushed
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the HW.
> > > > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > > > waitque-
> > > > > > > > > > > > solution?
> > > > > > > > > > > I don't remember seeing anything. What have I
> > > > > > > > > > > missed?
> > > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > > > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > > > 
> > > > > > > > > Interesting approach, I'm just not sure if we can or
> > > > > > > > > should
> > > > > > > > > wait in
> > > > > > > > > drm_sched_fini().
> > > > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > > > starts
> > > > > > > is
> > > > > > > always a bug, right?
> > > > > > 
> > > > > > Correct, the question is how

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-22 Thread Philipp Stanner
On Thu, 2024-11-21 at 10:05 -0800, Matthew Brost wrote:
> On Tue, Oct 29, 2024 at 08:22:22AM +0100, Philipp Stanner wrote:
> > Christian, Sima?
> > 
> > Matthew? (+CC)
> > 
> 
> Sorry catching up here. Internally in Xe we ref count the scheduler
> as
> the scheduler is embedded into our user queue structure
> (xe_exec_queue).
> Jobs ref the queue once they are armed. Upon queue killing we set TDR
> to
> zero which will flush out all jobs / signal all the job's fences.
> Once
> the ref count of queue is zero we do hardware / firmware teardown of
> the
> queue and then finally call drm_sched_fini before freeing the queues
> memory (and thus the scheduler).
> 
> This flow seems to work pretty well but largely bypasses the
> scheduler
> functions to implement this. Not sure if this flow could be
> normalized
> at all but I would expect usage models of a 1 to 1 relationship
> between
> queue and scheduler to something similar to Xe's flow.

One of the solution I proposed basically does the same: jobs added to
gpu_scheduler.pending_list refcount the scheduler.

If we'd go for that solution I assume the separate refcounting could be
removed from Xe. For exmaple.


>  Maybe we could
> write this done as design guideline so other drivers don't have to
> figure this out - this took me a bit to land on this.

I already improved the documentation a while ago [1], but didn't detail
how drivers are supposed to solve this. The reason is that I don't want
to encourage more separate solutions while we're working on solving it
centrally.

> With that, in general I agree with Christian's patch here complaining
> about pending jobs if drm_sched_fini is called.

Yes, firing a WARN_ON there is also fine from my POV.

P.


[1] 
https://lore.kernel.org/dri-devel/20241105143137.71893-2-pstan...@redhat.com/


> 
> > Opinions on the below?
> > 
> > tl;dr:
> > I still think it's a good thing to detectably block in
> > drm_sched_fini(), or at the very least drm_sched_flush(), because
> > then
> 
> See above. I don't think drm_sched_fini should block rather just
> complain this patch does which says 'go fix your driver'.
> 
> Matt
> 
> > you'll find out that the driver is broken and can repair it.
> > 
> > P.
> > 
> > 
> > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König
> > > > wrote:
> > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > > wrote:
> > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König
> > > > > > > > > wrote:
> > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian
> > > > > > > > > > > König
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Tearing down the scheduler with jobs still on
> > > > > > > > > > > > the
> > > > > > > > > > > > pending
> > > > > > > > > > > > list
> > > > > > > > > > > > can
> > > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > > drivers try
> > > > > > > > > > > > to
> > > > > > > > > > > > destroy a scheduler which still has work pushed
> > > > > > > > > > > > to
> > > > > > > > > > > > the HW.
> > > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > > waitque-
> > > > > > > > > > > solution?
> > > > > > > > > > I don't remember seeing anything. What have I
> > > > > > > > > > missed?
> > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > > 
> > > > > > > > Interesting approach, I'm just not sure if we can or
> > > > > > > > should
> > > > > > > > wait in
> > > > > > > > drm_sched_fini().
> > > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > > starts
> > > > > > is
> > > > > > always a bug, right?
> > > > > 
> > > > > Correct, the question is how to avoid that.
> > > > > 
> > > > > > If so, what are the disadvantages of waiting in
> > > > > > drm_sched_fini()?
> > > > > > We
> > > > > > could block buggy drivers as I see it. Which wouldn't be
> > > > > > good,
> > > > > > but
> > > > > > could then be fixed on drivers' site.
> > > > > 
> > > > > Sima explained that pretty well: Don't block in fops->close,
> > > > > do
> > > > > that in
> > > > > fops->flush instead.
> > > > 
> > > > I agree that we shouldn't block in close(), but this
> > > > effectively
> > > > means that we
> > > > need to reference count the scheduler, right?
> > > > 
> > > > Otherwise, if we allow to just skip / interrupt the teardown,
> > > > we
> > > > can
> > > > still leak
> > > > memory.
> > > 
> > > Having thought about it, I agree with Danilo. Having something
>

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-21 Thread Matthew Brost
On Thu, Nov 21, 2024 at 10:06:54AM +0100, Philipp Stanner wrote:
> +CC Thomas Hellström
> 
> 
> On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote:
> >  Am 15.11.24 um 17:07 schrieb Philipp Stanner:
> >  
> > >  
> > > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
> > >  
> > > >  
> > > >  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
> > > >  
> > > >  
> > > > >  
> > > > > [SNIP] 
> > > > >  
> > > >  
> > >  
> > > >  
> > > > >  
> > > > > But you wouldn't want to aim for getting rid of struct
> > > > > drm_sched_job
> > > > > completely, or would you?
> > > > >  
> > > > >  
> > > >  
> > > >  
> > > >  No, the drm_sched_job structure was added to aid the single
> > > > producer
> > > > single consumer queue and so made it easier to put the jobs into
> > > > a
> > > > container.
> > > >  
> > > >  
> > > >  In my mind the full solution for running jobs looks like this:
> > > >  
> > > >  1. Driver enqueues with drm_sched_job_arm()
> > > >  2. job ends up in pending_list
> > > >  3. Sooner or later scheduler calls run_job()
> > > >  4. In return scheduler gets a dma_fence representing the
> > > > resulting
> > > > hardware operation.
> > > >  5, This fence is put into a container to keep around what the hw
> > > > is
> > > > actually executing.
> > > >  6. At some point the fence gets signaled informing the scheduler
> > > > that the next job can be pushed if enough credits are now
> > > > available.
> > > >  
> > > >  There is no free_job callback any more because after run_job is
> > > > called the scheduler is done with the job and only the dma_fence
> > > > which represents the actually HW operation is the object the
> > > > scheduler now works with.
> > > >  
> > > >  We would still need something like a kill_job callback which is
> > > > used
> > > > when an entity is released without flushing all jobs (see
> > > > drm_sched_entity_kill()), but that is then only used for a
> > > > specific
> > > > corner case.
> > > >  
> > >  
> > > Can't we just limit the scheduler's responsibility to telling the
> > > driver that it has to flush, and if it doesn't it's a bug?
> > >  
> >  
> >  We still need to remove the pending jobs from the scheduler if
> > flushing times out.
> >  
> >  Without timing out flushing and/or aborting when the process was
> > killed we run into the same problem again that we don't want ti block
> > on _fini().
> >    
> > >  
> > > >  
> > > >  Blocking for the cleanup in drm_sched_fini() then becomes a
> > > > trivial
> > > > dma_fence_wait() on the remaining unsignaled HW fences and
> > > > eventually
> > > > on the latest killed fence.
> > > >  
> > >  
> > > But that results in exactly the same situation as my waitque
> > > solution,
> > > doesn't it?
> > >  
> >  
> >  The key part is that dma_fence's have a documented requirement to
> > never block infinitely.
> >  
> >  
> > >  
> > > Blocking infinitely in drm_sched_fini():
> > > 
> > > If the driver doesn't guarantee that all fences will get signaled,
> > > then
> > > wait_event() in the waitque solution will block forever. The same
> > > with
> > > dma_fence_wait().
> > > 
> > > Or are you aiming at an interruptible dma_fence_wait(), thereby not
> > > blocking SIGKILL?
> > >  
> >  
> >  That is basically what drm_sched_entity_flush() is already doing.
> >  
> >  
> > >  
> > > That then would still result in leaks. So basically the same
> > > situation
> > > as with an interruptible drm_sched_flush() function.
> > > 
> > > (Although I agree that would probably be more elegant)
> > >  
> >  
> >  If the wait_event really would just waiting for dma_fences then yes.
> >  
> >  The problem with the waitqueue approach is that we need to wait for
> > the free_job callback and that callback is normally called from a
> > work item without holding any additional locks.
> >  
> >  When drm_sched_fini starts to block for that we create a rat-tail of
> > new dependencies since that one is most likely called from a file
> > descriptor destruction.
> >   
> >  
> > >  
> > > 
> > >  
> > > >  
> > > >  
> > > >  The problem with this approach is that the idea of re-submitting
> > > > jobs in a GPU reset by the scheduler is then basically dead. But
> > > > to
> > > > be honest that never worked correctly.
> > > >  
> > > >  See the deprecated warning I already put on
> > > > drm_sched_resubmit_jobs(). The job lifetime is not really well
> > > > defined because of this, see the free_guilty hack as well.
> > > >  
> > >  
> > > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So
> > > if
> > > we go for this approach we have to port them, first. That doesn't
> > > sound
> > > trivial to me
> > >  
> >  
> >  Yeah, completely agree. I think the scheduler should provide
> > something like drm_sched_for_each_pending_fence() which helps to
> > iterate over all the pending HW fences.
> >  
> >  The individual drivers could then device by themself what to do,
> > e.g. upcast the HW fence into the job and 

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-21 Thread Matthew Brost
On Tue, Oct 29, 2024 at 08:22:22AM +0100, Philipp Stanner wrote:
> Christian, Sima?
> 
> Matthew? (+CC)
> 

Sorry catching up here. Internally in Xe we ref count the scheduler as
the scheduler is embedded into our user queue structure (xe_exec_queue).
Jobs ref the queue once they are armed. Upon queue killing we set TDR to
zero which will flush out all jobs / signal all the job's fences. Once
the ref count of queue is zero we do hardware / firmware teardown of the
queue and then finally call drm_sched_fini before freeing the queues
memory (and thus the scheduler).

This flow seems to work pretty well but largely bypasses the scheduler
functions to implement this. Not sure if this flow could be normalized
at all but I would expect usage models of a 1 to 1 relationship between
queue and scheduler to something similar to Xe's flow. Maybe we could
write this done as design guideline so other drivers don't have to
figure this out - this took me a bit to land on this.

With that, in general I agree with Christian's patch here complaining
about pending jobs if drm_sched_fini is called.

> Opinions on the below?
> 
> tl;dr:
> I still think it's a good thing to detectably block in
> drm_sched_fini(), or at the very least drm_sched_flush(), because then

See above. I don't think drm_sched_fini should block rather just
complain this patch does which says 'go fix your driver'.

Matt

> you'll find out that the driver is broken and can repair it.
> 
> P.
> 
> 
> On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > wrote:
> > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König
> > > > > > > > > > wrote:
> > > > > > > > > > > Tearing down the scheduler with jobs still on the
> > > > > > > > > > > pending
> > > > > > > > > > > list
> > > > > > > > > > > can
> > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > drivers try
> > > > > > > > > > > to
> > > > > > > > > > > destroy a scheduler which still has work pushed to
> > > > > > > > > > > the HW.
> > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > waitque-
> > > > > > > > > > solution?
> > > > > > > > > I don't remember seeing anything. What have I missed?
> > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > 
> > > > > > > Interesting approach, I'm just not sure if we can or should
> > > > > > > wait in
> > > > > > > drm_sched_fini().
> > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > starts
> > > > > is
> > > > > always a bug, right?
> > > > 
> > > > Correct, the question is how to avoid that.
> > > > 
> > > > > If so, what are the disadvantages of waiting in
> > > > > drm_sched_fini()?
> > > > > We
> > > > > could block buggy drivers as I see it. Which wouldn't be good,
> > > > > but
> > > > > could then be fixed on drivers' site.
> > > > 
> > > > Sima explained that pretty well: Don't block in fops->close, do
> > > > that in
> > > > fops->flush instead.
> > > 
> > > I agree that we shouldn't block in close(), but this effectively
> > > means that we
> > > need to reference count the scheduler, right?
> > > 
> > > Otherwise, if we allow to just skip / interrupt the teardown, we
> > > can
> > > still leak
> > > memory.
> > 
> > Having thought about it, I agree with Danilo. Having something that
> > shall wait on green light, but can be interrupted, is no guarantee
> > and
> > therefore not a feasible solution.
> > 
> > To break down the solution space, these seem to be our options:
> >    1. We have something (either drm_sched_fini() or a helper, e.g.,
> >   drm_sched_flush()) that definitely blocks until the pending
> > list
> >   has become empty.
> >    2. We have jobs reference-count the scheduler, so the latter can
> >   outlive the driver and will be freed some time later.
> > 
> > Can anyone think of a third solution?
> > 
> > 
> > Solution #1 has the problem of obviously blocking unconditionally if
> > the driver didn't make sure that all fences will be signaled within
> > reasonable time. In my opinion, this would actually be an advantage,
> > because it will be *very* noticable and force users to repair their
> > driver. The driver *has* to guarantee that all fences will be
> > signaled.
> > If the driver has to do fishy things, having the blocking outsourced
> > to
> > the helper drm_sched_flu

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-21 Thread Philipp Stanner
+CC Thomas Hellström


On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote:
>  Am 15.11.24 um 17:07 schrieb Philipp Stanner:
>  
> >  
> > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
> >  
> > >  
> > >  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
> > >  
> > >  
> > > >  
> > > > [SNIP] 
> > > >  
> > >  
> >  
> > >  
> > > >  
> > > > But you wouldn't want to aim for getting rid of struct
> > > > drm_sched_job
> > > > completely, or would you?
> > > >  
> > > >  
> > >  
> > >  
> > >  No, the drm_sched_job structure was added to aid the single
> > > producer
> > > single consumer queue and so made it easier to put the jobs into
> > > a
> > > container.
> > >  
> > >  
> > >  In my mind the full solution for running jobs looks like this:
> > >  
> > >  1. Driver enqueues with drm_sched_job_arm()
> > >  2. job ends up in pending_list
> > >  3. Sooner or later scheduler calls run_job()
> > >  4. In return scheduler gets a dma_fence representing the
> > > resulting
> > > hardware operation.
> > >  5, This fence is put into a container to keep around what the hw
> > > is
> > > actually executing.
> > >  6. At some point the fence gets signaled informing the scheduler
> > > that the next job can be pushed if enough credits are now
> > > available.
> > >  
> > >  There is no free_job callback any more because after run_job is
> > > called the scheduler is done with the job and only the dma_fence
> > > which represents the actually HW operation is the object the
> > > scheduler now works with.
> > >  
> > >  We would still need something like a kill_job callback which is
> > > used
> > > when an entity is released without flushing all jobs (see
> > > drm_sched_entity_kill()), but that is then only used for a
> > > specific
> > > corner case.
> > >  
> >  
> > Can't we just limit the scheduler's responsibility to telling the
> > driver that it has to flush, and if it doesn't it's a bug?
> >  
>  
>  We still need to remove the pending jobs from the scheduler if
> flushing times out.
>  
>  Without timing out flushing and/or aborting when the process was
> killed we run into the same problem again that we don't want ti block
> on _fini().
>    
> >  
> > >  
> > >  Blocking for the cleanup in drm_sched_fini() then becomes a
> > > trivial
> > > dma_fence_wait() on the remaining unsignaled HW fences and
> > > eventually
> > > on the latest killed fence.
> > >  
> >  
> > But that results in exactly the same situation as my waitque
> > solution,
> > doesn't it?
> >  
>  
>  The key part is that dma_fence's have a documented requirement to
> never block infinitely.
>  
>  
> >  
> > Blocking infinitely in drm_sched_fini():
> > 
> > If the driver doesn't guarantee that all fences will get signaled,
> > then
> > wait_event() in the waitque solution will block forever. The same
> > with
> > dma_fence_wait().
> > 
> > Or are you aiming at an interruptible dma_fence_wait(), thereby not
> > blocking SIGKILL?
> >  
>  
>  That is basically what drm_sched_entity_flush() is already doing.
>  
>  
> >  
> > That then would still result in leaks. So basically the same
> > situation
> > as with an interruptible drm_sched_flush() function.
> > 
> > (Although I agree that would probably be more elegant)
> >  
>  
>  If the wait_event really would just waiting for dma_fences then yes.
>  
>  The problem with the waitqueue approach is that we need to wait for
> the free_job callback and that callback is normally called from a
> work item without holding any additional locks.
>  
>  When drm_sched_fini starts to block for that we create a rat-tail of
> new dependencies since that one is most likely called from a file
> descriptor destruction.
>   
>  
> >  
> > 
> >  
> > >  
> > >  
> > >  The problem with this approach is that the idea of re-submitting
> > > jobs in a GPU reset by the scheduler is then basically dead. But
> > > to
> > > be honest that never worked correctly.
> > >  
> > >  See the deprecated warning I already put on
> > > drm_sched_resubmit_jobs(). The job lifetime is not really well
> > > defined because of this, see the free_guilty hack as well.
> > >  
> >  
> > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So
> > if
> > we go for this approach we have to port them, first. That doesn't
> > sound
> > trivial to me
> >  
>  
>  Yeah, completely agree. I think the scheduler should provide
> something like drm_sched_for_each_pending_fence() which helps to
> iterate over all the pending HW fences.
>  
>  The individual drivers could then device by themself what to do,
> e.g. upcast the HW fence into the job and push it again or similar.

I have talked with Dave and we would be interested in exploring the
direction of getting rid of backend_ops.free_job() and doing the
modifications this implies.

Matthew, Thomas, any hard NACKs on principle, or can we look into this
in a future patch series?


P.





>  
>  But what we really need to get away from are those fence pre-
> requis

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-18 Thread Christian König

Am 15.11.24 um 17:07 schrieb Philipp Stanner:

On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:

  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
  

[SNIP]
But you wouldn't want to aim for getting rid of struct
drm_sched_job
completely, or would you?
  
  
  No, the drm_sched_job structure was added to aid the single producer

single consumer queue and so made it easier to put the jobs into a
container.
  
  
  In my mind the full solution for running jobs looks like this:
  
  1. Driver enqueues with drm_sched_job_arm()

  2. job ends up in pending_list
  3. Sooner or later scheduler calls run_job()
  4. In return scheduler gets a dma_fence representing the resulting
hardware operation.
  5, This fence is put into a container to keep around what the hw is
actually executing.
  6. At some point the fence gets signaled informing the scheduler
that the next job can be pushed if enough credits are now available.
  
  There is no free_job callback any more because after run_job is

called the scheduler is done with the job and only the dma_fence
which represents the actually HW operation is the object the
scheduler now works with.
  
  We would still need something like a kill_job callback which is used

when an entity is released without flushing all jobs (see
drm_sched_entity_kill()), but that is then only used for a specific
corner case.

Can't we just limit the scheduler's responsibility to telling the
driver that it has to flush, and if it doesn't it's a bug?


We still need to remove the pending jobs from the scheduler if flushing 
times out.


Without timing out flushing and/or aborting when the process was killed 
we run into the same problem again that we don't want ti block on _fini().

  Blocking for the cleanup in drm_sched_fini() then becomes a trivial
dma_fence_wait() on the remaining unsignaled HW fences and eventually
on the latest killed fence.

But that results in exactly the same situation as my waitque solution,
doesn't it?


The key part is that dma_fence's have a documented requirement to never 
block infinitely.



Blocking infinitely in drm_sched_fini():

If the driver doesn't guarantee that all fences will get signaled, then
wait_event() in the waitque solution will block forever. The same with
dma_fence_wait().

Or are you aiming at an interruptible dma_fence_wait(), thereby not
blocking SIGKILL?


That is basically what drm_sched_entity_flush() is already doing.


That then would still result in leaks. So basically the same situation
as with an interruptible drm_sched_flush() function.

(Although I agree that would probably be more elegant)


If the wait_event really would just waiting for dma_fences then yes.

The problem with the waitqueue approach is that we need to wait for the 
free_job callback and that callback is normally called from a work item 
without holding any additional locks.


When drm_sched_fini starts to block for that we create a rat-tail of new 
dependencies since that one is most likely called from a file descriptor 
destruction.




  
  The problem with this approach is that the idea of re-submitting

jobs in a GPU reset by the scheduler is then basically dead. But to
be honest that never worked correctly.
  
  See the deprecated warning I already put on

drm_sched_resubmit_jobs(). The job lifetime is not really well
defined because of this, see the free_guilty hack as well.

drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if
we go for this approach we have to port them, first. That doesn't sound
trivial to me


Yeah, completely agree. I think the scheduler should provide something 
like drm_sched_for_each_pending_fence() which helps to iterate over all 
the pending HW fences.


The individual drivers could then device by themself what to do, e.g. 
upcast the HW fence into the job and push it again or similar.


But what we really need to get away from are those fence pre-requisite 
violations Sima pointed out. For example we can't allocate memory for 
run_job.


Regards,
Christian.




P.

  
  Regards,

  Christian.
  
  
  



Grüße,
P.

  
  


Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-15 Thread Philipp Stanner
On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
>  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
>  
> > [SNIP] 
> > >  
> > > A good bunch of the problems we have here are caused by abusing
> > > the
> > > job 
> > > as state for executing the submission on the hardware.
> > > 
> > > The original design idea of the scheduler instead was to only
> > > have
> > > the 
> > > job as intermediate state between submission and picking what to
> > > execute 
> > > next. E.g. the scheduler in that view is just a pipeline were you
> > > push 
> > > jobs in on one end and jobs come out on the other end.
> > >  
> >  
> > So let's get a bit more precise about this:
> >    1. Driver enqueues with drm_sched_job_arm()
> >    2. job ends up in pending_list
> >    3. Sooner or later scheduler calls run_job()
> >    4. Job is pushed to hardware
> >    5. Fence gets signaled
> >    6. ???
> > 
> > What would the "come out on the other end" part you describe look
> > like?
> > 
> > How would jobs get removed from pending_list and, accordingly, how
> > would we avoid leaks?
> >  
>  
>  Let me describe the full solution a bit further down.
>  
>  
>  
> >  
> > >  
> > > In that approach the object which represents the hardware
> > > execution
> > > is 
> > > the dma_fence instead of the job. And the dma_fence has a well
> > > defined 
> > > lifetime, error handling, etc...
> > > 
> > > So when we go back to this original approach it would mean that
> > > we
> > > don't 
> > > need to wait for any cleanup because the scheduler wouldn't be
> > > involved 
> > > in the execution of the jobs on the hardware any more.
> > >  
> >  
> > It would be involved (to speak precisely) in the sense of the
> > scheduler
> > still being the one who pushes the job onto the hardware, agreed?
> >  
>  
>  Yes, exactly.
>  
>  See in the original design the "job" wasn't even a defined
> structure, but rather just a void*.
>  
>  So basically what the driver did was telling the scheduler here I
> have a bunch of different void* please tell me which one to run
> first.
>  
>  And apart from being this identifier this void* had absolutely no
> meaning for the scheduler.

Interesting..

>  
> > >  
> > > The worst thing that could happen is that the driver messes
> > > things up
> > > and still has not executed job in an entity,
> > >  
> >  
> > I can't fully follow.
> > 
> > So in your mind, the driver would personally set the dma_fence
> > callback
> > and hereby be informed about the job being completed, right?
> >  
>  
>  No. The run_job callback would still return the hw fence so that the
> scheduler can install the callback and so gets informed when the next
> job could be run.
>  
>  
> >  
> > But you wouldn't want to aim for getting rid of struct
> > drm_sched_job
> > completely, or would you?
> >  
>  
>  No, the drm_sched_job structure was added to aid the single producer
> single consumer queue and so made it easier to put the jobs into a
> container.
>  
>  
>  In my mind the full solution for running jobs looks like this:
>  
>  1. Driver enqueues with drm_sched_job_arm()
>  2. job ends up in pending_list
>  3. Sooner or later scheduler calls run_job()
>  4. In return scheduler gets a dma_fence representing the resulting
> hardware operation.
>  5, This fence is put into a container to keep around what the hw is
> actually executing.
>  6. At some point the fence gets signaled informing the scheduler
> that the next job can be pushed if enough credits are now available.
>  
>  There is no free_job callback any more because after run_job is
> called the scheduler is done with the job and only the dma_fence
> which represents the actually HW operation is the object the
> scheduler now works with.
>  
>  We would still need something like a kill_job callback which is used
> when an entity is released without flushing all jobs (see
> drm_sched_entity_kill()), but that is then only used for a specific
> corner case.

Can't we just limit the scheduler's responsibility to telling the
driver that it has to flush, and if it doesn't it's a bug?
 
>  Blocking for the cleanup in drm_sched_fini() then becomes a trivial
> dma_fence_wait() on the remaining unsignaled HW fences and eventually
> on the latest killed fence.

But that results in exactly the same situation as my waitque solution,
doesn't it?

Blocking infinitely in drm_sched_fini():

If the driver doesn't guarantee that all fences will get signaled, then
wait_event() in the waitque solution will block forever. The same with
dma_fence_wait().

Or are you aiming at an interruptible dma_fence_wait(), thereby not
blocking SIGKILL?

That then would still result in leaks. So basically the same situation
as with an interruptible drm_sched_flush() function.

(Although I agree that would probably be more elegant)

>  
>  The problem with this approach is that the idea of re-submitting
> jobs in a GPU reset by the scheduler is then basically dead. But to
> be honest that never worked c

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-15 Thread Christian König

Am 15.11.24 um 14:55 schrieb Philipp Stanner:

[SNIP]

A good bunch of the problems we have here are caused by abusing the
job
as state for executing the submission on the hardware.

The original design idea of the scheduler instead was to only have
the
job as intermediate state between submission and picking what to
execute
next. E.g. the scheduler in that view is just a pipeline were you
push
jobs in on one end and jobs come out on the other end.

So let's get a bit more precise about this:
1. Driver enqueues with drm_sched_job_arm()
2. job ends up in pending_list
3. Sooner or later scheduler calls run_job()
4. Job is pushed to hardware
5. Fence gets signaled
6. ???

What would the "come out on the other end" part you describe look like?

How would jobs get removed from pending_list and, accordingly, how
would we avoid leaks?


Let me describe the full solution a bit further down.


In that approach the object which represents the hardware execution
is
the dma_fence instead of the job. And the dma_fence has a well
defined
lifetime, error handling, etc...

So when we go back to this original approach it would mean that we
don't
need to wait for any cleanup because the scheduler wouldn't be
involved
in the execution of the jobs on the hardware any more.

It would be involved (to speak precisely) in the sense of the scheduler
still being the one who pushes the job onto the hardware, agreed?


Yes, exactly.

See in the original design the "job" wasn't even a defined structure, 
but rather just a void*.


So basically what the driver did was telling the scheduler here I have a 
bunch of different void* please tell me which one to run first.


And apart from being this identifier this void* had absolutely no 
meaning for the scheduler.



The worst thing that could happen is that the driver messes things up
and still has not executed job in an entity,

I can't fully follow.

So in your mind, the driver would personally set the dma_fence callback
and hereby be informed about the job being completed, right?


No. The run_job callback would still return the hw fence so that the 
scheduler can install the callback and so gets informed when the next 
job could be run.



But you wouldn't want to aim for getting rid of struct drm_sched_job
completely, or would you?


No, the drm_sched_job structure was added to aid the single producer 
single consumer queue and so made it easier to put the jobs into a 
container.



In my mind the full solution for running jobs looks like this:

1. Driver enqueues with drm_sched_job_arm()
2. job ends up in pending_list
3. Sooner or later scheduler calls run_job()
4. In return scheduler gets a dma_fence representing the resulting 
hardware operation.
5, This fence is put into a container to keep around what the hw is 
actually executing.
6. At some point the fence gets signaled informing the scheduler that 
the next job can be pushed if enough credits are now available.


There is no free_job callback any more because after run_job is called 
the scheduler is done with the job and only the dma_fence which 
represents the actually HW operation is the object the scheduler now 
works with.


We would still need something like a kill_job callback which is used 
when an entity is released without flushing all jobs (see 
drm_sched_entity_kill()), but that is then only used for a specific 
corner case.


Blocking for the cleanup in drm_sched_fini() then becomes a trivial 
dma_fence_wait() on the remaining unsignaled HW fences and eventually on 
the latest killed fence.


The problem with this approach is that the idea of re-submitting jobs in 
a GPU reset by the scheduler is then basically dead. But to be honest 
that never worked correctly.


See the deprecated warning I already put on drm_sched_resubmit_jobs(). 
The job lifetime is not really well defined because of this, see the 
free_guilty hack as well.


Regards,
Christian.




Grüße,
P.


Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-15 Thread Philipp Stanner
On Fri, 2024-11-15 at 12:26 +0100, Christian König wrote:
> Interesting, those mails just showed up in my inbox as unread. More
> than 
> 14 days after you send it.
> 
> So sorry for the late reply.

I smell google mail :p

> 
> Am 29.10.24 um 08:22 schrieb Philipp Stanner:
> > Christian, Sima?
> > 
> > Matthew? (+CC)
> > 
> > Opinions on the below?
> > 
> > tl;dr:
> > I still think it's a good thing to detectably block in
> > drm_sched_fini(), or at the very least drm_sched_flush(), because
> > then
> > you'll find out that the driver is broken and can repair it.
> 
> As discussed in private mail as well, the third option not mentioned
> so 
> far is to completely drop the free_job callback.

If it's really possible to do that I think this sounds like a great
solution at first glance because we could decrease complexity quite
substantially.

> 
> A good bunch of the problems we have here are caused by abusing the
> job 
> as state for executing the submission on the hardware.
> 
> The original design idea of the scheduler instead was to only have
> the 
> job as intermediate state between submission and picking what to
> execute 
> next. E.g. the scheduler in that view is just a pipeline were you
> push 
> jobs in on one end and jobs come out on the other end.

So let's get a bit more precise about this:
   1. Driver enqueues with drm_sched_job_arm()
   2. job ends up in pending_list
   3. Sooner or later scheduler calls run_job()
   4. Job is pushed to hardware
   5. Fence gets signaled
   6. ???

What would the "come out on the other end" part you describe look like?

How would jobs get removed from pending_list and, accordingly, how
would we avoid leaks?

> 
> In that approach the object which represents the hardware execution
> is 
> the dma_fence instead of the job. And the dma_fence has a well
> defined 
> lifetime, error handling, etc...
> 
> So when we go back to this original approach it would mean that we
> don't 
> need to wait for any cleanup because the scheduler wouldn't be
> involved 
> in the execution of the jobs on the hardware any more.

It would be involved (to speak precisely) in the sense of the scheduler
still being the one who pushes the job onto the hardware, agreed?

> 
> The worst thing that could happen is that the driver messes things up
> and still has not executed job in an entity,

I can't fully follow.

So in your mind, the driver would personally set the dma_fence callback
and hereby be informed about the job being completed, right?

But you wouldn't want to aim for getting rid of struct drm_sched_job
completely, or would you?


Grüße,
P.


> but in that case we could
> just warn, block for the hardware fence and then kill all pending 
> submissions with an error.- And this blocking on the hardware fence 
> can't deadlock because we know that it is a hardware fence with
> certain 
> properties and not some random free_job callback where we don't have
> any 
> idea what locks the driver need.
> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > 
> > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König
> > > > wrote:
> > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > > wrote:
> > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König
> > > > > > > > > wrote:
> > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian
> > > > > > > > > > > König
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Tearing down the scheduler with jobs still on
> > > > > > > > > > > > the
> > > > > > > > > > > > pending
> > > > > > > > > > > > list
> > > > > > > > > > > > can
> > > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > > drivers try
> > > > > > > > > > > > to
> > > > > > > > > > > > destroy a scheduler which still has work pushed
> > > > > > > > > > > > to
> > > > > > > > > > > > the HW.
> > > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > > waitque-
> > > > > > > > > > > solution?
> > > > > > > > > > I don't remember seeing anything. What have I
> > > > > > > > > > missed?
> > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > > 
> > > > > > > > Interesting approach, I'm just not sure if we can or
> > > > > > > > should
> > > > > > > > wait in
> > > > > > > > drm_sched_fini().
> > > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > > starts
> > > > > > is
> > > > > > always a bug, right?
> > > > > Correct, the question is how to avoid that.
>

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-15 Thread Christian König
Interesting, those mails just showed up in my inbox as unread. More than 
14 days after you send it.


So sorry for the late reply.

Am 29.10.24 um 08:22 schrieb Philipp Stanner:

Christian, Sima?

Matthew? (+CC)

Opinions on the below?

tl;dr:
I still think it's a good thing to detectably block in
drm_sched_fini(), or at the very least drm_sched_flush(), because then
you'll find out that the driver is broken and can repair it.


As discussed in private mail as well, the third option not mentioned so 
far is to completely drop the free_job callback.


A good bunch of the problems we have here are caused by abusing the job 
as state for executing the submission on the hardware.


The original design idea of the scheduler instead was to only have the 
job as intermediate state between submission and picking what to execute 
next. E.g. the scheduler in that view is just a pipeline were you push 
jobs in on one end and jobs come out on the other end.


In that approach the object which represents the hardware execution is 
the dma_fence instead of the job. And the dma_fence has a well defined 
lifetime, error handling, etc...


So when we go back to this original approach it would mean that we don't 
need to wait for any cleanup because the scheduler wouldn't be involved 
in the execution of the jobs on the hardware any more.


The worst thing that could happen is that the driver messes things up 
and still has not executed job in an entity, but in that case we could 
just warn, block for the hardware fence and then kill all pending 
submissions with an error.- And this blocking on the hardware fence 
can't deadlock because we know that it is a hardware fence with certain 
properties and not some random free_job callback where we don't have any 
idea what locks the driver need.


Regards,
Christian.



P.


On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:

On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:

On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:

Am 25.09.24 um 16:53 schrieb Philipp Stanner:

On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:

On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
wrote:

Am 20.09.24 um 15:26 schrieb Philipp Stanner:

On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:

Am 20.09.24 um 10:57 schrieb Philipp Stanner:

On Wed, 2024-09-18 at 15:39 +0200, Christian König
wrote:

Tearing down the scheduler with jobs still on the
pending
list
can
lead to use after free issues. Add a warning if
drivers try
to
destroy a scheduler which still has work pushed to
the HW.

Did you have time yet to look into my proposed
waitque-
solution?

I don't remember seeing anything. What have I missed?

https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/

Mhm, I didn't got that in my inbox for some reason.

Interesting approach, I'm just not sure if we can or should
wait in
drm_sched_fini().

We do agree that jobs still pending when drm_sched_fini()
starts
is
always a bug, right?

Correct, the question is how to avoid that.


If so, what are the disadvantages of waiting in
drm_sched_fini()?
We
could block buggy drivers as I see it. Which wouldn't be good,
but
could then be fixed on drivers' site.

Sima explained that pretty well: Don't block in fops->close, do
that in
fops->flush instead.

I agree that we shouldn't block in close(), but this effectively
means that we
need to reference count the scheduler, right?

Otherwise, if we allow to just skip / interrupt the teardown, we
can
still leak
memory.

Having thought about it, I agree with Danilo. Having something that
shall wait on green light, but can be interrupted, is no guarantee
and
therefore not a feasible solution.

To break down the solution space, these seem to be our options:
    1. We have something (either drm_sched_fini() or a helper, e.g.,
   drm_sched_flush()) that definitely blocks until the pending
list
   has become empty.
    2. We have jobs reference-count the scheduler, so the latter can
   outlive the driver and will be freed some time later.

Can anyone think of a third solution?


Solution #1 has the problem of obviously blocking unconditionally if
the driver didn't make sure that all fences will be signaled within
reasonable time. In my opinion, this would actually be an advantage,
because it will be *very* noticable and force users to repair their
driver. The driver *has* to guarantee that all fences will be
signaled.
If the driver has to do fishy things, having the blocking outsourced
to
the helper drm_sched_flush() would allow them to circumvent that.

Solution #2 has the problem of backend_ops.free_job() potentially
using
driver-data after the driver is gone, causing UAF. So with this
solutions all drivers would have to be aware of the issue and handle
it
through one of DRMs primitives dedicated to such problems.


Currently, all drivers either work around the problem internally or
simply ignore it, it seems.

So I'd argue that both so

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-12 Thread Christian König

Am 08.11.24 um 05:00 schrieb Dave Airlie:

On Wed, 18 Sept 2024 at 23:48, Christian König
  wrote:

Tearing down the scheduler with jobs still on the pending list can
lead to use after free issues. Add a warning if drivers try to
destroy a scheduler which still has work pushed to the HW.

When there are still entities with jobs the situation is even worse
since the dma_fences for those jobs can never signal we can just
choose between potentially locking up core memory management and
random memory corruption. When drivers really mess it up that well
let them run into a BUG_ON().

I've been talking a bit to Phillip about this offline.

I'm not sure we should ever be BUG_ON here, I think it might be an
extreme answer, considering we are saying blocking userspace to let
things finish is bad, I think hitting this would be much worse.


Yeah, completely agree that this is the most extreme response.

The problem is that the scheduler doesn't have much of a choice in this 
moment any more. What we can do is the following:


1. Try to block for the queued up jobs in the entities to be processed 
and signaling their hardware fence.


    There is no guarantee that this won't deadlock and we are 
potentially blocking a SIGKILL here.

    We already tried it before and that turned out to be quite unstable.

2. Signal all pending fences without actually executing anything.

    Because of the dma_fence design, pipe-lining work and only keeping 
the latest fence for each context in the containers that can potentially 
lead to random memory corruptions.


3. Don't signal anything, just free up all jobs.

    That can trivially result in the core memory management waiting 
forever for things to make progress. Leading to complete system stops.


So the choice you have are all really bad for the system as a whole.


I'd rather we WARN_ON, and consider just saying screw it and block
userspace close.


Going to make that a WARN_ON() if you insist, but IIRC Linus or Greg 
once summarized it as "A BUG() is only justified if you prevent even 
worse things from happening". If you ask me that's exactly the case here.



If we really want to avoid the hang on close possibility (though I'm
mostly fine with that) then I think Sima's option to just keep a
reference on the driver, let userspace close and try and clean things
up on fences in the driver later.

I think this should be at least good for rust lifetimes.


The problem with the rust lifetime is that it got the direction of the 
dependency wrong.


A dma_fence represents an operation on the hardware, e.g. a direct 
access to some memory. This means that the dma_fence can only signal if 
the hardware tells the driver that the operation is completed.


If a dma_fence should signals because some object is not referenced any 
more, for example because userspace closed it, than that clearly points 
to a fundamentally broken design.



Having an explicit memory leak is bad, having a managed memory leak is
less bad, because at least all the memory is still pointed to by
something and managed, at a guess we'd have to keep the whole driver
and scheduler around, to avoid having to make special free functions.
Unless there is some concept like TTM ghost objects we could get away
with, but I think having to signal the fence means we should keep all
the pieces.


Well I think memory leaks are more or less harmless. What we really 
really need to prevent are random memory corruptions.


Those can run undetected under the radar for a very long time and cause 
massive damage without anybody being able to pinpoint where corruptions 
came from.


Regards,
Christian.



Dave.


Signed-off-by: Christian König
---
  drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..8a46fab5cdc8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)

 drm_sched_wqueue_stop(sched);

+   /*
+* Tearing down the scheduler wile there are still unprocessed jobs can
+* lead to use after free issues in the scheduler fence.
+*/
+   WARN_ON(!list_empty(&sched->pending_list));
+
 for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 struct drm_sched_rq *rq = sched->sched_rq[i];

 spin_lock(&rq->lock);
-   list_for_each_entry(s_entity, &rq->entities, list)
+   list_for_each_entry(s_entity, &rq->entities, list) {
+   /*
+* The justification for this BUG_ON() is that tearing
+* down the scheduler while jobs are pending leaves
+* dma_fences unsignaled. Since we have dependencies
+* from the core memory manageme

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-11-07 Thread Dave Airlie
On Wed, 18 Sept 2024 at 23:48, Christian König
 wrote:
>
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.
>
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().

I've been talking a bit to Phillip about this offline.

I'm not sure we should ever be BUG_ON here, I think it might be an
extreme answer, considering we are saying blocking userspace to let
things finish is bad, I think hitting this would be much worse.

I'd rather we WARN_ON, and consider just saying screw it and block
userspace close.

If we really want to avoid the hang on close possibility (though I'm
mostly fine with that) then I think Sima's option to just keep a
reference on the driver, let userspace close and try and clean things
up on fences in the driver later.

I think this should be at least good for rust lifetimes.

Having an explicit memory leak is bad, having a managed memory leak is
less bad, because at least all the memory is still pointed to by
something and managed, at a guess we'd have to keep the whole driver
and scheduler around, to avoid having to make special free functions.
Unless there is some concept like TTM ghost objects we could get away
with, but I think having to signal the fence means we should keep all
the pieces.

Dave.

>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>
> drm_sched_wqueue_stop(sched);
>
> +   /*
> +* Tearing down the scheduler wile there are still unprocessed jobs 
> can
> +* lead to use after free issues in the scheduler fence.
> +*/
> +   WARN_ON(!list_empty(&sched->pending_list));
> +
> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> struct drm_sched_rq *rq = sched->sched_rq[i];
>
> spin_lock(&rq->lock);
> -   list_for_each_entry(s_entity, &rq->entities, list)
> +   list_for_each_entry(s_entity, &rq->entities, list) {
> +   /*
> +* The justification for this BUG_ON() is that tearing
> +* down the scheduler while jobs are pending leaves
> +* dma_fences unsignaled. Since we have dependencies
> +* from the core memory management to eventually 
> signal
> +* dma_fences this can trivially lead to a system wide
> +* stop because of a locked up memory management.
> +*/
> +   BUG_ON(spsc_queue_count(&s_entity->job_queue));
> +
> /*
>  * Prevents reinsertion and marks job_queue as idle,
>  * it will removed from rq in drm_sched_entity_fini
>  * eventually
>  */
> s_entity->stopped = true;
> +   }
> spin_unlock(&rq->lock);
> kfree(sched->sched_rq[i]);
> }
> --
> 2.34.1
>


Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-10-29 Thread Philipp Stanner
Christian, Sima?

Matthew? (+CC)

Opinions on the below?

tl;dr:
I still think it's a good thing to detectably block in
drm_sched_fini(), or at the very least drm_sched_flush(), because then
you'll find out that the driver is broken and can repair it.

P.


On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > wrote:
> > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König
> > > > > > > > > wrote:
> > > > > > > > > > Tearing down the scheduler with jobs still on the
> > > > > > > > > > pending
> > > > > > > > > > list
> > > > > > > > > > can
> > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > drivers try
> > > > > > > > > > to
> > > > > > > > > > destroy a scheduler which still has work pushed to
> > > > > > > > > > the HW.
> > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > waitque-
> > > > > > > > > solution?
> > > > > > > > I don't remember seeing anything. What have I missed?
> > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > 
> > > > > > Interesting approach, I'm just not sure if we can or should
> > > > > > wait in
> > > > > > drm_sched_fini().
> > > > We do agree that jobs still pending when drm_sched_fini()
> > > > starts
> > > > is
> > > > always a bug, right?
> > > 
> > > Correct, the question is how to avoid that.
> > > 
> > > > If so, what are the disadvantages of waiting in
> > > > drm_sched_fini()?
> > > > We
> > > > could block buggy drivers as I see it. Which wouldn't be good,
> > > > but
> > > > could then be fixed on drivers' site.
> > > 
> > > Sima explained that pretty well: Don't block in fops->close, do
> > > that in
> > > fops->flush instead.
> > 
> > I agree that we shouldn't block in close(), but this effectively
> > means that we
> > need to reference count the scheduler, right?
> > 
> > Otherwise, if we allow to just skip / interrupt the teardown, we
> > can
> > still leak
> > memory.
> 
> Having thought about it, I agree with Danilo. Having something that
> shall wait on green light, but can be interrupted, is no guarantee
> and
> therefore not a feasible solution.
> 
> To break down the solution space, these seem to be our options:
>    1. We have something (either drm_sched_fini() or a helper, e.g.,
>   drm_sched_flush()) that definitely blocks until the pending
> list
>   has become empty.
>    2. We have jobs reference-count the scheduler, so the latter can
>   outlive the driver and will be freed some time later.
> 
> Can anyone think of a third solution?
> 
> 
> Solution #1 has the problem of obviously blocking unconditionally if
> the driver didn't make sure that all fences will be signaled within
> reasonable time. In my opinion, this would actually be an advantage,
> because it will be *very* noticable and force users to repair their
> driver. The driver *has* to guarantee that all fences will be
> signaled.
> If the driver has to do fishy things, having the blocking outsourced
> to
> the helper drm_sched_flush() would allow them to circumvent that.
> 
> Solution #2 has the problem of backend_ops.free_job() potentially
> using
> driver-data after the driver is gone, causing UAF. So with this
> solutions all drivers would have to be aware of the issue and handle
> it
> through one of DRMs primitives dedicated to such problems.
> 
> 
> Currently, all drivers either work around the problem internally or
> simply ignore it, it seems.
> 
> So I'd argue that both solutions are an improvement over the existing
> situation. My preference would be #1.
> 
> 
> Opinions?
> 
> P.
> 
> > 
> > > 
> > > One issue this solves is that when you send a SIGTERM the tear
> > > down
> > > handling
> > > first flushes all the FDs and then closes them.
> > > 
> > > So if flushing the FDs blocks because the process initiated
> > > sending
> > > a
> > > terabyte of data over a 300bps line (for example) you can still
> > > throw a
> > > SIGKILL and abort that as well.
> > > 
> > > If you would block in fops-close() that SIGKILL won't have any
> > > effect any
> > > more because by the time close() is called the process is gone
> > > and
> > > signals
> > > are already blocked.
> > > 
> > > And yes when I learned about that issue I was also buffed that
> > > handling like
> > > this in the UNIX design is nearly 50 years old and still applies
> > > to
> > > today.
> > > > > > Probably bette

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-10-18 Thread Philipp Stanner
On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > wrote:
> > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König
> > > > > > > > wrote:
> > > > > > > > > Tearing down the scheduler with jobs still on the
> > > > > > > > > pending
> > > > > > > > > list
> > > > > > > > > can
> > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > drivers try
> > > > > > > > > to
> > > > > > > > > destroy a scheduler which still has work pushed to
> > > > > > > > > the HW.
> > > > > > > > Did you have time yet to look into my proposed waitque-
> > > > > > > > solution?
> > > > > > > I don't remember seeing anything. What have I missed?
> > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > 
> > > > > Interesting approach, I'm just not sure if we can or should
> > > > > wait in
> > > > > drm_sched_fini().
> > > We do agree that jobs still pending when drm_sched_fini() starts
> > > is
> > > always a bug, right?
> > 
> > Correct, the question is how to avoid that.
> > 
> > > If so, what are the disadvantages of waiting in drm_sched_fini()?
> > > We
> > > could block buggy drivers as I see it. Which wouldn't be good,
> > > but
> > > could then be fixed on drivers' site.
> > 
> > Sima explained that pretty well: Don't block in fops->close, do
> > that in
> > fops->flush instead.
> 
> I agree that we shouldn't block in close(), but this effectively
> means that we
> need to reference count the scheduler, right?
> 
> Otherwise, if we allow to just skip / interrupt the teardown, we can
> still leak
> memory.

Having thought about it, I agree with Danilo. Having something that
shall wait on green light, but can be interrupted, is no guarantee and
therefore not a feasible solution.

To break down the solution space, these seem to be our options:
   1. We have something (either drm_sched_fini() or a helper, e.g.,
  drm_sched_flush()) that definitely blocks until the pending list
  has become empty.
   2. We have jobs reference-count the scheduler, so the latter can
  outlive the driver and will be freed some time later.

Can anyone think of a third solution?


Solution #1 has the problem of obviously blocking unconditionally if
the driver didn't make sure that all fences will be signaled within
reasonable time. In my opinion, this would actually be an advantage,
because it will be *very* noticable and force users to repair their
driver. The driver *has* to guarantee that all fences will be signaled.
If the driver has to do fishy things, having the blocking outsourced to
the helper drm_sched_flush() would allow them to circumvent that.

Solution #2 has the problem of backend_ops.free_job() potentially using
driver-data after the driver is gone, causing UAF. So with this
solutions all drivers would have to be aware of the issue and handle it
through one of DRMs primitives dedicated to such problems.


Currently, all drivers either work around the problem internally or
simply ignore it, it seems.

So I'd argue that both solutions are an improvement over the existing
situation. My preference would be #1.


Opinions?

P.

> 
> > 
> > One issue this solves is that when you send a SIGTERM the tear down
> > handling
> > first flushes all the FDs and then closes them.
> > 
> > So if flushing the FDs blocks because the process initiated sending
> > a
> > terabyte of data over a 300bps line (for example) you can still
> > throw a
> > SIGKILL and abort that as well.
> > 
> > If you would block in fops-close() that SIGKILL won't have any
> > effect any
> > more because by the time close() is called the process is gone and
> > signals
> > are already blocked.
> > 
> > And yes when I learned about that issue I was also buffed that
> > handling like
> > this in the UNIX design is nearly 50 years old and still applies to
> > today.
> > > > > Probably better to make that a separate function, something
> > > > > like
> > > > > drm_sched_flush() or similar.
> > > We could do that. Such a function could then be called by drivers
> > > which
> > > are not sure whether all jobs are done before they start tearing
> > > down.
> > 
> > Yes exactly that's the idea. And give that flush function a return
> > code so
> > that it can return -EINTR.
> > 
> > > > Yeah I don't think we should smash this into drm_sched_fini
> > > > unconditionally. I think conceptually there's about three
> > > > cases:
> > > > 
> > > > - Ringbuffer schedules. Probably want everything as-is, becaus

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-10-14 Thread Danilo Krummrich
On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > > > > Tearing down the scheduler with jobs still on the pending
> > > > > > > > list
> > > > > > > > can
> > > > > > > > lead to use after free issues. Add a warning if drivers try
> > > > > > > > to
> > > > > > > > destroy a scheduler which still has work pushed to the HW.
> > > > > > > Did you have time yet to look into my proposed waitque-
> > > > > > > solution?
> > > > > > I don't remember seeing anything. What have I missed?
> > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > > > Mhm, I didn't got that in my inbox for some reason.
> > > > 
> > > > Interesting approach, I'm just not sure if we can or should wait in
> > > > drm_sched_fini().
> > We do agree that jobs still pending when drm_sched_fini() starts is
> > always a bug, right?
> 
> Correct, the question is how to avoid that.
> 
> > If so, what are the disadvantages of waiting in drm_sched_fini()? We
> > could block buggy drivers as I see it. Which wouldn't be good, but
> > could then be fixed on drivers' site.
> 
> Sima explained that pretty well: Don't block in fops->close, do that in
> fops->flush instead.

I agree that we shouldn't block in close(), but this effectively means that we
need to reference count the scheduler, right?

Otherwise, if we allow to just skip / interrupt the teardown, we can still leak
memory.

> 
> One issue this solves is that when you send a SIGTERM the tear down handling
> first flushes all the FDs and then closes them.
> 
> So if flushing the FDs blocks because the process initiated sending a
> terabyte of data over a 300bps line (for example) you can still throw a
> SIGKILL and abort that as well.
> 
> If you would block in fops-close() that SIGKILL won't have any effect any
> more because by the time close() is called the process is gone and signals
> are already blocked.
> 
> And yes when I learned about that issue I was also buffed that handling like
> this in the UNIX design is nearly 50 years old and still applies to today.
> > > > Probably better to make that a separate function, something like
> > > > drm_sched_flush() or similar.
> > We could do that. Such a function could then be called by drivers which
> > are not sure whether all jobs are done before they start tearing down.
> 
> Yes exactly that's the idea. And give that flush function a return code so
> that it can return -EINTR.
> 
> > > Yeah I don't think we should smash this into drm_sched_fini
> > > unconditionally. I think conceptually there's about three cases:
> > > 
> > > - Ringbuffer schedules. Probably want everything as-is, because
> > >    drm_sched_fini is called long after all the entities are gone in
> > >    drm_device cleanup.
> > > 
> > > - fw scheduler hardware with preemption support. There we probably
> > > want to
> > >    nuke the context by setting the tdr timeout to zero (or maybe just
> > > as
> > >    long as context preemption takes to be efficient), and relying on
> > > the
> > >    normal gpu reset flow to handle things. drm_sched_entity_flush
> > > kinda
> > >    does this, except not really and it's a lot more focused on the
> > >    ringbuffer context. So maybe we want a new drm_sched_entity_kill.
> > > 
> > >    For this case calling drm_sched_fini() after the 1:1 entity is gone
> > >    should not find any linger jobs, it would actually be a bug
> > > somewhere if
> > >    there's a job lingering. Maybe a sanity check that there's not just
> > > no
> > >    jobs lingering, but also no entity left would be good here?
> > The check for lingering ones is in Christian's patch here IISC.
> > At which position would you imagine the check for the entity being
> > performed?
> > 
> > > - fw scheduler without preemption support. There we kinda need the
> > >    drm_sched_flush, except blocking in fops->close is not great. So
> > > instead
> > >    I think the following is better:
> > >    1. drm_sched_entity_stopped, which only stops new submissions (for
> > >    paranoia) but doesn't tear down the entity
> > Who would call that function?
> > Drivers using it voluntarily could just as well stop accepting new jobs
> > from userspace to their entities, couldn't they?
> > 
> > >    2. drm_dev_get
> > >    3. launch a worker which does a) drm_sched_flush (or
> > >    drm_sched_entity_flush or whatever we call it) b)
> > > drm_sched_entity_fini
> > >    + drm_sched_fini c) drm_dev_put
> > > 
> > >    Note that semantically this implements the refcount in the other
> > > path
> > >    from Phillip:

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-10-14 Thread Danilo Krummrich
On Tue, Sep 24, 2024 at 01:18:25PM +0200, Simona Vetter wrote:
> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > > Tearing down the scheduler with jobs still on the pending list
> > > > > > can
> > > > > > lead to use after free issues. Add a warning if drivers try to
> > > > > > destroy a scheduler which still has work pushed to the HW.
> > > > > Did you have time yet to look into my proposed waitque-solution?
> > > > I don't remember seeing anything. What have I missed?
> > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > 
> > Mhm, I didn't got that in my inbox for some reason.
> > 
> > Interesting approach, I'm just not sure if we can or should wait in
> > drm_sched_fini().
> > 
> > Probably better to make that a separate function, something like
> > drm_sched_flush() or similar.
> 
> Yeah I don't think we should smash this into drm_sched_fini
> unconditionally. I think conceptually there's about three cases:
> 
> - Ringbuffer schedules. Probably want everything as-is, because
>   drm_sched_fini is called long after all the entities are gone in
>   drm_device cleanup.

Even if entities are long gone we can still have jobs on the pending list of the
scheduler. Calling drm_sched_fini() would still leak those jobs.

It's just that for this case it's unlikely, but I don't think we want to rely on
"unlikely".

Do I miss anything?

> 
> - fw scheduler hardware with preemption support. There we probably want to
>   nuke the context by setting the tdr timeout to zero (or maybe just as
>   long as context preemption takes to be efficient), and relying on the
>   normal gpu reset flow to handle things. drm_sched_entity_flush kinda
>   does this, except not really and it's a lot more focused on the
>   ringbuffer context. So maybe we want a new drm_sched_entity_kill.
> 
>   For this case calling drm_sched_fini() after the 1:1 entity is gone
>   should not find any linger jobs, it would actually be a bug somewhere if
>   there's a job lingering. Maybe a sanity check that there's not just no
>   jobs lingering, but also no entity left would be good here?

This is a timing assumption too, isn't it? What if we reach drm_sched_fini()
quicker than all the free() callbacks from the jobs on the pending list are
processed?

I know Xe uses the "tdr to zero trick", so I'm not aware of all implications --
in Nouveau I had to do it differently.

> 
> - fw scheduler without preemption support. There we kinda need the
>   drm_sched_flush, except blocking in fops->close is not great. So instead
>   I think the following is better:
>   1. drm_sched_entity_stopped, which only stops new submissions (for
>   paranoia) but doesn't tear down the entity
>   2. drm_dev_get
>   3. launch a worker which does a) drm_sched_flush (or
>   drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini

I think it would be drm_sched_flush(); the jobs on the pending are not
associated with any entity any more. See also 96c7c2f4d5bd ("drm/scheduler: set
entity to NULL in drm_sched_entity_pop_job()").

>   + drm_sched_fini c) drm_dev_put
> 
>   Note that semantically this implements the refcount in the other path
>   from Phillip:
> 
>   https://lore.kernel.org/all/20240903094531.29893-2-pstan...@redhat.com/
>   
>   Except it doesn't impose refcount on everyone else who doesn't need it,
>   and it doesn't even impose refcounting on drivers that do need it
>   because we use drm_sched_flush and a worker to achieve the same.
> 
> Essentially helper functions for the common use-cases instead of trying to
> solve them all by putting drm_sched_flush as a potentially very blocking
> function into drm_sched_fini.
> 
> 
> > > > > > When there are still entities with jobs the situation is even
> > > > > > worse
> > > > > > since the dma_fences for those jobs can never signal we can just
> > > > > > choose between potentially locking up core memory management and
> > > > > > random memory corruption. When drivers really mess it up that
> > > > > > well
> > > > > > let them run into a BUG_ON().
> > > > > > 
> > > > > > Signed-off-by: Christian König
> > > > > > ---
> > > > > >    drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
> > > > > >    1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > drm_gpu_scheduler
> > > > > > *sched)
> > > > > I agree with Sima that it should first be docume

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-27 Thread Christian König

Am 25.09.24 um 16:53 schrieb Philipp Stanner:

On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:

On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:

Am 20.09.24 um 15:26 schrieb Philipp Stanner:

On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:

Am 20.09.24 um 10:57 schrieb Philipp Stanner:

On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:

Tearing down the scheduler with jobs still on the pending
list
can
lead to use after free issues. Add a warning if drivers try
to
destroy a scheduler which still has work pushed to the HW.

Did you have time yet to look into my proposed waitque-
solution?

I don't remember seeing anything. What have I missed?

https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/

Mhm, I didn't got that in my inbox for some reason.

Interesting approach, I'm just not sure if we can or should wait in
drm_sched_fini().

We do agree that jobs still pending when drm_sched_fini() starts is
always a bug, right?


Correct, the question is how to avoid that.


If so, what are the disadvantages of waiting in drm_sched_fini()? We
could block buggy drivers as I see it. Which wouldn't be good, but
could then be fixed on drivers' site.


Sima explained that pretty well: Don't block in fops->close, do that in 
fops->flush instead.


One issue this solves is that when you send a SIGTERM the tear down 
handling first flushes all the FDs and then closes them.


So if flushing the FDs blocks because the process initiated sending a 
terabyte of data over a 300bps line (for example) you can still throw a 
SIGKILL and abort that as well.


If you would block in fops-close() that SIGKILL won't have any effect 
any more because by the time close() is called the process is gone and 
signals are already blocked.


And yes when I learned about that issue I was also buffed that handling 
like this in the UNIX design is nearly 50 years old and still applies to 
today.

Probably better to make that a separate function, something like
drm_sched_flush() or similar.

We could do that. Such a function could then be called by drivers which
are not sure whether all jobs are done before they start tearing down.


Yes exactly that's the idea. And give that flush function a return code 
so that it can return -EINTR.



Yeah I don't think we should smash this into drm_sched_fini
unconditionally. I think conceptually there's about three cases:

- Ringbuffer schedules. Probably want everything as-is, because
   drm_sched_fini is called long after all the entities are gone in
   drm_device cleanup.

- fw scheduler hardware with preemption support. There we probably
want to
   nuke the context by setting the tdr timeout to zero (or maybe just
as
   long as context preemption takes to be efficient), and relying on
the
   normal gpu reset flow to handle things. drm_sched_entity_flush
kinda
   does this, except not really and it's a lot more focused on the
   ringbuffer context. So maybe we want a new drm_sched_entity_kill.

   For this case calling drm_sched_fini() after the 1:1 entity is gone
   should not find any linger jobs, it would actually be a bug
somewhere if
   there's a job lingering. Maybe a sanity check that there's not just
no
   jobs lingering, but also no entity left would be good here?

The check for lingering ones is in Christian's patch here IISC.
At which position would you imagine the check for the entity being
performed?


- fw scheduler without preemption support. There we kinda need the
   drm_sched_flush, except blocking in fops->close is not great. So
instead
   I think the following is better:
   1. drm_sched_entity_stopped, which only stops new submissions (for
   paranoia) but doesn't tear down the entity

Who would call that function?
Drivers using it voluntarily could just as well stop accepting new jobs
from userspace to their entities, couldn't they?


   2. drm_dev_get
   3. launch a worker which does a) drm_sched_flush (or
   drm_sched_entity_flush or whatever we call it) b)
drm_sched_entity_fini
   + drm_sched_fini c) drm_dev_put

   Note that semantically this implements the refcount in the other
path
   from Phillip:

  
https://lore.kernel.org/all/20240903094531.29893-2-pstan...@redhat.com/
   
   Except it doesn't impose refcount on everyone else who doesn't need

it,
   and it doesn't even impose refcounting on drivers that do need it
   because we use drm_sched_flush and a worker to achieve the same.

I indeed wasn't happy with the refcount approach for that reason,
agreed.


Essentially helper functions for the common use-cases instead of
trying to
solve them all by putting drm_sched_flush as a potentially very
blocking
function into drm_sched_fini.

I'm still not able to see why it blocking would be undesired – as far
as I can see, it is only invoked on driver teardown, so not during
active operation. Teardown doesn't happen that often, and it can (if
implemented correctly) only block until the driver's code has signaled
the last 

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-25 Thread Philipp Stanner
On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > > Tearing down the scheduler with jobs still on the pending
> > > > > > list
> > > > > > can
> > > > > > lead to use after free issues. Add a warning if drivers try
> > > > > > to
> > > > > > destroy a scheduler which still has work pushed to the HW.
> > > > > Did you have time yet to look into my proposed waitque-
> > > > > solution?
> > > > I don't remember seeing anything. What have I missed?
> > > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> > 
> > Mhm, I didn't got that in my inbox for some reason.
> > 
> > Interesting approach, I'm just not sure if we can or should wait in
> > drm_sched_fini().

We do agree that jobs still pending when drm_sched_fini() starts is
always a bug, right?

If so, what are the disadvantages of waiting in drm_sched_fini()? We
could block buggy drivers as I see it. Which wouldn't be good, but
could then be fixed on drivers' site.

> > 
> > Probably better to make that a separate function, something like
> > drm_sched_flush() or similar.

We could do that. Such a function could then be called by drivers which
are not sure whether all jobs are done before they start tearing down.

> 
> Yeah I don't think we should smash this into drm_sched_fini
> unconditionally. I think conceptually there's about three cases:
> 
> - Ringbuffer schedules. Probably want everything as-is, because
>   drm_sched_fini is called long after all the entities are gone in
>   drm_device cleanup.
> 
> - fw scheduler hardware with preemption support. There we probably
> want to
>   nuke the context by setting the tdr timeout to zero (or maybe just
> as
>   long as context preemption takes to be efficient), and relying on
> the
>   normal gpu reset flow to handle things. drm_sched_entity_flush
> kinda
>   does this, except not really and it's a lot more focused on the
>   ringbuffer context. So maybe we want a new drm_sched_entity_kill.
> 
>   For this case calling drm_sched_fini() after the 1:1 entity is gone
>   should not find any linger jobs, it would actually be a bug
> somewhere if
>   there's a job lingering. Maybe a sanity check that there's not just
> no
>   jobs lingering, but also no entity left would be good here?

The check for lingering ones is in Christian's patch here IISC.
At which position would you imagine the check for the entity being
performed?

> 
> - fw scheduler without preemption support. There we kinda need the
>   drm_sched_flush, except blocking in fops->close is not great. So
> instead
>   I think the following is better:
>   1. drm_sched_entity_stopped, which only stops new submissions (for
>   paranoia) but doesn't tear down the entity

Who would call that function?
Drivers using it voluntarily could just as well stop accepting new jobs
from userspace to their entities, couldn't they?

>   2. drm_dev_get
>   3. launch a worker which does a) drm_sched_flush (or
>   drm_sched_entity_flush or whatever we call it) b)
> drm_sched_entity_fini
>   + drm_sched_fini c) drm_dev_put
> 
>   Note that semantically this implements the refcount in the other
> path
>   from Phillip:
> 
>  
> https://lore.kernel.org/all/20240903094531.29893-2-pstan...@redhat.com/
>   
>   Except it doesn't impose refcount on everyone else who doesn't need
> it,
>   and it doesn't even impose refcounting on drivers that do need it
>   because we use drm_sched_flush and a worker to achieve the same.

I indeed wasn't happy with the refcount approach for that reason,
agreed.

> 
> Essentially helper functions for the common use-cases instead of
> trying to
> solve them all by putting drm_sched_flush as a potentially very
> blocking
> function into drm_sched_fini.

I'm still not able to see why it blocking would be undesired – as far
as I can see, it is only invoked on driver teardown, so not during
active operation. Teardown doesn't happen that often, and it can (if
implemented correctly) only block until the driver's code has signaled
the last fences. If that doesn't happen, the block would reveal a bug.

But don't get me wrong: I don't want to *push* this solution. I just
want to understand when it could become a problem.


Wouldn't an explicitly blocking, separate function like
drm_sched_flush() or drm_sched_fini_flush() be a small, doable step
towards the right direction?


> 
> 
> > > > > > When there are still entities with jobs the situation is
> > > > > > even
> > > > > > worse
> > > > > > since the dma_fences for those jobs can never signal we can
> > > > > > just
> > > > > > choose between potentially locking up core memory
> > > > > > management and
> > > > > > random memory corruption. When drivers really mess 

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-24 Thread Simona Vetter
On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > Tearing down the scheduler with jobs still on the pending list
> > > > > can
> > > > > lead to use after free issues. Add a warning if drivers try to
> > > > > destroy a scheduler which still has work pushed to the HW.
> > > > Did you have time yet to look into my proposed waitque-solution?
> > > I don't remember seeing anything. What have I missed?
> > https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/
> 
> Mhm, I didn't got that in my inbox for some reason.
> 
> Interesting approach, I'm just not sure if we can or should wait in
> drm_sched_fini().
> 
> Probably better to make that a separate function, something like
> drm_sched_flush() or similar.

Yeah I don't think we should smash this into drm_sched_fini
unconditionally. I think conceptually there's about three cases:

- Ringbuffer schedules. Probably want everything as-is, because
  drm_sched_fini is called long after all the entities are gone in
  drm_device cleanup.

- fw scheduler hardware with preemption support. There we probably want to
  nuke the context by setting the tdr timeout to zero (or maybe just as
  long as context preemption takes to be efficient), and relying on the
  normal gpu reset flow to handle things. drm_sched_entity_flush kinda
  does this, except not really and it's a lot more focused on the
  ringbuffer context. So maybe we want a new drm_sched_entity_kill.

  For this case calling drm_sched_fini() after the 1:1 entity is gone
  should not find any linger jobs, it would actually be a bug somewhere if
  there's a job lingering. Maybe a sanity check that there's not just no
  jobs lingering, but also no entity left would be good here?

- fw scheduler without preemption support. There we kinda need the
  drm_sched_flush, except blocking in fops->close is not great. So instead
  I think the following is better:
  1. drm_sched_entity_stopped, which only stops new submissions (for
  paranoia) but doesn't tear down the entity
  2. drm_dev_get
  3. launch a worker which does a) drm_sched_flush (or
  drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini
  + drm_sched_fini c) drm_dev_put

  Note that semantically this implements the refcount in the other path
  from Phillip:

  https://lore.kernel.org/all/20240903094531.29893-2-pstan...@redhat.com/
  
  Except it doesn't impose refcount on everyone else who doesn't need it,
  and it doesn't even impose refcounting on drivers that do need it
  because we use drm_sched_flush and a worker to achieve the same.

Essentially helper functions for the common use-cases instead of trying to
solve them all by putting drm_sched_flush as a potentially very blocking
function into drm_sched_fini.


> > > > > When there are still entities with jobs the situation is even
> > > > > worse
> > > > > since the dma_fences for those jobs can never signal we can just
> > > > > choose between potentially locking up core memory management and
> > > > > random memory corruption. When drivers really mess it up that
> > > > > well
> > > > > let them run into a BUG_ON().
> > > > > 
> > > > > Signed-off-by: Christian König
> > > > > ---
> > > > >    drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
> > > > >    1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > drm_gpu_scheduler
> > > > > *sched)
> > > > I agree with Sima that it should first be documented in the
> > > > function's
> > > > docstring what the user is expected to have done before calling the
> > > > function.
> > > Good point, going to update the documentation as well.
> > Cool thing, thx.
> > Would be great if everything (not totally trivial) necessary to be done
> > before _fini() is mentioned.
> > 
> > One could also think about providing a hint at how the driver can do
> > that. AFAICS the only way for the driver to ensure that is to maintain
> > its own, separate list of submitted jobs.
> 
> Even with a duplicated pending list it's actually currently impossible to do
> this fully cleanly.
> 
> The problem is that the dma_fence object gives no guarantee when callbacks
> are processed, e.g. they can be both processed from interrupt context as
> well as from a CPU which called dma_fence_is_signaled().
> 
> So when a driver (or drm_sched_fini) waits for the last submitted fence it
> actually can be that the drm_sched object still needs to do some proce

Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-23 Thread Christian König

Am 20.09.24 um 15:26 schrieb Philipp Stanner:

On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:

Am 20.09.24 um 10:57 schrieb Philipp Stanner:

On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:

Tearing down the scheduler with jobs still on the pending list
can
lead to use after free issues. Add a warning if drivers try to
destroy a scheduler which still has work pushed to the HW.

Did you have time yet to look into my proposed waitque-solution?

I don't remember seeing anything. What have I missed?

https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/


Mhm, I didn't got that in my inbox for some reason.

Interesting approach, I'm just not sure if we can or should wait in 
drm_sched_fini().


Probably better to make that a separate function, something like 
drm_sched_flush() or similar.



When there are still entities with jobs the situation is even
worse
since the dma_fences for those jobs can never signal we can just
choose between potentially locking up core memory management and
random memory corruption. When drivers really mess it up that
well
let them run into a BUG_ON().

Signed-off-by: Christian König
---
   drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
   1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..8a46fab5cdc8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
drm_gpu_scheduler
*sched)

I agree with Sima that it should first be documented in the
function's
docstring what the user is expected to have done before calling the
function.

Good point, going to update the documentation as well.

Cool thing, thx.
Would be great if everything (not totally trivial) necessary to be done
before _fini() is mentioned.

One could also think about providing a hint at how the driver can do
that. AFAICS the only way for the driver to ensure that is to maintain
its own, separate list of submitted jobs.


Even with a duplicated pending list it's actually currently impossible 
to do this fully cleanly.


The problem is that the dma_fence object gives no guarantee when 
callbacks are processed, e.g. they can be both processed from interrupt 
context as well as from a CPU which called dma_fence_is_signaled().


So when a driver (or drm_sched_fini) waits for the last submitted fence 
it actually can be that the drm_sched object still needs to do some 
processing. See the hack in amdgpu_vm_tlb_seq() for more background on 
the problem.


Regards,
Christian.



P.


Thanks,
Christian.


P.

   
   	drm_sched_wqueue_stop(sched);
   
+	/*

+* Tearing down the scheduler wile there are still
unprocessed jobs can
+* lead to use after free issues in the scheduler fence.
+*/
+   WARN_ON(!list_empty(&sched->pending_list));
+
    for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
i++)
{
    struct drm_sched_rq *rq = sched->sched_rq[i];
   
   		spin_lock(&rq->lock);

-   list_for_each_entry(s_entity, &rq->entities,
list)
+   list_for_each_entry(s_entity, &rq->entities,
list) {
+   /*
+* The justification for this BUG_ON()
is
that tearing
+* down the scheduler while jobs are
pending
leaves
+* dma_fences unsignaled. Since we have
dependencies
+* from the core memory management to
eventually signal
+* dma_fences this can trivially lead to
a
system wide
+* stop because of a locked up memory
management.
+*/
+   BUG_ON(spsc_queue_count(&s_entity-

job_queue));

+
    /*
     * Prevents reinsertion and marks
job_queue
as idle,
     * it will removed from rq in
drm_sched_entity_fini
     * eventually
     */
    s_entity->stopped = true;
+   }
    spin_unlock(&rq->lock);
    kfree(sched->sched_rq[i]);
    }


Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-20 Thread Philipp Stanner
On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > Tearing down the scheduler with jobs still on the pending list
> > > can
> > > lead to use after free issues. Add a warning if drivers try to
> > > destroy a scheduler which still has work pushed to the HW.
> > Did you have time yet to look into my proposed waitque-solution?
> 
> I don't remember seeing anything. What have I missed?

https://lore.kernel.org/all/20240903094446.29797-2-pstan...@redhat.com/

> 
> > 
> > > When there are still entities with jobs the situation is even
> > > worse
> > > since the dma_fences for those jobs can never signal we can just
> > > choose between potentially locking up core memory management and
> > > random memory corruption. When drivers really mess it up that
> > > well
> > > let them run into a BUG_ON().
> > > 
> > > Signed-off-by: Christian König 
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
> > >   1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index f093616fe53c..8a46fab5cdc8 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > drm_gpu_scheduler
> > > *sched)
> > I agree with Sima that it should first be documented in the
> > function's
> > docstring what the user is expected to have done before calling the
> > function.
> 
> Good point, going to update the documentation as well.

Cool thing, thx.
Would be great if everything (not totally trivial) necessary to be done
before _fini() is mentioned.

One could also think about providing a hint at how the driver can do
that. AFAICS the only way for the driver to ensure that is to maintain
its own, separate list of submitted jobs.

P.

> 
> Thanks,
> Christian.
> 
> > 
> > P.
> > 
> > >   
> > >   drm_sched_wqueue_stop(sched);
> > >   
> > > + /*
> > > +  * Tearing down the scheduler wile there are still
> > > unprocessed jobs can
> > > +  * lead to use after free issues in the scheduler fence.
> > > +  */
> > > + WARN_ON(!list_empty(&sched->pending_list));
> > > +
> > >   for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > > i++)
> > > {
> > >   struct drm_sched_rq *rq = sched->sched_rq[i];
> > >   
> > >   spin_lock(&rq->lock);
> > > - list_for_each_entry(s_entity, &rq->entities,
> > > list)
> > > + list_for_each_entry(s_entity, &rq->entities,
> > > list) {
> > > + /*
> > > +  * The justification for this BUG_ON()
> > > is
> > > that tearing
> > > +  * down the scheduler while jobs are
> > > pending
> > > leaves
> > > +  * dma_fences unsignaled. Since we have
> > > dependencies
> > > +  * from the core memory management to
> > > eventually signal
> > > +  * dma_fences this can trivially lead to
> > > a
> > > system wide
> > > +  * stop because of a locked up memory
> > > management.
> > > +  */
> > > + BUG_ON(spsc_queue_count(&s_entity-
> > > > job_queue));
> > > +
> > >   /*
> > >    * Prevents reinsertion and marks
> > > job_queue
> > > as idle,
> > >    * it will removed from rq in
> > > drm_sched_entity_fini
> > >    * eventually
> > >    */
> > >   s_entity->stopped = true;
> > > + }
> > >   spin_unlock(&rq->lock);
> > >   kfree(sched->sched_rq[i]);
> > >   }
> 



Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-20 Thread Christian König

Am 20.09.24 um 10:57 schrieb Philipp Stanner:

On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:

Tearing down the scheduler with jobs still on the pending list can
lead to use after free issues. Add a warning if drivers try to
destroy a scheduler which still has work pushed to the HW.

Did you have time yet to look into my proposed waitque-solution?


I don't remember seeing anything. What have I missed?




When there are still entities with jobs the situation is even worse
since the dma_fences for those jobs can never signal we can just
choose between potentially locking up core memory management and
random memory corruption. When drivers really mess it up that well
let them run into a BUG_ON().

Signed-off-by: Christian König 
---
  drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..8a46fab5cdc8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler
*sched)

I agree with Sima that it should first be documented in the function's
docstring what the user is expected to have done before calling the
function.


Good point, going to update the documentation as well.

Thanks,
Christian.



P.

  
  	drm_sched_wqueue_stop(sched);
  
+	/*

+* Tearing down the scheduler wile there are still
unprocessed jobs can
+* lead to use after free issues in the scheduler fence.
+*/
+   WARN_ON(!list_empty(&sched->pending_list));
+
    for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
{
    struct drm_sched_rq *rq = sched->sched_rq[i];
  
  		spin_lock(&rq->lock);

-   list_for_each_entry(s_entity, &rq->entities, list)
+   list_for_each_entry(s_entity, &rq->entities, list) {
+   /*
+* The justification for this BUG_ON() is
that tearing
+* down the scheduler while jobs are pending
leaves
+* dma_fences unsignaled. Since we have
dependencies
+* from the core memory management to
eventually signal
+* dma_fences this can trivially lead to a
system wide
+* stop because of a locked up memory
management.
+*/
+   BUG_ON(spsc_queue_count(&s_entity-

job_queue));

+
    /*
     * Prevents reinsertion and marks job_queue
as idle,
     * it will removed from rq in
drm_sched_entity_fini
     * eventually
     */
    s_entity->stopped = true;
+   }
    spin_unlock(&rq->lock);
    kfree(sched->sched_rq[i]);
    }




Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-20 Thread Philipp Stanner
On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.

Did you have time yet to look into my proposed waitque-solution?

> 
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().
> 
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler
> *sched)

I agree with Sima that it should first be documented in the function's
docstring what the user is expected to have done before calling the
function.

P.

>  
>   drm_sched_wqueue_stop(sched);
>  
> + /*
> +  * Tearing down the scheduler wile there are still
> unprocessed jobs can
> +  * lead to use after free issues in the scheduler fence.
> +  */
> + WARN_ON(!list_empty(&sched->pending_list));
> +
>   for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
> {
>   struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>   spin_lock(&rq->lock);
> - list_for_each_entry(s_entity, &rq->entities, list)
> + list_for_each_entry(s_entity, &rq->entities, list) {
> + /*
> +  * The justification for this BUG_ON() is
> that tearing
> +  * down the scheduler while jobs are pending
> leaves
> +  * dma_fences unsignaled. Since we have
> dependencies
> +  * from the core memory management to
> eventually signal
> +  * dma_fences this can trivially lead to a
> system wide
> +  * stop because of a locked up memory
> management.
> +  */
> + BUG_ON(spsc_queue_count(&s_entity-
> >job_queue));
> +
>   /*
>    * Prevents reinsertion and marks job_queue
> as idle,
>    * it will removed from rq in
> drm_sched_entity_fini
>    * eventually
>    */
>   s_entity->stopped = true;
> + }
>   spin_unlock(&rq->lock);
>   kfree(sched->sched_rq[i]);
>   }



Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-18 Thread Christian König

Am 18.09.24 um 16:41 schrieb Jani Nikula:

On Wed, 18 Sep 2024, "Christian König"  wrote:

Tearing down the scheduler with jobs still on the pending list can
lead to use after free issues. Add a warning if drivers try to
destroy a scheduler which still has work pushed to the HW.

When there are still entities with jobs the situation is even worse
since the dma_fences for those jobs can never signal we can just
choose between potentially locking up core memory management and
random memory corruption. When drivers really mess it up that well
let them run into a BUG_ON().

Signed-off-by: Christian König 
---
  drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..8a46fab5cdc8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
  
  	drm_sched_wqueue_stop(sched);
  
+	/*

+* Tearing down the scheduler wile there are still unprocessed jobs can
+* lead to use after free issues in the scheduler fence.
+*/
+   WARN_ON(!list_empty(&sched->pending_list));

drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
would be helpful in multi-GPU systems.


Good point, going to fix that.

Regards,
Christian.



BR,
Jani.


+
for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
struct drm_sched_rq *rq = sched->sched_rq[i];
  
  		spin_lock(&rq->lock);

-   list_for_each_entry(s_entity, &rq->entities, list)
+   list_for_each_entry(s_entity, &rq->entities, list) {
+   /*
+* The justification for this BUG_ON() is that tearing
+* down the scheduler while jobs are pending leaves
+* dma_fences unsignaled. Since we have dependencies
+* from the core memory management to eventually signal
+* dma_fences this can trivially lead to a system wide
+* stop because of a locked up memory management.
+*/
+   BUG_ON(spsc_queue_count(&s_entity->job_queue));
+
/*
 * Prevents reinsertion and marks job_queue as idle,
 * it will removed from rq in drm_sched_entity_fini
 * eventually
 */
s_entity->stopped = true;
+   }
spin_unlock(&rq->lock);
kfree(sched->sched_rq[i]);
}




Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

2024-09-18 Thread Jani Nikula
On Wed, 18 Sep 2024, "Christian König"  wrote:
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.
>
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().
>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  
>   drm_sched_wqueue_stop(sched);
>  
> + /*
> +  * Tearing down the scheduler wile there are still unprocessed jobs can
> +  * lead to use after free issues in the scheduler fence.
> +  */
> + WARN_ON(!list_empty(&sched->pending_list));

drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
would be helpful in multi-GPU systems.

BR,
Jani.

> +
>   for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>   struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>   spin_lock(&rq->lock);
> - list_for_each_entry(s_entity, &rq->entities, list)
> + list_for_each_entry(s_entity, &rq->entities, list) {
> + /*
> +  * The justification for this BUG_ON() is that tearing
> +  * down the scheduler while jobs are pending leaves
> +  * dma_fences unsignaled. Since we have dependencies
> +  * from the core memory management to eventually signal
> +  * dma_fences this can trivially lead to a system wide
> +  * stop because of a locked up memory management.
> +  */
> + BUG_ON(spsc_queue_count(&s_entity->job_queue));
> +
>   /*
>* Prevents reinsertion and marks job_queue as idle,
>* it will removed from rq in drm_sched_entity_fini
>* eventually
>*/
>   s_entity->stopped = true;
> + }
>   spin_unlock(&rq->lock);
>   kfree(sched->sched_rq[i]);
>   }

-- 
Jani Nikula, Intel