Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 1/2] drm/doc/rfc: i915 GuC submission / DRM scheduler

2021-06-04 Thread Dave Airlie
On Sat, 5 Jun 2021 at 03:39, Daniel Vetter  wrote:
>
> On Wed, May 26, 2021 at 04:33:56PM -0700, Matthew Brost wrote:
> > Add entry for i915 GuC submission / DRM scheduler integration plan.
> > Follow up patch with details of new parallel submission uAPI to come.
> >
> > v2:
> >  (Daniel Vetter)
> >   - Expand explaination of why bonding isn't supported for GuC
> > submission
> >   - CC some of the DRM scheduler maintainers
> >   - Add priority inheritance / boosting use case
> >   - Add reasoning for removing in order assumptions
> >  (Daniel Stone)
> >   - Add links to priority spec
> >
> > Cc: Christian König 
> > Cc: Luben Tuikov 
> > Cc: Alex Deucher 
> > Cc: Steven Price 
> > Cc: Jon Bloomfield 
> > Cc: Jason Ekstrand 
> > Cc: Dave Airlie 
> > Cc: Daniel Vetter 
> > Cc: Jason Ekstrand 
> > Cc: dri-de...@lists.freedesktop.org
> > Signed-off-by: Matthew Brost 
>
> You have a one-line hunk in the next patch that probably should be here.
> With that moved.
>
> Reviewed-by: Daniel Vetter 

Acked-by: Dave Airlie 

And yes having the todos for GuC tracked would be good externally, so
pressure can be applied for new fw releases with those features.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-06-04 Thread Daniel Vetter
On Wed, May 26, 2021 at 04:33:57PM -0700, Matthew Brost wrote:
> Add entry for i915 new parallel submission uAPI plan.
> 
> v2:
>  (Daniel Vetter):
>   - Expand logical order explaination
>   - Add dummy header
>   - Only allow N BBs in execbuf IOCTL
>   - Configure parallel submission per slot not per gem context
> v3:
>  (Marcin Ślusarz):
>   - Lot's of typos / bad english fixed
>  (Tvrtko Ursulin):
>   - Consistent pseudo code, clean up wording in descriptions
> 
> Cc: Tvrtko Ursulin 
> Cc: Tony Ye 
> CC: Carl Zhang 
> Cc: Daniel Vetter 
> Cc: Jason Ekstrand 
> Signed-off-by: Matthew Brost 
> ---
>  Documentation/gpu/rfc/i915_parallel_execbuf.h | 145 ++
>  Documentation/gpu/rfc/i915_scheduler.rst  |  55 ++-
>  2 files changed, 198 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> 
> diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> new file mode 100644
> index ..20de206e3ab4
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> @@ -0,0 +1,145 @@
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> i915_context_engines_parallel_submit */
> +
> +/*
> + * i915_context_engines_parallel_submit:

So the idea is to make these kerneldoc and pull them into the rfc section.
Then when we merge, move them to the real uapi section, like what Matt has
done for lmem.

> + *
> + * Setup a slot in the context engine map to allow multiple BBs to be 
> submitted
> + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the 
> GPU
> + * in parallel. Multiple hardware contexts are created internally in the i915
> + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The 
> user
> + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL know how
> + * many BBs there are based on the slots configuration. The N BBs are the 
> last N
> + * buffer objects for first N if I915_EXEC_BATCH_FIRST is set.

s/for/or/

> + *
> + * There are two currently defined ways to control the placement of the
> + * hardware contexts on physical engines: default behavior (no flags) and
> + * I915_PARALLEL_IMPLICIT_BONDS (a flag). More flags may be added the in the
> + * future as new hardware / use cases arise. Details of how to use this
> + * interface above the flags field in this structure.
> + *
> + * Returns -EINVAL if hardware context placement configuration is invalid or 
> if
> + * the placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * inteface.
> + */
> +struct i915_context_engines_parallel_submit {
> + struct i915_user_extension base;
> +
> + __u16 engine_index; /* slot for parallel engine */

Kernel doc here for the inline comments too.

> + __u16 width;/* number of contexts per parallel engine */
> + __u16 num_siblings; /* number of siblings per context */
> + __u16 mbz16;
> +/*
> + * Default placement behavior (currently unsupported):
> + *
> + * Allow BBs to be placed on any available engine instance. In this case each
> + * context's engine mask indicates where that context can be placed. It is
> + * implied in this mode that all contexts have mutual exclusive placement.
> + * e.g. If one context is running CSX[0] no other contexts can run on 
> CSX[0]).
> + *
> + * Example 1 pseudo code:
> + * CSX,Y[N] = generic engine class X or Y, logical instance N
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID)
> + * set_parallel(engine_index=0, width=2, num_siblings=2,
> + *   engines=CSX[0],CSX[1],CSY[0],CSY[1])
> + *
> + * Results in the following valid placements:
> + * CSX[0], CSY[0]
> + * CSX[0], CSY[1]
> + * CSX[1], CSY[0]
> + * CSX[1], CSY[1]
> + *
> + * This can also be thought of as 2 virtual engines described by 2-D array in
> + * the engines the field:
> + * VE[0] = CSX[0], CSX[1]
> + * VE[1] = CSY[0], CSY[1]
> + *
> + * Example 2 pseudo code:
> + * CSX[Y] = generic engine of same class X, logical instance N
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID)
> + * set_parallel(engine_index=0, width=2, num_siblings=3,
> + *   engines=CSX[0],CSX[1],CSX[2],CSX[0],CSX[1],CSX[2])
> + *
> + * Results in the following valid placements:
> + * CSX[0], CSX[1]
> + * CSX[0], CSX[2]
> + * CSX[1], CSX[0]
> + * CSX[1], CSX[2]
> + * CSX[2], CSX[0]
> + * CSX[2], CSX[1]
> + *
> + * This can also be thought of as 2 virtual engines described by 2-D array in
> + * the engines the field:
> + * VE[0] = CSX[0], CSX[1], CSX[2]
> + * VE[1] = CSX[0], CSX[1], CSX[2]
> +
> + * This enables a use case where all engines are created equally, we don't 
> care
> + * where they are scheduled, we 

Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 1/2] drm/doc/rfc: i915 GuC submission / DRM scheduler

2021-06-04 Thread Daniel Vetter
On Wed, May 26, 2021 at 04:33:56PM -0700, Matthew Brost wrote:
> Add entry for i915 GuC submission / DRM scheduler integration plan.
> Follow up patch with details of new parallel submission uAPI to come.
> 
> v2:
>  (Daniel Vetter)
>   - Expand explaination of why bonding isn't supported for GuC
> submission
>   - CC some of the DRM scheduler maintainers
>   - Add priority inheritance / boosting use case
>   - Add reasoning for removing in order assumptions
>  (Daniel Stone)
>   - Add links to priority spec
> 
> Cc: Christian König 
> Cc: Luben Tuikov 
> Cc: Alex Deucher 
> Cc: Steven Price 
> Cc: Jon Bloomfield 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: Daniel Vetter 
> Cc: Jason Ekstrand 
> Cc: dri-de...@lists.freedesktop.org
> Signed-off-by: Matthew Brost 

You have a one-line hunk in the next patch that probably should be here.
With that moved.

Reviewed-by: Daniel Vetter 

Also did you get an ack from Carl on these?
-Daniel

> ---
>  Documentation/gpu/rfc/i915_scheduler.rst | 85 
>  Documentation/gpu/rfc/index.rst  |  4 ++
>  2 files changed, 89 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst
> 
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst 
> b/Documentation/gpu/rfc/i915_scheduler.rst
> new file mode 100644
> index ..7faa46cde088
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -0,0 +1,85 @@
> +=
> +I915 GuC Submission/DRM Scheduler Section
> +=
> +
> +Upstream plan
> +=
> +For upstream the overall plan for landing GuC submission and integrating the
> +i915 with the DRM scheduler is:
> +
> +* Merge basic GuC submission
> + * Basic submission support for all gen11+ platforms
> + * Not enabled by default on any current platforms but can be enabled via
> +   modparam enable_guc
> + * Lots of rework will need to be done to integrate with DRM scheduler so
> +   no need to nit pick everything in the code, it just should be
> +   functional, no major coding style / layering errors, and not regress
> +   execlists
> + * Update IGTs / selftests as needed to work with GuC submission
> + * Enable CI on supported platforms for a baseline
> + * Rework / get CI heathly for GuC submission in place as needed
> +* Merge new parallel submission uAPI
> + * Bonding uAPI completely incompatible with GuC submission, plus it has
> +   severe design issues in general, which is why we want to retire it no
> +   matter what
> + * New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup step
> +   which configures a slot with N contexts 
> + * After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches to
> +   a slot in a single execbuf IOCTL and the batches run on the GPU in
> +   paralllel
> + * Initially only for GuC submission but execlists can be supported if
> +   needed
> +* Convert the i915 to use the DRM scheduler
> + * GuC submission backend fully integrated with DRM scheduler
> + * All request queues removed from backend (e.g. all backpressure
> +   handled in DRM scheduler)
> + * Resets / cancels hook in DRM scheduler
> + * Watchdog hooks into DRM scheduler
> + * Lots of complexity of the GuC backend can be pulled out once
> +   integrated with DRM scheduler (e.g. state machine gets
> +   simplier, locking gets simplier, etc...)
> + * Execlist backend will do the minimum required to hook in the DRM
> +   scheduler so it can live next to the fully integrated GuC backend
> + * Legacy interface
> + * Features like timeslicing / preemption / virtual engines would
> +   be difficult to integrate with the DRM scheduler and these
> +   features are not required for GuC submission as the GuC does
> +   these things for us
> + * ROI low on fully integrating into DRM scheduler
> + * Fully integrating would add lots of complexity to DRM
> +   scheduler
> + * Port i915 priority inheritance / boosting feature in DRM scheduler
> + * Used for i915 page flip, may be useful to other DRM drivers as
> +   well
> + * Will be an optional feature in the DRM scheduler
> + * Remove in-order completion assumptions from DRM scheduler
> + * Even when using the DRM scheduler the backends will handle
> +   preemption, timeslicing, etc... so it is possible for jobs to
> +   finish out of order
> + * Pull out i915 priority levels and use DRM priority levels
> + * Optimize DRM scheduler as needed
> +
> +New uAPI for basic GuC submission
> +=
> +No major changes are required to the uAPI for basic GuC submission. The only
> +change is a new 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Christian König

Am 04.06.21 um 10:57 schrieb Daniel Vetter:

On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that
we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

Have you seen that one here:
https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush which
can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with
retrying page faults) and the not so shitty one?


Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of 
the PTEs currently in flight in the 3D/DMA engine and you just need to 
invalidate the TLB.


The heavyweight must be used when you need to invalidate the TLB *AND* 
make sure that no concurrently operation moves new stuff into the TLB.


The problem is for this use case we have to use the heavyweight one.


But yeah at that point I think you just have to bite one of the bullets.


Yeah, completely agree. We can choose which way we want to die, but it's 
certainly not going to be nice whatever we do.




The thing is with hmm/userspace memory fence model this will be even
worse, because you will _have_ to do this tlb flush deep down in core mm
functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least
wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the
compute model, i.e. indefinite timeouts, no tdr, because everything takes
too long. At least looking around sriov timeouts tend to be 10x bare
metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting
since it brings us right back "the entire linux/android 3d stack is built
on top of dma_fence right now".


The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
blindly trusts userspace to set up everything else, and even just wraps
dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
kernel shot the entire context and b) userspace recovers by setting up a
new ringbuffer

- memory management is done using ttm only, you 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Daniel Vetter
On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > 
> > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > > > 
> > > > > > Yes, we can't break anything because we don't want to complicate 
> > > > > > things
> > > > > > for us. It's pretty much all NAK'd already. We are trying to gather 
> > > > > > more
> > > > > > knowledge and then make better decisions.
> > > > > > 
> > > > > > The idea we are considering is that we'll expose memory-based sync 
> > > > > > objects
> > > > > > to userspace for read only, and the kernel or hw will strictly 
> > > > > > control the
> > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > userspace can decide not to signal a job, so even if userspace can't
> > > > > > overwrite memory-based sync object states arbitrarily, it can still 
> > > > > > decide
> > > > > > not to signal them, and then a future fence is born.
> > > > > > 
> > > > > This would actually be treated as a GPU hang caused by that context, 
> > > > > so it
> > > > > should be fine.
> > > > This is practically what I proposed already, except your not doing it 
> > > > with
> > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > what you want for that compute model.
> > > > 
> > > > This seems like a bit a worst of both worlds approach to me? Tons of 
> > > > work
> > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > actually drive the hardware like it should be for compute or direct
> > > > display.
> > > > 
> > > > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > interesting to know what doesn't work there instead of amd folks going 
> > > > of
> > > > into internal again and then coming back with another rfc from out of
> > > > nowhere :-)
> > > Well to be honest I would just push back on our hardware/firmware guys 
> > > that
> > > we need to keep kernel queues forever before going down that route.
> > I looked again, and you said the model wont work because preemption is way
> > too slow, even when the context is idle.
> > 
> > I guess at that point I got maybe too fed up and just figured "not my
> > problem", but if preempt is too slow as the unload fence, you can do it
> > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > otherwise your hw is just garbage and wont even be fast for direct submit
> > compute workloads).
> 
> Have you seen that one here:
> https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> 
> I've rejected it because I think polling for 6 seconds on a TLB flush which
> can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with
retrying page faults) and the not so shitty one?

But yeah at that point I think you just have to bite one of the bullets.

The thing is with hmm/userspace memory fence model this will be even
worse, because you will _have_ to do this tlb flush deep down in core mm
functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least
wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the
compute model, i.e. indefinite timeouts, no tdr, because everything takes
too long. At least looking around sriov timeouts tend to be 10x bare
metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting
since it brings us right back "the entire linux/android 3d stack is built
on top of dma_fence right now".

> > The only thing that you need to do when you use pte clearing + tlb
> > shootdown instad of preemption as the unload fence for buffers that get
> > moved is that if you get any gpu page fault, you don't serve that, but
> > instead treat it as a tdr and shot the context permanently.
> > 
> > So summarizing the model I proposed:
> > 
> > - you allow userspace to directly write into the ringbuffer, and also
> >write the fences directly
> > 
> > - actual submit is done by the kernel, using drm/scheduler. The kernel
> >blindly trusts userspace to set up everything else, and even just wraps
> >dma_fences around the userspace memory fences.
> > 
> > - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
> >kernel shot the entire context and b) userspace recovers by setting up a
> >new ringbuffer
> > 
> > - memory management is done using ttm only, you still need to supply the
> >buffer list (ofc that list includes the always present 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Christian König

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:


Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that
we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).


Have you seen that one here: 
https://www.spinics.net/lists/amd-gfx/msg63101.html :)


I've rejected it because I think polling for 6 seconds on a TLB flush 
which can block interrupts as well is just madness.






The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
   write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
   blindly trusts userspace to set up everything else, and even just wraps
   dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
   kernel shot the entire context and b) userspace recovers by setting up a
   new ringbuffer

- memory management is done using ttm only, you still need to supply the
   buffer list (ofc that list includes the always present ones, so CS will
   only get the list of special buffers like today). If you hw can't trun
   gpu page faults and you ever get one we pull up the same old solution:
   Kernel shots the entire context.

   The important thing is that from the gpu pov memory management works
   exactly like compute workload with direct submit, except that you just
   terminate the context on _any_ page fault, instead of only those that go
   somewhere where there's really no mapping and repair the others.

   Also I guess from reading the old thread this means you'd disable page
   fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
   userspace memory fences (represented as dma_fence) to complete, with
   maybe some fudge to busy-spin until we've reached the actual end of the
   ringbuffer (maybe you have a IB tail there after the memory fence write,
   we have that on intel hw), and it waits for the memory to get
   "unloaded". This is either preemption, or pte clearing + tlb shootdown,
   or whatever else your hw provides which is a) used for dynamic memory
   management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
   in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact
same mappings, command sequences and everything else. The only difference
is that the rinbuffer head/tail updates happen from drm/scheduler, instead
of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the
writes to