Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Jason Ekstrand
On Tue, May 4, 2021 at 12:16 PM Marek Olšák  wrote:
>
> I see some mentions of XNACK and recoverable page faults. Note that all 
> gaming AMD hw that has userspace queues doesn't have XNACK, so there is no 
> overhead in compute units. My understanding is that recoverable page faults 
> are still supported without XNACK, but instead of the compute unit replaying 
> the faulting instruction, the L1 cache does that. Anyway, the point is that 
> XNACK is totally irrelevant here.
>
> Marek
>
> On Tue., May 4, 2021, 08:48 Christian König, 
>  wrote:
>>
>> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
>> > On Tue, May 4, 2021 at 12:53 PM Christian König
>> >  wrote:
>> >> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
>> >>> [SNIP]
>>  Yeah, it just takes to long for the preemption to complete to be really
>>  useful for the feature we are discussing here.
>> 
>>  As I said when the kernel requests to preempt a queue we can easily 
>>  expect a
>>  timeout of ~100ms until that comes back. For compute that is even in the
>>  multiple seconds range.
>> >>> 100ms for preempting an idle request sounds like broken hw to me. Of
>> >>> course preemting something that actually runs takes a while, that's
>> >>> nothing new. But it's also not the thing we're talking about here. Is 
>> >>> this
>> >>> 100ms actual numbers from hw for an actual idle ringbuffer?
>> >> Well 100ms is just an example of the scheduler granularity. Let me
>> >> explain in a wider context.
>> >>
>> >> The hardware can have X queues mapped at the same time and every Y time
>> >> interval the hardware scheduler checks if those queues have changed and
>> >> only if they have changed the necessary steps to reload them are started.
>> >>
>> >> Multiple queues can be rendering at the same time, so you can have X as
>> >> a high priority queue active and just waiting for a signal to start and
>> >> the client rendering one frame after another and a third background
>> >> compute task mining bitcoins for you.
>> >>
>> >> As long as everything is static this is perfectly performant. Adding a
>> >> queue to the list of active queues is also relatively simple, but taking
>> >> one down requires you to wait until we are sure the hardware has seen
>> >> the change and reloaded the queues.
>> >>
>> >> Think of it as an RCU grace period. This is simply not something which
>> >> is made to be used constantly, but rather just at process termination.
>> > Uh ... that indeed sounds rather broken.
>>
>> Well I wouldn't call it broken. It's just not made for the use case we
>> are trying to abuse it for.
>>
>> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
>>
>> Yeah, exactly that's why it isn't much of a problem for process
>> termination or freeing memory.
>>
>> > So by and large everyone should already be able to cope with it taking a
>> > bit longer. So from a design pov I don't see a huge problem, but I
>> > guess you guys wont be happy since it means on amd hw there will be
>> > random unsightly stalls in desktop linux usage.
>> >
>>  The "preemption" feature is really called suspend and made just for the 
>>  case
>>  when we want to put a process to sleep or need to forcefully kill it for
>>  misbehavior or stuff like that. It is not meant to be used in normal
>>  operation.
>> 
>>  If we only attach it on ->move then yeah maybe a last resort 
>>  possibility to
>>  do it this way, but I think in that case we could rather stick with 
>>  kernel
>>  submissions.
>> >>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so 
>> >>> you
>> >>> can keep dma-fences working. Because the dma-fence stuff wont work with
>> >>> pure userspace submit, I think that conclusion is rather solid. Once more
>> >>> even after this long thread here.
>> >> When assisted with unload fences, then yes. Problem is that I can't see
>> >> how we could implement those performant currently.
>> > Is there really no way to fix fw here? Like if process start/teardown
>> > takes 100ms, that's going to suck no matter what.
>>
>> As I said adding the queue is unproblematic and teardown just results in
>> a bit more waiting to free things up.
>>
>> Problematic is more overcommit swapping and OOM situations which need to
>> wait for the hw scheduler to come back and tell us that the queue is now
>> unmapped.
>>
>> > Also, if userspace lies to us and keeps pushing crap into the ring
>> > after it's supposed to be idle: Userspace is already allowed to waste
>> > gpu time. If you're too worried about this set a fairly aggressive
>> > preempt timeout on the unload fence, and kill the context if it takes
>> > longer than what preempting an idle ring should take (because that
>> > would indicate broken/evil userspace).
>>  I think you have the wrong expectation here. It is perfectly valid and
>>  expected for userspace to keep writing commands into the ring buffer.

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Marek Olšák
I see some mentions of XNACK and recoverable page faults. Note that all
gaming AMD hw that has userspace queues doesn't have XNACK, so there is no
overhead in compute units. My understanding is that recoverable page faults
are still supported without XNACK, but instead of the compute unit
replaying the faulting instruction, the L1 cache does that. Anyway, the
point is that XNACK is totally irrelevant here.

Marek

On Tue., May 4, 2021, 08:48 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> >  wrote:
> >> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> >>> [SNIP]
>  Yeah, it just takes to long for the preemption to complete to be
> really
>  useful for the feature we are discussing here.
> 
>  As I said when the kernel requests to preempt a queue we can easily
> expect a
>  timeout of ~100ms until that comes back. For compute that is even in
> the
>  multiple seconds range.
> >>> 100ms for preempting an idle request sounds like broken hw to me. Of
> >>> course preemting something that actually runs takes a while, that's
> >>> nothing new. But it's also not the thing we're talking about here. Is
> this
> >>> 100ms actual numbers from hw for an actual idle ringbuffer?
> >> Well 100ms is just an example of the scheduler granularity. Let me
> >> explain in a wider context.
> >>
> >> The hardware can have X queues mapped at the same time and every Y time
> >> interval the hardware scheduler checks if those queues have changed and
> >> only if they have changed the necessary steps to reload them are
> started.
> >>
> >> Multiple queues can be rendering at the same time, so you can have X as
> >> a high priority queue active and just waiting for a signal to start and
> >> the client rendering one frame after another and a third background
> >> compute task mining bitcoins for you.
> >>
> >> As long as everything is static this is perfectly performant. Adding a
> >> queue to the list of active queues is also relatively simple, but taking
> >> one down requires you to wait until we are sure the hardware has seen
> >> the change and reloaded the queues.
> >>
> >> Think of it as an RCU grace period. This is simply not something which
> >> is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
>
> Well I wouldn't call it broken. It's just not made for the use case we
> are trying to abuse it for.
>
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
>
> Yeah, exactly that's why it isn't much of a problem for process
> termination or freeing memory.
>
> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> >
>  The "preemption" feature is really called suspend and made just for
> the case
>  when we want to put a process to sleep or need to forcefully kill it
> for
>  misbehavior or stuff like that. It is not meant to be used in normal
>  operation.
> 
>  If we only attach it on ->move then yeah maybe a last resort
> possibility to
>  do it this way, but I think in that case we could rather stick with
> kernel
>  submissions.
> >>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so
> you
> >>> can keep dma-fences working. Because the dma-fence stuff wont work with
> >>> pure userspace submit, I think that conclusion is rather solid. Once
> more
> >>> even after this long thread here.
> >> When assisted with unload fences, then yes. Problem is that I can't see
> >> how we could implement those performant currently.
> > Is there really no way to fix fw here? Like if process start/teardown
> > takes 100ms, that's going to suck no matter what.
>
> As I said adding the queue is unproblematic and teardown just results in
> a bit more waiting to free things up.
>
> Problematic is more overcommit swapping and OOM situations which need to
> wait for the hw scheduler to come back and tell us that the queue is now
> unmapped.
>
> > Also, if userspace lies to us and keeps pushing crap into the ring
> > after it's supposed to be idle: Userspace is already allowed to waste
> > gpu time. If you're too worried about this set a fairly aggressive
> > preempt timeout on the unload fence, and kill the context if it takes
> > longer than what preempting an idle ring should take (because that
> > would indicate broken/evil userspace).
>  I think you have the wrong expectation here. It is perfectly valid and
>  expected for userspace to keep writing commands into the ring buffer.
> 
>  After all when one frame is completed they want to immediately start
>  rendering the next one.
> >>> Sure, for the true userspace direct 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 02:48:35PM +0200, Christian König wrote:
> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> >  wrote:
> > > Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > > Yeah, it just takes to long for the preemption to complete to be 
> > > > > really
> > > > > useful for the feature we are discussing here.
> > > > > 
> > > > > As I said when the kernel requests to preempt a queue we can easily 
> > > > > expect a
> > > > > timeout of ~100ms until that comes back. For compute that is even in 
> > > > > the
> > > > > multiple seconds range.
> > > > 100ms for preempting an idle request sounds like broken hw to me. Of
> > > > course preemting something that actually runs takes a while, that's
> > > > nothing new. But it's also not the thing we're talking about here. Is 
> > > > this
> > > > 100ms actual numbers from hw for an actual idle ringbuffer?
> > > Well 100ms is just an example of the scheduler granularity. Let me
> > > explain in a wider context.
> > > 
> > > The hardware can have X queues mapped at the same time and every Y time
> > > interval the hardware scheduler checks if those queues have changed and
> > > only if they have changed the necessary steps to reload them are started.
> > > 
> > > Multiple queues can be rendering at the same time, so you can have X as
> > > a high priority queue active and just waiting for a signal to start and
> > > the client rendering one frame after another and a third background
> > > compute task mining bitcoins for you.
> > > 
> > > As long as everything is static this is perfectly performant. Adding a
> > > queue to the list of active queues is also relatively simple, but taking
> > > one down requires you to wait until we are sure the hardware has seen
> > > the change and reloaded the queues.
> > > 
> > > Think of it as an RCU grace period. This is simply not something which
> > > is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
> 
> Well I wouldn't call it broken. It's just not made for the use case we are
> trying to abuse it for.
> 
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
> 
> Yeah, exactly that's why it isn't much of a problem for process termination
> or freeing memory.

Ok so your hw really hates the unload fence. On ours the various queues
are a bit more explicit, so largely unload/preempt is the same as context
switch and pretty quick. Afaik at least.

Still baffled that you can't fix this in fw, but oh well. Judging from how
fast our fw team moves I'm not surprised :-/

Anyway so next plan: Make this work exactly like hmm:
1. wait for the user fence as a dma-fence fake thing, tdr makes this safe
2. remove pte
3. do synchronous tlb flush

Tada, no more 100ms stall in your buffer move callbacks. And feel free to
pack up 2&3 into an async worker or something if it takes too long and
treating it as a bo move dma_fence is better. Also that way you might be
able to batch up the tlb flushing if it's too damn expensive, by
collecting them all under a single dma_fence (and starting a new tlb flush
cycle every time ->enable_signalling gets called).

As long as you nack any gpu faults and don't try to fill them for these
legacy contexts that support dma-fence there's no harm in using the hw
facilities.

Ofc if you're now telling me your synchronous tlb flush is also 100ms,
then maybe just throw the hw out the window, and accept that the
millisecond anything evicts anything (good look with userptr) the screen
freezes for a bit.

> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> > 
> > > > > The "preemption" feature is really called suspend and made just for 
> > > > > the case
> > > > > when we want to put a process to sleep or need to forcefully kill it 
> > > > > for
> > > > > misbehavior or stuff like that. It is not meant to be used in normal
> > > > > operation.
> > > > > 
> > > > > If we only attach it on ->move then yeah maybe a last resort 
> > > > > possibility to
> > > > > do it this way, but I think in that case we could rather stick with 
> > > > > kernel
> > > > > submissions.
> > > > Well this is a hybrid userspace ring + kernel augmeted submit mode, so 
> > > > you
> > > > can keep dma-fences working. Because the dma-fence stuff wont work with
> > > > pure userspace submit, I think that conclusion is rather solid. Once 
> > > > more
> > > > even after this long thread here.
> > > When assisted with unload fences, then yes. Problem is that I can't see
> > > how we could implement those performant currently.
> > Is there really no way to fix fw here? Like if process start/teardown
> > takes 100ms, that's going to suck no matter what.
> 
> As I said 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 13:13 schrieb Daniel Vetter:

On Tue, May 4, 2021 at 12:53 PM Christian König
 wrote:

Am 04.05.21 um 11:47 schrieb Daniel Vetter:

[SNIP]

Yeah, it just takes to long for the preemption to complete to be really
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily expect a
timeout of ~100ms until that comes back. For compute that is even in the
multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?

Well 100ms is just an example of the scheduler granularity. Let me
explain in a wider context.

The hardware can have X queues mapped at the same time and every Y time
interval the hardware scheduler checks if those queues have changed and
only if they have changed the necessary steps to reload them are started.

Multiple queues can be rendering at the same time, so you can have X as
a high priority queue active and just waiting for a signal to start and
the client rendering one frame after another and a third background
compute task mining bitcoins for you.

As long as everything is static this is perfectly performant. Adding a
queue to the list of active queues is also relatively simple, but taking
one down requires you to wait until we are sure the hardware has seen
the change and reloaded the queues.

Think of it as an RCU grace period. This is simply not something which
is made to be used constantly, but rather just at process termination.

Uh ... that indeed sounds rather broken.


Well I wouldn't call it broken. It's just not made for the use case we 
are trying to abuse it for.



Otoh it's just a dma_fence that'd we'd inject as this unload-fence.


Yeah, exactly that's why it isn't much of a problem for process 
termination or freeing memory.



So by and large everyone should already be able to cope with it taking a
bit longer. So from a design pov I don't see a huge problem, but I
guess you guys wont be happy since it means on amd hw there will be
random unsightly stalls in desktop linux usage.


The "preemption" feature is really called suspend and made just for the case
when we want to put a process to sleep or need to forcefully kill it for
misbehavior or stuff like that. It is not meant to be used in normal
operation.

If we only attach it on ->move then yeah maybe a last resort possibility to
do it this way, but I think in that case we could rather stick with kernel
submissions.

Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
can keep dma-fences working. Because the dma-fence stuff wont work with
pure userspace submit, I think that conclusion is rather solid. Once more
even after this long thread here.

When assisted with unload fences, then yes. Problem is that I can't see
how we could implement those performant currently.

Is there really no way to fix fw here? Like if process start/teardown
takes 100ms, that's going to suck no matter what.


As I said adding the queue is unproblematic and teardown just results in 
a bit more waiting to free things up.


Problematic is more overcommit swapping and OOM situations which need to 
wait for the hw scheduler to come back and tell us that the queue is now 
unmapped.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start
rendering the next one.

Sure, for the true userspace direct submit model. But with that you don't
get dma-fence, which means this gpu will not work for 3d accel on any
current linux desktop.

I'm not sure of that. I've looked a bit into how we could add user
fences to dma_resv objects and that isn't that hard after all.

I think as a proof of concept it's fine, but as an actual solution ...
pls no. Two reasons:
- implicit sync is bad


Well can't disagree with that :) But I think we can't avoid supporting it.


- this doesn't fix anything for explicit sync using dma_fence in terms
of sync_file or drm_syncobj.


Exactly.

If we do implicit sync or explicit sync is orthogonal to the problems 
that sync must be made reliable somehow.


So when we sync and timeout the waiter should just continue, but whoever 
failed to signal will be punished.


But since this isn't solved on Windows I don't see how we can solve it 
on Linux either.



So if we go with the route of 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 4, 2021 at 12:53 PM Christian König
 wrote:
>
> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > [SNIP]
> >> Yeah, it just takes to long for the preemption to complete to be really
> >> useful for the feature we are discussing here.
> >>
> >> As I said when the kernel requests to preempt a queue we can easily expect 
> >> a
> >> timeout of ~100ms until that comes back. For compute that is even in the
> >> multiple seconds range.
> > 100ms for preempting an idle request sounds like broken hw to me. Of
> > course preemting something that actually runs takes a while, that's
> > nothing new. But it's also not the thing we're talking about here. Is this
> > 100ms actual numbers from hw for an actual idle ringbuffer?
>
> Well 100ms is just an example of the scheduler granularity. Let me
> explain in a wider context.
>
> The hardware can have X queues mapped at the same time and every Y time
> interval the hardware scheduler checks if those queues have changed and
> only if they have changed the necessary steps to reload them are started.
>
> Multiple queues can be rendering at the same time, so you can have X as
> a high priority queue active and just waiting for a signal to start and
> the client rendering one frame after another and a third background
> compute task mining bitcoins for you.
>
> As long as everything is static this is perfectly performant. Adding a
> queue to the list of active queues is also relatively simple, but taking
> one down requires you to wait until we are sure the hardware has seen
> the change and reloaded the queues.
>
> Think of it as an RCU grace period. This is simply not something which
> is made to be used constantly, but rather just at process termination.

Uh ... that indeed sounds rather broken.

Otoh it's just a dma_fence that'd we'd inject as this unload-fence. So
by and large everyone should already be able to cope with it taking a
bit longer. So from a design pov I don't see a huge problem, but I
guess you guys wont be happy since it means on amd hw there will be
random unsightly stalls in desktop linux usage.

> >> The "preemption" feature is really called suspend and made just for the 
> >> case
> >> when we want to put a process to sleep or need to forcefully kill it for
> >> misbehavior or stuff like that. It is not meant to be used in normal
> >> operation.
> >>
> >> If we only attach it on ->move then yeah maybe a last resort possibility to
> >> do it this way, but I think in that case we could rather stick with kernel
> >> submissions.
> > Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
> > can keep dma-fences working. Because the dma-fence stuff wont work with
> > pure userspace submit, I think that conclusion is rather solid. Once more
> > even after this long thread here.
>
> When assisted with unload fences, then yes. Problem is that I can't see
> how we could implement those performant currently.

Is there really no way to fix fw here? Like if process start/teardown
takes 100ms, that's going to suck no matter what.

> >>> Also, if userspace lies to us and keeps pushing crap into the ring
> >>> after it's supposed to be idle: Userspace is already allowed to waste
> >>> gpu time. If you're too worried about this set a fairly aggressive
> >>> preempt timeout on the unload fence, and kill the context if it takes
> >>> longer than what preempting an idle ring should take (because that
> >>> would indicate broken/evil userspace).
> >> I think you have the wrong expectation here. It is perfectly valid and
> >> expected for userspace to keep writing commands into the ring buffer.
> >>
> >> After all when one frame is completed they want to immediately start
> >> rendering the next one.
> > Sure, for the true userspace direct submit model. But with that you don't
> > get dma-fence, which means this gpu will not work for 3d accel on any
> > current linux desktop.
>
> I'm not sure of that. I've looked a bit into how we could add user
> fences to dma_resv objects and that isn't that hard after all.

I think as a proof of concept it's fine, but as an actual solution ...
pls no. Two reasons:
- implicit sync is bad
- this doesn't fix anything for explicit sync using dma_fence in terms
of sync_file or drm_syncobj.

So if we go with the route of papering over this in the kernel, then
it'll be a ton more work than just hacking something into dma_resv.

> > Which sucks, hence some hybrid model of using the userspace ring and
> > kernel augmented submit is needed. Which was my idea.
>
> Yeah, I think when our firmware folks would really remove the kernel
> queue and we still don't have

Yeah I think kernel queue can be removed. But the price is that you
need reasonable fast preempt of idle contexts.

I really can't understand how this can take multiple ms, something
feels very broken in the design of the fw (since obviously the hw can
preempt an idle context to another one pretty fast, or you'd render
any multi-client desktop as a 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 11:47 schrieb Daniel Vetter:

[SNIP]

Yeah, it just takes to long for the preemption to complete to be really
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily expect a
timeout of ~100ms until that comes back. For compute that is even in the
multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?


Well 100ms is just an example of the scheduler granularity. Let me 
explain in a wider context.


The hardware can have X queues mapped at the same time and every Y time 
interval the hardware scheduler checks if those queues have changed and 
only if they have changed the necessary steps to reload them are started.


Multiple queues can be rendering at the same time, so you can have X as 
a high priority queue active and just waiting for a signal to start and 
the client rendering one frame after another and a third background 
compute task mining bitcoins for you.


As long as everything is static this is perfectly performant. Adding a 
queue to the list of active queues is also relatively simple, but taking 
one down requires you to wait until we are sure the hardware has seen 
the change and reloaded the queues.


Think of it as an RCU grace period. This is simply not something which 
is made to be used constantly, but rather just at process termination.



The "preemption" feature is really called suspend and made just for the case
when we want to put a process to sleep or need to forcefully kill it for
misbehavior or stuff like that. It is not meant to be used in normal
operation.

If we only attach it on ->move then yeah maybe a last resort possibility to
do it this way, but I think in that case we could rather stick with kernel
submissions.

Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
can keep dma-fences working. Because the dma-fence stuff wont work with
pure userspace submit, I think that conclusion is rather solid. Once more
even after this long thread here.


When assisted with unload fences, then yes. Problem is that I can't see 
how we could implement those performant currently.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start
rendering the next one.

Sure, for the true userspace direct submit model. But with that you don't
get dma-fence, which means this gpu will not work for 3d accel on any
current linux desktop.


I'm not sure of that. I've looked a bit into how we could add user 
fences to dma_resv objects and that isn't that hard after all.



Which sucks, hence some hybrid model of using the userspace ring and
kernel augmented submit is needed. Which was my idea.


Yeah, I think when our firmware folks would really remove the kernel 
queue and we still don't have





[SNIP]
Can't find that of hand either, but see the amdgpu_noretry module option.

It basically tells the hardware if retry page faults should be supported or
not because this whole TLB shutdown thing when they are supported is
extremely costly.

Hm so synchronous tlb shootdown is a lot more costly when you allow
retrying of page faults?


Partially correct, yes.

See when you have retry page faults enabled and unmap something you need 
to make sure that everybody which could have potentially translated that 
page and has a TLB is either invalidated or waited until the access is 
completed.


Since every CU could be using a memory location that takes ages to 
completed compared to the normal invalidation where you just invalidate 
the L1/L2 and are done.


Additional to that the recovery adds some extra overhead to every memory 
access, so even without a fault you are quite a bit slower if this is 
enabled.



That sounds bad, because for full hmm mode you need to be able to retry
pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
hang your gpu for a long time while it's waiting for the va->pa lookup
response to return. So retrying lookups shouldn't be any different really.

And you also need fairly fast synchronous tlb shootdown for hmm. So if
your hw has a problem with both together that sounds bad.


Completely agree. And since it was my job to validate the implementation 
on Vega10 I was also the first one to 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 11:14:06AM +0200, Christian König wrote:
> Am 04.05.21 um 10:27 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 10:09 AM Christian König
> >  wrote:
> > > Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > > > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> > > > > Unfortunately as I pointed out to Daniel as well this won't work 100%
> > > > > reliable either.
> > > > You're claiming this, but there's no clear reason why really, and you
> > > > did't reply to my last mail on that sub-thread, so I really don't get
> > > > where exactly you're seeing a problem.
> > > Yeah, it's rather hard to explain without pointing out how the hardware
> > > works in detail.
> > > 
> > > > > See the signal on the ring buffer needs to be protected by 
> > > > > manipulation from
> > > > > userspace so that we can guarantee that the hardware really has 
> > > > > finished
> > > > > executing when it fires.
> > > > Nope you don't. Userspace is already allowed to submit all kinds of 
> > > > random
> > > > garbage, the only thing the kernel has to guarnatee is:
> > > > - the dma-fence DAG stays a DAG
> > > > - dma-fence completes in finite time
> > > > 
> > > > Everything else is not the kernel's problem, and if userspace mixes 
> > > > stuff
> > > > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > > > already.
> > > > 
> > > > > Protecting memory by immediate page table updates is a good first 
> > > > > step, but
> > > > > unfortunately not sufficient (and we would need to restructure large 
> > > > > parts
> > > > > of the driver to make this happen).
> > > > This is why you need the unload-fence on top, because indeed you can't
> > > > just rely on the fences created from the userspace ring, those are
> > > > unreliable for memory management.
> > > And exactly that's the problem! We can't provide a reliable unload-fence
> > > and the user fences are unreliable for that.
> > > 
> > > I've talked this through lengthy with our hardware/firmware guy last
> > > Thursday but couldn't find a solution either.
> > > 
> > > We can have a preemption fence for the kernel which says: Hey this queue
> > > was scheduled away you can touch it's hardware descriptor, control
> > > registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> > > But that one is only triggered on preemption and then we have the same
> > > ordering problems once more.
> > > 
> > > Or we can have a end of operation fence for userspace which says: Hey
> > > this queue has finished it's batch of execution, but this one is
> > > manipulable from userspace in both finish to early (very very bad for
> > > invalidations and memory management) or finish to late/never (deadlock
> > > prone but fixable by timeout).
> > > 
> > > What we could do is to use the preemption fence to emulate the unload
> > > fence, e.g. something like:
> > > 1. Preempt the queue in fixed intervals (let's say 100ms).
> > > 2. While preempted check if we have reached the checkpoint in question
> > > by looking at the hardware descriptor.
> > > 3. If we have reached the checkpoint signal the unload fence.
> > > 4. If we haven't reached the checkpoint resume the queue again.
> > > 
> > > The problem is that this might introduce a maximum of 100ms delay before
> > > signaling the unload fence and preempt/resume has such a hefty overhead
> > > that we waste a horrible amount of time on it.
> > So your hw can preempt? That's good enough.
> > 
> > The unload fence is just
> > 1. wait for all dma_fence that are based on the userspace ring. This
> > is unreliable, but we don't care because tdr will make it reliable.
> > And once tdr shot down a context we'll force-unload and thrash it
> > completely, which solves the problem.
> > 2. preempt the context, which /should/ now be stuck waiting for more
> > commands to be stuffed into the ringbuffer. Which means your
> > preemption is hopefully fast enough to not matter. If your hw takes
> > forever to preempt an idle ring, I can't help you :-)
> 
> Yeah, it just takes to long for the preemption to complete to be really
> useful for the feature we are discussing here.
> 
> As I said when the kernel requests to preempt a queue we can easily expect a
> timeout of ~100ms until that comes back. For compute that is even in the
> multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?

> The "preemption" feature is really called suspend and made just for the case
> when we want to put a process to sleep or need to forcefully kill it for
> misbehavior or stuff like that. It is not meant to be used in normal
> operation.
> 
> If we only attach it on ->move then yeah maybe a last resort possibility to
> do it this way, but I think in that case we 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 10:27 schrieb Daniel Vetter:

On Tue, May 4, 2021 at 10:09 AM Christian König
 wrote:

Am 04.05.21 um 09:32 schrieb Daniel Vetter:

On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:

Unfortunately as I pointed out to Daniel as well this won't work 100%
reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.

Yeah, it's rather hard to explain without pointing out how the hardware
works in detail.


See the signal on the ring buffer needs to be protected by manipulation from
userspace so that we can guarantee that the hardware really has finished
executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.


Protecting memory by immediate page table updates is a good first step, but
unfortunately not sufficient (and we would need to restructure large parts
of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.

And exactly that's the problem! We can't provide a reliable unload-fence
and the user fences are unreliable for that.

I've talked this through lengthy with our hardware/firmware guy last
Thursday but couldn't find a solution either.

We can have a preemption fence for the kernel which says: Hey this queue
was scheduled away you can touch it's hardware descriptor, control
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
But that one is only triggered on preemption and then we have the same
ordering problems once more.

Or we can have a end of operation fence for userspace which says: Hey
this queue has finished it's batch of execution, but this one is
manipulable from userspace in both finish to early (very very bad for
invalidations and memory management) or finish to late/never (deadlock
prone but fixable by timeout).

What we could do is to use the preemption fence to emulate the unload
fence, e.g. something like:
1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question
by looking at the hardware descriptor.
3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before
signaling the unload fence and preempt/resume has such a hefty overhead
that we waste a horrible amount of time on it.

So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)


Yeah, it just takes to long for the preemption to complete to be really 
useful for the feature we are discussing here.


As I said when the kernel requests to preempt a queue we can easily 
expect a timeout of ~100ms until that comes back. For compute that is 
even in the multiple seconds range.


The "preemption" feature is really called suspend and made just for the 
case when we want to put a process to sleep or need to forcefully kill 
it for misbehavior or stuff like that. It is not meant to be used in 
normal operation.


If we only attach it on ->move then yeah maybe a last resort possibility 
to do it this way, but I think in that case we could rather stick with 
kernel submissions.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).


I think you have the wrong expectation here. It is perfectly valid and 
expected for userspace to keep writing commands into the ring buffer.


After all when one frame is completed they want to immediately start 
rendering the next one.



Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient performance :-P

Of 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 4, 2021 at 10:09 AM Christian König
 wrote:
>
> Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> >> Unfortunately as I pointed out to Daniel as well this won't work 100%
> >> reliable either.
> > You're claiming this, but there's no clear reason why really, and you
> > did't reply to my last mail on that sub-thread, so I really don't get
> > where exactly you're seeing a problem.
>
> Yeah, it's rather hard to explain without pointing out how the hardware
> works in detail.
>
> >> See the signal on the ring buffer needs to be protected by manipulation 
> >> from
> >> userspace so that we can guarantee that the hardware really has finished
> >> executing when it fires.
> > Nope you don't. Userspace is already allowed to submit all kinds of random
> > garbage, the only thing the kernel has to guarnatee is:
> > - the dma-fence DAG stays a DAG
> > - dma-fence completes in finite time
> >
> > Everything else is not the kernel's problem, and if userspace mixes stuff
> > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > already.
> >
> >> Protecting memory by immediate page table updates is a good first step, but
> >> unfortunately not sufficient (and we would need to restructure large parts
> >> of the driver to make this happen).
> > This is why you need the unload-fence on top, because indeed you can't
> > just rely on the fences created from the userspace ring, those are
> > unreliable for memory management.
>
> And exactly that's the problem! We can't provide a reliable unload-fence
> and the user fences are unreliable for that.
>
> I've talked this through lengthy with our hardware/firmware guy last
> Thursday but couldn't find a solution either.
>
> We can have a preemption fence for the kernel which says: Hey this queue
> was scheduled away you can touch it's hardware descriptor, control
> registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> But that one is only triggered on preemption and then we have the same
> ordering problems once more.
>
> Or we can have a end of operation fence for userspace which says: Hey
> this queue has finished it's batch of execution, but this one is
> manipulable from userspace in both finish to early (very very bad for
> invalidations and memory management) or finish to late/never (deadlock
> prone but fixable by timeout).
>
> What we could do is to use the preemption fence to emulate the unload
> fence, e.g. something like:
> 1. Preempt the queue in fixed intervals (let's say 100ms).
> 2. While preempted check if we have reached the checkpoint in question
> by looking at the hardware descriptor.
> 3. If we have reached the checkpoint signal the unload fence.
> 4. If we haven't reached the checkpoint resume the queue again.
>
> The problem is that this might introduce a maximum of 100ms delay before
> signaling the unload fence and preempt/resume has such a hefty overhead
> that we waste a horrible amount of time on it.

So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)

Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient performance :-P

Of course if you issue the preempt context request before the
userspace fences have finished (or tdr cleaned up the mess) like you
do in your proposal, then it will be ridiculously expensive and/or
wont work. So just don't do that.

> > btw I thought some more, and I think it's probably best if we only attach
> > the unload-fence in the ->move(_notify) callbacks. Kinda like we already
> > do for async copy jobs. So the overall buffer move sequence would be:
> >
> > 1. wait for (untrusted for kernel, but necessary for userspace
> > correctness) fake dma-fence that rely on the userspace ring
> >
> > 2. unload ctx
> >
> > 3. copy buffer
> >
> > Ofc 2&3 would be done async behind a dma_fence.
> >
> >> On older hardware we often had the situation that for reliable invalidation
> >> we need the guarantee that every previous 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 09:32 schrieb Daniel Vetter:

On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:

Unfortunately as I pointed out to Daniel as well this won't work 100%
reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.


Yeah, it's rather hard to explain without pointing out how the hardware 
works in detail.



See the signal on the ring buffer needs to be protected by manipulation from
userspace so that we can guarantee that the hardware really has finished
executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.


Protecting memory by immediate page table updates is a good first step, but
unfortunately not sufficient (and we would need to restructure large parts
of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.


And exactly that's the problem! We can't provide a reliable unload-fence 
and the user fences are unreliable for that.


I've talked this through lengthy with our hardware/firmware guy last 
Thursday but couldn't find a solution either.


We can have a preemption fence for the kernel which says: Hey this queue 
was scheduled away you can touch it's hardware descriptor, control 
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again. 
But that one is only triggered on preemption and then we have the same 
ordering problems once more.


Or we can have a end of operation fence for userspace which says: Hey 
this queue has finished it's batch of execution, but this one is 
manipulable from userspace in both finish to early (very very bad for 
invalidations and memory management) or finish to late/never (deadlock 
prone but fixable by timeout).


What we could do is to use the preemption fence to emulate the unload 
fence, e.g. something like:

1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question 
by looking at the hardware descriptor.

3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before 
signaling the unload fence and preempt/resume has such a hefty overhead 
that we waste a horrible amount of time on it.




btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.


On older hardware we often had the situation that for reliable invalidation
we need the guarantee that every previous operation has finished executing.
It's not so much of a problem when the next operation has already started,
since then we had the opportunity to do things in between the last and the
next operation. Just see cache invalidation and VM switching for example.

If you have gpu page faults you generally have synchronous tlb
invalidation,


Please tell that our hardware engineers :)

We have two modes of operation, see the whole XNACK on/off discussion on 
the amdgfx mailing list.



so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.


Additional to that it doesn't really buy us anything, e.g. there is not much
advantage to this. Writing the ring buffer in userspace and then ringing in
the kernel has the same overhead as doing everything in the kernel in the
first place.

It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?


We still have a bit more time for this. As I learned from our firmware 
engineer 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> Unfortunately as I pointed out to Daniel as well this won't work 100%
> reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.

> See the signal on the ring buffer needs to be protected by manipulation from
> userspace so that we can guarantee that the hardware really has finished
> executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.

> Protecting memory by immediate page table updates is a good first step, but
> unfortunately not sufficient (and we would need to restructure large parts
> of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.

btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.

> On older hardware we often had the situation that for reliable invalidation
> we need the guarantee that every previous operation has finished executing.
> It's not so much of a problem when the next operation has already started,
> since then we had the opportunity to do things in between the last and the
> next operation. Just see cache invalidation and VM switching for example.

If you have gpu page faults you generally have synchronous tlb
invalidation, so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.

> Additional to that it doesn't really buy us anything, e.g. there is not much
> advantage to this. Writing the ring buffer in userspace and then ringing in
> the kernel has the same overhead as doing everything in the kernel in the
> first place.

It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?

Maybe also pick up that other subthread which ended with my last reply.

Cheers, Daniel


> 
> Christian.
> 
> Am 04.05.21 um 05:11 schrieb Marek Olšák:
> > Proposal for a new CS ioctl, kernel pseudo code:
> > 
> > lock(_lock);
> > serial = get_next_serial(dev);
> > add_wait_command(ring, serial - 1);
> > add_exec_cmdbuf(ring, user_cmdbuf);
> > add_signal_command(ring, serial);
> > *ring->doorbell = FIRE;
> > unlock(_lock);
> > 
> > See? Just like userspace submit, but in the kernel without
> > concurrency/preemption. Is this now safe enough for dma_fence?
> > 
> > Marek
> > 
> > On Mon, May 3, 2021 at 4:36 PM Marek Olšák  > > wrote:
> > 
> > What about direct submit from the kernel where the process still
> > has write access to the GPU ring buffer but doesn't use it? I
> > think that solves your preemption example, but leaves a potential
> > backdoor for a process to overwrite the signal commands, which
> > shouldn't be a problem since we are OK with timeouts.
> > 
> > Marek
> > 
> > On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
> > mailto:ja...@jlekstrand.net>> wrote:
> > 
> > On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
> > mailto:b...@basnieuwenhuizen.nl>> wrote:
> > >
> > > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
> > mailto:ja...@jlekstrand.net>> wrote:
> > > >
> > > > Sorry for the top-post but there's no good thing to reply
> > to here...
> > > >
> > > > One of the things pointed out to me recently by Daniel
> > Vetter that I
> > > > didn't fully understand before is that dma_buf has a very
> > subtle
> > > > second requirement beyond finite time completion:  Nothing
> > required
> 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König
Unfortunately as I pointed out to Daniel as well this won't work 100% 
reliable either.


See the signal on the ring buffer needs to be protected by manipulation 
from userspace so that we can guarantee that the hardware really has 
finished executing when it fires.


Protecting memory by immediate page table updates is a good first step, 
but unfortunately not sufficient (and we would need to restructure large 
parts of the driver to make this happen).


On older hardware we often had the situation that for reliable 
invalidation we need the guarantee that every previous operation has 
finished executing. It's not so much of a problem when the next 
operation has already started, since then we had the opportunity to do 
things in between the last and the next operation. Just see cache 
invalidation and VM switching for example.


Additional to that it doesn't really buy us anything, e.g. there is not 
much advantage to this. Writing the ring buffer in userspace and then 
ringing in the kernel has the same overhead as doing everything in the 
kernel in the first place.


Christian.

Am 04.05.21 um 05:11 schrieb Marek Olšák:

Proposal for a new CS ioctl, kernel pseudo code:

lock(_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(_lock);

See? Just like userspace submit, but in the kernel without 
concurrency/preemption. Is this now safe enough for dma_fence?


Marek

On Mon, May 3, 2021 at 4:36 PM Marek Olšák > wrote:


What about direct submit from the kernel where the process still
has write access to the GPU ring buffer but doesn't use it? I
think that solves your preemption example, but leaves a potential
backdoor for a process to overwrite the signal commands, which
shouldn't be a problem since we are OK with timeouts.

Marek

On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
mailto:ja...@jlekstrand.net>> wrote:

On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
mailto:b...@basnieuwenhuizen.nl>> wrote:
>
> On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
mailto:ja...@jlekstrand.net>> wrote:
> >
> > Sorry for the top-post but there's no good thing to reply
to here...
> >
> > One of the things pointed out to me recently by Daniel
Vetter that I
> > didn't fully understand before is that dma_buf has a very
subtle
> > second requirement beyond finite time completion:  Nothing
required
> > for signaling a dma-fence can allocate memory. Why? 
Because the act
> > of allocating memory may wait on your dma-fence.  This, as
it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals
we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace
involvement with
> > dma-fence by asking the kernel for a next serial and the
kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up. 
There's simply
> > no way for the kernel to trust userspace to not do
ANYTHING which
> > might allocate memory.  I don't even think there's a way
userspace can
> > trust itself there.  It also blows up my plan of moving
the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Honestly the more I look at things I think
userspace-signalable fences
> with a timeout sound like they are a valid solution for
these issues.
> Especially since (as has been mentioned countless times in
this email
> thread) userspace already has a lot of ways to cause
timeouts and or
> GPU hangs through GPU work already.
>
> Adding a timeout on the signaling side of a dma_fence would
ensure:
>
> - The dma_fence signals in finite time
> -  If the timeout case does not allocate memory then memory
allocation
> is not a blocker for signaling.
>
> Of course you lose the full dependency graph and we need to
make sure
> garbage collection of fences works correctly when we have
cycles.
> However, the latter sounds very doable and the first sounds
like it is
> to some extent inevitable.
>
> I feel like I'm missing some requirement here given that we
> immediately went to much more complicated things but can't
find it.
> Thoughts?

Timeouts are sufficient to protect the kernel but they make
the fences
unpredictable and unreliable 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Marek Olšák
Proposal for a new CS ioctl, kernel pseudo code:

lock(_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(_lock);

See? Just like userspace submit, but in the kernel without
concurrency/preemption. Is this now safe enough for dma_fence?

Marek

On Mon, May 3, 2021 at 4:36 PM Marek Olšák  wrote:

> What about direct submit from the kernel where the process still has write
> access to the GPU ring buffer but doesn't use it? I think that solves your
> preemption example, but leaves a potential backdoor for a process to
> overwrite the signal commands, which shouldn't be a problem since we are OK
> with timeouts.
>
> Marek
>
> On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand 
> wrote:
>
>> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>>  wrote:
>> >
>> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand 
>> wrote:
>> > >
>> > > Sorry for the top-post but there's no good thing to reply to here...
>> > >
>> > > One of the things pointed out to me recently by Daniel Vetter that I
>> > > didn't fully understand before is that dma_buf has a very subtle
>> > > second requirement beyond finite time completion:  Nothing required
>> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
>> > > of allocating memory may wait on your dma-fence.  This, as it turns
>> > > out, is a massively more strict requirement than finite time
>> > > completion and, I think, throws out all of the proposals we have so
>> > > far.
>> > >
>> > > Take, for instance, Marek's proposal for userspace involvement with
>> > > dma-fence by asking the kernel for a next serial and the kernel
>> > > trusting userspace to signal it.  That doesn't work at all if
>> > > allocating memory to trigger a dma-fence can blow up.  There's simply
>> > > no way for the kernel to trust userspace to not do ANYTHING which
>> > > might allocate memory.  I don't even think there's a way userspace can
>> > > trust itself there.  It also blows up my plan of moving the fences to
>> > > transition boundaries.
>> > >
>> > > Not sure where that leaves us.
>> >
>> > Honestly the more I look at things I think userspace-signalable fences
>> > with a timeout sound like they are a valid solution for these issues.
>> > Especially since (as has been mentioned countless times in this email
>> > thread) userspace already has a lot of ways to cause timeouts and or
>> > GPU hangs through GPU work already.
>> >
>> > Adding a timeout on the signaling side of a dma_fence would ensure:
>> >
>> > - The dma_fence signals in finite time
>> > -  If the timeout case does not allocate memory then memory allocation
>> > is not a blocker for signaling.
>> >
>> > Of course you lose the full dependency graph and we need to make sure
>> > garbage collection of fences works correctly when we have cycles.
>> > However, the latter sounds very doable and the first sounds like it is
>> > to some extent inevitable.
>> >
>> > I feel like I'm missing some requirement here given that we
>> > immediately went to much more complicated things but can't find it.
>> > Thoughts?
>>
>> Timeouts are sufficient to protect the kernel but they make the fences
>> unpredictable and unreliable from a userspace PoV.  One of the big
>> problems we face is that, once we expose a dma_fence to userspace,
>> we've allowed for some pretty crazy potential dependencies that
>> neither userspace nor the kernel can sort out.  Say you have marek's
>> "next serial, please" proposal and a multi-threaded application.
>> Between time time you ask the kernel for a serial and get a dma_fence
>> and submit the work to signal that serial, your process may get
>> preempted, something else shoved in which allocates memory, and then
>> we end up blocking on that dma_fence.  There's no way userspace can
>> predict and defend itself from that.
>>
>> So I think where that leaves us is that there is no safe place to
>> create a dma_fence except for inside the ioctl which submits the work
>> and only after any necessary memory has been allocated.  That's a
>> pretty stiff requirement.  We may still be able to interact with
>> userspace a bit more explicitly but I think it throws any notion of
>> userspace direct submit out the window.
>>
>> --Jason
>>
>>
>> > - Bas
>> > >
>> > > --Jason
>> > >
>> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher 
>> wrote:
>> > > >
>> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák 
>> wrote:
>> > > > >
>> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer 
>> wrote:
>> > > > >>
>> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
>> > > > >> > Hi Dave,
>> > > > >> >
>> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> > > > >> >> Supporting interop with any device is always possible. It
>> depends on which drivers we need to interoperate with and update them.
>> We've already found the path forward for amdgpu. We just need to find out
>> how many other 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Marek Olšák
What about direct submit from the kernel where the process still has write
access to the GPU ring buffer but doesn't use it? I think that solves your
preemption example, but leaves a potential backdoor for a process to
overwrite the signal commands, which shouldn't be a problem since we are OK
with timeouts.

Marek

On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand  wrote:

> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>  wrote:
> >
> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand 
> wrote:
> > >
> > > Sorry for the top-post but there's no good thing to reply to here...
> > >
> > > One of the things pointed out to me recently by Daniel Vetter that I
> > > didn't fully understand before is that dma_buf has a very subtle
> > > second requirement beyond finite time completion:  Nothing required
> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > > of allocating memory may wait on your dma-fence.  This, as it turns
> > > out, is a massively more strict requirement than finite time
> > > completion and, I think, throws out all of the proposals we have so
> > > far.
> > >
> > > Take, for instance, Marek's proposal for userspace involvement with
> > > dma-fence by asking the kernel for a next serial and the kernel
> > > trusting userspace to signal it.  That doesn't work at all if
> > > allocating memory to trigger a dma-fence can blow up.  There's simply
> > > no way for the kernel to trust userspace to not do ANYTHING which
> > > might allocate memory.  I don't even think there's a way userspace can
> > > trust itself there.  It also blows up my plan of moving the fences to
> > > transition boundaries.
> > >
> > > Not sure where that leaves us.
> >
> > Honestly the more I look at things I think userspace-signalable fences
> > with a timeout sound like they are a valid solution for these issues.
> > Especially since (as has been mentioned countless times in this email
> > thread) userspace already has a lot of ways to cause timeouts and or
> > GPU hangs through GPU work already.
> >
> > Adding a timeout on the signaling side of a dma_fence would ensure:
> >
> > - The dma_fence signals in finite time
> > -  If the timeout case does not allocate memory then memory allocation
> > is not a blocker for signaling.
> >
> > Of course you lose the full dependency graph and we need to make sure
> > garbage collection of fences works correctly when we have cycles.
> > However, the latter sounds very doable and the first sounds like it is
> > to some extent inevitable.
> >
> > I feel like I'm missing some requirement here given that we
> > immediately went to much more complicated things but can't find it.
> > Thoughts?
>
> Timeouts are sufficient to protect the kernel but they make the fences
> unpredictable and unreliable from a userspace PoV.  One of the big
> problems we face is that, once we expose a dma_fence to userspace,
> we've allowed for some pretty crazy potential dependencies that
> neither userspace nor the kernel can sort out.  Say you have marek's
> "next serial, please" proposal and a multi-threaded application.
> Between time time you ask the kernel for a serial and get a dma_fence
> and submit the work to signal that serial, your process may get
> preempted, something else shoved in which allocates memory, and then
> we end up blocking on that dma_fence.  There's no way userspace can
> predict and defend itself from that.
>
> So I think where that leaves us is that there is no safe place to
> create a dma_fence except for inside the ioctl which submits the work
> and only after any necessary memory has been allocated.  That's a
> pretty stiff requirement.  We may still be able to interact with
> userspace a bit more explicitly but I think it throws any notion of
> userspace direct submit out the window.
>
> --Jason
>
>
> > - Bas
> > >
> > > --Jason
> > >
> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher 
> wrote:
> > > >
> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> > > > >
> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer 
> wrote:
> > > > >>
> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > > >> > Hi Dave,
> > > > >> >
> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > > >> >> Supporting interop with any device is always possible. It
> depends on which drivers we need to interoperate with and update them.
> We've already found the path forward for amdgpu. We just need to find out
> how many other drivers need to be updated and evaluate the cost/benefit
> aspect.
> > > > >> >>
> > > > >> >> Marek
> > > > >> >>
> > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > wrote:
> > > > >> >>
> > > > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> > > > >> >>  ckoenig.leichtzumer...@gmail.com>> wrote:
> > > > >> >> >
> > > > >> >> > Correct, we wouldn't have synchronization between device
> with and without user queues any more.
> > > > >> >> >
> > > > >> >> > That could 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Jason Ekstrand
On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
 wrote:
>
> On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand  wrote:
> >
> > Sorry for the top-post but there's no good thing to reply to here...
> >
> > One of the things pointed out to me recently by Daniel Vetter that I
> > didn't fully understand before is that dma_buf has a very subtle
> > second requirement beyond finite time completion:  Nothing required
> > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > of allocating memory may wait on your dma-fence.  This, as it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace involvement with
> > dma-fence by asking the kernel for a next serial and the kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up.  There's simply
> > no way for the kernel to trust userspace to not do ANYTHING which
> > might allocate memory.  I don't even think there's a way userspace can
> > trust itself there.  It also blows up my plan of moving the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Honestly the more I look at things I think userspace-signalable fences
> with a timeout sound like they are a valid solution for these issues.
> Especially since (as has been mentioned countless times in this email
> thread) userspace already has a lot of ways to cause timeouts and or
> GPU hangs through GPU work already.
>
> Adding a timeout on the signaling side of a dma_fence would ensure:
>
> - The dma_fence signals in finite time
> -  If the timeout case does not allocate memory then memory allocation
> is not a blocker for signaling.
>
> Of course you lose the full dependency graph and we need to make sure
> garbage collection of fences works correctly when we have cycles.
> However, the latter sounds very doable and the first sounds like it is
> to some extent inevitable.
>
> I feel like I'm missing some requirement here given that we
> immediately went to much more complicated things but can't find it.
> Thoughts?

Timeouts are sufficient to protect the kernel but they make the fences
unpredictable and unreliable from a userspace PoV.  One of the big
problems we face is that, once we expose a dma_fence to userspace,
we've allowed for some pretty crazy potential dependencies that
neither userspace nor the kernel can sort out.  Say you have marek's
"next serial, please" proposal and a multi-threaded application.
Between time time you ask the kernel for a serial and get a dma_fence
and submit the work to signal that serial, your process may get
preempted, something else shoved in which allocates memory, and then
we end up blocking on that dma_fence.  There's no way userspace can
predict and defend itself from that.

So I think where that leaves us is that there is no safe place to
create a dma_fence except for inside the ioctl which submits the work
and only after any necessary memory has been allocated.  That's a
pretty stiff requirement.  We may still be able to interact with
userspace a bit more explicitly but I think it throws any notion of
userspace direct submit out the window.

--Jason


> - Bas
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
> > >
> > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  
> > > > wrote:
> > > >>
> > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > >> > Hi Dave,
> > > >> >
> > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > >> >> Supporting interop with any device is always possible. It depends 
> > > >> >> on which drivers we need to interoperate with and update them. 
> > > >> >> We've already found the path forward for amdgpu. We just need to 
> > > >> >> find out how many other drivers need to be updated and evaluate the 
> > > >> >> cost/benefit aspect.
> > > >> >>
> > > >> >> Marek
> > > >> >>
> > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > > >> >> > wrote:
> > > >> >>
> > > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> > > >> >>  > > >> >> > wrote:
> > > >> >> >
> > > >> >> > Correct, we wouldn't have synchronization between device with 
> > > >> >> and without user queues any more.
> > > >> >> >
> > > >> >> > That could only be a problem for A+I Laptops.
> > > >> >>
> > > >> >> Since I think you mentioned you'd only be enabling this on newer
> > > >> >> chipsets, won't it be a problem for A+A where one A is a 
> > > >> >> generation
> > > >> >> behind the other?
> > > >> >>
> > > >> >
> > > >> > Crap, that is a good point as well.
> > > >> >
> > > >> >>
> > > >> >> I'm not really liking where this is going btw, seems like a ill
> > > >> >> thought out concept, if AMD 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Bas Nieuwenhuizen
On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand  wrote:
>
> Sorry for the top-post but there's no good thing to reply to here...
>
> One of the things pointed out to me recently by Daniel Vetter that I
> didn't fully understand before is that dma_buf has a very subtle
> second requirement beyond finite time completion:  Nothing required
> for signaling a dma-fence can allocate memory.  Why?  Because the act
> of allocating memory may wait on your dma-fence.  This, as it turns
> out, is a massively more strict requirement than finite time
> completion and, I think, throws out all of the proposals we have so
> far.
>
> Take, for instance, Marek's proposal for userspace involvement with
> dma-fence by asking the kernel for a next serial and the kernel
> trusting userspace to signal it.  That doesn't work at all if
> allocating memory to trigger a dma-fence can blow up.  There's simply
> no way for the kernel to trust userspace to not do ANYTHING which
> might allocate memory.  I don't even think there's a way userspace can
> trust itself there.  It also blows up my plan of moving the fences to
> transition boundaries.
>
> Not sure where that leaves us.

Honestly the more I look at things I think userspace-signalable fences
with a timeout sound like they are a valid solution for these issues.
Especially since (as has been mentioned countless times in this email
thread) userspace already has a lot of ways to cause timeouts and or
GPU hangs through GPU work already.

Adding a timeout on the signaling side of a dma_fence would ensure:

- The dma_fence signals in finite time
-  If the timeout case does not allocate memory then memory allocation
is not a blocker for signaling.

Of course you lose the full dependency graph and we need to make sure
garbage collection of fences works correctly when we have cycles.
However, the latter sounds very doable and the first sounds like it is
to some extent inevitable.

I feel like I'm missing some requirement here given that we
immediately went to much more complicated things but can't find it.
Thoughts?

- Bas
>
> --Jason
>
> On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
> >
> > On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> > >
> > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
> > >>
> > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > >> > Hi Dave,
> > >> >
> > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > >> >> Supporting interop with any device is always possible. It depends on 
> > >> >> which drivers we need to interoperate with and update them. We've 
> > >> >> already found the path forward for amdgpu. We just need to find out 
> > >> >> how many other drivers need to be updated and evaluate the 
> > >> >> cost/benefit aspect.
> > >> >>
> > >> >> Marek
> > >> >>
> > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > >> >> > wrote:
> > >> >>
> > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> > >> >>  > >> >> > wrote:
> > >> >> >
> > >> >> > Correct, we wouldn't have synchronization between device with 
> > >> >> and without user queues any more.
> > >> >> >
> > >> >> > That could only be a problem for A+I Laptops.
> > >> >>
> > >> >> Since I think you mentioned you'd only be enabling this on newer
> > >> >> chipsets, won't it be a problem for A+A where one A is a 
> > >> >> generation
> > >> >> behind the other?
> > >> >>
> > >> >
> > >> > Crap, that is a good point as well.
> > >> >
> > >> >>
> > >> >> I'm not really liking where this is going btw, seems like a ill
> > >> >> thought out concept, if AMD is really going down the road of 
> > >> >> designing
> > >> >> hw that is currently Linux incompatible, you are going to have to
> > >> >> accept a big part of the burden in bringing this support in to 
> > >> >> more
> > >> >> than just amd drivers for upcoming generations of gpu.
> > >> >>
> > >> >
> > >> > Well we don't really like that either, but we have no other option as 
> > >> > far as I can see.
> > >>
> > >> I don't really understand what "future hw may remove support for kernel 
> > >> queues" means exactly. While the per-context queues can be mapped to 
> > >> userspace directly, they don't *have* to be, do they? I.e. the kernel 
> > >> driver should be able to either intercept userspace access to the 
> > >> queues, or in the worst case do it all itself, and provide the existing 
> > >> synchronization semantics as needed?
> > >>
> > >> Surely there are resource limits for the per-context queues, so the 
> > >> kernel driver needs to do some kind of virtualization / multi-plexing 
> > >> anyway, or we'll get sad user faces when there's no queue available for 
> > >> .
> > >>
> > >> I'm probably missing something though, awaiting enlightenment. :)
> > >
> > >
> > > The hw interface for userspace is that the ring buffer is mapped to the 
> > > process address space alongside a doorbell aperture 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Jason Ekstrand
On Mon, May 3, 2021 at 10:03 AM Christian König
 wrote:
>
> Am 03.05.21 um 16:59 schrieb Jason Ekstrand:
> > Sorry for the top-post but there's no good thing to reply to here...
> >
> > One of the things pointed out to me recently by Daniel Vetter that I
> > didn't fully understand before is that dma_buf has a very subtle
> > second requirement beyond finite time completion:  Nothing required
> > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > of allocating memory may wait on your dma-fence.  This, as it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace involvement with
> > dma-fence by asking the kernel for a next serial and the kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up.  There's simply
> > no way for the kernel to trust userspace to not do ANYTHING which
> > might allocate memory.  I don't even think there's a way userspace can
> > trust itself there.  It also blows up my plan of moving the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Well at least I was perfectly aware of that :)

I'd have been a bit disappointed if this had been news to you. :-P
However, there are a number of us plebeians on the thread who need
things spelled out sometimes. :-)

> I'm currently experimenting with some sample code which would allow
> implicit sync with user fences.
>
> Not that I'm pushing hard into that directly, but I just want to make
> clear how simple or complex the whole thing would be.

I'd like to see that.  It'd be good to know what our options are.
Honestly, if we can get implicit sync somehow without tying our hands
w.r.t. how fences work in modern drivers, that's the opens a lot of
doors.

--Jason

> Christian.
>
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
> >> On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> >>> On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
>  On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> >
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on 
> >> which drivers we need to interoperate with and update them. We've 
> >> already found the path forward for amdgpu. We just need to find out 
> >> how many other drivers need to be updated and evaluate the 
> >> cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  >> > wrote:
> >>
> >>  On Tue, 27 Apr 2021 at 22:06, Christian König
> >>   >> > wrote:
> >>  >
> >>  > Correct, we wouldn't have synchronization between device with 
> >> and without user queues any more.
> >>  >
> >>  > That could only be a problem for A+I Laptops.
> >>
> >>  Since I think you mentioned you'd only be enabling this on newer
> >>  chipsets, won't it be a problem for A+A where one A is a 
> >> generation
> >>  behind the other?
> >>
> > Crap, that is a good point as well.
> >
> >>  I'm not really liking where this is going btw, seems like a ill
> >>  thought out concept, if AMD is really going down the road of 
> >> designing
> >>  hw that is currently Linux incompatible, you are going to have to
> >>  accept a big part of the burden in bringing this support in to 
> >> more
> >>  than just amd drivers for upcoming generations of gpu.
> >>
> > Well we don't really like that either, but we have no other option as 
> > far as I can see.
>  I don't really understand what "future hw may remove support for kernel 
>  queues" means exactly. While the per-context queues can be mapped to 
>  userspace directly, they don't *have* to be, do they? I.e. the kernel 
>  driver should be able to either intercept userspace access to the 
>  queues, or in the worst case do it all itself, and provide the existing 
>  synchronization semantics as needed?
> 
>  Surely there are resource limits for the per-context queues, so the 
>  kernel driver needs to do some kind of virtualization / multi-plexing 
>  anyway, or we'll get sad user faces when there's no queue available for 
>  .
> 
>  I'm probably missing something though, awaiting enlightenment. :)
> >>>
> >>> The hw interface for userspace is that the ring buffer is mapped to the 
> >>> process address space alongside a doorbell aperture (4K page) that isn't 
> >>> real memory, but when the CPU writes into it, it tells the hw scheduler 
> >>> that there are new GPU commands in the ring buffer. Userspace inserts all 
> >>> the wait, 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Christian König

Am 03.05.21 um 16:59 schrieb Jason Ekstrand:

Sorry for the top-post but there's no good thing to reply to here...

One of the things pointed out to me recently by Daniel Vetter that I
didn't fully understand before is that dma_buf has a very subtle
second requirement beyond finite time completion:  Nothing required
for signaling a dma-fence can allocate memory.  Why?  Because the act
of allocating memory may wait on your dma-fence.  This, as it turns
out, is a massively more strict requirement than finite time
completion and, I think, throws out all of the proposals we have so
far.

Take, for instance, Marek's proposal for userspace involvement with
dma-fence by asking the kernel for a next serial and the kernel
trusting userspace to signal it.  That doesn't work at all if
allocating memory to trigger a dma-fence can blow up.  There's simply
no way for the kernel to trust userspace to not do ANYTHING which
might allocate memory.  I don't even think there's a way userspace can
trust itself there.  It also blows up my plan of moving the fences to
transition boundaries.

Not sure where that leaves us.


Well at least I was perfectly aware of that :)

I'm currently experimenting with some sample code which would allow 
implicit sync with user fences.


Not that I'm pushing hard into that directly, but I just want to make 
clear how simple or complex the whole thing would be.


Christian.



--Jason

On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:

On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:

On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:

On 2021-04-28 8:59 a.m., Christian König wrote:

Hi Dave,

Am 27.04.21 um 21:23 schrieb Marek Olšák:

Supporting interop with any device is always possible. It depends on which 
drivers we need to interoperate with and update them. We've already found the 
path forward for amdgpu. We just need to find out how many other drivers need 
to be updated and evaluate the cost/benefit aspect.

Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie mailto:airl...@gmail.com>> wrote:

 On Tue, 27 Apr 2021 at 22:06, Christian König
 mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
 >
 > Correct, we wouldn't have synchronization between device with and 
without user queues any more.
 >
 > That could only be a problem for A+I Laptops.

 Since I think you mentioned you'd only be enabling this on newer
 chipsets, won't it be a problem for A+A where one A is a generation
 behind the other?


Crap, that is a good point as well.


 I'm not really liking where this is going btw, seems like a ill
 thought out concept, if AMD is really going down the road of designing
 hw that is currently Linux incompatible, you are going to have to
 accept a big part of the burden in bringing this support in to more
 than just amd drivers for upcoming generations of gpu.


Well we don't really like that either, but we have no other option as far as I 
can see.

I don't really understand what "future hw may remove support for kernel queues" 
means exactly. While the per-context queues can be mapped to userspace directly, they 
don't *have* to be, do they? I.e. the kernel driver should be able to either intercept 
userspace access to the queues, or in the worst case do it all itself, and provide the 
existing synchronization semantics as needed?

Surely there are resource limits for the per-context queues, so the kernel driver 
needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user 
faces when there's no queue available for .

I'm probably missing something though, awaiting enlightenment. :)


The hw interface for userspace is that the ring buffer is mapped to the process address 
space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU 
writes into it, it tells the hw scheduler that there are new GPU commands in the ring 
buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer 
and then "rings" the doorbell. It's my understanding that the ring buffer and 
the doorbell are always mapped in the same GPU address space as the process, which makes 
it very difficult to emulate the current protected ring buffers in the kernel. The VMID 
of the ring buffer is also not changeable.


The doorbell does not have to be mapped into the process's GPU virtual
address space.  The CPU could write to it directly.  Mapping it into
the GPU's virtual address space would allow you to have a device kick
off work however rather than the CPU.  E.g., the GPU could kick off
it's own work or multiple devices could kick off work without CPU
involvement.

Alex



The hw scheduler doesn't do any synchronization and it doesn't see any 
dependencies. It only chooses which queue to execute, so it's really just a 
simple queue manager handling the virtualization aspect and not much else.

Marek
___
dri-devel mailing list

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Jason Ekstrand
Sorry for the top-post but there's no good thing to reply to here...

One of the things pointed out to me recently by Daniel Vetter that I
didn't fully understand before is that dma_buf has a very subtle
second requirement beyond finite time completion:  Nothing required
for signaling a dma-fence can allocate memory.  Why?  Because the act
of allocating memory may wait on your dma-fence.  This, as it turns
out, is a massively more strict requirement than finite time
completion and, I think, throws out all of the proposals we have so
far.

Take, for instance, Marek's proposal for userspace involvement with
dma-fence by asking the kernel for a next serial and the kernel
trusting userspace to signal it.  That doesn't work at all if
allocating memory to trigger a dma-fence can blow up.  There's simply
no way for the kernel to trust userspace to not do ANYTHING which
might allocate memory.  I don't even think there's a way userspace can
trust itself there.  It also blows up my plan of moving the fences to
transition boundaries.

Not sure where that leaves us.

--Jason

On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
>
> On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> >
> > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
> >>
> >> On 2021-04-28 8:59 a.m., Christian König wrote:
> >> > Hi Dave,
> >> >
> >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> >> Supporting interop with any device is always possible. It depends on 
> >> >> which drivers we need to interoperate with and update them. We've 
> >> >> already found the path forward for amdgpu. We just need to find out how 
> >> >> many other drivers need to be updated and evaluate the cost/benefit 
> >> >> aspect.
> >> >>
> >> >> Marek
> >> >>
> >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  >> >> > wrote:
> >> >>
> >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> >> >>  >> >> > wrote:
> >> >> >
> >> >> > Correct, we wouldn't have synchronization between device with and 
> >> >> without user queues any more.
> >> >> >
> >> >> > That could only be a problem for A+I Laptops.
> >> >>
> >> >> Since I think you mentioned you'd only be enabling this on newer
> >> >> chipsets, won't it be a problem for A+A where one A is a generation
> >> >> behind the other?
> >> >>
> >> >
> >> > Crap, that is a good point as well.
> >> >
> >> >>
> >> >> I'm not really liking where this is going btw, seems like a ill
> >> >> thought out concept, if AMD is really going down the road of 
> >> >> designing
> >> >> hw that is currently Linux incompatible, you are going to have to
> >> >> accept a big part of the burden in bringing this support in to more
> >> >> than just amd drivers for upcoming generations of gpu.
> >> >>
> >> >
> >> > Well we don't really like that either, but we have no other option as 
> >> > far as I can see.
> >>
> >> I don't really understand what "future hw may remove support for kernel 
> >> queues" means exactly. While the per-context queues can be mapped to 
> >> userspace directly, they don't *have* to be, do they? I.e. the kernel 
> >> driver should be able to either intercept userspace access to the queues, 
> >> or in the worst case do it all itself, and provide the existing 
> >> synchronization semantics as needed?
> >>
> >> Surely there are resource limits for the per-context queues, so the kernel 
> >> driver needs to do some kind of virtualization / multi-plexing anyway, or 
> >> we'll get sad user faces when there's no queue available for  >> game>.
> >>
> >> I'm probably missing something though, awaiting enlightenment. :)
> >
> >
> > The hw interface for userspace is that the ring buffer is mapped to the 
> > process address space alongside a doorbell aperture (4K page) that isn't 
> > real memory, but when the CPU writes into it, it tells the hw scheduler 
> > that there are new GPU commands in the ring buffer. Userspace inserts all 
> > the wait, draw, and signal commands into the ring buffer and then "rings" 
> > the doorbell. It's my understanding that the ring buffer and the doorbell 
> > are always mapped in the same GPU address space as the process, which makes 
> > it very difficult to emulate the current protected ring buffers in the 
> > kernel. The VMID of the ring buffer is also not changeable.
> >
>
> The doorbell does not have to be mapped into the process's GPU virtual
> address space.  The CPU could write to it directly.  Mapping it into
> the GPU's virtual address space would allow you to have a device kick
> off work however rather than the CPU.  E.g., the GPU could kick off
> it's own work or multiple devices could kick off work without CPU
> involvement.
>
> Alex
>
>
> > The hw scheduler doesn't do any synchronization and it doesn't see any 
> > dependencies. It only chooses which queue to execute, so it's really just a 
> > simple queue manager handling the virtualization 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Alex Deucher
On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
>
> On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
>>
>> On 2021-04-28 8:59 a.m., Christian König wrote:
>> > Hi Dave,
>> >
>> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> >> Supporting interop with any device is always possible. It depends on 
>> >> which drivers we need to interoperate with and update them. We've already 
>> >> found the path forward for amdgpu. We just need to find out how many 
>> >> other drivers need to be updated and evaluate the cost/benefit aspect.
>> >>
>> >> Marek
>> >>
>> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie > >> > wrote:
>> >>
>> >> On Tue, 27 Apr 2021 at 22:06, Christian König
>> >> > >> > wrote:
>> >> >
>> >> > Correct, we wouldn't have synchronization between device with and 
>> >> without user queues any more.
>> >> >
>> >> > That could only be a problem for A+I Laptops.
>> >>
>> >> Since I think you mentioned you'd only be enabling this on newer
>> >> chipsets, won't it be a problem for A+A where one A is a generation
>> >> behind the other?
>> >>
>> >
>> > Crap, that is a good point as well.
>> >
>> >>
>> >> I'm not really liking where this is going btw, seems like a ill
>> >> thought out concept, if AMD is really going down the road of designing
>> >> hw that is currently Linux incompatible, you are going to have to
>> >> accept a big part of the burden in bringing this support in to more
>> >> than just amd drivers for upcoming generations of gpu.
>> >>
>> >
>> > Well we don't really like that either, but we have no other option as far 
>> > as I can see.
>>
>> I don't really understand what "future hw may remove support for kernel 
>> queues" means exactly. While the per-context queues can be mapped to 
>> userspace directly, they don't *have* to be, do they? I.e. the kernel driver 
>> should be able to either intercept userspace access to the queues, or in the 
>> worst case do it all itself, and provide the existing synchronization 
>> semantics as needed?
>>
>> Surely there are resource limits for the per-context queues, so the kernel 
>> driver needs to do some kind of virtualization / multi-plexing anyway, or 
>> we'll get sad user faces when there's no queue available for > game>.
>>
>> I'm probably missing something though, awaiting enlightenment. :)
>
>
> The hw interface for userspace is that the ring buffer is mapped to the 
> process address space alongside a doorbell aperture (4K page) that isn't real 
> memory, but when the CPU writes into it, it tells the hw scheduler that there 
> are new GPU commands in the ring buffer. Userspace inserts all the wait, 
> draw, and signal commands into the ring buffer and then "rings" the doorbell. 
> It's my understanding that the ring buffer and the doorbell are always mapped 
> in the same GPU address space as the process, which makes it very difficult 
> to emulate the current protected ring buffers in the kernel. The VMID of the 
> ring buffer is also not changeable.
>

The doorbell does not have to be mapped into the process's GPU virtual
address space.  The CPU could write to it directly.  Mapping it into
the GPU's virtual address space would allow you to have a device kick
off work however rather than the CPU.  E.g., the GPU could kick off
it's own work or multiple devices could kick off work without CPU
involvement.

Alex


> The hw scheduler doesn't do any synchronization and it doesn't see any 
> dependencies. It only chooses which queue to execute, so it's really just a 
> simple queue manager handling the virtualization aspect and not much else.
>
> Marek
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-01 Thread Marek Olšák
On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:

> On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> >
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on
> which drivers we need to interoperate with and update them. We've already
> found the path forward for amdgpu. We just need to find out how many other
> drivers need to be updated and evaluate the cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  airl...@gmail.com>> wrote:
> >>
> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> >>  ckoenig.leichtzumer...@gmail.com>> wrote:
> >> >
> >> > Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >> >
> >> > That could only be a problem for A+I Laptops.
> >>
> >> Since I think you mentioned you'd only be enabling this on newer
> >> chipsets, won't it be a problem for A+A where one A is a generation
> >> behind the other?
> >>
> >
> > Crap, that is a good point as well.
> >
> >>
> >> I'm not really liking where this is going btw, seems like a ill
> >> thought out concept, if AMD is really going down the road of
> designing
> >> hw that is currently Linux incompatible, you are going to have to
> >> accept a big part of the burden in bringing this support in to more
> >> than just amd drivers for upcoming generations of gpu.
> >>
> >
> > Well we don't really like that either, but we have no other option as
> far as I can see.
>
> I don't really understand what "future hw may remove support for kernel
> queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the queues,
> or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
>
> Surely there are resource limits for the per-context queues, so the kernel
> driver needs to do some kind of virtualization / multi-plexing anyway, or
> we'll get sad user faces when there's no queue available for  game>.
>
> I'm probably missing something though, awaiting enlightenment. :)
>

The hw interface for userspace is that the ring buffer is mapped to the
process address space alongside a doorbell aperture (4K page) that isn't
real memory, but when the CPU writes into it, it tells the hw scheduler
that there are new GPU commands in the ring buffer. Userspace inserts all
the wait, draw, and signal commands into the ring buffer and then "rings"
the doorbell. It's my understanding that the ring buffer and the doorbell
are always mapped in the same GPU address space as the process, which makes
it very difficult to emulate the current protected ring buffers in the
kernel. The VMID of the ring buffer is also not changeable.

The hw scheduler doesn't do any synchronization and it doesn't see any
dependencies. It only chooses which queue to execute, so it's really just a
simple queue manager handling the virtualization aspect and not much else.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-30 Thread Daniel Stone
Hi,

On Fri, 30 Apr 2021 at 10:35, Daniel Vetter  wrote:
> On Fri, Apr 30, 2021 at 11:08 AM Christian König
>  wrote:
> > This doesn't work in hardware. We at least need to setup a few registers
> > and memory locations from inside the VM which userspace shouldn't have
> > access to when we want the end of batch fence and ring buffer start to
> > be reliable.
>
> The thing is, we don't care whether it's reliable or not. Userspace is
> allowed to lie, not signal, signal the wrong thing, out of order,
> everything.
>
> The design assumes all this is possible.
>
> So unless you can't signal at all from userspace, this works. And for
> the "can't signal at all" it just means something needs to do a cpu
> busy wait and burn down lots of cpu time. I hope that's not your hw
> design :-)

I've been sitting this one out so far because what other-Dan's
proposed seems totally sensible and workable for me, so I'll let him
argue it rather than confuse it.

But - yes. Our threat model does not care about a malicious content
which deliberately submits garbage and then gets the compositor to
display garbage. If that's the attack then you could just emit noise
from your frag shader.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-30 Thread Daniel Vetter
On Fri, Apr 30, 2021 at 11:08 AM Christian König
 wrote:
>
> Am 30.04.21 um 10:58 schrieb Daniel Vetter:
> > [SNIP]
> >>> When the user allocates usermode queues, the kernel driver sets up a
> >>> queue descriptor in the kernel which defines the location of the queue
> >>> in memory, what priority it has, what page tables it should use, etc.
> >>> User mode can then start writing commands to its queues.  When they
> >>> are ready for the hardware to start executing them, they ring a
> >>> doorbell which signals the scheduler and it maps the queue descriptors
> >>> to HW queue slots and they start executing.  The user only has access
> >>> to it's queues and any buffers it has mapped in it's GPU virtual
> >>> address space.  While the queues are scheduled, the user can keep
> >>> submitting work to them and they will keep executing unless they get
> >>> preempted by the scheduler due to oversubscription or a priority call
> >>> or a request from the kernel driver to preempt, etc.
> >> Yeah, works like with our stuff.
> >>
> >> I don't see a problem tbh. It's slightly silly going the detour with the
> >> kernel ioctl, and it's annoying that you still have to use drm/scheduler
> >> to resolve dependencies instead of gpu semaphores and all that. But this
> >> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
> >> use the full power. Just needs a flag or something when setting up the
> >> context.
> >>
> >> And best part is that from hw pov this really is indistinguishable from
> >> the full on userspace submit model.
> >>
> >> The thing where it gets annoying is when you use one of these new cpu
> >> instructions which do direct submit to hw and pass along the pasid id
> >> behind the scenes. That's truly something you can't intercept anymore in
> >> the kernel and fake the legacy dma_fence world.
> >>
> >> But what you're describing here sounds like bog standard stuff, and also
> >> pretty easy to keep working with exactly the current model.
> >>
> >> Ofc we'll want to push forward a more modern model that better suits
> >> modern gpus, but I don't see any hard requirement here from the hw side.
> > Adding a bit more detail on what I have in mind:
> >
> > - memory management works like amdgpu does today, so all buffers are
> > pre-bound to the gpu vm, we keep the entire bo set marked as busy with
> > the bulk lru trick for every command submission.
> >
> > - for the ringbuffer, userspace allcoates a suitably sized bo for
> > ringbuffer, ring/tail/seqno and whatever else it needs
> >
> > - userspace then asks the kernel to make that into a hw context, with
> > all the priviledges setup. Doorbell will only be mapped into kernel
> > (hw can't tell the difference anyway), but if it happens to also be
> > visible to userspace that's no problem. We assume userspace can ring
> > the doorbell anytime it wants to.
>
> This doesn't work in hardware. We at least need to setup a few registers
> and memory locations from inside the VM which userspace shouldn't have
> access to when we want the end of batch fence and ring buffer start to
> be reliable.

The thing is, we don't care whether it's reliable or not. Userspace is
allowed to lie, not signal, signal the wrong thing, out of order,
everything.

The design assumes all this is possible.

So unless you can't signal at all from userspace, this works. And for
the "can't signal at all" it just means something needs to do a cpu
busy wait and burn down lots of cpu time. I hope that's not your hw
design :-)

> > - we do double memory management: One dma_fence works similar to the
> > amdkfd preempt fence, except it doesn't preempt but does anything
> > required to make the hw context unrunnable and take it out of the hw
> > scheduler entirely. This might involve unmapping the doorbell if
> > userspace has access to it.
> >
> > - but we also do classic end-of-batch fences, so that implicit fencing
> > and all that keeps working. The "make hw ctx unrunnable" fence must
> > also wait for all of these pending submissions to complete.
>
> This together doesn't work from the software side, e.g. you can either
> have preemption fences or end of batch fences but never both or your end
> of batch fences would have another dependency on the preemption fences
> which we currently can't express in the dma_fence framework.

It's _not_ a preempt fence. It's an ctx unload fence. Not the same
thing. Normal preempt fence would indeed fail.

> Additional to that it can't work from the hardware side because we have
> a separation between engine and scheduler on the hardware side. So we
> can't reliable get a signal inside the kernel that a batch has completed.
>
> What we could do is to get this signal in userspace, e.g. userspace
> inserts the packets into the ring buffer and then the kernel can read
> the fence value and get the IV.
>
> But this has the same problem as user fences because it requires the
> cooperation of userspace.

Nope. Read the thing again, I'm 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-30 Thread Christian König

Am 30.04.21 um 10:58 schrieb Daniel Vetter:

[SNIP]

When the user allocates usermode queues, the kernel driver sets up a
queue descriptor in the kernel which defines the location of the queue
in memory, what priority it has, what page tables it should use, etc.
User mode can then start writing commands to its queues.  When they
are ready for the hardware to start executing them, they ring a
doorbell which signals the scheduler and it maps the queue descriptors
to HW queue slots and they start executing.  The user only has access
to it's queues and any buffers it has mapped in it's GPU virtual
address space.  While the queues are scheduled, the user can keep
submitting work to them and they will keep executing unless they get
preempted by the scheduler due to oversubscription or a priority call
or a request from the kernel driver to preempt, etc.

Yeah, works like with our stuff.

I don't see a problem tbh. It's slightly silly going the detour with the
kernel ioctl, and it's annoying that you still have to use drm/scheduler
to resolve dependencies instead of gpu semaphores and all that. But this
only applies to legacy winsys mode, compute (e.g. vk without winsys) can
use the full power. Just needs a flag or something when setting up the
context.

And best part is that from hw pov this really is indistinguishable from
the full on userspace submit model.

The thing where it gets annoying is when you use one of these new cpu
instructions which do direct submit to hw and pass along the pasid id
behind the scenes. That's truly something you can't intercept anymore in
the kernel and fake the legacy dma_fence world.

But what you're describing here sounds like bog standard stuff, and also
pretty easy to keep working with exactly the current model.

Ofc we'll want to push forward a more modern model that better suits
modern gpus, but I don't see any hard requirement here from the hw side.

Adding a bit more detail on what I have in mind:

- memory management works like amdgpu does today, so all buffers are
pre-bound to the gpu vm, we keep the entire bo set marked as busy with
the bulk lru trick for every command submission.

- for the ringbuffer, userspace allcoates a suitably sized bo for
ringbuffer, ring/tail/seqno and whatever else it needs

- userspace then asks the kernel to make that into a hw context, with
all the priviledges setup. Doorbell will only be mapped into kernel
(hw can't tell the difference anyway), but if it happens to also be
visible to userspace that's no problem. We assume userspace can ring
the doorbell anytime it wants to.


This doesn't work in hardware. We at least need to setup a few registers 
and memory locations from inside the VM which userspace shouldn't have 
access to when we want the end of batch fence and ring buffer start to 
be reliable.



- we do double memory management: One dma_fence works similar to the
amdkfd preempt fence, except it doesn't preempt but does anything
required to make the hw context unrunnable and take it out of the hw
scheduler entirely. This might involve unmapping the doorbell if
userspace has access to it.

- but we also do classic end-of-batch fences, so that implicit fencing
and all that keeps working. The "make hw ctx unrunnable" fence must
also wait for all of these pending submissions to complete.


This together doesn't work from the software side, e.g. you can either 
have preemption fences or end of batch fences but never both or your end 
of batch fences would have another dependency on the preemption fences 
which we currently can't express in the dma_fence framework.


Additional to that it can't work from the hardware side because we have 
a separation between engine and scheduler on the hardware side. So we 
can't reliable get a signal inside the kernel that a batch has completed.


What we could do is to get this signal in userspace, e.g. userspace 
inserts the packets into the ring buffer and then the kernel can read 
the fence value and get the IV.


But this has the same problem as user fences because it requires the 
cooperation of userspace.


We just yesterday had a meeting with the firmware developers to discuss 
the possible options and I now have even stronger doubts that this is 
doable.


We either have user queues where userspace writes the necessary commands 
directly to the ring buffer or we have kernel queues. A mixture of both 
isn't supported in neither the hardware nor the firmware.


Regards,
Christian.



- for the actual end-of-batchbuffer dma_fence it's almost all faked,
but with some checks in the kernel to keep up the guarantees. cs flow
is roughly

1. userspace directly writes into the userspace ringbuffer. It needs
to follow the kernel's rule for this if it wants things to work
correctly, but we assume evil userspace is allowed to write whatever
it wants to the ring, and change that whenever it wants. Userspace
does not update ring head/tail pointers.

2. cs ioctl just contains: a) head (the thing userspace 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-30 Thread Daniel Vetter
On Thu, Apr 29, 2021 at 1:12 PM Daniel Vetter  wrote:
>
> On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
> > On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter  wrote:
> > >
> > > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > > > broken for now:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > > > make us pretty
> > > > > > > > > > > > > unhappy
> > > > > > > > > > > > I concur. I have quite a few users with a multi-GPU 
> > > > > > > > > > > > setup involving
> > > > > > > > > > > > AMD hardware.
> > > > > > > > > > > >
> > > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a 
> > > > > > > > > > > > to get a clear
> > > > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > > > synchronized
> > > > > > > > > > > > anymore.
> > > > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > > > likely to
> > > > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > > > windows.  I
> > > > > > > > > > > think the timing depends on how quickly the legacy 
> > > > > > > > > > > hardware support
> > > > > > > > > > > sticks around for each vendor.
> > > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > > > constructed to not
> > > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > > >
> > > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer 
> > > > > > > > > > outside of the
> > > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously 
> > > > > > > > > > hoping you have r/o
> > > > > > > > > > pte flags. Otherwise the entire "share address space with 
> > > > > > > > > > cpu side,
> > > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > > >
> > > > > > > > > > And with that r/o bit on the ringbuffer you can once more 
> > > > > > > > > > force submit
> > > > > > > > > > through kernel space, and all the legacy dma_fence based 
> > > > > > > > > > stuff keeps
> > > > > > > > > > working. And we don't have to invent some horrendous 
> > > > > > > > > > userspace fence based
> > > > > > > > > > implicit sync mechanism in the kernel, but can instead do 
> > > > > > > > > > this transition
> > > > > > > > > > properly with drm_syncobj timeline explicit sync and 
> > > > > > > > > > protocol reving.
> > > > > > > > > >
> > > > > > > > > > At least I think you'd have to work extra hard to create a 
> > > > > > > > > > gpu which
> > > > > > > > > > cannot possibly be intercepted by the kernel, even when 
> > > > > > > > > > it's designed to
> > > > > > > > > > support userspace direct submit only.
> > > > > > > > > >
> > > > > > > > > > Or are your hw engineers more creative here and we're 
> > > > > > > > > > screwed?
> > > > > > > > > The upcomming hardware generation will have this hardware 
> > > > > > > > > scheduler as a
> > > > > > > > > must have, but there are certain ways we can still stick to 
> > > > > > > > > the old
> > > > > > > > > approach:
> > > > > > > > >
> > > > > > > > > 1. The new hardware scheduler currently still supports kernel 
> > > > > > > > > queues which
> > > > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > > > >
> > > > > > > > > 2. Mapping the top level ring buffer into the VM at least 
> > > > > > > > > partially solves
> > > > > > > > > the problem. This way you can't manipulate the ring buffer 
> > > > > > > > > content, but the
> > > > > > > > > location for the fence must still be writeable.
> > > > > > > > Yeah allowing userspace to lie about completion fences in this 
> > > > > > > > model is
> > > > > > > > ok. Though I haven't thought through full consequences of that, 
> > > > > > > > but I
> > > > > > > > think it's not any worse than userspace lying about which 
> > > > > > > > buffers/address
> > > > > > > > it uses in the current model - we rely on hw vm 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-29 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
> On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter  wrote:
> >
> > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > > >  wrote:
> > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > > broken for now:
> > > > > > > > > > > > >
> > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > > make us pretty
> > > > > > > > > > > > unhappy
> > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > > > involving
> > > > > > > > > > > AMD hardware.
> > > > > > > > > > >
> > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a 
> > > > > > > > > > > to get a clear
> > > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > > synchronized
> > > > > > > > > > > anymore.
> > > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > > likely to
> > > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > > windows.  I
> > > > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > > > support
> > > > > > > > > > sticks around for each vendor.
> > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > > constructed to not
> > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > >
> > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer 
> > > > > > > > > outside of the
> > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping 
> > > > > > > > > you have r/o
> > > > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > > > side,
> > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > >
> > > > > > > > > And with that r/o bit on the ringbuffer you can once more 
> > > > > > > > > force submit
> > > > > > > > > through kernel space, and all the legacy dma_fence based 
> > > > > > > > > stuff keeps
> > > > > > > > > working. And we don't have to invent some horrendous 
> > > > > > > > > userspace fence based
> > > > > > > > > implicit sync mechanism in the kernel, but can instead do 
> > > > > > > > > this transition
> > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > > > reving.
> > > > > > > > >
> > > > > > > > > At least I think you'd have to work extra hard to create a 
> > > > > > > > > gpu which
> > > > > > > > > cannot possibly be intercepted by the kernel, even when it's 
> > > > > > > > > designed to
> > > > > > > > > support userspace direct submit only.
> > > > > > > > >
> > > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > > The upcomming hardware generation will have this hardware 
> > > > > > > > scheduler as a
> > > > > > > > must have, but there are certain ways we can still stick to the 
> > > > > > > > old
> > > > > > > > approach:
> > > > > > > >
> > > > > > > > 1. The new hardware scheduler currently still supports kernel 
> > > > > > > > queues which
> > > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > > >
> > > > > > > > 2. Mapping the top level ring buffer into the VM at least 
> > > > > > > > partially solves
> > > > > > > > the problem. This way you can't manipulate the ring buffer 
> > > > > > > > content, but the
> > > > > > > > location for the fence must still be writeable.
> > > > > > > Yeah allowing userspace to lie about completion fences in this 
> > > > > > > model is
> > > > > > > ok. Though I haven't thought through full consequences of that, 
> > > > > > > but I
> > > > > > > think it's not any worse than userspace lying about which 
> > > > > > > buffers/address
> > > > > > > it uses in the current model - we rely on hw vm ptes to catch 
> > > > > > > that stuff.
> > > > > > >
> > > > > > > Also it might be good to switch to a non-recoverable ctx model 
> > > > > > > for these.
> > > > > > > That's already what we do in i915 (opt-in, but all current umd 
> > > > > > > use that
> > > > > > > mode). So 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-29 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 04:45:01PM +0200, Christian König wrote:
> Am 28.04.21 um 16:34 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > > >  wrote:
> > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > > broken for now:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > > make us pretty
> > > > > > > > > > > > unhappy
> > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > > > involving
> > > > > > > > > > > AMD hardware.
> > > > > > > > > > > 
> > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a 
> > > > > > > > > > > to get a clear
> > > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > > synchronized
> > > > > > > > > > > anymore.
> > > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > > likely to
> > > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > > windows.  I
> > > > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > > > support
> > > > > > > > > > sticks around for each vendor.
> > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > > constructed to not
> > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > > 
> > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer 
> > > > > > > > > outside of the
> > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping 
> > > > > > > > > you have r/o
> > > > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > > > side,
> > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > > 
> > > > > > > > > And with that r/o bit on the ringbuffer you can once more 
> > > > > > > > > force submit
> > > > > > > > > through kernel space, and all the legacy dma_fence based 
> > > > > > > > > stuff keeps
> > > > > > > > > working. And we don't have to invent some horrendous 
> > > > > > > > > userspace fence based
> > > > > > > > > implicit sync mechanism in the kernel, but can instead do 
> > > > > > > > > this transition
> > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > > > reving.
> > > > > > > > > 
> > > > > > > > > At least I think you'd have to work extra hard to create a 
> > > > > > > > > gpu which
> > > > > > > > > cannot possibly be intercepted by the kernel, even when it's 
> > > > > > > > > designed to
> > > > > > > > > support userspace direct submit only.
> > > > > > > > > 
> > > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > > The upcomming hardware generation will have this hardware 
> > > > > > > > scheduler as a
> > > > > > > > must have, but there are certain ways we can still stick to the 
> > > > > > > > old
> > > > > > > > approach:
> > > > > > > > 
> > > > > > > > 1. The new hardware scheduler currently still supports kernel 
> > > > > > > > queues which
> > > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > > > 
> > > > > > > > 2. Mapping the top level ring buffer into the VM at least 
> > > > > > > > partially solves
> > > > > > > > the problem. This way you can't manipulate the ring buffer 
> > > > > > > > content, but the
> > > > > > > > location for the fence must still be writeable.
> > > > > > > Yeah allowing userspace to lie about completion fences in this 
> > > > > > > model is
> > > > > > > ok. Though I haven't thought through full consequences of that, 
> > > > > > > but I
> > > > > > > think it's not any worse than userspace lying about which 
> > > > > > > buffers/address
> > > > > > > it uses in the current model - we rely on hw vm ptes to catch 
> > > > > > > that stuff.
> > > > > > > 
> > > > > > > Also it might be good to switch to a non-recoverable ctx model 
> > > > > > > for these.
> > > > > > > That's already what we do in i915 (opt-in, but all current umd 
> > > > > > > use that
> > > > > > > mode). So 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Alex Deucher
On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter  wrote:
>
> On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > >  wrote:
> > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > broken for now:
> > > > > > > > > > > >
> > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > make us pretty
> > > > > > > > > > > unhappy
> > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > > involving
> > > > > > > > > > AMD hardware.
> > > > > > > > > >
> > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to 
> > > > > > > > > > get a clear
> > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > synchronized
> > > > > > > > > > anymore.
> > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > likely to
> > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > windows.  I
> > > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > > support
> > > > > > > > > sticks around for each vendor.
> > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > constructed to not
> > > > > > > > support isolating the ringbuffer at all.
> > > > > > > >
> > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside 
> > > > > > > > of the
> > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping 
> > > > > > > > you have r/o
> > > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > > side,
> > > > > > > > seamlessly" thing is out of the window.
> > > > > > > >
> > > > > > > > And with that r/o bit on the ringbuffer you can once more force 
> > > > > > > > submit
> > > > > > > > through kernel space, and all the legacy dma_fence based stuff 
> > > > > > > > keeps
> > > > > > > > working. And we don't have to invent some horrendous userspace 
> > > > > > > > fence based
> > > > > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > > > > transition
> > > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > > reving.
> > > > > > > >
> > > > > > > > At least I think you'd have to work extra hard to create a gpu 
> > > > > > > > which
> > > > > > > > cannot possibly be intercepted by the kernel, even when it's 
> > > > > > > > designed to
> > > > > > > > support userspace direct submit only.
> > > > > > > >
> > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > The upcomming hardware generation will have this hardware 
> > > > > > > scheduler as a
> > > > > > > must have, but there are certain ways we can still stick to the 
> > > > > > > old
> > > > > > > approach:
> > > > > > >
> > > > > > > 1. The new hardware scheduler currently still supports kernel 
> > > > > > > queues which
> > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > >
> > > > > > > 2. Mapping the top level ring buffer into the VM at least 
> > > > > > > partially solves
> > > > > > > the problem. This way you can't manipulate the ring buffer 
> > > > > > > content, but the
> > > > > > > location for the fence must still be writeable.
> > > > > > Yeah allowing userspace to lie about completion fences in this 
> > > > > > model is
> > > > > > ok. Though I haven't thought through full consequences of that, but 
> > > > > > I
> > > > > > think it's not any worse than userspace lying about which 
> > > > > > buffers/address
> > > > > > it uses in the current model - we rely on hw vm ptes to catch that 
> > > > > > stuff.
> > > > > >
> > > > > > Also it might be good to switch to a non-recoverable ctx model for 
> > > > > > these.
> > > > > > That's already what we do in i915 (opt-in, but all current umd use 
> > > > > > that
> > > > > > mode). So any hang/watchdog just kills the entire ctx and you don't 
> > > > > > have
> > > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > > Simplifies everything.
> > > > > >
> > > > > > Also ofc userspace fencing still 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 16:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:

Am 28.04.21 um 15:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is submitted,
but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge
performance drop because of the additional latency.

That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
it can do that already today because there's also no command parser to
e.g. stop gpu semaphores) the entire context is shot and terminally
killed. Userspace has to then set up a new one. This isn't how amdgpu
  

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  
> > > > > > > > wrote:
> > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > >  wrote:
> > > > > > > > > 
> > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > broken for now:
> > > > > > > > > > > 
> > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make 
> > > > > > > > > > us pretty
> > > > > > > > > > unhappy
> > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > involving
> > > > > > > > > AMD hardware.
> > > > > > > > > 
> > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to 
> > > > > > > > > get a clear
> > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > synchronized
> > > > > > > > > anymore.
> > > > > > > > It's an upcoming requirement for windows[1], so you are likely 
> > > > > > > > to
> > > > > > > > start seeing this across all GPU vendors that support windows.  
> > > > > > > > I
> > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > support
> > > > > > > > sticks around for each vendor.
> > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed 
> > > > > > > to not
> > > > > > > support isolating the ringbuffer at all.
> > > > > > > 
> > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside 
> > > > > > > of the
> > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you 
> > > > > > > have r/o
> > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > side,
> > > > > > > seamlessly" thing is out of the window.
> > > > > > > 
> > > > > > > And with that r/o bit on the ringbuffer you can once more force 
> > > > > > > submit
> > > > > > > through kernel space, and all the legacy dma_fence based stuff 
> > > > > > > keeps
> > > > > > > working. And we don't have to invent some horrendous userspace 
> > > > > > > fence based
> > > > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > > > transition
> > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > reving.
> > > > > > > 
> > > > > > > At least I think you'd have to work extra hard to create a gpu 
> > > > > > > which
> > > > > > > cannot possibly be intercepted by the kernel, even when it's 
> > > > > > > designed to
> > > > > > > support userspace direct submit only.
> > > > > > > 
> > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > The upcomming hardware generation will have this hardware scheduler 
> > > > > > as a
> > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > approach:
> > > > > > 
> > > > > > 1. The new hardware scheduler currently still supports kernel 
> > > > > > queues which
> > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > 
> > > > > > 2. Mapping the top level ring buffer into the VM at least partially 
> > > > > > solves
> > > > > > the problem. This way you can't manipulate the ring buffer content, 
> > > > > > but the
> > > > > > location for the fence must still be writeable.
> > > > > Yeah allowing userspace to lie about completion fences in this model 
> > > > > is
> > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > think it's not any worse than userspace lying about which 
> > > > > buffers/address
> > > > > it uses in the current model - we rely on hw vm ptes to catch that 
> > > > > stuff.
> > > > > 
> > > > > Also it might be good to switch to a non-recoverable ctx model for 
> > > > > these.
> > > > > That's already what we do in i915 (opt-in, but all current umd use 
> > > > > that
> > > > > mode). So any hang/watchdog just kills the entire ctx and you don't 
> > > > > have
> > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > Simplifies everything.
> > > > > 
> > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > handle dependencies through that still. Not great, but workable.
> > > > > 
> > > > > Thinking about this, not even mapping the ringbuffer r/o is 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 15:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is submitted,
but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge
performance drop because of the additional latency.

That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
   call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
   it can do that already today because there's also no command parser to
   e.g. stop gpu semaphores) the entire context is shot and terminally
   killed. Userspace has to then set up a new one. This isn't how amdgpu
   recovery works right now, but i915 supports it and I think it's also the
   better model for userspace error 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  
> > > > > > wrote:
> > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > >  wrote:
> > > > > > > 
> > > > > > > > > Ok. So that would only make the following use cases broken 
> > > > > > > > > for now:
> > > > > > > > > 
> > > > > > > > > - amd render -> external gpu
> > > > > > > > > - amd video encode -> network device
> > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us 
> > > > > > > > pretty
> > > > > > > > unhappy
> > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > involving
> > > > > > > AMD hardware.
> > > > > > > 
> > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a 
> > > > > > > clear
> > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > synchronized
> > > > > > > anymore.
> > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > sticks around for each vendor.
> > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to 
> > > > > not
> > > > > support isolating the ringbuffer at all.
> > > > > 
> > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you 
> > > > > have r/o
> > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > seamlessly" thing is out of the window.
> > > > > 
> > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > working. And we don't have to invent some horrendous userspace fence 
> > > > > based
> > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > transition
> > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > 
> > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > cannot possibly be intercepted by the kernel, even when it's designed 
> > > > > to
> > > > > support userspace direct submit only.
> > > > > 
> > > > > Or are your hw engineers more creative here and we're screwed?
> > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > must have, but there are certain ways we can still stick to the old
> > > > approach:
> > > > 
> > > > 1. The new hardware scheduler currently still supports kernel queues 
> > > > which
> > > > essentially is the same as the old hardware ring buffer.
> > > > 
> > > > 2. Mapping the top level ring buffer into the VM at least partially 
> > > > solves
> > > > the problem. This way you can't manipulate the ring buffer content, but 
> > > > the
> > > > location for the fence must still be writeable.
> > > Yeah allowing userspace to lie about completion fences in this model is
> > > ok. Though I haven't thought through full consequences of that, but I
> > > think it's not any worse than userspace lying about which buffers/address
> > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > 
> > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > That's already what we do in i915 (opt-in, but all current umd use that
> > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > to worry about userspace doing something funny with it's ringbuffer.
> > > Simplifies everything.
> > > 
> > > Also ofc userspace fencing still disallowed, but since userspace would
> > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > handle dependencies through that still. Not great, but workable.
> > > 
> > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > just that we must queue things throug the kernel to resolve dependencies
> > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > shoot it and the kernel stops running that context entirely.
> 
> Thinking more about that approach I don't think that it will work correctly.
> 
> See we not only need to write the fence as signal that an IB is submitted,
> but also adjust a bunch of privileged hardware registers.
> 
> When userspace could do that from its IBs as well then there is nothing
> blocking it from reprogramming the page table base address for example.
> 
> We could do those writes with the CPU as well, but that would 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.


Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is 
submitted, but also adjust a bunch of privileged hardware registers.


When userspace could do that from its IBs as well then there is nothing 
blocking it from reprogramming the page table base address for example.


We could do those writes with the CPU as well, but that would be a huge 
performance drop because of the additional latency.


Christian.



So I think even if we have hw with 100% userspace submit model only we
should be still fine. It's ofc silly, because instead of using userspace
fences and gpu semaphores the hw scheduler understands we still take the
detour through drm/scheduler, but at least it's not a break-the-world
event.

Also no page fault support, userptr invalidates still stall until
end-of-batch instead of just preempting it, and all that too. But I mean
there needs to be some motivation to fix this and roll out explicit sync
:-)
-Daniel


Or do I miss something here?


For now and the next hardware we are save to support the old submission
model, but the functionality of kernel queues will sooner or later go away
if it is only for Linux.

So we need to work on something which works in the long term and get us away
from this implicit sync.

Yeah I think we have pretty clear 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Alex Deucher
On Wed, Apr 28, 2021 at 6:31 AM Christian König
 wrote:
>
> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> >> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> >>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> >>>  wrote:
> >>>
> > Ok. So that would only make the following use cases broken for now:
> >
> > - amd render -> external gpu
> > - amd video encode -> network device
>  FWIW, "only" breaking amd render -> external gpu will make us pretty
>  unhappy
> >>> I concur. I have quite a few users with a multi-GPU setup involving
> >>> AMD hardware.
> >>>
> >>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> >>> error, and not bad results on screen because nothing is synchronized
> >>> anymore.
> >> It's an upcoming requirement for windows[1], so you are likely to
> >> start seeing this across all GPU vendors that support windows.  I
> >> think the timing depends on how quickly the legacy hardware support
> >> sticks around for each vendor.
> > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > support isolating the ringbuffer at all.
> >
> > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > pte flags. Otherwise the entire "share address space with cpu side,
> > seamlessly" thing is out of the window.
> >
> > And with that r/o bit on the ringbuffer you can once more force submit
> > through kernel space, and all the legacy dma_fence based stuff keeps
> > working. And we don't have to invent some horrendous userspace fence based
> > implicit sync mechanism in the kernel, but can instead do this transition
> > properly with drm_syncobj timeline explicit sync and protocol reving.
> >
> > At least I think you'd have to work extra hard to create a gpu which
> > cannot possibly be intercepted by the kernel, even when it's designed to
> > support userspace direct submit only.
> >
> > Or are your hw engineers more creative here and we're screwed?
>
> The upcomming hardware generation will have this hardware scheduler as a
> must have, but there are certain ways we can still stick to the old
> approach:
>
> 1. The new hardware scheduler currently still supports kernel queues
> which essentially is the same as the old hardware ring buffer.
>
> 2. Mapping the top level ring buffer into the VM at least partially
> solves the problem. This way you can't manipulate the ring buffer
> content, but the location for the fence must still be writeable.
>
> For now and the next hardware we are save to support the old submission
> model, but the functionality of kernel queues will sooner or later go
> away if it is only for Linux.

Even if it didn't go away completely, no one else will be using it.
This leaves a lot of under-validated execution paths that lead to
subtle bugs.  When everyone else moved to KIQ for queue management, we
stuck with MMIO for a while in Linux and we ran into tons of subtle
bugs that disappeared when we moved to KIQ.  There were lots of
assumptions about how software would use different firmware interfaces
or not which impacted lots of interactions with clock and powergating
to name a few.  On top of that, you need to use the scheduler to
utilize stuff like preemption properly.  Also, if you want to do stuff
like gang scheduling (UMD scheduling multiple queues together), it's
really hard to do with kernel software schedulers.

Alex

>
> So we need to work on something which works in the long term and get us
> away from this implicit sync.
>
> Christian.
>
> > -Daniel
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Simon Ser
On Wednesday, April 28th, 2021 at 2:21 PM, Daniel Vetter  
wrote:

> Yeah I think we have pretty clear consensus on that goal, just no one yet
> volunteered to get going with the winsys/wayland work to plumb drm_syncobj
> through, and the kernel/mesa work to make that optionally a userspace
> fence underneath. And it's for a sure a lot of work.

I'm interested in helping with the winsys/wayland bits, assuming the
following:

- We are pretty confident that drm_syncobj won't be superseded by
  something else in the near future. It seems to me like a lot of
  effort has gone into plumbing sync_file stuff all over, and it
  already needs replacing (I mean, it'll keep working, but we have a
  better replacement now. So compositors which have decided to ignore
  explicit sync for all this time won't have to do the work twice.)
- Plumbing drm_syncobj solves the synchronization issues with upcoming
  AMD hardware, and all of this works fine in cross-vendor multi-GPU
  setups.
- Someone is willing to spend a bit of time bearing with me and
  explaining how this all works. (I only know about sync_file for now,
  I'll start reading the Vulkan bits.)

Are these points something we can agree on?

Thanks,

Simon
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > >  wrote:
> > > > > 
> > > > > > > Ok. So that would only make the following use cases broken for 
> > > > > > > now:
> > > > > > > 
> > > > > > > - amd render -> external gpu
> > > > > > > - amd video encode -> network device
> > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > unhappy
> > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > AMD hardware.
> > > > > 
> > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > error, and not bad results on screen because nothing is synchronized
> > > > > anymore.
> > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > start seeing this across all GPU vendors that support windows.  I
> > > > think the timing depends on how quickly the legacy hardware support
> > > > sticks around for each vendor.
> > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > support isolating the ringbuffer at all.
> > > 
> > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > pte flags. Otherwise the entire "share address space with cpu side,
> > > seamlessly" thing is out of the window.
> > > 
> > > And with that r/o bit on the ringbuffer you can once more force submit
> > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > working. And we don't have to invent some horrendous userspace fence based
> > > implicit sync mechanism in the kernel, but can instead do this transition
> > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > 
> > > At least I think you'd have to work extra hard to create a gpu which
> > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > support userspace direct submit only.
> > > 
> > > Or are your hw engineers more creative here and we're screwed?
> > 
> > The upcomming hardware generation will have this hardware scheduler as a
> > must have, but there are certain ways we can still stick to the old
> > approach:
> > 
> > 1. The new hardware scheduler currently still supports kernel queues which
> > essentially is the same as the old hardware ring buffer.
> > 
> > 2. Mapping the top level ring buffer into the VM at least partially solves
> > the problem. This way you can't manipulate the ring buffer content, but the
> > location for the fence must still be writeable.
> 
> Yeah allowing userspace to lie about completion fences in this model is
> ok. Though I haven't thought through full consequences of that, but I
> think it's not any worse than userspace lying about which buffers/address
> it uses in the current model - we rely on hw vm ptes to catch that stuff.
> 
> Also it might be good to switch to a non-recoverable ctx model for these.
> That's already what we do in i915 (opt-in, but all current umd use that
> mode). So any hang/watchdog just kills the entire ctx and you don't have
> to worry about userspace doing something funny with it's ringbuffer.
> Simplifies everything.
> 
> Also ofc userspace fencing still disallowed, but since userspace would
> queu up all writes to its ringbuffer through the drm/scheduler, we'd
> handle dependencies through that still. Not great, but workable.
> 
> Thinking about this, not even mapping the ringbuffer r/o is required, it's
> just that we must queue things throug the kernel to resolve dependencies
> and everything without breaking dma_fence. If userspace lies, tdr will
> shoot it and the kernel stops running that context entirely.
> 
> So I think even if we have hw with 100% userspace submit model only we
> should be still fine. It's ofc silly, because instead of using userspace
> fences and gpu semaphores the hw scheduler understands we still take the
> detour through drm/scheduler, but at least it's not a break-the-world
> event.

Also no page fault support, userptr invalidates still stall until
end-of-batch instead of just preempting it, and all that too. But I mean
there needs to be some motivation to fix this and roll out explicit sync
:-)
-Daniel

> 
> Or do I miss something here?
> 
> > For now and the next hardware we are save to support the old submission
> > model, but the functionality of kernel queues will sooner or later go away
> > if it is only for Linux.
> > 
> > So we need to work on something which works in the long term and get us away
> > from this implicit sync.
> 
> Yeah I think we have pretty clear consensus on that goal, just no one yet
> volunteered to get going with the 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > >  wrote:
> > > > 
> > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > 
> > > > > > - amd render -> external gpu
> > > > > > - amd video encode -> network device
> > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > unhappy
> > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > AMD hardware.
> > > > 
> > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > error, and not bad results on screen because nothing is synchronized
> > > > anymore.
> > > It's an upcoming requirement for windows[1], so you are likely to
> > > start seeing this across all GPU vendors that support windows.  I
> > > think the timing depends on how quickly the legacy hardware support
> > > sticks around for each vendor.
> > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > support isolating the ringbuffer at all.
> > 
> > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > pte flags. Otherwise the entire "share address space with cpu side,
> > seamlessly" thing is out of the window.
> > 
> > And with that r/o bit on the ringbuffer you can once more force submit
> > through kernel space, and all the legacy dma_fence based stuff keeps
> > working. And we don't have to invent some horrendous userspace fence based
> > implicit sync mechanism in the kernel, but can instead do this transition
> > properly with drm_syncobj timeline explicit sync and protocol reving.
> > 
> > At least I think you'd have to work extra hard to create a gpu which
> > cannot possibly be intercepted by the kernel, even when it's designed to
> > support userspace direct submit only.
> > 
> > Or are your hw engineers more creative here and we're screwed?
> 
> The upcomming hardware generation will have this hardware scheduler as a
> must have, but there are certain ways we can still stick to the old
> approach:
> 
> 1. The new hardware scheduler currently still supports kernel queues which
> essentially is the same as the old hardware ring buffer.
> 
> 2. Mapping the top level ring buffer into the VM at least partially solves
> the problem. This way you can't manipulate the ring buffer content, but the
> location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

So I think even if we have hw with 100% userspace submit model only we
should be still fine. It's ofc silly, because instead of using userspace
fences and gpu semaphores the hw scheduler understands we still take the
detour through drm/scheduler, but at least it's not a break-the-world
event.

Or do I miss something here?

> For now and the next hardware we are save to support the old submission
> model, but the functionality of kernel queues will sooner or later go away
> if it is only for Linux.
> 
> So we need to work on something which works in the long term and get us away
> from this implicit sync.

Yeah I think we have pretty clear consensus on that goal, just no one yet
volunteered to get going with the winsys/wayland work to plumb drm_syncobj
through, and the kernel/mesa work to make that optionally a userspace
fence underneath. And it's for a sure a lot of work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?


The upcomming hardware generation will have this hardware scheduler as a 
must have, but there are certain ways we can still stick to the old 
approach:


1. The new hardware scheduler currently still supports kernel queues 
which essentially is the same as the old hardware ring buffer.


2. Mapping the top level ring buffer into the VM at least partially 
solves the problem. This way you can't manipulate the ring buffer 
content, but the location for the fence must still be writeable.


For now and the next hardware we are save to support the old submission 
model, but the functionality of kernel queues will sooner or later go 
away if it is only for Linux.


So we need to work on something which works in the long term and get us 
away from this implicit sync.


Christian.


-Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> >
> > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> >  wrote:
> >
> > > > Ok. So that would only make the following use cases broken for now:
> > > >
> > > > - amd render -> external gpu
> > > > - amd video encode -> network device
> > >
> > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > unhappy
> >
> > I concur. I have quite a few users with a multi-GPU setup involving
> > AMD hardware.
> >
> > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > error, and not bad results on screen because nothing is synchronized
> > anymore.
> 
> It's an upcoming requirement for windows[1], so you are likely to
> start seeing this across all GPU vendors that support windows.  I
> think the timing depends on how quickly the legacy hardware support
> sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 06:27:27PM +, Simon Ser wrote:
> On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher  
> wrote:
> 
> > It's an upcoming requirement for windows[1], so you are likely to
> > start seeing this across all GPU vendors that support windows. I
> > think the timing depends on how quickly the legacy hardware support
> > sticks around for each vendor.
> 
> Hm, okay.
> 
> Will using the existing explicit synchronization APIs make it work
> properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync +
> EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)

If you have hw which really _only_ supports userspace direct submission
(i.e. the ringbuffer has to be in the same gpu vm as everything else by
design, and can't be protected at all with e.g. read-only pte entries)
then all that stuff would be broken.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 11:07:09AM +0200, Michel Dänzer wrote:
> On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> > 
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on which 
> >> drivers we need to interoperate with and update them. We've already found 
> >> the path forward for amdgpu. We just need to find out how many other 
> >> drivers need to be updated and evaluate the cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  >> > wrote:
> >>
> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> >>  >> > wrote:
> >> >
> >> > Correct, we wouldn't have synchronization between device with and 
> >> without user queues any more.
> >> >
> >> > That could only be a problem for A+I Laptops.
> >>
> >> Since I think you mentioned you'd only be enabling this on newer
> >> chipsets, won't it be a problem for A+A where one A is a generation
> >> behind the other?
> >>
> > 
> > Crap, that is a good point as well.
> > 
> >>
> >> I'm not really liking where this is going btw, seems like a ill
> >> thought out concept, if AMD is really going down the road of designing
> >> hw that is currently Linux incompatible, you are going to have to
> >> accept a big part of the burden in bringing this support in to more
> >> than just amd drivers for upcoming generations of gpu.
> >>
> > 
> > Well we don't really like that either, but we have no other option as far 
> > as I can see.
> 
> I don't really understand what "future hw may remove support for kernel
> queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the
> queues, or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
> 
> Surely there are resource limits for the per-context queues, so the
> kernel driver needs to do some kind of virtualization / multi-plexing
> anyway, or we'll get sad user faces when there's no queue available for
> .
> 
> I'm probably missing something though, awaiting enlightenment. :)

Yeah in all this discussion what's unclear to me is, is this a hard amdgpu
requirement going forward, in which case you need a time machine and lots
of people to retroactively fix this because this aint fast to get fixed.

Or is this just musings for an ecosystem that better fits current
hw, for which I think we all agree where the rough direction is?

The former is quite a glorious situation, and I'm with Dave here that if
your hw engineers really removed the bit to not map the ringbuffers to
userspace, then amd gets to eat a big chunk of the cost here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 08:59:47AM +0200, Christian König wrote:
> Hi Dave,
> 
> Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > Supporting interop with any device is always possible. It depends on
> > which drivers we need to interoperate with and update them. We've
> > already found the path forward for amdgpu. We just need to find out how
> > many other drivers need to be updated and evaluate the cost/benefit
> > aspect.
> > 
> > Marek
> > 
> > On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > > wrote:
> > 
> > On Tue, 27 Apr 2021 at 22:06, Christian König
> >  > > wrote:
> > >
> > > Correct, we wouldn't have synchronization between device with
> > and without user queues any more.
> > >
> > > That could only be a problem for A+I Laptops.
> > 
> > Since I think you mentioned you'd only be enabling this on newer
> > chipsets, won't it be a problem for A+A where one A is a generation
> > behind the other?
> > 
> 
> Crap, that is a good point as well.
> 
> > 
> > I'm not really liking where this is going btw, seems like a ill
> > thought out concept, if AMD is really going down the road of designing
> > hw that is currently Linux incompatible, you are going to have to
> > accept a big part of the burden in bringing this support in to more
> > than just amd drivers for upcoming generations of gpu.
> > 
> 
> Well we don't really like that either, but we have no other option as far as
> I can see.
> 
> I have a couple of ideas how to handle this in the kernel without
> dma_fences, but it always require more or less changes to all existing
> drivers.

Yeah one horrible idea is to essentially do the plan we hashed out for
adding userspace fences to drm_syncobj timelines. And then add drm_syncobj
as another implicit fencing thing to dma-buf.

But:
- This is horrible. We're all agreeing that implicit sync is not a great
  idea, building an entire new world on this flawed thing doesn't sound
  like a good path forward.

- It's kernel uapi, so it's going to be forever.

- It's only fixing the correctness issue, since you have to stall for
  future/indefinite fences at the beginning of the CS ioctl. Or at the
  beginning of the atomic modeset ioctl, which kinda defeats the point of
  nonblocking.

- You still have to touch all kmd drivers.

- For performance, you still have to glue a submit thread onto all gl
  drivers.

It is horrendous.
-Daniel

> 
> Christian.
> 
> > 
> > Dave.
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Michel Dänzer
On 2021-04-28 8:59 a.m., Christian König wrote:
> Hi Dave,
> 
> Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> Supporting interop with any device is always possible. It depends on which 
>> drivers we need to interoperate with and update them. We've already found 
>> the path forward for amdgpu. We just need to find out how many other drivers 
>> need to be updated and evaluate the cost/benefit aspect.
>>
>> Marek
>>
>> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie > > wrote:
>>
>> On Tue, 27 Apr 2021 at 22:06, Christian König
>> > > wrote:
>> >
>> > Correct, we wouldn't have synchronization between device with and 
>> without user queues any more.
>> >
>> > That could only be a problem for A+I Laptops.
>>
>> Since I think you mentioned you'd only be enabling this on newer
>> chipsets, won't it be a problem for A+A where one A is a generation
>> behind the other?
>>
> 
> Crap, that is a good point as well.
> 
>>
>> I'm not really liking where this is going btw, seems like a ill
>> thought out concept, if AMD is really going down the road of designing
>> hw that is currently Linux incompatible, you are going to have to
>> accept a big part of the burden in bringing this support in to more
>> than just amd drivers for upcoming generations of gpu.
>>
> 
> Well we don't really like that either, but we have no other option as far as 
> I can see.

I don't really understand what "future hw may remove support for kernel queues" 
means exactly. While the per-context queues can be mapped to userspace 
directly, they don't *have* to be, do they? I.e. the kernel driver should be 
able to either intercept userspace access to the queues, or in the worst case 
do it all itself, and provide the existing synchronization semantics as needed?

Surely there are resource limits for the per-context queues, so the kernel 
driver needs to do some kind of virtualization / multi-plexing anyway, or we'll 
get sad user faces when there's no queue available for .

I'm probably missing something though, awaiting enlightenment. :)


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Hi Dave,

Am 27.04.21 um 21:23 schrieb Marek Olšák:
Supporting interop with any device is always possible. It depends on 
which drivers we need to interoperate with and update them. We've 
already found the path forward for amdgpu. We just need to find out 
how many other drivers need to be updated and evaluate the 
cost/benefit aspect.


Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie > wrote:


On Tue, 27 Apr 2021 at 22:06, Christian König
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>
> Correct, we wouldn't have synchronization between device with
and without user queues any more.
>
> That could only be a problem for A+I Laptops.

Since I think you mentioned you'd only be enabling this on newer
chipsets, won't it be a problem for A+A where one A is a generation
behind the other?



Crap, that is a good point as well.



I'm not really liking where this is going btw, seems like a ill
thought out concept, if AMD is really going down the road of designing
hw that is currently Linux incompatible, you are going to have to
accept a big part of the burden in bringing this support in to more
than just amd drivers for upcoming generations of gpu.



Well we don't really like that either, but we have no other option as 
far as I can see.


I have a couple of ideas how to handle this in the kernel without 
dma_fences, but it always require more or less changes to all existing 
drivers.


Christian.



Dave.



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
On Wed., Apr. 28, 2021, 00:01 Jason Ekstrand,  wrote:

> On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák  wrote:
> >
> > Jason, both memory-based signalling as well as interrupt-based
> signalling to the CPU would be supported by amdgpu. External devices don't
> need to support memory-based sync objects. The only limitation is that they
> can't convert amdgpu sync objects to dma_fence.
>
> Sure.  I'm not worried about the mechanism.  We just need a word that
> means "the new fence thing" and I've been throwing "memory fence"
> around for that.  Other mechanisms may work as well.
>
> > The sad thing is that "external -> amdgpu" dependencies are really
> "external <-> amdgpu" dependencies due to mutually-exclusive access
> required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
> only interop that would initially work with those buffers. Explicitly
> sync'd buffers also won't work if other drivers convert explicit fences to
> dma_fence. Thus, both implicit sync and explicit sync might not work with
> other drivers at all. The only interop that would initially work is
> explicit fences with memory-based waiting and signalling on the external
> device to keep the kernel out of the picture.
>
> Yup.  This is where things get hard.  That said, I'm not quite ready
> to give up on memory/interrupt fences just yet.
>
> One thought that came to mind which might help would be if we added an
> extremely strict concept of memory ownership.  The idea would be that
> any given BO would be in one of two states at any given time:
>
>  1. legacy: dma_fences and implicit sync works as normal but it cannot
> be resident in any "modern" (direct submission, ULLS, whatever you
> want to call it) context
>
>  2. modern: In this mode they should not be used by any legacy
> context.  We can't strictly prevent this, unfortunately, but maybe we
> can say reading produces garbage and writes may be discarded.  In this
> mode, they can be bound to modern contexts.
>
> In theory, when in "modern" mode, you could bind the same buffer in
> multiple modern contexts at a time.  However, when that's the case, it
> makes ownership really tricky to track.  Therefore, we might want some
> sort of dma-buf create flag for "always modern" vs. "switchable" and
> only allow binding to one modern context at a time when it's
> switchable.
>
> If we did this, we may be able to move any dma_fence shenanigans to
> the ownership transition points.  We'd still need some sort of "wait
> for fence and transition" which has a timeout.  However, then we'd be
> fairly well guaranteed that the application (not just Mesa!) has
> really and truly decided it's done with the buffer and we wouldn't (I
> hope!) end up with the accidental edges in the dependency graph.
>
> Of course, I've not yet proven any of this correct so feel free to
> tell me why it won't work. :-)  It was just one of those "about to go
> to bed and had a thunk" type thoughts.
>

We'd like to keep userspace outside of Mesa drivers intact and working
except for interop where we don't have much choice. At the same time,
future hw may remove support for kernel queues, so we might not have much
choice there either, depending on what the hw interface will look like.

The idea is to have an ioctl for querying a timeline semaphore buffer
associated with a shared BO, and an ioctl for querying the next wait and
signal number (e.g. n and n+1) for that semaphore. Waiting for n would be
like mutex lock and signaling would be like mutex unlock. The next process
would use the same ioctl and get n+1 and n+2, etc. There is a deadlock
condition because one process can do lock A, lock B, and another can do
lock B, lock A, which can be prevented such that the ioctl that returns the
numbers would return them for multiple buffers at once. This solution needs
no changes to userspace outside of Mesa drivers, and we'll also keep the BO
wait ioctl for GPU-CPU sync.

Marek


> --Jason
>
> P.S.  Daniel was 100% right when he said this discussion needs a glossary.
>
>
> > Marek
> >
> >
> > On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand 
> wrote:
> >>
> >> Trying to figure out which e-mail in this mess is the right one to
> reply to
> >>
> >> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> >> > > Ok. So that would only make the following use cases broken for now:
> >> > > - amd render -> external gpu
> >>
> >> Assuming said external GPU doesn't support memory fences.  If we do
> >> amdgpu and i915 at the same time, that covers basically most of the
> >> external GPU use-cases.  Of course, we'd want to convert nouveau as
> >> well for the rest.
> >>
> >> > > - amd video encode -> network device
> >> >
> >> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> >> > unhappy, as we have some cases where we are combining an AMD APU with
> a
> >> > FPGA based graphics card. I can't go into 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Jason Ekstrand
On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák  wrote:
>
> Jason, both memory-based signalling as well as interrupt-based signalling to 
> the CPU would be supported by amdgpu. External devices don't need to support 
> memory-based sync objects. The only limitation is that they can't convert 
> amdgpu sync objects to dma_fence.

Sure.  I'm not worried about the mechanism.  We just need a word that
means "the new fence thing" and I've been throwing "memory fence"
around for that.  Other mechanisms may work as well.

> The sad thing is that "external -> amdgpu" dependencies are really "external 
> <-> amdgpu" dependencies due to mutually-exclusive access required by 
> non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the only interop 
> that would initially work with those buffers. Explicitly sync'd buffers also 
> won't work if other drivers convert explicit fences to dma_fence. Thus, both 
> implicit sync and explicit sync might not work with other drivers at all. The 
> only interop that would initially work is explicit fences with memory-based 
> waiting and signalling on the external device to keep the kernel out of the 
> picture.

Yup.  This is where things get hard.  That said, I'm not quite ready
to give up on memory/interrupt fences just yet.

One thought that came to mind which might help would be if we added an
extremely strict concept of memory ownership.  The idea would be that
any given BO would be in one of two states at any given time:

 1. legacy: dma_fences and implicit sync works as normal but it cannot
be resident in any "modern" (direct submission, ULLS, whatever you
want to call it) context

 2. modern: In this mode they should not be used by any legacy
context.  We can't strictly prevent this, unfortunately, but maybe we
can say reading produces garbage and writes may be discarded.  In this
mode, they can be bound to modern contexts.

In theory, when in "modern" mode, you could bind the same buffer in
multiple modern contexts at a time.  However, when that's the case, it
makes ownership really tricky to track.  Therefore, we might want some
sort of dma-buf create flag for "always modern" vs. "switchable" and
only allow binding to one modern context at a time when it's
switchable.

If we did this, we may be able to move any dma_fence shenanigans to
the ownership transition points.  We'd still need some sort of "wait
for fence and transition" which has a timeout.  However, then we'd be
fairly well guaranteed that the application (not just Mesa!) has
really and truly decided it's done with the buffer and we wouldn't (I
hope!) end up with the accidental edges in the dependency graph.

Of course, I've not yet proven any of this correct so feel free to
tell me why it won't work. :-)  It was just one of those "about to go
to bed and had a thunk" type thoughts.

--Jason

P.S.  Daniel was 100% right when he said this discussion needs a glossary.


> Marek
>
>
> On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand  wrote:
>>
>> Trying to figure out which e-mail in this mess is the right one to reply 
>> to
>>
>> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach  wrote:
>> >
>> > Hi,
>> >
>> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
>> > > Ok. So that would only make the following use cases broken for now:
>> > > - amd render -> external gpu
>>
>> Assuming said external GPU doesn't support memory fences.  If we do
>> amdgpu and i915 at the same time, that covers basically most of the
>> external GPU use-cases.  Of course, we'd want to convert nouveau as
>> well for the rest.
>>
>> > > - amd video encode -> network device
>> >
>> > FWIW, "only" breaking amd render -> external gpu will make us pretty
>> > unhappy, as we have some cases where we are combining an AMD APU with a
>> > FPGA based graphics card. I can't go into the specifics of this use-
>> > case too much but basically the AMD graphics is rendering content that
>> > gets composited on top of a live video pipeline running through the
>> > FPGA.
>>
>> I think it's worth taking a step back and asking what's being here
>> before we freak out too much.  If we do go this route, it doesn't mean
>> that your FPGA use-case can't work, it just means it won't work
>> out-of-the box anymore.  You'll have to separate execution and memory
>> dependencies inside your FPGA driver.  That's still not great but it's
>> not as bad as you maybe made it sound.
>>
>> > > What about the case when we get a buffer from an external device and
>> > > we're supposed to make it "busy" when we are using it, and the
>> > > external device wants to wait until we stop using it? Is it something
>> > > that can happen, thus turning "external -> amd" into "external <->
>> > > amd"?
>> >
>> > Zero-copy texture sampling from a video input certainly appreciates
>> > this very much. Trying to pass the render fence through the various
>> > layers of userspace to be able to tell when the video input can reuse a
>> > buffer is a great experience in yak 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Jason, both memory-based signalling as well as interrupt-based signalling
to the CPU would be supported by amdgpu. External devices don't need to
support memory-based sync objects. The only limitation is that they can't
convert amdgpu sync objects to dma_fence.

The sad thing is that "external -> amdgpu" dependencies are really
"external <-> amdgpu" dependencies due to mutually-exclusive access
required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
only interop that would initially work with those buffers. Explicitly
sync'd buffers also won't work if other drivers convert explicit fences to
dma_fence. Thus, both implicit sync and explicit sync might not work with
other drivers at all. The only interop that would initially work is
explicit fences with memory-based waiting and signalling on the external
device to keep the kernel out of the picture.

Marek


On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand  wrote:

> Trying to figure out which e-mail in this mess is the right one to reply
> to
>
> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach 
> wrote:
> >
> > Hi,
> >
> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> > > Ok. So that would only make the following use cases broken for now:
> > > - amd render -> external gpu
>
> Assuming said external GPU doesn't support memory fences.  If we do
> amdgpu and i915 at the same time, that covers basically most of the
> external GPU use-cases.  Of course, we'd want to convert nouveau as
> well for the rest.
>
> > > - amd video encode -> network device
> >
> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > unhappy, as we have some cases where we are combining an AMD APU with a
> > FPGA based graphics card. I can't go into the specifics of this use-
> > case too much but basically the AMD graphics is rendering content that
> > gets composited on top of a live video pipeline running through the
> > FPGA.
>
> I think it's worth taking a step back and asking what's being here
> before we freak out too much.  If we do go this route, it doesn't mean
> that your FPGA use-case can't work, it just means it won't work
> out-of-the box anymore.  You'll have to separate execution and memory
> dependencies inside your FPGA driver.  That's still not great but it's
> not as bad as you maybe made it sound.
>
> > > What about the case when we get a buffer from an external device and
> > > we're supposed to make it "busy" when we are using it, and the
> > > external device wants to wait until we stop using it? Is it something
> > > that can happen, thus turning "external -> amd" into "external <->
> > > amd"?
> >
> > Zero-copy texture sampling from a video input certainly appreciates
> > this very much. Trying to pass the render fence through the various
> > layers of userspace to be able to tell when the video input can reuse a
> > buffer is a great experience in yak shaving. Allowing the video input
> > to reuse the buffer as soon as the read dma_fence from the GPU is
> > signaled is much more straight forward.
>
> Oh, it's definitely worse than that.  Every window system interaction
> is bi-directional.  The X server has to wait on the client before
> compositing from it and the client has to wait on X before re-using
> that back-buffer.  Of course, we can break that later dependency by
> doing a full CPU wait but that's going to mean either more latency or
> reserving more back buffers.  There's no good clean way to claim that
> any of this is one-directional.
>
> --Jason
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Jason Ekstrand
On Tue, Apr 27, 2021 at 1:38 PM Dave Airlie  wrote:
>
> On Tue, 27 Apr 2021 at 22:06, Christian König
>  wrote:
> >
> > Correct, we wouldn't have synchronization between device with and without 
> > user queues any more.
> >
> > That could only be a problem for A+I Laptops.
>
> Since I think you mentioned you'd only be enabling this on newer
> chipsets, won't it be a problem for A+A where one A is a generation
> behind the other?
>
> I'm not really liking where this is going btw, seems like a ill
> thought out concept, if AMD is really going down the road of designing
> hw that is currently Linux incompatible, you are going to have to
> accept a big part of the burden in bringing this support in to more
> than just amd drivers for upcoming generations of gpu.

In case my previous e-mail sounded too enthusiastic, I'm also pensive
about this direction.  I'm not sure I'm ready to totally give up on
all of Linux WSI just yet.  We definitely want to head towards memory
fences and direct submission but I'm not convinced that throwing out
all of interop is necessary.  It's certainly a very big hammer and we
should try to figure out something less destructive, if that's
possible.  (I don't know for sure that it is.)

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Jason Ekstrand
Trying to figure out which e-mail in this mess is the right one to reply to

On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach  wrote:
>
> Hi,
>
> Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> > Ok. So that would only make the following use cases broken for now:
> > - amd render -> external gpu

Assuming said external GPU doesn't support memory fences.  If we do
amdgpu and i915 at the same time, that covers basically most of the
external GPU use-cases.  Of course, we'd want to convert nouveau as
well for the rest.

> > - amd video encode -> network device
>
> FWIW, "only" breaking amd render -> external gpu will make us pretty
> unhappy, as we have some cases where we are combining an AMD APU with a
> FPGA based graphics card. I can't go into the specifics of this use-
> case too much but basically the AMD graphics is rendering content that
> gets composited on top of a live video pipeline running through the
> FPGA.

I think it's worth taking a step back and asking what's being here
before we freak out too much.  If we do go this route, it doesn't mean
that your FPGA use-case can't work, it just means it won't work
out-of-the box anymore.  You'll have to separate execution and memory
dependencies inside your FPGA driver.  That's still not great but it's
not as bad as you maybe made it sound.

> > What about the case when we get a buffer from an external device and
> > we're supposed to make it "busy" when we are using it, and the
> > external device wants to wait until we stop using it? Is it something
> > that can happen, thus turning "external -> amd" into "external <->
> > amd"?
>
> Zero-copy texture sampling from a video input certainly appreciates
> this very much. Trying to pass the render fence through the various
> layers of userspace to be able to tell when the video input can reuse a
> buffer is a great experience in yak shaving. Allowing the video input
> to reuse the buffer as soon as the read dma_fence from the GPU is
> signaled is much more straight forward.

Oh, it's definitely worse than that.  Every window system interaction
is bi-directional.  The X server has to wait on the client before
compositing from it and the client has to wait on X before re-using
that back-buffer.  Of course, we can break that later dependency by
doing a full CPU wait but that's going to mean either more latency or
reserving more back buffers.  There's no good clean way to claim that
any of this is one-directional.

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Supporting interop with any device is always possible. It depends on which
drivers we need to interoperate with and update them. We've already found
the path forward for amdgpu. We just need to find out how many other
drivers need to be updated and evaluate the cost/benefit aspect.

Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  wrote:

> On Tue, 27 Apr 2021 at 22:06, Christian König
>  wrote:
> >
> > Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >
> > That could only be a problem for A+I Laptops.
>
> Since I think you mentioned you'd only be enabling this on newer
> chipsets, won't it be a problem for A+A where one A is a generation
> behind the other?
>
> I'm not really liking where this is going btw, seems like a ill
> thought out concept, if AMD is really going down the road of designing
> hw that is currently Linux incompatible, you are going to have to
> accept a big part of the burden in bringing this support in to more
> than just amd drivers for upcoming generations of gpu.
>
> Dave.
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Dave Airlie
On Tue, 27 Apr 2021 at 22:06, Christian König
 wrote:
>
> Correct, we wouldn't have synchronization between device with and without 
> user queues any more.
>
> That could only be a problem for A+I Laptops.

Since I think you mentioned you'd only be enabling this on newer
chipsets, won't it be a problem for A+A where one A is a generation
behind the other?

I'm not really liking where this is going btw, seems like a ill
thought out concept, if AMD is really going down the road of designing
hw that is currently Linux incompatible, you are going to have to
accept a big part of the burden in bringing this support in to more
than just amd drivers for upcoming generations of gpu.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Simon Ser
On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher  
wrote:

> It's an upcoming requirement for windows[1], so you are likely to
> start seeing this across all GPU vendors that support windows. I
> think the timing depends on how quickly the legacy hardware support
> sticks around for each vendor.

Hm, okay.

Will using the existing explicit synchronization APIs make it work
properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync +
EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Alex Deucher
On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
>
> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
> wrote:
>
> > > Ok. So that would only make the following use cases broken for now:
> > >
> > > - amd render -> external gpu
> > > - amd video encode -> network device
> >
> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > unhappy
>
> I concur. I have quite a few users with a multi-GPU setup involving
> AMD hardware.
>
> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> error, and not bad results on screen because nothing is synchronized
> anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Alex


[1] - 
https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/


> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Simon Ser
On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:

> > Ok. So that would only make the following use cases broken for now:
> >
> > - amd render -> external gpu
> > - amd video encode -> network device
>
> FWIW, "only" breaking amd render -> external gpu will make us pretty
> unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Lucas Stach
Hi,

Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> Ok. So that would only make the following use cases broken for now:
> - amd render -> external gpu
> - amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy, as we have some cases where we are combining an AMD APU with a
FPGA based graphics card. I can't go into the specifics of this use-
case too much but basically the AMD graphics is rendering content that
gets composited on top of a live video pipeline running through the
FPGA.

> What about the case when we get a buffer from an external device and
> we're supposed to make it "busy" when we are using it, and the
> external device wants to wait until we stop using it? Is it something
> that can happen, thus turning "external -> amd" into "external <->
> amd"?

Zero-copy texture sampling from a video input certainly appreciates
this very much. Trying to pass the render fence through the various
layers of userspace to be able to tell when the video input can reuse a
buffer is a great experience in yak shaving. Allowing the video input
to reuse the buffer as soon as the read dma_fence from the GPU is
signaled is much more straight forward.

Regards,
Lucas

> Marek
> 
> On Tue., Apr. 27, 2021, 08:50 Christian König, < 
> ckoenig.leichtzumer...@gmail.com> wrote:
> >  Only amd -> external.
> >  
> >  We can easily install something in an user queue which waits for a
> > dma_fence in the kernel.
> >  
> >  But we can't easily wait for an user queue as dependency of a
> > dma_fence.
> >  
> >  The good thing is we have this wait before signal case on Vulkan
> > timeline semaphores which have the same problem in the kernel.
> >  
> >  The good news is I think we can relatively easily convert i915 and
> > older amdgpu device to something which is compatible with user
> > fences.
> >  
> >  So yes, getting that fixed case by case should work.
> >  
> >  Christian
> >  
> > Am 27.04.21 um 14:46 schrieb Marek Olšák:
> >  
> > > I'll defer to Christian and Alex to decide whether dropping sync
> > > with non-amd devices (GPUs, cameras etc.) is acceptable.
> > > 
> > > Rewriting those drivers to this new sync model could be done on a
> > > case by case basis.
> > > 
> > > For now, would we only lose the "amd -> external" dependency? Or
> > > the "external -> amd" dependency too?
> > > 
> > > Marek
> > > 
> > > On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, 
> > > wrote:
> > >  
> > > > On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák 
> > > > wrote:
> > > >  > Ok. I'll interpret this as "yes, it will work, let's do it".
> > > >  
> > > >  It works if all you care about is drm/amdgpu. I'm not sure
> > > > that's a
> > > >  reasonable approach for upstream, but it definitely is an
> > > > approach :-)
> > > >  
> > > >  We've already gone somewhat through the pain of drm/amdgpu
> > > > redefining
> > > >  how implicit sync works without sufficiently talking with
> > > > other
> > > >  people, maybe we should avoid a repeat of this ...
> > > >  -Daniel
> > > >  
> > > >  >
> > > >  > Marek
> > > >  >
> > > >  > On Tue., Apr. 27, 2021, 08:06 Christian König,
> > > >  wrote:
> > > >  >>
> > > >  >> Correct, we wouldn't have synchronization between device
> > > > with
> > > > and without user queues any more.
> > > >  >>
> > > >  >> That could only be a problem for A+I Laptops.
> > > >  >>
> > > >  >> Memory management will just work with preemption fences
> > > > which
> > > > pause the user queues of a process before evicting something.
> > > > That will be a dma_fence, but also a well known approach.
> > > >  >>
> > > >  >> Christian.
> > > >  >>
> > > >  >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
> > > >  >>
> > > >  >> If we don't use future fences for DMA fences at all, e.g.
> > > > we
> > > > don't use them for memory management, it can work, right?
> > > > Memory
> > > > management can suspend user queues anytime. It doesn't need to
> > > > use DMA fences. There might be something that I'm missing here.
> > > >  >>
> > > >  >> What would we lose without DMA fences? Just inter-device
> > > > synchronization? I think that might be acceptable.
> > > >  >>
> > > >  >> The only case when the kernel will wait on a future fence
> > > > is
> > > > before a page flip. Everything today already depends on
> > > > userspace
> > > > not hanging the gpu, which makes everything a future fence.
> > > >  >>
> > > >  >> Marek
> > > >  >>
> > > >  >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,
> > > >  wrote:
> > > >  >>>
> > > >  >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák
> > > > wrote:
> > > >  >>> > Thanks everybody. The initial proposal is dead. Here are
> > > > some thoughts on
> > > >  >>> > how to do it differently.
> > > >  >>> >
> > > >  >>> > I think we can have direct command submission from
> > > > userspace via
> > > >  >>> > memory-mapped queues ("user queues") without changing
> > > > window systems.
> > > >  >>> >
> > > >  >>> > 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König
Uff good question. DMA-buf certainly supports that use case, but I have 
no idea if that is actually used somewhere.


Daniel do you know any case?

Christian.

Am 27.04.21 um 15:26 schrieb Marek Olšák:

Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device

What about the case when we get a buffer from an external device and 
we're supposed to make it "busy" when we are using it, and the 
external device wants to wait until we stop using it? Is it something 
that can happen, thus turning "external -> amd" into "external <-> amd"?


Marek

On Tue., Apr. 27, 2021, 08:50 Christian König, 
> wrote:


Only amd -> external.

We can easily install something in an user queue which waits for a
dma_fence in the kernel.

But we can't easily wait for an user queue as dependency of a
dma_fence.

The good thing is we have this wait before signal case on Vulkan
timeline semaphores which have the same problem in the kernel.

The good news is I think we can relatively easily convert i915 and
older amdgpu device to something which is compatible with user fences.

So yes, getting that fixed case by case should work.

Christian

Am 27.04.21 um 14:46 schrieb Marek Olšák:

I'll defer to Christian and Alex to decide whether dropping sync
with non-amd devices (GPUs, cameras etc.) is acceptable.

Rewriting those drivers to this new sync model could be done on a
case by case basis.

For now, would we only lose the "amd -> external" dependency? Or
the "external -> amd" dependency too?

Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, mailto:dan...@ffwll.ch>> wrote:

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák mailto:mar...@gmail.com>> wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure
that's a
reasonable approach for upstream, but it definitely is an
approach :-)

We've already gone somewhat through the pain of drm/amdgpu
redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König,
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>>
>> Correct, we wouldn't have synchronization between device
with and without user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences
which pause the user queues of a process before evicting
something. That will be a dma_fence, but also a well known
approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g.
we don't use them for memory management, it can work, right?
Memory management can suspend user queues anytime. It doesn't
need to use DMA fences. There might be something that I'm
missing here.
>>
>> What would we lose without DMA fences? Just inter-device
synchronization? I think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence
is before a page flip. Everything today already depends on
userspace not hanging the gpu, which makes everything a
future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,
mailto:dan...@ffwll.ch>> wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here
are some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from
userspace via
>>> > memory-mapped queues ("user queues") without changing
window systems.
>>> >
>>> > The memory management doesn't have to use GPU page
faults like HMM.
>>> > Instead, it can wait for user queues of a specific
process to go idle and
>>> > then unmap the queues, so that userspace can't submit
anything. Buffer
>>> > evictions, pinning, etc. can be executed when all
queues are unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline
semaphores. Userspace will
>>> > query the wait and signal value for a shared buffer
from the kernel. The
>>> > kernel will keep a history of those queries to know
which process is
>>> > 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device

What about the case when we get a buffer from an external device and we're
supposed to make it "busy" when we are using it, and the external device
wants to wait until we stop using it? Is it something that can happen, thus
turning "external -> amd" into "external <-> amd"?

Marek

On Tue., Apr. 27, 2021, 08:50 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Only amd -> external.
>
> We can easily install something in an user queue which waits for a
> dma_fence in the kernel.
>
> But we can't easily wait for an user queue as dependency of a dma_fence.
>
> The good thing is we have this wait before signal case on Vulkan timeline
> semaphores which have the same problem in the kernel.
>
> The good news is I think we can relatively easily convert i915 and older
> amdgpu device to something which is compatible with user fences.
>
> So yes, getting that fixed case by case should work.
>
> Christian
>
> Am 27.04.21 um 14:46 schrieb Marek Olšák:
>
> I'll defer to Christian and Alex to decide whether dropping sync with
> non-amd devices (GPUs, cameras etc.) is acceptable.
>
> Rewriting those drivers to this new sync model could be done on a case by
> case basis.
>
> For now, would we only lose the "amd -> external" dependency? Or the
> "external -> amd" dependency too?
>
> Marek
>
> On Tue., Apr. 27, 2021, 08:15 Daniel Vetter,  wrote:
>
>> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
>> > Ok. I'll interpret this as "yes, it will work, let's do it".
>>
>> It works if all you care about is drm/amdgpu. I'm not sure that's a
>> reasonable approach for upstream, but it definitely is an approach :-)
>>
>> We've already gone somewhat through the pain of drm/amdgpu redefining
>> how implicit sync works without sufficiently talking with other
>> people, maybe we should avoid a repeat of this ...
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
>> ckoenig.leichtzumer...@gmail.com> wrote:
>> >>
>> >> Correct, we wouldn't have synchronization between device with and
>> without user queues any more.
>> >>
>> >> That could only be a problem for A+I Laptops.
>> >>
>> >> Memory management will just work with preemption fences which pause
>> the user queues of a process before evicting something. That will be a
>> dma_fence, but also a well known approach.
>> >>
>> >> Christian.
>> >>
>> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>> >>
>> >> If we don't use future fences for DMA fences at all, e.g. we don't use
>> them for memory management, it can work, right? Memory management can
>> suspend user queues anytime. It doesn't need to use DMA fences. There might
>> be something that I'm missing here.
>> >>
>> >> What would we lose without DMA fences? Just inter-device
>> synchronization? I think that might be acceptable.
>> >>
>> >> The only case when the kernel will wait on a future fence is before a
>> page flip. Everything today already depends on userspace not hanging the
>> gpu, which makes everything a future fence.
>> >>
>> >> Marek
>> >>
>> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>> >>>
>> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> >>> > Thanks everybody. The initial proposal is dead. Here are some
>> thoughts on
>> >>> > how to do it differently.
>> >>> >
>> >>> > I think we can have direct command submission from userspace via
>> >>> > memory-mapped queues ("user queues") without changing window
>> systems.
>> >>> >
>> >>> > The memory management doesn't have to use GPU page faults like HMM.
>> >>> > Instead, it can wait for user queues of a specific process to go
>> idle and
>> >>> > then unmap the queues, so that userspace can't submit anything.
>> Buffer
>> >>> > evictions, pinning, etc. can be executed when all queues are
>> unmapped
>> >>> > (suspended). Thus, no BO fences and page faults are needed.
>> >>> >
>> >>> > Inter-process synchronization can use timeline semaphores.
>> Userspace will
>> >>> > query the wait and signal value for a shared buffer from the
>> kernel. The
>> >>> > kernel will keep a history of those queries to know which process is
>> >>> > responsible for signalling which buffer. There is only the
>> wait-timeout
>> >>> > issue and how to identify the culprit. One of the solutions is to
>> have the
>> >>> > GPU send all GPU signal commands and all timed out wait commands
>> via an
>> >>> > interrupt to the kernel driver to monitor and validate userspace
>> behavior.
>> >>> > With that, it can be identified whether the culprit is the waiting
>> process
>> >>> > or the signalling process and which one. Invalid signal/wait
>> parameters can
>> >>> > also be detected. The kernel can force-signal only the semaphores
>> that time
>> >>> > out, and punish the processes which caused the timeout or used
>> invalid
>> >>> > signal/wait parameters.
>> >>> >
>> >>> > The 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König

Only amd -> external.

We can easily install something in an user queue which waits for a 
dma_fence in the kernel.


But we can't easily wait for an user queue as dependency of a dma_fence.

The good thing is we have this wait before signal case on Vulkan 
timeline semaphores which have the same problem in the kernel.


The good news is I think we can relatively easily convert i915 and older 
amdgpu device to something which is compatible with user fences.


So yes, getting that fixed case by case should work.

Christian

Am 27.04.21 um 14:46 schrieb Marek Olšák:
I'll defer to Christian and Alex to decide whether dropping sync with 
non-amd devices (GPUs, cameras etc.) is acceptable.


Rewriting those drivers to this new sync model could be done on a case 
by case basis.


For now, would we only lose the "amd -> external" dependency? Or the 
"external -> amd" dependency too?


Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, > wrote:


On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák mailto:mar...@gmail.com>> wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König,
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>>
>> Correct, we wouldn't have synchronization between device with
and without user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences which
pause the user queues of a process before evicting something. That
will be a dma_fence, but also a well known approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g. we
don't use them for memory management, it can work, right? Memory
management can suspend user queues anytime. It doesn't need to use
DMA fences. There might be something that I'm missing here.
>>
>> What would we lose without DMA fences? Just inter-device
synchronization? I think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence is
before a page flip. Everything today already depends on userspace
not hanging the gpu, which makes everything a future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, mailto:dan...@ffwll.ch>> wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here are
some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from userspace via
>>> > memory-mapped queues ("user queues") without changing window
systems.
>>> >
>>> > The memory management doesn't have to use GPU page faults
like HMM.
>>> > Instead, it can wait for user queues of a specific process
to go idle and
>>> > then unmap the queues, so that userspace can't submit
anything. Buffer
>>> > evictions, pinning, etc. can be executed when all queues are
unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline semaphores.
Userspace will
>>> > query the wait and signal value for a shared buffer from the
kernel. The
>>> > kernel will keep a history of those queries to know which
process is
>>> > responsible for signalling which buffer. There is only the
wait-timeout
>>> > issue and how to identify the culprit. One of the solutions
is to have the
>>> > GPU send all GPU signal commands and all timed out wait
commands via an
>>> > interrupt to the kernel driver to monitor and validate
userspace behavior.
>>> > With that, it can be identified whether the culprit is the
waiting process
>>> > or the signalling process and which one. Invalid signal/wait
parameters can
>>> > also be detected. The kernel can force-signal only the
semaphores that time
>>> > out, and punish the processes which caused the timeout or
used invalid
>>> > signal/wait parameters.
>>> >
>>> > The question is whether this synchronization solution is
robust enough for
>>> > dma_fence and whatever the kernel and window systems need.
>>>
>>> The proper model here is the preempt-ctx dma_fence that amdkfd
uses
>>> (without page faults). That means dma_fence for
synchronization is doa, at
>>> least as-is, and we're back to figuring out the 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
I'll defer to Christian and Alex to decide whether dropping sync with
non-amd devices (GPUs, cameras etc.) is acceptable.

Rewriting those drivers to this new sync model could be done on a case by
case basis.

For now, would we only lose the "amd -> external" dependency? Or the
"external -> amd" dependency too?

Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter,  wrote:

> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
> > Ok. I'll interpret this as "yes, it will work, let's do it".
>
> It works if all you care about is drm/amdgpu. I'm not sure that's a
> reasonable approach for upstream, but it definitely is an approach :-)
>
> We've already gone somewhat through the pain of drm/amdgpu redefining
> how implicit sync works without sufficiently talking with other
> people, maybe we should avoid a repeat of this ...
> -Daniel
>
> >
> > Marek
> >
> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
> ckoenig.leichtzumer...@gmail.com> wrote:
> >>
> >> Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >>
> >> That could only be a problem for A+I Laptops.
> >>
> >> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
> >>
> >> Christian.
> >>
> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
> >>
> >> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
> >>
> >> What would we lose without DMA fences? Just inter-device
> synchronization? I think that might be acceptable.
> >>
> >> The only case when the kernel will wait on a future fence is before a
> page flip. Everything today already depends on userspace not hanging the
> gpu, which makes everything a future fence.
> >>
> >> Marek
> >>
> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
> >>>
> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> >>> > Thanks everybody. The initial proposal is dead. Here are some
> thoughts on
> >>> > how to do it differently.
> >>> >
> >>> > I think we can have direct command submission from userspace via
> >>> > memory-mapped queues ("user queues") without changing window systems.
> >>> >
> >>> > The memory management doesn't have to use GPU page faults like HMM.
> >>> > Instead, it can wait for user queues of a specific process to go
> idle and
> >>> > then unmap the queues, so that userspace can't submit anything.
> Buffer
> >>> > evictions, pinning, etc. can be executed when all queues are unmapped
> >>> > (suspended). Thus, no BO fences and page faults are needed.
> >>> >
> >>> > Inter-process synchronization can use timeline semaphores. Userspace
> will
> >>> > query the wait and signal value for a shared buffer from the kernel.
> The
> >>> > kernel will keep a history of those queries to know which process is
> >>> > responsible for signalling which buffer. There is only the
> wait-timeout
> >>> > issue and how to identify the culprit. One of the solutions is to
> have the
> >>> > GPU send all GPU signal commands and all timed out wait commands via
> an
> >>> > interrupt to the kernel driver to monitor and validate userspace
> behavior.
> >>> > With that, it can be identified whether the culprit is the waiting
> process
> >>> > or the signalling process and which one. Invalid signal/wait
> parameters can
> >>> > also be detected. The kernel can force-signal only the semaphores
> that time
> >>> > out, and punish the processes which caused the timeout or used
> invalid
> >>> > signal/wait parameters.
> >>> >
> >>> > The question is whether this synchronization solution is robust
> enough for
> >>> > dma_fence and whatever the kernel and window systems need.
> >>>
> >>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
> >>> (without page faults). That means dma_fence for synchronization is
> doa, at
> >>> least as-is, and we're back to figuring out the winsys problem.
> >>>
> >>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
> >>> akin to saying that we're solving deadlock issues in a locking design
> by
> >>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
> >>> avoids having to reach the reset button, but that's about it.
> >>>
> >>> And the fundamental problem is that once you throw in userspace command
> >>> submission (and syncing, at least within the userspace driver,
> otherwise
> >>> there's kinda no point if you still need the kernel for cross-engine
> sync)
> >>> means you get deadlocks if you still use dma_fence for sync under
> >>> perfectly legit use-case. We've discussed that one ad nauseam last
> summer:
> >>>
> >>>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
> >>>
> >>> 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König

Am 27.04.21 um 14:15 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:

Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...


BTW: This is coming up again for the plan here.

We once more need to think about the "other" fences which don't 
participate in the implicit sync here.


Christian.


-Daniel


Marek

On Tue., Apr. 27, 2021, 08:06 Christian König, 
 wrote:

Correct, we wouldn't have synchronization between device with and without user 
queues any more.

That could only be a problem for A+I Laptops.

Memory management will just work with preemption fences which pause the user 
queues of a process before evicting something. That will be a dma_fence, but 
also a well known approach.

Christian.

Am 27.04.21 um 13:49 schrieb Marek Olšák:

If we don't use future fences for DMA fences at all, e.g. we don't use them for 
memory management, it can work, right? Memory management can suspend user 
queues anytime. It doesn't need to use DMA fences. There might be something 
that I'm missing here.

What would we lose without DMA fences? Just inter-device synchronization? I 
think that might be acceptable.

The only case when the kernel will wait on a future fence is before a page 
flip. Everything today already depends on userspace not hanging the gpu, which 
makes everything a future fence.

Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:

On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:

Thanks everybody. The initial proposal is dead. Here are some thoughts on
how to do it differently.

I think we can have direct command submission from userspace via
memory-mapped queues ("user queues") without changing window systems.

The memory management doesn't have to use GPU page faults like HMM.
Instead, it can wait for user queues of a specific process to go idle and
then unmap the queues, so that userspace can't submit anything. Buffer
evictions, pinning, etc. can be executed when all queues are unmapped
(suspended). Thus, no BO fences and page faults are needed.

Inter-process synchronization can use timeline semaphores. Userspace will
query the wait and signal value for a shared buffer from the kernel. The
kernel will keep a history of those queries to know which process is
responsible for signalling which buffer. There is only the wait-timeout
issue and how to identify the culprit. One of the solutions is to have the
GPU send all GPU signal commands and all timed out wait commands via an
interrupt to the kernel driver to monitor and validate userspace behavior.
With that, it can be identified whether the culprit is the waiting process
or the signalling process and which one. Invalid signal/wait parameters can
also be detected. The kernel can force-signal only the semaphores that time
out, and punish the processes which caused the timeout or used invalid
signal/wait parameters.

The question is whether this synchronization solution is robust enough for
dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work. It's
akin to saying that we're solving deadlock issues in a locking design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace command
submission (and syncing, at least within the userspace driver, otherwise
there's kinda no point if you still need the kernel for cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last summer:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting to this
brave new world is rebuilding the driver on top of userspace fences, and
with the adjusted cmd submit model. You probably don't want to use amdkfd,
but port that as a context flag or similar to render nodes for gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel


Marek

On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:


Hi,

On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:


The thing is, you can't do this in drm/scheduler. At least not without
splitting up the dma_fence in the kernel into separate 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König, 
>  wrote:
>>
>> Correct, we wouldn't have synchronization between device with and without 
>> user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences which pause the user 
>> queues of a process before evicting something. That will be a dma_fence, but 
>> also a well known approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g. we don't use them 
>> for memory management, it can work, right? Memory management can suspend 
>> user queues anytime. It doesn't need to use DMA fences. There might be 
>> something that I'm missing here.
>>
>> What would we lose without DMA fences? Just inter-device synchronization? I 
>> think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence is before a page 
>> flip. Everything today already depends on userspace not hanging the gpu, 
>> which makes everything a future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from userspace via
>>> > memory-mapped queues ("user queues") without changing window systems.
>>> >
>>> > The memory management doesn't have to use GPU page faults like HMM.
>>> > Instead, it can wait for user queues of a specific process to go idle and
>>> > then unmap the queues, so that userspace can't submit anything. Buffer
>>> > evictions, pinning, etc. can be executed when all queues are unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline semaphores. Userspace will
>>> > query the wait and signal value for a shared buffer from the kernel. The
>>> > kernel will keep a history of those queries to know which process is
>>> > responsible for signalling which buffer. There is only the wait-timeout
>>> > issue and how to identify the culprit. One of the solutions is to have the
>>> > GPU send all GPU signal commands and all timed out wait commands via an
>>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>>> > With that, it can be identified whether the culprit is the waiting process
>>> > or the signalling process and which one. Invalid signal/wait parameters 
>>> > can
>>> > also be detected. The kernel can force-signal only the semaphores that 
>>> > time
>>> > out, and punish the processes which caused the timeout or used invalid
>>> > signal/wait parameters.
>>> >
>>> > The question is whether this synchronization solution is robust enough for
>>> > dma_fence and whatever the kernel and window systems need.
>>>
>>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>>> (without page faults). That means dma_fence for synchronization is doa, at
>>> least as-is, and we're back to figuring out the winsys problem.
>>>
>>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>>> akin to saying that we're solving deadlock issues in a locking design by
>>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>>> avoids having to reach the reset button, but that's about it.
>>>
>>> And the fundamental problem is that once you throw in userspace command
>>> submission (and syncing, at least within the userspace driver, otherwise
>>> there's kinda no point if you still need the kernel for cross-engine sync)
>>> means you get deadlocks if you still use dma_fence for sync under
>>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>>
>>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>>
>>> See silly diagramm at the bottom.
>>>
>>> Now I think all isn't lost, because imo the first step to getting to this
>>> brave new world is rebuilding the driver on top of userspace fences, and
>>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>>> but port that as a context flag or similar to render nodes for gl/vk. Of
>>> course that means you can only use this mode in headless, without
>>> glx/wayland winsys support, but it's a start.
>>> -Daniel
>>>
>>> >
>>> > Marek
>>> >
>>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:
>>> >
>>> > > 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 1:49 PM Marek Olšák  wrote:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use them 
> for memory management, it can work, right? Memory management can suspend user 
> queues anytime. It doesn't need to use DMA fences. There might be something 
> that I'm missing here.

Other drivers use dma_fence for their memory management. So unles
you've converted them all over to the dma_fence/memory fence split,
dma_fence fences stay memory fences. In theory this is possible, but
maybe not if you want to complete the job this decade :-)

> What would we lose without DMA fences? Just inter-device synchronization? I 
> think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page 
> flip. Everything today already depends on userspace not hanging the gpu, 
> which makes everything a future fence.

That's not quite what we defined as future fences, because tdr
guarantees those complete, even if userspace hangs. It's when you put
userspace fence waits into the cs buffer you've submitted to the
kernel (or directly to hw) where the "real" future fences kick in.
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>> > With that, it can be identified whether the culprit is the waiting process
>> > or the signalling process and which one. Invalid signal/wait parameters can
>> > also be detected. The kernel can force-signal only the semaphores that time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
>> course that means you can only use this mode in headless, without
>> glx/wayland winsys support, but it's a start.
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:
>> >
>> > > Hi,
>> > >
>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
>> > >
>> > >> The thing is, you can't do this in drm/scheduler. At least not without
>> > >> splitting up the dma_fence in the kernel into separate memory fences
>> > >> and sync fences
>> > >
>> > >
>> > > I'm starting to think this thread needs its own glossary ...
>> > >
>> > > I propose we use 'residency fence' for execution fences which enact
>> > > memory-residency 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Ok. I'll interpret this as "yes, it will work, let's do it".

Marek

On Tue., Apr. 27, 2021, 08:06 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Correct, we wouldn't have synchronization between device with and without
> user queues any more.
>
> That could only be a problem for A+I Laptops.
>
> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
>
> Christian.
>
> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
>
> What would we lose without DMA fences? Just inter-device synchronization?
> I think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page
> flip. Everything today already depends on userspace not hanging the gpu,
> which makes everything a future fence.
>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts
>> on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle
>> and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace
>> will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have
>> the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace
>> behavior.
>> > With that, it can be identified whether the culprit is the waiting
>> process
>> > or the signalling process and which one. Invalid signal/wait parameters
>> can
>> > also be detected. The kernel can force-signal only the semaphores that
>> time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough
>> for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
>> course that means you can only use this mode in headless, without
>> glx/wayland winsys support, but it's a start.
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone 
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
>> > >
>> > >> The thing is, you can't do this in drm/scheduler. At least not
>> without
>> > >> splitting up the dma_fence in the kernel into separate memory fences
>> > >> and sync fences
>> > >
>> > >
>> > > I'm starting to think this thread needs its own glossary ...
>> > >
>> > > I propose we use 'residency fence' for execution fences which enact
>> > > memory-residency 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König
Correct, we wouldn't have synchronization between device with and 
without user queues any more.


That could only be a problem for A+I Laptops.

Memory management will just work with preemption fences which pause the 
user queues of a process before evicting something. That will be a 
dma_fence, but also a well known approach.


Christian.

Am 27.04.21 um 13:49 schrieb Marek Olšák:
If we don't use future fences for DMA fences at all, e.g. we don't use 
them for memory management, it can work, right? Memory management can 
suspend user queues anytime. It doesn't need to use DMA fences. There 
might be something that I'm missing here.


What would we lose without DMA fences? Just inter-device 
synchronization? I think that might be acceptable.


The only case when the kernel will wait on a future fence is before a 
page flip. Everything today already depends on userspace not hanging 
the gpu, which makes everything a future fence.


Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, > wrote:


On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> Thanks everybody. The initial proposal is dead. Here are some
thoughts on
> how to do it differently.
>
> I think we can have direct command submission from userspace via
> memory-mapped queues ("user queues") without changing window
systems.
>
> The memory management doesn't have to use GPU page faults like HMM.
> Instead, it can wait for user queues of a specific process to go
idle and
> then unmap the queues, so that userspace can't submit anything.
Buffer
> evictions, pinning, etc. can be executed when all queues are
unmapped
> (suspended). Thus, no BO fences and page faults are needed.
>
> Inter-process synchronization can use timeline semaphores.
Userspace will
> query the wait and signal value for a shared buffer from the
kernel. The
> kernel will keep a history of those queries to know which process is
> responsible for signalling which buffer. There is only the
wait-timeout
> issue and how to identify the culprit. One of the solutions is
to have the
> GPU send all GPU signal commands and all timed out wait commands
via an
> interrupt to the kernel driver to monitor and validate userspace
behavior.
> With that, it can be identified whether the culprit is the
waiting process
> or the signalling process and which one. Invalid signal/wait
parameters can
> also be detected. The kernel can force-signal only the
semaphores that time
> out, and punish the processes which caused the timeout or used
invalid
> signal/wait parameters.
>
> The question is whether this synchronization solution is robust
enough for
> dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is
doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work.
It's
akin to saying that we're solving deadlock issues in a locking
design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace
command
submission (and syncing, at least within the userspace driver,
otherwise
there's kinda no point if you still need the kernel for
cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last
summer:


https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences



See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting
to this
brave new world is rebuilding the driver on top of userspace
fences, and
with the adjusted cmd submit model. You probably don't want to use
amdkfd,
but port that as a context flag or similar to render nodes for
gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel

>
> Marek
>
> On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone
mailto:dan...@fooishbar.org>> wrote:
>
> > Hi,
> >
> > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter mailto:dan...@ffwll.ch>> wrote:
> >
> >> The thing is, you can't do this in drm/scheduler. At least
not without
> >> splitting up the dma_fence in the kernel into separate memory
fences
> >> and sync fences
> >
> >
> > I'm starting to think this thread 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
If we don't use future fences for DMA fences at all, e.g. we don't use them
for memory management, it can work, right? Memory management can suspend
user queues anytime. It doesn't need to use DMA fences. There might be
something that I'm missing here.

What would we lose without DMA fences? Just inter-device synchronization? I
think that might be acceptable.

The only case when the kernel will wait on a future fence is before a page
flip. Everything today already depends on userspace not hanging the gpu,
which makes everything a future fence.

Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:

> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
> > how to do it differently.
> >
> > I think we can have direct command submission from userspace via
> > memory-mapped queues ("user queues") without changing window systems.
> >
> > The memory management doesn't have to use GPU page faults like HMM.
> > Instead, it can wait for user queues of a specific process to go idle and
> > then unmap the queues, so that userspace can't submit anything. Buffer
> > evictions, pinning, etc. can be executed when all queues are unmapped
> > (suspended). Thus, no BO fences and page faults are needed.
> >
> > Inter-process synchronization can use timeline semaphores. Userspace will
> > query the wait and signal value for a shared buffer from the kernel. The
> > kernel will keep a history of those queries to know which process is
> > responsible for signalling which buffer. There is only the wait-timeout
> > issue and how to identify the culprit. One of the solutions is to have
> the
> > GPU send all GPU signal commands and all timed out wait commands via an
> > interrupt to the kernel driver to monitor and validate userspace
> behavior.
> > With that, it can be identified whether the culprit is the waiting
> process
> > or the signalling process and which one. Invalid signal/wait parameters
> can
> > also be detected. The kernel can force-signal only the semaphores that
> time
> > out, and punish the processes which caused the timeout or used invalid
> > signal/wait parameters.
> >
> > The question is whether this synchronization solution is robust enough
> for
> > dma_fence and whatever the kernel and window systems need.
>
> The proper model here is the preempt-ctx dma_fence that amdkfd uses
> (without page faults). That means dma_fence for synchronization is doa, at
> least as-is, and we're back to figuring out the winsys problem.
>
> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
> akin to saying that we're solving deadlock issues in a locking design by
> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
> avoids having to reach the reset button, but that's about it.
>
> And the fundamental problem is that once you throw in userspace command
> submission (and syncing, at least within the userspace driver, otherwise
> there's kinda no point if you still need the kernel for cross-engine sync)
> means you get deadlocks if you still use dma_fence for sync under
> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>
> See silly diagramm at the bottom.
>
> Now I think all isn't lost, because imo the first step to getting to this
> brave new world is rebuilding the driver on top of userspace fences, and
> with the adjusted cmd submit model. You probably don't want to use amdkfd,
> but port that as a context flag or similar to render nodes for gl/vk. Of
> course that means you can only use this mode in headless, without
> glx/wayland winsys support, but it's a start.
> -Daniel
>
> >
> > Marek
> >
> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone 
> wrote:
> >
> > > Hi,
> > >
> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
> > >
> > >> The thing is, you can't do this in drm/scheduler. At least not without
> > >> splitting up the dma_fence in the kernel into separate memory fences
> > >> and sync fences
> > >
> > >
> > > I'm starting to think this thread needs its own glossary ...
> > >
> > > I propose we use 'residency fence' for execution fences which enact
> > > memory-residency operations, e.g. faulting in a page ultimately
> depending
> > > on GPU work retiring.
> > >
> > > And 'value fence' for the pure-userspace model suggested by timeline
> > > semaphores, i.e. fences being (*addr == val) rather than being able to
> look
> > > at ctx seqno.
> > >
> > > Cheers,
> > > Daniel
> > > ___
> > > mesa-dev mailing list
> > > mesa-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter
On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> Thanks everybody. The initial proposal is dead. Here are some thoughts on
> how to do it differently.
> 
> I think we can have direct command submission from userspace via
> memory-mapped queues ("user queues") without changing window systems.
> 
> The memory management doesn't have to use GPU page faults like HMM.
> Instead, it can wait for user queues of a specific process to go idle and
> then unmap the queues, so that userspace can't submit anything. Buffer
> evictions, pinning, etc. can be executed when all queues are unmapped
> (suspended). Thus, no BO fences and page faults are needed.
> 
> Inter-process synchronization can use timeline semaphores. Userspace will
> query the wait and signal value for a shared buffer from the kernel. The
> kernel will keep a history of those queries to know which process is
> responsible for signalling which buffer. There is only the wait-timeout
> issue and how to identify the culprit. One of the solutions is to have the
> GPU send all GPU signal commands and all timed out wait commands via an
> interrupt to the kernel driver to monitor and validate userspace behavior.
> With that, it can be identified whether the culprit is the waiting process
> or the signalling process and which one. Invalid signal/wait parameters can
> also be detected. The kernel can force-signal only the semaphores that time
> out, and punish the processes which caused the timeout or used invalid
> signal/wait parameters.
> 
> The question is whether this synchronization solution is robust enough for
> dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work. It's
akin to saying that we're solving deadlock issues in a locking design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace command
submission (and syncing, at least within the userspace driver, otherwise
there's kinda no point if you still need the kernel for cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last summer:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting to this
brave new world is rebuilding the driver on top of userspace fences, and
with the adjusted cmd submit model. You probably don't want to use amdkfd,
but port that as a context flag or similar to render nodes for gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel

> 
> Marek
> 
> On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:
> 
> > Hi,
> >
> > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
> >
> >> The thing is, you can't do this in drm/scheduler. At least not without
> >> splitting up the dma_fence in the kernel into separate memory fences
> >> and sync fences
> >
> >
> > I'm starting to think this thread needs its own glossary ...
> >
> > I propose we use 'residency fence' for execution fences which enact
> > memory-residency operations, e.g. faulting in a page ultimately depending
> > on GPU work retiring.
> >
> > And 'value fence' for the pure-userspace model suggested by timeline
> > semaphores, i.e. fences being (*addr == val) rather than being able to look
> > at ctx seqno.
> >
> > Cheers,
> > Daniel
> > ___
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-26 Thread Marek Olšák
Thanks everybody. The initial proposal is dead. Here are some thoughts on
how to do it differently.

I think we can have direct command submission from userspace via
memory-mapped queues ("user queues") without changing window systems.

The memory management doesn't have to use GPU page faults like HMM.
Instead, it can wait for user queues of a specific process to go idle and
then unmap the queues, so that userspace can't submit anything. Buffer
evictions, pinning, etc. can be executed when all queues are unmapped
(suspended). Thus, no BO fences and page faults are needed.

Inter-process synchronization can use timeline semaphores. Userspace will
query the wait and signal value for a shared buffer from the kernel. The
kernel will keep a history of those queries to know which process is
responsible for signalling which buffer. There is only the wait-timeout
issue and how to identify the culprit. One of the solutions is to have the
GPU send all GPU signal commands and all timed out wait commands via an
interrupt to the kernel driver to monitor and validate userspace behavior.
With that, it can be identified whether the culprit is the waiting process
or the signalling process and which one. Invalid signal/wait parameters can
also be detected. The kernel can force-signal only the semaphores that time
out, and punish the processes which caused the timeout or used invalid
signal/wait parameters.

The question is whether this synchronization solution is robust enough for
dma_fence and whatever the kernel and window systems need.

Marek

On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:

> Hi,
>
> On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
>
>> The thing is, you can't do this in drm/scheduler. At least not without
>> splitting up the dma_fence in the kernel into separate memory fences
>> and sync fences
>
>
> I'm starting to think this thread needs its own glossary ...
>
> I propose we use 'residency fence' for execution fences which enact
> memory-residency operations, e.g. faulting in a page ultimately depending
> on GPU work retiring.
>
> And 'value fence' for the pure-userspace model suggested by timeline
> semaphores, i.e. fences being (*addr == val) rather than being able to look
> at ctx seqno.
>
> Cheers,
> Daniel
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi,

On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:

> The thing is, you can't do this in drm/scheduler. At least not without
> splitting up the dma_fence in the kernel into separate memory fences
> and sync fences


I'm starting to think this thread needs its own glossary ...

I propose we use 'residency fence' for execution fences which enact
memory-residency operations, e.g. faulting in a page ultimately depending
on GPU work retiring.

And 'value fence' for the pure-userspace model suggested by timeline
semaphores, i.e. fences being (*addr == val) rather than being able to look
at ctx seqno.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 9:14 PM Daniel Stone  wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 19:54, Daniel Vetter  wrote:
>>
>> So I can mostly get behind this, except it's _not_ going to be
>> dma_fence. That thing has horrendous internal ordering constraints
>> within the kernel, and the one thing that doesn't allow you is to make
>> a dma_fence depend upon a userspace fence.
>>
>> But what we can do is use the same currently existing container
>> objects like drm_syncobj or sync_file (timeline syncobj would fit best
>> tbh), and stuff a userspace fence behind it. The only trouble is that
>> currently timeline syncobj implement vulkan's spec, which means if you
>> build a wait-before-signal deadlock, you'll wait forever. Well until
>> the user ragequits and kills your process.
>>
>> So for winsys we'd need to be able to specify the wait timeout
>> somewhere for waiting for that dma_fence to materialize (plus the
>> submit thread, but userspace needs that anyway to support timeline
>> syncobj) if you're importing an untrusted timeline syncobj. And I
>> think that's roughly it.
>
>
> Right. The only way you get to materialise a dma_fence from an execbuf is 
> that you take a hard timeout, with a penalty for not meeting that timeout. 
> When I say dma_fence I mean dma_fence, because there is no extant winsys 
> support for drm_symcobj, so this is greenfield: the winsys gets to specify 
> its terms of engagement, and again, we've been the orange/green-site enemies 
> of users for quite some time already, so we're happy to continue doing so. If 
> the actual underlying primitive is not a dma_fence, and 
> compositors/protocol/clients need to eat a bunch of typing to deal with a 
> different primitive which offers the same guarantees, then that's fine, as 
> long as there is some tangible whole-of-system benefit.

So atm sync_file doesn't support future fences, but we could add the
support for those there. And since vulkan doesn't really say anything
about those, we could make the wait time out by default.

> How that timeout is actually realised is an implementation detail. Whether 
> it's a property of the last GPU job itself that the CPU-side driver can 
> observe, or that the kernel driver guarantees that there is a GPU job 
> launched in parallel which monitors the memory-fence status and reports back 
> through a mailbox/doorbell, or the CPU-side driver enqueues kqueue work for 
> $n milliseconds' time to check the value in memory and kill the context if it 
> doesn't meet expectations - whatever. I don't believe any of those choices 
> meaningfully impact on kernel driver complexity relative to the initial 
> proposal, but they do allow us to continue to provide the guarantees we do 
> today when buffers cross security boundaries.

The thing is, you can't do this in drm/scheduler. At least not without
splitting up the dma_fence in the kernel into separate memory fences
and sync fences, and the work to get there is imo just not worth it.
We've bikeshedded this ad nauseaum for vk timeline syncobj, and the
solution was to have the submit thread in the userspace driver.

It won't really change anything wrt what applications can observe from
the egl/gl side of things though.

> There might well be an argument for significantly weakening those security 
> boundaries and shifting the complexity from the DRM scheduler into userspace 
> compositors. So far though, I have yet to see that argument made coherently.

Ah we've had that argument. We have moved that into userspace as part
of vk submit threads. It aint pretty, but it's better than the other
option :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 9:17 PM Jason Ekstrand  wrote:
>
> On Tue, Apr 20, 2021 at 1:54 PM Daniel Vetter  wrote:
> >
> > On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone  wrote:
> >
> > > And something more concrete:
> > >
> > > dma_fence.
> > >
> > > This already has all of the properties described above. Kernel-wise, it 
> > > already devolves to CPU-side signaling when it crosses device boundaries. 
> > > We need to support it roughly forever since it's been plumbed so far and 
> > > so wide. Any primitive which is acceptable for winsys-like usage which 
> > > crosses so many device/subsystem/process/security boundaries has to meet 
> > > the same requirements. So why reinvent something which looks so similar, 
> > > and has the same requirements of the kernel babysitting completion, 
> > > providing little to no benefit for that difference?
> >
> > So I can mostly get behind this, except it's _not_ going to be
> > dma_fence. That thing has horrendous internal ordering constraints
> > within the kernel, and the one thing that doesn't allow you is to make
> > a dma_fence depend upon a userspace fence.
>
> Let me elaborate on this a bit.  One of the problems I mentioned
> earlier is the conflation of fence types inside the kernel.  dma_fence
> is used for solving two different semi-related but different problems:
> client command synchronization and memory residency synchronization.
> In the old implicit GL world, we conflated these two and thought we
> were providing ourselves a service.  Not so much
>
> It's all well and good to say that we should turn the memory fence
> into a dma_fence and throw a timeout on it.  However, these
> window-system sync primitives, as you said, have to be able to be
> shared across everything.  In particular, we have to be able to share
> them with drivers that don't make a good separation between command
> and memory synchronization.
>
> Let's say we're rendering on ANV with memory fences and presenting on
> some USB display adapter whose kernel driver is a bit old-school.
> When we pass that fence to the other driver via a sync_file or
> similar, that driver may shove that dma_fence into the dma_resv on
> some buffer somewhere.  Then our client, completely unaware of
> internal kernel dependencies, binds that buffer into its address space
> and kicks off another command buffer.  So i915 throws in a dependency
> on that dma_resv which contains the previously created dma_fence and
> refuses to execute any more command buffers until it signals.
> Unfortunately, unbeknownst to i915, that command buffer which the
> client kicked off after doing that bind was required for signaling the
> memory fence on which our first dma_fence depends.  Deadlock.

Nope. Because the waiting for this future fence will only happen in two places:
- driver submit thread, which is just userspace without holding
anything. From the kernel pov this can be preempted, memory
temporarily taken away, all these things. Until that's done you will
_not_ get a real dma_fence, but just another future fence.
- but what about the usb display you're asking? well for that we'll
need a new atomic extension, which takes a timeline syncobj and gives
you back a timeline syncobj. And the rules are that if one of the is a
future fence/userspace fence, so will the other (even if it's created
by the kernel)

Either way you get a timeline syncobj back which anv can then again
handle properly with it's submit thread. Not a dma_fence with a funny
timeout because there's deadlock issues with those.

So no you wont be able to get a dma_fence out of your slight of hands here.

> Sure, we put a timeout on the dma_fence and it will eventually fire
> and unblock everything.  However, there's one very important point
> that's easy to miss here:  Neither i915 nor the client did anything
> wrong in the above scenario.  The Vulkan footgun approach works
> because there are a set of rules and, if you follow those rules,
> you're guaranteed everything works.  In the above scenario, however,
> the client followed all of the rules and got a deadlock anyway.  We
> can't have that.
>
>
> > But what we can do is use the same currently existing container
> > objects like drm_syncobj or sync_file (timeline syncobj would fit best
> > tbh), and stuff a userspace fence behind it. The only trouble is that
> > currently timeline syncobj implement vulkan's spec, which means if you
> > build a wait-before-signal deadlock, you'll wait forever. Well until
> > the user ragequits and kills your process.
>
> Yeah, it may be that this approach can be made to work.  Instead of
> reusing dma_fence, maybe we can reuse syncobj and have another form of
> syncobj which is a memory fence, a value to wait on, and a timeout.

It's going to be the same container. But very much not a dma_fence.

Note the other approach is if you split the kernel's notion of what a
dma_fence is into two parts: memory fence and synchronization
primitive. The trouble is that there's 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Marek Olšák
On Tue, Apr 20, 2021 at 2:39 PM Daniel Vetter  wrote:

> On Tue, Apr 20, 2021 at 6:25 PM Marek Olšák  wrote:
> >
> > Daniel, imagine hardware that can only do what Windows does: future
> fences signalled by userspace whenever userspace wants, and no kernel
> queues like we have today.
> >
> > The only reason why current AMD GPUs work is because they have a ring
> buffer per queue with pointers to userspace command buffers followed by
> fences. What will we do if that ring buffer is removed?
>
> Well this is an entirely different problem than what you set out to
> describe. This is essentially the problem where hw does not have any
> support for priviledged commands and separate priviledges command
> buffer, and direct userspace submit is the only thing that is
> available.
>
> I think if this is your problem, then you get to implement some very
> interesting compat shim. But that's an entirely different problem from
> what you've described in your mail. This pretty much assumes at the hw
> level the only thing that works is ATS/pasid, and vram is managed with
> HMM exclusively. Once you have that pure driver stack you get to fake
> it in the kernel for compat with everything that exists already. How
> exactly that will look and how exactly you best construct your
> dma_fences for compat will depend highly upon how much is still there
> in this hw (e.g. wrt interrupt generation). A lot of the
> infrastructure was also done as part of drm_syncobj. I mean we have
> entirely fake kernel drivers like vgem/vkms that create dma_fence, so
> a hw ringbuffer is really not required.
>
> So ... is this your problem underneath it all, or was that more a wild
> strawman for the discussion?
>

Yes, that's the problem.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi,

On Tue, 20 Apr 2021 at 20:03, Bas Nieuwenhuizen 
wrote:

> On Tue, Apr 20, 2021 at 8:16 PM Daniel Stone  wrote:
>
>> It's a jarring transition. If you take a very narrow view and say 'it's
>> all GPU work in shared buffers so it should all work the same', then
>> client<->winsys looks the same as client<->client gbuffer. But this is a
>> trap.
>>
>
> I think this is where I think we have have a serious gap of what a winsys
> or a compositor is. Like if you have only a single wayland server running
> on a physical machine this is easy. But add a VR compositor, an
> intermediate compositor (say gamescope), Xwayland and some containers/VM,
> some video capture  (or, gasp, a browser that doubles as compositor) and
> this story gets seriously complicated. Like who are you protecting from
> who? at what point is something client<->winsys vs. client<->client?
>

As I've said upthread, the line is _seriously_ blurred, and is only getting
less clear. Right now, DRI3 cannot even accept a dma_fence, let alone a
drm_syncobj, let alone a memory fence.

Crossing those boundaries is hard, and requires as much thinking as typing.
That's a good thing.

Conflating every synchronisation desire into a single
userspace-visible primitive makes this harder, because it treats game
threads the same as other game threads the same as VR compositors the same
as embedding browsers the same as compositors etc. Drawing very clear lines
between game threads and the external world, with explicit weakening as
necessary, makes those jarring transitions of privilege and expectation
clear and explicit. Which is a good thing, since we're trying to move away
from magic and implicit.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Jason Ekstrand
On Tue, Apr 20, 2021 at 1:54 PM Daniel Vetter  wrote:
>
> On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone  wrote:
>
> > And something more concrete:
> >
> > dma_fence.
> >
> > This already has all of the properties described above. Kernel-wise, it 
> > already devolves to CPU-side signaling when it crosses device boundaries. 
> > We need to support it roughly forever since it's been plumbed so far and so 
> > wide. Any primitive which is acceptable for winsys-like usage which crosses 
> > so many device/subsystem/process/security boundaries has to meet the same 
> > requirements. So why reinvent something which looks so similar, and has the 
> > same requirements of the kernel babysitting completion, providing little to 
> > no benefit for that difference?
>
> So I can mostly get behind this, except it's _not_ going to be
> dma_fence. That thing has horrendous internal ordering constraints
> within the kernel, and the one thing that doesn't allow you is to make
> a dma_fence depend upon a userspace fence.

Let me elaborate on this a bit.  One of the problems I mentioned
earlier is the conflation of fence types inside the kernel.  dma_fence
is used for solving two different semi-related but different problems:
client command synchronization and memory residency synchronization.
In the old implicit GL world, we conflated these two and thought we
were providing ourselves a service.  Not so much

It's all well and good to say that we should turn the memory fence
into a dma_fence and throw a timeout on it.  However, these
window-system sync primitives, as you said, have to be able to be
shared across everything.  In particular, we have to be able to share
them with drivers that don't make a good separation between command
and memory synchronization.

Let's say we're rendering on ANV with memory fences and presenting on
some USB display adapter whose kernel driver is a bit old-school.
When we pass that fence to the other driver via a sync_file or
similar, that driver may shove that dma_fence into the dma_resv on
some buffer somewhere.  Then our client, completely unaware of
internal kernel dependencies, binds that buffer into its address space
and kicks off another command buffer.  So i915 throws in a dependency
on that dma_resv which contains the previously created dma_fence and
refuses to execute any more command buffers until it signals.
Unfortunately, unbeknownst to i915, that command buffer which the
client kicked off after doing that bind was required for signaling the
memory fence on which our first dma_fence depends.  Deadlock.

Sure, we put a timeout on the dma_fence and it will eventually fire
and unblock everything.  However, there's one very important point
that's easy to miss here:  Neither i915 nor the client did anything
wrong in the above scenario.  The Vulkan footgun approach works
because there are a set of rules and, if you follow those rules,
you're guaranteed everything works.  In the above scenario, however,
the client followed all of the rules and got a deadlock anyway.  We
can't have that.


> But what we can do is use the same currently existing container
> objects like drm_syncobj or sync_file (timeline syncobj would fit best
> tbh), and stuff a userspace fence behind it. The only trouble is that
> currently timeline syncobj implement vulkan's spec, which means if you
> build a wait-before-signal deadlock, you'll wait forever. Well until
> the user ragequits and kills your process.

Yeah, it may be that this approach can be made to work.  Instead of
reusing dma_fence, maybe we can reuse syncobj and have another form of
syncobj which is a memory fence, a value to wait on, and a timeout.

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi,

On Tue, 20 Apr 2021 at 19:54, Daniel Vetter  wrote:

> So I can mostly get behind this, except it's _not_ going to be
> dma_fence. That thing has horrendous internal ordering constraints
> within the kernel, and the one thing that doesn't allow you is to make
> a dma_fence depend upon a userspace fence.
>
> But what we can do is use the same currently existing container
> objects like drm_syncobj or sync_file (timeline syncobj would fit best
> tbh), and stuff a userspace fence behind it. The only trouble is that
> currently timeline syncobj implement vulkan's spec, which means if you
> build a wait-before-signal deadlock, you'll wait forever. Well until
> the user ragequits and kills your process.
>
> So for winsys we'd need to be able to specify the wait timeout
> somewhere for waiting for that dma_fence to materialize (plus the
> submit thread, but userspace needs that anyway to support timeline
> syncobj) if you're importing an untrusted timeline syncobj. And I
> think that's roughly it.
>

Right. The only way you get to materialise a dma_fence from an execbuf is
that you take a hard timeout, with a penalty for not meeting that timeout.
When I say dma_fence I mean dma_fence, because there is no extant winsys
support for drm_symcobj, so this is greenfield: the winsys gets to specify
its terms of engagement, and again, we've been the orange/green-site
enemies of users for quite some time already, so we're happy to continue
doing so. If the actual underlying primitive is not a dma_fence, and
compositors/protocol/clients need to eat a bunch of typing to deal with a
different primitive which offers the same guarantees, then that's fine, as
long as there is some tangible whole-of-system benefit.

How that timeout is actually realised is an implementation detail. Whether
it's a property of the last GPU job itself that the CPU-side driver can
observe, or that the kernel driver guarantees that there is a GPU job
launched in parallel which monitors the memory-fence status and reports
back through a mailbox/doorbell, or the CPU-side driver enqueues kqueue
work for $n milliseconds' time to check the value in memory and kill the
context if it doesn't meet expectations - whatever. I don't believe any of
those choices meaningfully impact on kernel driver complexity relative to
the initial proposal, but they do allow us to continue to provide the
guarantees we do today when buffers cross security boundaries.

There might well be an argument for significantly weakening those security
boundaries and shifting the complexity from the DRM scheduler into
userspace compositors. So far though, I have yet to see that argument made
coherently.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Bas Nieuwenhuizen
On Tue, Apr 20, 2021 at 8:16 PM Daniel Stone  wrote:

> On Tue, 20 Apr 2021 at 19:00, Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Am 20.04.21 um 19:44 schrieb Daniel Stone:
>>
>> But winsys is something _completely_ different. Yes, you're using the GPU
>> to do things with buffers A, B, and C to produce buffer Z. Yes, you're
>> using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition
>> job might depend on a Chromium composition job which depends on GTA's
>> render job which depends on GTA's compute job which might take a year to
>> complete. Mutter's composition job needs to complete in 'reasonable'
>> (again, FSVO) time, no matter what. The two are compatible.
>>
>> How? Don't lump them together. Isolate them aggressively, and
>> _predictably_ in a way that you can reason about.
>>
>> What clients do in their own process space is their own business. Games
>> can deadlock themselves if they get wait-before-signal wrong. Compute jobs
>> can run for a year. Their problem. Winsys is not that, because you're
>> crossing every isolation boundary possible. Process, user, container, VM -
>> every kind of privilege boundary. Thus far, dma_fence has protected us from
>> the most egregious abuses by guaranteeing bounded-time completion; it also
>> acts as a sequencing primitive, but from the perspective of a winsys person
>> that's of secondary importance, which is probably one of the bigger
>> disconnects between winsys people and GPU driver people.
>>
>>
>> Finally somebody who understands me :)
>>
>> Well the question is then how do we get winsys and your own process space
>> together then?
>>
>
> It's a jarring transition. If you take a very narrow view and say 'it's
> all GPU work in shared buffers so it should all work the same', then
> client<->winsys looks the same as client<->client gbuffer. But this is a
> trap.
>

I think this is where I think we have have a serious gap of what a winsys
or a compositor is. Like if you have only a single wayland server running
on a physical machine this is easy. But add a VR compositor, an
intermediate compositor (say gamescope), Xwayland and some containers/VM,
some video capture  (or, gasp, a browser that doubles as compositor) and
this story gets seriously complicated. Like who are you protecting from
who? at what point is something client<->winsys vs. client<->client?



> Just because you can mmap() a file on an NFS server in New Zealand doesn't
> mean that you should have the same expectations of memory access to that
> file as you do to of a pointer from alloca(). Even if the primitives look
> the same, you are crossing significant boundaries, and those do not come
> without a compromise and a penalty.
>
>
>> Anyway, one of the great things about winsys (there are some! trust me)
>> is we don't need to be as hopelessly general as for game engines, nor as
>> hyperoptimised. We place strict demands on our clients, and we literally
>> kill them every single time they get something wrong in a way that's
>> visible to us. Our demands on the GPU are so embarrassingly simple that you
>> can run every modern desktop environment on GPUs which don't have unified
>> shaders. And on certain platforms who don't share tiling formats between
>> texture/render-target/scanout ... and it all still runs fast enough that
>> people don't complain.
>>
>>
>> Ignoring everything below since that is the display pipeline I'm not
>> really interested in. My concern is how to get the buffer from the client
>> to the server without allowing the client to get the server into trouble?
>>
>> My thinking is still to use timeouts to acquire texture locks. E.g. when
>> the compositor needs to access texture it grabs a lock and if that lock
>> isn't available in less than 20ms whoever is holding it is killed hard and
>> the lock given to the compositor.
>>
>> It's perfectly fine if a process has a hung queue, but if it tries to
>> send buffers which should be filled by that queue to the compositor it just
>> gets a corrupted window content.
>>
>
> Kill the client hard. If the compositor has speculatively queued sampling
> against rendering which never completed, let it access garbage. You'll have
> one frame of garbage (outdated content, all black, random pattern; the
> failure mode is equally imperfect, because there is no perfect answer),
> then the compositor will notice the client has disappeared and remove all
> its resources.
>
> It's not possible to completely prevent this situation if the compositor
> wants to speculatively pipeline work, only ameliorate it. From a
> system-global point of view, just expose the situation and let it bubble
> up. Watch the number of fences which failed to retire in time, and destroy
> the context if there are enough of them (maybe 1, maybe 100). Watch the
> number of contexts the file description get forcibly destroyed, and destroy
> the file description if there are enough of them. Watch the number of
> descriptions which 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone  wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand  wrote:
>>
>> It's still early in the morning here and I'm not awake yet so sorry if
>> this comes out in bits and pieces...
>
>
> No problem, it's helpful. If I weren't on this thread I'd be attempting to 
> put together a 73-piece chest of drawers whose instructions are about as 
> clear as this so far, so I'm in the right head space anyway.
>
>>
>> IMO, there are two problems being solved here which are related in
>> very subtle and tricky ways.  They're also, admittedly, driver
>> problems, not really winsys problems.  Unfortunately, they may have
>> winsys implications.
>
>
> Yeah ... bingo.
>
>>
>> First, is better/real timelines for Vulkan and compute.  [...]
>>
>> We also want something like this for compute workloads.  [...]
>
>
> Totally understand and agree with all of this. Memory fences seem like a good 
> and useful primitive here.
>
>>
>> The second biting issue is that, in the current kernel implementation
>> of dma-fence and dma_resv, we've lumped internal synchronization for
>> memory management together with execution synchronization for
>> userspace dependency tracking.  And we have no way to tell the
>> difference between the two internally.  Even if user space is passing
>> around sync_files and trying to do explicit sync, once you get inside
>> the kernel, they're all dma-fences and it can't tell the difference.
>
>
> Funny, because 'lumped [the two] together' is exactly the crux of my issues 
> ...
>
>>
>> If we move
>
>
> Stop here, because ...
>
>>
>> to a more userspace-controlled synchronization model with
>> wait-before-signal and no timeouts unless requested, regardless of the
>> implementation, it plays really badly dma-fence.  And, by "badly" I
>> mean the two are nearly incompatible.
>
>
> I would go further than that, and say completely, fundamentally, 
> conceptually, incompatible.
>
>>
>> From a user space PoV, it means
>> it's tricky to provide the finite time dma-fence guarantee.  From a
>> kernel PoV, it's way worse.  Currently, the way dma-fence is
>> constructed, it's impossible to deadlock assuming everyone follows the
>> rules.  The moment we allow user space to deadlock itself and allow
>> those deadlocks to leak into the kernel, we have a problem.  Even if
>> we throw in some timeouts, we still have a scenario where user space
>> has one linearizable dependency graph for execution synchronization
>> and the kernel has a different linearizable dependency graph for
>> memory management and, when you smash them together, you may have
>> cycles in your graph.
>>
>> So how do we sort this all out?  Good question.  It's a hard problem.
>> Probably the hardest problem here is the second one: the intermixing
>> of synchronization types.  Solving that one is likely going to require
>> some user space re-plumbing because all the user space APIs we have
>> for explicit sync are built on dma-fence.
>
>
> Gotcha.
>
> Firstly, let's stop, as you say, lumping things together. Timeline semaphores 
> and compute's GPU-side spinlocks etc, are one thing. I accept those now have 
> a hard requirement on something like memory fences, where any responsibility 
> is totally abrogated. So let's run with that in our strawman: Vulkan compute 
> & graphics & transfer queues all degenerate to something spinning (hopefully 
> GPU-assisted gentle spin) on a uint64 somewhere. The kernel has (in the 
> general case) no visibility or responsibility into these things. Fine - 
> that's one side of the story.
>
> But winsys is something _completely_ different. Yes, you're using the GPU to 
> do things with buffers A, B, and C to produce buffer Z. Yes, you're using 
> vkQueuePresentKHR to schedule that work. Yes, Mutter's composition job might 
> depend on a Chromium composition job which depends on GTA's render job which 
> depends on GTA's compute job which might take a year to complete. Mutter's 
> composition job needs to complete in 'reasonable' (again, FSVO) time, no 
> matter what. The two are compatible.
>
> How? Don't lump them together. Isolate them aggressively, and _predictably_ 
> in a way that you can reason about.
>
> What clients do in their own process space is their own business. Games can 
> deadlock themselves if they get wait-before-signal wrong. Compute jobs can 
> run for a year. Their problem. Winsys is not that, because you're crossing 
> every isolation boundary possible. Process, user, container, VM - every kind 
> of privilege boundary. Thus far, dma_fence has protected us from the most 
> egregious abuses by guaranteeing bounded-time completion; it also acts as a 
> sequencing primitive, but from the perspective of a winsys person that's of 
> secondary importance, which is probably one of the bigger disconnects between 
> winsys people and GPU driver people.
>
> Anyway, one of the great things about winsys (there are some! trust me) is we 
> don't need to be 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 6:25 PM Marek Olšák  wrote:
>
> Daniel, imagine hardware that can only do what Windows does: future fences 
> signalled by userspace whenever userspace wants, and no kernel queues like we 
> have today.
>
> The only reason why current AMD GPUs work is because they have a ring buffer 
> per queue with pointers to userspace command buffers followed by fences. What 
> will we do if that ring buffer is removed?

Well this is an entirely different problem than what you set out to
describe. This is essentially the problem where hw does not have any
support for priviledged commands and separate priviledges command
buffer, and direct userspace submit is the only thing that is
available.

I think if this is your problem, then you get to implement some very
interesting compat shim. But that's an entirely different problem from
what you've described in your mail. This pretty much assumes at the hw
level the only thing that works is ATS/pasid, and vram is managed with
HMM exclusively. Once you have that pure driver stack you get to fake
it in the kernel for compat with everything that exists already. How
exactly that will look and how exactly you best construct your
dma_fences for compat will depend highly upon how much is still there
in this hw (e.g. wrt interrupt generation). A lot of the
infrastructure was also done as part of drm_syncobj. I mean we have
entirely fake kernel drivers like vgem/vkms that create dma_fence, so
a hw ringbuffer is really not required.

So ... is this your problem underneath it all, or was that more a wild
strawman for the discussion?
-Daniel


> Marek
>
> On Tue, Apr 20, 2021 at 11:50 AM Daniel Stone  wrote:
>>
>> Hi,
>>
>> On Tue, 20 Apr 2021 at 16:16, Christian König 
>>  wrote:
>>>
>>> Am 20.04.21 um 17:07 schrieb Daniel Stone:
>>>
>>> If the compositor no longer has a guarantee that the buffer will be ready 
>>> for composition in a reasonable amount of time (which dma_fence gives us, 
>>> and this proposal does not appear to give us), then the compositor isn't 
>>> trying to use the buffer for compositing, it's waiting asynchronously on a 
>>> notification that the fence has signaled before it attempts to use the 
>>> buffer.
>>>
>>> Marek's initial suggestion is that the kernel signal the fence, which would 
>>> unblock composition (and presumably show garbage on screen, or at best jump 
>>> back to old content).
>>>
>>> My position is that the compositor will know the process has crashed anyway 
>>> - because its socket has been closed - at which point we destroy all the 
>>> client's resources including its windows and buffers regardless. Signaling 
>>> the fence doesn't give us any value here, _unless_ the compositor is just 
>>> blindly waiting for the fence to signal ... which it can't do because 
>>> there's no guarantee the fence will ever signal.
>>>
>>>
>>> Yeah, but that assumes that the compositor has change to not blindly wait 
>>> for the client to finish rendering and as Daniel explained that is rather 
>>> unrealistic.
>>>
>>> What we need is a fallback mechanism which signals the fence after a 
>>> timeout and gives a penalty to the one causing the timeout.
>>>
>>> That gives us the same functionality we have today with the in software 
>>> scheduler inside the kernel.
>>
>>
>> OK, if that's the case then I think I'm really missing something which isn't 
>> explained in this thread, because I don't understand what the additional 
>> complexity and API change gains us (see my first reply in this thread).
>>
>> By way of example - say I have a blind-but-explicit compositor that takes a 
>> drm_syncobj along with a dmabuf with each client presentation request, but 
>> doesn't check syncobj completion, it just imports that into a VkSemaphore + 
>> VkImage and schedules work for the next frame.
>>
>> Currently, that generates an execbuf ioctl for the composition (ignore KMS 
>> for now) with a sync point to wait on, and the kernel+GPU scheduling 
>> guarantees that the composition work will not begin until the client 
>> rendering work has retired. We have a further guarantee that this work will 
>> complete in reasonable time, for some value of 'reasonable'.
>>
>> My understanding of this current proposal is that:
>> * userspace creates a 'present fence' with this new ioctl
>> * the fence becomes signaled when a value is written to a location in 
>> memory, which is visible through both CPU and GPU mappings of that page
>> * this 'present fence' is imported as a VkSemaphore (?) and the userspace 
>> Vulkan driver will somehow wait on this value  either before submitting work 
>> or as a possibly-hardware-assisted GPU-side wait (?)
>> * the kernel's scheduler is thus eliminated from the equation, and every 
>> execbuf is submitted directly to hardware, because either userspace knows 
>> that the fence has already been signaled, or it will issue a GPU-side wait 
>> (?)
>> * but the kernel is still required to monitor completion of every fence 
>> 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
On Tue, 20 Apr 2021 at 19:00, Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 20.04.21 um 19:44 schrieb Daniel Stone:
>
> But winsys is something _completely_ different. Yes, you're using the GPU
> to do things with buffers A, B, and C to produce buffer Z. Yes, you're
> using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition
> job might depend on a Chromium composition job which depends on GTA's
> render job which depends on GTA's compute job which might take a year to
> complete. Mutter's composition job needs to complete in 'reasonable'
> (again, FSVO) time, no matter what. The two are compatible.
>
> How? Don't lump them together. Isolate them aggressively, and
> _predictably_ in a way that you can reason about.
>
> What clients do in their own process space is their own business. Games
> can deadlock themselves if they get wait-before-signal wrong. Compute jobs
> can run for a year. Their problem. Winsys is not that, because you're
> crossing every isolation boundary possible. Process, user, container, VM -
> every kind of privilege boundary. Thus far, dma_fence has protected us from
> the most egregious abuses by guaranteeing bounded-time completion; it also
> acts as a sequencing primitive, but from the perspective of a winsys person
> that's of secondary importance, which is probably one of the bigger
> disconnects between winsys people and GPU driver people.
>
>
> Finally somebody who understands me :)
>
> Well the question is then how do we get winsys and your own process space
> together then?
>

It's a jarring transition. If you take a very narrow view and say 'it's all
GPU work in shared buffers so it should all work the same', then
client<->winsys looks the same as client<->client gbuffer. But this is a
trap.

Just because you can mmap() a file on an NFS server in New Zealand doesn't
mean that you should have the same expectations of memory access to that
file as you do to of a pointer from alloca(). Even if the primitives look
the same, you are crossing significant boundaries, and those do not come
without a compromise and a penalty.


> Anyway, one of the great things about winsys (there are some! trust me) is
> we don't need to be as hopelessly general as for game engines, nor as
> hyperoptimised. We place strict demands on our clients, and we literally
> kill them every single time they get something wrong in a way that's
> visible to us. Our demands on the GPU are so embarrassingly simple that you
> can run every modern desktop environment on GPUs which don't have unified
> shaders. And on certain platforms who don't share tiling formats between
> texture/render-target/scanout ... and it all still runs fast enough that
> people don't complain.
>
>
> Ignoring everything below since that is the display pipeline I'm not
> really interested in. My concern is how to get the buffer from the client
> to the server without allowing the client to get the server into trouble?
>
> My thinking is still to use timeouts to acquire texture locks. E.g. when
> the compositor needs to access texture it grabs a lock and if that lock
> isn't available in less than 20ms whoever is holding it is killed hard and
> the lock given to the compositor.
>
> It's perfectly fine if a process has a hung queue, but if it tries to send
> buffers which should be filled by that queue to the compositor it just gets
> a corrupted window content.
>

Kill the client hard. If the compositor has speculatively queued sampling
against rendering which never completed, let it access garbage. You'll have
one frame of garbage (outdated content, all black, random pattern; the
failure mode is equally imperfect, because there is no perfect answer),
then the compositor will notice the client has disappeared and remove all
its resources.

It's not possible to completely prevent this situation if the compositor
wants to speculatively pipeline work, only ameliorate it. From a
system-global point of view, just expose the situation and let it bubble
up. Watch the number of fences which failed to retire in time, and destroy
the context if there are enough of them (maybe 1, maybe 100). Watch the
number of contexts the file description get forcibly destroyed, and destroy
the file description if there are enough of them. Watch the number of
descriptions which get forcibly destroyed, and destroy the process if there
are enough of them. Watch the number of processes in a cgroup/pidns which
get forcibly destroyed, and destroy the ... etc. Whether it's the DRM
driver or an external monitor such as systemd/Flatpak/podman/Docker doing
that is pretty immaterial, as long as the concept of failure bubbling up
remains.

(20ms is objectively the wrong answer FWIW, because we're not a hard RTOS.
But if our biggest point of disagreement is 20 vs. 200 vs. 2000 vs. 2
ms, then this thread has been a huge success!)

Cheers,
Daniel
___
mesa-dev mailing list

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
On Tue, 20 Apr 2021 at 17:25, Marek Olšák  wrote:

> Daniel, imagine hardware that can only do what Windows does: future fences
> signalled by userspace whenever userspace wants, and no kernel queues like
> we have today.
>
> The only reason why current AMD GPUs work is because they have a ring
> buffer per queue with pointers to userspace command buffers followed by
> fences. What will we do if that ring buffer is removed?
>

I can totally imagine that; memory fences are clearly a reality and we need
to make them work for functionality as well as performance. Let's imagine
that winsys joins that flying-car future of totally arbitrary sync, that we
work only on memory fences and nothing else, and that this all happens by
the time we're all vaccinated and can go cram into a room with 8000
other people at FOSDEM instead of trying to do this over email.

But the first couple of sentences of your proposal has the kernel
monitoring those synchronisation points to ensure that they complete in
bounded time. That already _completely_ destroys the purity of the simple
picture you paint. Either there are no guarantees and userspace has to
figure it out, or there are guarantees and we have to compromise that
purity.

I understand how you arrived at your proposal from your perspective as an
extremely skilled driver developer who has delivered gigantic performance
improvements to real-world clients. As a winsys person with a very
different perspective, I disagree with you on where you are drawing the
boundaries, to the point that I think your initial proposal is worse than
useless; doing glFinish() or the VkFence equivalent in clients would be
better in most cases than the first mail.

I don't want to do glFinish (which I'm right about), and you don't want to
do dma_fence (which you're right about). So let's work together to find a
middle ground which we're both happy with. That middle ground does exist,
and we as winsys people are happy to eat a significant amount of pain to
arrive at that middle ground. Your current proposal is at once too gentle
on the winsys, and far too harsh on it. I only want to move where and how
those lines are drawn, not to pretend that all the world is still a
single-context FIFO execution engine.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König



Am 20.04.21 um 19:44 schrieb Daniel Stone:

Hi,

On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand > wrote:


It's still early in the morning here and I'm not awake yet so sorry if
this comes out in bits and pieces...


No problem, it's helpful. If I weren't on this thread I'd be 
attempting to put together a 73-piece chest of drawers whose 
instructions are about as clear as this so far, so I'm in the right 
head space anyway.


IMO, there are two problems being solved here which are related in
very subtle and tricky ways.  They're also, admittedly, driver
problems, not really winsys problems.  Unfortunately, they may have
winsys implications.


Yeah ... bingo.

First, is better/real timelines for Vulkan and compute. [...]

We also want something like this for compute workloads. [...]


Totally understand and agree with all of this. Memory fences seem like 
a good and useful primitive here.


Completely agree.


The second biting issue is that, in the current kernel implementation
of dma-fence and dma_resv, we've lumped internal synchronization for
memory management together with execution synchronization for
userspace dependency tracking.  And we have no way to tell the
difference between the two internally.  Even if user space is passing
around sync_files and trying to do explicit sync, once you get inside
the kernel, they're all dma-fences and it can't tell the difference.


Funny, because 'lumped [the two] together' is exactly the crux of my 
issues ...


If we move


Stop here, because ...

to a more userspace-controlled synchronization model with
wait-before-signal and no timeouts unless requested, regardless of the
implementation, it plays really badly dma-fence.  And, by "badly" I
mean the two are nearly incompatible.


I would go further than that, and say completely, fundamentally, 
conceptually, incompatible.


+1


From a user space PoV, it means
it's tricky to provide the finite time dma-fence guarantee. From a
kernel PoV, it's way worse.  Currently, the way dma-fence is
constructed, it's impossible to deadlock assuming everyone follows the
rules.  The moment we allow user space to deadlock itself and allow
those deadlocks to leak into the kernel, we have a problem. Even if
we throw in some timeouts, we still have a scenario where user space
has one linearizable dependency graph for execution synchronization
and the kernel has a different linearizable dependency graph for
memory management and, when you smash them together, you may have
cycles in your graph.

So how do we sort this all out?  Good question.  It's a hard problem.
Probably the hardest problem here is the second one: the intermixing
of synchronization types.  Solving that one is likely going to require
some user space re-plumbing because all the user space APIs we have
for explicit sync are built on dma-fence.


Gotcha.

Firstly, let's stop, as you say, lumping things together. Timeline 
semaphores and compute's GPU-side spinlocks etc, are one thing. I 
accept those now have a hard requirement on something like memory 
fences, where any responsibility is totally abrogated. So let's run 
with that in our strawman: Vulkan compute & graphics & transfer queues 
all degenerate to something spinning (hopefully GPU-assisted gentle 
spin) on a uint64 somewhere. The kernel has (in the general case) no 
visibility or responsibility into these things. Fine - that's one side 
of the story.


Exactly, yes.



But winsys is something _completely_ different. Yes, you're using the 
GPU to do things with buffers A, B, and C to produce buffer Z. Yes, 
you're using vkQueuePresentKHR to schedule that work. Yes, Mutter's 
composition job might depend on a Chromium composition job which 
depends on GTA's render job which depends on GTA's compute job which 
might take a year to complete. Mutter's composition job needs to 
complete in 'reasonable' (again, FSVO) time, no matter what. The two 
are compatible.


How? Don't lump them together. Isolate them aggressively, and 
_predictably_ in a way that you can reason about.


What clients do in their own process space is their own 
business. Games can deadlock themselves if they get wait-before-signal 
wrong. Compute jobs can run for a year. Their problem. Winsys is not 
that, because you're crossing every isolation boundary possible. 
Process, user, container, VM - every kind of privilege boundary. Thus 
far, dma_fence has protected us from the most egregious abuses by 
guaranteeing bounded-time completion; it also acts as a sequencing 
primitive, but from the perspective of a winsys person that's of 
secondary importance, which is probably one of the bigger disconnects 
between winsys people and GPU driver people.


Finally somebody who understands me :)

Well the question is then how do we get winsys and your own process 
space 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi,

On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand  wrote:

> It's still early in the morning here and I'm not awake yet so sorry if
> this comes out in bits and pieces...
>

No problem, it's helpful. If I weren't on this thread I'd be attempting to
put together a 73-piece chest of drawers whose instructions are about as
clear as this so far, so I'm in the right head space anyway.


> IMO, there are two problems being solved here which are related in
> very subtle and tricky ways.  They're also, admittedly, driver
> problems, not really winsys problems.  Unfortunately, they may have
> winsys implications.
>

Yeah ... bingo.


> First, is better/real timelines for Vulkan and compute.  [...]
>
> We also want something like this for compute workloads.  [...]
>

Totally understand and agree with all of this. Memory fences seem like a
good and useful primitive here.


> The second biting issue is that, in the current kernel implementation
> of dma-fence and dma_resv, we've lumped internal synchronization for
> memory management together with execution synchronization for
> userspace dependency tracking.  And we have no way to tell the
> difference between the two internally.  Even if user space is passing
> around sync_files and trying to do explicit sync, once you get inside
> the kernel, they're all dma-fences and it can't tell the difference.
>

Funny, because 'lumped [the two] together' is exactly the crux of my issues
...


> If we move


Stop here, because ...


> to a more userspace-controlled synchronization model with
> wait-before-signal and no timeouts unless requested, regardless of the
> implementation, it plays really badly dma-fence.  And, by "badly" I
> mean the two are nearly incompatible.


I would go further than that, and say completely, fundamentally,
conceptually, incompatible.


> From a user space PoV, it means
> it's tricky to provide the finite time dma-fence guarantee.  From a
> kernel PoV, it's way worse.  Currently, the way dma-fence is
> constructed, it's impossible to deadlock assuming everyone follows the
> rules.  The moment we allow user space to deadlock itself and allow
> those deadlocks to leak into the kernel, we have a problem.  Even if
> we throw in some timeouts, we still have a scenario where user space
> has one linearizable dependency graph for execution synchronization
> and the kernel has a different linearizable dependency graph for
> memory management and, when you smash them together, you may have
> cycles in your graph.
>
> So how do we sort this all out?  Good question.  It's a hard problem.
> Probably the hardest problem here is the second one: the intermixing
> of synchronization types.  Solving that one is likely going to require
> some user space re-plumbing because all the user space APIs we have
> for explicit sync are built on dma-fence.
>

Gotcha.

Firstly, let's stop, as you say, lumping things together. Timeline
semaphores and compute's GPU-side spinlocks etc, are one thing. I accept
those now have a hard requirement on something like memory fences, where
any responsibility is totally abrogated. So let's run with that in our
strawman: Vulkan compute & graphics & transfer queues all degenerate to
something spinning (hopefully GPU-assisted gentle spin) on a uint64
somewhere. The kernel has (in the general case) no visibility or
responsibility into these things. Fine - that's one side of the story.

But winsys is something _completely_ different. Yes, you're using the GPU
to do things with buffers A, B, and C to produce buffer Z. Yes, you're
using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition
job might depend on a Chromium composition job which depends on GTA's
render job which depends on GTA's compute job which might take a year to
complete. Mutter's composition job needs to complete in 'reasonable'
(again, FSVO) time, no matter what. The two are compatible.

How? Don't lump them together. Isolate them aggressively, and _predictably_
in a way that you can reason about.

What clients do in their own process space is their own business. Games can
deadlock themselves if they get wait-before-signal wrong. Compute jobs can
run for a year. Their problem. Winsys is not that, because you're crossing
every isolation boundary possible. Process, user, container, VM - every
kind of privilege boundary. Thus far, dma_fence has protected us from the
most egregious abuses by guaranteeing bounded-time completion; it also acts
as a sequencing primitive, but from the perspective of a winsys person
that's of secondary importance, which is probably one of the bigger
disconnects between winsys people and GPU driver people.

Anyway, one of the great things about winsys (there are some! trust me) is
we don't need to be as hopelessly general as for game engines, nor as
hyperoptimised. We place strict demands on our clients, and we literally
kill them every single time they get something wrong in a way that's
visible to us. Our demands on the GPU are so 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Jacob Lifshay
On Tue, Apr 20, 2021, 09:25 Marek Olšák  wrote:

> Daniel, imagine hardware that can only do what Windows does: future fences
> signalled by userspace whenever userspace wants, and no kernel queues like
> we have today.
>

Hmm, that sounds kinda like what we're trying to do for Libre-SOC's gpu
which is basically where the cpu (exactly the same cores as the gpu) runs a
user-space software renderer with extra instructions to make it go fast, so
the kernel only gets involved for futex-wait or for video scan-out. This
causes problems when figuring out how to interact with dma-fences for
interoperability...

Jacob Lifshay
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Marek Olšák
Daniel, imagine hardware that can only do what Windows does: future fences
signalled by userspace whenever userspace wants, and no kernel queues like
we have today.

The only reason why current AMD GPUs work is because they have a ring
buffer per queue with pointers to userspace command buffers followed by
fences. What will we do if that ring buffer is removed?

Marek

On Tue, Apr 20, 2021 at 11:50 AM Daniel Stone  wrote:

> Hi,
>
> On Tue, 20 Apr 2021 at 16:16, Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Am 20.04.21 um 17:07 schrieb Daniel Stone:
>>
>> If the compositor no longer has a guarantee that the buffer will be ready
>> for composition in a reasonable amount of time (which dma_fence gives us,
>> and this proposal does not appear to give us), then the compositor isn't
>> trying to use the buffer for compositing, it's waiting asynchronously on a
>> notification that the fence has signaled before it attempts to use the
>> buffer.
>>
>> Marek's initial suggestion is that the kernel signal the fence, which
>> would unblock composition (and presumably show garbage on screen, or at
>> best jump back to old content).
>>
>> My position is that the compositor will know the process has crashed
>> anyway - because its socket has been closed - at which point we destroy all
>> the client's resources including its windows and buffers regardless.
>> Signaling the fence doesn't give us any value here, _unless_ the compositor
>> is just blindly waiting for the fence to signal ... which it can't do
>> because there's no guarantee the fence will ever signal.
>>
>>
>> Yeah, but that assumes that the compositor has change to not blindly wait
>> for the client to finish rendering and as Daniel explained that is rather
>> unrealistic.
>>
>> What we need is a fallback mechanism which signals the fence after a
>> timeout and gives a penalty to the one causing the timeout.
>>
>> That gives us the same functionality we have today with the in software
>> scheduler inside the kernel.
>>
>
> OK, if that's the case then I think I'm really missing something which
> isn't explained in this thread, because I don't understand what the
> additional complexity and API change gains us (see my first reply in this
> thread).
>
> By way of example - say I have a blind-but-explicit compositor that takes
> a drm_syncobj along with a dmabuf with each client presentation request,
> but doesn't check syncobj completion, it just imports that into a
> VkSemaphore + VkImage and schedules work for the next frame.
>
> Currently, that generates an execbuf ioctl for the composition (ignore KMS
> for now) with a sync point to wait on, and the kernel+GPU scheduling
> guarantees that the composition work will not begin until the client
> rendering work has retired. We have a further guarantee that this work will
> complete in reasonable time, for some value of 'reasonable'.
>
> My understanding of this current proposal is that:
> * userspace creates a 'present fence' with this new ioctl
> * the fence becomes signaled when a value is written to a location in
> memory, which is visible through both CPU and GPU mappings of that page
> * this 'present fence' is imported as a VkSemaphore (?) and the userspace
> Vulkan driver will somehow wait on this value  either before submitting
> work or as a possibly-hardware-assisted GPU-side wait (?)
> * the kernel's scheduler is thus eliminated from the equation, and every
> execbuf is submitted directly to hardware, because either userspace knows
> that the fence has already been signaled, or it will issue a GPU-side wait
> (?)
> * but the kernel is still required to monitor completion of every fence
> itself, so it can forcibly complete, or penalise the client (?)
>
> Lastly, let's say we stop ignoring KMS: what happens for the
> render-with-GPU-display-on-KMS case? Do we need to do the equivalent of
> glFinish() in userspace and only submit the KMS atomic request when the GPU
> work has fully retired?
>
> Clarifying those points would be really helpful so this is less of a
> strawman. I have some further opinions, but I'm going to wait until I
> understand what I'm actually arguing against before I go too far. :) The
> last point is very salient though.
>
> Cheers,
> Daniel
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Jason Ekstrand
On Tue, Apr 20, 2021 at 9:10 AM Daniel Vetter  wrote:
>
> On Tue, Apr 20, 2021 at 1:59 PM Christian König
>  wrote:
> >
> > > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > > the kernel's problem.
> >
> > Well, the path of inner peace begins with four words. “Not my fucking
> > problem.”
> >
> > But I'm not that much concerned about the kernel, but rather about
> > important userspace processes like X, Wayland, SurfaceFlinger etc...
> >
> > I mean attaching a page to a sync object and allowing to wait/signal
> > from both CPU as well as GPU side is not so much of a problem.
> >
> > > You have to somehow handle that, e.g. perhaps with conditional
> > > rendering and just using the old frame in compositing if the new one
> > > doesn't show up in time.
> >
> > Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.
>
> For opengl we do all the same guarantees, so if you get one of these
> you just block until the fence is signalled. Doing that properly means
> submit thread to support drm_syncobj like for vulkan.
>
> For vulkan we probably want to represent these as proper vk timeline
> objects, and the vulkan way is to just let the application (well
> compositor) here deal with it. If they import timelines from untrusted
> other parties, they need to handle the potential fallback of being
> lied at. How is "not vulkan's fucking problem", because that entire
> "with great power (well performance) comes great responsibility" is
> the entire vk design paradigm.

The security aspects are currently an unsolved problem in Vulkan.  The
assumption is that everyone trusts everyone else to be careful with
the scissors.  It's a great model!

I think we can do something in Vulkan to allow apps to protect
themselves a bit but it's tricky and non-obvious.

--Jason


> Glamour will just rely on GL providing nice package of the harsh
> reality of gpus, like usual.
>
> So I guess step 1 here for GL would be to provide some kind of
> import/export of timeline syncobj, including properly handling this
> "future/indefinite fences" aspect of them with submit thread and
> everything.
>
> -Daniel
>
> >
> > Regards,
> > Christian.
> >
> > Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> > >> Daniel, are you suggesting that we should skip any deadlock prevention in
> > >> the kernel, and just let userspace wait for and signal any fence it has
> > >> access to?
> > > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > > the kernel's problem. The only criteria is that the kernel itself must
> > > never rely on these userspace fences, except for stuff like implementing
> > > optimized cpu waits. And in those we must always guarantee that the
> > > userspace process remains interruptible.
> > >
> > > It's a completely different world from dma_fence based kernel fences,
> > > whether those are implicit or explicit.
> > >
> > >> Do you have any concern with the deprecation/removal of BO fences in the
> > >> kernel assuming userspace is only using explicit fences? Any concern with
> > >> the submit and return fences for modesetting and other 
> > >> producer<->consumer
> > >> scenarios?
> > > Let me work on the full replay for your rfc first, because there's a lot
> > > of details here and nuance.
> > > -Daniel
> > >
> > >> Thanks,
> > >> Marek
> > >>
> > >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:
> > >>
> > >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> > >>>  wrote:
> >  Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > > Not going to comment on everything on the first pass...
> > >
> > > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> > >> Hi,
> > >>
> > >> This is our initial proposal for explicit fences everywhere and new
> > >>> memory management that doesn't use BO fences. It's a redesign of how 
> > >>> Linux
> > >>> graphics drivers work, and it can coexist with what we have now.
> > >>
> > >> 1. Introduction
> > >> (skip this if you are already sold on explicit fences)
> > >>
> > >> The current Linux graphics architecture was initially designed for
> > >>> GPUs with only one graphics queue where everything was executed in the
> > >>> submission order and per-BO fences were used for memory management and
> > >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> > >>> queues were added on top, which required the introduction of implicit
> > >>> GPU-GPU synchronization between queues of different processes using 
> > >>> per-BO
> > >>> fences. Recently, even parallel execution within one queue was enabled
> > >>> where a command buffer starts draws and compute shaders, but doesn't 
> > >>> wait
> > >>> for them, enabling parallelism between back-to-back command buffers.
> > >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU 
> > >>> scheduler
> > >>> was created to enable all 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Jason Ekstrand
Sorry for the mega-reply but timezones...

On Tue, Apr 20, 2021 at 6:59 AM Christian König
 wrote:
>
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem.
>
> Well, the path of inner peace begins with four words. “Not my fucking
> problem.”

律

> But I'm not that much concerned about the kernel, but rather about
> important userspace processes like X, Wayland, SurfaceFlinger etc...
>
> I mean attaching a page to a sync object and allowing to wait/signal
> from both CPU as well as GPU side is not so much of a problem.

Yup... Sorting out these issues is what makes this a hard problem.


> > You have to somehow handle that, e.g. perhaps with conditional
> > rendering and just using the old frame in compositing if the new one
> > doesn't show up in time.
>
> Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

"Just handle it with conditional rendering" is a pretty trite answer.
If we have memory fences, we could expose a Vulkan extension to allow
them to be read by conditional rendering or by a shader.  However, as
Daniel has pointed out multiple times, composition pipelines are long
and complex and cheap tricks like that aren't something we can rely on
for solving the problem.  If we're going to solve the problem, we need
to make driver-internal stuff nice while still providing something
that looks very much like a sync_file with finite time semantics to
the composition pipeline.  How?  That's the question.


> Regards,
> Christian.
>
> Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> >> Daniel, are you suggesting that we should skip any deadlock prevention in
> >> the kernel, and just let userspace wait for and signal any fence it has
> >> access to?
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem. The only criteria is that the kernel itself must
> > never rely on these userspace fences, except for stuff like implementing
> > optimized cpu waits. And in those we must always guarantee that the
> > userspace process remains interruptible.
> >
> > It's a completely different world from dma_fence based kernel fences,
> > whether those are implicit or explicit.
> >
> >> Do you have any concern with the deprecation/removal of BO fences in the
> >> kernel assuming userspace is only using explicit fences? Any concern with
> >> the submit and return fences for modesetting and other producer<->consumer
> >> scenarios?
> > Let me work on the full replay for your rfc first, because there's a lot
> > of details here and nuance.
> > -Daniel
> >
> >> Thanks,
> >> Marek
> >>
> >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:
> >>
> >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> >>>  wrote:
>  Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > Not going to comment on everything on the first pass...
> >
> > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> >> Hi,
> >>
> >> This is our initial proposal for explicit fences everywhere and new
> >>> memory management that doesn't use BO fences. It's a redesign of how Linux
> >>> graphics drivers work, and it can coexist with what we have now.
> >>
> >> 1. Introduction
> >> (skip this if you are already sold on explicit fences)
> >>
> >> The current Linux graphics architecture was initially designed for
> >>> GPUs with only one graphics queue where everything was executed in the
> >>> submission order and per-BO fences were used for memory management and
> >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> >>> queues were added on top, which required the introduction of implicit
> >>> GPU-GPU synchronization between queues of different processes using per-BO
> >>> fences. Recently, even parallel execution within one queue was enabled
> >>> where a command buffer starts draws and compute shaders, but doesn't wait
> >>> for them, enabling parallelism between back-to-back command buffers.
> >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU 
> >>> scheduler
> >>> was created to enable all those use cases, and it's the only reason why 
> >>> the
> >>> scheduler exists.
> >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> >>> management, and the tracking of per-BO fences increase CPU overhead and
> >>> latency, and reduce parallelism. There is a desire to replace all of them
> >>> with something much simpler. Below is how we could do it.
> >>
> >> 2. Explicit synchronization for window systems and modesetting
> >>
> >> The producer is an application and the consumer is a compositor or a
> >>> modesetting driver.
> >> 2.1. The Present request
> >>
> >> As part of the Present request, the producer will pass 2 fences (sync
> >>> objects) to the consumer alongside the presented DMABUF BO:
> >> - The submit fence: Initially unsignalled, 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi,

On Tue, 20 Apr 2021 at 16:16, Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 20.04.21 um 17:07 schrieb Daniel Stone:
>
> If the compositor no longer has a guarantee that the buffer will be ready
> for composition in a reasonable amount of time (which dma_fence gives us,
> and this proposal does not appear to give us), then the compositor isn't
> trying to use the buffer for compositing, it's waiting asynchronously on a
> notification that the fence has signaled before it attempts to use the
> buffer.
>
> Marek's initial suggestion is that the kernel signal the fence, which
> would unblock composition (and presumably show garbage on screen, or at
> best jump back to old content).
>
> My position is that the compositor will know the process has crashed
> anyway - because its socket has been closed - at which point we destroy all
> the client's resources including its windows and buffers regardless.
> Signaling the fence doesn't give us any value here, _unless_ the compositor
> is just blindly waiting for the fence to signal ... which it can't do
> because there's no guarantee the fence will ever signal.
>
>
> Yeah, but that assumes that the compositor has change to not blindly wait
> for the client to finish rendering and as Daniel explained that is rather
> unrealistic.
>
> What we need is a fallback mechanism which signals the fence after a
> timeout and gives a penalty to the one causing the timeout.
>
> That gives us the same functionality we have today with the in software
> scheduler inside the kernel.
>

OK, if that's the case then I think I'm really missing something which
isn't explained in this thread, because I don't understand what the
additional complexity and API change gains us (see my first reply in this
thread).

By way of example - say I have a blind-but-explicit compositor that takes a
drm_syncobj along with a dmabuf with each client presentation request, but
doesn't check syncobj completion, it just imports that into a VkSemaphore +
VkImage and schedules work for the next frame.

Currently, that generates an execbuf ioctl for the composition (ignore KMS
for now) with a sync point to wait on, and the kernel+GPU scheduling
guarantees that the composition work will not begin until the client
rendering work has retired. We have a further guarantee that this work will
complete in reasonable time, for some value of 'reasonable'.

My understanding of this current proposal is that:
* userspace creates a 'present fence' with this new ioctl
* the fence becomes signaled when a value is written to a location in
memory, which is visible through both CPU and GPU mappings of that page
* this 'present fence' is imported as a VkSemaphore (?) and the userspace
Vulkan driver will somehow wait on this value  either before submitting
work or as a possibly-hardware-assisted GPU-side wait (?)
* the kernel's scheduler is thus eliminated from the equation, and every
execbuf is submitted directly to hardware, because either userspace knows
that the fence has already been signaled, or it will issue a GPU-side wait
(?)
* but the kernel is still required to monitor completion of every fence
itself, so it can forcibly complete, or penalise the client (?)

Lastly, let's say we stop ignoring KMS: what happens for the
render-with-GPU-display-on-KMS case? Do we need to do the equivalent of
glFinish() in userspace and only submit the KMS atomic request when the GPU
work has fully retired?

Clarifying those points would be really helpful so this is less of a
strawman. I have some further opinions, but I'm going to wait until I
understand what I'm actually arguing against before I go too far. :) The
last point is very salient though.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Jason Ekstrand
It's still early in the morning here and I'm not awake yet so sorry if
this comes out in bits and pieces...

On Tue, Apr 20, 2021 at 7:43 AM Daniel Stone  wrote:
>
> Hi Marek,
>
> On Mon, 19 Apr 2021 at 11:48, Marek Olšák  wrote:
>>
>> 2. Explicit synchronization for window systems and modesetting
>>
>> The producer is an application and the consumer is a compositor or a 
>> modesetting driver.
>>
>> 2.1. The Present request
>
>
> So the 'present request' is an ioctl, right? Not a userspace construct like 
> it is today? If so, how do we correlate the two?
>
> The terminology is pretty X11-centric so I'll assume that's what you've 
> designed against, but Wayland and even X11 carry much more auxiliary 
> information attached to a present request than just 'this buffer, this 
> swapchain'. Wayland latches a lot of data on presentation, including 
> non-graphics data such as surface geometry (so we can have resizes which 
> don't suck), window state (e.g. fullscreen or not, also so we can have 
> resizes which don't suck), and these requests can also cascade through a tree 
> of subsurfaces (so we can have embeds which don't suck). X11 mostly just 
> carries timestamps, which is more tractable.
>
> Given we don't want to move the entirety of Wayland into kernel-visible 
> objects, how do we synchronise the two streams so they aren't incoherent? 
> Taking a rough stab at it whilst assuming we do have 
> DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in 
> kernel space, which the producer would create and ?? export a FD from, that 
> the compositor would ?? import.
>
>> As part of the Present request, the producer will pass 2 fences (sync 
>> objects) to the consumer alongside the presented DMABUF BO:
>> - The submit fence: Initially unsignalled, it will be signalled when the 
>> producer has finished drawing into the presented buffer.
>
>
> We have already have this in Wayland through dma_fence. I'm relaxed about 
> this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter of 
> typing. X11 has patches to DRI3 to support dma_fence, but they never got 
> merged because it was far too invasive to a server which is no longer 
> maintained.
>
>>
>> - The return fence: Initially unsignalled, it will be signalled when the 
>> consumer has finished using the presented buffer.
>
>
> Currently in Wayland the return fence (again a dma_fence) is generated by the 
> compositor and sent as an event when it's done, because we can't have 
> speculative/empty/future fences. drm_syncobj would make this possible, but so 
> far I've been hesitant because I don't see the benefit to it (more below).
>
>>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This 
>> information is part of the Present request and supplied by userspace.
>
>
> Same as today with dma_fence. Less true with drm_syncobj if we're using 
> timelines.
>
>>
>> - If the producer crashes, the kernel signals the submit fence, so that the 
>> consumer can make forward progress.
>
>
> This is only a change if the producer is now allowed to submit a fence before 
> it's flushed the work which would eventually fulfill that fence. Using 
> dma_fence has so far isolated us from this.
>
>>
>> - If the consumer crashes, the kernel signals the return fence, so that the 
>> producer can reclaim the buffer.
>
>
> 'The consumer' is problematic, per below. I think the wording you want is 'if 
> no references are held to the submitted present object'.
>
>>
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU 
>> hangs.
>>
>> Other window system requests can follow the same idea.
>
>
> Which other window system requests did you have in mind? Again, moving the 
> entirety of Wayland's signaling into the kernel is a total non-starter. 
> Partly because it means our entire protocol would be subject to the kernel's 
> ABI rules, partly because the rules and interdependencies between the 
> requests are extremely complex, but mostly because the kernel is just a 
> useless proxy: it would be forced to do significant work to reason about what 
> those requests do and when they should happen, but wouldn't be able to make 
> those decisions itself so would have to just punt everything to userspace. 
> Unless we have eBPF compositors.
>
>>
>> Merged fences where one fence object contains multiple fences will be 
>> supported. A merged fence is signalled only when its fences are signalled. 
>> The consumer will have the option to redefine the unsignalled return fence 
>> to a merged fence.
>
>
> An elaboration of how this differed from drm_syncobj would be really helpful 
> here. I can make some guesses based on the rest of the mail, but I'm not sure 
> how accurate they are.
>
>>
>> 2.2. Modesetting
>>
>> Since a modesetting driver can also be the consumer, the present ioctl will 
>> contain a submit fence and a return fence too. One small problem with 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König



Am 20.04.21 um 17:07 schrieb Daniel Stone:
On Tue, 20 Apr 2021 at 15:58, Christian König 
> wrote:


Am 20.04.21 um 16:53 schrieb Daniel Stone:

On Mon, 19 Apr 2021 at 11:48, Marek Olšák mailto:mar...@gmail.com>> wrote:

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which
fence. This information is part of the Present request and
supplied by userspace.
- If the producer crashes, the kernel signals the submit
fence, so that the consumer can make forward progress.
- If the consumer crashes, the kernel signals the return
fence, so that the producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be
handled like GPU hangs.


Another thought: with completely arbitrary userspace fencing,
none of this is helpful either. If the compositor can't guarantee
that a hostile client has submitted a fence which will never be
signaled, then it won't be waiting on it, so it already needs
infrastructure to handle something like this.



That already handles the crashed-client case, because if the
client crashes, then its connection will be dropped, which will
trigger the compositor to destroy all its resources anyway,
including any pending waits.


Exactly that's the problem. A compositor isn't immediately
informed that the client crashed, instead it is still referencing
the buffer and trying to use it for compositing.


If the compositor no longer has a guarantee that the buffer will be 
ready for composition in a reasonable amount of time (which dma_fence 
gives us, and this proposal does not appear to give us), then the 
compositor isn't trying to use the buffer for compositing, it's 
waiting asynchronously on a notification that the fence has signaled 
before it attempts to use the buffer.


Marek's initial suggestion is that the kernel signal the fence, which 
would unblock composition (and presumably show garbage on screen, or 
at best jump back to old content).


My position is that the compositor will know the process has crashed 
anyway - because its socket has been closed - at which point we 
destroy all the client's resources including its windows and buffers 
regardless. Signaling the fence doesn't give us any value here, 
_unless_ the compositor is just blindly waiting for the fence to 
signal ... which it can't do because there's no guarantee the fence 
will ever signal.


Yeah, but that assumes that the compositor has change to not blindly 
wait for the client to finish rendering and as Daniel explained that is 
rather unrealistic.


What we need is a fallback mechanism which signals the fence after a 
timeout and gives a penalty to the one causing the timeout.


That gives us the same functionality we have today with the in software 
scheduler inside the kernel.


Regards,
Christian.


Cheers,
Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König



Am 20.04.21 um 16:53 schrieb Daniel Stone:

Hi,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák > wrote:


Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence.
This information is part of the Present request and supplied by
userspace.
- If the producer crashes, the kernel signals the submit fence, so
that the consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so
that the producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled
like GPU hangs.


Another thought: with completely arbitrary userspace fencing, none of 
this is helpful either. If the compositor can't guarantee that a 
hostile client has submitted a fence which will never be signaled, 
then it won't be waiting on it, so it already needs infrastructure to 
handle something like this.


That already handles the crashed-client case, because if the client 
crashes, then its connection will be dropped, which will trigger the 
compositor to destroy all its resources anyway, including any pending 
waits.


Exactly that's the problem. A compositor isn't immediately informed that 
the client crashed, instead it is still referencing the buffer and 
trying to use it for compositing.




GPU hangs also look pretty similar; it's an infinite wait, until the 
client resubmits a new buffer which would replace (& discard) the old.


Correct. You just need to assume that all queues get destroyed and 
re-initialized when a GPU reset happens.




So signal-fence-on-process-exit isn't helpful and doesn't provide any 
extra reliability; it in fact probably just complicates things.


Well it is when you go for partial GPU resets.

Regards,
Christian.



Cheers,
Daniel

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
On Tue, 20 Apr 2021 at 15:58, Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 20.04.21 um 16:53 schrieb Daniel Stone:
>
> On Mon, 19 Apr 2021 at 11:48, Marek Olšák  wrote:
>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This
>> information is part of the Present request and supplied by userspace.
>> - If the producer crashes, the kernel signals the submit fence, so that
>> the consumer can make forward progress.
>> - If the consumer crashes, the kernel signals the return fence, so that
>> the producer can reclaim the buffer.
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
>> hangs.
>>
>
> Another thought: with completely arbitrary userspace fencing, none of this
> is helpful either. If the compositor can't guarantee that a hostile client
> has submitted a fence which will never be signaled, then it won't be
> waiting on it, so it already needs infrastructure to handle something like
> this.
>
>
> That already handles the crashed-client case, because if the client
> crashes, then its connection will be dropped, which will trigger the
> compositor to destroy all its resources anyway, including any pending waits.
>
>
> Exactly that's the problem. A compositor isn't immediately informed that
> the client crashed, instead it is still referencing the buffer and trying
> to use it for compositing.
>

If the compositor no longer has a guarantee that the buffer will be ready
for composition in a reasonable amount of time (which dma_fence gives us,
and this proposal does not appear to give us), then the compositor isn't
trying to use the buffer for compositing, it's waiting asynchronously on a
notification that the fence has signaled before it attempts to use the
buffer.

Marek's initial suggestion is that the kernel signal the fence, which would
unblock composition (and presumably show garbage on screen, or at best jump
back to old content).

My position is that the compositor will know the process has crashed anyway
- because its socket has been closed - at which point we destroy all the
client's resources including its windows and buffers regardless. Signaling
the fence doesn't give us any value here, _unless_ the compositor is just
blindly waiting for the fence to signal ... which it can't do because
there's no guarantee the fence will ever signal.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák  wrote:

> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
> - If the producer crashes, the kernel signals the submit fence, so that
> the consumer can make forward progress.
> - If the consumer crashes, the kernel signals the return fence, so that
> the producer can reclaim the buffer.
> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
> hangs.
>

Another thought: with completely arbitrary userspace fencing, none of this
is helpful either. If the compositor can't guarantee that a hostile client
has submitted a fence which will never be signaled, then it won't be
waiting on it, so it already needs infrastructure to handle something like
this. That already handles the crashed-client case, because if the client
crashes, then its connection will be dropped, which will trigger the
compositor to destroy all its resources anyway, including any pending waits.

GPU hangs also look pretty similar; it's an infinite wait, until the client
resubmits a new buffer which would replace (& discard) the old.

So signal-fence-on-process-exit isn't helpful and doesn't provide any extra
reliability; it in fact probably just complicates things.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 1:59 PM Christian König
 wrote:
>
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem.
>
> Well, the path of inner peace begins with four words. “Not my fucking
> problem.”
>
> But I'm not that much concerned about the kernel, but rather about
> important userspace processes like X, Wayland, SurfaceFlinger etc...
>
> I mean attaching a page to a sync object and allowing to wait/signal
> from both CPU as well as GPU side is not so much of a problem.
>
> > You have to somehow handle that, e.g. perhaps with conditional
> > rendering and just using the old frame in compositing if the new one
> > doesn't show up in time.
>
> Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

For opengl we do all the same guarantees, so if you get one of these
you just block until the fence is signalled. Doing that properly means
submit thread to support drm_syncobj like for vulkan.

For vulkan we probably want to represent these as proper vk timeline
objects, and the vulkan way is to just let the application (well
compositor) here deal with it. If they import timelines from untrusted
other parties, they need to handle the potential fallback of being
lied at. How is "not vulkan's fucking problem", because that entire
"with great power (well performance) comes great responsibility" is
the entire vk design paradigm.

Glamour will just rely on GL providing nice package of the harsh
reality of gpus, like usual.

So I guess step 1 here for GL would be to provide some kind of
import/export of timeline syncobj, including properly handling this
"future/indefinite fences" aspect of them with submit thread and
everything.
-Daniel

>
> Regards,
> Christian.
>
> Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> >> Daniel, are you suggesting that we should skip any deadlock prevention in
> >> the kernel, and just let userspace wait for and signal any fence it has
> >> access to?
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem. The only criteria is that the kernel itself must
> > never rely on these userspace fences, except for stuff like implementing
> > optimized cpu waits. And in those we must always guarantee that the
> > userspace process remains interruptible.
> >
> > It's a completely different world from dma_fence based kernel fences,
> > whether those are implicit or explicit.
> >
> >> Do you have any concern with the deprecation/removal of BO fences in the
> >> kernel assuming userspace is only using explicit fences? Any concern with
> >> the submit and return fences for modesetting and other producer<->consumer
> >> scenarios?
> > Let me work on the full replay for your rfc first, because there's a lot
> > of details here and nuance.
> > -Daniel
> >
> >> Thanks,
> >> Marek
> >>
> >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:
> >>
> >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> >>>  wrote:
>  Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > Not going to comment on everything on the first pass...
> >
> > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> >> Hi,
> >>
> >> This is our initial proposal for explicit fences everywhere and new
> >>> memory management that doesn't use BO fences. It's a redesign of how Linux
> >>> graphics drivers work, and it can coexist with what we have now.
> >>
> >> 1. Introduction
> >> (skip this if you are already sold on explicit fences)
> >>
> >> The current Linux graphics architecture was initially designed for
> >>> GPUs with only one graphics queue where everything was executed in the
> >>> submission order and per-BO fences were used for memory management and
> >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> >>> queues were added on top, which required the introduction of implicit
> >>> GPU-GPU synchronization between queues of different processes using per-BO
> >>> fences. Recently, even parallel execution within one queue was enabled
> >>> where a command buffer starts draws and compute shaders, but doesn't wait
> >>> for them, enabling parallelism between back-to-back command buffers.
> >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU 
> >>> scheduler
> >>> was created to enable all those use cases, and it's the only reason why 
> >>> the
> >>> scheduler exists.
> >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> >>> management, and the tracking of per-BO fences increase CPU overhead and
> >>> latency, and reduce parallelism. There is a desire to replace all of them
> >>> with something much simpler. Below is how we could do it.
> >>
> >> 2. Explicit synchronization for window systems and modesetting
> >>
> >> The producer is an application and the consumer is a compositor or a
> >>> modesetting driver.
> >> 2.1. 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 3:04 PM Daniel Stone  wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 13:01, Daniel Vetter  wrote:
>>
>> - We live in a post xf86-video-$vendor world, and all these other
>>   compositors rely on implicit sync. You're not going to be able to get
>>   rid of them anytime soon. What's worse, all the various EGL/vk buffer
>>   sharing things also rely on implicit sync, so you get to fix up tons of
>>   applications on top. Any plan that's realistic needs to cope with
>>   implicit/explicit at the same time together won't work.
>>
>> - Absolute infuriating, but you can't use page-faulting together with any
>>   dma_fence synchronization primitives, whether implicit or explicit. This
>>   means until the entire ecosystem moved forward (good luck with that) we
>>   have to support dma_fence. The only sync model that works together with
>>   page faults is userspace fence based sync.
>>
>> This should get rid of the oversync issues, and since implicit sync is
>> backed in everywhere right now, you'll have to deal with implicit sync for
>> a very long time.
>
>
> Depends what you mean by 'implicit sync'. ;)
>
> Getting userspace (Vulkan WSI, EGL, Wayland compositors, browsers, media 
> clients) over to explicit sync is easy, _provided_ that the explicit sync 
> gives us the same guarantees as implicit sync, i.e. completes in bounded 
> time, GPU/display work can be flushed to the kernel predicated on fence 
> completion with the kernel handling synchronisation and scheduling. It's just 
> a matter of typing, and until now we haven't had a great reason to do that 
> typing. Now we do have that reason, so we are implementing it. Whether it's 
> dma_fence or drm_syncobj is mostly immaterial; we can encode in protocol 
> requirements that you can't try to use wait-before-signal with drm_syncobj 
> and you'll get killed if you try.
>
> Getting that userspace over to fully userspace-based sync (wait-before-signal 
> or wait-never-signal, no kernel assistance but you just have to roll your own 
> polling or signal handling on either CPU or GPU side) is not easy. It might 
> never happen, because it's an extraordinary amount of work, introduces a huge 
> amount of fragility into a super-critical path, and and so far it's not clear 
> that it's a global performance improvement for the whole system, just 
> shifting performance problems from kernel to userspace, and probably (AFAICT) 
> making them worse in addition to the other problems it brings.
>
> What am I missing?

Nothing I think.

Which is why I'm arguing that kernel based sync with all the current
dma_fence guarantees is probably going to stick around for something
close to forever, and we need to assume so.

Only in specific cases does full userspace sync make sense imo:
- anything compute, excluding using compute/shaders to create
displayable buffers, but compute as in your final target is writing
some stuff to files and never interacting with any winsys. Those
really care because "run a compute kernel for a few hours" isn't
supported without userspace sync, and I don't think ever will.
- maybe vulkan direct display, once/if we have the extensions for
atomic kms wired up
- maybe someone wants to write a vulkan based compositor and deal with
all this themselves. That model I think would also imply that they
deal with all the timeouts and fallbacks, irrespective of whether
underneath we actually run on dma_fence timeline syncobjs or userspace
fence timeline syncobjs.

>From about 2 years of screaming at this stuff it feels like this will
be a pretty exhaustive list for the next 10 years. Definitely doesn't
include your random linux desktop wayland compositor stack. But
there's definitely some are specific areas where people care enough
for all the pain. For everyone else it's all the other pieces I laid
out.

This also means that I don't think we now have that impedus to start
typing all the explicit sync protocol/compositor bits, since:
- the main driver is compute stuff, that needs mesa work (well vk/ocl
plus all the various repainted copies of cuda)
- with the tricks to make implicit sync work more like explicit sync
the oversyncing can be largely solved without protocol work
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi,

On Tue, 20 Apr 2021 at 13:01, Daniel Vetter  wrote:

> - We live in a post xf86-video-$vendor world, and all these other
>   compositors rely on implicit sync. You're not going to be able to get
>   rid of them anytime soon. What's worse, all the various EGL/vk buffer
>   sharing things also rely on implicit sync, so you get to fix up tons of
>   applications on top. Any plan that's realistic needs to cope with
>   implicit/explicit at the same time together won't work.
>
> - Absolute infuriating, but you can't use page-faulting together with any
>   dma_fence synchronization primitives, whether implicit or explicit. This
>   means until the entire ecosystem moved forward (good luck with that) we
>   have to support dma_fence. The only sync model that works together with
>   page faults is userspace fence based sync.
>
This should get rid of the oversync issues, and since implicit sync is
> backed in everywhere right now, you'll have to deal with implicit sync for
> a very long time.
>

Depends what you mean by 'implicit sync'. ;)

Getting userspace (Vulkan WSI, EGL, Wayland compositors, browsers, media
clients) over to explicit sync is easy, _provided_ that the explicit sync
gives us the same guarantees as implicit sync, i.e. completes in bounded
time, GPU/display work can be flushed to the kernel predicated on fence
completion with the kernel handling synchronisation and scheduling. It's
just a matter of typing, and until now we haven't had a great reason to do
that typing. Now we do have that reason, so we are implementing it. Whether
it's dma_fence or drm_syncobj is mostly immaterial; we can encode in
protocol requirements that you can't try to use wait-before-signal with
drm_syncobj and you'll get killed if you try.

Getting that userspace over to fully userspace-based sync
(wait-before-signal or wait-never-signal, no kernel assistance but you just
have to roll your own polling or signal handling on either CPU or GPU side)
is not easy. It might never happen, because it's an extraordinary amount of
work, introduces a huge amount of fragility into a super-critical path, and
and so far it's not clear that it's a global performance improvement for
the whole system, just shifting performance problems from kernel to
userspace, and probably (AFAICT) making them worse in addition to the other
problems it brings.

What am I missing?

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Stone
Hi Marek,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák  wrote:

> *2. Explicit synchronization for window systems and modesetting*
>
> The producer is an application and the consumer is a compositor or a
> modesetting driver.
>
> *2.1. The Present request*
>

So the 'present request' is an ioctl, right? Not a userspace construct like
it is today? If so, how do we correlate the two?

The terminology is pretty X11-centric so I'll assume that's what you've
designed against, but Wayland and even X11 carry much more auxiliary
information attached to a present request than just 'this buffer, this
swapchain'. Wayland latches a lot of data on presentation, including
non-graphics data such as surface geometry (so we can have resizes which
don't suck), window state (e.g. fullscreen or not, also so we can have
resizes which don't suck), and these requests can also cascade through a
tree of subsurfaces (so we can have embeds which don't suck). X11 mostly
just carries timestamps, which is more tractable.

Given we don't want to move the entirety of Wayland into kernel-visible
objects, how do we synchronise the two streams so they aren't incoherent?
Taking a rough stab at it whilst assuming we do have
DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in
kernel space, which the producer would create and ?? export a FD from, that
the compositor would ?? import.

As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
>

We have already have this in Wayland through dma_fence. I'm relaxed about
this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter
of typing. X11 has patches to DRI3 to support dma_fence, but they never got
merged because it was far too invasive to a server which is no longer
maintained.


> - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.
>

Currently in Wayland the return fence (again a dma_fence) is generated by
the compositor and sent as an event when it's done, because we can't have
speculative/empty/future fences. drm_syncobj would make this possible, but
so far I've been hesitant because I don't see the benefit to it (more
below).


> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
>

Same as today with dma_fence. Less true with drm_syncobj if we're using
timelines.


> - If the producer crashes, the kernel signals the submit fence, so that
> the consumer can make forward progress.
>

This is only a change if the producer is now allowed to submit a fence
before it's flushed the work which would eventually fulfill that fence.
Using dma_fence has so far isolated us from this.


> - If the consumer crashes, the kernel signals the return fence, so that
> the producer can reclaim the buffer.
>

'The consumer' is problematic, per below. I think the wording you want is
'if no references are held to the submitted present object'.


> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
> hangs.
>
> Other window system requests can follow the same idea.
>

Which other window system requests did you have in mind? Again, moving the
entirety of Wayland's signaling into the kernel is a total non-starter.
Partly because it means our entire protocol would be subject to the
kernel's ABI rules, partly because the rules and interdependencies between
the requests are extremely complex, but mostly because the kernel is just a
useless proxy: it would be forced to do significant work to reason about
what those requests do and when they should happen, but wouldn't be able to
make those decisions itself so would have to just punt everything to
userspace. Unless we have eBPF compositors.


> Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
>

An elaboration of how this differed from drm_syncobj would be really
helpful here. I can make some guesses based on the rest of the mail, but
I'm not sure how accurate they are.


> *2.2. Modesetting*
>
> Since a modesetting driver can also be the consumer, the present ioctl
> will contain a submit fence and a return fence too. One small problem with
> this is that userspace can hang the modesetting driver, but in theory, any
> later present ioctl can override the previous one, so the unsignalled
> presentation is never used.
>

This is also problematic. It's not just KMS, but media codecs too - V4L
doesn't yet have explicit fencing, but given the programming model of
codecs and how deeply they interoperate, but 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König

Hi Daniel,

Am 20.04.21 um 14:01 schrieb Daniel Vetter:

On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:

Hi,

This is our initial proposal for explicit fences everywhere and new memory
management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


*1. Introduction*
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs
with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory
management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.

I get the feeling you're mixing up a lot of things here that have more
nuance, so first some lingo.

- There's kernel based synchronization, based on dma_fence. These come in
   two major variants: Implicit synchronization, where the kernel attaches
   the dma_fences to a dma-buf, and explicit synchronization, where the
   dma_fence gets passed around as a stand-alone object, either a sync_file
   or a drm_syncobj

- Then there's userspace fence synchronization, where userspace issues any
   fences directly and the kernel doesn't even know what's going on. This
   is the only model that allows you to ditch the kernel overhead, and it's
   also the model that vk uses.

   I concur with Jason that this one is the future, it's the model hw
   wants, compute wants and vk wants. Building an explicit fence world
   which doesn't aim at this is imo wasted effort.

Now you smash them into one thing by also changing the memory model, but I
think that doesn't work:

- Relying on gpu page faults across the board wont happen. I think right
   now only amd's GFX10 or so has enough pagefault support to allow this,


It's even worse. GFX9 has enough support so that in theory can work.

Because of this Felix and his team are working on HMM support based on 
this generation.


On GFX10 some aspects of it are improved while others are totally broken 
again.



   and not even there I'm really sure. Nothing else will anytime soon, at
   least not as far as I know. So we need to support slightly more hw in
   upstream than just that.  Any plan that's realistic needs to cope with
   dma_fence for a really long time.

- Pown^WPin All The Things! is probably not a general enough memory
   management approach. We've kinda tried for years to move away from it.
   Sure we can support it as an optimization in specific workloads, and it
   will make stuff faster, but it's not going to be the default I think.

- We live in a post xf86-video-$vendor world, and all these other
   compositors rely on implicit sync. You're not going to be able to get
   rid of them anytime soon. What's worse, all the various EGL/vk buffer
   sharing things also rely on implicit sync, so you get to fix up tons of
   applications on top. Any plan that's realistic needs to cope with
   implicit/explicit at the same time together won't work.

- Absolute infuriating, but you can't use page-faulting together with any
   dma_fence synchronization primitives, whether implicit or explicit. This
   means until the entire ecosystem moved forward (good luck with that) we
   have to support dma_fence. The only sync model that works together with
   page faults is userspace fence based sync.

Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
sync, at least last I checked. Currently this oversynchronizes badly
because it's left to the kernel to guess what should be synchronized, and
that gets things wrong. What you need there is explicit implicit
synchronization:

- on the cs side, userspace must set explicit for which buffers the kernel
   should engage in implicit synchronization. That's how it works on all
   other drivers that support more explicit userspace like vk or gl drivers
   that are internally all explicit. So essentially you only set the
   implicit fence slot when you really want to, and only userspace knows
   this. Implementing this without breaking the current logic probably
   needs some flags.

- the other side isn't there yet upstream, but Jason has patches.
   

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:
> Hi,
> 
> This is our initial proposal for explicit fences everywhere and new memory
> management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> 
> 
> *1. Introduction*
> (skip this if you are already sold on explicit fences)
> 
> The current Linux graphics architecture was initially designed for GPUs
> with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> 
> The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.

I get the feeling you're mixing up a lot of things here that have more
nuance, so first some lingo.

- There's kernel based synchronization, based on dma_fence. These come in
  two major variants: Implicit synchronization, where the kernel attaches
  the dma_fences to a dma-buf, and explicit synchronization, where the
  dma_fence gets passed around as a stand-alone object, either a sync_file
  or a drm_syncobj

- Then there's userspace fence synchronization, where userspace issues any
  fences directly and the kernel doesn't even know what's going on. This
  is the only model that allows you to ditch the kernel overhead, and it's
  also the model that vk uses.

  I concur with Jason that this one is the future, it's the model hw
  wants, compute wants and vk wants. Building an explicit fence world
  which doesn't aim at this is imo wasted effort.

Now you smash them into one thing by also changing the memory model, but I
think that doesn't work:

- Relying on gpu page faults across the board wont happen. I think right
  now only amd's GFX10 or so has enough pagefault support to allow this,
  and not even there I'm really sure. Nothing else will anytime soon, at
  least not as far as I know. So we need to support slightly more hw in
  upstream than just that.  Any plan that's realistic needs to cope with
  dma_fence for a really long time.

- Pown^WPin All The Things! is probably not a general enough memory
  management approach. We've kinda tried for years to move away from it.
  Sure we can support it as an optimization in specific workloads, and it
  will make stuff faster, but it's not going to be the default I think.

- We live in a post xf86-video-$vendor world, and all these other
  compositors rely on implicit sync. You're not going to be able to get
  rid of them anytime soon. What's worse, all the various EGL/vk buffer
  sharing things also rely on implicit sync, so you get to fix up tons of
  applications on top. Any plan that's realistic needs to cope with
  implicit/explicit at the same time together won't work.

- Absolute infuriating, but you can't use page-faulting together with any
  dma_fence synchronization primitives, whether implicit or explicit. This
  means until the entire ecosystem moved forward (good luck with that) we
  have to support dma_fence. The only sync model that works together with
  page faults is userspace fence based sync.

Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
sync, at least last I checked. Currently this oversynchronizes badly
because it's left to the kernel to guess what should be synchronized, and
that gets things wrong. What you need there is explicit implicit
synchronization:

- on the cs side, userspace must set explicit for which buffers the kernel
  should engage in implicit synchronization. That's how it works on all
  other drivers that support more explicit userspace like vk or gl drivers
  that are internally all explicit. So essentially you only set the
  implicit fence slot when you really want to, and only userspace knows
  this. Implementing this without breaking the current logic probably
  needs some flags.

- the other side isn't there yet upstream, but Jason has patches.
  Essentially you also need to sample your implicit sync points at the
  right spot, to avoid oversync on later rendering by the producer.
  Jason's patch solves this by adding an ioctl to dma-buf to get the
  current set.

- without any of this things for pure explicit fencing userspace 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem.


Well, the path of inner peace begins with four words. “Not my fucking 
problem.”


But I'm not that much concerned about the kernel, but rather about 
important userspace processes like X, Wayland, SurfaceFlinger etc...


I mean attaching a page to a sync object and allowing to wait/signal 
from both CPU as well as GPU side is not so much of a problem.



You have to somehow handle that, e.g. perhaps with conditional
rendering and just using the old frame in compositing if the new one
doesn't show up in time.


Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

Regards,
Christian.

Am 20.04.21 um 13:16 schrieb Daniel Vetter:

On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:

Daniel, are you suggesting that we should skip any deadlock prevention in
the kernel, and just let userspace wait for and signal any fence it has
access to?

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem. The only criteria is that the kernel itself must
never rely on these userspace fences, except for stuff like implementing
optimized cpu waits. And in those we must always guarantee that the
userspace process remains interruptible.

It's a completely different world from dma_fence based kernel fences,
whether those are implicit or explicit.


Do you have any concern with the deprecation/removal of BO fences in the
kernel assuming userspace is only using explicit fences? Any concern with
the submit and return fences for modesetting and other producer<->consumer
scenarios?

Let me work on the full replay for your rfc first, because there's a lot
of details here and nuance.
-Daniel


Thanks,
Marek

On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:


On Tue, Apr 20, 2021 at 12:15 PM Christian König
 wrote:

Am 19.04.21 um 17:48 schrieb Jason Ekstrand:

Not going to comment on everything on the first pass...

On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:

Hi,

This is our initial proposal for explicit fences everywhere and new

memory management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


1. Introduction
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for

GPUs with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory

management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.


2. Explicit synchronization for window systems and modesetting

The producer is an application and the consumer is a compositor or a

modesetting driver.

2.1. The Present request

As part of the Present request, the producer will pass 2 fences (sync

objects) to the consumer alongside the presented DMABUF BO:

- The submit fence: Initially unsignalled, it will be signalled when

the producer has finished drawing into the presented buffer.

- The return fence: Initially unsignalled, it will be signalled when

the consumer has finished using the presented buffer.

I'm not sure syncobj is what we want.  In the Intel world we're trying
to go even further to something we're calling "userspace fences" which
are a timeline implemented as a single 64-bit value in some
CPU-mappable BO.  The client writes a higher value into the BO to
signal the timeline.

Well that is exactly what our Windows guys have suggested as well, but
it strongly looks like that this isn't sufficient.

First of all you run into security problems when any application can
just write any value to that memory location. Just imagine an
application sets the counter to zero and X waits forever for some
rendering to finish.

The thing is, with userspace fences security boundary issue prevent
moves into userspace entirely. And it really doesn't matter whether
the event you're waiting on doesn't complete because the other app
crashed or was stupid or intentionally gave you a wrong fence point:
You have to somehow handle that, e.g. perhaps with conditional
rendering and just using the old frame 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> Daniel, are you suggesting that we should skip any deadlock prevention in
> the kernel, and just let userspace wait for and signal any fence it has
> access to?

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem. The only criteria is that the kernel itself must
never rely on these userspace fences, except for stuff like implementing
optimized cpu waits. And in those we must always guarantee that the
userspace process remains interruptible.

It's a completely different world from dma_fence based kernel fences,
whether those are implicit or explicit.

> Do you have any concern with the deprecation/removal of BO fences in the
> kernel assuming userspace is only using explicit fences? Any concern with
> the submit and return fences for modesetting and other producer<->consumer
> scenarios?

Let me work on the full replay for your rfc first, because there's a lot
of details here and nuance.
-Daniel

> 
> Thanks,
> Marek
> 
> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:
> 
> > On Tue, Apr 20, 2021 at 12:15 PM Christian König
> >  wrote:
> > >
> > > Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > > > Not going to comment on everything on the first pass...
> > > >
> > > > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> > > >> Hi,
> > > >>
> > > >> This is our initial proposal for explicit fences everywhere and new
> > memory management that doesn't use BO fences. It's a redesign of how Linux
> > graphics drivers work, and it can coexist with what we have now.
> > > >>
> > > >>
> > > >> 1. Introduction
> > > >> (skip this if you are already sold on explicit fences)
> > > >>
> > > >> The current Linux graphics architecture was initially designed for
> > GPUs with only one graphics queue where everything was executed in the
> > submission order and per-BO fences were used for memory management and
> > CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> > queues were added on top, which required the introduction of implicit
> > GPU-GPU synchronization between queues of different processes using per-BO
> > fences. Recently, even parallel execution within one queue was enabled
> > where a command buffer starts draws and compute shaders, but doesn't wait
> > for them, enabling parallelism between back-to-back command buffers.
> > Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> > was created to enable all those use cases, and it's the only reason why the
> > scheduler exists.
> > > >>
> > > >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> > management, and the tracking of per-BO fences increase CPU overhead and
> > latency, and reduce parallelism. There is a desire to replace all of them
> > with something much simpler. Below is how we could do it.
> > > >>
> > > >>
> > > >> 2. Explicit synchronization for window systems and modesetting
> > > >>
> > > >> The producer is an application and the consumer is a compositor or a
> > modesetting driver.
> > > >>
> > > >> 2.1. The Present request
> > > >>
> > > >> As part of the Present request, the producer will pass 2 fences (sync
> > objects) to the consumer alongside the presented DMABUF BO:
> > > >> - The submit fence: Initially unsignalled, it will be signalled when
> > the producer has finished drawing into the presented buffer.
> > > >> - The return fence: Initially unsignalled, it will be signalled when
> > the consumer has finished using the presented buffer.
> > > > I'm not sure syncobj is what we want.  In the Intel world we're trying
> > > > to go even further to something we're calling "userspace fences" which
> > > > are a timeline implemented as a single 64-bit value in some
> > > > CPU-mappable BO.  The client writes a higher value into the BO to
> > > > signal the timeline.
> > >
> > > Well that is exactly what our Windows guys have suggested as well, but
> > > it strongly looks like that this isn't sufficient.
> > >
> > > First of all you run into security problems when any application can
> > > just write any value to that memory location. Just imagine an
> > > application sets the counter to zero and X waits forever for some
> > > rendering to finish.
> >
> > The thing is, with userspace fences security boundary issue prevent
> > moves into userspace entirely. And it really doesn't matter whether
> > the event you're waiting on doesn't complete because the other app
> > crashed or was stupid or intentionally gave you a wrong fence point:
> > You have to somehow handle that, e.g. perhaps with conditional
> > rendering and just using the old frame in compositing if the new one
> > doesn't show up in time. Or something like that. So trying to get the
> > kernel involved but also not so much involved sounds like a bad design
> > to me.
> >
> > > Additional to that in such a model you can't determine who is the guilty
> > > queue in case of a hang and can't reset the 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Marek Olšák
Daniel, are you suggesting that we should skip any deadlock prevention in
the kernel, and just let userspace wait for and signal any fence it has
access to?

Do you have any concern with the deprecation/removal of BO fences in the
kernel assuming userspace is only using explicit fences? Any concern with
the submit and return fences for modesetting and other producer<->consumer
scenarios?

Thanks,
Marek

On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:

> On Tue, Apr 20, 2021 at 12:15 PM Christian König
>  wrote:
> >
> > Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > > Not going to comment on everything on the first pass...
> > >
> > > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> > >> Hi,
> > >>
> > >> This is our initial proposal for explicit fences everywhere and new
> memory management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> > >>
> > >>
> > >> 1. Introduction
> > >> (skip this if you are already sold on explicit fences)
> > >>
> > >> The current Linux graphics architecture was initially designed for
> GPUs with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> > >>
> > >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.
> > >>
> > >>
> > >> 2. Explicit synchronization for window systems and modesetting
> > >>
> > >> The producer is an application and the consumer is a compositor or a
> modesetting driver.
> > >>
> > >> 2.1. The Present request
> > >>
> > >> As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> > >> - The submit fence: Initially unsignalled, it will be signalled when
> the producer has finished drawing into the presented buffer.
> > >> - The return fence: Initially unsignalled, it will be signalled when
> the consumer has finished using the presented buffer.
> > > I'm not sure syncobj is what we want.  In the Intel world we're trying
> > > to go even further to something we're calling "userspace fences" which
> > > are a timeline implemented as a single 64-bit value in some
> > > CPU-mappable BO.  The client writes a higher value into the BO to
> > > signal the timeline.
> >
> > Well that is exactly what our Windows guys have suggested as well, but
> > it strongly looks like that this isn't sufficient.
> >
> > First of all you run into security problems when any application can
> > just write any value to that memory location. Just imagine an
> > application sets the counter to zero and X waits forever for some
> > rendering to finish.
>
> The thing is, with userspace fences security boundary issue prevent
> moves into userspace entirely. And it really doesn't matter whether
> the event you're waiting on doesn't complete because the other app
> crashed or was stupid or intentionally gave you a wrong fence point:
> You have to somehow handle that, e.g. perhaps with conditional
> rendering and just using the old frame in compositing if the new one
> doesn't show up in time. Or something like that. So trying to get the
> kernel involved but also not so much involved sounds like a bad design
> to me.
>
> > Additional to that in such a model you can't determine who is the guilty
> > queue in case of a hang and can't reset the synchronization primitives
> > in case of an error.
> >
> > Apart from that this is rather inefficient, e.g. we don't have any way
> > to prevent priority inversion when used as a synchronization mechanism
> > between different GPU queues.
>
> Yeah but you can't have it both ways. Either all the scheduling in the
> kernel and fence handling is a problem, or you actually want to
> schedule in the kernel. hw seems to definitely move towards the more
> stupid spinlock-in-hw model (and direct submit from userspace and all
> that), priority inversions be damned. I'm really not sure we should
> fight that - if it's really that inefficient then maybe hw will add
> support for waiting sync constructs in hardware, or at least be
> smarter about scheduling other stuff. E.g. on intel hw 

  1   2   >