Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Marek Olšák
Proposal for a new CS ioctl, kernel pseudo code:

lock(_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(_lock);

See? Just like userspace submit, but in the kernel without
concurrency/preemption. Is this now safe enough for dma_fence?

Marek

On Mon, May 3, 2021 at 4:36 PM Marek Olšák  wrote:

> What about direct submit from the kernel where the process still has write
> access to the GPU ring buffer but doesn't use it? I think that solves your
> preemption example, but leaves a potential backdoor for a process to
> overwrite the signal commands, which shouldn't be a problem since we are OK
> with timeouts.
>
> Marek
>
> On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand 
> wrote:
>
>> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>>  wrote:
>> >
>> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand 
>> wrote:
>> > >
>> > > Sorry for the top-post but there's no good thing to reply to here...
>> > >
>> > > One of the things pointed out to me recently by Daniel Vetter that I
>> > > didn't fully understand before is that dma_buf has a very subtle
>> > > second requirement beyond finite time completion:  Nothing required
>> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
>> > > of allocating memory may wait on your dma-fence.  This, as it turns
>> > > out, is a massively more strict requirement than finite time
>> > > completion and, I think, throws out all of the proposals we have so
>> > > far.
>> > >
>> > > Take, for instance, Marek's proposal for userspace involvement with
>> > > dma-fence by asking the kernel for a next serial and the kernel
>> > > trusting userspace to signal it.  That doesn't work at all if
>> > > allocating memory to trigger a dma-fence can blow up.  There's simply
>> > > no way for the kernel to trust userspace to not do ANYTHING which
>> > > might allocate memory.  I don't even think there's a way userspace can
>> > > trust itself there.  It also blows up my plan of moving the fences to
>> > > transition boundaries.
>> > >
>> > > Not sure where that leaves us.
>> >
>> > Honestly the more I look at things I think userspace-signalable fences
>> > with a timeout sound like they are a valid solution for these issues.
>> > Especially since (as has been mentioned countless times in this email
>> > thread) userspace already has a lot of ways to cause timeouts and or
>> > GPU hangs through GPU work already.
>> >
>> > Adding a timeout on the signaling side of a dma_fence would ensure:
>> >
>> > - The dma_fence signals in finite time
>> > -  If the timeout case does not allocate memory then memory allocation
>> > is not a blocker for signaling.
>> >
>> > Of course you lose the full dependency graph and we need to make sure
>> > garbage collection of fences works correctly when we have cycles.
>> > However, the latter sounds very doable and the first sounds like it is
>> > to some extent inevitable.
>> >
>> > I feel like I'm missing some requirement here given that we
>> > immediately went to much more complicated things but can't find it.
>> > Thoughts?
>>
>> Timeouts are sufficient to protect the kernel but they make the fences
>> unpredictable and unreliable from a userspace PoV.  One of the big
>> problems we face is that, once we expose a dma_fence to userspace,
>> we've allowed for some pretty crazy potential dependencies that
>> neither userspace nor the kernel can sort out.  Say you have marek's
>> "next serial, please" proposal and a multi-threaded application.
>> Between time time you ask the kernel for a serial and get a dma_fence
>> and submit the work to signal that serial, your process may get
>> preempted, something else shoved in which allocates memory, and then
>> we end up blocking on that dma_fence.  There's no way userspace can
>> predict and defend itself from that.
>>
>> So I think where that leaves us is that there is no safe place to
>> create a dma_fence except for inside the ioctl which submits the work
>> and only after any necessary memory has been allocated.  That's a
>> pretty stiff requirement.  We may still be able to interact with
>> userspace a bit more explicitly but I think it throws any notion of
>> userspace direct submit out the window.
>>
>> --Jason
>>
>>
>> > - Bas
>> > >
>> > > --Jason
>> > >
>> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher 
>> wrote:
>> > > >
>> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák 
>> wrote:
>> > > > >
>> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer 
>> wrote:
>> > > > >>
>> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
>> > > > >> > Hi Dave,
>> > > > >> >
>> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> > > > >> >> Supporting interop with any device is always possible. It
>> depends on which drivers we need to interoperate with and update them.
>> We've already found the path forward for amdgpu. We just need to find out
>> how many other 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Marek Olšák
What about direct submit from the kernel where the process still has write
access to the GPU ring buffer but doesn't use it? I think that solves your
preemption example, but leaves a potential backdoor for a process to
overwrite the signal commands, which shouldn't be a problem since we are OK
with timeouts.

Marek

On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand  wrote:

> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>  wrote:
> >
> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand 
> wrote:
> > >
> > > Sorry for the top-post but there's no good thing to reply to here...
> > >
> > > One of the things pointed out to me recently by Daniel Vetter that I
> > > didn't fully understand before is that dma_buf has a very subtle
> > > second requirement beyond finite time completion:  Nothing required
> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > > of allocating memory may wait on your dma-fence.  This, as it turns
> > > out, is a massively more strict requirement than finite time
> > > completion and, I think, throws out all of the proposals we have so
> > > far.
> > >
> > > Take, for instance, Marek's proposal for userspace involvement with
> > > dma-fence by asking the kernel for a next serial and the kernel
> > > trusting userspace to signal it.  That doesn't work at all if
> > > allocating memory to trigger a dma-fence can blow up.  There's simply
> > > no way for the kernel to trust userspace to not do ANYTHING which
> > > might allocate memory.  I don't even think there's a way userspace can
> > > trust itself there.  It also blows up my plan of moving the fences to
> > > transition boundaries.
> > >
> > > Not sure where that leaves us.
> >
> > Honestly the more I look at things I think userspace-signalable fences
> > with a timeout sound like they are a valid solution for these issues.
> > Especially since (as has been mentioned countless times in this email
> > thread) userspace already has a lot of ways to cause timeouts and or
> > GPU hangs through GPU work already.
> >
> > Adding a timeout on the signaling side of a dma_fence would ensure:
> >
> > - The dma_fence signals in finite time
> > -  If the timeout case does not allocate memory then memory allocation
> > is not a blocker for signaling.
> >
> > Of course you lose the full dependency graph and we need to make sure
> > garbage collection of fences works correctly when we have cycles.
> > However, the latter sounds very doable and the first sounds like it is
> > to some extent inevitable.
> >
> > I feel like I'm missing some requirement here given that we
> > immediately went to much more complicated things but can't find it.
> > Thoughts?
>
> Timeouts are sufficient to protect the kernel but they make the fences
> unpredictable and unreliable from a userspace PoV.  One of the big
> problems we face is that, once we expose a dma_fence to userspace,
> we've allowed for some pretty crazy potential dependencies that
> neither userspace nor the kernel can sort out.  Say you have marek's
> "next serial, please" proposal and a multi-threaded application.
> Between time time you ask the kernel for a serial and get a dma_fence
> and submit the work to signal that serial, your process may get
> preempted, something else shoved in which allocates memory, and then
> we end up blocking on that dma_fence.  There's no way userspace can
> predict and defend itself from that.
>
> So I think where that leaves us is that there is no safe place to
> create a dma_fence except for inside the ioctl which submits the work
> and only after any necessary memory has been allocated.  That's a
> pretty stiff requirement.  We may still be able to interact with
> userspace a bit more explicitly but I think it throws any notion of
> userspace direct submit out the window.
>
> --Jason
>
>
> > - Bas
> > >
> > > --Jason
> > >
> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher 
> wrote:
> > > >
> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> > > > >
> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer 
> wrote:
> > > > >>
> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > > >> > Hi Dave,
> > > > >> >
> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > > >> >> Supporting interop with any device is always possible. It
> depends on which drivers we need to interoperate with and update them.
> We've already found the path forward for amdgpu. We just need to find out
> how many other drivers need to be updated and evaluate the cost/benefit
> aspect.
> > > > >> >>
> > > > >> >> Marek
> > > > >> >>
> > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > wrote:
> > > > >> >>
> > > > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> > > > >> >>  ckoenig.leichtzumer...@gmail.com>> wrote:
> > > > >> >> >
> > > > >> >> > Correct, we wouldn't have synchronization between device
> with and without user queues any more.
> > > > >> >> >
> > > > >> >> > That could 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Jason Ekstrand
On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
 wrote:
>
> On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand  wrote:
> >
> > Sorry for the top-post but there's no good thing to reply to here...
> >
> > One of the things pointed out to me recently by Daniel Vetter that I
> > didn't fully understand before is that dma_buf has a very subtle
> > second requirement beyond finite time completion:  Nothing required
> > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > of allocating memory may wait on your dma-fence.  This, as it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace involvement with
> > dma-fence by asking the kernel for a next serial and the kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up.  There's simply
> > no way for the kernel to trust userspace to not do ANYTHING which
> > might allocate memory.  I don't even think there's a way userspace can
> > trust itself there.  It also blows up my plan of moving the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Honestly the more I look at things I think userspace-signalable fences
> with a timeout sound like they are a valid solution for these issues.
> Especially since (as has been mentioned countless times in this email
> thread) userspace already has a lot of ways to cause timeouts and or
> GPU hangs through GPU work already.
>
> Adding a timeout on the signaling side of a dma_fence would ensure:
>
> - The dma_fence signals in finite time
> -  If the timeout case does not allocate memory then memory allocation
> is not a blocker for signaling.
>
> Of course you lose the full dependency graph and we need to make sure
> garbage collection of fences works correctly when we have cycles.
> However, the latter sounds very doable and the first sounds like it is
> to some extent inevitable.
>
> I feel like I'm missing some requirement here given that we
> immediately went to much more complicated things but can't find it.
> Thoughts?

Timeouts are sufficient to protect the kernel but they make the fences
unpredictable and unreliable from a userspace PoV.  One of the big
problems we face is that, once we expose a dma_fence to userspace,
we've allowed for some pretty crazy potential dependencies that
neither userspace nor the kernel can sort out.  Say you have marek's
"next serial, please" proposal and a multi-threaded application.
Between time time you ask the kernel for a serial and get a dma_fence
and submit the work to signal that serial, your process may get
preempted, something else shoved in which allocates memory, and then
we end up blocking on that dma_fence.  There's no way userspace can
predict and defend itself from that.

So I think where that leaves us is that there is no safe place to
create a dma_fence except for inside the ioctl which submits the work
and only after any necessary memory has been allocated.  That's a
pretty stiff requirement.  We may still be able to interact with
userspace a bit more explicitly but I think it throws any notion of
userspace direct submit out the window.

--Jason


> - Bas
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
> > >
> > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  
> > > > wrote:
> > > >>
> > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > >> > Hi Dave,
> > > >> >
> > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > >> >> Supporting interop with any device is always possible. It depends 
> > > >> >> on which drivers we need to interoperate with and update them. 
> > > >> >> We've already found the path forward for amdgpu. We just need to 
> > > >> >> find out how many other drivers need to be updated and evaluate the 
> > > >> >> cost/benefit aspect.
> > > >> >>
> > > >> >> Marek
> > > >> >>
> > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > > >> >> > wrote:
> > > >> >>
> > > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> > > >> >>  > > >> >> > wrote:
> > > >> >> >
> > > >> >> > Correct, we wouldn't have synchronization between device with 
> > > >> >> and without user queues any more.
> > > >> >> >
> > > >> >> > That could only be a problem for A+I Laptops.
> > > >> >>
> > > >> >> Since I think you mentioned you'd only be enabling this on newer
> > > >> >> chipsets, won't it be a problem for A+A where one A is a 
> > > >> >> generation
> > > >> >> behind the other?
> > > >> >>
> > > >> >
> > > >> > Crap, that is a good point as well.
> > > >> >
> > > >> >>
> > > >> >> I'm not really liking where this is going btw, seems like a ill
> > > >> >> thought out concept, if AMD 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Bas Nieuwenhuizen
On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand  wrote:
>
> Sorry for the top-post but there's no good thing to reply to here...
>
> One of the things pointed out to me recently by Daniel Vetter that I
> didn't fully understand before is that dma_buf has a very subtle
> second requirement beyond finite time completion:  Nothing required
> for signaling a dma-fence can allocate memory.  Why?  Because the act
> of allocating memory may wait on your dma-fence.  This, as it turns
> out, is a massively more strict requirement than finite time
> completion and, I think, throws out all of the proposals we have so
> far.
>
> Take, for instance, Marek's proposal for userspace involvement with
> dma-fence by asking the kernel for a next serial and the kernel
> trusting userspace to signal it.  That doesn't work at all if
> allocating memory to trigger a dma-fence can blow up.  There's simply
> no way for the kernel to trust userspace to not do ANYTHING which
> might allocate memory.  I don't even think there's a way userspace can
> trust itself there.  It also blows up my plan of moving the fences to
> transition boundaries.
>
> Not sure where that leaves us.

Honestly the more I look at things I think userspace-signalable fences
with a timeout sound like they are a valid solution for these issues.
Especially since (as has been mentioned countless times in this email
thread) userspace already has a lot of ways to cause timeouts and or
GPU hangs through GPU work already.

Adding a timeout on the signaling side of a dma_fence would ensure:

- The dma_fence signals in finite time
-  If the timeout case does not allocate memory then memory allocation
is not a blocker for signaling.

Of course you lose the full dependency graph and we need to make sure
garbage collection of fences works correctly when we have cycles.
However, the latter sounds very doable and the first sounds like it is
to some extent inevitable.

I feel like I'm missing some requirement here given that we
immediately went to much more complicated things but can't find it.
Thoughts?

- Bas
>
> --Jason
>
> On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
> >
> > On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> > >
> > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
> > >>
> > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > >> > Hi Dave,
> > >> >
> > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > >> >> Supporting interop with any device is always possible. It depends on 
> > >> >> which drivers we need to interoperate with and update them. We've 
> > >> >> already found the path forward for amdgpu. We just need to find out 
> > >> >> how many other drivers need to be updated and evaluate the 
> > >> >> cost/benefit aspect.
> > >> >>
> > >> >> Marek
> > >> >>
> > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > >> >> > wrote:
> > >> >>
> > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> > >> >>  > >> >> > wrote:
> > >> >> >
> > >> >> > Correct, we wouldn't have synchronization between device with 
> > >> >> and without user queues any more.
> > >> >> >
> > >> >> > That could only be a problem for A+I Laptops.
> > >> >>
> > >> >> Since I think you mentioned you'd only be enabling this on newer
> > >> >> chipsets, won't it be a problem for A+A where one A is a 
> > >> >> generation
> > >> >> behind the other?
> > >> >>
> > >> >
> > >> > Crap, that is a good point as well.
> > >> >
> > >> >>
> > >> >> I'm not really liking where this is going btw, seems like a ill
> > >> >> thought out concept, if AMD is really going down the road of 
> > >> >> designing
> > >> >> hw that is currently Linux incompatible, you are going to have to
> > >> >> accept a big part of the burden in bringing this support in to 
> > >> >> more
> > >> >> than just amd drivers for upcoming generations of gpu.
> > >> >>
> > >> >
> > >> > Well we don't really like that either, but we have no other option as 
> > >> > far as I can see.
> > >>
> > >> I don't really understand what "future hw may remove support for kernel 
> > >> queues" means exactly. While the per-context queues can be mapped to 
> > >> userspace directly, they don't *have* to be, do they? I.e. the kernel 
> > >> driver should be able to either intercept userspace access to the 
> > >> queues, or in the worst case do it all itself, and provide the existing 
> > >> synchronization semantics as needed?
> > >>
> > >> Surely there are resource limits for the per-context queues, so the 
> > >> kernel driver needs to do some kind of virtualization / multi-plexing 
> > >> anyway, or we'll get sad user faces when there's no queue available for 
> > >> .
> > >>
> > >> I'm probably missing something though, awaiting enlightenment. :)
> > >
> > >
> > > The hw interface for userspace is that the ring buffer is mapped to the 
> > > process address space alongside a doorbell aperture 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Jason Ekstrand
On Mon, May 3, 2021 at 10:03 AM Christian König
 wrote:
>
> Am 03.05.21 um 16:59 schrieb Jason Ekstrand:
> > Sorry for the top-post but there's no good thing to reply to here...
> >
> > One of the things pointed out to me recently by Daniel Vetter that I
> > didn't fully understand before is that dma_buf has a very subtle
> > second requirement beyond finite time completion:  Nothing required
> > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > of allocating memory may wait on your dma-fence.  This, as it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace involvement with
> > dma-fence by asking the kernel for a next serial and the kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up.  There's simply
> > no way for the kernel to trust userspace to not do ANYTHING which
> > might allocate memory.  I don't even think there's a way userspace can
> > trust itself there.  It also blows up my plan of moving the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Well at least I was perfectly aware of that :)

I'd have been a bit disappointed if this had been news to you. :-P
However, there are a number of us plebeians on the thread who need
things spelled out sometimes. :-)

> I'm currently experimenting with some sample code which would allow
> implicit sync with user fences.
>
> Not that I'm pushing hard into that directly, but I just want to make
> clear how simple or complex the whole thing would be.

I'd like to see that.  It'd be good to know what our options are.
Honestly, if we can get implicit sync somehow without tying our hands
w.r.t. how fences work in modern drivers, that's the opens a lot of
doors.

--Jason

> Christian.
>
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
> >> On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> >>> On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
>  On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> >
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on 
> >> which drivers we need to interoperate with and update them. We've 
> >> already found the path forward for amdgpu. We just need to find out 
> >> how many other drivers need to be updated and evaluate the 
> >> cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  >> > wrote:
> >>
> >>  On Tue, 27 Apr 2021 at 22:06, Christian König
> >>   >> > wrote:
> >>  >
> >>  > Correct, we wouldn't have synchronization between device with 
> >> and without user queues any more.
> >>  >
> >>  > That could only be a problem for A+I Laptops.
> >>
> >>  Since I think you mentioned you'd only be enabling this on newer
> >>  chipsets, won't it be a problem for A+A where one A is a 
> >> generation
> >>  behind the other?
> >>
> > Crap, that is a good point as well.
> >
> >>  I'm not really liking where this is going btw, seems like a ill
> >>  thought out concept, if AMD is really going down the road of 
> >> designing
> >>  hw that is currently Linux incompatible, you are going to have to
> >>  accept a big part of the burden in bringing this support in to 
> >> more
> >>  than just amd drivers for upcoming generations of gpu.
> >>
> > Well we don't really like that either, but we have no other option as 
> > far as I can see.
>  I don't really understand what "future hw may remove support for kernel 
>  queues" means exactly. While the per-context queues can be mapped to 
>  userspace directly, they don't *have* to be, do they? I.e. the kernel 
>  driver should be able to either intercept userspace access to the 
>  queues, or in the worst case do it all itself, and provide the existing 
>  synchronization semantics as needed?
> 
>  Surely there are resource limits for the per-context queues, so the 
>  kernel driver needs to do some kind of virtualization / multi-plexing 
>  anyway, or we'll get sad user faces when there's no queue available for 
>  .
> 
>  I'm probably missing something though, awaiting enlightenment. :)
> >>>
> >>> The hw interface for userspace is that the ring buffer is mapped to the 
> >>> process address space alongside a doorbell aperture (4K page) that isn't 
> >>> real memory, but when the CPU writes into it, it tells the hw scheduler 
> >>> that there are new GPU commands in the ring buffer. Userspace inserts all 
> >>> the wait, 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Christian König

Am 03.05.21 um 16:59 schrieb Jason Ekstrand:

Sorry for the top-post but there's no good thing to reply to here...

One of the things pointed out to me recently by Daniel Vetter that I
didn't fully understand before is that dma_buf has a very subtle
second requirement beyond finite time completion:  Nothing required
for signaling a dma-fence can allocate memory.  Why?  Because the act
of allocating memory may wait on your dma-fence.  This, as it turns
out, is a massively more strict requirement than finite time
completion and, I think, throws out all of the proposals we have so
far.

Take, for instance, Marek's proposal for userspace involvement with
dma-fence by asking the kernel for a next serial and the kernel
trusting userspace to signal it.  That doesn't work at all if
allocating memory to trigger a dma-fence can blow up.  There's simply
no way for the kernel to trust userspace to not do ANYTHING which
might allocate memory.  I don't even think there's a way userspace can
trust itself there.  It also blows up my plan of moving the fences to
transition boundaries.

Not sure where that leaves us.


Well at least I was perfectly aware of that :)

I'm currently experimenting with some sample code which would allow 
implicit sync with user fences.


Not that I'm pushing hard into that directly, but I just want to make 
clear how simple or complex the whole thing would be.


Christian.



--Jason

On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:

On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:

On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:

On 2021-04-28 8:59 a.m., Christian König wrote:

Hi Dave,

Am 27.04.21 um 21:23 schrieb Marek Olšák:

Supporting interop with any device is always possible. It depends on which 
drivers we need to interoperate with and update them. We've already found the 
path forward for amdgpu. We just need to find out how many other drivers need 
to be updated and evaluate the cost/benefit aspect.

Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie mailto:airl...@gmail.com>> wrote:

 On Tue, 27 Apr 2021 at 22:06, Christian König
 mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
 >
 > Correct, we wouldn't have synchronization between device with and 
without user queues any more.
 >
 > That could only be a problem for A+I Laptops.

 Since I think you mentioned you'd only be enabling this on newer
 chipsets, won't it be a problem for A+A where one A is a generation
 behind the other?


Crap, that is a good point as well.


 I'm not really liking where this is going btw, seems like a ill
 thought out concept, if AMD is really going down the road of designing
 hw that is currently Linux incompatible, you are going to have to
 accept a big part of the burden in bringing this support in to more
 than just amd drivers for upcoming generations of gpu.


Well we don't really like that either, but we have no other option as far as I 
can see.

I don't really understand what "future hw may remove support for kernel queues" 
means exactly. While the per-context queues can be mapped to userspace directly, they 
don't *have* to be, do they? I.e. the kernel driver should be able to either intercept 
userspace access to the queues, or in the worst case do it all itself, and provide the 
existing synchronization semantics as needed?

Surely there are resource limits for the per-context queues, so the kernel driver 
needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user 
faces when there's no queue available for .

I'm probably missing something though, awaiting enlightenment. :)


The hw interface for userspace is that the ring buffer is mapped to the process address 
space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU 
writes into it, it tells the hw scheduler that there are new GPU commands in the ring 
buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer 
and then "rings" the doorbell. It's my understanding that the ring buffer and 
the doorbell are always mapped in the same GPU address space as the process, which makes 
it very difficult to emulate the current protected ring buffers in the kernel. The VMID 
of the ring buffer is also not changeable.


The doorbell does not have to be mapped into the process's GPU virtual
address space.  The CPU could write to it directly.  Mapping it into
the GPU's virtual address space would allow you to have a device kick
off work however rather than the CPU.  E.g., the GPU could kick off
it's own work or multiple devices could kick off work without CPU
involvement.

Alex



The hw scheduler doesn't do any synchronization and it doesn't see any 
dependencies. It only chooses which queue to execute, so it's really just a 
simple queue manager handling the virtualization aspect and not much else.

Marek
___
dri-devel mailing list

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Jason Ekstrand
Sorry for the top-post but there's no good thing to reply to here...

One of the things pointed out to me recently by Daniel Vetter that I
didn't fully understand before is that dma_buf has a very subtle
second requirement beyond finite time completion:  Nothing required
for signaling a dma-fence can allocate memory.  Why?  Because the act
of allocating memory may wait on your dma-fence.  This, as it turns
out, is a massively more strict requirement than finite time
completion and, I think, throws out all of the proposals we have so
far.

Take, for instance, Marek's proposal for userspace involvement with
dma-fence by asking the kernel for a next serial and the kernel
trusting userspace to signal it.  That doesn't work at all if
allocating memory to trigger a dma-fence can blow up.  There's simply
no way for the kernel to trust userspace to not do ANYTHING which
might allocate memory.  I don't even think there's a way userspace can
trust itself there.  It also blows up my plan of moving the fences to
transition boundaries.

Not sure where that leaves us.

--Jason

On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:
>
> On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> >
> > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
> >>
> >> On 2021-04-28 8:59 a.m., Christian König wrote:
> >> > Hi Dave,
> >> >
> >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> >> Supporting interop with any device is always possible. It depends on 
> >> >> which drivers we need to interoperate with and update them. We've 
> >> >> already found the path forward for amdgpu. We just need to find out how 
> >> >> many other drivers need to be updated and evaluate the cost/benefit 
> >> >> aspect.
> >> >>
> >> >> Marek
> >> >>
> >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  >> >> > wrote:
> >> >>
> >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> >> >>  >> >> > wrote:
> >> >> >
> >> >> > Correct, we wouldn't have synchronization between device with and 
> >> >> without user queues any more.
> >> >> >
> >> >> > That could only be a problem for A+I Laptops.
> >> >>
> >> >> Since I think you mentioned you'd only be enabling this on newer
> >> >> chipsets, won't it be a problem for A+A where one A is a generation
> >> >> behind the other?
> >> >>
> >> >
> >> > Crap, that is a good point as well.
> >> >
> >> >>
> >> >> I'm not really liking where this is going btw, seems like a ill
> >> >> thought out concept, if AMD is really going down the road of 
> >> >> designing
> >> >> hw that is currently Linux incompatible, you are going to have to
> >> >> accept a big part of the burden in bringing this support in to more
> >> >> than just amd drivers for upcoming generations of gpu.
> >> >>
> >> >
> >> > Well we don't really like that either, but we have no other option as 
> >> > far as I can see.
> >>
> >> I don't really understand what "future hw may remove support for kernel 
> >> queues" means exactly. While the per-context queues can be mapped to 
> >> userspace directly, they don't *have* to be, do they? I.e. the kernel 
> >> driver should be able to either intercept userspace access to the queues, 
> >> or in the worst case do it all itself, and provide the existing 
> >> synchronization semantics as needed?
> >>
> >> Surely there are resource limits for the per-context queues, so the kernel 
> >> driver needs to do some kind of virtualization / multi-plexing anyway, or 
> >> we'll get sad user faces when there's no queue available for  >> game>.
> >>
> >> I'm probably missing something though, awaiting enlightenment. :)
> >
> >
> > The hw interface for userspace is that the ring buffer is mapped to the 
> > process address space alongside a doorbell aperture (4K page) that isn't 
> > real memory, but when the CPU writes into it, it tells the hw scheduler 
> > that there are new GPU commands in the ring buffer. Userspace inserts all 
> > the wait, draw, and signal commands into the ring buffer and then "rings" 
> > the doorbell. It's my understanding that the ring buffer and the doorbell 
> > are always mapped in the same GPU address space as the process, which makes 
> > it very difficult to emulate the current protected ring buffers in the 
> > kernel. The VMID of the ring buffer is also not changeable.
> >
>
> The doorbell does not have to be mapped into the process's GPU virtual
> address space.  The CPU could write to it directly.  Mapping it into
> the GPU's virtual address space would allow you to have a device kick
> off work however rather than the CPU.  E.g., the GPU could kick off
> it's own work or multiple devices could kick off work without CPU
> involvement.
>
> Alex
>
>
> > The hw scheduler doesn't do any synchronization and it doesn't see any 
> > dependencies. It only chooses which queue to execute, so it's really just a 
> > simple queue manager handling the virtualization 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Alex Deucher
On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
>
> On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:
>>
>> On 2021-04-28 8:59 a.m., Christian König wrote:
>> > Hi Dave,
>> >
>> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> >> Supporting interop with any device is always possible. It depends on 
>> >> which drivers we need to interoperate with and update them. We've already 
>> >> found the path forward for amdgpu. We just need to find out how many 
>> >> other drivers need to be updated and evaluate the cost/benefit aspect.
>> >>
>> >> Marek
>> >>
>> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie > >> > wrote:
>> >>
>> >> On Tue, 27 Apr 2021 at 22:06, Christian König
>> >> > >> > wrote:
>> >> >
>> >> > Correct, we wouldn't have synchronization between device with and 
>> >> without user queues any more.
>> >> >
>> >> > That could only be a problem for A+I Laptops.
>> >>
>> >> Since I think you mentioned you'd only be enabling this on newer
>> >> chipsets, won't it be a problem for A+A where one A is a generation
>> >> behind the other?
>> >>
>> >
>> > Crap, that is a good point as well.
>> >
>> >>
>> >> I'm not really liking where this is going btw, seems like a ill
>> >> thought out concept, if AMD is really going down the road of designing
>> >> hw that is currently Linux incompatible, you are going to have to
>> >> accept a big part of the burden in bringing this support in to more
>> >> than just amd drivers for upcoming generations of gpu.
>> >>
>> >
>> > Well we don't really like that either, but we have no other option as far 
>> > as I can see.
>>
>> I don't really understand what "future hw may remove support for kernel 
>> queues" means exactly. While the per-context queues can be mapped to 
>> userspace directly, they don't *have* to be, do they? I.e. the kernel driver 
>> should be able to either intercept userspace access to the queues, or in the 
>> worst case do it all itself, and provide the existing synchronization 
>> semantics as needed?
>>
>> Surely there are resource limits for the per-context queues, so the kernel 
>> driver needs to do some kind of virtualization / multi-plexing anyway, or 
>> we'll get sad user faces when there's no queue available for > game>.
>>
>> I'm probably missing something though, awaiting enlightenment. :)
>
>
> The hw interface for userspace is that the ring buffer is mapped to the 
> process address space alongside a doorbell aperture (4K page) that isn't real 
> memory, but when the CPU writes into it, it tells the hw scheduler that there 
> are new GPU commands in the ring buffer. Userspace inserts all the wait, 
> draw, and signal commands into the ring buffer and then "rings" the doorbell. 
> It's my understanding that the ring buffer and the doorbell are always mapped 
> in the same GPU address space as the process, which makes it very difficult 
> to emulate the current protected ring buffers in the kernel. The VMID of the 
> ring buffer is also not changeable.
>

The doorbell does not have to be mapped into the process's GPU virtual
address space.  The CPU could write to it directly.  Mapping it into
the GPU's virtual address space would allow you to have a device kick
off work however rather than the CPU.  E.g., the GPU could kick off
it's own work or multiple devices could kick off work without CPU
involvement.

Alex


> The hw scheduler doesn't do any synchronization and it doesn't see any 
> dependencies. It only chooses which queue to execute, so it's really just a 
> simple queue manager handling the virtualization aspect and not much else.
>
> Marek
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev