[RFC] Explicit synchronization for Nouveau

2014-10-08 Thread Daniel Vetter
On Mon, Oct 6, 2014 at 2:25 PM, Lauri Peltonen  wrote:
>> Also, the problem is that to actually push android stuff out of staging
>> you need a use-case in upstream, which means an open-source gpu driver.
>> There's not a lot of companies who have both that and ship android, and
>> definitely not the nexus/android lead platforms.
>>
>> Display side would be easier since there's a bunch of kms drivers now
>> upstream. But given that google decided to go ahead with their own adf
>> instead of drm-kms that's also a non-starter.
>
> Hmm..  Maybe we could use TegraDRM on the display side..  That and Nouveau
> would already be two upstream drivers that support explicit sync on Tegra K1.
>
> Also, if we bring sync fd's out of staging, one idea would be to add support
> for EGL_ANDROID_native_fence_sync in mesa, along with some tests.  That would
> demonstrate converting between sync fd's and EGLSync objects.

Just read through the extension spec for this and this sounds
excellent. So enabling that in mesa and having some basic piglits for
the conversion should be more than good enough to fullfill the
open-source userspace driver requirement.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


[RFC] Explicit synchronization for Nouveau

2014-10-06 Thread Lauri Peltonen
On Thu, Oct 02, 2014 at 10:44:05PM +0200, Daniel Vetter wrote:
> On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
> > Yes, that will probably work!  So, just to reiterate that I understood you 
> > and
> > Daniel correctly:
> > 
> > - de-stage sync_fence and it's user space API (the tedious part)
> > - add dma-buf ioctls for extracting and attaching explicit fences
> > - Nouveau specific: allow flagging gem buffers for explicit sync
> >   - do not check pre-fences from explicitly synced buffers at submit 
> >   - continue to attach a shared fence after submit so that pinning and
> > unmapping continue to work
> 
> - Have explicit in/out fences for the pushbuf ioctl is missing I
>   guess in this step?

Yes, I was missing that. :)


> I also think we need some kind of demonstration vehicle using nouveau to
> satisfy Dave Airlie's open-source userspace requirements for new
> interfaces. Might be good to chat with him to make sure we have that
> covered (and how much needs to be there really).

Agreed.


> Also, the problem is that to actually push android stuff out of staging
> you need a use-case in upstream, which means an open-source gpu driver.
> There's not a lot of companies who have both that and ship android, and
> definitely not the nexus/android lead platforms.
> 
> Display side would be easier since there's a bunch of kms drivers now
> upstream. But given that google decided to go ahead with their own adf
> instead of drm-kms that's also a non-starter.

Hmm..  Maybe we could use TegraDRM on the display side..  That and Nouveau
would already be two upstream drivers that support explicit sync on Tegra K1.

Also, if we bring sync fd's out of staging, one idea would be to add support
for EGL_ANDROID_native_fence_sync in mesa, along with some tests.  That would
demonstrate converting between sync fd's and EGLSync objects.


> Hm, usually we expose such test interfaces through debugfs - that way
> production system won't ever ship with it (since there's too many exploits in
> there, especially with secure boot). But since you need it for validation
> tests (at least for the i915 suite) it should always be there when you need
> it.
> 
> Exposing this as a configurable driver in dev is imo a no-go. But we
> should be able to easily convert this into a few debugfs files, so not too
> much fuzz hopefully.

Good idea!


> > > Aside: Will you be at XDC or linux plumbers? Either would be a perfect
> > > place to discuss plans and ideas - I'll attend both.
> > 
> > I wasn't going to, but let's see.  The former is pretty soon and the latter 
> > is
> > sold out.  At least Andy Ritger from Nvidia is coming to XDC for sure, and 
> > he's
> > been involved in our internal discussions around these topics. So I suggest 
> > you
> > have a chat with him at least!  :)
> 
> I'll definitely have a chat (and some beers) with Andy, been a while I've
> last seen him ;-)

Change of plans, I'll attend XDC, so see you there!  I'll even give a short
talk about explicit sync to get some discussions going. :)


Thanks, 
Lauri



[RFC] Explicit synchronization for Nouveau

2014-10-03 Thread Rom Lemarchand
Riley (CCed) and I will be at Plumbers in a couple weeks.

There is a session on sync planned in the Android track, and of course
we'll be available to chat.

On Thu, Oct 2, 2014 at 1:44 PM, Daniel Vetter  wrote:

> On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
> > +Rom who seems to be presenting about mainlining android sync at linux
> plumbers
>
> Also add Greg KH as fyi that we're working on de-stage one of the android
> subsystems.
>
> > On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
> > > You could neuter implicit fences by always attaching the fences as
> > > shared when explicit syncing is used. This would work correctly with
> > > eviction, and wouldn't cause any unneeded syncing. :)
> >
> > Yes, that will probably work!  So, just to reiterate that I understood
> you and
> > Daniel correctly:
> >
> > - de-stage sync_fence and it's user space API (the tedious part)
> > - add dma-buf ioctls for extracting and attaching explicit fences
> > - Nouveau specific: allow flagging gem buffers for explicit sync
> >   - do not check pre-fences from explicitly synced buffers at submit
> >   - continue to attach a shared fence after submit so that pinning and
> > unmapping continue to work
>
> - Have explicit in/out fences for the pushbuf ioctl is missing I
>   guess in this step?
>
> I also think we need some kind of demonstration vehicle using nouveau to
> satisfy Dave Airlie's open-source userspace requirements for new
> interfaces. Might be good to chat with him to make sure we have that
> covered (and how much needs to be there really).
>
> > Then working sets and getting rid of locking all buffers individually
> > can be dealt with later as an optimization.
>
> Yeah, sounds like a good plan.
>
> > On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
> > > On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
> > > > Implicit fences attached to individual buffers are one way for
> residency
> > > > management.  Do you think a working set based model could work in
> the DRM
> > > > framework?  For example, something like this:
> > > >
> > > > - Allow user space to create "working set objects" and associate
> buffers with
> > > >   them.  If the user space doesn't want to manage working sets
> explicitly, it
> > > >   could also use an implicit default working set that contains all
> buffers that
> > > >   are mapped to the channel vm (on Android we could always use the
> default
> > > >   working set since we don't need to manage residency).  The working
> sets are
> > > >   initially marked as dirty.
> > > > - User space tells which working sets are referenced by each work
> submission.
> > > >   Kernel locks these working sets, pins all buffers in dirty working
> sets, and
> > > >   resets the dirty bits.  After kicking off work, kernel stores the
> fence to
> > > >   the _working sets_, and then releases the locks (if an implicit
> default
> > > >   working set is used, then this would be roughly equivalent to
> storing a fence
> > > >   to channel vm that tells "this is the last hw operation that might
> have
> > > >   touched buffers in this address space").
> > > > - If swapping doesn't happen, then we just need to check the working
> set dirty
> > > >   bits at each submit.
> > > > - When a buffer is swapped out, all working sets that refer to it
> need to be
> > > >   marked as dirty.
> > > > - When a buffer is swapped out or unmapped, we need to wait for the
> fences from
> > > >   all working sets that refer to the buffer.
> > > >
> > > > Initially one might think of working sets as a mere optimization -
> we now need
> > > > to process a few working sets at every submit instead of many
> individual
> > > > buffers.  However, it makes a huge difference because of fences:
> fences that
> > > > are attached to buffers are used for implicitly synchronizing work
> across
> > > > different channels and engines.  They are in the performance
> critical path, and
> > > > we want to carefully manage them (that's the idea of explicit
> synchronization).
> > > > The working set fences, on the other hand, would only be used to
> guarantee that
> > > > we don't swap out or unmap something that the GPU might be
> accessing.  We never
> > > > need to wait for those fences (except when swapping or unmapping),
> so we can be
> > > > conservative without hurting performance.
> > >
> > > Yeah, within the driver (i.e. for private objects which are never
> exported
> > > to dma_buf) we can recently do stuff like this. And your above idea is
> > > roughly one of the things we're tossing around for i915.
> > >
> > > But the cool stuff with drm is that cmd submission is driver-specific,
> so
> > > you can just go wild with nouveau. Of course you have to coninvce the
> > > nouveau guys (and also have open-source users for the new interface).
> > >
> > > For shared buffers I think we should stick with the implicit fences
> for a
> > > while simply 

[RFC] Explicit synchronization for Nouveau

2014-10-02 Thread Daniel Vetter
On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
> +Rom who seems to be presenting about mainlining android sync at linux 
> plumbers

Also add Greg KH as fyi that we're working on de-stage one of the android
subsystems.

> On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
> > You could neuter implicit fences by always attaching the fences as
> > shared when explicit syncing is used. This would work correctly with
> > eviction, and wouldn't cause any unneeded syncing. :)
> 
> Yes, that will probably work!  So, just to reiterate that I understood you and
> Daniel correctly:
> 
> - de-stage sync_fence and it's user space API (the tedious part)
> - add dma-buf ioctls for extracting and attaching explicit fences
> - Nouveau specific: allow flagging gem buffers for explicit sync
>   - do not check pre-fences from explicitly synced buffers at submit 
>   - continue to attach a shared fence after submit so that pinning and
> unmapping continue to work

- Have explicit in/out fences for the pushbuf ioctl is missing I
  guess in this step?

I also think we need some kind of demonstration vehicle using nouveau to
satisfy Dave Airlie's open-source userspace requirements for new
interfaces. Might be good to chat with him to make sure we have that
covered (and how much needs to be there really).

> Then working sets and getting rid of locking all buffers individually 
> can be dealt with later as an optimization.

Yeah, sounds like a good plan.

> On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
> > On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
> > > Implicit fences attached to individual buffers are one way for residency
> > > management.  Do you think a working set based model could work in the DRM
> > > framework?  For example, something like this:
> > > 
> > > - Allow user space to create "working set objects" and associate buffers 
> > > with
> > >   them.  If the user space doesn't want to manage working sets 
> > > explicitly, it
> > >   could also use an implicit default working set that contains all 
> > > buffers that
> > >   are mapped to the channel vm (on Android we could always use the default
> > >   working set since we don't need to manage residency).  The working sets 
> > > are
> > >   initially marked as dirty.
> > > - User space tells which working sets are referenced by each work 
> > > submission.
> > >   Kernel locks these working sets, pins all buffers in dirty working 
> > > sets, and
> > >   resets the dirty bits.  After kicking off work, kernel stores the fence 
> > > to
> > >   the _working sets_, and then releases the locks (if an implicit default
> > >   working set is used, then this would be roughly equivalent to storing a 
> > > fence
> > >   to channel vm that tells "this is the last hw operation that might have
> > >   touched buffers in this address space").
> > > - If swapping doesn't happen, then we just need to check the working set 
> > > dirty
> > >   bits at each submit.
> > > - When a buffer is swapped out, all working sets that refer to it need to 
> > > be
> > >   marked as dirty.
> > > - When a buffer is swapped out or unmapped, we need to wait for the 
> > > fences from
> > >   all working sets that refer to the buffer.
> > > 
> > > Initially one might think of working sets as a mere optimization - we now 
> > > need
> > > to process a few working sets at every submit instead of many individual
> > > buffers.  However, it makes a huge difference because of fences: fences 
> > > that
> > > are attached to buffers are used for implicitly synchronizing work across
> > > different channels and engines.  They are in the performance critical 
> > > path, and
> > > we want to carefully manage them (that's the idea of explicit 
> > > synchronization).
> > > The working set fences, on the other hand, would only be used to 
> > > guarantee that
> > > we don't swap out or unmap something that the GPU might be accessing.  We 
> > > never
> > > need to wait for those fences (except when swapping or unmapping), so we 
> > > can be
> > > conservative without hurting performance.
> > 
> > Yeah, within the driver (i.e. for private objects which are never exported
> > to dma_buf) we can recently do stuff like this. And your above idea is
> > roughly one of the things we're tossing around for i915.
> > 
> > But the cool stuff with drm is that cmd submission is driver-specific, so
> > you can just go wild with nouveau. Of course you have to coninvce the
> > nouveau guys (and also have open-source users for the new interface).
> > 
> > For shared buffers I think we should stick with the implicit fences for a
> > while simply because I'm not sure whether it's really worth the fuzz. And
> > reworking all the drivers and dma-buf for some working sets is a lot of
> > fuzz ;-) Like Maarten said you can mostly short-circuit the implicit
> > fencing by only attaching shared fences.
> 
> Yes, I'll try to do that.
> 
> 
> > 

[RFC] Explicit synchronization for Nouveau

2014-10-02 Thread Lauri Peltonen
+Rom who seems to be presenting about mainlining android sync at linux plumbers


On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
> You could neuter implicit fences by always attaching the fences as
> shared when explicit syncing is used. This would work correctly with
> eviction, and wouldn't cause any unneeded syncing. :)

Yes, that will probably work!  So, just to reiterate that I understood you and
Daniel correctly:

- de-stage sync_fence and it's user space API (the tedious part)
- add dma-buf ioctls for extracting and attaching explicit fences
- Nouveau specific: allow flagging gem buffers for explicit sync
  - do not check pre-fences from explicitly synced buffers at submit 
  - continue to attach a shared fence after submit so that pinning and
unmapping continue to work

Then working sets and getting rid of locking all buffers individually 
can be dealt with later as an optimization.


On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
> On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
> > Implicit fences attached to individual buffers are one way for residency
> > management.  Do you think a working set based model could work in the DRM
> > framework?  For example, something like this:
> > 
> > - Allow user space to create "working set objects" and associate buffers 
> > with
> >   them.  If the user space doesn't want to manage working sets explicitly, 
> > it
> >   could also use an implicit default working set that contains all buffers 
> > that
> >   are mapped to the channel vm (on Android we could always use the default
> >   working set since we don't need to manage residency).  The working sets 
> > are
> >   initially marked as dirty.
> > - User space tells which working sets are referenced by each work 
> > submission.
> >   Kernel locks these working sets, pins all buffers in dirty working sets, 
> > and
> >   resets the dirty bits.  After kicking off work, kernel stores the fence to
> >   the _working sets_, and then releases the locks (if an implicit default
> >   working set is used, then this would be roughly equivalent to storing a 
> > fence
> >   to channel vm that tells "this is the last hw operation that might have
> >   touched buffers in this address space").
> > - If swapping doesn't happen, then we just need to check the working set 
> > dirty
> >   bits at each submit.
> > - When a buffer is swapped out, all working sets that refer to it need to be
> >   marked as dirty.
> > - When a buffer is swapped out or unmapped, we need to wait for the fences 
> > from
> >   all working sets that refer to the buffer.
> > 
> > Initially one might think of working sets as a mere optimization - we now 
> > need
> > to process a few working sets at every submit instead of many individual
> > buffers.  However, it makes a huge difference because of fences: fences that
> > are attached to buffers are used for implicitly synchronizing work across
> > different channels and engines.  They are in the performance critical path, 
> > and
> > we want to carefully manage them (that's the idea of explicit 
> > synchronization).
> > The working set fences, on the other hand, would only be used to guarantee 
> > that
> > we don't swap out or unmap something that the GPU might be accessing.  We 
> > never
> > need to wait for those fences (except when swapping or unmapping), so we 
> > can be
> > conservative without hurting performance.
> 
> Yeah, within the driver (i.e. for private objects which are never exported
> to dma_buf) we can recently do stuff like this. And your above idea is
> roughly one of the things we're tossing around for i915.
> 
> But the cool stuff with drm is that cmd submission is driver-specific, so
> you can just go wild with nouveau. Of course you have to coninvce the
> nouveau guys (and also have open-source users for the new interface).
> 
> For shared buffers I think we should stick with the implicit fences for a
> while simply because I'm not sure whether it's really worth the fuzz. And
> reworking all the drivers and dma-buf for some working sets is a lot of
> fuzz ;-) Like Maarten said you can mostly short-circuit the implicit
> fencing by only attaching shared fences.

Yes, I'll try to do that.


> In case you're curious: The idea is to have a 1:1 association between
> ppgtt address spaces and what you call the working set above, to implement
> the buffer svm model in ocl2. Mostly because we expect that applications
> won't get the more fine-grained buffer list right anyway. And this kind of
> gang-scheduling of working set sizes should be more efficient for the
> usual case where everything fits.

If I understood correctly, this would be exactly the same as what I called the
"default working set" above.  On Android we don't care much about finer grained
working sets either, because of UMA and no swapping.


> > > Imo de-staging the android syncpt stuff needs to happen first, before 
> > > drivers
> > > can use it. 

[RFC] Explicit synchronization for Nouveau

2014-10-01 Thread Daniel Vetter
On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
> Thanks Daniel for your input!
> 
> On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
> > On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
> > > (2) Stop automatically storing fences to the buffers that user space 
> > > wants to
> > > synchronize explicitly.
> > 
> > The problem with this approach is that you then need hw faulting to make
> > sure the memory is there. Implicit fences aren't just used for syncing,
> > but also to make sure that the gpu still has access to the buffer as long
> > as it needs it. So you need at least a non-exclusive fence attached for
> > each command submission.
> > 
> > Of course on Android you don't have swap (would kill the puny mmc within
> > seconds) and you don't care for letting userspace pin most of memory for
> > gfx. So you'll get away with no fences at all. But for upstream I don't
> > see a good solution unfortunately. Ideas very much welcome.
> > 
> > > (3) Allow user space to attach an explicit fence to dma-buf when 
> > > exporting to
> > > another driver that uses implicit sync.
> > > 
> > > There are still some open issues beyond these.  For example, can we skip
> > > acquiring the ww mutex for explicitly synchronized buffers?  I think we 
> > > could
> > > eventually, at least on unified memory systems where we don't need to 
> > > migrate
> > > between heaps (our downstream Tegra GPU driver does not lock any buffers 
> > > at
> > > submit, it just grabs refcounts for hw).  Another quirk is that now 
> > > Nouveau
> > > waits on the buffer fences when closing the gem object to ensure that it
> > > doesn't unmap too early.  We need to rework that for explicit sync, but 
> > > that
> > > shouldn't be difficult.
> > 
> > See above, but you can't avoid to attach fences as long as we still use a
> > buffer-object based gfx memory management model. At least afaics. Which
> > means you need the ordering guarantees imposed by ww mutexes to ensure
> > that the oddball implicit ordered client can't deadlock the kernel's
> > memory management code.
> 
> Implicit fences attached to individual buffers are one way for residency
> management.  Do you think a working set based model could work in the DRM
> framework?  For example, something like this:
> 
> - Allow user space to create "working set objects" and associate buffers with
>   them.  If the user space doesn't want to manage working sets explicitly, it
>   could also use an implicit default working set that contains all buffers 
> that
>   are mapped to the channel vm (on Android we could always use the default
>   working set since we don't need to manage residency).  The working sets are
>   initially marked as dirty.
> - User space tells which working sets are referenced by each work submission.
>   Kernel locks these working sets, pins all buffers in dirty working sets, and
>   resets the dirty bits.  After kicking off work, kernel stores the fence to
>   the _working sets_, and then releases the locks (if an implicit default
>   working set is used, then this would be roughly equivalent to storing a 
> fence
>   to channel vm that tells "this is the last hw operation that might have
>   touched buffers in this address space").
> - If swapping doesn't happen, then we just need to check the working set dirty
>   bits at each submit.
> - When a buffer is swapped out, all working sets that refer to it need to be
>   marked as dirty.
> - When a buffer is swapped out or unmapped, we need to wait for the fences 
> from
>   all working sets that refer to the buffer.
> 
> Initially one might think of working sets as a mere optimization - we now need
> to process a few working sets at every submit instead of many individual
> buffers.  However, it makes a huge difference because of fences: fences that
> are attached to buffers are used for implicitly synchronizing work across
> different channels and engines.  They are in the performance critical path, 
> and
> we want to carefully manage them (that's the idea of explicit 
> synchronization).
> The working set fences, on the other hand, would only be used to guarantee 
> that
> we don't swap out or unmap something that the GPU might be accessing.  We 
> never
> need to wait for those fences (except when swapping or unmapping), so we can 
> be
> conservative without hurting performance.

Yeah, within the driver (i.e. for private objects which are never exported
to dma_buf) we can recently do stuff like this. And your above idea is
roughly one of the things we're tossing around for i915.

But the cool stuff with drm is that cmd submission is driver-specific, so
you can just go wild with nouveau. Of course you have to coninvce the
nouveau guys (and also have open-source users for the new interface).

For shared buffers I think we should stick with the implicit fences for a
while simply because I'm not sure whether it's really worth the fuzz. And
reworking all the drivers 

[RFC] Explicit synchronization for Nouveau

2014-10-01 Thread Lauri Peltonen
Thanks Daniel for your input!

On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
> On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
> > (2) Stop automatically storing fences to the buffers that user space wants 
> > to
> > synchronize explicitly.
> 
> The problem with this approach is that you then need hw faulting to make
> sure the memory is there. Implicit fences aren't just used for syncing,
> but also to make sure that the gpu still has access to the buffer as long
> as it needs it. So you need at least a non-exclusive fence attached for
> each command submission.
> 
> Of course on Android you don't have swap (would kill the puny mmc within
> seconds) and you don't care for letting userspace pin most of memory for
> gfx. So you'll get away with no fences at all. But for upstream I don't
> see a good solution unfortunately. Ideas very much welcome.
> 
> > (3) Allow user space to attach an explicit fence to dma-buf when exporting 
> > to
> > another driver that uses implicit sync.
> > 
> > There are still some open issues beyond these.  For example, can we skip
> > acquiring the ww mutex for explicitly synchronized buffers?  I think we 
> > could
> > eventually, at least on unified memory systems where we don't need to 
> > migrate
> > between heaps (our downstream Tegra GPU driver does not lock any buffers at
> > submit, it just grabs refcounts for hw).  Another quirk is that now Nouveau
> > waits on the buffer fences when closing the gem object to ensure that it
> > doesn't unmap too early.  We need to rework that for explicit sync, but that
> > shouldn't be difficult.
> 
> See above, but you can't avoid to attach fences as long as we still use a
> buffer-object based gfx memory management model. At least afaics. Which
> means you need the ordering guarantees imposed by ww mutexes to ensure
> that the oddball implicit ordered client can't deadlock the kernel's
> memory management code.

Implicit fences attached to individual buffers are one way for residency
management.  Do you think a working set based model could work in the DRM
framework?  For example, something like this:

- Allow user space to create "working set objects" and associate buffers with
  them.  If the user space doesn't want to manage working sets explicitly, it
  could also use an implicit default working set that contains all buffers that
  are mapped to the channel vm (on Android we could always use the default
  working set since we don't need to manage residency).  The working sets are
  initially marked as dirty.
- User space tells which working sets are referenced by each work submission.
  Kernel locks these working sets, pins all buffers in dirty working sets, and
  resets the dirty bits.  After kicking off work, kernel stores the fence to
  the _working sets_, and then releases the locks (if an implicit default
  working set is used, then this would be roughly equivalent to storing a fence
  to channel vm that tells "this is the last hw operation that might have
  touched buffers in this address space").
- If swapping doesn't happen, then we just need to check the working set dirty
  bits at each submit.
- When a buffer is swapped out, all working sets that refer to it need to be
  marked as dirty.
- When a buffer is swapped out or unmapped, we need to wait for the fences from
  all working sets that refer to the buffer.

Initially one might think of working sets as a mere optimization - we now need
to process a few working sets at every submit instead of many individual
buffers.  However, it makes a huge difference because of fences: fences that
are attached to buffers are used for implicitly synchronizing work across
different channels and engines.  They are in the performance critical path, and
we want to carefully manage them (that's the idea of explicit synchronization).
The working set fences, on the other hand, would only be used to guarantee that
we don't swap out or unmap something that the GPU might be accessing.  We never
need to wait for those fences (except when swapping or unmapping), so we can be
conservative without hurting performance.


> Imo de-staging the android syncpt stuff needs to happen first, before drivers
> can use it. Since non-staging stuff really shouldn't depend upon code from
> staging.

Fully agree.  I thought the best way towards that would be to show some driver
code that _would_ use it. :)


> I'm all for adding explicit syncing. Our plans are roughly.  - Add both an in
> and and out fence to execbuf to sync with other rendering and give userspace
> a fence back. Needs to different flags probably.
> 
> - Maybe add an ioctl to dma-bufs to get at the current implicit fences
>   attached to them (both an exclusive and non-exclusive version). This
>   should help with making explicit and implicit sync work together nicely.
> 
> - Add fence support to kms. Probably only worth it together with the new
>   atomic stuff. Again we need an in fence to wait for 

[RFC] Explicit synchronization for Nouveau

2014-10-01 Thread Maarten Lankhorst
Hey,

On 01-10-14 17:14, Lauri Peltonen wrote:
> Thanks Daniel for your input!
> 
> On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
>> On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
>>> (2) Stop automatically storing fences to the buffers that user space wants 
>>> to
>>> synchronize explicitly.
>>
>> The problem with this approach is that you then need hw faulting to make
>> sure the memory is there. Implicit fences aren't just used for syncing,
>> but also to make sure that the gpu still has access to the buffer as long
>> as it needs it. So you need at least a non-exclusive fence attached for
>> each command submission.
>>
>> Of course on Android you don't have swap (would kill the puny mmc within
>> seconds) and you don't care for letting userspace pin most of memory for
>> gfx. So you'll get away with no fences at all. But for upstream I don't
>> see a good solution unfortunately. Ideas very much welcome.
>>
>>> (3) Allow user space to attach an explicit fence to dma-buf when exporting 
>>> to
>>> another driver that uses implicit sync.
>>>
>>> There are still some open issues beyond these.  For example, can we skip
>>> acquiring the ww mutex for explicitly synchronized buffers?  I think we 
>>> could
>>> eventually, at least on unified memory systems where we don't need to 
>>> migrate
>>> between heaps (our downstream Tegra GPU driver does not lock any buffers at
>>> submit, it just grabs refcounts for hw).  Another quirk is that now Nouveau
>>> waits on the buffer fences when closing the gem object to ensure that it
>>> doesn't unmap too early.  We need to rework that for explicit sync, but that
>>> shouldn't be difficult.
>>
>> See above, but you can't avoid to attach fences as long as we still use a
>> buffer-object based gfx memory management model. At least afaics. Which
>> means you need the ordering guarantees imposed by ww mutexes to ensure
>> that the oddball implicit ordered client can't deadlock the kernel's
>> memory management code.
> 
> Implicit fences attached to individual buffers are one way for residency
> management.  Do you think a working set based model could work in the DRM
> framework?  For example, something like this:
> 
> - Allow user space to create "working set objects" and associate buffers with
>   them.  If the user space doesn't want to manage working sets explicitly, it
>   could also use an implicit default working set that contains all buffers 
> that
>   are mapped to the channel vm (on Android we could always use the default
>   working set since we don't need to manage residency).  The working sets are
>   initially marked as dirty.
> - User space tells which working sets are referenced by each work submission.
>   Kernel locks these working sets, pins all buffers in dirty working sets, and
>   resets the dirty bits.  After kicking off work, kernel stores the fence to
>   the _working sets_, and then releases the locks (if an implicit default
>   working set is used, then this would be roughly equivalent to storing a 
> fence
>   to channel vm that tells "this is the last hw operation that might have
>   touched buffers in this address space").
> - If swapping doesn't happen, then we just need to check the working set dirty
>   bits at each submit.
> - When a buffer is swapped out, all working sets that refer to it need to be
>   marked as dirty.
> - When a buffer is swapped out or unmapped, we need to wait for the fences 
> from
>   all working sets that refer to the buffer.
> 
> Initially one might think of working sets as a mere optimization - we now need
> to process a few working sets at every submit instead of many individual
> buffers.  However, it makes a huge difference because of fences: fences that
> are attached to buffers are used for implicitly synchronizing work across
> different channels and engines.  They are in the performance critical path, 
> and
> we want to carefully manage them (that's the idea of explicit 
> synchronization).
> The working set fences, on the other hand, would only be used to guarantee 
> that
> we don't swap out or unmap something that the GPU might be accessing.  We 
> never
> need to wait for those fences (except when swapping or unmapping), so we can 
> be
> conservative without hurting performance.
> 
> 
>> Imo de-staging the android syncpt stuff needs to happen first, before drivers
>> can use it. Since non-staging stuff really shouldn't depend upon code from
>> staging.
> 
> Fully agree.  I thought the best way towards that would be to show some driver
> code that _would_ use it. :)
> 
> 
>> I'm all for adding explicit syncing. Our plans are roughly.  - Add both an in
>> and and out fence to execbuf to sync with other rendering and give userspace
>> a fence back. Needs to different flags probably.
>>
>> - Maybe add an ioctl to dma-bufs to get at the current implicit fences
>>   attached to them (both an exclusive and non-exclusive version). This
>>   should help with making 

[RFC] Explicit synchronization for Nouveau

2014-09-30 Thread Daniel Vetter
On Mon, Sep 29, 2014 at 10:20:44AM -0700, James Jones wrote:
> Additionally, I think the goal is to move to a model where some higher-level
> object such as a working set, rather than individual buffers, are assigned
> counters or sync primitives on a per-submission basis.  Versioning off tags
> for individual buffers then moves to working set modification time.  This is
> more feasible if the only thing that needs precise fencing of individual
> surfaces is lifetime management.

Yeah, there's always ways to make the fence assignment and tracking a bit
more efficient, we're playing around with working-set tricks for i915
ourselves. But fundamentally you still have fences for each buffer object
(just can't directly access them). And for buffers exported with dma_buf
you still need the direct link I think, at least when you care about
implicit syncing somewhat.

> The trend seems to be towards establishing a relatively large working set up
> front and then submitting many command buffers against it, perhaps with
> incremental modifications to the working set along the way.  This may be
> what's referred to as the Android model above, but I view it as the
> "non-glitchy graphic" model going forward.

Nah, that's something different. Afaik Android drivers don't really bother
a lot with swap and page migration and having a working shrinker for gpu
objects. At least our android guys need to disable the lowmemory killer
since that thing just goes bananas if we driver i915 memory usuage against
the wall and into swap.

I'm not really sure what you mean by "non-glitchy graphics", for me this
can mean anything from avoiding stalls to proper syncing with vblanks to
anything else really ... So might be good to elaborate here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


[RFC] Explicit synchronization for Nouveau

2014-09-29 Thread Daniel Vetter
On Mon, Sep 29, 2014 at 11:42:19AM -0400, Jerome Glisse wrote:
> On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
> > On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
> > > 
> > > Hi guys,
> > > 
> > > 
> > > I'd like to start a new thread about explicit fence synchronization.  
> > > This time
> > > with a Nouveau twist. :-)
> > > 
> > > First, let me define what I understand by implicit/explicit sync:
> > > 
> > > Implicit synchronization
> > > * Fences are attached to buffers
> > > * Kernel manages fences automatically based on buffer read/write access
> > > 
> > > Explicit synchronization
> > > * Fences are passed around independently
> > > * Kernel takes and emits fences to/from user space when submitting work
> > > 
> > > Implicit synchronization is already implemented in open source drivers, 
> > > and
> > > works well for most use cases.  I don't seek to change any of that.  My
> > > proposal aims at allowing some drm drivers to operate in explicit sync 
> > > mode to
> > > get maximal performance, while still remaining fully compatible with the
> > > implicit paradigm.
> > 
> > Yeah, pretty much what we have in mind on the i915 side too. I didn't look
> > too closely at your patches, so just a few high level comments on your rfc
> > here.
> > 
> > > I will try to explain why I think we should support the explicit model as 
> > > well.
> > > 
> > > 
> > > 1. Bindless graphics
> > > 
> > > Bindless graphics is a central concept when trying to reduce the OpenGL 
> > > driver
> > > overhead.  The idea is that the application can bind a large set of 
> > > buffers to
> > > the working set up front using extensions such as 
> > > GL_ARB_bindless_texture, and
> > > they remain resident until the application releases them (note that 
> > > compute
> > > APIs have typically similar semantics).  These working sets can be huge,
> > > hundreds or even thousands of buffers, so we would like to opt out from 
> > > the
> > > per-submit overhead of acquiring locks, waiting for fences, and storing 
> > > fences.
> > > Automatically synchronizing these working sets in kernel will also prevent
> > > parallelism between channels that are sharing the working set (in fact 
> > > sharing
> > > just one buffer from the working set will cause the jobs of the two 
> > > channels to
> > > be serialized).
> > > 
> > > 2. Evolution of graphics APIs
> > > 
> > > The graphics API evolution seems to be going to a direction where game 
> > > engine
> > > and middleware vendors demand more control over work submission and
> > > synchronization.  We expect that this trend will continue, and more and 
> > > more
> > > synchronization decisions will be pushed to the API level.  OpenGL and EGL
> > > already provide good explicit command stream level synchronization 
> > > primitives:
> > > glFenceSync and EGL_KHR_wait_sync.  Their use is also encouraged - for 
> > > example
> > > EGL_KHR_image_base spec clearly states that the application is 
> > > responsible for
> > > synchronizing accesses to EGLImages.  If the API that is exposed to 
> > > developers
> > > gives the control over synchronization to the developer, then implicit 
> > > waits
> > > that are inserted by the kernel are unnecessary and unexpected, and can
> > > severely hurt performance.  It also makes it easy for the developer to 
> > > write
> > > code that happens to work on Linux because of implicit sync, but will 
> > > fail on
> > > other platforms.
> > > 
> > > 3. Suballocation
> > > 
> > > Using user space suballocation can help reduce the overhead when a large 
> > > number
> > > of small textures are used.  Synchronizing suballocated surfaces 
> > > implicitly in
> > > kernel doesn't make sense - many channels should be able to access the 
> > > same
> > > kernel-level buffer object simultaneously.
> > > 
> > > 4. Buffer sharing complications
> > > 
> > > This is not really an argument for explicit sync as such, but I'd like to 
> > > point
> > > out that sharing buffers across SoC engines is often much more complex 
> > > than
> > > just exporting and importing a dma-buf and waiting for the dma-buf fences.
> > > Sometimes we need to do color format or tiling layout conversion.  
> > > Sometimes,
> > > at least on Tegra, we need to decompress buffers when we pass them from 
> > > the GPU
> > > to an engine that doesn't support framebuffer compression.  These things 
> > > are
> > > not uncommon, particularly when we have SoC's that combine licensed IP 
> > > blocks
> > > from different vendors.  My point is that user space is already heavily
> > > involved when sharing buffers between drivers, and giving it some more 
> > > control
> > > over synchronization is not adding that much complexity.
> > > 
> > > 
> > > Because of the above arguments, I think it makes sense to let some user 
> > > space
> > > drm drivers opt out from implicit synchronization, while allowing them to 
> > > still
> > > remain fully compatible with the rest 

[RFC] Explicit synchronization for Nouveau

2014-09-29 Thread Jerome Glisse
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
> On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
> > 
> > Hi guys,
> > 
> > 
> > I'd like to start a new thread about explicit fence synchronization.  This 
> > time
> > with a Nouveau twist. :-)
> > 
> > First, let me define what I understand by implicit/explicit sync:
> > 
> > Implicit synchronization
> > * Fences are attached to buffers
> > * Kernel manages fences automatically based on buffer read/write access
> > 
> > Explicit synchronization
> > * Fences are passed around independently
> > * Kernel takes and emits fences to/from user space when submitting work
> > 
> > Implicit synchronization is already implemented in open source drivers, and
> > works well for most use cases.  I don't seek to change any of that.  My
> > proposal aims at allowing some drm drivers to operate in explicit sync mode 
> > to
> > get maximal performance, while still remaining fully compatible with the
> > implicit paradigm.
> 
> Yeah, pretty much what we have in mind on the i915 side too. I didn't look
> too closely at your patches, so just a few high level comments on your rfc
> here.
> 
> > I will try to explain why I think we should support the explicit model as 
> > well.
> > 
> > 
> > 1. Bindless graphics
> > 
> > Bindless graphics is a central concept when trying to reduce the OpenGL 
> > driver
> > overhead.  The idea is that the application can bind a large set of buffers 
> > to
> > the working set up front using extensions such as GL_ARB_bindless_texture, 
> > and
> > they remain resident until the application releases them (note that compute
> > APIs have typically similar semantics).  These working sets can be huge,
> > hundreds or even thousands of buffers, so we would like to opt out from the
> > per-submit overhead of acquiring locks, waiting for fences, and storing 
> > fences.
> > Automatically synchronizing these working sets in kernel will also prevent
> > parallelism between channels that are sharing the working set (in fact 
> > sharing
> > just one buffer from the working set will cause the jobs of the two 
> > channels to
> > be serialized).
> > 
> > 2. Evolution of graphics APIs
> > 
> > The graphics API evolution seems to be going to a direction where game 
> > engine
> > and middleware vendors demand more control over work submission and
> > synchronization.  We expect that this trend will continue, and more and more
> > synchronization decisions will be pushed to the API level.  OpenGL and EGL
> > already provide good explicit command stream level synchronization 
> > primitives:
> > glFenceSync and EGL_KHR_wait_sync.  Their use is also encouraged - for 
> > example
> > EGL_KHR_image_base spec clearly states that the application is responsible 
> > for
> > synchronizing accesses to EGLImages.  If the API that is exposed to 
> > developers
> > gives the control over synchronization to the developer, then implicit waits
> > that are inserted by the kernel are unnecessary and unexpected, and can
> > severely hurt performance.  It also makes it easy for the developer to write
> > code that happens to work on Linux because of implicit sync, but will fail 
> > on
> > other platforms.
> > 
> > 3. Suballocation
> > 
> > Using user space suballocation can help reduce the overhead when a large 
> > number
> > of small textures are used.  Synchronizing suballocated surfaces implicitly 
> > in
> > kernel doesn't make sense - many channels should be able to access the same
> > kernel-level buffer object simultaneously.
> > 
> > 4. Buffer sharing complications
> > 
> > This is not really an argument for explicit sync as such, but I'd like to 
> > point
> > out that sharing buffers across SoC engines is often much more complex than
> > just exporting and importing a dma-buf and waiting for the dma-buf fences.
> > Sometimes we need to do color format or tiling layout conversion.  
> > Sometimes,
> > at least on Tegra, we need to decompress buffers when we pass them from the 
> > GPU
> > to an engine that doesn't support framebuffer compression.  These things are
> > not uncommon, particularly when we have SoC's that combine licensed IP 
> > blocks
> > from different vendors.  My point is that user space is already heavily
> > involved when sharing buffers between drivers, and giving it some more 
> > control
> > over synchronization is not adding that much complexity.
> > 
> > 
> > Because of the above arguments, I think it makes sense to let some user 
> > space
> > drm drivers opt out from implicit synchronization, while allowing them to 
> > still
> > remain fully compatible with the rest of the drm world that uses implicit
> > synchronization.  In practice, this would require three things:
> > 
> > (1) Support passing fences (that are not tied to buffer objects) between 
> > kernel
> > and user space.
> > 
> > (2) Stop automatically storing fences to the buffers that user space wants 
> > to
> > synchronize explicitly.

[RFC] Explicit synchronization for Nouveau

2014-09-29 Thread James Jones
On 9/29/14 8:42 AM, Jerome Glisse wrote:
> On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
>> On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
>>>
>>> Hi guys,
>>>
>>>
>>> I'd like to start a new thread about explicit fence synchronization.  This 
>>> time
>>> with a Nouveau twist. :-)
>>>
>>> First, let me define what I understand by implicit/explicit sync:
>>>
>>> Implicit synchronization
>>> * Fences are attached to buffers
>>> * Kernel manages fences automatically based on buffer read/write access
>>>
>>> Explicit synchronization
>>> * Fences are passed around independently
>>> * Kernel takes and emits fences to/from user space when submitting work
>>>
>>> Implicit synchronization is already implemented in open source drivers, and
>>> works well for most use cases.  I don't seek to change any of that.  My
>>> proposal aims at allowing some drm drivers to operate in explicit sync mode 
>>> to
>>> get maximal performance, while still remaining fully compatible with the
>>> implicit paradigm.
>>
>> Yeah, pretty much what we have in mind on the i915 side too. I didn't look
>> too closely at your patches, so just a few high level comments on your rfc
>> here.
>>
>>> I will try to explain why I think we should support the explicit model as 
>>> well.
>>>
>>>
>>> 1. Bindless graphics
>>>
>>> Bindless graphics is a central concept when trying to reduce the OpenGL 
>>> driver
>>> overhead.  The idea is that the application can bind a large set of buffers 
>>> to
>>> the working set up front using extensions such as GL_ARB_bindless_texture, 
>>> and
>>> they remain resident until the application releases them (note that compute
>>> APIs have typically similar semantics).  These working sets can be huge,
>>> hundreds or even thousands of buffers, so we would like to opt out from the
>>> per-submit overhead of acquiring locks, waiting for fences, and storing 
>>> fences.
>>> Automatically synchronizing these working sets in kernel will also prevent
>>> parallelism between channels that are sharing the working set (in fact 
>>> sharing
>>> just one buffer from the working set will cause the jobs of the two 
>>> channels to
>>> be serialized).
>>>
>>> 2. Evolution of graphics APIs
>>>
>>> The graphics API evolution seems to be going to a direction where game 
>>> engine
>>> and middleware vendors demand more control over work submission and
>>> synchronization.  We expect that this trend will continue, and more and more
>>> synchronization decisions will be pushed to the API level.  OpenGL and EGL
>>> already provide good explicit command stream level synchronization 
>>> primitives:
>>> glFenceSync and EGL_KHR_wait_sync.  Their use is also encouraged - for 
>>> example
>>> EGL_KHR_image_base spec clearly states that the application is responsible 
>>> for
>>> synchronizing accesses to EGLImages.  If the API that is exposed to 
>>> developers
>>> gives the control over synchronization to the developer, then implicit waits
>>> that are inserted by the kernel are unnecessary and unexpected, and can
>>> severely hurt performance.  It also makes it easy for the developer to write
>>> code that happens to work on Linux because of implicit sync, but will fail 
>>> on
>>> other platforms.
>>>
>>> 3. Suballocation
>>>
>>> Using user space suballocation can help reduce the overhead when a large 
>>> number
>>> of small textures are used.  Synchronizing suballocated surfaces implicitly 
>>> in
>>> kernel doesn't make sense - many channels should be able to access the same
>>> kernel-level buffer object simultaneously.
>>>
>>> 4. Buffer sharing complications
>>>
>>> This is not really an argument for explicit sync as such, but I'd like to 
>>> point
>>> out that sharing buffers across SoC engines is often much more complex than
>>> just exporting and importing a dma-buf and waiting for the dma-buf fences.
>>> Sometimes we need to do color format or tiling layout conversion.  
>>> Sometimes,
>>> at least on Tegra, we need to decompress buffers when we pass them from the 
>>> GPU
>>> to an engine that doesn't support framebuffer compression.  These things are
>>> not uncommon, particularly when we have SoC's that combine licensed IP 
>>> blocks
>>> from different vendors.  My point is that user space is already heavily
>>> involved when sharing buffers between drivers, and giving it some more 
>>> control
>>> over synchronization is not adding that much complexity.
>>>
>>>
>>> Because of the above arguments, I think it makes sense to let some user 
>>> space
>>> drm drivers opt out from implicit synchronization, while allowing them to 
>>> still
>>> remain fully compatible with the rest of the drm world that uses implicit
>>> synchronization.  In practice, this would require three things:
>>>
>>> (1) Support passing fences (that are not tied to buffer objects) between 
>>> kernel
>>>  and user space.
>>>
>>> (2) Stop automatically storing fences to the buffers that user space wants 
>>> to
>>>  

[RFC] Explicit synchronization for Nouveau

2014-09-29 Thread Daniel Vetter
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
> 
> Hi guys,
> 
> 
> I'd like to start a new thread about explicit fence synchronization.  This 
> time
> with a Nouveau twist. :-)
> 
> First, let me define what I understand by implicit/explicit sync:
> 
> Implicit synchronization
> * Fences are attached to buffers
> * Kernel manages fences automatically based on buffer read/write access
> 
> Explicit synchronization
> * Fences are passed around independently
> * Kernel takes and emits fences to/from user space when submitting work
> 
> Implicit synchronization is already implemented in open source drivers, and
> works well for most use cases.  I don't seek to change any of that.  My
> proposal aims at allowing some drm drivers to operate in explicit sync mode to
> get maximal performance, while still remaining fully compatible with the
> implicit paradigm.

Yeah, pretty much what we have in mind on the i915 side too. I didn't look
too closely at your patches, so just a few high level comments on your rfc
here.

> I will try to explain why I think we should support the explicit model as 
> well.
> 
> 
> 1. Bindless graphics
> 
> Bindless graphics is a central concept when trying to reduce the OpenGL driver
> overhead.  The idea is that the application can bind a large set of buffers to
> the working set up front using extensions such as GL_ARB_bindless_texture, and
> they remain resident until the application releases them (note that compute
> APIs have typically similar semantics).  These working sets can be huge,
> hundreds or even thousands of buffers, so we would like to opt out from the
> per-submit overhead of acquiring locks, waiting for fences, and storing 
> fences.
> Automatically synchronizing these working sets in kernel will also prevent
> parallelism between channels that are sharing the working set (in fact sharing
> just one buffer from the working set will cause the jobs of the two channels 
> to
> be serialized).
> 
> 2. Evolution of graphics APIs
> 
> The graphics API evolution seems to be going to a direction where game engine
> and middleware vendors demand more control over work submission and
> synchronization.  We expect that this trend will continue, and more and more
> synchronization decisions will be pushed to the API level.  OpenGL and EGL
> already provide good explicit command stream level synchronization primitives:
> glFenceSync and EGL_KHR_wait_sync.  Their use is also encouraged - for example
> EGL_KHR_image_base spec clearly states that the application is responsible for
> synchronizing accesses to EGLImages.  If the API that is exposed to developers
> gives the control over synchronization to the developer, then implicit waits
> that are inserted by the kernel are unnecessary and unexpected, and can
> severely hurt performance.  It also makes it easy for the developer to write
> code that happens to work on Linux because of implicit sync, but will fail on
> other platforms.
> 
> 3. Suballocation
> 
> Using user space suballocation can help reduce the overhead when a large 
> number
> of small textures are used.  Synchronizing suballocated surfaces implicitly in
> kernel doesn't make sense - many channels should be able to access the same
> kernel-level buffer object simultaneously.
> 
> 4. Buffer sharing complications
> 
> This is not really an argument for explicit sync as such, but I'd like to 
> point
> out that sharing buffers across SoC engines is often much more complex than
> just exporting and importing a dma-buf and waiting for the dma-buf fences.
> Sometimes we need to do color format or tiling layout conversion.  Sometimes,
> at least on Tegra, we need to decompress buffers when we pass them from the 
> GPU
> to an engine that doesn't support framebuffer compression.  These things are
> not uncommon, particularly when we have SoC's that combine licensed IP blocks
> from different vendors.  My point is that user space is already heavily
> involved when sharing buffers between drivers, and giving it some more control
> over synchronization is not adding that much complexity.
> 
> 
> Because of the above arguments, I think it makes sense to let some user space
> drm drivers opt out from implicit synchronization, while allowing them to 
> still
> remain fully compatible with the rest of the drm world that uses implicit
> synchronization.  In practice, this would require three things:
> 
> (1) Support passing fences (that are not tied to buffer objects) between 
> kernel
> and user space.
> 
> (2) Stop automatically storing fences to the buffers that user space wants to
> synchronize explicitly.

The problem with this approach is that you then need hw faulting to make
sure the memory is there. Implicit fences aren't just used for syncing,
but also to make sure that the gpu still has access to the buffer as long
as it needs it. So you need at least a non-exclusive fence attached for
each command submission.

Of course on Android you don't 

[RFC] Explicit synchronization for Nouveau

2014-09-26 Thread Lauri Peltonen

Hi guys,


I'd like to start a new thread about explicit fence synchronization.  This time
with a Nouveau twist. :-)

First, let me define what I understand by implicit/explicit sync:

Implicit synchronization
* Fences are attached to buffers
* Kernel manages fences automatically based on buffer read/write access

Explicit synchronization
* Fences are passed around independently
* Kernel takes and emits fences to/from user space when submitting work

Implicit synchronization is already implemented in open source drivers, and
works well for most use cases.  I don't seek to change any of that.  My
proposal aims at allowing some drm drivers to operate in explicit sync mode to
get maximal performance, while still remaining fully compatible with the
implicit paradigm.

I will try to explain why I think we should support the explicit model as well.


1. Bindless graphics

Bindless graphics is a central concept when trying to reduce the OpenGL driver
overhead.  The idea is that the application can bind a large set of buffers to
the working set up front using extensions such as GL_ARB_bindless_texture, and
they remain resident until the application releases them (note that compute
APIs have typically similar semantics).  These working sets can be huge,
hundreds or even thousands of buffers, so we would like to opt out from the
per-submit overhead of acquiring locks, waiting for fences, and storing fences.
Automatically synchronizing these working sets in kernel will also prevent
parallelism between channels that are sharing the working set (in fact sharing
just one buffer from the working set will cause the jobs of the two channels to
be serialized).

2. Evolution of graphics APIs

The graphics API evolution seems to be going to a direction where game engine
and middleware vendors demand more control over work submission and
synchronization.  We expect that this trend will continue, and more and more
synchronization decisions will be pushed to the API level.  OpenGL and EGL
already provide good explicit command stream level synchronization primitives:
glFenceSync and EGL_KHR_wait_sync.  Their use is also encouraged - for example
EGL_KHR_image_base spec clearly states that the application is responsible for
synchronizing accesses to EGLImages.  If the API that is exposed to developers
gives the control over synchronization to the developer, then implicit waits
that are inserted by the kernel are unnecessary and unexpected, and can
severely hurt performance.  It also makes it easy for the developer to write
code that happens to work on Linux because of implicit sync, but will fail on
other platforms.

3. Suballocation

Using user space suballocation can help reduce the overhead when a large number
of small textures are used.  Synchronizing suballocated surfaces implicitly in
kernel doesn't make sense - many channels should be able to access the same
kernel-level buffer object simultaneously.

4. Buffer sharing complications

This is not really an argument for explicit sync as such, but I'd like to point
out that sharing buffers across SoC engines is often much more complex than
just exporting and importing a dma-buf and waiting for the dma-buf fences.
Sometimes we need to do color format or tiling layout conversion.  Sometimes,
at least on Tegra, we need to decompress buffers when we pass them from the GPU
to an engine that doesn't support framebuffer compression.  These things are
not uncommon, particularly when we have SoC's that combine licensed IP blocks
from different vendors.  My point is that user space is already heavily
involved when sharing buffers between drivers, and giving it some more control
over synchronization is not adding that much complexity.


Because of the above arguments, I think it makes sense to let some user space
drm drivers opt out from implicit synchronization, while allowing them to still
remain fully compatible with the rest of the drm world that uses implicit
synchronization.  In practice, this would require three things:

(1) Support passing fences (that are not tied to buffer objects) between kernel
and user space.

(2) Stop automatically storing fences to the buffers that user space wants to
synchronize explicitly.

(3) Allow user space to attach an explicit fence to dma-buf when exporting to
another driver that uses implicit sync.

There are still some open issues beyond these.  For example, can we skip
acquiring the ww mutex for explicitly synchronized buffers?  I think we could
eventually, at least on unified memory systems where we don't need to migrate
between heaps (our downstream Tegra GPU driver does not lock any buffers at
submit, it just grabs refcounts for hw).  Another quirk is that now Nouveau
waits on the buffer fences when closing the gem object to ensure that it
doesn't unmap too early.  We need to rework that for explicit sync, but that
shouldn't be difficult.

I have written a prototype that demonstrates (1) by adding explicit sync fd
support to