[RFC] Explicit synchronization for Nouveau

2014-10-03 Thread Rom Lemarchand
Riley (CCed) and I will be at Plumbers in a couple weeks.

There is a session on sync planned in the Android track, and of course
we'll be available to chat.

On Thu, Oct 2, 2014 at 1:44 PM, Daniel Vetter  wrote:

> On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
> > +Rom who seems to be presenting about mainlining android sync at linux
> plumbers
>
> Also add Greg KH as fyi that we're working on de-stage one of the android
> subsystems.
>
> > On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
> > > You could neuter implicit fences by always attaching the fences as
> > > shared when explicit syncing is used. This would work correctly with
> > > eviction, and wouldn't cause any unneeded syncing. :)
> >
> > Yes, that will probably work!  So, just to reiterate that I understood
> you and
> > Daniel correctly:
> >
> > - de-stage sync_fence and it's user space API (the tedious part)
> > - add dma-buf ioctls for extracting and attaching explicit fences
> > - Nouveau specific: allow flagging gem buffers for explicit sync
> >   - do not check pre-fences from explicitly synced buffers at submit
> >   - continue to attach a shared fence after submit so that pinning and
> > unmapping continue to work
>
> - Have explicit in/out fences for the pushbuf ioctl is missing I
>   guess in this step?
>
> I also think we need some kind of demonstration vehicle using nouveau to
> satisfy Dave Airlie's open-source userspace requirements for new
> interfaces. Might be good to chat with him to make sure we have that
> covered (and how much needs to be there really).
>
> > Then working sets and getting rid of locking all buffers individually
> > can be dealt with later as an optimization.
>
> Yeah, sounds like a good plan.
>
> > On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
> > > On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
> > > > Implicit fences attached to individual buffers are one way for
> residency
> > > > management.  Do you think a working set based model could work in
> the DRM
> > > > framework?  For example, something like this:
> > > >
> > > > - Allow user space to create "working set objects" and associate
> buffers with
> > > >   them.  If the user space doesn't want to manage working sets
> explicitly, it
> > > >   could also use an implicit default working set that contains all
> buffers that
> > > >   are mapped to the channel vm (on Android we could always use the
> default
> > > >   working set since we don't need to manage residency).  The working
> sets are
> > > >   initially marked as dirty.
> > > > - User space tells which working sets are referenced by each work
> submission.
> > > >   Kernel locks these working sets, pins all buffers in dirty working
> sets, and
> > > >   resets the dirty bits.  After kicking off work, kernel stores the
> fence to
> > > >   the _working sets_, and then releases the locks (if an implicit
> default
> > > >   working set is used, then this would be roughly equivalent to
> storing a fence
> > > >   to channel vm that tells "this is the last hw operation that might
> have
> > > >   touched buffers in this address space").
> > > > - If swapping doesn't happen, then we just need to check the working
> set dirty
> > > >   bits at each submit.
> > > > - When a buffer is swapped out, all working sets that refer to it
> need to be
> > > >   marked as dirty.
> > > > - When a buffer is swapped out or unmapped, we need to wait for the
> fences from
> > > >   all working sets that refer to the buffer.
> > > >
> > > > Initially one might think of working sets as a mere optimization -
> we now need
> > > > to process a few working sets at every submit instead of many
> individual
> > > > buffers.  However, it makes a huge difference because of fences:
> fences that
> > > > are attached to buffers are used for implicitly synchronizing work
> across
> > > > different channels and engines.  They are in the performance
> critical path, and
> > > > we want to carefully manage them (that's the idea of explicit
> synchronization).
> > > > The working set fences, on the other hand, would only be used to
> guarantee that
> > > > we don't swap out or unmap something that the GPU might be
> accessing.  We never
> > > > need to wait for those fences (except when swapping or unmapping),
> so we can be
> > > > conservative without hurting performance.
> > >
> > > Yeah, within the driver (i.e. for private objects which are never
> exported
> > > to dma_buf) we can recently do stuff like this. And your above idea is
> > > roughly one of the things we're tossing around for i915.
> > >
> > > But the cool stuff with drm is that cmd submission is driver-specific,
> so
> > > you can just go wild with nouveau. Of course you have to coninvce the
> > > nouveau guys (and also have open-source users for the new interface).
> > >
> > > For shared buffers I think we should stick with the implicit fences
> for a
> > > while simply 

[Linaro-mm-sig] thoughts of looking at android fences

2013-11-11 Thread Rom Lemarchand
I ran some benchmarks and things seem to be running about the same.
No one on our graphics team seemed concerned about the change.

The only concern I heard was about the increased complexity of the new sync
code as opposed to the old sync framework which tried to keep things
straightforward.


On Fri, Nov 8, 2013 at 3:43 AM, Maarten Lankhorst <
maarten.lankhorst at canonical.com> wrote:

> op 07-11-13 22:11, Rom Lemarchand schreef:
> > Hi Maarten, I tested your changes and needed the attached patch: behavior
> > now seems equivalent as android sync. I haven't tested performance.
> >
> > The issue resolved by this patch happens when i_b < b->num_fences and i_a
> >> = a->num_fences (or vice versa). Then, pt_a is invalid and so
> > dereferencing pt_a->context causes a crash.
> >
> Yeah, I pushed my original fix. I intended to keep android userspace
> behavior the same, and I tried to keep the kernelspace the api same as much
> as I could. If peformance is the same, or not noticeably worse, would there
> be any objections on the android side about renaming dma-fence to
> syncpoint, and getting it in mainline?
>
> ~Maarten
>
-- next part --
An HTML attachment was scrubbed...
URL: 
<http://lists.freedesktop.org/archives/dri-devel/attachments/2013/8632187b/attachment.html>


[Linaro-mm-sig] thoughts of looking at android fences

2013-11-08 Thread Rom Lemarchand
Let me run some benchmarks today, talk to people internally, and I'll let
you know.
On Nov 8, 2013 3:43 AM, "Maarten Lankhorst" 
wrote:

> op 07-11-13 22:11, Rom Lemarchand schreef:
> > Hi Maarten, I tested your changes and needed the attached patch: behavior
> > now seems equivalent as android sync. I haven't tested performance.
> >
> > The issue resolved by this patch happens when i_b < b->num_fences and i_a
> >> = a->num_fences (or vice versa). Then, pt_a is invalid and so
> > dereferencing pt_a->context causes a crash.
> >
> Yeah, I pushed my original fix. I intended to keep android userspace
> behavior the same, and I tried to keep the kernelspace the api same as much
> as I could. If peformance is the same, or not noticeably worse, would there
> be any objections on the android side about renaming dma-fence to
> syncpoint, and getting it in mainline?
>
> ~Maarten
>
-- next part --
An HTML attachment was scrubbed...
URL: 
<http://lists.freedesktop.org/archives/dri-devel/attachments/20131108/8245f7ee/attachment-0001.html>


[Linaro-mm-sig] thoughts of looking at android fences

2013-11-07 Thread Rom Lemarchand
Hi Maarten, I tested your changes and needed the attached patch: behavior
now seems equivalent as android sync. I haven't tested performance.

The issue resolved by this patch happens when i_b < b->num_fences and i_a
>= a->num_fences (or vice versa). Then, pt_a is invalid and so
dereferencing pt_a->context causes a crash.


On Mon, Nov 4, 2013 at 2:31 AM, Maarten Lankhorst <
maarten.lankhorst at canonical.com> wrote:

> op 02-11-13 22:36, Colin Cross schreef:
> > On Wed, Oct 30, 2013 at 5:17 AM, Maarten Lankhorst
> >  wrote:
> >> op 24-10-13 14:13, Maarten Lankhorst schreef:
> >>> So I actually tried to implement it now. I killed all the deprecated
> members and assumed a linear timeline.
> >>> This means that syncpoints can only be added at the end, not in
> between. In particular it means sw_sync
> >>> might be slightly broken.
> >>>
> >>> I only tested it with a simple program I wrote called ufence.c, it's
> in drivers/staging/android/ufence.c in the following tree:
> >>>
> >>> http://cgit.freedesktop.org/~mlankhorst/linux
> >>>
> >>> the "rfc: convert android to fence api" has all the changes from my
> dma-fence proposal to what android would need,
> >>> it also converts the userspace fence api to use the dma-fence api.
> >>>
> >>> sync_pt is implemented as fence too. This meant not having to convert
> all of android right away, though I did make some changes.
> >>> I killed the deprecated members and made all the fence calls forward
> to the sync_timeline_ops. dup and compare are no longer used.
> >>>
> >>> I haven't given this a spin on a full android kernel, only with the
> components that are in mainline kernel under staging and my dumb test
> program.
> >>>
> >>> ~Maarten
> >>>
> >>> PS: The nomenclature is very confusing. I want to rename dma-fence to
> syncpoint, but I want some feedback from the android devs first. :)
> >>>
> >> Come on, any feedback? I want to move the discussion forward.
> >>
> >> ~Maarten
> > I experimented with it a little on a device that uses sync and came
> > across a few bugs:
> > 1.  sync_timeline_signal needs to call __fence_signal on all signaled
> > points on the timeline, not just the first
> > 2.  fence_add_callback doesn't always initialize cb.node
> > 3.  sync_fence_wait should take ms
> > 4.  sync_print_pt status printing was incorrect
> > 5.  there is a deadlock:
> >sync_print_obj takes obj->child_list_lock
> >sync_print_pt
> >fence_is_signaled
> >fence_signal takes fence->lock == obj->child_list_lock
> > 6.  freeing a timeline before all the fences holding points on that
> > timeline have timed out results in a crash
> >
> > With the attached patch to fix these issues, our libsync and sync_test
> > give the same results as with our sync code.  I haven't tested against
> > the full Android framework yet.
> >
> > The compare op and timeline ordering is critical to the efficiency of
> > sync points on Android.  The compare op is used when merging fences to
> > drop all but the latest point on the same timeline.  This is necessary
> > for example when the same buffer is submitted to the display on
> > multiple frames, like when there is a live wallpaper in the background
> > updating at 60 fps and a static screen of widgets on top of it.  The
> > static widget buffer is submitted on every frame, returning a new
> > fence each time.  The compositor merges the new fence with the fence
> > for the previous buffer, and because they are on the same timeline it
> > merges down to a single point.  I experimented with disabling the
> > merge optimization on a Nexus 10, and found that leaving the screen on
> > running a live wallpaper eventually resulted in 100k outstanding sync
> > points.
>
> Well, here I did the same for dma-fence, can you take a look?
>
> ---
>
> diff --git a/drivers/staging/android/sync.c
> b/drivers/staging/android/sync.c
> index 2c7fd3f2ab23..d1d89f1f8553 100644
> --- a/drivers/staging/android/sync.c
> +++ b/drivers/staging/android/sync.c
> @@ -232,39 +232,62 @@ void sync_fence_install(struct sync_fence *fence,
> int fd)
>  }
>  EXPORT_SYMBOL(sync_fence_install);
>
> +static void sync_fence_add_pt(struct sync_fence *fence, int *i, struct
> fence *pt) {
> +   fence->cbs[*i].sync_pt = pt;
> +   fence->cbs[*i].fence = fence;
> +
> +   if (!fence_add_callback(pt, >cbs[*i].cb,
> fence_check_cb_func)) {
> +   fence_get(pt);
> +   (*i)++;
> +   }
> +}
> +
>  struct sync_fence *sync_fence_merge(const char *name,
> struct sync_fence *a, struct
> sync_fence *b)
>  {
> int num_fences = a->num_fences + b->num_fences;
> struct sync_fence *fence;
> -   int i;
> +   int i, i_a, i_b;
>
> fence = sync_fence_alloc(offsetof(struct sync_fence,
> cbs[num_fences]), name);
> if (fence == NULL)
> return NULL;
>
> -   fence->num_fences = num_fences;
> atomic_set(>status, num_fences);
>
> -   for (i = 0; i < 

[Linaro-mm-sig] thoughts of looking at android fences

2013-11-01 Thread Rom Lemarchand
Sorry about the delay.
Hopefully other people from Android will also chime in.
We need the ability to merge sync fences and keep the sync_pt ordered: the
idea behind sync timelines is that we promise an ordering of operations.

Our reference device is Nexus 10: we need to make sure that any new
implementation satisfies the same requirements.

You can find sample use-cases here, we also use it in our hardware composer
libraries:
https://android.googlesource.com/platform/system/core/+/refs/heads/master/libsync/
https://android.googlesource.com/platform/frameworks/native/+/master/libs/ui/Fence.cpp



On Wed, Oct 30, 2013 at 5:17 AM, Maarten Lankhorst <
maarten.lankhorst at canonical.com> wrote:

> op 24-10-13 14:13, Maarten Lankhorst schreef:
> > op 09-10-13 16:39, Maarten Lankhorst schreef:
> >> Hey,
> >>
> >>  op 08-10-13 19:37, John Stultz schreef:
> >>> On Wed, Oct 2, 2013 at 11:13 AM, Erik Gilling 
> wrote:
>  On Wed, Oct 2, 2013 at 12:35 AM, Maarten Lankhorst
>   wrote:
> > Depending on feedback I'll try reflashing my nexus 7 to stock
> android, and work on trying to convert android
> > syncpoints to dma-fence, which I'll probably rename to syncpoints.
>  I thought the plan decided at plumbers was to investigate backing
>  dma_buf with the android sync solution not the other way around.  It
>  doesn't make sense to me to take a working, tested, end-to-end
>  solution with a released compositing system built around it, throw it
>  out, and replace it with new un-tested code to
>  support a system which is not yet built.
> >>> Hey Erik,
> >>>   Thanks for the clarifying points in your email, your insights and
> >>> feedback are critical, and I think having you and Maarten continue to
> >>> work out the details here will make this productive.
> >>>
> >>> My recollection from the discussion was that Rob was ok with trying to
> >>> pipe the sync arguments through the various interfaces in order to
> >>> support the explicit sync, but I think he did suggest having it backed
> >>> by the dma-buf fences underneath.
> >>>
> >>> I know this can be frustrating to watch things be reimplemented when
> >>> you have a pre-baked solution, but some compromise will be needed to
> >>> get things merged (and Maarten is taking the initiative here), but its
> >>> important to keep discussing this so the *right* compromises are made
> >>> that don't hurt performance, etc.
> >>>
> >>> My hope is Maarten's approach of getting the dma-fence core
> >>> integrated, and then moving the existing Android sync interface over
> >>> to the shared back-end, will allow for proper apples-to-apples
> >>> comparisons of the same interface. And if the functionality isn't
> >>> sufficient we can hold off on merging the sync interface conversion
> >>> until that gets resolved.
> >>>
> >> Yeah, I'm trying to understand the android side too. I think a unified
> interface would benefit both. I'm
> >> toying a bit with the sw_sync driver in staging because it's the
> easiest to try out on my desktop.
> >>
> >> The timeline stuff looks like it could be simplified. The main
> difference that there seems to be is that
> >> I didn't want to create a separate timeline struct for synchronization
> but let the drivers handle it.
> >>
> >> If you use rcu for reference lifetime management of timeline, the kref
> can be dropped. Signalling all
> >> syncpts on timeline destroy to a new destroyed state would kill the
> need for a destroyed member.
> >> The active list is unneeded and can be killed if only a linear
> progression of child_list is allowed.
> >>
> >> Which probably leaves this nice structure:
> >> struct sync_timeline {
> >> const struct sync_timeline_ops*ops;
> >> charname[32];
> >>
> >> struct list_headchild_list_head;
> >> spinlock_tchild_list_lock;
> >>
> >> struct list_headsync_timeline_list;
> >> };
> >>
> >> Where name, and sync_timeline_list are nice for debugging, but I guess
> not necessarily required. so that
> >> could be split out into a separate debugfs thing if required. I've
> moved the pointer to ops to the fence
> >> for dma-fence, which leaves this..
> >>
> >> struct sync_timeline {
> >> struct list_headchild_list_head;
> >> spinlock_tchild_list_lock;
> >>
> >> struct  sync_timeline_debug {
> >> struct list_headsync_timeline_list;
> >> char name[32];
> >> };
> >> };
> >>
> >> Hm, this looks familiar, the drm drivers had some structure for
> protecting the active fence list that has
> >> an identical definition, but with a slightly different list offset..
> >>
> >> struct __wait_queue_head {
> >> spinlock_t lock;
> >> struct list_head task_list;
> >> };
> >>
> >> typedef struct __wait_queue_head wait_queue_head_t;
> >>
> >> This is nicer to convert the existing drm drivers, which already
> implement synchronous wait with a waitqueue.
> >> The default wait op is in fact
> >>
> >> Ok