[RFC] Explicit synchronization for Nouveau
Riley (CCed) and I will be at Plumbers in a couple weeks. There is a session on sync planned in the Android track, and of course we'll be available to chat. On Thu, Oct 2, 2014 at 1:44 PM, Daniel Vetter wrote: > On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote: > > +Rom who seems to be presenting about mainlining android sync at linux > plumbers > > Also add Greg KH as fyi that we're working on de-stage one of the android > subsystems. > > > On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote: > > > You could neuter implicit fences by always attaching the fences as > > > shared when explicit syncing is used. This would work correctly with > > > eviction, and wouldn't cause any unneeded syncing. :) > > > > Yes, that will probably work! So, just to reiterate that I understood > you and > > Daniel correctly: > > > > - de-stage sync_fence and it's user space API (the tedious part) > > - add dma-buf ioctls for extracting and attaching explicit fences > > - Nouveau specific: allow flagging gem buffers for explicit sync > > - do not check pre-fences from explicitly synced buffers at submit > > - continue to attach a shared fence after submit so that pinning and > > unmapping continue to work > > - Have explicit in/out fences for the pushbuf ioctl is missing I > guess in this step? > > I also think we need some kind of demonstration vehicle using nouveau to > satisfy Dave Airlie's open-source userspace requirements for new > interfaces. Might be good to chat with him to make sure we have that > covered (and how much needs to be there really). > > > Then working sets and getting rid of locking all buffers individually > > can be dealt with later as an optimization. > > Yeah, sounds like a good plan. > > > On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote: > > > On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote: > > > > Implicit fences attached to individual buffers are one way for > residency > > > > management. Do you think a working set based model could work in > the DRM > > > > framework? For example, something like this: > > > > > > > > - Allow user space to create "working set objects" and associate > buffers with > > > > them. If the user space doesn't want to manage working sets > explicitly, it > > > > could also use an implicit default working set that contains all > buffers that > > > > are mapped to the channel vm (on Android we could always use the > default > > > > working set since we don't need to manage residency). The working > sets are > > > > initially marked as dirty. > > > > - User space tells which working sets are referenced by each work > submission. > > > > Kernel locks these working sets, pins all buffers in dirty working > sets, and > > > > resets the dirty bits. After kicking off work, kernel stores the > fence to > > > > the _working sets_, and then releases the locks (if an implicit > default > > > > working set is used, then this would be roughly equivalent to > storing a fence > > > > to channel vm that tells "this is the last hw operation that might > have > > > > touched buffers in this address space"). > > > > - If swapping doesn't happen, then we just need to check the working > set dirty > > > > bits at each submit. > > > > - When a buffer is swapped out, all working sets that refer to it > need to be > > > > marked as dirty. > > > > - When a buffer is swapped out or unmapped, we need to wait for the > fences from > > > > all working sets that refer to the buffer. > > > > > > > > Initially one might think of working sets as a mere optimization - > we now need > > > > to process a few working sets at every submit instead of many > individual > > > > buffers. However, it makes a huge difference because of fences: > fences that > > > > are attached to buffers are used for implicitly synchronizing work > across > > > > different channels and engines. They are in the performance > critical path, and > > > > we want to carefully manage them (that's the idea of explicit > synchronization). > > > > The working set fences, on the other hand, would only be used to > guarantee that > > > > we don't swap out or unmap something that the GPU might be > accessing. We never > > > > need to wait for those fences (except when swapping or unmapping), > so we can be > > > > conservative without hurting performance. > > > > > > Yeah, within the driver (i.e. for private objects which are never > exported > > > to dma_buf) we can recently do stuff like this. And your above idea is > > > roughly one of the things we're tossing around for i915. > > > > > > But the cool stuff with drm is that cmd submission is driver-specific, > so > > > you can just go wild with nouveau. Of course you have to coninvce the > > > nouveau guys (and also have open-source users for the new interface). > > > > > > For shared buffers I think we should stick with the implicit fences > for a > > > while simply
[Linaro-mm-sig] thoughts of looking at android fences
I ran some benchmarks and things seem to be running about the same. No one on our graphics team seemed concerned about the change. The only concern I heard was about the increased complexity of the new sync code as opposed to the old sync framework which tried to keep things straightforward. On Fri, Nov 8, 2013 at 3:43 AM, Maarten Lankhorst < maarten.lankhorst at canonical.com> wrote: > op 07-11-13 22:11, Rom Lemarchand schreef: > > Hi Maarten, I tested your changes and needed the attached patch: behavior > > now seems equivalent as android sync. I haven't tested performance. > > > > The issue resolved by this patch happens when i_b < b->num_fences and i_a > >> = a->num_fences (or vice versa). Then, pt_a is invalid and so > > dereferencing pt_a->context causes a crash. > > > Yeah, I pushed my original fix. I intended to keep android userspace > behavior the same, and I tried to keep the kernelspace the api same as much > as I could. If peformance is the same, or not noticeably worse, would there > be any objections on the android side about renaming dma-fence to > syncpoint, and getting it in mainline? > > ~Maarten > -- next part -- An HTML attachment was scrubbed... URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/2013/8632187b/attachment.html>
[Linaro-mm-sig] thoughts of looking at android fences
Let me run some benchmarks today, talk to people internally, and I'll let you know. On Nov 8, 2013 3:43 AM, "Maarten Lankhorst" wrote: > op 07-11-13 22:11, Rom Lemarchand schreef: > > Hi Maarten, I tested your changes and needed the attached patch: behavior > > now seems equivalent as android sync. I haven't tested performance. > > > > The issue resolved by this patch happens when i_b < b->num_fences and i_a > >> = a->num_fences (or vice versa). Then, pt_a is invalid and so > > dereferencing pt_a->context causes a crash. > > > Yeah, I pushed my original fix. I intended to keep android userspace > behavior the same, and I tried to keep the kernelspace the api same as much > as I could. If peformance is the same, or not noticeably worse, would there > be any objections on the android side about renaming dma-fence to > syncpoint, and getting it in mainline? > > ~Maarten > -- next part -- An HTML attachment was scrubbed... URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20131108/8245f7ee/attachment-0001.html>
[Linaro-mm-sig] thoughts of looking at android fences
Hi Maarten, I tested your changes and needed the attached patch: behavior now seems equivalent as android sync. I haven't tested performance. The issue resolved by this patch happens when i_b < b->num_fences and i_a >= a->num_fences (or vice versa). Then, pt_a is invalid and so dereferencing pt_a->context causes a crash. On Mon, Nov 4, 2013 at 2:31 AM, Maarten Lankhorst < maarten.lankhorst at canonical.com> wrote: > op 02-11-13 22:36, Colin Cross schreef: > > On Wed, Oct 30, 2013 at 5:17 AM, Maarten Lankhorst > > wrote: > >> op 24-10-13 14:13, Maarten Lankhorst schreef: > >>> So I actually tried to implement it now. I killed all the deprecated > members and assumed a linear timeline. > >>> This means that syncpoints can only be added at the end, not in > between. In particular it means sw_sync > >>> might be slightly broken. > >>> > >>> I only tested it with a simple program I wrote called ufence.c, it's > in drivers/staging/android/ufence.c in the following tree: > >>> > >>> http://cgit.freedesktop.org/~mlankhorst/linux > >>> > >>> the "rfc: convert android to fence api" has all the changes from my > dma-fence proposal to what android would need, > >>> it also converts the userspace fence api to use the dma-fence api. > >>> > >>> sync_pt is implemented as fence too. This meant not having to convert > all of android right away, though I did make some changes. > >>> I killed the deprecated members and made all the fence calls forward > to the sync_timeline_ops. dup and compare are no longer used. > >>> > >>> I haven't given this a spin on a full android kernel, only with the > components that are in mainline kernel under staging and my dumb test > program. > >>> > >>> ~Maarten > >>> > >>> PS: The nomenclature is very confusing. I want to rename dma-fence to > syncpoint, but I want some feedback from the android devs first. :) > >>> > >> Come on, any feedback? I want to move the discussion forward. > >> > >> ~Maarten > > I experimented with it a little on a device that uses sync and came > > across a few bugs: > > 1. sync_timeline_signal needs to call __fence_signal on all signaled > > points on the timeline, not just the first > > 2. fence_add_callback doesn't always initialize cb.node > > 3. sync_fence_wait should take ms > > 4. sync_print_pt status printing was incorrect > > 5. there is a deadlock: > >sync_print_obj takes obj->child_list_lock > >sync_print_pt > >fence_is_signaled > >fence_signal takes fence->lock == obj->child_list_lock > > 6. freeing a timeline before all the fences holding points on that > > timeline have timed out results in a crash > > > > With the attached patch to fix these issues, our libsync and sync_test > > give the same results as with our sync code. I haven't tested against > > the full Android framework yet. > > > > The compare op and timeline ordering is critical to the efficiency of > > sync points on Android. The compare op is used when merging fences to > > drop all but the latest point on the same timeline. This is necessary > > for example when the same buffer is submitted to the display on > > multiple frames, like when there is a live wallpaper in the background > > updating at 60 fps and a static screen of widgets on top of it. The > > static widget buffer is submitted on every frame, returning a new > > fence each time. The compositor merges the new fence with the fence > > for the previous buffer, and because they are on the same timeline it > > merges down to a single point. I experimented with disabling the > > merge optimization on a Nexus 10, and found that leaving the screen on > > running a live wallpaper eventually resulted in 100k outstanding sync > > points. > > Well, here I did the same for dma-fence, can you take a look? > > --- > > diff --git a/drivers/staging/android/sync.c > b/drivers/staging/android/sync.c > index 2c7fd3f2ab23..d1d89f1f8553 100644 > --- a/drivers/staging/android/sync.c > +++ b/drivers/staging/android/sync.c > @@ -232,39 +232,62 @@ void sync_fence_install(struct sync_fence *fence, > int fd) > } > EXPORT_SYMBOL(sync_fence_install); > > +static void sync_fence_add_pt(struct sync_fence *fence, int *i, struct > fence *pt) { > + fence->cbs[*i].sync_pt = pt; > + fence->cbs[*i].fence = fence; > + > + if (!fence_add_callback(pt, >cbs[*i].cb, > fence_check_cb_func)) { > + fence_get(pt); > + (*i)++; > + } > +} > + > struct sync_fence *sync_fence_merge(const char *name, > struct sync_fence *a, struct > sync_fence *b) > { > int num_fences = a->num_fences + b->num_fences; > struct sync_fence *fence; > - int i; > + int i, i_a, i_b; > > fence = sync_fence_alloc(offsetof(struct sync_fence, > cbs[num_fences]), name); > if (fence == NULL) > return NULL; > > - fence->num_fences = num_fences; > atomic_set(>status, num_fences); > > - for (i = 0; i <
[Linaro-mm-sig] thoughts of looking at android fences
Sorry about the delay. Hopefully other people from Android will also chime in. We need the ability to merge sync fences and keep the sync_pt ordered: the idea behind sync timelines is that we promise an ordering of operations. Our reference device is Nexus 10: we need to make sure that any new implementation satisfies the same requirements. You can find sample use-cases here, we also use it in our hardware composer libraries: https://android.googlesource.com/platform/system/core/+/refs/heads/master/libsync/ https://android.googlesource.com/platform/frameworks/native/+/master/libs/ui/Fence.cpp On Wed, Oct 30, 2013 at 5:17 AM, Maarten Lankhorst < maarten.lankhorst at canonical.com> wrote: > op 24-10-13 14:13, Maarten Lankhorst schreef: > > op 09-10-13 16:39, Maarten Lankhorst schreef: > >> Hey, > >> > >> op 08-10-13 19:37, John Stultz schreef: > >>> On Wed, Oct 2, 2013 at 11:13 AM, Erik Gilling > wrote: > On Wed, Oct 2, 2013 at 12:35 AM, Maarten Lankhorst > wrote: > > Depending on feedback I'll try reflashing my nexus 7 to stock > android, and work on trying to convert android > > syncpoints to dma-fence, which I'll probably rename to syncpoints. > I thought the plan decided at plumbers was to investigate backing > dma_buf with the android sync solution not the other way around. It > doesn't make sense to me to take a working, tested, end-to-end > solution with a released compositing system built around it, throw it > out, and replace it with new un-tested code to > support a system which is not yet built. > >>> Hey Erik, > >>> Thanks for the clarifying points in your email, your insights and > >>> feedback are critical, and I think having you and Maarten continue to > >>> work out the details here will make this productive. > >>> > >>> My recollection from the discussion was that Rob was ok with trying to > >>> pipe the sync arguments through the various interfaces in order to > >>> support the explicit sync, but I think he did suggest having it backed > >>> by the dma-buf fences underneath. > >>> > >>> I know this can be frustrating to watch things be reimplemented when > >>> you have a pre-baked solution, but some compromise will be needed to > >>> get things merged (and Maarten is taking the initiative here), but its > >>> important to keep discussing this so the *right* compromises are made > >>> that don't hurt performance, etc. > >>> > >>> My hope is Maarten's approach of getting the dma-fence core > >>> integrated, and then moving the existing Android sync interface over > >>> to the shared back-end, will allow for proper apples-to-apples > >>> comparisons of the same interface. And if the functionality isn't > >>> sufficient we can hold off on merging the sync interface conversion > >>> until that gets resolved. > >>> > >> Yeah, I'm trying to understand the android side too. I think a unified > interface would benefit both. I'm > >> toying a bit with the sw_sync driver in staging because it's the > easiest to try out on my desktop. > >> > >> The timeline stuff looks like it could be simplified. The main > difference that there seems to be is that > >> I didn't want to create a separate timeline struct for synchronization > but let the drivers handle it. > >> > >> If you use rcu for reference lifetime management of timeline, the kref > can be dropped. Signalling all > >> syncpts on timeline destroy to a new destroyed state would kill the > need for a destroyed member. > >> The active list is unneeded and can be killed if only a linear > progression of child_list is allowed. > >> > >> Which probably leaves this nice structure: > >> struct sync_timeline { > >> const struct sync_timeline_ops*ops; > >> charname[32]; > >> > >> struct list_headchild_list_head; > >> spinlock_tchild_list_lock; > >> > >> struct list_headsync_timeline_list; > >> }; > >> > >> Where name, and sync_timeline_list are nice for debugging, but I guess > not necessarily required. so that > >> could be split out into a separate debugfs thing if required. I've > moved the pointer to ops to the fence > >> for dma-fence, which leaves this.. > >> > >> struct sync_timeline { > >> struct list_headchild_list_head; > >> spinlock_tchild_list_lock; > >> > >> struct sync_timeline_debug { > >> struct list_headsync_timeline_list; > >> char name[32]; > >> }; > >> }; > >> > >> Hm, this looks familiar, the drm drivers had some structure for > protecting the active fence list that has > >> an identical definition, but with a slightly different list offset.. > >> > >> struct __wait_queue_head { > >> spinlock_t lock; > >> struct list_head task_list; > >> }; > >> > >> typedef struct __wait_queue_head wait_queue_head_t; > >> > >> This is nicer to convert the existing drm drivers, which already > implement synchronous wait with a waitqueue. > >> The default wait op is in fact > >> > >> Ok