Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-03-03 Thread Daniel Vetter
On Wed, Feb 19, 2014 at 02:25:59PM +0100, Maarten Lankhorst wrote:
 op 17-02-14 19:41, Christian König schreef:
 Am 17.02.2014 19:24, schrieb Rob Clark:
 On Mon, Feb 17, 2014 at 12:36 PM, Christian König
 deathsim...@vodafone.de wrote:
 Am 17.02.2014 18:27, schrieb Rob Clark:
 
 On Mon, Feb 17, 2014 at 11:56 AM, Christian König
 deathsim...@vodafone.de wrote:
 Am 17.02.2014 16:56, schrieb Maarten Lankhorst:
 
 This type of fence can be used with hardware synchronization for simple
 hardware that can block execution until the condition
 (dma_buf[offset] - value) = 0 has been met.
 
 Can't we make that just dma_buf[offset] != 0 instead? As far as I know
 this way it would match the definition M$ uses in their WDDM
 specification
 and so make it much more likely that hardware supports it.
 well 'buf[offset] = value' at least means the same slot can be used
 for multiple operations (with increasing values of 'value').. not sure
 if that is something people care about.
 
 =value seems to be possible with adreno and radeon.  I'm not really sure
 about others (although I presume it as least supported for nv desktop
 stuff).  For hw that cannot do =value, we can either have a different 
 fence
 implementation which uses the !=0 approach.  Or change seqno-fence
 implementation later if needed.  But if someone has hw that can do !=0 
 but
 not =value, speak up now ;-)
 
 Here! Radeon can only do =value on the DMA and 3D engine, but not with UVD
 or VCE. And for the 3D engine it means draining the pipe, which isn't 
 really
 a good idea.
 hmm, ok.. forgot you have a few extra rings compared to me.  Is UVD
 re-ordering from decode-order to display-order for you in hw? If not,
 I guess you need sw intervention anyways when a frame is done for
 frame re-ordering, so maybe hw-hw sync doesn't really matter as much
 as compared to gpu/3d-display.  For dma-3d interactions, seems like
 you would care more about hw-hw sync, but I guess you aren't likely
 to use GPU A to do a resolve blit for GPU B..
 
 No UVD isn't reordering, but since frame reordering is predictable you 
 usually end up with pipelining everything to the hardware. E.g. you send the 
 decode commands in decode order to the UVD block and if you have overlay 
 active one of the frames are going to be the first to display and then you 
 want to wait for it on the display side.
 
 For 3D ring, I assume you probably want a CP_WAIT_FOR_IDLE before a
 CP_MEM_WRITE to update fence value in memory (for the one signalling
 the fence).  But why would you need that before a CP_WAIT_REG_MEM (for
 the one waiting for the fence)?  I don't exactly have documentation
 for adreno version of CP_WAIT_REG_{MEM,EQ,GTE}..  but PFP and ME
 appear to be same instruction set as r600, so I'm pretty sure they
 should have similar capabilities.. CP_WAIT_REG_MEM appears to be same
 but with 32bit gpu addresses vs 64b.
 
 You shouldn't use any of the CP commands for engine synchronization (neither 
 for wait nor for signal). The PFP and ME are just the top of a quite deep 
 pipeline and when you use any of the CP_WAIT functions you block them for 
 something and that's draining the pipeline.
 
 With the semaphore and fence commands the values are just attached as 
 prerequisite to the draw command, e.g. the CP setups the draw environment 
 and issues the command, but the actual execution of it is delayed until the 
 != 0 condition hits. And in the meantime the CP already prepares the next 
 draw operation.
 
 But at least for compute queues wait semaphore aren't the perfect solution 
 either. What you need then is a GPU scheduler that uses a kernel task for 
 setting up the command submission for you when all prerequisites are meet.
 nouveau has sort of a scheduler in hardware. It can yield when waiting
 on a semaphore. And each process gets their own context and the
 timeslices can be adjusted. ;-) But I don't mind changing this patch
 when an actual user pops up. Nouveau can do a  wait for (*sema  mask)
 != 0 only on nvc0 and newer, where mask can be chosen. But it can do ==
 somevalue and = somevalue on older relevant optimus hardware, so if we
 know that it was zero before and we know the sign of the new value that
 could work too.
 
 Adding ops and a separate mask later on when users pop up is fine with
 me, the original design here was chosen so I could map the intel status
 page read-only into the process specific nvidia vm.

Yeah, I guess in the end we might end up having a pile of different
memory-based (shared through dma-bufs) based fences. But imo getting this
thing of the ground is more important, and you can always do crazy
per-platform/generation/whatever hacks and match on that specific fence
type in the interim. That'll cut it for the SoC madness, which also seems
to be what google does with the android syncpoint stuff.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line 

Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-19 Thread Maarten Lankhorst

op 17-02-14 19:41, Christian König schreef:

Am 17.02.2014 19:24, schrieb Rob Clark:

On Mon, Feb 17, 2014 at 12:36 PM, Christian König
deathsim...@vodafone.de wrote:

Am 17.02.2014 18:27, schrieb Rob Clark:


On Mon, Feb 17, 2014 at 11:56 AM, Christian König
deathsim...@vodafone.de wrote:

Am 17.02.2014 16:56, schrieb Maarten Lankhorst:


This type of fence can be used with hardware synchronization for simple
hardware that can block execution until the condition
(dma_buf[offset] - value) = 0 has been met.


Can't we make that just dma_buf[offset] != 0 instead? As far as I know
this way it would match the definition M$ uses in their WDDM
specification
and so make it much more likely that hardware supports it.

well 'buf[offset] = value' at least means the same slot can be used
for multiple operations (with increasing values of 'value').. not sure
if that is something people care about.


=value seems to be possible with adreno and radeon.  I'm not really sure
about others (although I presume it as least supported for nv desktop
stuff).  For hw that cannot do =value, we can either have a different fence
implementation which uses the !=0 approach.  Or change seqno-fence
implementation later if needed.  But if someone has hw that can do !=0 but
not =value, speak up now ;-)


Here! Radeon can only do =value on the DMA and 3D engine, but not with UVD
or VCE. And for the 3D engine it means draining the pipe, which isn't really
a good idea.

hmm, ok.. forgot you have a few extra rings compared to me.  Is UVD
re-ordering from decode-order to display-order for you in hw? If not,
I guess you need sw intervention anyways when a frame is done for
frame re-ordering, so maybe hw-hw sync doesn't really matter as much
as compared to gpu/3d-display.  For dma-3d interactions, seems like
you would care more about hw-hw sync, but I guess you aren't likely
to use GPU A to do a resolve blit for GPU B..


No UVD isn't reordering, but since frame reordering is predictable you usually 
end up with pipelining everything to the hardware. E.g. you send the decode 
commands in decode order to the UVD block and if you have overlay active one of 
the frames are going to be the first to display and then you want to wait for 
it on the display side.


For 3D ring, I assume you probably want a CP_WAIT_FOR_IDLE before a
CP_MEM_WRITE to update fence value in memory (for the one signalling
the fence).  But why would you need that before a CP_WAIT_REG_MEM (for
the one waiting for the fence)?  I don't exactly have documentation
for adreno version of CP_WAIT_REG_{MEM,EQ,GTE}..  but PFP and ME
appear to be same instruction set as r600, so I'm pretty sure they
should have similar capabilities.. CP_WAIT_REG_MEM appears to be same
but with 32bit gpu addresses vs 64b.


You shouldn't use any of the CP commands for engine synchronization (neither 
for wait nor for signal). The PFP and ME are just the top of a quite deep 
pipeline and when you use any of the CP_WAIT functions you block them for 
something and that's draining the pipeline.

With the semaphore and fence commands the values are just attached as prerequisite to the 
draw command, e.g. the CP setups the draw environment and issues the command, but the 
actual execution of it is delayed until the != 0 condition hits. And in the 
meantime the CP already prepares the next draw operation.

But at least for compute queues wait semaphore aren't the perfect solution 
either. What you need then is a GPU scheduler that uses a kernel task for 
setting up the command submission for you when all prerequisites are meet.

nouveau has sort of a scheduler in hardware. It can yield when waiting on a semaphore. 
And each process gets their own context and the timeslices can be adjusted. ;-) But I 
don't mind changing this patch when an actual user pops up. Nouveau can do a  wait for 
(*sema  mask) != 0 only on nvc0 and newer, where mask can be chosen. But it can 
do == somevalue and = somevalue on older relevant optimus hardware, so if we know 
that it was zero before and we know the sign of the new value that could work too.

Adding ops and a separate mask later on when users pop up is fine with me, the 
original design here was chosen so I could map the intel status page read-only 
into the process specific nvidia vm.

~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-17 Thread Maarten Lankhorst
This type of fence can be used with hardware synchronization for simple
hardware that can block execution until the condition
(dma_buf[offset] - value) = 0 has been met.

A software fallback still has to be provided in case the fence is used
with a device that doesn't support this mechanism. It is useful to expose
this for graphics cards that have an op to support this.

Some cards like i915 can export those, but don't have an option to wait,
so they need the software fallback.

I extended the original patch by Rob Clark.

v1: Original
v2: Renamed from bikeshed to seqno, moved into dma-fence.c since
not much was left of the file. Lots of documentation added.
v3: Use fence_ops instead of custom callbacks. Moved to own file
to avoid circular dependency between dma-buf.h and fence.h
v4: Add spinlock pointer to seqno_fence_init

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
 Documentation/DocBook/device-drivers.tmpl |1 
 drivers/base/fence.c  |   50 +
 include/linux/seqno-fence.h   |  109 +
 3 files changed, 160 insertions(+)
 create mode 100644 include/linux/seqno-fence.h

diff --git a/Documentation/DocBook/device-drivers.tmpl 
b/Documentation/DocBook/device-drivers.tmpl
index 7a0c9ddb4818..8c85c20942c2 100644
--- a/Documentation/DocBook/device-drivers.tmpl
+++ b/Documentation/DocBook/device-drivers.tmpl
@@ -131,6 +131,7 @@ X!Edrivers/base/interface.c
 !Edrivers/base/dma-buf.c
 !Edrivers/base/fence.c
 !Iinclude/linux/fence.h
+!Iinclude/linux/seqno-fence.h
 !Edrivers/base/reservation.c
 !Iinclude/linux/reservation.h
 !Edrivers/base/dma-coherent.c
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
index 12df2bf62034..cd0937127a89 100644
--- a/drivers/base/fence.c
+++ b/drivers/base/fence.c
@@ -25,6 +25,7 @@
 #include linux/export.h
 #include linux/atomic.h
 #include linux/fence.h
+#include linux/seqno-fence.h
 
 #define CREATE_TRACE_POINTS
 #include trace/events/fence.h
@@ -413,3 +414,52 @@ __fence_init(struct fence *fence, const struct fence_ops 
*ops,
trace_fence_init(fence);
 }
 EXPORT_SYMBOL(__fence_init);
+
+static const char *seqno_fence_get_driver_name(struct fence *fence) {
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-get_driver_name(fence);
+}
+
+static const char *seqno_fence_get_timeline_name(struct fence *fence) {
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-get_timeline_name(fence);
+}
+
+static bool seqno_enable_signaling(struct fence *fence)
+{
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-enable_signaling(fence);
+}
+
+static bool seqno_signaled(struct fence *fence)
+{
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-signaled  seqno_fence-ops-signaled(fence);
+}
+
+static void seqno_release(struct fence *fence)
+{
+   struct seqno_fence *f = to_seqno_fence(fence);
+
+   dma_buf_put(f-sync_buf);
+   if (f-ops-release)
+   f-ops-release(fence);
+   else
+   kfree(f);
+}
+
+static long seqno_wait(struct fence *fence, bool intr, signed long timeout)
+{
+   struct seqno_fence *f = to_seqno_fence(fence);
+   return f-ops-wait(fence, intr, timeout);
+}
+
+const struct fence_ops seqno_fence_ops = {
+   .get_driver_name = seqno_fence_get_driver_name,
+   .get_timeline_name = seqno_fence_get_timeline_name,
+   .enable_signaling = seqno_enable_signaling,
+   .signaled = seqno_signaled,
+   .wait = seqno_wait,
+   .release = seqno_release,
+};
+EXPORT_SYMBOL(seqno_fence_ops);
diff --git a/include/linux/seqno-fence.h b/include/linux/seqno-fence.h
new file mode 100644
index ..952f7909128c
--- /dev/null
+++ b/include/linux/seqno-fence.h
@@ -0,0 +1,109 @@
+/*
+ * seqno-fence, using a dma-buf to synchronize fencing
+ *
+ * Copyright (C) 2012 Texas Instruments
+ * Copyright (C) 2012 Canonical Ltd
+ * Authors:
+ * Rob Clark robdcl...@gmail.com
+ *   Maarten Lankhorst maarten.lankho...@canonical.com
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see http://www.gnu.org/licenses/.
+ */
+
+#ifndef __LINUX_SEQNO_FENCE_H
+#define __LINUX_SEQNO_FENCE_H
+
+#include linux/fence.h
+#include linux/dma-buf.h
+
+struct seqno_fence {
+   struct fence base;
+
+   const struct fence_ops *ops;
+   struct dma_buf 

Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-17 Thread Rob Clark
On Mon, Feb 17, 2014 at 10:56 AM, Maarten Lankhorst
maarten.lankho...@canonical.com wrote:
 This type of fence can be used with hardware synchronization for simple
 hardware that can block execution until the condition
 (dma_buf[offset] - value) = 0 has been met.

 A software fallback still has to be provided in case the fence is used
 with a device that doesn't support this mechanism. It is useful to expose
 this for graphics cards that have an op to support this.

 Some cards like i915 can export those, but don't have an option to wait,
 so they need the software fallback.

 I extended the original patch by Rob Clark.

 v1: Original
 v2: Renamed from bikeshed to seqno, moved into dma-fence.c since
 not much was left of the file. Lots of documentation added.
 v3: Use fence_ops instead of custom callbacks. Moved to own file
 to avoid circular dependency between dma-buf.h and fence.h
 v4: Add spinlock pointer to seqno_fence_init

 Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com


Reviewed-by: Rob Clark robdcl...@gmail.com


 ---
  Documentation/DocBook/device-drivers.tmpl |1
  drivers/base/fence.c  |   50 +
  include/linux/seqno-fence.h   |  109 
 +
  3 files changed, 160 insertions(+)
  create mode 100644 include/linux/seqno-fence.h

 diff --git a/Documentation/DocBook/device-drivers.tmpl 
 b/Documentation/DocBook/device-drivers.tmpl
 index 7a0c9ddb4818..8c85c20942c2 100644
 --- a/Documentation/DocBook/device-drivers.tmpl
 +++ b/Documentation/DocBook/device-drivers.tmpl
 @@ -131,6 +131,7 @@ X!Edrivers/base/interface.c
  !Edrivers/base/dma-buf.c
  !Edrivers/base/fence.c
  !Iinclude/linux/fence.h
 +!Iinclude/linux/seqno-fence.h
  !Edrivers/base/reservation.c
  !Iinclude/linux/reservation.h
  !Edrivers/base/dma-coherent.c
 diff --git a/drivers/base/fence.c b/drivers/base/fence.c
 index 12df2bf62034..cd0937127a89 100644
 --- a/drivers/base/fence.c
 +++ b/drivers/base/fence.c
 @@ -25,6 +25,7 @@
  #include linux/export.h
  #include linux/atomic.h
  #include linux/fence.h
 +#include linux/seqno-fence.h

  #define CREATE_TRACE_POINTS
  #include trace/events/fence.h
 @@ -413,3 +414,52 @@ __fence_init(struct fence *fence, const struct fence_ops 
 *ops,
 trace_fence_init(fence);
  }
  EXPORT_SYMBOL(__fence_init);
 +
 +static const char *seqno_fence_get_driver_name(struct fence *fence) {
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-get_driver_name(fence);
 +}
 +
 +static const char *seqno_fence_get_timeline_name(struct fence *fence) {
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-get_timeline_name(fence);
 +}
 +
 +static bool seqno_enable_signaling(struct fence *fence)
 +{
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-enable_signaling(fence);
 +}
 +
 +static bool seqno_signaled(struct fence *fence)
 +{
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-signaled  
 seqno_fence-ops-signaled(fence);
 +}
 +
 +static void seqno_release(struct fence *fence)
 +{
 +   struct seqno_fence *f = to_seqno_fence(fence);
 +
 +   dma_buf_put(f-sync_buf);
 +   if (f-ops-release)
 +   f-ops-release(fence);
 +   else
 +   kfree(f);
 +}
 +
 +static long seqno_wait(struct fence *fence, bool intr, signed long timeout)
 +{
 +   struct seqno_fence *f = to_seqno_fence(fence);
 +   return f-ops-wait(fence, intr, timeout);
 +}
 +
 +const struct fence_ops seqno_fence_ops = {
 +   .get_driver_name = seqno_fence_get_driver_name,
 +   .get_timeline_name = seqno_fence_get_timeline_name,
 +   .enable_signaling = seqno_enable_signaling,
 +   .signaled = seqno_signaled,
 +   .wait = seqno_wait,
 +   .release = seqno_release,
 +};
 +EXPORT_SYMBOL(seqno_fence_ops);
 diff --git a/include/linux/seqno-fence.h b/include/linux/seqno-fence.h
 new file mode 100644
 index ..952f7909128c
 --- /dev/null
 +++ b/include/linux/seqno-fence.h
 @@ -0,0 +1,109 @@
 +/*
 + * seqno-fence, using a dma-buf to synchronize fencing
 + *
 + * Copyright (C) 2012 Texas Instruments
 + * Copyright (C) 2012 Canonical Ltd
 + * Authors:
 + * Rob Clark robdcl...@gmail.com
 + *   Maarten Lankhorst maarten.lankho...@canonical.com
 + *
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms of the GNU General Public License version 2 as published 
 by
 + * the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful, but 
 WITHOUT
 + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 + * more details.
 + *
 + * You should have received a copy of the GNU General Public License along 
 with
 + * this program.  If not, 

Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-17 Thread Christian König

Am 17.02.2014 16:56, schrieb Maarten Lankhorst:

This type of fence can be used with hardware synchronization for simple
hardware that can block execution until the condition
(dma_buf[offset] - value) = 0 has been met.


Can't we make that just dma_buf[offset] != 0 instead? As far as I know 
this way it would match the definition M$ uses in their WDDM 
specification and so make it much more likely that hardware supports it.


Apart from that I still don't like the idea of leaking a drivers IRQ 
context outside of the driver, but without a proper GPU scheduler there 
probably isn't much alternative.


Christian.



A software fallback still has to be provided in case the fence is used
with a device that doesn't support this mechanism. It is useful to expose
this for graphics cards that have an op to support this.

Some cards like i915 can export those, but don't have an option to wait,
so they need the software fallback.

I extended the original patch by Rob Clark.

v1: Original
v2: Renamed from bikeshed to seqno, moved into dma-fence.c since
 not much was left of the file. Lots of documentation added.
v3: Use fence_ops instead of custom callbacks. Moved to own file
 to avoid circular dependency between dma-buf.h and fence.h
v4: Add spinlock pointer to seqno_fence_init

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
  Documentation/DocBook/device-drivers.tmpl |1
  drivers/base/fence.c  |   50 +
  include/linux/seqno-fence.h   |  109 +
  3 files changed, 160 insertions(+)
  create mode 100644 include/linux/seqno-fence.h

diff --git a/Documentation/DocBook/device-drivers.tmpl 
b/Documentation/DocBook/device-drivers.tmpl
index 7a0c9ddb4818..8c85c20942c2 100644
--- a/Documentation/DocBook/device-drivers.tmpl
+++ b/Documentation/DocBook/device-drivers.tmpl
@@ -131,6 +131,7 @@ X!Edrivers/base/interface.c
  !Edrivers/base/dma-buf.c
  !Edrivers/base/fence.c
  !Iinclude/linux/fence.h
+!Iinclude/linux/seqno-fence.h
  !Edrivers/base/reservation.c
  !Iinclude/linux/reservation.h
  !Edrivers/base/dma-coherent.c
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
index 12df2bf62034..cd0937127a89 100644
--- a/drivers/base/fence.c
+++ b/drivers/base/fence.c
@@ -25,6 +25,7 @@
  #include linux/export.h
  #include linux/atomic.h
  #include linux/fence.h
+#include linux/seqno-fence.h
  
  #define CREATE_TRACE_POINTS

  #include trace/events/fence.h
@@ -413,3 +414,52 @@ __fence_init(struct fence *fence, const struct fence_ops 
*ops,
trace_fence_init(fence);
  }
  EXPORT_SYMBOL(__fence_init);
+
+static const char *seqno_fence_get_driver_name(struct fence *fence) {
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-get_driver_name(fence);
+}
+
+static const char *seqno_fence_get_timeline_name(struct fence *fence) {
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-get_timeline_name(fence);
+}
+
+static bool seqno_enable_signaling(struct fence *fence)
+{
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-enable_signaling(fence);
+}
+
+static bool seqno_signaled(struct fence *fence)
+{
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-signaled  seqno_fence-ops-signaled(fence);
+}
+
+static void seqno_release(struct fence *fence)
+{
+   struct seqno_fence *f = to_seqno_fence(fence);
+
+   dma_buf_put(f-sync_buf);
+   if (f-ops-release)
+   f-ops-release(fence);
+   else
+   kfree(f);
+}
+
+static long seqno_wait(struct fence *fence, bool intr, signed long timeout)
+{
+   struct seqno_fence *f = to_seqno_fence(fence);
+   return f-ops-wait(fence, intr, timeout);
+}
+
+const struct fence_ops seqno_fence_ops = {
+   .get_driver_name = seqno_fence_get_driver_name,
+   .get_timeline_name = seqno_fence_get_timeline_name,
+   .enable_signaling = seqno_enable_signaling,
+   .signaled = seqno_signaled,
+   .wait = seqno_wait,
+   .release = seqno_release,
+};
+EXPORT_SYMBOL(seqno_fence_ops);
diff --git a/include/linux/seqno-fence.h b/include/linux/seqno-fence.h
new file mode 100644
index ..952f7909128c
--- /dev/null
+++ b/include/linux/seqno-fence.h
@@ -0,0 +1,109 @@
+/*
+ * seqno-fence, using a dma-buf to synchronize fencing
+ *
+ * Copyright (C) 2012 Texas Instruments
+ * Copyright (C) 2012 Canonical Ltd
+ * Authors:
+ * Rob Clark robdcl...@gmail.com
+ *   Maarten Lankhorst maarten.lankho...@canonical.com
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * 

Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-17 Thread Rob Clark
On Mon, Feb 17, 2014 at 11:56 AM, Christian König
deathsim...@vodafone.de wrote:
 Am 17.02.2014 16:56, schrieb Maarten Lankhorst:

 This type of fence can be used with hardware synchronization for simple
 hardware that can block execution until the condition
 (dma_buf[offset] - value) = 0 has been met.


 Can't we make that just dma_buf[offset] != 0 instead? As far as I know
 this way it would match the definition M$ uses in their WDDM specification
 and so make it much more likely that hardware supports it.

well 'buf[offset] = value' at least means the same slot can be used
for multiple operations (with increasing values of 'value').. not sure
if that is something people care about.

=value seems to be possible with adreno and radeon.  I'm not really sure about 
others (although I presume it as least supported for nv desktop stuff).  For 
hw that cannot do =value, we can either have a different fence implementation 
which uses the !=0 approach.  Or change seqno-fence implementation later if 
needed.  But if someone has hw that can do !=0 but not =value, speak up now 
;-)


 Apart from that I still don't like the idea of leaking a drivers IRQ context
 outside of the driver, but without a proper GPU scheduler there probably
 isn't much alternative.

I guess it will be not uncommon scenario for gpu device to just need
to kick display device to write a few registers for a page flip..
probably best not to schedule a worker just for this (unless the
signalled device otherwise needs to).  I think it is better in this
case to give the signalee some rope to hang themselves, and make it
the responsibility of the callback to kick things off to a worker if
needed.

BR,
-R

 Christian.


 A software fallback still has to be provided in case the fence is used
 with a device that doesn't support this mechanism. It is useful to expose
 this for graphics cards that have an op to support this.

 Some cards like i915 can export those, but don't have an option to wait,
 so they need the software fallback.

 I extended the original patch by Rob Clark.

 v1: Original
 v2: Renamed from bikeshed to seqno, moved into dma-fence.c since
  not much was left of the file. Lots of documentation added.
 v3: Use fence_ops instead of custom callbacks. Moved to own file
  to avoid circular dependency between dma-buf.h and fence.h
 v4: Add spinlock pointer to seqno_fence_init

 Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
 ---
   Documentation/DocBook/device-drivers.tmpl |1
   drivers/base/fence.c  |   50 +
   include/linux/seqno-fence.h   |  109
 +
   3 files changed, 160 insertions(+)
   create mode 100644 include/linux/seqno-fence.h

 diff --git a/Documentation/DocBook/device-drivers.tmpl
 b/Documentation/DocBook/device-drivers.tmpl
 index 7a0c9ddb4818..8c85c20942c2 100644
 --- a/Documentation/DocBook/device-drivers.tmpl
 +++ b/Documentation/DocBook/device-drivers.tmpl
 @@ -131,6 +131,7 @@ X!Edrivers/base/interface.c
   !Edrivers/base/dma-buf.c
   !Edrivers/base/fence.c
   !Iinclude/linux/fence.h
 +!Iinclude/linux/seqno-fence.h
   !Edrivers/base/reservation.c
   !Iinclude/linux/reservation.h
   !Edrivers/base/dma-coherent.c
 diff --git a/drivers/base/fence.c b/drivers/base/fence.c
 index 12df2bf62034..cd0937127a89 100644
 --- a/drivers/base/fence.c
 +++ b/drivers/base/fence.c
 @@ -25,6 +25,7 @@
   #include linux/export.h
   #include linux/atomic.h
   #include linux/fence.h
 +#include linux/seqno-fence.h
 #define CREATE_TRACE_POINTS
   #include trace/events/fence.h
 @@ -413,3 +414,52 @@ __fence_init(struct fence *fence, const struct
 fence_ops *ops,
 trace_fence_init(fence);
   }
   EXPORT_SYMBOL(__fence_init);
 +
 +static const char *seqno_fence_get_driver_name(struct fence *fence) {
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-get_driver_name(fence);
 +}
 +
 +static const char *seqno_fence_get_timeline_name(struct fence *fence) {
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-get_timeline_name(fence);
 +}
 +
 +static bool seqno_enable_signaling(struct fence *fence)
 +{
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-enable_signaling(fence);
 +}
 +
 +static bool seqno_signaled(struct fence *fence)
 +{
 +   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
 +   return seqno_fence-ops-signaled 
 seqno_fence-ops-signaled(fence);
 +}
 +
 +static void seqno_release(struct fence *fence)
 +{
 +   struct seqno_fence *f = to_seqno_fence(fence);
 +
 +   dma_buf_put(f-sync_buf);
 +   if (f-ops-release)
 +   f-ops-release(fence);
 +   else
 +   kfree(f);
 +}
 +
 +static long seqno_wait(struct fence *fence, bool intr, signed long
 timeout)
 +{
 +   struct seqno_fence *f = to_seqno_fence(fence);
 +   return 

Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-17 Thread Christian König

Am 17.02.2014 18:27, schrieb Rob Clark:

On Mon, Feb 17, 2014 at 11:56 AM, Christian König
deathsim...@vodafone.de wrote:

Am 17.02.2014 16:56, schrieb Maarten Lankhorst:


This type of fence can be used with hardware synchronization for simple
hardware that can block execution until the condition
(dma_buf[offset] - value) = 0 has been met.


Can't we make that just dma_buf[offset] != 0 instead? As far as I know
this way it would match the definition M$ uses in their WDDM specification
and so make it much more likely that hardware supports it.

well 'buf[offset] = value' at least means the same slot can be used
for multiple operations (with increasing values of 'value').. not sure
if that is something people care about.


=value seems to be possible with adreno and radeon.  I'm not really sure about others 
(although I presume it as least supported for nv desktop stuff).  For hw that cannot 
do =value, we can either have a different fence implementation which uses the !=0 
approach.  Or change seqno-fence implementation later if needed.  But if someone has 
hw that can do !=0 but not =value, speak up now ;-)


Here! Radeon can only do =value on the DMA and 3D engine, but not with 
UVD or VCE. And for the 3D engine it means draining the pipe, which 
isn't really a good idea.


Christian.




Apart from that I still don't like the idea of leaking a drivers IRQ context
outside of the driver, but without a proper GPU scheduler there probably
isn't much alternative.

I guess it will be not uncommon scenario for gpu device to just need
to kick display device to write a few registers for a page flip..
probably best not to schedule a worker just for this (unless the
signalled device otherwise needs to).  I think it is better in this
case to give the signalee some rope to hang themselves, and make it
the responsibility of the callback to kick things off to a worker if
needed.

BR,
-R


Christian.


A software fallback still has to be provided in case the fence is used
with a device that doesn't support this mechanism. It is useful to expose
this for graphics cards that have an op to support this.

Some cards like i915 can export those, but don't have an option to wait,
so they need the software fallback.

I extended the original patch by Rob Clark.

v1: Original
v2: Renamed from bikeshed to seqno, moved into dma-fence.c since
  not much was left of the file. Lots of documentation added.
v3: Use fence_ops instead of custom callbacks. Moved to own file
  to avoid circular dependency between dma-buf.h and fence.h
v4: Add spinlock pointer to seqno_fence_init

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
   Documentation/DocBook/device-drivers.tmpl |1
   drivers/base/fence.c  |   50 +
   include/linux/seqno-fence.h   |  109
+
   3 files changed, 160 insertions(+)
   create mode 100644 include/linux/seqno-fence.h

diff --git a/Documentation/DocBook/device-drivers.tmpl
b/Documentation/DocBook/device-drivers.tmpl
index 7a0c9ddb4818..8c85c20942c2 100644
--- a/Documentation/DocBook/device-drivers.tmpl
+++ b/Documentation/DocBook/device-drivers.tmpl
@@ -131,6 +131,7 @@ X!Edrivers/base/interface.c
   !Edrivers/base/dma-buf.c
   !Edrivers/base/fence.c
   !Iinclude/linux/fence.h
+!Iinclude/linux/seqno-fence.h
   !Edrivers/base/reservation.c
   !Iinclude/linux/reservation.h
   !Edrivers/base/dma-coherent.c
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
index 12df2bf62034..cd0937127a89 100644
--- a/drivers/base/fence.c
+++ b/drivers/base/fence.c
@@ -25,6 +25,7 @@
   #include linux/export.h
   #include linux/atomic.h
   #include linux/fence.h
+#include linux/seqno-fence.h
 #define CREATE_TRACE_POINTS
   #include trace/events/fence.h
@@ -413,3 +414,52 @@ __fence_init(struct fence *fence, const struct
fence_ops *ops,
 trace_fence_init(fence);
   }
   EXPORT_SYMBOL(__fence_init);
+
+static const char *seqno_fence_get_driver_name(struct fence *fence) {
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-get_driver_name(fence);
+}
+
+static const char *seqno_fence_get_timeline_name(struct fence *fence) {
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-get_timeline_name(fence);
+}
+
+static bool seqno_enable_signaling(struct fence *fence)
+{
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-enable_signaling(fence);
+}
+
+static bool seqno_signaled(struct fence *fence)
+{
+   struct seqno_fence *seqno_fence = to_seqno_fence(fence);
+   return seqno_fence-ops-signaled 
seqno_fence-ops-signaled(fence);
+}
+
+static void seqno_release(struct fence *fence)
+{
+   struct seqno_fence *f = to_seqno_fence(fence);
+
+   dma_buf_put(f-sync_buf);
+   if (f-ops-release)
+   f-ops-release(fence);
+   else
+   kfree(f);
+}
+
+static long 

Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-17 Thread Rob Clark
On Mon, Feb 17, 2014 at 12:36 PM, Christian König
deathsim...@vodafone.de wrote:
 Am 17.02.2014 18:27, schrieb Rob Clark:

 On Mon, Feb 17, 2014 at 11:56 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 17.02.2014 16:56, schrieb Maarten Lankhorst:

 This type of fence can be used with hardware synchronization for simple
 hardware that can block execution until the condition
 (dma_buf[offset] - value) = 0 has been met.


 Can't we make that just dma_buf[offset] != 0 instead? As far as I know
 this way it would match the definition M$ uses in their WDDM
 specification
 and so make it much more likely that hardware supports it.

 well 'buf[offset] = value' at least means the same slot can be used
 for multiple operations (with increasing values of 'value').. not sure
 if that is something people care about.

 =value seems to be possible with adreno and radeon.  I'm not really sure
 about others (although I presume it as least supported for nv desktop
 stuff).  For hw that cannot do =value, we can either have a different fence
 implementation which uses the !=0 approach.  Or change seqno-fence
 implementation later if needed.  But if someone has hw that can do !=0 but
 not =value, speak up now ;-)


 Here! Radeon can only do =value on the DMA and 3D engine, but not with UVD
 or VCE. And for the 3D engine it means draining the pipe, which isn't really
 a good idea.

hmm, ok.. forgot you have a few extra rings compared to me.  Is UVD
re-ordering from decode-order to display-order for you in hw?  If not,
I guess you need sw intervention anyways when a frame is done for
frame re-ordering, so maybe hw-hw sync doesn't really matter as much
as compared to gpu/3d-display.  For dma-3d interactions, seems like
you would care more about hw-hw sync, but I guess you aren't likely
to use GPU A to do a resolve blit for GPU B..

For 3D ring, I assume you probably want a CP_WAIT_FOR_IDLE before a
CP_MEM_WRITE to update fence value in memory (for the one signalling
the fence).  But why would you need that before a CP_WAIT_REG_MEM (for
the one waiting for the fence)?  I don't exactly have documentation
for adreno version of CP_WAIT_REG_{MEM,EQ,GTE}..  but PFP and ME
appear to be same instruction set as r600, so I'm pretty sure they
should have similar capabilities.. CP_WAIT_REG_MEM appears to be same
but with 32bit gpu addresses vs 64b.

BR,
-R

 Christian.



 Apart from that I still don't like the idea of leaking a drivers IRQ
 context
 outside of the driver, but without a proper GPU scheduler there probably
 isn't much alternative.

 I guess it will be not uncommon scenario for gpu device to just need
 to kick display device to write a few registers for a page flip..
 probably best not to schedule a worker just for this (unless the
 signalled device otherwise needs to).  I think it is better in this
 case to give the signalee some rope to hang themselves, and make it
 the responsibility of the callback to kick things off to a worker if
 needed.

 BR,
 -R

 Christian.

 A software fallback still has to be provided in case the fence is used
 with a device that doesn't support this mechanism. It is useful to
 expose
 this for graphics cards that have an op to support this.

 Some cards like i915 can export those, but don't have an option to wait,
 so they need the software fallback.

 I extended the original patch by Rob Clark.

 v1: Original
 v2: Renamed from bikeshed to seqno, moved into dma-fence.c since
   not much was left of the file. Lots of documentation added.
 v3: Use fence_ops instead of custom callbacks. Moved to own file
   to avoid circular dependency between dma-buf.h and fence.h
 v4: Add spinlock pointer to seqno_fence_init

 Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
 ---
Documentation/DocBook/device-drivers.tmpl |1
drivers/base/fence.c  |   50 +
include/linux/seqno-fence.h   |  109
 +
3 files changed, 160 insertions(+)
create mode 100644 include/linux/seqno-fence.h

 diff --git a/Documentation/DocBook/device-drivers.tmpl
 b/Documentation/DocBook/device-drivers.tmpl
 index 7a0c9ddb4818..8c85c20942c2 100644
 --- a/Documentation/DocBook/device-drivers.tmpl
 +++ b/Documentation/DocBook/device-drivers.tmpl
 @@ -131,6 +131,7 @@ X!Edrivers/base/interface.c
!Edrivers/base/dma-buf.c
!Edrivers/base/fence.c
!Iinclude/linux/fence.h
 +!Iinclude/linux/seqno-fence.h
!Edrivers/base/reservation.c
!Iinclude/linux/reservation.h
!Edrivers/base/dma-coherent.c
 diff --git a/drivers/base/fence.c b/drivers/base/fence.c
 index 12df2bf62034..cd0937127a89 100644
 --- a/drivers/base/fence.c
 +++ b/drivers/base/fence.c
 @@ -25,6 +25,7 @@
#include linux/export.h
#include linux/atomic.h
#include linux/fence.h
 +#include linux/seqno-fence.h
  #define CREATE_TRACE_POINTS
#include trace/events/fence.h
 @@ -413,3 +414,52 @@ __fence_init(struct fence *fence, const struct

Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4)

2014-02-17 Thread Christian König

Am 17.02.2014 19:24, schrieb Rob Clark:

On Mon, Feb 17, 2014 at 12:36 PM, Christian König
deathsim...@vodafone.de wrote:

Am 17.02.2014 18:27, schrieb Rob Clark:


On Mon, Feb 17, 2014 at 11:56 AM, Christian König
deathsim...@vodafone.de wrote:

Am 17.02.2014 16:56, schrieb Maarten Lankhorst:


This type of fence can be used with hardware synchronization for simple
hardware that can block execution until the condition
(dma_buf[offset] - value) = 0 has been met.


Can't we make that just dma_buf[offset] != 0 instead? As far as I know
this way it would match the definition M$ uses in their WDDM
specification
and so make it much more likely that hardware supports it.

well 'buf[offset] = value' at least means the same slot can be used
for multiple operations (with increasing values of 'value').. not sure
if that is something people care about.


=value seems to be possible with adreno and radeon.  I'm not really sure
about others (although I presume it as least supported for nv desktop
stuff).  For hw that cannot do =value, we can either have a different fence
implementation which uses the !=0 approach.  Or change seqno-fence
implementation later if needed.  But if someone has hw that can do !=0 but
not =value, speak up now ;-)


Here! Radeon can only do =value on the DMA and 3D engine, but not with UVD
or VCE. And for the 3D engine it means draining the pipe, which isn't really
a good idea.

hmm, ok.. forgot you have a few extra rings compared to me.  Is UVD
re-ordering from decode-order to display-order for you in hw?  If not,
I guess you need sw intervention anyways when a frame is done for
frame re-ordering, so maybe hw-hw sync doesn't really matter as much
as compared to gpu/3d-display.  For dma-3d interactions, seems like
you would care more about hw-hw sync, but I guess you aren't likely
to use GPU A to do a resolve blit for GPU B..


No UVD isn't reordering, but since frame reordering is predictable you 
usually end up with pipelining everything to the hardware. E.g. you send 
the decode commands in decode order to the UVD block and if you have 
overlay active one of the frames are going to be the first to display 
and then you want to wait for it on the display side.



For 3D ring, I assume you probably want a CP_WAIT_FOR_IDLE before a
CP_MEM_WRITE to update fence value in memory (for the one signalling
the fence).  But why would you need that before a CP_WAIT_REG_MEM (for
the one waiting for the fence)?  I don't exactly have documentation
for adreno version of CP_WAIT_REG_{MEM,EQ,GTE}..  but PFP and ME
appear to be same instruction set as r600, so I'm pretty sure they
should have similar capabilities.. CP_WAIT_REG_MEM appears to be same
but with 32bit gpu addresses vs 64b.


You shouldn't use any of the CP commands for engine synchronization 
(neither for wait nor for signal). The PFP and ME are just the top of a 
quite deep pipeline and when you use any of the CP_WAIT functions you 
block them for something and that's draining the pipeline.


With the semaphore and fence commands the values are just attached as 
prerequisite to the draw command, e.g. the CP setups the draw 
environment and issues the command, but the actual execution of it is 
delayed until the != 0 condition hits. And in the meantime the CP 
already prepares the next draw operation.


But at least for compute queues wait semaphore aren't the perfect 
solution either. What you need then is a GPU scheduler that uses a 
kernel task for setting up the command submission for you when all 
prerequisites are meet.


Christian.


BR,
-R


Christian.



Apart from that I still don't like the idea of leaking a drivers IRQ
context
outside of the driver, but without a proper GPU scheduler there probably
isn't much alternative.

I guess it will be not uncommon scenario for gpu device to just need
to kick display device to write a few registers for a page flip..
probably best not to schedule a worker just for this (unless the
signalled device otherwise needs to).  I think it is better in this
case to give the signalee some rope to hang themselves, and make it
the responsibility of the callback to kick things off to a worker if
needed.

BR,
-R


Christian.


A software fallback still has to be provided in case the fence is used
with a device that doesn't support this mechanism. It is useful to
expose
this for graphics cards that have an op to support this.

Some cards like i915 can export those, but don't have an option to wait,
so they need the software fallback.

I extended the original patch by Rob Clark.

v1: Original
v2: Renamed from bikeshed to seqno, moved into dma-fence.c since
   not much was left of the file. Lots of documentation added.
v3: Use fence_ops instead of custom callbacks. Moved to own file
   to avoid circular dependency between dma-buf.h and fence.h
v4: Add spinlock pointer to seqno_fence_init

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
Documentation/DocBook/device-drivers.tmpl |