Re: [PATCH] dma-fence: Make dma_fence_add_callback() fail if signaled with error

2018-05-16 Thread Daniel Vetter
On Tue, May 15, 2018 at 01:16:30PM +0100, Chris Wilson wrote:
> Quoting Ezequiel Garcia (2018-05-14 22:28:31)
> > On Mon, 2018-05-14 at 18:48 +0200, Daniel Vetter wrote:
> > > On Fri, May 11, 2018 at 08:27:41AM +0100, Chris Wilson wrote:
> > > > Quoting Ezequiel Garcia (2018-05-09 21:14:49)
> > > > > Change how dma_fence_add_callback() behaves, when the fence
> > > > > has error-signaled by the time it is being add. After this commit,
> > > > > dma_fence_add_callback() returns the fence error, if it
> > > > > has error-signaled before dma_fence_add_callback() is called.
> > > > 
> > > > Why? What problem are you trying to solve? fence->error does not imply
> > > > that the fence has yet been signaled, and the caller wants a callback
> > > > when it is signaled.
> > > 
> > > On top this is incosistent, e.g. we don't do the same for any of the other
> > > dma_fence interfaces. Plus there's the issue that you might alias errno
> > > values with fence errno values.
> > > 
> > 
> > Right.
> > 
> > > I think keeping the error codes from the functions you're calling distinct
> > > from the error code of the fence itself makes a lot of sense. The first
> > > tells you whether your request worked out (or why not), the second tells
> > > you whether the asynchronous dma operation (gpu rendering, page flip,
> > > whatever) that the dma_fence represents worked out (or why not). That's 2
> > > distinct things imo.
> > > 
> > > Might be good to show us the driver code that needs this behaviour so we
> > > can discuss how to best handle your use-case.
> > > 
> > 
> > This change arose while discussing the in-fences support for video4linux.
> > Here's the patch that calls dma_fence_add_callback 
> > https://lkml.org/lkml/2018/5/4/766.
> > 
> > The code snippet currently looks something like this:
> > 
> > if (vb->in_fence) {
> > ret = dma_fence_add_callback(vb->in_fence, >fence_cb,
> > 
> >  vb2_qbuf_fence_cb);
> > /* is the fence signaled? */
> > if (ret == -ENOENT) {
> > 
> > dma_fence_put(vb->in_fence);
> > vb->in_fence = NULL;
> > } else if (ret)
> > {
> > goto unlock;
> > }
> > }
> > 
> > In this use case, if the callback is added successfully,
> > the video4linux core defers the activation of the buffer
> > until the fence signals.
> > 
> > If the fence is signaled (currently disregarding of errors)
> > then the buffer is assumed to be ready to be activated,
> > and so it gets queued for hardware usage.
> > 
> > Giving some more thought to this, I'm not so sure what is
> > the right action if a fence signaled with error. In this case,
> > it appears to me that we shouldn't be using this buffer
> > if its in-fence is in error, but perhaps I'm missing
> > something.
> 
> What I have in mind for async errors is to skip the operation and
> propagate the error onto the next fence. Mostly because those async
> errors may include fatal errors such as unable to pin the backing
> storage for the operation, but even "trivial" errors such as an early
> operation failing means that this request is then subject to garbage-in,
> garbage-out. However, for trivial errors I would just propagate the
> error status (so the caller knows something went wrong if they care, but
> in all likelihood no one will notice) and continue on with the glitchy
> operation.

In general, there's not really any hard rule about propagating fence
errors across devices. It's mostly just used by drivers internally to keep
track of failed stuff (gpu hangs or anything else async like Chris
describes here).

For v4l I'm not sure you want to care much about this, since right now the
main use of fence errors is gpu hang recovery (whether it's the driver or
hw that's hung doesn't matter here).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-fence: Make dma_fence_add_callback() fail if signaled with error

2018-05-14 Thread Daniel Vetter
On Fri, May 11, 2018 at 08:27:41AM +0100, Chris Wilson wrote:
> Quoting Ezequiel Garcia (2018-05-09 21:14:49)
> > Change how dma_fence_add_callback() behaves, when the fence
> > has error-signaled by the time it is being add. After this commit,
> > dma_fence_add_callback() returns the fence error, if it
> > has error-signaled before dma_fence_add_callback() is called.
> 
> Why? What problem are you trying to solve? fence->error does not imply
> that the fence has yet been signaled, and the caller wants a callback
> when it is signaled.

On top this is incosistent, e.g. we don't do the same for any of the other
dma_fence interfaces. Plus there's the issue that you might alias errno
values with fence errno values.

I think keeping the error codes from the functions you're calling distinct
from the error code of the fence itself makes a lot of sense. The first
tells you whether your request worked out (or why not), the second tells
you whether the asynchronous dma operation (gpu rendering, page flip,
whatever) that the dma_fence represents worked out (or why not). That's 2
distinct things imo.

Might be good to show us the driver code that needs this behaviour so we
can discuss how to best handle your use-case.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-buf: Remove unneeded stubs around sync_debug interfaces

2018-05-07 Thread Daniel Vetter
On Fri, May 04, 2018 at 03:00:37PM -0300, Ezequiel Garcia wrote:
> The sync_debug.h header is internal, and only used by
> sw_sync.c. Therefore, SW_SYNC is always defined and there
> is no need for the stubs. Remove them and make the code
> simpler.
> 
> Signed-off-by: Ezequiel Garcia <ezequ...@collabora.com>

Applied, thanks.
-Daniel
> ---
>  drivers/dma-buf/sync_debug.h | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/drivers/dma-buf/sync_debug.h b/drivers/dma-buf/sync_debug.h
> index d615a89f774c..05e33f937ad0 100644
> --- a/drivers/dma-buf/sync_debug.h
> +++ b/drivers/dma-buf/sync_debug.h
> @@ -62,8 +62,6 @@ struct sync_pt {
>   struct rb_node node;
>  };
>  
> -#ifdef CONFIG_SW_SYNC
> -
>  extern const struct file_operations sw_sync_debugfs_fops;
>  
>  void sync_timeline_debug_add(struct sync_timeline *obj);
> @@ -72,12 +70,4 @@ void sync_file_debug_add(struct sync_file *fence);
>  void sync_file_debug_remove(struct sync_file *fence);
>  void sync_dump(void);
>  
> -#else
> -# define sync_timeline_debug_add(obj)
> -# define sync_timeline_debug_remove(obj)
> -# define sync_file_debug_add(fence)
> -# define sync_file_debug_remove(fence)
> -# define sync_dump()
> -#endif
> -
>  #endif /* _LINUX_SYNC_H */
> -- 
> 2.16.3
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


[PATCH] dma-fence: Make ->enable_signaling optional

2018-05-04 Thread Daniel Vetter
Many drivers have a trivial implementation for ->enable_signaling.
Let's make it optional by assuming that signalling is already
available when the callback isn't present.

v2: Don't do the trick to set the ENABLE_SIGNAL_BIT
unconditionally, it results in an expensive spinlock take for
everyone. Instead just check if the callback is present. Suggested by
Maarten.

Also move misplaced kerneldoc hunk to the right patch.

Cc: Maarten Lankhorst <maarten.lankho...@linux.intel.com>
Reviewed-by: Christian König <christian.koe...@amd.com> (v1)
Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 drivers/dma-buf/dma-fence.c | 9 +
 include/linux/dma-fence.h   | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 4edb9fd3cf47..dd01a1720be9 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -200,7 +200,8 @@ void dma_fence_enable_sw_signaling(struct dma_fence *fence)
 
if (!test_and_set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
  >flags) &&
-   !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)) {
+   !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags) &&
+   fence->ops->enable_signaling) {
trace_dma_fence_enable_signal(fence);
 
spin_lock_irqsave(fence->lock, flags);
@@ -260,7 +261,7 @@ int dma_fence_add_callback(struct dma_fence *fence, struct 
dma_fence_cb *cb,
 
if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags))
ret = -ENOENT;
-   else if (!was_set) {
+   else if (!was_set && fence->ops->enable_signaling) {
trace_dma_fence_enable_signal(fence);
 
if (!fence->ops->enable_signaling(fence)) {
@@ -388,7 +389,7 @@ dma_fence_default_wait(struct dma_fence *fence, bool intr, 
signed long timeout)
if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags))
goto out;
 
-   if (!was_set) {
+   if (!was_set && fence->ops->enable_signaling) {
trace_dma_fence_enable_signal(fence);
 
if (!fence->ops->enable_signaling(fence)) {
@@ -560,7 +561,7 @@ dma_fence_init(struct dma_fence *fence, const struct 
dma_fence_ops *ops,
   spinlock_t *lock, u64 context, unsigned seqno)
 {
BUG_ON(!lock);
-   BUG_ON(!ops || !ops->wait || !ops->enable_signaling ||
+   BUG_ON(!ops || !ops->wait ||
   !ops->get_driver_name || !ops->get_timeline_name);
 
kref_init(>refcount);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 111aefe1c956..c053d19e1e24 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -166,7 +166,8 @@ struct dma_fence_ops {
 * released when the fence is signalled (through e.g. the interrupt
 * handler).
 *
-* This callback is mandatory.
+* This callback is optional. If this callback is not present, then the
+* driver must always have signaling enabled.
 */
bool (*enable_signaling)(struct dma_fence *fence);
 
-- 
2.17.0



[PATCH] dma-fence: Polish kernel-doc for dma-fence.c

2018-05-04 Thread Daniel Vetter
- Intro section that links to how this is exposed to userspace.
- Lots more hyperlinks.
- Minor clarifications and style polish

v2: Add misplaced hunk of kerneldoc from a different patch.

Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 Documentation/driver-api/dma-buf.rst |   6 ++
 drivers/dma-buf/dma-fence.c  | 147 +++
 2 files changed, 109 insertions(+), 44 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst 
b/Documentation/driver-api/dma-buf.rst
index dc384f2f7f34..b541e97c7ab1 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -130,6 +130,12 @@ Reservation Objects
 DMA Fences
 --
 
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: DMA fences overview
+
+DMA Fences Functions Reference
+~~
+
 .. kernel-doc:: drivers/dma-buf/dma-fence.c
:export:
 
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 7a92f85a4cec..1551ca7df394 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -38,12 +38,43 @@ EXPORT_TRACEPOINT_SYMBOL(dma_fence_enable_signal);
  */
 static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(0);
 
+/**
+ * DOC: DMA fences overview
+ *
+ * DMA fences, represented by  dma_fence, are the kernel internal
+ * synchronization primitive for DMA operations like GPU rendering, video
+ * encoding/decoding, or displaying buffers on a screen.
+ *
+ * A fence is initialized using dma_fence_init() and completed using
+ * dma_fence_signal(). Fences are associated with a context, allocated through
+ * dma_fence_context_alloc(), and all fences on the same context are
+ * fully ordered.
+ *
+ * Since the purposes of fences is to facilitate cross-device and
+ * cross-application synchronization, there's multiple ways to use one:
+ *
+ * - Individual fences can be exposed as a _file, accessed as a file
+ *   descriptor from userspace, created by calling sync_file_create(). This is
+ *   called explicit fencing, since userspace passes around explicit
+ *   synchronization points.
+ *
+ * - Some subsystems also have their own explicit fencing primitives, like
+ *   _syncobj. Compared to _file, a _syncobj allows the underlying
+ *   fence to be updated.
+ *
+ * - Then there's also implicit fencing, where the synchronization points are
+ *   implicitly passed around as part of shared _buf instances. Such
+ *   implicit fences are stored in  reservation_object through the
+ *   _buf.resv pointer.
+ */
+
 /**
  * dma_fence_context_alloc - allocate an array of fence contexts
- * @num:   [in]amount of contexts to allocate
+ * @num: amount of contexts to allocate
  *
- * This function will return the first index of the number of fences allocated.
- * The fence context is used for setting fence->context to a unique number.
+ * This function will return the first index of the number of fence contexts
+ * allocated.  The fence context is used for setting _fence.context to a
+ * unique number by passing the context to dma_fence_init().
  */
 u64 dma_fence_context_alloc(unsigned num)
 {
@@ -59,10 +90,14 @@ EXPORT_SYMBOL(dma_fence_context_alloc);
  * Signal completion for software callbacks on a fence, this will unblock
  * dma_fence_wait() calls and run all the callbacks added with
  * dma_fence_add_callback(). Can be called multiple times, but since a fence
- * can only go from unsignaled to signaled state, it will only be effective
- * the first time.
+ * can only go from the unsignaled to the signaled state and not back, it will
+ * only be effective the first time.
+ *
+ * Unlike dma_fence_signal(), this function must be called with _fence.lock
+ * held.
  *
- * Unlike dma_fence_signal, this function must be called with fence->lock held.
+ * Returns 0 on success and a negative error value when @fence has been
+ * signalled already.
  */
 int dma_fence_signal_locked(struct dma_fence *fence)
 {
@@ -102,8 +137,11 @@ EXPORT_SYMBOL(dma_fence_signal_locked);
  * Signal completion for software callbacks on a fence, this will unblock
  * dma_fence_wait() calls and run all the callbacks added with
  * dma_fence_add_callback(). Can be called multiple times, but since a fence
- * can only go from unsignaled to signaled state, it will only be effective
- * the first time.
+ * can only go from the unsignaled to the signaled state and not back, it will
+ * only be effective the first time.
+ *
+ * Returns 0 on success and a negative error value when @fence has been
+ * signalled already.
  */
 int dma_fence_signal(struct dma_fence *fence)
 {
@@ -136,9 +174,9 @@ EXPORT_SYMBOL(dma_fence_signal);
 /**
  * dma_fence_wait_timeout - sleep until the fence gets signaled
  * or until timeout elapses
- * @fence: [in]the fence to wait on
- * @intr: 

Re: [PATCH 04/15] dma-fence: Make ->wait callback optional

2018-05-04 Thread Daniel Vetter
On Fri, May 04, 2018 at 03:17:08PM +0200, Christian König wrote:
> Am 04.05.2018 um 11:25 schrieb Daniel Vetter:
> > On Fri, May 4, 2018 at 11:16 AM, Chris Wilson <ch...@chris-wilson.co.uk> 
> > wrote:
> > > Quoting Daniel Vetter (2018-05-04 09:57:59)
> > > > On Fri, May 04, 2018 at 09:31:33AM +0100, Chris Wilson wrote:
> > > > > Quoting Daniel Vetter (2018-05-04 09:23:01)
> > > > > > On Fri, May 04, 2018 at 10:17:22AM +0200, Daniel Vetter wrote:
> > > > > > > On Fri, May 04, 2018 at 09:09:10AM +0100, Chris Wilson wrote:
> > > > > > > > Quoting Daniel Vetter (2018-05-03 15:25:52)
> > > > > > > > > Almost everyone uses dma_fence_default_wait.
> > > > > > > > > 
> > > > > > > > > v2: Also remove the BUG_ON(!ops->wait) (Chris).
> > > > > > > > I just don't get the rationale for implicit over explicit.
> > > > > > > Closer approximation of dwim semantics. There's been tons of 
> > > > > > > patch series
> > > > > > > all over drm and related places to get there, once we have a big 
> > > > > > > pile of
> > > > > > > implementations and know what the dwim semantics should be. 
> > > > > > > Individually
> > > > > > > they're all not much, in aggregate they substantially simplify 
> > > > > > > simple
> > > > > > > drivers.
> > > > > > I also think clearer separation between optional optimization hooks 
> > > > > > and
> > > > > > mandatory core parts is useful in itself.
> > > > > A new spelling of midlayer ;) I don't see the contradiction with a
> > > > > driver saying use the default and simplicity. (I know which one the
> > > > > compiler thinks is simpler ;)
> > > > If the compiler overhead is real then I guess it would makes to be
> > > > explicit. I don't expect that to be a problem though for a blocking
> > > > function.
> > > > 
> > > > I disagree on this being a midlayer - you can still overwrite everything
> > > > you please to. What it does help is people doing less copypasting (and
> > > > assorted bugs), at least in the grand scheme of things. And we do have a
> > > > _lot_ more random small drivers than just a few years ago. Reducing the
> > > > amount of explicit typing just to get default bahaviour has been an
> > > > ongoing theme for a few years now, and your objection here is about the
> > > > first that this is not a good idea. So I'm somewhat confused.
> > > I'm just saying I don't see any rationale for this patch.
> > > 
> > >  "Almost everyone uses dma_fence_default_wait."
> > > 
> > > Why change?
> > > 
> > > Making it look simpler on the surface, so that you don't have to think
> > > about things straight away? I understand the appeal, but I do worry
> > > about it just being an illusion. (Cutting and pasting a line saying
> > > .wait = default_wait, doesn't feel that onerous, as you likely cut and
> > > paste the ops anyway, and at the very least you are reminded about some
> > > of the interactions. You could even have default initializers and/or
> > > magic macros to hide the cut and paste; maybe a simple_dma_fence [now
> > > that's a midlayer!] but I haven't looked.)
> > In really monolithic vtables like drm_driver we do use default
> > function macros, so you type 1 line, get them all. But dma_fence_ops
> > is pretty small, and most drivers only implement a few callbacks. Also
> > note that e.g. the ->release callback already works like that, so this
> > pattern is there already. I simply extended it to ->wait and
> > ->enable_signaling. Also note that I leave the EXPORT_SYMBOL in place,
> > you can still wrap dma_fence_default_wait if you wish to do so.
> > 
> > But I just realized that I didn't clean out the optional release
> > hooks, I guess I should do that too (for the few cases it's not yet
> > done) and respin.
> 
> I kind of agree with Chris here, but also see the practical problem to copy
> the default function in all the implementations.
> 
> We had the same problem in TTM and I also don't really like the result to
> always have that "if (some_callback) default(); else some_callback();".
> 
> Might be that the run time overhead is negligible, but it doesn't feels
> right from the coding style perspective.

Hm, maybe I've seen too much bad code, but modeset helpers is choke full
of exactly that pattern. It's imo also a trade-off. If you have a fairly
specialized library like ttm that's used by relatively few things, doing
everything explicitly is probably better. It's also where kms started out
from.

But if you have a huge pile of fairly simple drivers, imo the balance
starts to tip the other way, and a bit of additional logic in the shared
code to make all the implementations a notch simpler is good. If we
wouldn't have acquired quite a pile of dma_fence implementations I
wouldn't have bothered with all this.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 04/15] dma-fence: Make ->wait callback optional

2018-05-04 Thread Daniel Vetter
On Fri, May 4, 2018 at 11:16 AM, Chris Wilson <ch...@chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2018-05-04 09:57:59)
>> On Fri, May 04, 2018 at 09:31:33AM +0100, Chris Wilson wrote:
>> > Quoting Daniel Vetter (2018-05-04 09:23:01)
>> > > On Fri, May 04, 2018 at 10:17:22AM +0200, Daniel Vetter wrote:
>> > > > On Fri, May 04, 2018 at 09:09:10AM +0100, Chris Wilson wrote:
>> > > > > Quoting Daniel Vetter (2018-05-03 15:25:52)
>> > > > > > Almost everyone uses dma_fence_default_wait.
>> > > > > >
>> > > > > > v2: Also remove the BUG_ON(!ops->wait) (Chris).
>> > > > >
>> > > > > I just don't get the rationale for implicit over explicit.
>> > > >
>> > > > Closer approximation of dwim semantics. There's been tons of patch 
>> > > > series
>> > > > all over drm and related places to get there, once we have a big pile 
>> > > > of
>> > > > implementations and know what the dwim semantics should be. 
>> > > > Individually
>> > > > they're all not much, in aggregate they substantially simplify simple
>> > > > drivers.
>> > >
>> > > I also think clearer separation between optional optimization hooks and
>> > > mandatory core parts is useful in itself.
>> >
>> > A new spelling of midlayer ;) I don't see the contradiction with a
>> > driver saying use the default and simplicity. (I know which one the
>> > compiler thinks is simpler ;)
>>
>> If the compiler overhead is real then I guess it would makes to be
>> explicit. I don't expect that to be a problem though for a blocking
>> function.
>>
>> I disagree on this being a midlayer - you can still overwrite everything
>> you please to. What it does help is people doing less copypasting (and
>> assorted bugs), at least in the grand scheme of things. And we do have a
>> _lot_ more random small drivers than just a few years ago. Reducing the
>> amount of explicit typing just to get default bahaviour has been an
>> ongoing theme for a few years now, and your objection here is about the
>> first that this is not a good idea. So I'm somewhat confused.
>
> I'm just saying I don't see any rationale for this patch.
>
> "Almost everyone uses dma_fence_default_wait."
>
> Why change?
>
> Making it look simpler on the surface, so that you don't have to think
> about things straight away? I understand the appeal, but I do worry
> about it just being an illusion. (Cutting and pasting a line saying
> .wait = default_wait, doesn't feel that onerous, as you likely cut and
> paste the ops anyway, and at the very least you are reminded about some
> of the interactions. You could even have default initializers and/or
> magic macros to hide the cut and paste; maybe a simple_dma_fence [now
> that's a midlayer!] but I haven't looked.)

In really monolithic vtables like drm_driver we do use default
function macros, so you type 1 line, get them all. But dma_fence_ops
is pretty small, and most drivers only implement a few callbacks. Also
note that e.g. the ->release callback already works like that, so this
pattern is there already. I simply extended it to ->wait and
->enable_signaling. Also note that I leave the EXPORT_SYMBOL in place,
you can still wrap dma_fence_default_wait if you wish to do so.

But I just realized that I didn't clean out the optional release
hooks, I guess I should do that too (for the few cases it's not yet
done) and respin.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [PATCH 04/15] dma-fence: Make ->wait callback optional

2018-05-04 Thread Daniel Vetter
On Fri, May 04, 2018 at 09:31:33AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2018-05-04 09:23:01)
> > On Fri, May 04, 2018 at 10:17:22AM +0200, Daniel Vetter wrote:
> > > On Fri, May 04, 2018 at 09:09:10AM +0100, Chris Wilson wrote:
> > > > Quoting Daniel Vetter (2018-05-03 15:25:52)
> > > > > Almost everyone uses dma_fence_default_wait.
> > > > > 
> > > > > v2: Also remove the BUG_ON(!ops->wait) (Chris).
> > > > 
> > > > I just don't get the rationale for implicit over explicit.
> > > 
> > > Closer approximation of dwim semantics. There's been tons of patch series
> > > all over drm and related places to get there, once we have a big pile of
> > > implementations and know what the dwim semantics should be. Individually
> > > they're all not much, in aggregate they substantially simplify simple
> > > drivers.
> > 
> > I also think clearer separation between optional optimization hooks and
> > mandatory core parts is useful in itself.
> 
> A new spelling of midlayer ;) I don't see the contradiction with a
> driver saying use the default and simplicity. (I know which one the
> compiler thinks is simpler ;)

If the compiler overhead is real then I guess it would makes to be
explicit. I don't expect that to be a problem though for a blocking
function.

I disagree on this being a midlayer - you can still overwrite everything
you please to. What it does help is people doing less copypasting (and
assorted bugs), at least in the grand scheme of things. And we do have a
_lot_ more random small drivers than just a few years ago. Reducing the
amount of explicit typing just to get default bahaviour has been an
ongoing theme for a few years now, and your objection here is about the
first that this is not a good idea. So I'm somewhat confused.

It's ofc not all that useful when looking only through the i915
perspective, where we overwrite almost everything anyway. But the
ecosystem is a bit bigger than just i915.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 04/15] dma-fence: Make ->wait callback optional

2018-05-04 Thread Daniel Vetter
On Fri, May 04, 2018 at 10:17:22AM +0200, Daniel Vetter wrote:
> On Fri, May 04, 2018 at 09:09:10AM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2018-05-03 15:25:52)
> > > Almost everyone uses dma_fence_default_wait.
> > > 
> > > v2: Also remove the BUG_ON(!ops->wait) (Chris).
> > 
> > I just don't get the rationale for implicit over explicit.
> 
> Closer approximation of dwim semantics. There's been tons of patch series
> all over drm and related places to get there, once we have a big pile of
> implementations and know what the dwim semantics should be. Individually
> they're all not much, in aggregate they substantially simplify simple
> drivers.

I also think clearer separation between optional optimization hooks and
mandatory core parts is useful in itself.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 04/15] dma-fence: Make ->wait callback optional

2018-05-04 Thread Daniel Vetter
On Fri, May 04, 2018 at 09:09:10AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2018-05-03 15:25:52)
> > Almost everyone uses dma_fence_default_wait.
> > 
> > v2: Also remove the BUG_ON(!ops->wait) (Chris).
> 
> I just don't get the rationale for implicit over explicit.

Closer approximation of dwim semantics. There's been tons of patch series
all over drm and related places to get there, once we have a big pile of
implementations and know what the dwim semantics should be. Individually
they're all not much, in aggregate they substantially simplify simple
drivers.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


[PATCH 04/15] dma-fence: Make ->wait callback optional

2018-05-03 Thread Daniel Vetter
Almost everyone uses dma_fence_default_wait.

v2: Also remove the BUG_ON(!ops->wait) (Chris).

Reviewed-by: Christian König <christian.koe...@amd.com> (v1)
Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: Chris Wilson <ch...@chris-wilson.co.uk>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 drivers/dma-buf/dma-fence-array.c |  1 -
 drivers/dma-buf/dma-fence.c   |  8 +---
 drivers/dma-buf/sw_sync.c |  1 -
 include/linux/dma-fence.h | 13 -
 4 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-array.c 
b/drivers/dma-buf/dma-fence-array.c
index dd1edfb27b61..a8c254497251 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -104,7 +104,6 @@ const struct dma_fence_ops dma_fence_array_ops = {
.get_timeline_name = dma_fence_array_get_timeline_name,
.enable_signaling = dma_fence_array_enable_signaling,
.signaled = dma_fence_array_signaled,
-   .wait = dma_fence_default_wait,
.release = dma_fence_array_release,
 };
 EXPORT_SYMBOL(dma_fence_array_ops);
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 59049375bd19..41ec19c9efc7 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -158,7 +158,10 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, 
signed long timeout)
return -EINVAL;
 
trace_dma_fence_wait_start(fence);
-   ret = fence->ops->wait(fence, intr, timeout);
+   if (fence->ops->wait)
+   ret = fence->ops->wait(fence, intr, timeout);
+   else
+   ret = dma_fence_default_wait(fence, intr, timeout);
trace_dma_fence_wait_end(fence);
return ret;
 }
@@ -562,8 +565,7 @@ dma_fence_init(struct dma_fence *fence, const struct 
dma_fence_ops *ops,
   spinlock_t *lock, u64 context, unsigned seqno)
 {
BUG_ON(!lock);
-   BUG_ON(!ops || !ops->wait ||
-  !ops->get_driver_name || !ops->get_timeline_name);
+   BUG_ON(!ops || !ops->get_driver_name || !ops->get_timeline_name);
 
kref_init(>refcount);
fence->ops = ops;
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 3d78ca89a605..53c1d6d36a64 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -188,7 +188,6 @@ static const struct dma_fence_ops timeline_fence_ops = {
.get_timeline_name = timeline_fence_get_timeline_name,
.enable_signaling = timeline_fence_enable_signaling,
.signaled = timeline_fence_signaled,
-   .wait = dma_fence_default_wait,
.release = timeline_fence_release,
.fence_value_str = timeline_fence_value_str,
.timeline_value_str = timeline_fence_timeline_value_str,
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index c053d19e1e24..02dba8cd033d 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -191,11 +191,14 @@ struct dma_fence_ops {
/**
 * @wait:
 *
-* Custom wait implementation, or dma_fence_default_wait.
+* Custom wait implementation, defaults to dma_fence_default_wait() if
+* not set.
 *
-* Must not be NULL, set to dma_fence_default_wait for default 
implementation.
-* the dma_fence_default_wait implementation should work for any fence, 
as long
-* as enable_signaling works correctly.
+* The dma_fence_default_wait implementation should work for any fence, 
as long
+* as @enable_signaling works correctly. This hook allows drivers to
+* have an optimized version for the case where a process context is
+* already available, e.g. if @enable_signaling for the general case
+* needs to set up a worker thread.
 *
 * Must return -ERESTARTSYS if the wait is intr = true and the wait was
 * interrupted, and remaining jiffies if fence has signaled, or 0 if 
wait
@@ -203,7 +206,7 @@ struct dma_fence_ops {
 * which should be treated as if the fence is signaled. For example a 
hardware
 * lockup could be reported like that.
 *
-* This callback is mandatory.
+* This callback is optional.
 */
signed long (*wait)(struct dma_fence *fence,
bool intr, signed long timeout);
-- 
2.17.0



[PATCH 03/15] dma-fence: Allow wait_any_timeout for all fences

2018-05-03 Thread Daniel Vetter
When this was introduced in

commit a519435a96597d8cd96123246fea4ae5a6c90b02
Author: Christian König <christian.koe...@amd.com>
Date:   Tue Oct 20 16:34:16 2015 +0200

dma-buf/fence: add fence_wait_any_timeout function v2

there was a restriction added that this only works if the dma-fence
uses the dma_fence_default_wait hook. Which works for amdgpu, which is
the only caller. Well, until you share some buffers with e.g. i915,
then you get an -EINVAL.

But there's really no reason for this, because all drivers must
support callbacks. The special ->wait hook is only as an optimization;
if the driver needs to create a worker thread for an active callback,
then it can avoid to do that if it knows that there's a process
context available already. So ->wait is just an optimization, just
using the logic in dma_fence_default_wait() should work for all
drivers.

Let's remove this restriction.

Reviewed-by: Christian König <christian.koe...@amd.com>
Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: Christian König <christian.koe...@amd.com>
Cc: Alex Deucher <alexander.deuc...@amd.com>
---
 drivers/dma-buf/dma-fence.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 7b5b40d6b70e..59049375bd19 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -503,11 +503,6 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, 
uint32_t count,
for (i = 0; i < count; ++i) {
struct dma_fence *fence = fences[i];
 
-   if (fence->ops->wait != dma_fence_default_wait) {
-   ret = -EINVAL;
-   goto fence_rm_cb;
-   }
-
cb[i].task = current;
if (dma_fence_add_callback(fence, [i].base,
   dma_fence_default_wait_cb)) {
-- 
2.17.0



[PATCH 15/15] dma-fence: Polish kernel-doc for dma-fence.c

2018-05-03 Thread Daniel Vetter
- Intro section that links to how this is exposed to userspace.
- Lots more hyperlinks.
- Minor clarifications and style polish

Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 Documentation/driver-api/dma-buf.rst |   6 ++
 drivers/dma-buf/dma-fence.c  | 140 ++-
 2 files changed, 102 insertions(+), 44 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst 
b/Documentation/driver-api/dma-buf.rst
index dc384f2f7f34..b541e97c7ab1 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -130,6 +130,12 @@ Reservation Objects
 DMA Fences
 --
 
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: DMA fences overview
+
+DMA Fences Functions Reference
+~~
+
 .. kernel-doc:: drivers/dma-buf/dma-fence.c
:export:
 
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 41ec19c9efc7..0387c6a59055 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -38,12 +38,43 @@ EXPORT_TRACEPOINT_SYMBOL(dma_fence_enable_signal);
  */
 static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(0);
 
+/**
+ * DOC: DMA fences overview
+ *
+ * DMA fences, represented by  dma_fence, are the kernel internal
+ * synchronization primitive for DMA operations like GPU rendering, video
+ * encoding/decoding, or displaying buffers on a screen.
+ *
+ * A fence is initialized using dma_fence_init() and completed using
+ * dma_fence_signal(). Fences are associated with a context, allocated through
+ * dma_fence_context_alloc(), and all fences on the same context are
+ * fully ordered.
+ *
+ * Since the purposes of fences is to facilitate cross-device and
+ * cross-application synchronization, there's multiple ways to use one:
+ *
+ * - Individual fences can be exposed as a _file, accessed as a file
+ *   descriptor from userspace, created by calling sync_file_create(). This is
+ *   called explicit fencing, since userspace passes around explicit
+ *   synchronization points.
+ *
+ * - Some subsystems also have their own explicit fencing primitives, like
+ *   _syncobj. Compared to _file, a _syncobj allows the underlying
+ *   fence to be updated.
+ *
+ * - Then there's also implicit fencing, where the synchronization points are
+ *   implicitly passed around as part of shared _buf instances. Such
+ *   implicit fences are stored in  reservation_object through the
+ *   _buf.resv pointer.
+ */
+
 /**
  * dma_fence_context_alloc - allocate an array of fence contexts
- * @num:   [in]amount of contexts to allocate
+ * @num: amount of contexts to allocate
  *
- * This function will return the first index of the number of fences allocated.
- * The fence context is used for setting fence->context to a unique number.
+ * This function will return the first index of the number of fence contexts
+ * allocated.  The fence context is used for setting _fence.context to a
+ * unique number by passing the context to dma_fence_init().
  */
 u64 dma_fence_context_alloc(unsigned num)
 {
@@ -59,10 +90,14 @@ EXPORT_SYMBOL(dma_fence_context_alloc);
  * Signal completion for software callbacks on a fence, this will unblock
  * dma_fence_wait() calls and run all the callbacks added with
  * dma_fence_add_callback(). Can be called multiple times, but since a fence
- * can only go from unsignaled to signaled state, it will only be effective
- * the first time.
+ * can only go from the unsignaled to the signaled state and not back, it will
+ * only be effective the first time.
  *
- * Unlike dma_fence_signal, this function must be called with fence->lock held.
+ * Unlike dma_fence_signal(), this function must be called with _fence.lock
+ * held.
+ *
+ * Returns 0 on success and a negative error value when @fence has been
+ * signalled already.
  */
 int dma_fence_signal_locked(struct dma_fence *fence)
 {
@@ -102,8 +137,11 @@ EXPORT_SYMBOL(dma_fence_signal_locked);
  * Signal completion for software callbacks on a fence, this will unblock
  * dma_fence_wait() calls and run all the callbacks added with
  * dma_fence_add_callback(). Can be called multiple times, but since a fence
- * can only go from unsignaled to signaled state, it will only be effective
- * the first time.
+ * can only go from the unsignaled to the signaled state and not back, it will
+ * only be effective the first time.
+ *
+ * Returns 0 on success and a negative error value when @fence has been
+ * signalled already.
  */
 int dma_fence_signal(struct dma_fence *fence)
 {
@@ -136,9 +174,9 @@ EXPORT_SYMBOL(dma_fence_signal);
 /**
  * dma_fence_wait_timeout - sleep until the fence gets signaled
  * or until timeout elapses
- * @fence: [in]the fence to wait on
- * @intr:  [in]if true, do an interruptible wait
- * @timeout:   [in]

[PATCH 02/15] dma-fence: Make ->enable_signaling optional

2018-05-03 Thread Daniel Vetter
Many drivers have a trivial implementation for ->enable_signaling.
Let's make it optional by assuming that signalling is already
available when the callback isn't present.

Reviewed-by: Christian König <christian.koe...@amd.com>
Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 drivers/dma-buf/dma-fence.c | 13 -
 include/linux/dma-fence.h   |  3 ++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 4edb9fd3cf47..7b5b40d6b70e 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -181,6 +181,13 @@ void dma_fence_release(struct kref *kref)
 }
 EXPORT_SYMBOL(dma_fence_release);
 
+/**
+ * dma_fence_free - default release function for _fence.
+ * @fence: fence to release
+ *
+ * This is the default implementation for _fence_ops.release. It calls
+ * kfree_rcu() on @fence.
+ */
 void dma_fence_free(struct dma_fence *fence)
 {
kfree_rcu(fence, rcu);
@@ -560,7 +567,7 @@ dma_fence_init(struct dma_fence *fence, const struct 
dma_fence_ops *ops,
   spinlock_t *lock, u64 context, unsigned seqno)
 {
BUG_ON(!lock);
-   BUG_ON(!ops || !ops->wait || !ops->enable_signaling ||
+   BUG_ON(!ops || !ops->wait ||
   !ops->get_driver_name || !ops->get_timeline_name);
 
kref_init(>refcount);
@@ -572,6 +579,10 @@ dma_fence_init(struct dma_fence *fence, const struct 
dma_fence_ops *ops,
fence->flags = 0UL;
fence->error = 0;
 
+   if (!ops->enable_signaling)
+   set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
+   >flags);
+
trace_dma_fence_init(fence);
 }
 EXPORT_SYMBOL(dma_fence_init);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 111aefe1c956..c053d19e1e24 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -166,7 +166,8 @@ struct dma_fence_ops {
 * released when the fence is signalled (through e.g. the interrupt
 * handler).
 *
-* This callback is mandatory.
+* This callback is optional. If this callback is not present, then the
+* driver must always have signaling enabled.
 */
bool (*enable_signaling)(struct dma_fence *fence);
 
-- 
2.17.0



Re: [PATCH v3 0/8] R-Car DU: Support CRC calculation

2018-05-03 Thread Daniel Vetter
omputation
>>
>>  drivers/gpu/drm/rcar-du/rcar_du_crtc.c| 156 -
>>  drivers/gpu/drm/rcar-du/rcar_du_crtc.h|  15 ++
>>  drivers/gpu/drm/rcar-du/rcar_du_vsp.c |  12 +-
>>  drivers/media/platform/vsp1/Makefile  |   2 +-
>>  drivers/media/platform/vsp1/vsp1.h|  10 +-
>>  drivers/media/platform/vsp1/vsp1_brx.c|   6 +-
>>  drivers/media/platform/vsp1/vsp1_brx.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_clu.c|  71 ++--
>>  drivers/media/platform/vsp1/vsp1_clu.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_dl.c |   8 +-
>>  drivers/media/platform/vsp1/vsp1_dl.h |   6 +-
>>  drivers/media/platform/vsp1/vsp1_drm.c| 127 --
>>  drivers/media/platform/vsp1/vsp1_drm.h|  15 +-
>>  drivers/media/platform/vsp1/vsp1_drv.c|  26 ++-
>>  drivers/media/platform/vsp1/vsp1_entity.c | 103 +++-
>>  drivers/media/platform/vsp1/vsp1_entity.h |  13 +-
>>  drivers/media/platform/vsp1/vsp1_hgo.c|   6 +-
>>  drivers/media/platform/vsp1/vsp1_hgo.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_hgt.c|   6 +-
>>  drivers/media/platform/vsp1/vsp1_hgt.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_histo.c  |  65 +--
>>  drivers/media/platform/vsp1/vsp1_histo.h  |   6 +-
>>  drivers/media/platform/vsp1/vsp1_hsit.c   |   6 +-
>>  drivers/media/platform/vsp1/vsp1_hsit.h   |   6 +-
>>  drivers/media/platform/vsp1/vsp1_lif.c|  71 ++--
>>  drivers/media/platform/vsp1/vsp1_lif.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_lut.c|  71 ++--
>>  drivers/media/platform/vsp1/vsp1_lut.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_pipe.c   |   6 +-
>>  drivers/media/platform/vsp1/vsp1_pipe.h   |   6 +-
>>  drivers/media/platform/vsp1/vsp1_regs.h   |  46 -
>>  drivers/media/platform/vsp1/vsp1_rpf.c|   6 +-
>>  drivers/media/platform/vsp1/vsp1_rwpf.c   |   6 +-
>>  drivers/media/platform/vsp1/vsp1_rwpf.h   |   6 +-
>>  drivers/media/platform/vsp1/vsp1_sru.c|   6 +-
>>  drivers/media/platform/vsp1/vsp1_sru.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_uds.c|   6 +-
>>  drivers/media/platform/vsp1/vsp1_uds.h|   6 +-
>>  drivers/media/platform/vsp1/vsp1_uif.c| 271 +++
>>  drivers/media/platform/vsp1/vsp1_uif.h|  32 
>>  drivers/media/platform/vsp1/vsp1_video.c  |   6 +-
>>  drivers/media/platform/vsp1/vsp1_video.h  |   6 +-
>>  drivers/media/platform/vsp1/vsp1_wpf.c|   6 +-
>>  include/media/vsp1.h  |  45 -
>>  44 files changed, 892 insertions(+), 417 deletions(-)
>>  create mode 100644 drivers/media/platform/vsp1/vsp1_uif.c
>>  create mode 100644 drivers/media/platform/vsp1/vsp1_uif.h
>
> --
> Regards,
>
> Laurent Pinchart
>
>
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH 01/17] dma-fence: Some kerneldoc polish for dma-fence.h

2018-05-02 Thread Daniel Vetter
On Mon, Apr 30, 2018 at 10:49:00AM -0700, Eric Anholt wrote:
> Daniel Vetter <daniel.vet...@ffwll.ch> writes:
> > +   /**
> > +* @fill_driver_data:
> > +*
> > +* Callback to fill in free-form debug info Returns amount of bytes
> > +* filled, or negative error on failure.
> 
> Maybe this "Returns" should be on a new line?  Or at least a '.' in
> between.

Indeed I've missed this, thanks for spotting it. Done both

Thanks, Daniel

> 
> Other than that,
> 
> Reviewed-by: Eric Anholt <e...@anholt.net>
> 
> Thanks!



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 04/17] dma-fence: Allow wait_any_timeout for all fences

2018-04-30 Thread Daniel Vetter
On Sun, Apr 29, 2018 at 09:11:31AM +0200, Christian König wrote:
> Am 27.04.2018 um 08:17 schrieb Daniel Vetter:
> > When this was introduced in
> > 
> > commit a519435a96597d8cd96123246fea4ae5a6c90b02
> > Author: Christian König <christian.koe...@amd.com>
> > Date:   Tue Oct 20 16:34:16 2015 +0200
> > 
> >  dma-buf/fence: add fence_wait_any_timeout function v2
> > 
> > there was a restriction added that this only works if the dma-fence
> > uses the dma_fence_default_wait hook. Which works for amdgpu, which is
> > the only caller. Well, until you share some buffers with e.g. i915,
> > then you get an -EINVAL.
> > 
> > But there's really no reason for this, because all drivers must
> > support callbacks. The special ->wait hook is only as an optimization;
> > if the driver needs to create a worker thread for an active callback,
> > then it can avoid to do that if it knows that there's a process
> > context available already. So ->wait is just an optimization, just
> > using the logic in dma_fence_default_wait() should work for all
> > drivers.
> > 
> > Let's remove this restriction.
> 
> Mhm, that was intentional introduced because for radeon that is not only an
> optimization, but mandatory for correct operation.
> 
> On the other hand radeon isn't using this function, so it should be fine as
> long as the Intel driver can live with it.

Well dma-buf already requires that dma_fence_add_callback works correctly.
And so do various users of it as soon as you engage in a bit of buffer
sharing. I guess whomever cares about buffer sharing with radeon gets to
fix this (you need to spawn a kthread or whatever in ->enable_signaling
which does the same work as your optimized ->wait callback).

But yeah, I'm definitely not making things work with this series, just a
bit more obvious that there's a problem already.
-Daniel

> 
> Christian.
> 
> > 
> > Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>
> > Cc: Sumit Semwal <sumit.sem...@linaro.org>
> > Cc: Gustavo Padovan <gust...@padovan.org>
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-...@lists.linaro.org
> > Cc: Christian König <christian.koe...@amd.com>
> > Cc: Alex Deucher <alexander.deuc...@amd.com>
> > ---
> >   drivers/dma-buf/dma-fence.c | 5 -
> >   1 file changed, 5 deletions(-)
> > 
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 7b5b40d6b70e..59049375bd19 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -503,11 +503,6 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, 
> > uint32_t count,
> > for (i = 0; i < count; ++i) {
> > struct dma_fence *fence = fences[i];
> > -   if (fence->ops->wait != dma_fence_default_wait) {
> > -   ret = -EINVAL;
> > -   goto fence_rm_cb;
> > -   }
> > -
> > cb[i].task = current;
> > if (dma_fence_add_callback(fence, [i].base,
> >dma_fence_default_wait_cb)) {
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


[PATCH 17/17] dma-fence: Polish kernel-doc for dma-fence.c

2018-04-27 Thread Daniel Vetter
- Intro section that links to how this is exposed to userspace.
- Lots more hyperlinks.
- Minor clarifications and style polish

Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 Documentation/driver-api/dma-buf.rst |   6 ++
 drivers/dma-buf/dma-fence.c  | 140 ++-
 2 files changed, 102 insertions(+), 44 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst 
b/Documentation/driver-api/dma-buf.rst
index dc384f2f7f34..b541e97c7ab1 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -130,6 +130,12 @@ Reservation Objects
 DMA Fences
 --
 
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: DMA fences overview
+
+DMA Fences Functions Reference
+~~
+
 .. kernel-doc:: drivers/dma-buf/dma-fence.c
:export:
 
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 30fcbe415ff4..4e931e1de198 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -38,12 +38,43 @@ EXPORT_TRACEPOINT_SYMBOL(dma_fence_enable_signal);
  */
 static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(0);
 
+/**
+ * DOC: DMA fences overview
+ *
+ * DMA fences, represented by  dma_fence, are the kernel internal
+ * synchronization primitive for DMA operations like GPU rendering, video
+ * encoding/decoding, or displaying buffers on a screen.
+ *
+ * A fence is initialized using dma_fence_init() and completed using
+ * dma_fence_signal(). Fences are associated with a context, allocated through
+ * dma_fence_context_alloc(), and all fences on the same context are
+ * fully ordered.
+ *
+ * Since the purposes of fences is to facilitate cross-device and
+ * cross-application synchronization, there's multiple ways to use one:
+ *
+ * - Individual fences can be exposed as a _file, accessed as a file
+ *   descriptor from userspace, created by calling sync_file_create(). This is
+ *   called explicit fencing, since userspace passes around explicit
+ *   synchronization points.
+ *
+ * - Some subsystems also have their own explicit fencing primitives, like
+ *   _syncobj. Compared to _file, a _syncobj allows the underlying
+ *   fence to be updated.
+ *
+ * - Then there's also implicit fencing, where the synchronization points are
+ *   implicitly passed around as part of shared _buf instances. Such
+ *   implicit fences are stored in  reservation_object through the
+ *   _buf.resv pointer.
+ */
+
 /**
  * dma_fence_context_alloc - allocate an array of fence contexts
- * @num:   [in]amount of contexts to allocate
+ * @num: amount of contexts to allocate
  *
- * This function will return the first index of the number of fences allocated.
- * The fence context is used for setting fence->context to a unique number.
+ * This function will return the first index of the number of fence contexts
+ * allocated.  The fence context is used for setting _fence.context to a
+ * unique number by passing the context to dma_fence_init().
  */
 u64 dma_fence_context_alloc(unsigned num)
 {
@@ -59,10 +90,14 @@ EXPORT_SYMBOL(dma_fence_context_alloc);
  * Signal completion for software callbacks on a fence, this will unblock
  * dma_fence_wait() calls and run all the callbacks added with
  * dma_fence_add_callback(). Can be called multiple times, but since a fence
- * can only go from unsignaled to signaled state, it will only be effective
- * the first time.
+ * can only go from the unsignaled to the signaled state and not back, it will
+ * only be effective the first time.
  *
- * Unlike dma_fence_signal, this function must be called with fence->lock held.
+ * Unlike dma_fence_signal(), this function must be called with _fence.lock
+ * held.
+ *
+ * Returns 0 on success and a negative error value when @fence has been
+ * signalled already.
  */
 int dma_fence_signal_locked(struct dma_fence *fence)
 {
@@ -102,8 +137,11 @@ EXPORT_SYMBOL(dma_fence_signal_locked);
  * Signal completion for software callbacks on a fence, this will unblock
  * dma_fence_wait() calls and run all the callbacks added with
  * dma_fence_add_callback(). Can be called multiple times, but since a fence
- * can only go from unsignaled to signaled state, it will only be effective
- * the first time.
+ * can only go from the unsignaled to the signaled state and not back, it will
+ * only be effective the first time.
+ *
+ * Returns 0 on success and a negative error value when @fence has been
+ * signalled already.
  */
 int dma_fence_signal(struct dma_fence *fence)
 {
@@ -136,9 +174,9 @@ EXPORT_SYMBOL(dma_fence_signal);
 /**
  * dma_fence_wait_timeout - sleep until the fence gets signaled
  * or until timeout elapses
- * @fence: [in]the fence to wait on
- * @intr:  [in]if true, do an interruptible wait
- * @timeout:   [in]

[PATCH 05/17] dma-fence: Make ->wait callback optional

2018-04-27 Thread Daniel Vetter
Almost everyone uses dma_fence_default_wait.

Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 drivers/dma-buf/dma-fence-array.c |  1 -
 drivers/dma-buf/dma-fence.c   |  5 -
 drivers/dma-buf/sw_sync.c |  1 -
 include/linux/dma-fence.h | 13 -
 4 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-array.c 
b/drivers/dma-buf/dma-fence-array.c
index dd1edfb27b61..a8c254497251 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -104,7 +104,6 @@ const struct dma_fence_ops dma_fence_array_ops = {
.get_timeline_name = dma_fence_array_get_timeline_name,
.enable_signaling = dma_fence_array_enable_signaling,
.signaled = dma_fence_array_signaled,
-   .wait = dma_fence_default_wait,
.release = dma_fence_array_release,
 };
 EXPORT_SYMBOL(dma_fence_array_ops);
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 59049375bd19..30fcbe415ff4 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -158,7 +158,10 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, 
signed long timeout)
return -EINVAL;
 
trace_dma_fence_wait_start(fence);
-   ret = fence->ops->wait(fence, intr, timeout);
+   if (fence->ops->wait)
+   ret = fence->ops->wait(fence, intr, timeout);
+   else
+   ret = dma_fence_default_wait(fence, intr, timeout);
trace_dma_fence_wait_end(fence);
return ret;
 }
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 3d78ca89a605..53c1d6d36a64 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -188,7 +188,6 @@ static const struct dma_fence_ops timeline_fence_ops = {
.get_timeline_name = timeline_fence_get_timeline_name,
.enable_signaling = timeline_fence_enable_signaling,
.signaled = timeline_fence_signaled,
-   .wait = dma_fence_default_wait,
.release = timeline_fence_release,
.fence_value_str = timeline_fence_value_str,
.timeline_value_str = timeline_fence_timeline_value_str,
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index c730f569621a..d05496ff0d10 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -191,11 +191,14 @@ struct dma_fence_ops {
/**
 * @wait:
 *
-* Custom wait implementation, or dma_fence_default_wait.
+* Custom wait implementation, defaults to dma_fence_default_wait() if
+* not set.
 *
-* Must not be NULL, set to dma_fence_default_wait for default 
implementation.
-* the dma_fence_default_wait implementation should work for any fence, 
as long
-* as enable_signaling works correctly.
+* The dma_fence_default_wait implementation should work for any fence, 
as long
+* as @enable_signaling works correctly. This hook allows drivers to
+* have an optimized version for the case where a process context is
+* already available, e.g. if @enable_signaling for the general case
+* needs to set up a worker thread.
 *
 * Must return -ERESTARTSYS if the wait is intr = true and the wait was
 * interrupted, and remaining jiffies if fence has signaled, or 0 if 
wait
@@ -203,7 +206,7 @@ struct dma_fence_ops {
 * which should be treated as if the fence is signaled. For example a 
hardware
 * lockup could be reported like that.
 *
-* This callback is mandatory.
+* This callback is optional.
 */
signed long (*wait)(struct dma_fence *fence,
bool intr, signed long timeout);
-- 
2.17.0



[PATCH 01/17] dma-fence: Some kerneldoc polish for dma-fence.h

2018-04-27 Thread Daniel Vetter
- Switch to inline member docs for dma_fence_ops.
- Mild polish all around.
- hyperlink all the things!

v2: - Remove the various [in] annotations, they seem really uncommon
in kerneldoc and look funny.

Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 include/linux/dma-fence.h | 235 +-
 1 file changed, 154 insertions(+), 81 deletions(-)

diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 4c008170fe65..9d6f39bf2111 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -94,11 +94,11 @@ typedef void (*dma_fence_func_t)(struct dma_fence *fence,
 struct dma_fence_cb *cb);
 
 /**
- * struct dma_fence_cb - callback for dma_fence_add_callback
- * @node: used by dma_fence_add_callback to append this struct to 
fence::cb_list
+ * struct dma_fence_cb - callback for dma_fence_add_callback()
+ * @node: used by dma_fence_add_callback() to append this struct to 
fence::cb_list
  * @func: dma_fence_func_t to call
  *
- * This struct will be initialized by dma_fence_add_callback, additional
+ * This struct will be initialized by dma_fence_add_callback(), additional
  * data can be passed along by embedding dma_fence_cb in another struct.
  */
 struct dma_fence_cb {
@@ -108,75 +108,142 @@ struct dma_fence_cb {
 
 /**
  * struct dma_fence_ops - operations implemented for fence
- * @get_driver_name: returns the driver name.
- * @get_timeline_name: return the name of the context this fence belongs to.
- * @enable_signaling: enable software signaling of fence.
- * @signaled: [optional] peek whether the fence is signaled, can be null.
- * @wait: custom wait implementation, or dma_fence_default_wait.
- * @release: [optional] called on destruction of fence, can be null
- * @fill_driver_data: [optional] callback to fill in free-form debug info
- * Returns amount of bytes filled, or -errno.
- * @fence_value_str: [optional] fills in the value of the fence as a string
- * @timeline_value_str: [optional] fills in the current value of the timeline
- * as a string
  *
- * Notes on enable_signaling:
- * For fence implementations that have the capability for hw->hw
- * signaling, they can implement this op to enable the necessary
- * irqs, or insert commands into cmdstream, etc.  This is called
- * in the first wait() or add_callback() path to let the fence
- * implementation know that there is another driver waiting on
- * the signal (ie. hw->sw case).
- *
- * This function can be called from atomic context, but not
- * from irq context, so normal spinlocks can be used.
- *
- * A return value of false indicates the fence already passed,
- * or some failure occurred that made it impossible to enable
- * signaling. True indicates successful enabling.
- *
- * fence->error may be set in enable_signaling, but only when false is
- * returned.
- *
- * Calling dma_fence_signal before enable_signaling is called allows
- * for a tiny race window in which enable_signaling is called during,
- * before, or after dma_fence_signal. To fight this, it is recommended
- * that before enable_signaling returns true an extra reference is
- * taken on the fence, to be released when the fence is signaled.
- * This will mean dma_fence_signal will still be called twice, but
- * the second time will be a noop since it was already signaled.
- *
- * Notes on signaled:
- * May set fence->error if returning true.
- *
- * Notes on wait:
- * Must not be NULL, set to dma_fence_default_wait for default implementation.
- * the dma_fence_default_wait implementation should work for any fence, as long
- * as enable_signaling works correctly.
- *
- * Must return -ERESTARTSYS if the wait is intr = true and the wait was
- * interrupted, and remaining jiffies if fence has signaled, or 0 if wait
- * timed out. Can also return other error values on custom implementations,
- * which should be treated as if the fence is signaled. For example a hardware
- * lockup could be reported like that.
- *
- * Notes on release:
- * Can be NULL, this function allows additional commands to run on
- * destruction of the fence. Can be called from irq context.
- * If pointer is set to NULL, kfree will get called instead.
  */
-
 struct dma_fence_ops {
+   /**
+* @get_driver_name:
+*
+* Returns the driver name. This is a callback to allow drivers to
+* compute the name at runtime, without having it to store permanently
+* for each fence, or build a cache of some sort.
+*
+* This callback is mandatory.
+*/
const char * (*get_driver_name)(struct dma_fence *fence);
+
+   /**
+* @get_timeline_name:
+*
+* Return the name of the context this fence belongs to. This is a
+* callback to allow drivers to compute the name at runtime, without
+

[PATCH 04/17] dma-fence: Allow wait_any_timeout for all fences

2018-04-27 Thread Daniel Vetter
When this was introduced in

commit a519435a96597d8cd96123246fea4ae5a6c90b02
Author: Christian König <christian.koe...@amd.com>
Date:   Tue Oct 20 16:34:16 2015 +0200

dma-buf/fence: add fence_wait_any_timeout function v2

there was a restriction added that this only works if the dma-fence
uses the dma_fence_default_wait hook. Which works for amdgpu, which is
the only caller. Well, until you share some buffers with e.g. i915,
then you get an -EINVAL.

But there's really no reason for this, because all drivers must
support callbacks. The special ->wait hook is only as an optimization;
if the driver needs to create a worker thread for an active callback,
then it can avoid to do that if it knows that there's a process
context available already. So ->wait is just an optimization, just
using the logic in dma_fence_default_wait() should work for all
drivers.

Let's remove this restriction.

Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: Christian König <christian.koe...@amd.com>
Cc: Alex Deucher <alexander.deuc...@amd.com>
---
 drivers/dma-buf/dma-fence.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 7b5b40d6b70e..59049375bd19 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -503,11 +503,6 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, 
uint32_t count,
for (i = 0; i < count; ++i) {
struct dma_fence *fence = fences[i];
 
-   if (fence->ops->wait != dma_fence_default_wait) {
-   ret = -EINVAL;
-   goto fence_rm_cb;
-   }
-
cb[i].task = current;
if (dma_fence_add_callback(fence, [i].base,
   dma_fence_default_wait_cb)) {
-- 
2.17.0



[PATCH 03/17] dma-fence: Make ->enable_signaling optional

2018-04-27 Thread Daniel Vetter
Many drivers have a trivial implementation for ->enable_signaling.
Let's make it optional by assuming that signalling is already
available when the callback isn't present.

Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 drivers/dma-buf/dma-fence.c | 13 -
 include/linux/dma-fence.h   |  3 ++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 4edb9fd3cf47..7b5b40d6b70e 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -181,6 +181,13 @@ void dma_fence_release(struct kref *kref)
 }
 EXPORT_SYMBOL(dma_fence_release);
 
+/**
+ * dma_fence_free - default release function for _fence.
+ * @fence: fence to release
+ *
+ * This is the default implementation for _fence_ops.release. It calls
+ * kfree_rcu() on @fence.
+ */
 void dma_fence_free(struct dma_fence *fence)
 {
kfree_rcu(fence, rcu);
@@ -560,7 +567,7 @@ dma_fence_init(struct dma_fence *fence, const struct 
dma_fence_ops *ops,
   spinlock_t *lock, u64 context, unsigned seqno)
 {
BUG_ON(!lock);
-   BUG_ON(!ops || !ops->wait || !ops->enable_signaling ||
+   BUG_ON(!ops || !ops->wait ||
   !ops->get_driver_name || !ops->get_timeline_name);
 
kref_init(>refcount);
@@ -572,6 +579,10 @@ dma_fence_init(struct dma_fence *fence, const struct 
dma_fence_ops *ops,
fence->flags = 0UL;
fence->error = 0;
 
+   if (!ops->enable_signaling)
+   set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
+   >flags);
+
trace_dma_fence_init(fence);
 }
 EXPORT_SYMBOL(dma_fence_init);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index f9a6848f8558..c730f569621a 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -166,7 +166,8 @@ struct dma_fence_ops {
 * released when the fence is signalled (through e.g. the interrupt
 * handler).
 *
-* This callback is mandatory.
+* This callback is optional. If this callback is not present, then the
+* driver must always have signaling enabled.
 */
bool (*enable_signaling)(struct dma_fence *fence);
 
-- 
2.17.0



Re: noveau vs arm dma ops

2018-04-26 Thread Daniel Vetter
On Thu, Apr 26, 2018 at 11:09 AM, Christoph Hellwig <h...@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
>> > get_required_mask() is supposed to tell you if you are safe.  However
>> > we are missing lots of implementations of it for iommus so you might get
>> > some false negatives, improvements welcome.  It's been on my list of
>> > things to fix in the DMA API, but it is nowhere near the top.
>>
>> I hasn't come up in a while in some fireworks, so I honestly don't
>> remember exactly what the issues have been. But
>>
>> commit d766ef53006c2c38a7fe2bef0904105a793383f2
>> Author: Chris Wilson <ch...@chris-wilson.co.uk>
>> Date:   Mon Dec 19 12:43:45 2016 +
>>
>> drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping
>>
>> and the various bits of code that a
>>
>> $ git grep SWIOTLB -- drivers/gpu
>>
>> turns up is what we're doing to hack around that stuff. And in general
>> (there's some exceptions) gpus should be able to address everything,
>> so I never fully understood where that's even coming from.
>
> I'm pretty sure I've seen some oddly low dma masks in GPU drivers.  E.g.
> duplicated in various AMD files:
>
> adev->need_dma32 = false;
> dma_bits = adev->need_dma32 ? 32 : 40;
> r = pci_set_dma_mask(adev->pdev, DMA_BIT_MASK(dma_bits));
> if (r) {
> adev->need_dma32 = true;
> dma_bits = 32;
> dev_warn(adev->dev, "amdgpu: No suitable DMA available.\n");
> }
>
> synopsis:
>
> drivers/gpu/drm/bridge/synopsys/dw-hdmi-i2s-audio.c:pdevinfo.dma_mask 
>   = DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
> DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
> DMA_BIT_MASK(32);
>
> etnaviv gets it right:
>
> drivers/gpu/drm/etnaviv/etnaviv_gpu.c:  u32 dma_mask = 
> (u32)dma_get_required_mask(gpu->dev);
>
>
> But yes, the swiotlb hackery really irks me.  I just have some more
> important and bigger fires to fight first, but I plan to get back to the
> root cause of that eventually.
>
>>
>> >> - dma api hides the cache flushing requirements from us. GPUs love
>> >>   non-snooped access, and worse give userspace control over that. We want
>> >>   a strict separation between mapping stuff and flushing stuff. With the
>> >>   IOMMU api we mostly have the former, but for the later arch maintainers
>> >>   regularly tells they won't allow that. So we have drm_clflush.c.
>> >
>> > The problem is that a cache flushing API entirely separate is hard. That
>> > being said if you look at my generic dma-noncoherent API series it tries
>> > to move that way.  So far it is in early stages and apparently rather
>> > buggy unfortunately.
>>
>> I'm assuming this stuff here?
>>
>> https://lkml.org/lkml/2018/4/20/146
>>
>> Anyway got lost in all that work a bit, looks really nice.
>
> That url doesn't seem to work currently.  But I am talking about the
> thread titled '[RFC] common non-cache coherent direct dma mapping ops'
>
>> Yeah the above is pretty much what we do on x86. dma-api believes
>> everything is coherent, so dma_map_sg does the mapping we want and
>> nothing else (minus swiotlb fun). Cache flushing, allocations, all
>> done by the driver.
>
> Which sounds like the right thing to do to me.
>
>> On arm that doesn't work. The iommu api seems like a good fit, except
>> the dma-api tends to get in the way a bit (drm/msm apparently has
>> similar problems like tegra), and if you need contiguous memory
>> dma_alloc_coherent is the only way to get at contiguous memory. There
>> was a huge discussion years ago about that, and direct cma access was
>> shot down because it would have exposed too much of the caching
>> attribute mangling required (most arm platforms need wc-pages to not
>> be in the kernel's linear map apparently).
>
> Simple cma_alloc() doesn't do anything about cache handling, it
> just is a very dumb allocator for large contiguous regions inside
> a big pool.
>
> I'm not the CMA maintainer, but in general I'd love to see an
> EXPORT_SYMBOL_GPL slapped onto cma_alloc/release and drivers use
> that were needed.  Using that plus dma_map*/dma_unmap* sounds like
> a much saner interface than dma_alloc_attrs + DMA_ATTR_NON_CONSISTENT
> or DMA_ATTR_NO_KERNEL_MAPPING.
>
> You don't happen to have a pointer to tha

Re: [Linaro-mm-sig] noveau vs arm dma ops

2018-04-26 Thread Daniel Vetter
On Thu, Apr 26, 2018 at 11:24 AM, Christoph Hellwig <h...@infradead.org> wrote:
> On Thu, Apr 26, 2018 at 11:20:44AM +0200, Daniel Vetter wrote:
>> The above is already what we're implementing in i915, at least
>> conceptually (it all boils down to clflush instructions because those
>> both invalidate and flush).
>
> The clwb instruction that just writes back dirty cache lines might
> be very interesting for the x86 non-coherent dma case.  A lot of
> architectures use their equivalent to prepare to to device transfers.

Iirc didn't help for i915 use-cases much. Either data gets streamed
between cpu and gpu, and then keeping the clean cacheline around
doesn't buy you anything. In other cases we need to flush because the
gpu really wants to use non-snooped transactions (faster/lower
latency/less power required for display because you can shut down the
caches), and then there's also no benefit with keeping the cacheline
around (no one will ever need it again).

I think clwb is more for persistent memory and stuff like that, not so
much for gpus.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [Linaro-mm-sig] noveau vs arm dma ops

2018-04-26 Thread Daniel Vetter
On Thu, Apr 26, 2018 at 12:54 AM, Russell King - ARM Linux
<li...@armlinux.org.uk> wrote:
> On Wed, Apr 25, 2018 at 08:33:12AM -0700, Christoph Hellwig wrote:
>> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
>> > - dma api hides the cache flushing requirements from us. GPUs love
>> >   non-snooped access, and worse give userspace control over that. We want
>> >   a strict separation between mapping stuff and flushing stuff. With the
>> >   IOMMU api we mostly have the former, but for the later arch maintainers
>> >   regularly tells they won't allow that. So we have drm_clflush.c.
>>
>> The problem is that a cache flushing API entirely separate is hard. That
>> being said if you look at my generic dma-noncoherent API series it tries
>> to move that way.  So far it is in early stages and apparently rather
>> buggy unfortunately.
>
> And if folk want a cacheable mapping with explicit cache flushing, the
> cache flushing must not be defined in terms of "this is what CPU seems
> to need" but from the point of view of a CPU with infinite prefetching,
> infinite caching and infinite capacity to perform writebacks of dirty
> cache lines at unexpected moments when the memory is mapped in a
> cacheable mapping.
>
> (The reason for that is you're operating in a non-CPU specific space,
> so you can't make any guarantees as to how much caching or prefetching
> will occur by the CPU - different CPUs will do different amounts.)
>
> So, for example, the sequence:
>
> GPU writes to memory
> CPU reads from cacheable memory
>
> if the memory was previously dirty (iow, CPU has written), you need to
> flush the dirty cache lines _before_ the GPU writes happen, but you
> don't know whether the CPU has speculatively prefetched, so you need
> to flush any prefetched cache lines before reading from the cacheable
> memory _after_ the GPU has finished writing.
>
> Also note that "flush" there can be "clean the cache", "clean and
> invalidate the cache" or "invalidate the cache" as appropriate - some
> CPUs are able to perform those three operations, and the appropriate
> one depends on not only where in the above sequence it's being used,
> but also on what the operations are.
>
> So, the above sequence could be:
>
> CPU invalidates cache for memory
> (due to possible dirty cache lines)
> GPU writes to memory
> CPU invalidates cache for memory
> (to get rid of any speculatively prefetched
>  lines)
> CPU reads from cacheable memory
>
> Yes, in the above case, _two_ cache operations are required to ensure
> correct behaviour.  However, if you know for certain that the memory was
> previously clean, then the first cache operation can be skipped.
>
> What I'm pointing out is there's much more than just "I want to flush
> the cache" here, which is currently what DRM seems to assume at the
> moment with the code in drm_cache.c.
>
> If we can agree a set of interfaces that allows _proper_ use of these
> facilities, one which can be used appropriately, then there shouldn't
> be a problem.  The DMA API does that via it's ideas about who owns a
> particular buffer (because of the above problem) and that's something
> which would need to be carried over to such a cache flushing API (it
> should be pretty obvious that having a GPU read or write memory while
> the cache for that memory is being cleaned will lead to unexpected
> results.)
>
> Also note that things get even more interesting in a SMP environment
> if cache operations aren't broadcasted across the SMP cluster (which
> means cache operations have to be IPI'd to other CPUs.)
>
> The next issue, which I've brought up before, is that exposing cache
> flushing to userspace on architectures where it isn't already exposed
> comes.  As has been shown by Google Project Zero, this risks exposing
> those architectures to Spectre and Meltdown exploits where they weren't
> at such a risk before.  (I've pretty much shown here that you _do_
> need to control which cache lines get flushed to make these exploits
> work, and flushing the cache by reading lots of data in liu of having
> the ability to explicitly flush bits of cache makes it very difficult
> to impossible for them to work.)

The above is already what we're implementing in i915, at least
conceptually (it all boils down to clflush instructions because those
both invalidate and flush).

One architectural guarantee we're exploiting is that prefetched (and
hence non-dirty) cachel

Re: noveau vs arm dma ops

2018-04-26 Thread Daniel Vetter
On Thu, Apr 26, 2018 at 1:26 AM, Russell King - ARM Linux
<li...@armlinux.org.uk> wrote:
> On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
>> On arm that doesn't work. The iommu api seems like a good fit, except
>> the dma-api tends to get in the way a bit (drm/msm apparently has
>> similar problems like tegra), and if you need contiguous memory
>> dma_alloc_coherent is the only way to get at contiguous memory. There
>> was a huge discussion years ago about that, and direct cma access was
>> shot down because it would have exposed too much of the caching
>> attribute mangling required (most arm platforms need wc-pages to not
>> be in the kernel's linear map apparently).
>
> I think you completely misunderstand ARM from what you've written above,
> and this worries me greatly about giving DRM the level of control that
> is being asked for.
>
> Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache
> attributes are stored in the page tables.  These caches are inherently
> non-aliasing when there are multiple mappings (which is a great step
> forward compared to the previous aliasing caches.)
>
> As the cache attributes are stored in the page tables, this in theory
> allows different virtual mappings of the same physical memory to have
> different cache attributes.  However, there's a problem, and that's
> called speculative prefetching.
>
> Let's say you have one mapping which is cacheable, and another that is
> marked as write combining.  If a cache line is speculatively prefetched
> through the cacheable mapping of this memory, and then you read the
> same physical location through the write combining mapping, it is
> possible that you could read cached data.
>
> So, it is generally accepted that all mappings of any particular
> physical bit of memory should have the same cache attributes to avoid
> unpredictable behaviour.
>
> This presents a problem with what is generally called "lowmem" where
> the memory is mapped in kernel virtual space with cacheable
> attributes.  It can also happen with highmem if the memory is
> kmapped.
>
> This is why, on ARM, you can't use something like get_free_pages() to
> grab some pages from the system, pass it to the GPU, map it into
> userspace as write-combining, etc.  It _might_ work for some CPUs,
> but ARM CPUs vary in how much prefetching they do, and what may work
> for one particular CPU is in no way guaranteed to work for another
> ARM CPU.
>
> The official line from architecture folk is to assume that the caches
> infinitely speculate, are of infinite size, and can writeback *dirty*
> data at any moment.
>
> The way to stop things like speculative prefetches to particular
> physical memory is to, quite "simply", not have any cacheable
> mappings of that physical memory anywhere in the system.
>
> Now, cache flushes on ARM tend to be fairly expensive for GPU buffers.
> If you have, say, an 8MB buffer (for a 1080p frame) and you need to
> do a cache operation on that buffer, you'll be iterating over it
> 32 or maybe 64 bytes at a time "just in case" there's a cache line
> present.  Referring to my previous email, where I detailed the
> potential need for _two_ flushes, one before the GPU operation and
> one after, and this becomes _really_ expensive.  At that point, you're
> probably way better off using write-combine memory where you don't
> need to spend CPU cycles performing cache flushing - potentially
> across all CPUs in the system if cache operations aren't broadcasted.
>
> This isn't a simple matter of "just provide some APIs for cache
> operations" - there's much more that needs to be understood by
> all parties here, especially when we have GPU drivers that can be
> used with quite different CPUs.
>
> It may well be that for some combinations of CPUs and workloads, it's
> better to use write-combine memory without cache flushing, but for
> other CPUs that tradeoff (for the same workload) could well be
> different.
>
> Older ARMs get more interesting, because they have aliasing caches.
> That means the CPU cache aliases across different virtual space
> mappings in some way, which complicates (a) the mapping of memory
> and (b) handling the cache operations on it.
>
> It's too late for me to go into that tonight, and I probably won't
> be reading mail for the next week and a half, sorry.

I didn't know all the details well enough (and neither had the time to
write a few paragraphs like you did), but the above is what I had in
mind and meant. Sorry if my sloppy reply sounded like I'm mixing stuff
up.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: noveau vs arm dma ops

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 5:33 PM, Christoph Hellwig <h...@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
>> > Coordinating the backport of a trivial helper in the arm tree is not
>> > the end of the world.  Really, this cowboy attitude is a good reason
>> > why graphics folks have such a bad rep.  You keep poking into random
>> > kernel internals, don't talk to anoyone and then complain if people
>> > are upset.  This shouldn't be surprising.
>>
>> Not really agreeing on the cowboy thing. The fundamental problem is that
>> the dma api provides abstraction that seriously gets in the way of writing
>> a gpu driver. Some examples:
>
> So talk to other people.  Maybe people share your frustation.  Or maybe
> other people have a way to help.
>
>> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's
>>   hacks to fall back to a cache of pages allocated using
>>   dma_alloc_coherent if you build a kernel with bounce buffers.
>
> get_required_mask() is supposed to tell you if you are safe.  However
> we are missing lots of implementations of it for iommus so you might get
> some false negatives, improvements welcome.  It's been on my list of
> things to fix in the DMA API, but it is nowhere near the top.

I hasn't come up in a while in some fireworks, so I honestly don't
remember exactly what the issues have been. But

commit d766ef53006c2c38a7fe2bef0904105a793383f2
Author: Chris Wilson <ch...@chris-wilson.co.uk>
Date:   Mon Dec 19 12:43:45 2016 +

drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping

and the various bits of code that a

$ git grep SWIOTLB -- drivers/gpu

turns up is what we're doing to hack around that stuff. And in general
(there's some exceptions) gpus should be able to address everything,
so I never fully understood where that's even coming from.

>> - dma api hides the cache flushing requirements from us. GPUs love
>>   non-snooped access, and worse give userspace control over that. We want
>>   a strict separation between mapping stuff and flushing stuff. With the
>>   IOMMU api we mostly have the former, but for the later arch maintainers
>>   regularly tells they won't allow that. So we have drm_clflush.c.
>
> The problem is that a cache flushing API entirely separate is hard. That
> being said if you look at my generic dma-noncoherent API series it tries
> to move that way.  So far it is in early stages and apparently rather
> buggy unfortunately.

I'm assuming this stuff here?

https://lkml.org/lkml/2018/4/20/146

Anyway got lost in all that work a bit, looks really nice.

>> - dma api hides how/where memory is allocated. Kinda similar problem,
>>   except now for CMA or address limits. So either we roll our own
>>   allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
>>   we use dma_alloc_coherent and then grab the sgt to get at the CMA
>>   allocations because that's the only way. Which sucks, because we can't
>>   directly tell CMA how to back off if there's some way to make CMA memory
>>   available through other means (gpus love to hog all of memory, so we
>>   have shrinkers and everything).
>
> If you really care about doing explicitly cache flushing anyway (see
> above) allocating your own memory and mapping it where needed is by
> far the superior solution.  On cache coherent architectures
> dma_alloc_coherent is nothing but allocate memory + dma_map_single.
> On non coherent allocations the memory might come through a special
> pool or must be used through a special virtual address mapping that
> is set up either statically or dynamically.  For that case splitting
> allocation and mapping is a good idea in many ways, and I plan to move
> towards that once the number of dma mapping implementations is down
> to a reasonable number so that it can actually be done.

Yeah the above is pretty much what we do on x86. dma-api believes
everything is coherent, so dma_map_sg does the mapping we want and
nothing else (minus swiotlb fun). Cache flushing, allocations, all
done by the driver.

On arm that doesn't work. The iommu api seems like a good fit, except
the dma-api tends to get in the way a bit (drm/msm apparently has
similar problems like tegra), and if you need contiguous memory
dma_alloc_coherent is the only way to get at contiguous memory. There
was a huge discussion years ago about that, and direct cma access was
shot down because it would have exposed too much of the caching
attribute mangling required (most arm platforms need wc-pages to not
be in the kernel's linear map apparently).

Anything that separate these 3 things more (allocation pools, mapping
through IOMMUs and flushing cpu caches) sounds like the right
direction 

Re: noveau vs arm dma ops

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote:
> [discussion about this patch, which should have been cced to the iommu
>  and linux-arm-kernel lists, but wasn't:
>  https://www.spinics.net/lists/dri-devel/msg173630.html]
> 
> On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > > API from the iommu/dma-mapping code.  Drivers have no business poking
> > > into these details.
> > 
> > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> > which is rather misleading if they are not meant to be used by drivers
> > directly.
> 
> The only reason the DMA ops are exported is because get_arch_dma_ops
> references (or in case of the coherent ones used to reference).  We
> don't let drivers assign random dma ops.
> 
> > 
> > > Thierry, please resend this with at least the iommu list and
> > > linux-arm-kernel in Cc to have a proper discussion on the right API.
> > 
> > I'm certainly open to help with finding a correct solution, but the
> > patch above was purposefully terse because this is something that I
> > hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
> > such a backport between ARM and DRM trees does not sound like something
> > that would help getting this fixed in v4.16.
> 
> Coordinating the backport of a trivial helper in the arm tree is not
> the end of the world.  Really, this cowboy attitude is a good reason
> why graphics folks have such a bad rep.  You keep poking into random
> kernel internals, don't talk to anoyone and then complain if people
> are upset.  This shouldn't be surprising.

Not really agreeing on the cowboy thing. The fundamental problem is that
the dma api provides abstraction that seriously gets in the way of writing
a gpu driver. Some examples:

- We never want bounce buffers, ever. dma_map_sg gives us that, so there's
  hacks to fall back to a cache of pages allocated using
  dma_alloc_coherent if you build a kernel with bounce buffers.

- dma api hides the cache flushing requirements from us. GPUs love
  non-snooped access, and worse give userspace control over that. We want
  a strict separation between mapping stuff and flushing stuff. With the
  IOMMU api we mostly have the former, but for the later arch maintainers
  regularly tells they won't allow that. So we have drm_clflush.c.

- dma api hides how/where memory is allocated. Kinda similar problem,
  except now for CMA or address limits. So either we roll our own
  allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
  we use dma_alloc_coherent and then grab the sgt to get at the CMA
  allocations because that's the only way. Which sucks, because we can't
  directly tell CMA how to back off if there's some way to make CMA memory
  available through other means (gpus love to hog all of memory, so we
  have shrinkers and everything).

For display drivers the dma api mostly works, until you start sharing
buffers with other devices.

So from our perspective it looks fairly often that core folks just don't
want to support gpu use-cases, so we play a bit more cowboy and get things
done some other way. Since this has been going on for years now we often
don't even bother to start a discussion first, since it tended to go
nowhere useful.
-Daniel

> > Granted, this issue could've been caught with a little more testing, but
> > in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
> > was just enabled unconditionally if it has side-effects that platforms
> > don't opt in to but have to explicitly opt out of.
> 
> Agreed on that count.  Please send a patch.

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> > Can we please not nack everything right away? Doesn't really motivate
> > me to show you all the various things we're doing in gpu to make the
> > dma layer work for us. That kind of noodling around in lower levels to
> > get them to do what we want is absolutely par-for-course for gpu
> > drivers. If you just nack everything I point you at for illustrative
> > purposes, then I can't show you stuff anymore.
> 
> No, it's not.  No driver (and that includes the magic GPUs) has
> any business messing with dma ops directly.
> 
> A GPU driver imght have a very valid reason to disable the IOMMU,
> but the code to do so needs to be at least in the arch code, maybe
> in the dma-mapping/iommu code, not in the driver.
> 
> As a first step to get the discussion started we'll simply need
> to move the code Thierry wrote into a helper in arch/arm and that
> alone would be a massive improvement.  I'm not even talking about
> minor details like actually using arm_get_dma_map_ops instead
> of duplicating it.
> 
> And doing this basic trivial work really helps to get this whole
> mess under control.

Ah ok. It did sound a bit like a much more cathegorical NAK than an "ack
in principle, but we need to shuffle the implementation into the right
place first". In the past we generally got a principled NAK on anything
funny we've been doing with the dma api, and the dma api maintainer
steaming off telling us we're incompetent idiots. I guess I've been
branded a bit on this topic :-/

Really great that this is changing now.

On the patch itself: It might not be the right thing in all cases, since
for certain compression formats the nv gpu wants larger pages (easy to
allocate from vram, not so easy from main memory), so might need the iommu
still. But currently that's not implemented:

https://www.spinics.net/lists/dri-devel/msg173932.html

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 8:43 AM, Christoph Hellwig <h...@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote:
>> For more fun:
>>
>> https://www.spinics.net/lists/dri-devel/msg173630.html
>>
>> Yeah, sometimes we want to disable the iommu because the on-gpu
>> pagetables are faster ...
>
> I am not on this list, but remote NAK from here.  This needs an
> API from the iommu/dma-mapping code.  Drivers have no business poking
> into these details.

Can we please not nack everything right away? Doesn't really motivate
me to show you all the various things we're doing in gpu to make the
dma layer work for us. That kind of noodling around in lower levels to
get them to do what we want is absolutely par-for-course for gpu
drivers. If you just nack everything I point you at for illustrative
purposes, then I can't show you stuff anymore.

Just to make it clear: I do want to get this stuff sorted, and it's
awesome that someone from core finally takes a serious look at what
gpu folks have been doing for decades (instead of just telling us
we're incompetent and doing it all wrong and then steaming off), and
how to make this work without layering violations to no end. But
stopping the world until this is fixed isn't really a good option.

Thanks, Daniel

> Thierry, please resend this with at least the iommu list and
> linux-arm-kernel in Cc to have a proper discussion on the right API.



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 8:13 AM, Daniel Vetter <dan...@ffwll.ch> wrote:
> On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig <h...@infradead.org> wrote:
>> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>>> Out of curiosity, how much virtual flushing stuff is there still out
>>> there? At least in drm we've pretty much ignore this, and seem to be
>>> getting away without a huge uproar (at least from driver developers
>>> and users, core folks are less amused about that).
>>
>> As I've just been wading through the code, the following architectures
>> have non-coherent dma that flushes by virtual address for at least some
>> platforms:
>>
>>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>>powerpc
>>
>> These have non-coherent dma ops that flush by physical address:
>>
>>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>>
>> And these do not have non-coherent dma ops at all:
>>
>>  - alpha, h8300, riscv, unicore32, x86
>>
>> [1] arm ѕeems to do both virtually and physically based ops, further
>> audit needed.
>>
>> Note that using virtual addresses in the cache flushing interface
>> doesn't mean that the cache actually is virtually indexed, but it at
>> least allows for the possibility.
>>
>>> > I think the most important thing about such a buffer object is that
>>> > it can distinguish the underlying mapping types.  While
>>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>>> > back a dma_addr_t they are in now way interchangable.  And trying to
>>> > stuff them all into a structure like struct scatterlist that has
>>> > no indication what kind of mapping you are dealing with is just
>>> > asking for trouble.
>>>
>>> Well the idea was to have 1 interface to allow all drivers to share
>>> buffers with anything else, no matter how exactly they're allocated.
>>
>> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
>> a scatterlist through the sg_table in dma_buf_map_attachment /
>> ->map_dma_buf, but looking at a few of the callers it seems like they
>> really do not even want a scatterlist to start with, but check that
>> is contains a physically contiguous range first.  So kicking the
>> scatterlist our there will probably improve the interface in general.
>
> I think by number most drm drivers require contiguous memory (or an
> iommu that makes it look contiguous). But there's plenty others who
> have another set of pagetables on the gpu itself and can
> scatter-gather. Usually it's the former for display/video blocks, and
> the latter for rendering.

For more fun:

https://www.spinics.net/lists/dri-devel/msg173630.html

Yeah, sometimes we want to disable the iommu because the on-gpu
pagetables are faster ...
-Daniel

>>> dma-buf has all the functions for flushing, so you can have coherent
>>> mappings, non-coherent mappings and pretty much anything else. Or well
>>> could, because in practice people hack up layering violations until it
>>> works for the 2-3 drivers they care about. On top of that there's the
>>> small issue that x86 insists that dma is coherent (and that's true for
>>> most devices, including v4l drivers you might want to share stuff
>>> with), and gpus really, really really do want to make almost
>>> everything incoherent.
>>
>> How do discrete GPUs manage to be incoherent when attached over PCIe?
>
> It has a non-coherent transaction mode (which the chipset can opt to
> not implement and still flush), to make sure the AGP horror show
> doesn't happen again and GPU folks are happy with PCIe. That's at
> least my understanding from digging around in amd the last time we had
> coherency issues between intel and amd gpus. GPUs have some bits
> somewhere (in the pagetables, or in the buffer object description
> table created by userspace) to control that stuff.
>
> For anything on the SoC it's presented as pci device, but that's
> extremely fake, and we can definitely do non-snooped transactions on
> drm/i915. Again, controlled by a mix of pagetables and
> userspace-provided buffer object description tables.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig <h...@infradead.org> wrote:
> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>> Out of curiosity, how much virtual flushing stuff is there still out
>> there? At least in drm we've pretty much ignore this, and seem to be
>> getting away without a huge uproar (at least from driver developers
>> and users, core folks are less amused about that).
>
> As I've just been wading through the code, the following architectures
> have non-coherent dma that flushes by virtual address for at least some
> platforms:
>
>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>powerpc
>
> These have non-coherent dma ops that flush by physical address:
>
>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>
> And these do not have non-coherent dma ops at all:
>
>  - alpha, h8300, riscv, unicore32, x86
>
> [1] arm ѕeems to do both virtually and physically based ops, further
> audit needed.
>
> Note that using virtual addresses in the cache flushing interface
> doesn't mean that the cache actually is virtually indexed, but it at
> least allows for the possibility.
>
>> > I think the most important thing about such a buffer object is that
>> > it can distinguish the underlying mapping types.  While
>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>> > back a dma_addr_t they are in now way interchangable.  And trying to
>> > stuff them all into a structure like struct scatterlist that has
>> > no indication what kind of mapping you are dealing with is just
>> > asking for trouble.
>>
>> Well the idea was to have 1 interface to allow all drivers to share
>> buffers with anything else, no matter how exactly they're allocated.
>
> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
> a scatterlist through the sg_table in dma_buf_map_attachment /
> ->map_dma_buf, but looking at a few of the callers it seems like they
> really do not even want a scatterlist to start with, but check that
> is contains a physically contiguous range first.  So kicking the
> scatterlist our there will probably improve the interface in general.

I think by number most drm drivers require contiguous memory (or an
iommu that makes it look contiguous). But there's plenty others who
have another set of pagetables on the gpu itself and can
scatter-gather. Usually it's the former for display/video blocks, and
the latter for rendering.

>> dma-buf has all the functions for flushing, so you can have coherent
>> mappings, non-coherent mappings and pretty much anything else. Or well
>> could, because in practice people hack up layering violations until it
>> works for the 2-3 drivers they care about. On top of that there's the
>> small issue that x86 insists that dma is coherent (and that's true for
>> most devices, including v4l drivers you might want to share stuff
>> with), and gpus really, really really do want to make almost
>> everything incoherent.
>
> How do discrete GPUs manage to be incoherent when attached over PCIe?

It has a non-coherent transaction mode (which the chipset can opt to
not implement and still flush), to make sure the AGP horror show
doesn't happen again and GPU folks are happy with PCIe. That's at
least my understanding from digging around in amd the last time we had
coherency issues between intel and amd gpus. GPUs have some bits
somewhere (in the pagetables, or in the buffer object description
table created by userspace) to control that stuff.

For anything on the SoC it's presented as pci device, but that's
extremely fake, and we can definitely do non-snooped transactions on
drm/i915. Again, controlled by a mix of pagetables and
userspace-provided buffer object description tables.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Daniel Vetter
On Tue, Apr 24, 2018 at 8:48 PM, Christoph Hellwig <h...@infradead.org> wrote:
> On Fri, Apr 20, 2018 at 05:21:11PM +0200, Daniel Vetter wrote:
>> > At the very lowest level they will need to be handled differently for
>> > many architectures, the questions is at what point we'll do the
>> > branching out.
>>
>> Having at least struct page also in that list with (dma_addr_t, lenght)
>> pairs has a bunch of benefits for drivers in unifying buffer handling
>> code. You just pass that one single list around, use the dma_addr_t side
>> for gpu access (generally bashing it into gpu ptes). And the struct page
>> (if present) for cpu access, using kmap or vm_insert_*. We generally
>> ignore virt, if we do need a full mapping then we construct a vmap for
>> that buffer of our own.
>
> Well, for mapping a resource (which gets back to the start of the
> discussion) you will need an explicit virt pointer.  You also need
> an explicit virt pointer and not just page_address/kmap for users of
> dma_get_sgtable, because for many architectures you will need to flush
> the virtual address used to access the data, which might be a
> vmap/ioremap style mapping retourned from dma_alloc_address, and not
> the directly mapped kernel address.

Out of curiosity, how much virtual flushing stuff is there still out
there? At least in drm we've pretty much ignore this, and seem to be
getting away without a huge uproar (at least from driver developers
and users, core folks are less amused about that).

And at least for gpus that seems to have been the case since forever,
or at least since AGP was a thing 20 years ago: AGP isn't coherent, so
needs explicit cache flushing, and we have our own implementations of
that in drivers/char/agp. Luckily AGP died 10 years ago, so no one yet
proposed to port it all over to the iommu framework and hide it behind
the dma api (which really would be the "clean" way to do this, AGP is
simply an IOMMU + special socket dedicated for the add-on gpu).

> Here is another idea at the low-level dma API level:
>
>  - dma_get_sgtable goes away.  The replacement is a new
>dma_alloc_remap helper that takes the virtual address returned
>from dma_alloc_attrs/coherent and creates a dma_addr_t for the
>given new device.  If the original allocation was a coherent
>one no cache flushing is required either (because the arch
>made sure it is coherent), if the original allocation used
>DMA_ATTR_NON_CONSISTENT the new allocation will need
>dma_cache_sync calls as well.

Yeah I think that should work. dma_get_sgtable is a pretty nasty
layering violation.

>  - you never even try to share a mapping retourned from
>dma_map_resource - instead each device using it creates a new
>mapping, which makes sense as no virtual addresses are involved
>at all.

Yeah the dma-buf exporter always knows what kind of backing storage it
is dealing with, and for which struct device it should set up a new
view. Hence can make sure that it calls the right functions to
establish a new mapping, whether that's dma_map_sg, dma_map_resource
or the new dma_alloc_remap (instead of the dma_get_sgtable layering
mixup). The importer doesn't know.

>> So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
>> with struct page * being optional (if it's a resource, or something else
>> that the kernel core mm isn't aware of). But that only has benefits if we
>> really roll it out everywhere, in all the subsystems and drivers, since if
>> we don't we've made the struct pages ** <-> sgt conversion fun only worse
>> by adding a 3 representation of gpu buffer object backing storage.
>
> I think the most important thing about such a buffer object is that
> it can distinguish the underlying mapping types.  While
> dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
> dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
> back a dma_addr_t they are in now way interchangable.  And trying to
> stuff them all into a structure like struct scatterlist that has
> no indication what kind of mapping you are dealing with is just
> asking for trouble.

Well the idea was to have 1 interface to allow all drivers to share
buffers with anything else, no matter how exactly they're allocated.
dma-buf has all the functions for flushing, so you can have coherent
mappings, non-coherent mappings and pretty much anything else. Or well
could, because in practice people hack up layering violations until it
works for the 2-3 drivers they care about. On top of that there's the
small issue that x86 insists that dma is coherent (and that's true for
most devices, including v4l drivers you might want to share stuff
with), and gpus really, really really do want to make almost
everything incoherent.

The end result is pretty epic :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Daniel Vetter
On Fri, Apr 20, 2018 at 05:46:25AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 20, 2018 at 12:44:01PM +0200, Christian König wrote:
> > > > What we need is an sg_alloc_table_from_resources(dev, resources,
> > > > num_resources) which does the handling common to all drivers.
> > > A structure that contains
> > > 
> > > {page,offset,len} + {dma_addr+dma_len}
> > > 
> > > is not a good container for storing
> > > 
> > > {virt addr, dma_addr, len}
> > > 
> > > no matter what interface you build arond it.
> > 
> > Why not? I mean at least for my use case we actually don't need the virtual
> > address.
> 
> If you don't need the virtual address you need scatterlist even list.
> 
> > What we need is {dma_addr+dma_len} in a consistent interface which can come
> > from both {page,offset,len} as well as {resource, len}.
> 
> Ok.
> 
> > What I actually don't need is separate handling for system memory and
> > resources, but that would we get exactly when we don't use sg_table.
> 
> At the very lowest level they will need to be handled differently for
> many architectures, the questions is at what point we'll do the
> branching out.

Having at least struct page also in that list with (dma_addr_t, lenght)
pairs has a bunch of benefits for drivers in unifying buffer handling
code. You just pass that one single list around, use the dma_addr_t side
for gpu access (generally bashing it into gpu ptes). And the struct page
(if present) for cpu access, using kmap or vm_insert_*. We generally
ignore virt, if we do need a full mapping then we construct a vmap for
that buffer of our own.

If (and that would be serious amounts of work all over the tree, with lots
of drivers) we come up with a new struct for gpu buffers, then I'd also
add "force page alignement for everything" to the requirements list.
That's another mismatch we have, since gpu buffer objects (and dma-buf)
are always full pages. That mismatch motived the addition of the
page-oriented sg iterators.

So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
with struct page * being optional (if it's a resource, or something else
that the kernel core mm isn't aware of). But that only has benefits if we
really roll it out everywhere, in all the subsystems and drivers, since if
we don't we've made the struct pages ** <-> sgt conversion fun only worse
by adding a 3 representation of gpu buffer object backing storage.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Daniel Vetter
On Thu, Apr 19, 2018 at 01:16:57AM -0700, Christoph Hellwig wrote:
> On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote:
> > We've broken that assumption in i915 years ago. Not struct page backed
> > gpu memory is very real.
> > 
> > Of course we'll never feed such a strange sg table to a driver which
> > doesn't understand it, but allowing sg_page == NULL works perfectly
> > fine. At least for gpu drivers.
> 
> For GPU drivers on x86 with no dma coherency problems, sure.  But not
> all the world is x86.  We already have problems due to dmabugs use
> of the awkward get_sgtable interface (see the common on
> arm_dma_get_sgtable that I fully agree with), and doing this for memory
> that doesn't have a struct page at all will make things even worse.

x86 dma isn't coherent either, if you're a GPU :-) Flushing gpu caches
tends to be too expensive, so there's pci-e support and chipset support to
forgo it. Plus drivers flushing caches themselves.

The dma_get_sgtable thing is indeed fun, right solution would probably be
to push the dma-buf export down into the dma layer. The comment for
arm_dma_get_sgtable is also not a realy concern, because dma-buf also
abstracts away the flushing (or well is supposed to), so there really
shouldn't be anyone calling the streaming apis on the returned sg table.
That's why dma-buf gives you an sg table that's mapped already.

> > If that's not acceptable then I guess we could go over the entire tree
> > and frob all the gpu related code to switch over to a new struct
> > sg_table_might_not_be_struct_page_backed, including all the other
> > functions we added over the past few years to iterate over sg tables.
> > But seems slightly silly, given that sg tables seem to do exactly what
> > we need.
> 
> It isn't silly.  We will have to do some surgery like that anyway
> because the current APIs don't work.  So relax, sit back and come up
> with an API that solves the existing issues and serves us well in
> the future.

So we should just implement a copy of sg table for dma-buf, since I still
think it does exactly what we need for gpus?

Yes there's a bit a layering violation insofar that drivers really
shouldn't each have their own copy of "how do I convert a piece of dma
memory into  dma-buf", but that doesn't render the interface a bad idea.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-16 Thread Daniel Vetter
On Mon, Apr 16, 2018 at 2:39 PM, Christoph Hellwig <h...@infradead.org> wrote:
> On Tue, Apr 03, 2018 at 08:08:32PM +0200, Daniel Vetter wrote:
>> I did not mean you should dma_map_sg/page. I just meant that using
>> dma_map_resource to fill only the dma address part of the sg table seems
>> perfectly sufficient.
>
> But that is not how the interface work, especially facing sg_dma_len.
>
>> Assuming you get an sg table that's been mapping by calling dma_map_sg was
>> always a bit a case of bending the abstraction to avoid typing code. The
>> only thing an importer ever should have done is look at the dma addresses
>> in that sg table, nothing else.
>
> The scatterlist is not a very good abstraction unfortunately, but it
> it is spread all over the kernel.  And we do expect that anyone who
> gets passed a scatterlist can use sg_page() or sg_virt() (which calls
> sg_page()) on it.  Your changes would break that, and will cause major
> trouble because of that.
>
> If you want to expose p2p memory returned from dma_map_resource in
> dmabuf do not use scatterlists for this please, but with a new interface
> that explicitly passes a virtual address, a dma address and a length
> and make it very clear that virt_to_page will not work on the virtual
> address.

We've broken that assumption in i915 years ago. Not struct page backed
gpu memory is very real.

Of course we'll never feed such a strange sg table to a driver which
doesn't understand it, but allowing sg_page == NULL works perfectly
fine. At least for gpu drivers.

If that's not acceptable then I guess we could go over the entire tree
and frob all the gpu related code to switch over to a new struct
sg_table_might_not_be_struct_page_backed, including all the other
functions we added over the past few years to iterate over sg tables.
But seems slightly silly, given that sg tables seem to do exactly what
we need.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [RfC PATCH] Add udmabuf misc device

2018-04-16 Thread Daniel Vetter
Ok, confusion around backend is I think cleared up. The other
confusion seems to be around dma-buf:

dma-buf is the cross subsystem zerocopy abstraction. PRIME is the
drm-specific support for it, 100% based on top of the generic struct
dma_buf.

You need a dma_buf exporter to convert a xen grant references list
into a dma_buf, which you can then import in your drm driver (using
prime), v4l, or anything else that supports dma-buf. You do _not_ need
a prime implementation, that's only the marketing name we've given to
dma-buf import/export for drm drivers.
-Daniel


On Mon, Apr 16, 2018 at 12:14 PM, Oleksandr Andrushchenko
<andr2...@gmail.com> wrote:
> On 04/16/2018 12:32 PM, Daniel Vetter wrote:
>>
>> On Mon, Apr 16, 2018 at 10:22 AM, Oleksandr Andrushchenko
>> <andr2...@gmail.com> wrote:
>>>
>>> On 04/16/2018 10:43 AM, Daniel Vetter wrote:
>>>>
>>>> On Mon, Apr 16, 2018 at 10:16:31AM +0300, Oleksandr Andrushchenko wrote:
>>>>>
>>>>> On 04/13/2018 06:37 PM, Daniel Vetter wrote:
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 08:59:32AM +0300, Oleksandr Andrushchenko
>>>>>> wrote:
>>>>>>>
>>>>>>> On 04/10/2018 08:26 PM, Dongwon Kim wrote:
>>>>>>>>
>>>>>>>> On Tue, Apr 10, 2018 at 09:37:53AM +0300, Oleksandr Andrushchenko
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 04/06/2018 09:57 PM, Dongwon Kim wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 06, 2018 at 03:36:03PM +0300, Oleksandr Andrushchenko
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 04/06/2018 02:57 PM, Gerd Hoffmann wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>   Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>>> I fail to see any common ground for xen-zcopy and udmabuf ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does the above mean you can assume that xen-zcopy and udmabuf
>>>>>>>>>>>>> can co-exist as two different solutions?
>>>>>>>>>>>>
>>>>>>>>>>>> Well, udmabuf route isn't fully clear yet, but yes.
>>>>>>>>>>>>
>>>>>>>>>>>> See also gvt (intel vgpu), where the hypervisor interface is
>>>>>>>>>>>> abstracted
>>>>>>>>>>>> away into a separate kernel modules even though most of the
>>>>>>>>>>>> actual
>>>>>>>>>>>> vgpu
>>>>>>>>>>>> emulation code is common.
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your input, I'm just trying to figure out
>>>>>>>>>>> which of the three z-copy solutions intersect and how much
>>>>>>>>>>>>>
>>>>>>>>>>>>> And what about hyper-dmabuf?
>>>>>>>>>>
>>>>>>>>>> xen z-copy solution is pretty similar fundamentally to
>>>>>>>>>> hyper_dmabuf
>>>>>>>>>> in terms of these core sharing feature:
>>>>>>>>>>
>>>>>>>>>> 1. the sharing process - import prime/dmabuf from the producer ->
>>>>>>>>>> extract
>>>>>>>>>> underlying pages and get those shared -> return references for
>>>>>>>>>> shared pages
>>>>>>>>
>>>>>>>> Another thing is danvet was kind of against to the idea of importing
>>>>>>>> existing
>>>>>>>> dmabuf/prime buffer and forward it to the other domain due to
>>>>>>>> synchronization
>>>>>>>> issues. He proposed to make hyper_dmabuf only work as an exporter so
>>>>>>>> that it
>>>>>>>> can have a full control over the buffer. I think we need to talk
>>>>>>>> about
>>>>>>>> this
>>>>>>>> further as well.
>>>>>>>
>>>>>>> Yes, I saw this. But this limits the use-cases so much.
>>

Re: [RfC PATCH] Add udmabuf misc device

2018-04-16 Thread Daniel Vetter
On Mon, Apr 16, 2018 at 10:22 AM, Oleksandr Andrushchenko
<andr2...@gmail.com> wrote:
> On 04/16/2018 10:43 AM, Daniel Vetter wrote:
>>
>> On Mon, Apr 16, 2018 at 10:16:31AM +0300, Oleksandr Andrushchenko wrote:
>>>
>>> On 04/13/2018 06:37 PM, Daniel Vetter wrote:
>>>>
>>>> On Wed, Apr 11, 2018 at 08:59:32AM +0300, Oleksandr Andrushchenko wrote:
>>>>>
>>>>> On 04/10/2018 08:26 PM, Dongwon Kim wrote:
>>>>>>
>>>>>> On Tue, Apr 10, 2018 at 09:37:53AM +0300, Oleksandr Andrushchenko
>>>>>> wrote:
>>>>>>>
>>>>>>> On 04/06/2018 09:57 PM, Dongwon Kim wrote:
>>>>>>>>
>>>>>>>> On Fri, Apr 06, 2018 at 03:36:03PM +0300, Oleksandr Andrushchenko
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 04/06/2018 02:57 PM, Gerd Hoffmann wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>>>> I fail to see any common ground for xen-zcopy and udmabuf ...
>>>>>>>>>>>
>>>>>>>>>>> Does the above mean you can assume that xen-zcopy and udmabuf
>>>>>>>>>>> can co-exist as two different solutions?
>>>>>>>>>>
>>>>>>>>>> Well, udmabuf route isn't fully clear yet, but yes.
>>>>>>>>>>
>>>>>>>>>> See also gvt (intel vgpu), where the hypervisor interface is
>>>>>>>>>> abstracted
>>>>>>>>>> away into a separate kernel modules even though most of the actual
>>>>>>>>>> vgpu
>>>>>>>>>> emulation code is common.
>>>>>>>>>
>>>>>>>>> Thank you for your input, I'm just trying to figure out
>>>>>>>>> which of the three z-copy solutions intersect and how much
>>>>>>>>>>>
>>>>>>>>>>> And what about hyper-dmabuf?
>>>>>>>>
>>>>>>>> xen z-copy solution is pretty similar fundamentally to hyper_dmabuf
>>>>>>>> in terms of these core sharing feature:
>>>>>>>>
>>>>>>>> 1. the sharing process - import prime/dmabuf from the producer ->
>>>>>>>> extract
>>>>>>>> underlying pages and get those shared -> return references for
>>>>>>>> shared pages
>>>>>>
>>>>>> Another thing is danvet was kind of against to the idea of importing
>>>>>> existing
>>>>>> dmabuf/prime buffer and forward it to the other domain due to
>>>>>> synchronization
>>>>>> issues. He proposed to make hyper_dmabuf only work as an exporter so
>>>>>> that it
>>>>>> can have a full control over the buffer. I think we need to talk about
>>>>>> this
>>>>>> further as well.
>>>>>
>>>>> Yes, I saw this. But this limits the use-cases so much.
>>>>> For instance, running Android as a Guest (which uses ION to allocate
>>>>> buffers) means that finally HW composer will import dma-buf into
>>>>> the DRM driver. Then, in case of xen-front for example, it needs to be
>>>>> shared with the backend (Host side). Of course, we can change
>>>>> user-space
>>>>> to make xen-front allocate the buffers (make it exporter), but what we
>>>>> try
>>>>> to avoid is to change user-space which in normal world would have
>>>>> remain
>>>>> unchanged otherwise.
>>>>> So, I do think we have to support this use-case and just have to
>>>>> understand
>>>>> the complexity.
>>>>
>>>> Erm, why do you need importer capability for this use-case?
>>>>
>>>> guest1 -> ION -> xen-front -> hypervisor -> guest 2 -> xen-zcopy exposes
>>>> that dma-buf -> import to the real display hw
>>>>
>>>> No where in this chain do you need xen-zcopy to be able to import a
>>>> dma-buf (within linux, it needs to import a bunch of pages from the
>>>> hypervisor).
>>>>
>>>> Now if your plan is to use xen-zco

Re: [RfC PATCH] Add udmabuf misc device

2018-04-16 Thread Daniel Vetter
On Mon, Apr 16, 2018 at 10:16:31AM +0300, Oleksandr Andrushchenko wrote:
> On 04/13/2018 06:37 PM, Daniel Vetter wrote:
> > On Wed, Apr 11, 2018 at 08:59:32AM +0300, Oleksandr Andrushchenko wrote:
> > > On 04/10/2018 08:26 PM, Dongwon Kim wrote:
> > > > On Tue, Apr 10, 2018 at 09:37:53AM +0300, Oleksandr Andrushchenko wrote:
> > > > > On 04/06/2018 09:57 PM, Dongwon Kim wrote:
> > > > > > On Fri, Apr 06, 2018 at 03:36:03PM +0300, Oleksandr Andrushchenko 
> > > > > > wrote:
> > > > > > > On 04/06/2018 02:57 PM, Gerd Hoffmann wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > > > I fail to see any common ground for xen-zcopy and udmabuf 
> > > > > > > > > > ...
> > > > > > > > > Does the above mean you can assume that xen-zcopy and udmabuf
> > > > > > > > > can co-exist as two different solutions?
> > > > > > > > Well, udmabuf route isn't fully clear yet, but yes.
> > > > > > > > 
> > > > > > > > See also gvt (intel vgpu), where the hypervisor interface is 
> > > > > > > > abstracted
> > > > > > > > away into a separate kernel modules even though most of the 
> > > > > > > > actual vgpu
> > > > > > > > emulation code is common.
> > > > > > > Thank you for your input, I'm just trying to figure out
> > > > > > > which of the three z-copy solutions intersect and how much
> > > > > > > > > And what about hyper-dmabuf?
> > > > > > xen z-copy solution is pretty similar fundamentally to hyper_dmabuf
> > > > > > in terms of these core sharing feature:
> > > > > > 
> > > > > > 1. the sharing process - import prime/dmabuf from the producer -> 
> > > > > > extract
> > > > > > underlying pages and get those shared -> return references for 
> > > > > > shared pages
> > > > Another thing is danvet was kind of against to the idea of importing 
> > > > existing
> > > > dmabuf/prime buffer and forward it to the other domain due to 
> > > > synchronization
> > > > issues. He proposed to make hyper_dmabuf only work as an exporter so 
> > > > that it
> > > > can have a full control over the buffer. I think we need to talk about 
> > > > this
> > > > further as well.
> > > Yes, I saw this. But this limits the use-cases so much.
> > > For instance, running Android as a Guest (which uses ION to allocate
> > > buffers) means that finally HW composer will import dma-buf into
> > > the DRM driver. Then, in case of xen-front for example, it needs to be
> > > shared with the backend (Host side). Of course, we can change user-space
> > > to make xen-front allocate the buffers (make it exporter), but what we try
> > > to avoid is to change user-space which in normal world would have remain
> > > unchanged otherwise.
> > > So, I do think we have to support this use-case and just have to 
> > > understand
> > > the complexity.
> > Erm, why do you need importer capability for this use-case?
> > 
> > guest1 -> ION -> xen-front -> hypervisor -> guest 2 -> xen-zcopy exposes
> > that dma-buf -> import to the real display hw
> > 
> > No where in this chain do you need xen-zcopy to be able to import a
> > dma-buf (within linux, it needs to import a bunch of pages from the
> > hypervisor).
> > 
> > Now if your plan is to use xen-zcopy in the guest1 instead of xen-front,
> > then you indeed need to import.
> This is the exact use-case I was referring to while saying
> we need to import on Guest1 side. If hyper-dmabuf is so
> generic that there is no xen-front in the picture, then
> it needs to import a dma-buf, so it can be exported at Guest2 side.
> >   But that imo doesn't make sense:
> > - xen-front gives you clearly defined flip events you can forward to the
> >hypervisor. xen-zcopy would need to add that again.
> xen-zcopy is a helper driver which doesn't handle page flips
> and is not a KMS driver as one might think of: the DRM UAPI it uses is
> just to export a dma-buf as a PRIME buffer, but that's it.
> Flipping etc. is done by the backend [1], not xen-zcopy.
> >   Same for
> >hyperdmabuf (and really we're not going to shuffle struct dma_fence over
> >

Re: [RfC PATCH] Add udmabuf misc device

2018-04-13 Thread Daniel Vetter
 API that hyper_dmabuf
> > > > provides? I don't think we need different IOCTLs that do the same in 
> > > > the final
> > > > solution.
> > > > 
> > > If you think of xen-zcopy as a library (which implements Xen
> > > grant references mangling) and DRM PRIME wrapper on top of that
> > > library, we can probably define proper API for that library,
> > > so both xen-zcopy and hyper-dmabuf can use it. What is more, I am
> > > about to start upstreaming Xen para-virtualized sound device driver soon,
> > > which also uses similar code and gref passing mechanism [3].
> > > (Actually, I was about to upstream drm/xen-front, drm/xen-zcopy and
> > > snd/xen-front and then propose a Xen helper library for sharing big 
> > > buffers,
> > > so common code of the above drivers can use the same code w/o code
> > > duplication)
> > I think it is possible to use your functions for memory sharing part in
> > hyper_dmabuf's backend (this 'backend' means the layer that does page 
> > sharing
> > and inter-vm communication with xen-specific way.), so why don't we work on
> > "Xen helper library for sharing big buffers" first while we continue our
> > discussion on the common API layer that can cover any dmabuf sharing cases.
> > 
> Well, I would love we reuse the code that I have, but I also
> understand that it was limited by my use-cases. So, I do not
> insist we have to ;)
> If we start designing and discussing hyper-dmabuf protocol we of course
> can work on this helper library in parallel.

Imo code reuse is overrated. Adding new uapi is what freaks me out here
:-)

If we end up with duplicated implementations, even in upstream, meh, not
great, but also ok. New uapi, and in a similar way, new hypervisor api
like the dma-buf forwarding that hyperdmabuf does is the kind of thing
that will lock us in for 10+ years (if we make a mistake).

> > > Thank you,
> > > Oleksandr
> > > 
> > > P.S. All, is it a good idea to move this out of udmabuf thread into a
> > > dedicated one?
> > Either way is fine with me.
> So, if you can start designing the protocol we may have a dedicated mail
> thread for that. I will try to help with the protocol as much as I can

Please don't start with the protocol. Instead start with the concrete
use-cases, and then figure out why exactly you need new uapi. Once we have
that answered, we can start thinking about fleshing out the details.

Cheers, Daniel

> 
> > > > > > cheers,
> > > > > >Gerd
> > > > > > 
> > > > > Thank you,
> > > > > Oleksandr
> > > > > 
> > > > > P.S. Sorry for making your original mail thread to discuss things much
> > > > > broader than your RFC...
> > > > > 
> > > [1] https://github.com/xen-troops/displ_be
> > > [2] 
> > > https://elixir.bootlin.com/linux/v4.16-rc7/source/include/xen/interface/io/displif.h#L484
> > > [3] 
> > > https://elixir.bootlin.com/linux/v4.16-rc7/source/include/xen/interface/io/sndif.h
> > > 
> [1] 
> https://elixir.bootlin.com/linux/v4.16-rc7/source/include/xen/interface/io/displif.h
> [2]
> https://lists.xenproject.org/archives/html/xen-devel/2018-04/msg00685.html
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2] Add udmabuf misc device

2018-04-09 Thread Daniel Vetter
On Fri, Apr 06, 2018 at 02:24:46PM +0200, Christian König wrote:
> Am 06.04.2018 um 11:33 schrieb Gerd Hoffmann:
> >Hi,
> > 
> > > The pages backing a DMA-buf are not allowed to move (at least not without 
> > > a
> > > patch set I'm currently working on), but for certain MM operations to work
> > > correctly you must be able to modify the page tables entries and move the
> > > pages backing them around.
> > > 
> > > For example try to use fork() with some copy on write pages with this
> > > approach. You will find that you have only two options to correctly handle
> > > this.
> > The fork() issue should go away with shared memory pages (no cow).
> > I guess this is the reason why vgem is internally backed by shmem.
> 
> Yes, exactly that is also an approach which should work fine. Just don't try
> to get this working with get_user_pages().
> 
> > 
> > Hmm.  So I could try to limit the udmabuf driver to shmem too (i.e.
> > have the ioctl take a shmem filehandle and offset instead of a virtual
> > address).
> > 
> > But maybe it is better then to just extend vgem, i.e. add support to
> > create gem objects from existing shmem.
> > 
> > Comments?
> 
> Yes, extending vgem instead of creating something new sounds like a good
> idea to me as well.

+1 on adding a vgem "import from shmem/memfd" ioctl. Sounds like a good
idea, and generally useful.

We might want to limit to memfd though for semantic reasons: dma-buf have
invariant size, shmem not so much. memfd can be locked down to not change
their size anymore. And iirc the core mm page invalidation protocol around
truncate() is about as bad as get_user_pages vs cow :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [RfC PATCH] Add udmabuf misc device

2018-04-09 Thread Daniel Vetter
On Thu, Apr 05, 2018 at 05:11:17PM -0700, Matt Roper wrote:
> On Thu, Apr 05, 2018 at 10:32:04PM +0200, Daniel Vetter wrote:
> > Pulling this out of the shadows again.
> > 
> > We now also have xen-zcopy from Oleksandr and the hyper dmabuf stuff
> > from Matt and Dongwong.
> > 
> > At least from the intel side there seems to be the idea to just have 1
> > special device that can handle cross-gues/host sharing for all kinds
> > of hypervisors, so I guess you all need to work together :-)
> > 
> > Or we throw out the idea that hyper dmabuf will be cross-hypervisor
> > (not sure how useful/reasonable that is, someone please convince me
> > one way or the other).
> > 
> > Cheers, Daniel
> 
> Dongwon (DW) is the one doing all the real work on hyper_dmabuf, but I'm
> familiar with the use cases he's trying to address, and I think there
> are a couple high-level goals of his work that are worth calling out as
> we discuss the various options for sharing buffers produced in one VM
> with a consumer running in another VM:
> 
>  * We should try to keep the interface/usage separate from the
>underlying hypervisor implementation details.  I.e., in DW's design
>the sink/source drivers that handle the actual buffer passing in the
>two VM's should provide a generic interface that does not depend on a
>specific hypervisor.  Behind the scenes there could be various
>implementations for specific hypervisors (Xen, KVM, ACRN, etc.), and
>some of those backends may have additional restrictions, but it would
>be best if userspace didn't have to know the specific hypervisor
>running on the system and could just query the general capabilities
>available to it.  We've already got projects in flight that are
>wanting this functionality on Xen and ACRN today.

Two comments on this:

- Just because it's in drivers/gpu doesn't mean you can't use it for
  anything else. E.g. the xen-zcopy driver can very much be used for any
  dma-buf, there's nothing gpu specific with it - well besides that it
  resuses some useful DRM ioctls, but if that annoys you just do a #define
  TOTALLY_GENERIC DRM and be done :-)

- Especially the kvm memory and hypervisor model seems totally different
  from other hypervisors, e.g. no real use for guest-guest sharing (which
  doesn't go through the host) and other cases. So trying to make
  something 100% generic seems like a bad idea.

  Wrt making it generic: Just use generic interfaces - if you can somehow
  use xen-front for the display sharing, then a) no need for hyper-dmabuf
  and b) already fully generic since it looks like a normal drm device to
  the guest userspace.

>  * The general interface should be able to express sharing from any
>guest:guest, not just guest:host.  Arbitrary G:G sharing might be
>something some hypervisors simply aren't able to support, but the
>userspace API itself shouldn't make assumptions or restrict that.  I
>think ideally the sharing API would include some kind of
>query_targets interface that would return a list of VM's that your
>current OS is allowed to share with; that list would be depend on the
>policy established by the system integrator, but obviously wouldn't
>include targets that the hypervisor itself wouldn't be capable of
>handling.

Uh ... has a proper security architect analyzed this idea?

>  * A lot of the initial use cases are in the realm of graphics, but this
>shouldn't be a graphics-specific API.  Buffers might contain other
>types of content as well (e.g., audio).  Really the content producer
>could potentially be any driver (or userspace) running in the VM that
>knows how to import/export dma_buf's (or maybe just import given
>danvet's suggestion that we should make the sink driver do all the
>actual memory allocation for any buffers that may be shared).

See above, just because it uses drm ioctls doesn't make it gfx specific.

Otoh making it even more graphics specific might be even better, i.e. just
sharing the backend tech (grant tables or whatever), but having dedicated
front-ents for each use-case so there's less code to type.

>  * We need to be able to handle cross-VM coordination of buffer usage as
>well, so I think we'd want to include fence forwarding support in the
>API as well to signal back and forth about production/consumption
>completion.  And of course document really well what should happen
>if, for example, the entire VM you're sharing with/from dies.

Implicit fencing has been proven to be a bad idea. Please do explicit
passing of dma_fences (plus assorted protocol).

>  * The sharing API could be used to share multiple kinds of content in a
>single system.  The sharing sink driver running in the con

Re: [RfC PATCH] Add udmabuf misc device

2018-04-09 Thread Daniel Vetter
On Fri, Apr 06, 2018 at 12:54:22PM +0200, Gerd Hoffmann wrote:
> On Fri, Apr 06, 2018 at 10:52:21AM +0100, Daniel Stone wrote:
> > Hi Gerd,
> > 
> > On 14 March 2018 at 08:03, Gerd Hoffmann <kra...@redhat.com> wrote:
> > >> Either mlock account (because it's mlocked defacto), and get_user_pages
> > >> won't do that for you.
> > >>
> > >> Or you write the full-blown userptr implementation, including 
> > >> mmu_notifier
> > >> support (see i915 or amdgpu), but that also requires Christian Königs
> > >> latest ->invalidate_mapping RFC for dma-buf (since atm exporting userptr
> > >> buffers is a no-go).
> > >
> > > I guess I'll look at mlock accounting for starters then.  Easier for
> > > now, and leaves the door open to switch to userptr later as this should
> > > be transparent to userspace.
> > 
> > Out of interest, do you have usecases for full userptr support? Maybe
> > another way would be to allow creation of dmabufs from memfds.
> 
> I have two things in mind.
> 
> One is vga emulation.  I have virtual pci memory bar for the virtual
> vga.  qemu backs vga memory with anonymous pages right now, switching
> that to shmem should be easy though if that makes things easier.  Guest
> places the framebuffer somewhere in the pci bar, and I want export the
> chunk which represents the framebuffer as dma-buf to display it on the
> host without copying around data.  Framebuffer is linear in guest
> physical memory, so a single block only.  That is the simpler case.
> 
> The more difficuilt one is virtio-gpu ressources.  virtio-gpu resources
> live in host memory (guest has no direct access).  The guest can
> optionally specify guest memory pages as backing storage for the
> resource.  Guest backing storage is allowed to be scattered.  Commands
> exist to copy both ways between host storage and guest backing.
> 
> With virgl (opengl acceleration) enabled the guest will send rendering
> commands to fill the framebuffer ressource, so there is no need to copy
> content to the framebuffer ressource.  The guest may fill other
> resources such as textures used for rendering with copy commands.
> 
> Without acceleration the guest does software-rendering to the backing
> storage, then sends a command to copy the framebuffer content from guest
> backing storage to host ressource.
> 
> Now it would be useful to allow a shared mapping, so no copying between
> guest backing storage and host resource is needed, especially for the
> software rendering case (i.e. dumb gem buffers).  Being able to export
> guest dumb buffers to other host processes would be useful too, for
> example to display guest windows seamlessly on the host wayland server.
> 
> So getting a dma-buf for the guest backing storage via udmabuf looked
> like a useful approach.  We can export the guest gem buffers to other
> host processes that way.  qemu itself could map it too, to get a linear
> representation of the scattered guest backing storage.
> 
> The other obvious approach would be to do it the other way around and
> allow the guest map the host resource somehow.  On the host side qemu
> could use vgem to allocate resource memory, so it'll be a gem object
> already.  Mapping that into the guest isn't that straight-forward
> though.  The guest manages its physical address space, so the guest
> would need to find a free spot and ask the host to place the resource
> there.  Then the guest needs page structs covering the mapped resource,
> so it can work with it.  Didn't investigate how difficuilt that is.  Use
> memory hotplug maybe?  Can we easily unmap the resource then?  Also I
> think updating the guests physical memory layout (which we would need to
> do on every resource map/unmap) isn't an exactly cheap operation ...

Generally we try to cache mappings as much as possible. And wrt finding a
slot: Create a sufficiently sized BAR on the virgl device, just for that?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [RfC PATCH] Add udmabuf misc device

2018-04-05 Thread Daniel Vetter
Pulling this out of the shadows again.

We now also have xen-zcopy from Oleksandr and the hyper dmabuf stuff
from Matt and Dongwong.

At least from the intel side there seems to be the idea to just have 1
special device that can handle cross-gues/host sharing for all kinds
of hypervisors, so I guess you all need to work together :-)

Or we throw out the idea that hyper dmabuf will be cross-hypervisor
(not sure how useful/reasonable that is, someone please convince me
one way or the other).

Cheers, Daniel

On Wed, Mar 14, 2018 at 9:03 AM, Gerd Hoffmann <kra...@redhat.com> wrote:
>   Hi,
>
>> Either mlock account (because it's mlocked defacto), and get_user_pages
>> won't do that for you.
>>
>> Or you write the full-blown userptr implementation, including mmu_notifier
>> support (see i915 or amdgpu), but that also requires Christian Königs
>> latest ->invalidate_mapping RFC for dma-buf (since atm exporting userptr
>> buffers is a no-go).
>
> I guess I'll look at mlock accounting for starters then.  Easier for
> now, and leaves the door open to switch to userptr later as this should
> be transparent to userspace.
>
>> > Known issue:  Driver API isn't complete yet.  Need add some flags, for
>> > example to support read-only buffers.
>>
>> dma-buf has no concept of read-only. I don't think we can even enforce
>> that (not many iommus can enforce this iirc), so pretty much need to
>> require r/w memory.
>
> Ah, ok.  Just saw the 'write' arg for get_user_pages_fast and figured we
> might support that, but if iommus can't handle that anyway it's
> pointless indeed.
>
>> > Cc: David Airlie <airl...@linux.ie>
>> > Cc: Tomeu Vizoso <tomeu.viz...@collabora.com>
>> > Signed-off-by: Gerd Hoffmann <kra...@redhat.com>
>>
>> btw there's also the hyperdmabuf stuff from the xen folks, but imo their
>> solution of forwarding the entire dma-buf api is over the top. This here
>> looks _much_ better, pls cc all the hyperdmabuf people on your next
>> version.
>
> Fun fact: googling for "hyperdmabuf" found me your mail and nothing else :-o
> (Trying "hyper dmabuf" instead worked better then).
>
> Yes, will cc them on the next version.  Not sure it'll help much on xen
> though due to the memory management being very different.  Basically xen
> owns the memory, not the kernel of the control domain (dom0), so
> creating dmabufs for guest memory chunks isn't that simple ...
>
> Also it's not clear whenever they really need guest -> guest exports or
> guest -> dom0 exports.
>
>> Overall I like the idea, but too lazy to review.
>
> Cool.  General comments on the idea was all I was looking for for the
> moment.  Spare yor review cycles for the next version ;)
>
>> Oh, some kselftests for this stuff would be lovely.
>
> I'll look into it.
>
> thanks,
>   Gerd
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-03 Thread Daniel Vetter
On Tue, Apr 03, 2018 at 01:06:45PM -0400, Jerome Glisse wrote:
> On Tue, Apr 03, 2018 at 11:09:09AM +0200, Daniel Vetter wrote:
> > On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> > > Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > > > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > > > Add a peer2peer flag noting that the importer can deal with device
> > > > > resources which are not backed by pages.
> > > > > 
> > > > > Signed-off-by: Christian König <christian.koe...@amd.com>
> > > > Um strictly speaking they all should, but ttm never bothered to use the
> > > > real interfaces but just hacked around the provided sg list, grabbing 
> > > > the
> > > > underlying struct pages, then rebuilding the sg list again.
> > > 
> > > Actually that isn't correct. TTM converts them to a dma address array
> > > because drivers need it like this (at least nouveau, radeon and amdgpu).
> > > 
> > > I've fixed radeon and amdgpu to be able to deal without it and mailed with
> > > Ben about nouveau, but the outcome is they don't really know.
> > > 
> > > TTM itself doesn't have any need for the pages on imported BOs (you can't
> > > mmap them anyway), the real underlying problem is that sg tables doesn't
> > > provide what drivers need.
> > > 
> > > I think we could rather easily fix sg tables, but that is a totally 
> > > separate
> > > task.
> > 
> > Looking at patch 8, the sg table seems perfectly sufficient to convey the
> > right dma addresses to the importer. Ofcourse the exporter has to set up
> > the right kind of iommu mappings to make this work.
> > 
> > > > The entire point of using sg lists was exactly to allow this use case of
> > > > peer2peer dma (or well in general have special exporters which managed
> > > > memory/IO ranges not backed by struct page). So essentially you're 
> > > > having
> > > > a "I'm totally not broken flag" here.
> > > 
> > > No, independent of needed struct page pointers we need to note if the
> > > exporter can handle peer2peer stuff from the hardware side in general.
> > > 
> > > So what I've did is just to set peer2peer allowed on the importer because 
> > > of
> > > the driver needs and clear it in the exporter if the hardware can't handle
> > > that.
> > 
> > The only thing the importer seems to do is call the
> > pci_peer_traffic_supported, which the exporter could call too. What am I
> > missing (since the sturct_page stuff sounds like it's fixed already by
> > you)?
> > -Daniel
> 
> AFAIK Logan patchset require to register and initialize struct page
> for the device memory you want to map (export from exporter point of
> view).
> 
> With GPU this isn't something we want, struct page is >~= 2^6 so for
> 4GB GPU = 2^6*2^32/2^12 = 2^26 = 64MB of RAM
> 8GB GPU = 2^6*2^33/2^12 = 2^27 = 128MB of RAM
> 16GB GPU = 2^6*2^34/2^12 = 2^28 = 256MB of RAM
> 32GB GPU = 2^6*2^34/2^12 = 2^29 = 512MB of RAM
> 
> All this is mostly wasted as only a small sub-set (that can not be
> constraint to specific range) will ever be exported at any point in
> time. For GPU work load this is hardly justifiable, even for HMM i
> do not plan to register all those pages.
> 
> Hence why i argue that dma_map_resource() like use by Christian is
> good enough for us. People that care about SG can fix that but i
> rather not have to depend on that and waste system memory.

I did not mean you should dma_map_sg/page. I just meant that using
dma_map_resource to fill only the dma address part of the sg table seems
perfectly sufficient. And that's exactly why the importer gets an already
mapped sg table, so that it doesn't have to call dma_map_sg on something
that dma_map_sg can't handle.

Assuming you get an sg table that's been mapping by calling dma_map_sg was
always a bit a case of bending the abstraction to avoid typing code. The
only thing an importer ever should have done is look at the dma addresses
in that sg table, nothing else.

And p2p seems to perfectly fit into this (surprise, it was meant to).
That's why I suggested we annotate the broken importers who assume the sg
table is mapped using dma_map_sg or has a struct_page backing the memory
(but there doesn't seem to be any left it seems) instead of annotating the
ones that aren't broken with a flag that's confusing - you could also have
a dma-buf sgt that points at some other memory that doesn't have struct
pages backing it.

Aside: At least internally in i915 we've been using this forever for our
own private/stolen memory. Unfortunately no other device can access that
range of memory, which is why we don't allow it to be imported to anything
but i915 itself. But if that hw restriction doesn't exist, it'd would
work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-03 Thread Daniel Vetter
On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > Add a peer2peer flag noting that the importer can deal with device
> > > resources which are not backed by pages.
> > > 
> > > Signed-off-by: Christian König <christian.koe...@amd.com>
> > Um strictly speaking they all should, but ttm never bothered to use the
> > real interfaces but just hacked around the provided sg list, grabbing the
> > underlying struct pages, then rebuilding the sg list again.
> 
> Actually that isn't correct. TTM converts them to a dma address array
> because drivers need it like this (at least nouveau, radeon and amdgpu).
> 
> I've fixed radeon and amdgpu to be able to deal without it and mailed with
> Ben about nouveau, but the outcome is they don't really know.
> 
> TTM itself doesn't have any need for the pages on imported BOs (you can't
> mmap them anyway), the real underlying problem is that sg tables doesn't
> provide what drivers need.
> 
> I think we could rather easily fix sg tables, but that is a totally separate
> task.

Looking at patch 8, the sg table seems perfectly sufficient to convey the
right dma addresses to the importer. Ofcourse the exporter has to set up
the right kind of iommu mappings to make this work.

> > The entire point of using sg lists was exactly to allow this use case of
> > peer2peer dma (or well in general have special exporters which managed
> > memory/IO ranges not backed by struct page). So essentially you're having
> > a "I'm totally not broken flag" here.
> 
> No, independent of needed struct page pointers we need to note if the
> exporter can handle peer2peer stuff from the hardware side in general.
> 
> So what I've did is just to set peer2peer allowed on the importer because of
> the driver needs and clear it in the exporter if the hardware can't handle
> that.

The only thing the importer seems to do is call the
pci_peer_traffic_supported, which the exporter could call too. What am I
missing (since the sturct_page stuff sounds like it's fixed already by
you)?
-Daniel

> > I think a better approach would be if we add a requires_struct_page or so,
> > and annotate the current importers accordingly. Or we just fix them up (it
> > is all in shared ttm code after all, I think everyone else got this
> > right).
> 
> I would rather not bed on that.
> 
> Christian.
> 
> > -Daniel
> > 
> > > ---
> > >   drivers/dma-buf/dma-buf.c | 1 +
> > >   include/linux/dma-buf.h   | 4 
> > >   2 files changed, 5 insertions(+)
> > > 
> > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > index ffaa2f9a9c2c..f420225f93c6 100644
> > > --- a/drivers/dma-buf/dma-buf.c
> > > +++ b/drivers/dma-buf/dma-buf.c
> > > @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const 
> > > struct dma_buf_attach_info *info
> > >   attach->dev = info->dev;
> > >   attach->dmabuf = dmabuf;
> > > + attach->peer2peer = info->peer2peer;
> > >   attach->priv = info->priv;
> > >   attach->invalidate = info->invalidate;
> > > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> > > index 15dd8598bff1..1ef50bd9bc5b 100644
> > > --- a/include/linux/dma-buf.h
> > > +++ b/include/linux/dma-buf.h
> > > @@ -313,6 +313,7 @@ struct dma_buf {
> > >* @dmabuf: buffer for this attachment.
> > >* @dev: device attached to the buffer.
> > >* @node: list of dma_buf_attachment.
> > > + * @peer2peer: true if the importer can handle peer resources without 
> > > pages.
> > >* @priv: exporter specific attachment data.
> > >*
> > >* This structure holds the attachment information between the dma_buf 
> > > buffer
> > > @@ -328,6 +329,7 @@ struct dma_buf_attachment {
> > >   struct dma_buf *dmabuf;
> > >   struct device *dev;
> > >   struct list_head node;
> > > + bool peer2peer;
> > >   void *priv;
> > >   /**
> > > @@ -392,6 +394,7 @@ struct dma_buf_export_info {
> > >* @dmabuf: the exported dma_buf
> > >* @dev:the device which wants to import the attachment
> > >* @priv:   private data of importer to this attachment
> > > + * @peer2peer:   true if the importer can handle peer resources without 
> > > pages
> > >*

Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-03-29 Thread Daniel Vetter
On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> Add a peer2peer flag noting that the importer can deal with device
> resources which are not backed by pages.
> 
> Signed-off-by: Christian König <christian.koe...@amd.com>

Um strictly speaking they all should, but ttm never bothered to use the
real interfaces but just hacked around the provided sg list, grabbing the
underlying struct pages, then rebuilding the sg list again.

The entire point of using sg lists was exactly to allow this use case of
peer2peer dma (or well in general have special exporters which managed
memory/IO ranges not backed by struct page). So essentially you're having
a "I'm totally not broken flag" here.

I think a better approach would be if we add a requires_struct_page or so,
and annotate the current importers accordingly. Or we just fix them up (it
is all in shared ttm code after all, I think everyone else got this
right).
-Daniel

> ---
>  drivers/dma-buf/dma-buf.c | 1 +
>  include/linux/dma-buf.h   | 4 
>  2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index ffaa2f9a9c2c..f420225f93c6 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
> dma_buf_attach_info *info
>  
>   attach->dev = info->dev;
>   attach->dmabuf = dmabuf;
> + attach->peer2peer = info->peer2peer;
>   attach->priv = info->priv;
>   attach->invalidate = info->invalidate;
>  
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 15dd8598bff1..1ef50bd9bc5b 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -313,6 +313,7 @@ struct dma_buf {
>   * @dmabuf: buffer for this attachment.
>   * @dev: device attached to the buffer.
>   * @node: list of dma_buf_attachment.
> + * @peer2peer: true if the importer can handle peer resources without pages.
>   * @priv: exporter specific attachment data.
>   *
>   * This structure holds the attachment information between the dma_buf buffer
> @@ -328,6 +329,7 @@ struct dma_buf_attachment {
>   struct dma_buf *dmabuf;
>   struct device *dev;
>   struct list_head node;
> + bool peer2peer;
>   void *priv;
>  
>   /**
> @@ -392,6 +394,7 @@ struct dma_buf_export_info {
>   * @dmabuf:  the exported dma_buf
>   * @dev: the device which wants to import the attachment
>   * @priv:private data of importer to this attachment
> + * @peer2peer:   true if the importer can handle peer resources without 
> pages
>   * @invalidate:  callback to use for invalidating mappings
>   *
>   * This structure holds the information required to attach to a buffer. Used
> @@ -401,6 +404,7 @@ struct dma_buf_attach_info {
>   struct dma_buf *dmabuf;
>   struct device *dev;
>   void *priv;
> + bool peer2peer;
>   void (*invalidate)(struct dma_buf_attachment *attach);
>  };
>  
> -- 
> 2.14.1
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-27 Thread Daniel Vetter
On Tue, Mar 27, 2018 at 10:06:04AM +0200, Christian König wrote:
> Am 27.03.2018 um 09:53 schrieb Daniel Vetter:
> > [SNIP]
> > > > [SNIP]
> > > > A slightly better solution is using atomic counter:
> > > > driver_range_start() {
> > > >   atomic_inc(>notifier_count);
> > > ...
> > > 
> > > Yeah, that is exactly what amdgpu is doing now. Sorry if my description
> > > didn't made that clear.
> > > 
> > > > I would like to see driver using same code, as it means one place to fix
> > > > issues. I had for a long time on my TODO list doing the above conversion
> > > > to amd or radeon kernel driver. I am pushing up my todo list hopefully 
> > > > in
> > > > next few weeks i can send an rfc so people can have a real sense of how
> > > > it can look.
> > > Certainly a good idea, but I think we might have that separate to HMM.
> > > 
> > > TTM suffered really from feature overload, e.g. trying to do everything 
> > > in a
> > > single subsystem. And it would be rather nice to have coherent userptr
> > > handling for GPUs as separate feature.
> > TTM suffered from being a midlayer imo, not from doing too much.
> 
> Yeah, completely agree.
> 
> midlayers work as long as you concentrate on doing exactly one things in
> your midlayer. E.g. in the case of TTM the callback for BO move handling is
> well justified.
> 
> Only all the stuff around it like address space handling etc... is really
> wrong designed and should be separated (which is exactly what DRM MM did,
> but TTM still uses this in the wrong way).

Yeah the addres space allocator part of ttm really is backwards and makes
adding quick driver hacks and heuristics for better allocations schemes
really hard to add. Same for tuning how/what exactly you evict.

> > HMM is apparently structured like a toolbox (despite its documentation 
> > claiming
> > otherwise), so you can pick freely.
> 
> That sounds good, but I would still have a better feeling if userptr
> handling would be separated. That avoids mangling things together again.

Jerome said he wants to do at least one prototype conversion of one of the
"I can't fault" userptr implementation over to the suitable subset of HMM
helpers. I guess we'll see once he's submitting the patches, but it
sounded exactly like what the doctor ordered :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-27 Thread Daniel Vetter
On Tue, Mar 27, 2018 at 09:35:17AM +0200, Christian König wrote:
> Am 26.03.2018 um 17:42 schrieb Jerome Glisse:
> > On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote:
> > > On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
> > > > Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > Key take away from that was that you can't take any locks from neither 
> > > > the
> > > > MMU notifier nor the shrinker you also take while calling kmalloc (or
> > > > simpler speaking get_user_pages()).
> > > > 
> > > > Additional to that in the MMU or shrinker callback all different kinds 
> > > > of
> > > > locks might be held, so you basically can't assume that you do thinks 
> > > > like
> > > > recursive page table walks or call dma_unmap_anything.
> > > That sounds like a design bug in mmu_notifiers, since it would render them
> > > useless for KVM. And they were developed for that originally. I think I'll
> > > chat with Jerome to understand this, since it's all rather confusing.
> > Doing dma_unmap() during mmu_notifier callback should be fine, it was last
> > time i check. However there is no formal contract that it is ok to do so.
> 
> As I said before dma_unmap() isn't the real problem here.
> 
> The issues is more that you can't take a lock in the MMU notifier which you
> would also take while allocating memory without GFP_NOIO.
> 
> That makes it rather tricky to do any command submission, e.g. you need to
> grab all the pages/memory/resources prehand, then make sure that you don't
> have a MMU notifier running concurrently and do the submission.
> 
> If any of the prerequisites isn't fulfilled we need to restart the
> operation.

Yeah we're hitting all that epic amount of fun now, after a chat with
Jerome yesterady. I guess we'll figure out what we're coming up with.

> > [SNIP]
> > A slightly better solution is using atomic counter:
> >driver_range_start() {
> >  atomic_inc(>notifier_count);
> ...
> 
> Yeah, that is exactly what amdgpu is doing now. Sorry if my description
> didn't made that clear.
> 
> > I would like to see driver using same code, as it means one place to fix
> > issues. I had for a long time on my TODO list doing the above conversion
> > to amd or radeon kernel driver. I am pushing up my todo list hopefully in
> > next few weeks i can send an rfc so people can have a real sense of how
> > it can look.
> 
> Certainly a good idea, but I think we might have that separate to HMM.
> 
> TTM suffered really from feature overload, e.g. trying to do everything in a
> single subsystem. And it would be rather nice to have coherent userptr
> handling for GPUs as separate feature.

TTM suffered from being a midlayer imo, not from doing too much. HMM is
apparently structured like a toolbox (despite its documentation claiming
otherwise), so you can pick freely.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-buf: use parameter structure for dma_buf_attach

2018-03-26 Thread Daniel Vetter
On Mon, Mar 26, 2018 at 12:47:01PM +0200, Christian König wrote:
> Am 26.03.2018 um 10:36 schrieb Daniel Vetter:
> > On Sun, Mar 25, 2018 at 01:34:51PM +0200, Christian König wrote:
> [SNIP]
> > > - attach->dev = dev;
> > > + attach->dev = info->dev;
> > >   attach->dmabuf = dmabuf;
> > > + attach->priv = info->priv;
> > The ->priv field is for the exporter, not the importer. See e.g.
> > drm_gem_map_attach. You can't let the importer set this now too, so needs
> > to be removed from the info struct.
> 
> Crap, in this case I need to add an importer_priv field because we now need
> to map from the attachment to it's importer object as well.
> 
> Thanks for noticing this.

Maybe add the importer_priv field only in the series that actually adds
it, not in this prep patch. You can mention all the fields you need here
in the commit message for justification.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-buf: use parameter structure for dma_buf_attach

2018-03-26 Thread Daniel Vetter
> @@ -692,7 +696,7 @@ static void *vb2_dc_attach_dmabuf(struct device *dev, 
> struct dma_buf *dbuf,
>  
>   buf->dev = dev;
>   /* create attachment for the dmabuf with the user device */
> - dba = dma_buf_attach(dbuf, buf->dev);
> + dba = dma_buf_attach(_info);
>   if (IS_ERR(dba)) {
>   pr_err("failed to attach dmabuf\n");
>   kfree(buf);
> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c 
> b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> index 753ed3138dcc..4e61050ba87f 100644
> --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> @@ -609,6 +609,10 @@ static void vb2_dma_sg_detach_dmabuf(void *mem_priv)
>  static void *vb2_dma_sg_attach_dmabuf(struct device *dev, struct dma_buf 
> *dbuf,
>   unsigned long size, enum dma_data_direction dma_dir)
>  {
> + struct dma_buf_attach_info attach_info = {
> + .dev = dev,
> + .dmabuf = dbuf
> + };
>   struct vb2_dma_sg_buf *buf;
>   struct dma_buf_attachment *dba;
>  
> @@ -624,7 +628,7 @@ static void *vb2_dma_sg_attach_dmabuf(struct device *dev, 
> struct dma_buf *dbuf,
>  
>   buf->dev = dev;
>   /* create attachment for the dmabuf with the user device */
> - dba = dma_buf_attach(dbuf, buf->dev);
> + dba = dma_buf_attach(_info);
>   if (IS_ERR(dba)) {
>   pr_err("failed to attach dmabuf\n");
>   kfree(buf);
> diff --git a/drivers/staging/media/tegra-vde/tegra-vde.c 
> b/drivers/staging/media/tegra-vde/tegra-vde.c
> index c47659e96089..25d112443b0d 100644
> --- a/drivers/staging/media/tegra-vde/tegra-vde.c
> +++ b/drivers/staging/media/tegra-vde/tegra-vde.c
> @@ -529,6 +529,10 @@ static int tegra_vde_attach_dmabuf(struct device *dev,
>  size_t *size,
>  enum dma_data_direction dma_dir)
>  {
> + struct dma_buf_attach_info attach_info = {
> + .dev = dev,
> + .dmabuf = dmabuf
> + };
>   struct dma_buf_attachment *attachment;
>   struct dma_buf *dmabuf;
>   struct sg_table *sgt;
> @@ -547,7 +551,7 @@ static int tegra_vde_attach_dmabuf(struct device *dev,
>   return -EINVAL;
>   }
>  
> - attachment = dma_buf_attach(dmabuf, dev);
> + attachment = dma_buf_attach(_info);
>   if (IS_ERR(attachment)) {
>   dev_err(dev, "Failed to attach dmabuf\n");
>   err = PTR_ERR(attachment);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 085db2fee2d7..2c27568d44af 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -362,6 +362,21 @@ struct dma_buf_export_info {
>   struct dma_buf_export_info name = { .exp_name = KBUILD_MODNAME, \
>.owner = THIS_MODULE }
>  
> +/**
> + * struct dma_buf_attach_info - holds information needed to attach to a 
> dma_buf
> + * @dmabuf:  the exported dma_buf
> + * @dev: the device which wants to import the attachment
> + * @priv:private data of importer to this attachment
> + *
> + * This structure holds the information required to attach to a buffer. Used
> + * with dma_buf_attach() only.
> + */
> +struct dma_buf_attach_info {
> + struct dma_buf *dmabuf;
> + struct device *dev;
> + void *priv;
> +};
> +
>  /**
>   * get_dma_buf - convenience wrapper for get_file.
>   * @dmabuf:  [in]pointer to dma_buf
> @@ -376,8 +391,8 @@ static inline void get_dma_buf(struct dma_buf *dmabuf)
>   get_file(dmabuf->file);
>  }
>  
> -struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
> - struct device *dev);
> +struct dma_buf_attachment *
> +dma_buf_attach(const struct dma_buf_attach_info *info);
>  void dma_buf_detach(struct dma_buf *dmabuf,
>   struct dma_buf_attachment *dmabuf_attach);
>  
> -- 
> 2.14.1
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-26 Thread Daniel Vetter
On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
> Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
> > On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
> > > Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
> > > > On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> > > > > Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > For the in-driver reservation path (CS) having a slow-path that 
> > > > > > grabs a
> > > > > > temporary reference, drops the vram lock and then locks the 
> > > > > > reservation
> > > > > > normally (using the acquire context used already for the entire CS) 
> > > > > > is a
> > > > > > bit tricky, but totally feasible. Ttm doesn't do that though.
> > > > > That is exactly what we do in amdgpu as well, it's just not very 
> > > > > efficient
> > > > > nor reliable to retry getting the right pages for a submission over 
> > > > > and over
> > > > > again.
> > > > Out of curiosity, where's that code? I did read the ttm eviction code 
> > > > way
> > > > back, and that one definitely didn't do that. Would be interesting to
> > > > update my understanding.
> > > That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with
> > > grabbing, releasing and regrabbing locks in a loop.
> > > 
> > > Then in amdgpu_cs_submit() we grab an lock preventing page table updates 
> > > and
> > > check if all pages are still the one we want to have:
> > > >      amdgpu_mn_lock(p->mn);
> > > >      if (p->bo_list) {
> > > >      for (i = p->bo_list->first_userptr;
> > > >   i < p->bo_list->num_entries; ++i) {
> > > >      struct amdgpu_bo *bo = 
> > > > p->bo_list->array[i].robj;
> > > > 
> > > >      if
> > > > (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
> > > >      amdgpu_mn_unlock(p->mn);
> > > >      return -ERESTARTSYS;
> > > >      }
> > > >      }
> > > >      }
> > > If anything changed on the page tables we restart the whole IOCTL using
> > > -ERESTARTSYS and try again.
> > I'm not talking about userptr here, but general bo eviction. Sorry for the
> > confusion.
> > 
> > The reason I'm dragging all the general bo management into this
> > discussions is because we do seem to have fairly fundamental difference in
> > how that's done, with resulting consequences for the locking hierarchy.
> > 
> > And if this invalidate_mapping stuff should work, together with userptr
> > and everything else, I think we're required to agree on how this is all
> > supposed to nest, and how exactly we should back off for the other side
> > that needs to break the locking circle.
> > 
> > That aside, I don't entirely understand why you need to restart so much. I
> > figured that get_user_pages is ordered correctly against mmu
> > invalidations, but I get the impression you think that's not the case. How
> > does that happen?
> 
> Correct. I've had the same assumption, but both Jerome as well as our
> internal tests proved me wrong on that.
> 
> Key take away from that was that you can't take any locks from neither the
> MMU notifier nor the shrinker you also take while calling kmalloc (or
> simpler speaking get_user_pages()).
> 
> Additional to that in the MMU or shrinker callback all different kinds of
> locks might be held, so you basically can't assume that you do thinks like
> recursive page table walks or call dma_unmap_anything.

That sounds like a design bug in mmu_notifiers, since it would render them
useless for KVM. And they were developed for that originally. I think I'll
chat with Jerome to understand this, since it's all rather confusing.

> Thinking a moment about that it actually seems to make perfect sense.
> So it doesn't matter what order you got between the mmap_sem and your buffer
> or allocation lock, it will simply be incorrect with other locks in the
> system anyway.

Hm, doesn't make sense to me, at least from a locking  inversion pov. I
thought the only locks that are definitely part of the mmu_nofitier
callback contexts are the mm locks. We're definitely fine with those, and
kmalloc didn't bite us yet. Direct reclaim is rea

Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-26 Thread Daniel Vetter
On Thu, Mar 22, 2018 at 10:37:55AM +0100, Christian König wrote:
> Am 22.03.2018 um 08:14 schrieb Daniel Vetter:
> > On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
> > > Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
> > > > [SNIP]
> > > For correct operation you always need to implement invalidate_range_end as
> > > well and add some lock/completion work Otherwise get_user_pages() can 
> > > again
> > > grab the reference to the wrong page.
> > Is this really a problem?
> 
> Yes, and quite a big one.
> 
> > I figured that if a mmu_notifier invalidation is
> > going on, a get_user_pages on that mm from anywhere else (whether i915 or
> > anyone really) will serialize with the ongoing invalidate?
> 
> No, that isn't correct. Jerome can probably better explain that than I do.
> 
> > If that's not the case, then really any get_user_pages is racy, including 
> > all the
> > DIRECT_IO ones.
> 
> The key point here is that get_user_pages() grabs a reference to the page.
> So what you get is a bunch of pages which where mapped at that location at a
> specific point in time.
> 
> There is no guarantee that after get_user_pages() return you still have the
> same pages mapped at that point, you only guarantee that the pages are not
> reused for something else.
> 
> That is perfectly sufficient for a task like DIRECT_IO where you can only
> have block or network I/O, but unfortunately not really for GPUs where you
> crunch of results, write them back to pages and actually count on that the
> CPU sees the result in the right place.

Hm ok, I'll chat with Jerome about this. I thought we have epic amounts of
userptr tests, including thrashing the mappings vs gpu activity, so I'm
somewhat surprised that this hasn't blown up yet.

> > > [SNIP]
> > > So no matter how you put it i915 is clearly doing something wrong here :)
> > tbh I'm not entirely clear on the reasons why this works, but
> > cross-release lockdep catches these things, and it did not complain.
> > On a high-level we make sure that mm locks needed by get_user_pages do
> > _not_ nest within dev->struct_mutex. We have massive back-off slowpaths to
> > do anything that could fault outside of our own main gem locking.
> 
> I'm pretty sure that this doesn't work as intended and just hides the real
> problem.
> 
> > That was (at least in the past) a major difference with amdgpu, which
> > essentially has none of these paths. That would trivially deadlock with
> > your own gem mmap fault handler, so you had (maybe that changed) a dumb
> > retry loop, which did shut up lockdep but didn't fix any of the locking
> > inversions.
> 
> Any lock you grab in an MMU callback can't be even held when you call
> kmalloc() or get_free_page() (without GFP_NOIO).
> 
> Even simple things like drm_vm_open() violate that by using GFP_KERNEL. So I
> can 100% ensure you that what you do here is not correct.

drm_vm_open isn't used by modern drivers anymore. We have validated the
locking with the cross-release stuff for a few weeks, and it didn't catch
stuff. So I'm not worried that the locking is busted, only the mmu
notifier vs. get_user_pages races concerns me.

> > So yeah, grabbing dev->struct_mutex is in principle totally fine while
> > holding all kinds of struct mm/vma locks. I'm not entirely clear why we
> > punt the actual unmapping to the worker though, maybe simply to not have a
> > constrained stack.
> 
> I strongly disagree on that. As far as I can see what TTM does looks
> actually like the right approach to the problem.
> 
> > This is re: your statement that you can't unamp sg tables from the
> > shrinker. We can, because we've actually untangled the locking depencies
> > so that you can fully operate on gem objects from within mm/vma locks.
> > Maybe code has changed, but last time I looked at radeon/ttm a while back
> > that was totally not the case, and if you don't do all this work then yes
> > you'll deadlock.
> > 
> > Doen't mean it's not impossible, because we've done it :-)
> 
> And I'm pretty sure you didn't do it correctly :D
> 
> > Well, it actually gets the job done. We'd need to at least get to
> > per-object locking, and probably even then we'd need to rewrite the code a
> > lot. But please note that this here is only to avoid the GFP_NOIO
> > constraint, all the other bits I clarified around why we don't actually
> > have circular locking (because the entire hierarchy is inverted for us)
> > still hold even if you would only trylock here.
> 
> Well you reversed your allocation and mmap_sem lock which avoids the lock
> inversion 

Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-22 Thread Daniel Vetter
On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
> Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
> > On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> > > Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > For the in-driver reservation path (CS) having a slow-path that grabs a
> > > > temporary reference, drops the vram lock and then locks the reservation
> > > > normally (using the acquire context used already for the entire CS) is a
> > > > bit tricky, but totally feasible. Ttm doesn't do that though.
> > > That is exactly what we do in amdgpu as well, it's just not very efficient
> > > nor reliable to retry getting the right pages for a submission over and 
> > > over
> > > again.
> > Out of curiosity, where's that code? I did read the ttm eviction code way
> > back, and that one definitely didn't do that. Would be interesting to
> > update my understanding.
> 
> That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with
> grabbing, releasing and regrabbing locks in a loop.
> 
> Then in amdgpu_cs_submit() we grab an lock preventing page table updates and
> check if all pages are still the one we want to have:
> >     amdgpu_mn_lock(p->mn);
> >     if (p->bo_list) {
> >     for (i = p->bo_list->first_userptr;
> >  i < p->bo_list->num_entries; ++i) {
> >     struct amdgpu_bo *bo = p->bo_list->array[i].robj;
> > 
> >     if
> > (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
> >     amdgpu_mn_unlock(p->mn);
> >     return -ERESTARTSYS;
> >     }
> >     }
> >     }
> 
> If anything changed on the page tables we restart the whole IOCTL using
> -ERESTARTSYS and try again.

I'm not talking about userptr here, but general bo eviction. Sorry for the
confusion.

The reason I'm dragging all the general bo management into this
discussions is because we do seem to have fairly fundamental difference in
how that's done, with resulting consequences for the locking hierarchy.

And if this invalidate_mapping stuff should work, together with userptr
and everything else, I think we're required to agree on how this is all
supposed to nest, and how exactly we should back off for the other side
that needs to break the locking circle.

That aside, I don't entirely understand why you need to restart so much. I
figured that get_user_pages is ordered correctly against mmu
invalidations, but I get the impression you think that's not the case. How
does that happen?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-22 Thread Daniel Vetter
On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
> Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
> > [SNIP]
> > They're both in i915_gem_userptr.c, somewhat interleaved. Would be
> > interesting if you could show what you think is going wrong in there
> > compared to amdgpu_mn.c.
> 
> i915 implements only one callback:
> > static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> >     .invalidate_range_start =
> > i915_gem_userptr_mn_invalidate_range_start,
> > };
> For correct operation you always need to implement invalidate_range_end as
> well and add some lock/completion work Otherwise get_user_pages() can again
> grab the reference to the wrong page.

Is this really a problem? I figured that if a mmu_notifier invalidation is
going on, a get_user_pages on that mm from anywhere else (whether i915 or
anyone really) will serialize with the ongoing invalidate? If that's not
the case, then really any get_user_pages is racy, including all the
DIRECT_IO ones.

> The next problem seems to be that cancel_userptr() doesn't prevent any new
> command submission. E.g.
> > i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL);
> What prevents new command submissions to use the GEM object directly after
> you finished waiting here?
> 
> > I get a feeling we're talking past each another here.
> Yeah, agree. Additional to that I don't know the i915 code very well.
> 
> > Can you perhaps explain what exactly the race is you're seeing? The i915 
> > userptr code is
> > fairly convoluted and pushes a lot of stuff to workers (but then syncs
> > with those workers again later on), so easily possible you've overlooked
> > one of these lines that might guarantee already what you think needs to be
> > guaranteed. We're definitely not aiming to allow userspace to allow
> > writing to random pages all over.
> 
> You not read/write to random pages, there still is a reference to the page.
> So that the page can't be reused until you are done.
> 
> The problem is rather that you can't guarantee that you write to the page
> which is mapped into the process at that location. E.g. the CPU and the GPU
> might see two different things.
> 
> > > > Leaking the IOMMU mappings otoh means rogue userspace could do a bunch 
> > > > of
> > > > stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
> > > > the gpu side PTEs to make stuff inaccessible) and wreak the core 
> > > > kernel's
> > > > book-keeping.
> > > > 
> > > > In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
> > > > after all the mappings are really gone (both GPU PTEs and sg mapping),
> > > > guaranteeing that any stray writes from either the GPU or IOMMU will
> > > > result in faults (except bugs in the IOMMU, but can't have it all, 
> > > > "IOMMU
> > > > actually works" is an assumption behind device isolation).
> > > Well exactly that's the point, the handling in i915 looks incorrect to me.
> > > You need to call set_page_dirty/mark_page_accessed way before the mapping 
> > > is
> > > destroyed.
> > > 
> > > To be more precise for userptrs it must be called from the
> > > invalidate_range_start, but i915 seems to delegate everything into a
> > > background worker to avoid the locking problems.
> > Yeah, and at the end of the function there's a flush_work to make sure the
> > worker has caught up.
> Ah, yes haven't seen that.
> 
> But then grabbing the obj->base.dev->struct_mutex lock in cancel_userptr()
> is rather evil. You just silenced lockdep because you offloaded that into a
> work item.
> 
> So no matter how you put it i915 is clearly doing something wrong here :)

tbh I'm not entirely clear on the reasons why this works, but
cross-release lockdep catches these things, and it did not complain.

On a high-level we make sure that mm locks needed by get_user_pages do
_not_ nest within dev->struct_mutex. We have massive back-off slowpaths to
do anything that could fault outside of our own main gem locking.

That was (at least in the past) a major difference with amdgpu, which
essentially has none of these paths. That would trivially deadlock with
your own gem mmap fault handler, so you had (maybe that changed) a dumb
retry loop, which did shut up lockdep but didn't fix any of the locking
inversions.

So yeah, grabbing dev->struct_mutex is in principle totally fine while
holding all kinds of struct mm/vma locks. I'm not entirely clear why we
punt the actual unmapping to the worker though, maybe simply to not have a
constrained stack.

Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-21 Thread Daniel Vetter
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > [SNIP]
> > For the in-driver reservation path (CS) having a slow-path that grabs a
> > temporary reference, drops the vram lock and then locks the reservation
> > normally (using the acquire context used already for the entire CS) is a
> > bit tricky, but totally feasible. Ttm doesn't do that though.
> 
> That is exactly what we do in amdgpu as well, it's just not very efficient
> nor reliable to retry getting the right pages for a submission over and over
> again.

Out of curiosity, where's that code? I did read the ttm eviction code way
back, and that one definitely didn't do that. Would be interesting to
update my understanding.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-21 Thread Daniel Vetter
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > [SNIP]
> > For the in-driver reservation path (CS) having a slow-path that grabs a
> > temporary reference, drops the vram lock and then locks the reservation
> > normally (using the acquire context used already for the entire CS) is a
> > bit tricky, but totally feasible. Ttm doesn't do that though.
> 
> That is exactly what we do in amdgpu as well, it's just not very efficient
> nor reliable to retry getting the right pages for a submission over and over
> again.
> 
> > [SNIP]
> > Note that there are 2 paths for i915 userptr. One is the mmu notifier, the
> > other one is the root-only hack we have for dubious reasons (or that I
> > really don't see the point in myself).
> 
> Well I'm referring to i915_gem_userptr.c, if that isn't what you are
> exposing then just feel free to ignore this whole discussion.

They're both in i915_gem_userptr.c, somewhat interleaved. Would be
interesting if you could show what you think is going wrong in there
compared to amdgpu_mn.c.

> > > For coherent usage you need to install some lock to prevent concurrent
> > > get_user_pages(), command submission and
> > > invalidate_range_start/invalidate_range_end from the MMU notifier.
> > > 
> > > Otherwise you can't guarantee that you are actually accessing the right 
> > > page
> > > in the case of a fork() or mprotect().
> > Yeah doing that with a full lock will create endless amounts of issues,
> > but I don't see why we need that. Userspace racing stuff with itself gets
> > to keep all the pieces. This is like racing DIRECT_IO against mprotect and
> > fork.
> 
> First of all I strongly disagree on that. A thread calling fork() because it
> wants to run a command is not something we can forbid just because we have a
> gfx stack loaded. That the video driver is not capable of handling that
> correct is certainly not the problem of userspace.
> 
> Second it's not only userspace racing here, you can get into this kind of
> issues just because of transparent huge page support where the background
> daemon tries to reallocate the page tables into bigger chunks.
> 
> And if I'm not completely mistaken you can also open up quite a bunch of
> security problems if you suddenly access the wrong page.

I get a feeling we're talking past each another here. Can you perhaps
explain what exactly the race is you're seeing? The i915 userptr code is
fairly convoluted and pushes a lot of stuff to workers (but then syncs
with those workers again later on), so easily possible you've overlooked
one of these lines that might guarantee already what you think needs to be
guaranteed. We're definitely not aiming to allow userspace to allow
writing to random pages all over.

> > Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of
> > stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
> > the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's
> > book-keeping.
> > 
> > In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
> > after all the mappings are really gone (both GPU PTEs and sg mapping),
> > guaranteeing that any stray writes from either the GPU or IOMMU will
> > result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU
> > actually works" is an assumption behind device isolation).
> Well exactly that's the point, the handling in i915 looks incorrect to me.
> You need to call set_page_dirty/mark_page_accessed way before the mapping is
> destroyed.
> 
> To be more precise for userptrs it must be called from the
> invalidate_range_start, but i915 seems to delegate everything into a
> background worker to avoid the locking problems.

Yeah, and at the end of the function there's a flush_work to make sure the
worker has caught up.

The set_page_dirty is also there, but hidden very deep in the call chain
as part of the vma unmapping and backing storage unpinning. But I do think
we guarantee what you expect needs to happen.

> > > Felix and I hammered for quite some time on amdgpu until all of this was
> > > handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.
> > Maybe we should have more shared code in this, it seems to be a source of
> > endless amounts of fun ...
> > 
> > > I can try to gather the lockdep splat from my mail history, but it
> > > essentially took us multiple years to get rid of all of them.
> > I'm very much interested in specifically the splat that makes it
> > impossible for you folks to remove the sg mappings. That one sounds bad.
> > And would 

Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-20 Thread Daniel Vetter
On Tue, Mar 20, 2018 at 11:54:18AM +0100, Christian König wrote:
> Am 20.03.2018 um 08:44 schrieb Daniel Vetter:
> > On Mon, Mar 19, 2018 at 5:23 PM, Christian König
> > <ckoenig.leichtzumer...@gmail.com> wrote:
> > > Am 19.03.2018 um 16:53 schrieb Chris Wilson:
> > > > Quoting Christian König (2018-03-16 14:22:32)
> > > > [snip, probably lost too must context]
> > > > > This allows for full grown pipelining, e.g. the exporter can say I 
> > > > > need
> > > > > to move the buffer for some operation. Then let the move operation 
> > > > > wait
> > > > > for all existing fences in the reservation object and install the 
> > > > > fence
> > > > > of the move operation as exclusive fence.
> > > > Ok, the situation I have in mind is the non-pipelined case: revoking
> > > > dma-buf for mmu_invalidate_range or shrink_slab. I would need a
> > > > completion event that can be waited on the cpu for all the invalidate
> > > > callbacks. (Essentially an atomic_t counter plus struct completion; a
> > > > lighter version of dma_fence, I wonder where I've seen that before ;)
> > > 
> > > Actually that is harmless.
> > > 
> > > When you need to unmap a DMA-buf because of mmu_invalidate_range or
> > > shrink_slab you need to wait for it's reservation object anyway.
> > reservation_object only prevents adding new fences, you still have to
> > wait for all the current ones to signal. Also, we have dma-access
> > without fences in i915. "I hold the reservation_object" does not imply
> > you can just go and nuke the backing storage.
> 
> I was not talking about taking the lock, but rather using
> reservation_object_wait_timeout_rcu().
> 
> To be more precise you actually can't take the reservation object lock in an
> mmu_invalidate_range callback and you can only trylock it in a shrink_slab
> callback.

Ah ok, and yes agreed. Kinda. See below.

> > > This needs to be done to make sure that the backing memory is now idle, it
> > > doesn't matter if the jobs where submitted by DMA-buf importers or your 
> > > own
> > > driver.
> > > 
> > > The sg tables pointing to the now released memory might live a bit longer,
> > > but that is unproblematic and actually intended.
> > I think that's very problematic. One reason for an IOMMU is that you
> > have device access isolation, and a broken device can't access memory
> > it shouldn't be able to access. From that security-in-depth point of
> > view it's not cool that there's some sg tables hanging around still
> > that a broken GPU could use. And let's not pretend hw is perfect,
> > especially GPUs :-)
> 
> I completely agree on that, but there is unfortunately no other way.
> 
> See you simply can't take a reservation object lock in an mmu or slab
> callback, you can only trylock them.
> 
> For example it would require changing all allocations done while holding any
> reservation lock to GFP_NOIO.

Yeah mmu and slab can only trylock, and they need to skip to the next
object when the trylock fails. But once you have the lock you imo should
be able to clean up the entire mess still. We definitely do that for the
i915 shrinkers, and I don't see how going to importers through the
->invalidate_mapping callback changes anything with that.

For the in-driver reservation path (CS) having a slow-path that grabs a
temporary reference, drops the vram lock and then locks the reservation
normally (using the acquire context used already for the entire CS) is a
bit tricky, but totally feasible. Ttm doesn't do that though.

So there's completely feasible ways to make sure the sg list is all
properly released, all DMA gone and the IOMMU mappings torn down. Anything
else is just a bit shoddy device driver programming imo.

> > > When we would try to destroy the sg tables in an mmu_invalidate_range or
> > > shrink_slab callback we would run into a lockdep horror.
> > So I'm no expert on this, but I think this is exactly what we're doing
> > in i915. Kinda no other way to actually free the memory without
> > throwing all the nice isolation aspects of an IOMMU into the wind. Can
> > you please paste the lockdeps you've seen with amdgpu when trying to
> > do that?
> 
> Taking a quick look at i915 I can definitely say that this is actually quite
> buggy what you guys do here.

Note that there are 2 paths for i915 userptr. One is the mmu notifier, the
other one is the root-only hack we have for dubious reasons (or that I
really don't see the point in myself).

> For coherent usage you need to install some loc

Re: [Linaro-mm-sig] [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-20 Thread Daniel Vetter
On Mon, Mar 19, 2018 at 5:23 PM, Christian König
<ckoenig.leichtzumer...@gmail.com> wrote:
> Am 19.03.2018 um 16:53 schrieb Chris Wilson:
>>
>> Quoting Christian König (2018-03-16 14:22:32)
>> [snip, probably lost too must context]
>>>
>>> This allows for full grown pipelining, e.g. the exporter can say I need
>>> to move the buffer for some operation. Then let the move operation wait
>>> for all existing fences in the reservation object and install the fence
>>> of the move operation as exclusive fence.
>>
>> Ok, the situation I have in mind is the non-pipelined case: revoking
>> dma-buf for mmu_invalidate_range or shrink_slab. I would need a
>> completion event that can be waited on the cpu for all the invalidate
>> callbacks. (Essentially an atomic_t counter plus struct completion; a
>> lighter version of dma_fence, I wonder where I've seen that before ;)
>
>
> Actually that is harmless.
>
> When you need to unmap a DMA-buf because of mmu_invalidate_range or
> shrink_slab you need to wait for it's reservation object anyway.

reservation_object only prevents adding new fences, you still have to
wait for all the current ones to signal. Also, we have dma-access
without fences in i915. "I hold the reservation_object" does not imply
you can just go and nuke the backing storage.

> This needs to be done to make sure that the backing memory is now idle, it
> doesn't matter if the jobs where submitted by DMA-buf importers or your own
> driver.
>
> The sg tables pointing to the now released memory might live a bit longer,
> but that is unproblematic and actually intended.

I think that's very problematic. One reason for an IOMMU is that you
have device access isolation, and a broken device can't access memory
it shouldn't be able to access. From that security-in-depth point of
view it's not cool that there's some sg tables hanging around still
that a broken GPU could use. And let's not pretend hw is perfect,
especially GPUs :-)

> When we would try to destroy the sg tables in an mmu_invalidate_range or
> shrink_slab callback we would run into a lockdep horror.

So I'm no expert on this, but I think this is exactly what we're doing
in i915. Kinda no other way to actually free the memory without
throwing all the nice isolation aspects of an IOMMU into the wind. Can
you please paste the lockdeps you've seen with amdgpu when trying to
do that?
-Daniel

>
> Regards,
> Christian.
>
>>
>> Even so, it basically means passing a fence object down to the async
>> callbacks for them to signal when they are complete. Just to handle the
>> non-pipelined version. :|
>> -Chris
>> ___
>> dri-devel mailing list
>> dri-de...@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
> ___
> Linaro-mm-sig mailing list
> linaro-mm-...@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: RFC: unpinned DMA-buf exporting v2

2018-03-19 Thread Daniel Vetter
On Fri, Mar 16, 2018 at 02:20:44PM +0100, Christian König wrote:
> Hi everybody,
> 
> since I've got positive feedback from Daniel I continued working on this 
> approach.
> 
> A few issues are still open:
> 1. Daniel suggested that I make the invalidate_mappings callback a parameter 
> of dma_buf_attach().
> 
> This approach unfortunately won't work because when the attachment is
> created the importer is not necessarily ready to handle invalidation
> events.

Why do you have this constraint? This sounds a bit like inverted
create/teardown sequence troubles, where you make an object "life" before
the thing is fully set up.

Can't we fix this by creating the entire ttm scaffolding you'll need for a
dma-buf upfront, and only once you have everything we grab the dma_buf
attachment? At that point you really should be able to evict buffers
again.

Not requiring invalidate_mapping to be set together with the attachment
means we can't ever require importers to support it (e.g. to address your
concern with the userspace dma-buf userptr magic).

> E.g. in the amdgpu example we first need to setup the imported GEM/TMM
> objects and install that in the attachment.
> 
> My solution is to introduce a separate function to grab the locks and
> set the callback, this function could then be used to pin the buffer
> later on if that turns out to be necessary after all.
> 
> 2. With my example setup this currently results in a ping/pong situation
> because the exporter prefers a VRAM placement while the importer prefers
> a GTT placement.
> 
> This results in quite a performance drop, but can be fixed by a simple
> mesa patch which allows shred BOs to be placed in both VRAM and GTT.
> 
> Question is what should we do in the meantime? Accept the performance
> drop or only allow unpinned sharing with new Mesa?

Maybe the exporter should not try to move stuff back into VRAM as long as
there's an active dma-buf? I mean it's really cool that it works, but
maybe let's just do this for a tech demo :-)

Of course if it then runs out of TT then it could still try to move it
back in. And "let's not move it when it's imported" is probably too stupid
too, and will need to be improved again with more heuristics, but would at
least get it off the ground.

Long term you might want to move perhaps once per 10 seconds or so, to get
idle importers to detach. Adjust 10s to match whatever benchmark/workload
you care about.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 1/5] dma-buf: add optional invalidate_mappings callback v2

2018-03-19 Thread Daniel Vetter
nt *attach,
> +  void (*cb)(struct dma_buf_attachment *))
> +{
> + reservation_object_lock(attach->dmabuf->resv, NULL);
> + attach->invalidate_mappings = cb;
> + reservation_object_unlock(attach->dmabuf->resv);
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_set_invalidate_callback);
> +
> +/**
> + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf
> + *
> + * @dmabuf:  [in]buffer which mappings should be invalidated
> + *
> + * Informs all attachmenst that they need to destroy and recreated all their
> + * mappings.
> + */
> +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf)
> +{
> + struct dma_buf_attachment *attach;
> +
> + reservation_object_assert_held(dmabuf->resv);
> +
> + list_for_each_entry(attach, >attachments, node)
> + if (attach->invalidate_mappings)
> + attach->invalidate_mappings(attach);
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings);
> +
>  /**
>   * DOC: cpu access
>   *
> @@ -1121,10 +1179,12 @@ static int dma_buf_debug_show(struct seq_file *s, 
> void *unused)
>   seq_puts(s, "\tAttached Devices:\n");
>   attach_count = 0;
>  
> + reservation_object_lock(buf_obj->resv, NULL);
>   list_for_each_entry(attach_obj, _obj->attachments, node) {
>   seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
>   attach_count++;
>   }
> + reservation_object_unlock(buf_obj->resv);
>  
>   seq_printf(s, "Total %d devices attached\n\n",
>   attach_count);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 085db2fee2d7..70c65fcfe1e3 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -91,6 +91,18 @@ struct dma_buf_ops {
>*/
>   void (*detach)(struct dma_buf *, struct dma_buf_attachment *);
>  
> + /**
> +  * @supports_mapping_invalidation:
> +  *
> +  * True for exporters which supports unpinned DMA-buf operation using
> +  * the reservation lock.
> +  *
> +  * When attachment->invalidate_mappings is set the @map_dma_buf and
> +  * @unmap_dma_buf callbacks can be called with the reservation lock
> +  * held.
> +  */
> + bool supports_mapping_invalidation;
> +
>   /**
>* @map_dma_buf:
>*
> @@ -326,6 +338,29 @@ struct dma_buf_attachment {
>   struct device *dev;
>   struct list_head node;
>   void *priv;
> +
> + /**
> +  * @invalidate_mappings:
> +  *
> +  * Optional callback provided by the importer of the attachment which
> +  * must be set before mappings are created.
> +  *
> +  * If provided the exporter can avoid pinning the backing store while
> +  * mappings exists.
> +  *
> +  * The function is called with the lock of the reservation object
> +  * associated with the dma_buf held and the mapping function must be
> +  * called with this lock held as well. This makes sure that no mapping
> +  * is created concurrently with an ongoing invalidation.
> +  *
> +  * After the callback all existing mappings are still valid until all
> +  * fences in the dma_bufs reservation object are signaled, but should be
> +  * destroyed by the importer as soon as possible.
> +  *
> +  * New mappings can be created immediately, but can't be used before the
> +  * exclusive fence in the dma_bufs reservation object is signaled.
> +  */
> + void (*invalidate_mappings)(struct dma_buf_attachment *attach);
>  };
>  
>  /**
> @@ -391,6 +426,9 @@ struct sg_table *dma_buf_map_attachment(struct 
> dma_buf_attachment *,
>   enum dma_data_direction);
>  void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
>   enum dma_data_direction);
> +void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach,
> +  void (*cb)(struct dma_buf_attachment *));
> +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf);
>  int dma_buf_begin_cpu_access(struct dma_buf *dma_buf,
>enum dma_data_direction dir);
>  int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
> -- 
> 2.14.1
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback

2018-03-15 Thread Daniel Vetter
On Thu, Mar 15, 2018 at 10:56 AM, Christian König
<ckoenig.leichtzumer...@gmail.com> wrote:
> Am 15.03.2018 um 10:20 schrieb Daniel Vetter:
>>
>> On Tue, Mar 13, 2018 at 06:20:07PM +0100, Christian König wrote:
>> [SNIP]
>> Take a look at the DOT graphs for atomic I've done a while ago. I think we
>> could make a formidable competition for who's doing the worst diagrams :-)
>
>
> Thanks, going to give that a try.
>
>> [SNIP]
>> amdgpu: Expects that you never hold any of the heavywheight locks while
>> waiting for a fence (since gpu resets will need them).
>>
>> i915: Happily blocks on fences while holding all kinds of locks, expects
>> gpu reset to be able to recover even in this case.
>
>
> In this case I can comfort you, the looks amdgpu needs to grab during GPU
> reset are the reservation lock of the VM page tables. I have strong doubt
> that i915 will ever hold those.

Ah good, means that very likely there's at least no huge fundamental
design issue that we run into.

> Could be that we run into problems because Thread A hold lock 1 tries to
> take lock 2, then i915 holds 2 and our reset path needs 1.

Yeah that might happen, but lockdep will catch those, and generally
those cases can be fixed with slight reordering or re-annotating of
the code to avoid upsetting lockdep. As long as we don't have a
full-on functional dependency (which is what I've feared).

>> [SNIP]
>>>
>>> Yes, except for fallback paths and bootup self tests we simply never wait
>>> for fences while holding locks.
>>
>> That's not what I meant with "are you sure". Did you enable the
>> cross-release stuff (after patching the bunch of leftover core kernel
>> issues still present), annotate dma_fence with the cross-release stuff,
>> run a bunch of multi-driver (amdgpu vs i915) dma-buf sharing tests and
>> weep?
>
>
> Ok, what exactly do you mean with cross-release checking?

Current lockdep doesn't spot deadlocks like the below:

thread A: holds mutex, waiting for completion.

thread B: acquires mutex before it will ever signal the completion A
is waiting for

->deadlock

cross-release lockdep support can catch these through new fancy
annotations. Similar waiter/signaller annotations exists for waiting
on workers and anything else, and it would be a perfect fit for
waiter/signaller code around dma_fence.

lwn has you covered a usual: https://lwn.net/Articles/709849/

Cheers, Daniel

>> I didn't do the full thing yet, but just within i915 we've found tons of
>> small little deadlocks we never really considered thanks to cross release,
>> and that wasn't even including the dma_fence annotation. Luckily nothing
>> that needed a full-on driver redesign.
>>
>> I guess I need to ping core kernel maintainers about cross-release again.
>> I'd much prefer if we could validate ->invalidate_mapping and the
>> locking/fence dependency issues using that, instead of me having to read
>> and understand all the drivers.
>
> [SNIP]
>>
>> I fear that with the ->invalidate_mapping callback (which inverts the
>> control flow between importer and exporter) and tying dma_fences into all
>> this it will be a _lot_ worse. And I'm definitely too stupid to understand
>> all the dependency chains without the aid of lockdep and a full test suite
>> (we have a bunch of amdgpu/i915 dma-buf tests in igt btw).
>
>
> Yes, that is also something I worry about.
>
> Regards,
> Christian.



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback

2018-03-15 Thread Daniel Vetter
On Tue, Mar 13, 2018 at 06:20:07PM +0100, Christian König wrote:
> Am 13.03.2018 um 17:00 schrieb Daniel Vetter:
> > On Tue, Mar 13, 2018 at 04:52:02PM +0100, Christian König wrote:
> > > Am 13.03.2018 um 16:17 schrieb Daniel Vetter:
> > > [SNIP]
> > Ok, so plan is to support fully pipeline moves and everything, with the
> > old sg tables lazily cleaned up. I was thinking more about evicting stuff
> > and throwing it out, where there's not going to be any new sg list but the
> > object is going to be swapped out.
> 
> Yes, exactly. Well my example was the unlikely case when the object is
> swapped out and immediately swapped in again because somebody needs it.
> 
> > 
> > I think some state flow charts (we can do SVG or DOT) in the kerneldoc
> > would be sweet.Yeah, probably a good idea.
> 
> Sounds good and I find it great that you're volunteering for that :D
> 
> Ok seriously, my drawing capabilities are a bit underdeveloped. So I would
> prefer if somebody could at least help with that.

Take a look at the DOT graphs for atomic I've done a while ago. I think we
could make a formidable competition for who's doing the worst diagrams :-)

> > > > Re GPU might cause a deadlock: Isn't that already a problem if you hold
> > > > reservations of buffers used on other gpus, which want those 
> > > > reservations
> > > > to complete the gpu reset, but that gpu reset blocks some fence that the
> > > > reservation holder is waiting for?
> > > Correct, that's why amdgpu and TTM tries quite hard to never wait for a
> > > fence while a reservation object is locked.
> > We might have a fairly huge mismatch of expectations here :-/
> 
> What do you mean with that?

i915 expects that other drivers don't have this requirement. Our gpu reset
can proceed even if it's all locked down.

> > > The only use case I haven't fixed so far is reaping deleted object during
> > > eviction, but that is only a matter of my free time to fix it.
> > Yeah, this is the hard one.
> 
> Actually it isn't so hard, it's just that I didn't had time so far to clean
> it up and we never hit that issue so far during our reset testing.
> 
> The main point missing just a bit of functionality in the reservation object
> and Chris and I already had a good idea how to implement that.
> 
> > In general the assumption is that dma_fence will get signalled no matter
> > what you're doing, assuming the only thing you need is to not block
> > interrupts. The i915 gpu reset logic to make that work is a bit a work of
> > art ...
> 
> Correct, but I don't understand why that is so hard on i915? Our GPU
> scheduler makes all of that rather trivial, e.g. fences either signal
> correctly or are aborted and set as erroneous after a timeout.

Yes, i915 does the same. It's the locking requirement we disagree on, i915
can reset while holding locks. I think right now we don't reset while
holding reservation locks, but only while holding our own locks. I think
cross-release would help model us this and uncover all the funny
dependency loops we have.

The issue I'm seeing:

amdgpu: Expects that you never hold any of the heavywheight locks while
waiting for a fence (since gpu resets will need them).

i915: Happily blocks on fences while holding all kinds of locks, expects
gpu reset to be able to recover even in this case.

Both drivers either complete the fence (with or without setting the error
status to EIO or something like that), that's not the difference. The work
of art I referenced is how we managed to complete gpu reset (including
resubmitting) while holding plenty of locks.

> > If we expect amdgpu and i915 to cooperate with shared buffers I guess one
> > has to give in. No idea how to do that best.
> 
> Again at least from amdgpu side I don't see much of an issue with that. So
> what exactly do you have in mind here?
> 
> > > > We have tons of fun with deadlocks against GPU resets, and loots of
> > > > testcases, and I kinda get the impression amdgpu is throwing a lot of
> > > > issues under the rug through trylock tricks that shut up lockdep, but
> > > > don't fix much really.
> > > Hui? Why do you think that? The only trylock I'm aware of is during 
> > > eviction
> > > and there it isn't a problem.
> > mmap fault handler had one too last time I looked, and it smelled fishy.
> 
> Good point, never wrapped my head fully around that one either.
> 
> > > > btw adding cross-release lockdep annotations for fences will probably 
> > > > turn
> > > > up _lots_ more bugs in this area.
> > > At least for amdgpu that should be handled by now.
>

Re: [RfC PATCH] Add udmabuf misc device

2018-03-13 Thread Daniel Vetter
uf->pages);
> + kfree(ubuf);
> +err_free_iovs:
> + kfree(iovs);
> + return ret;
> +}
> +
> +static long udmabuf_ioctl(struct file *filp, unsigned int ioctl,
> +   unsigned long arg)
> +{
> + long ret;
> +
> + switch (ioctl) {
> + case UDMABUF_CREATE:
> + ret = udmabuf_ioctl_create(filp, arg);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static const struct file_operations udmabuf_fops = {
> + .owner  = THIS_MODULE,
> + .unlocked_ioctl = udmabuf_ioctl,
> +};
> +
> +static struct miscdevice udmabuf_misc = {
> + .minor  = MISC_DYNAMIC_MINOR,
> + .name   = "udmabuf",
> + .fops   = _fops,
> +};
> +
> +static int __init udmabuf_dev_init(void)
> +{
> + int ret;
> +
> + ret = misc_register(_misc);
> + if (ret)
> + return ret;
> +
> + return 0;
> +}
> +
> +static void __exit udmabuf_dev_exit(void)
> +{
> + misc_deregister(_misc);
> +}
> +
> +module_init(udmabuf_dev_init)
> +module_exit(udmabuf_dev_exit)
> +
> +MODULE_LICENSE("GPL v2");
> diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
> index ed3b785bae..5876b52554 100644
> --- a/drivers/dma-buf/Kconfig
> +++ b/drivers/dma-buf/Kconfig
> @@ -30,4 +30,11 @@ config SW_SYNC
> WARNING: improper use of this can result in deadlocking kernel
> drivers from userspace. Intended for test and debug only.
>  
> +config UDMABUF
> + tristate "userspace dmabuf misc driver"
> + default n
> + depends on DMA_SHARED_BUFFER
> + ---help---
> +   A driver to let userspace turn iovs into dma-bufs.
> +
>  endmenu
> diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
> index c33bf88631..0913a6ccab 100644
> --- a/drivers/dma-buf/Makefile
> +++ b/drivers/dma-buf/Makefile
> @@ -1,3 +1,4 @@
>  obj-y := dma-buf.o dma-fence.o dma-fence-array.o reservation.o seqno-fence.o
>  obj-$(CONFIG_SYNC_FILE)  += sync_file.o
>  obj-$(CONFIG_SW_SYNC)+= sw_sync.o sync_debug.o
> +obj-$(CONFIG_UDMABUF)+= udmabuf.o
> -- 
> 2.9.3
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback

2018-03-13 Thread Daniel Vetter
On Tue, Mar 13, 2018 at 04:52:02PM +0100, Christian König wrote:
> Am 13.03.2018 um 16:17 schrieb Daniel Vetter:
> > [SNIP]
> > > > I think a helper which both unmaps _and_ waits for all the fences to 
> > > > clear
> > > > would be best, with some guarantees that it'll either fail or all the
> > > > mappings _will_ be gone. The locking for that one will be hilarious, 
> > > > since
> > > > we need to figure out dmabuf->lock vs. the reservation. I kinda prefer 
> > > > we
> > > > throw away the dmabuf->lock and superseed it entirely by the reservation
> > > > lock.
> > > Big NAK on that. The whole API is asynchronously, e.g. we never block for
> > > any operation to finish.
> > > 
> > > Otherwise you run into big trouble with cross device GPU resets and stuff
> > > like that.
> > But how will the unmapping work then? You can't throw the sg list away
> > before the dma stopped. The dma only stops once the fence is signalled.
> > The importer can't call dma_buf_detach because the reservation lock is
> > hogged already by the exporter trying to unmap everything.
> > 
> > How is this supposed to work?
> 
> Even after invalidation the sg list stays alive until it is explicitly
> destroyed by the importer using dma_buf_unmap_attachment() which in turn is
> only allowed after all fences have signaled.
> 
> The implementation is in ttm_bo_pipeline_gutting(), basically we use the
> same functionality as for pipelined moves/evictions which hangs the old
> backing store on a dummy object and destroys it after all fences signaled.
> 
> While the old sg list is still about to be destroyed the importer can
> request a new sg list for the new location of the DMA-buf using
> dma_buf_map_attachment(). This new location becomes valid after the move
> fence in the reservation object is signaled.
> 
> So from the CPU point of view multiple sg list could exists at the same time
> which allows us to have a seamless transition from the old to the new
> location from the GPU point of view.

Ok, so plan is to support fully pipeline moves and everything, with the
old sg tables lazily cleaned up. I was thinking more about evicting stuff
and throwing it out, where there's not going to be any new sg list but the
object is going to be swapped out.

I think some state flow charts (we can do SVG or DOT) in the kerneldoc
would be sweet.

> > Re GPU might cause a deadlock: Isn't that already a problem if you hold
> > reservations of buffers used on other gpus, which want those reservations
> > to complete the gpu reset, but that gpu reset blocks some fence that the
> > reservation holder is waiting for?
> 
> Correct, that's why amdgpu and TTM tries quite hard to never wait for a
> fence while a reservation object is locked.

We might have a fairly huge mismatch of expectations here :-/

> The only use case I haven't fixed so far is reaping deleted object during
> eviction, but that is only a matter of my free time to fix it.

Yeah, this is the hard one.

In general the assumption is that dma_fence will get signalled no matter
what you're doing, assuming the only thing you need is to not block
interrupts. The i915 gpu reset logic to make that work is a bit a work of
art ...

If we expect amdgpu and i915 to cooperate with shared buffers I guess one
has to give in. No idea how to do that best.

> > We have tons of fun with deadlocks against GPU resets, and loots of
> > testcases, and I kinda get the impression amdgpu is throwing a lot of
> > issues under the rug through trylock tricks that shut up lockdep, but
> > don't fix much really.
> 
> Hui? Why do you think that? The only trylock I'm aware of is during eviction
> and there it isn't a problem.

mmap fault handler had one too last time I looked, and it smelled fishy.

> > btw adding cross-release lockdep annotations for fences will probably turn
> > up _lots_ more bugs in this area.
> 
> At least for amdgpu that should be handled by now.

You're sure? :-)

Trouble is that cross-release wasn't even ever enabled, much less anyone
typed the dma_fence annotations. And just cross-release alone turned up
_lost_ of deadlocks in i915 between fences, async workers (userptr, gpu
reset) and core mm stuff.

I'd be seriously surprised if it wouldn't find an entire rats nest of
issues around dma_fence once we enable it.
-Daniel

> > > > > +  *
> > > > > +  * New mappings can be created immediately, but can't be used 
> > > > > before the
> > > > > +  * exclusive fence in the dma_bufs reservation object is 
> > > > > signaled.
> > > > > +  */
> > &g

Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback

2018-03-13 Thread Daniel Vetter
On Mon, Mar 12, 2018 at 08:13:15PM +0100, Christian K??nig wrote:
> Am 12.03.2018 um 18:07 schrieb Daniel Vetter:
> > On Fri, Mar 09, 2018 at 08:11:41PM +0100, Christian K??nig wrote:
> > > [SNIP]
> > > +/**
> > > + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf
> > > + *
> > > + * @dmabuf:  [in]buffer which mappings should be invalidated
> > > + *
> > > + * Informs all attachmenst that they need to destroy and recreated all 
> > > their
> > > + * mappings.
> > > + */
> > > +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf)
> > > +{
> > > + struct dma_buf_attachment *attach;
> > > +
> > > + reservation_object_assert_held(dmabuf->resv);
> > > +
> > > + list_for_each_entry(attach, >attachments, node)
> > > + attach->invalidate_mappings(attach);
> > To make the locking work I think we also need to require importers to hold
> > the reservation object while attaching/detaching. Otherwise the list walk
> > above could go boom.
> 
> Oh, good point. Going, to fix this.
> 
> > [SNIP]
> > > + /**
> > > +  * @supports_mapping_invalidation:
> > > +  *
> > > +  * True for exporters which supports unpinned DMA-buf operation using
> > > +  * the reservation lock.
> > > +  *
> > > +  * When attachment->invalidate_mappings is set the @map_dma_buf and
> > > +  * @unmap_dma_buf callbacks can be called with the reservation lock
> > > +  * held.
> > > +  */
> > > + bool supports_mapping_invalidation;
> > Why do we need this? Importer could simply always register with the
> > invalidate_mapping hook registered, and exporters could use it when they
> > see fit. That gives us more lockdep coverage to make sure importers use
> > their attachment callbacks correctly (aka they hold the reservation
> > object).
> 
> One sole reason: Backward compability.
> 
> I didn't wanted to audit all those different drivers if they can handle
> being called with the reservation lock held.
> 
> > 
> > > +
> > >   /**
> > >* @map_dma_buf:
> > >*
> > > @@ -326,6 +338,29 @@ struct dma_buf_attachment {
> > >   struct device *dev;
> > >   struct list_head node;
> > >   void *priv;
> > > +
> > > + /**
> > > +  * @invalidate_mappings:
> > > +  *
> > > +  * Optional callback provided by the importer of the attachment which
> > > +  * must be set before mappings are created.
> > This doesn't work, it must be set before the attachment is created,
> > otherwise you race with your invalidate callback.
> 
> Another good point.
> 
> > 
> > I think the simplest option would be to add a new dma_buf_attach_dynamic
> > (well except a less crappy name).
> 
> Well how about adding an optional invalidate_mappings parameter to the
> existing dma_buf_attach?

Not sure that's best, it might confuse dumb importers and you need to
change all the callers. But up to you.

> > > +  *
> > > +  * If provided the exporter can avoid pinning the backing store while
> > > +  * mappings exists.
> > > +  *
> > > +  * The function is called with the lock of the reservation object
> > > +  * associated with the dma_buf held and the mapping function must be
> > > +  * called with this lock held as well. This makes sure that no mapping
> > > +  * is created concurrently with an ongoing invalidation.
> > > +  *
> > > +  * After the callback all existing mappings are still valid until all
> > > +  * fences in the dma_bufs reservation object are signaled, but should be
> > > +  * destroyed by the importer as soon as possible.
> > Do we guarantee that the importer will attach a fence, after which the
> > mapping will be gone? What about re-trying? Or just best effort (i.e. only
> > useful for evicting to try to make room).
> 
> The importer should attach fences for all it's operations with the DMA-buf.
> 
> > I think a helper which both unmaps _and_ waits for all the fences to clear
> > would be best, with some guarantees that it'll either fail or all the
> > mappings _will_ be gone. The locking for that one will be hilarious, since
> > we need to figure out dmabuf->lock vs. the reservation. I kinda prefer we
> > throw away the dmabuf->lock and superseed it entirely by the reservation
> > lock.
> 
> Big NAK on that. The whole API is asynchronously, e.g. we never block for

Re: RFC: unpinned DMA-buf exporting

2018-03-12 Thread Daniel Vetter
On Mon, Mar 12, 2018 at 8:15 PM, Christian König
<ckoenig.leichtzumer...@gmail.com> wrote:
> Am 12.03.2018 um 18:24 schrieb Daniel Vetter:
>>
>> On Fri, Mar 09, 2018 at 08:11:40PM +0100, Christian K??nig wrote:
>>>
>>> This set of patches adds an option invalidate_mappings callback to each
>>> DMA-buf attachment which can be filled in by the importer.
>>>
>>> This callback allows the exporter to provided the DMA-buf content
>>> without pinning it. The reservation objects lock acts as synchronization
>>> point for buffer moves and creating mappings.
>>>
>>> This set includes an implementation for amdgpu which should be rather
>>> easily portable to other DRM drivers.
>>
>> Bunch of higher level comments, and one I've forgotten in reply to patch
>> 1:
>>
>> - What happens when a dma-buf is pinned (e.g. i915 loves to pin buffers
>>for scanout)?
>
>
> When you need to pin an imported DMA-buf you need to detach and reattach
> without the invalidate_mappings callback.

I think that must both be better documented, and also somehow enforced
with checks. Atm nothing makes sure you actually manage to unmap if
you claim to be able to do so.

I think a helper to switch from pinned to unpinned would be lovely
(just need to clear/reset the ->invalidate_mapping pointer while
holding the reservation). Or do you expect to map buffers differently
depending whether you can move them or not? At least for i915 we'd
need to rework our driver quite a bit if you expect us to throw the
mapping away just to be able to pin it. Atm pinning requires that it's
mapped already (and depending upon platform the gpu might be using
that exact mapping to render, so unmapping for pinning is a bad idea
for us).

>> - pulling the dma-buf implementations into amdgpu makes sense, that's
>>kinda how it was meant to be anyway. The gem prime helpers are a bit
>> too
>>much midlayer for my taste (mostly because nvidia wanted to bypass the
>>EXPORT_SYMBOL_GPL of core dma-buf, hooray for legal bs). We can always
>>extract more helpers once there's more ttm based drivers doing this.
>
>
> Yeah, I though to abstract that similar to the AGP backend.
>
> Just moving some callbacks around in TTM should be sufficient to de-midlayer
> the whole thing.

Yeah TTM has all the abstractions needed to handle dma-bufs
"properly", it's just sometimes at the wrong level or can't be
overriden. At least per my understanding of TTM (which is most likely
... confused).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: RFC: unpinned DMA-buf exporting

2018-03-12 Thread Daniel Vetter
On Fri, Mar 09, 2018 at 08:11:40PM +0100, Christian K??nig wrote:
> This set of patches adds an option invalidate_mappings callback to each
> DMA-buf attachment which can be filled in by the importer.
> 
> This callback allows the exporter to provided the DMA-buf content
> without pinning it. The reservation objects lock acts as synchronization
> point for buffer moves and creating mappings.
> 
> This set includes an implementation for amdgpu which should be rather
> easily portable to other DRM drivers.

Bunch of higher level comments, and one I've forgotten in reply to patch
1:

- What happens when a dma-buf is pinned (e.g. i915 loves to pin buffers
  for scanout)?

- pulling the dma-buf implementations into amdgpu makes sense, that's
  kinda how it was meant to be anyway. The gem prime helpers are a bit too
  much midlayer for my taste (mostly because nvidia wanted to bypass the
  EXPORT_SYMBOL_GPL of core dma-buf, hooray for legal bs). We can always
  extract more helpers once there's more ttm based drivers doing this.

Overall I like, there's some details to figure out first.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 1/4] dma-buf: add optional invalidate_mappings callback

2018-03-12 Thread Daniel Vetter
o we guarantee that the importer will attach a fence, after which the
mapping will be gone? What about re-trying? Or just best effort (i.e. only
useful for evicting to try to make room).

I think a helper which both unmaps _and_ waits for all the fences to clear
would be best, with some guarantees that it'll either fail or all the
mappings _will_ be gone. The locking for that one will be hilarious, since
we need to figure out dmabuf->lock vs. the reservation. I kinda prefer we
throw away the dmabuf->lock and superseed it entirely by the reservation
lock.


> +  *
> +  * New mappings can be created immediately, but can't be used before the
> +  * exclusive fence in the dma_bufs reservation object is signaled.
> +  */
> + void (*invalidate_mappings)(struct dma_buf_attachment *attach);

Bunch of questions about exact semantics, but I very much like this. And I
think besides those technical details, the overall approach seems sound.
-Daniel

>  };
>  
>  /**
> @@ -391,6 +426,7 @@ struct sg_table *dma_buf_map_attachment(struct 
> dma_buf_attachment *,
>   enum dma_data_direction);
>  void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
>   enum dma_data_direction);
> +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf);
>  int dma_buf_begin_cpu_access(struct dma_buf *dma_buf,
>enum dma_data_direction dir);
>  int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
> -- 
> 2.14.1
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 1/3] dma-buf: make returning the exclusive fence optional

2018-01-17 Thread Daniel Vetter
On Tue, Jan 16, 2018 at 11:14:26AM +0100, Christian König wrote:
> Ping? Daniel you requested the patch with its user.

Acked-by: Daniel Vetter <daniel.vet...@ffwll.ch>

but might be good to get a review from one of the usual reservation stuff
folks.
-Daniel

> 
> Would be nice when I can commit this cause we need it for debugging and
> cleaning up a bunch of other things as well.
> 
> Regards,
> Christian.
> 
> Am 12.01.2018 um 10:47 schrieb Christian König:
> > Change reservation_object_get_fences_rcu to make the exclusive fence
> > pointer optional.
> > 
> > If not specified the exclusive fence is put into the fence array as
> > well.
> > 
> > This is helpful for a couple of cases where we need all fences in a
> > single array.
> > 
> > Signed-off-by: Christian König <christian.koe...@amd.com>
> > ---
> >   drivers/dma-buf/reservation.c | 31 ++-
> >   1 file changed, 22 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> > index b759a569b7b8..461afa9febd4 100644
> > --- a/drivers/dma-buf/reservation.c
> > +++ b/drivers/dma-buf/reservation.c
> > @@ -374,8 +374,9 @@ EXPORT_SYMBOL(reservation_object_copy_fences);
> >* @pshared: the array of shared fence ptrs returned (array is krealloc'd 
> > to
> >* the required size, and must be freed by caller)
> >*
> > - * RETURNS
> > - * Zero or -errno
> > + * Retrieve all fences from the reservation object. If the pointer for the
> > + * exclusive fence is not specified the fence is put into the array of the
> > + * shared fences as well. Returns either zero or -ENOMEM.
> >*/
> >   int reservation_object_get_fences_rcu(struct reservation_object *obj,
> >   struct dma_fence **pfence_excl,
> > @@ -389,8 +390,8 @@ int reservation_object_get_fences_rcu(struct 
> > reservation_object *obj,
> > do {
> > struct reservation_object_list *fobj;
> > -   unsigned seq;
> > -   unsigned int i;
> > +   unsigned int i, seq;
> > +   size_t sz = 0;
> > shared_count = i = 0;
> > @@ -402,9 +403,14 @@ int reservation_object_get_fences_rcu(struct 
> > reservation_object *obj,
> > goto unlock;
> > fobj = rcu_dereference(obj->fence);
> > -   if (fobj) {
> > +   if (fobj)
> > +   sz += sizeof(*shared) * fobj->shared_max;
> > +
> > +   if (!pfence_excl && fence_excl)
> > +   sz += sizeof(*shared);
> > +
> > +   if (sz) {
> > struct dma_fence **nshared;
> > -   size_t sz = sizeof(*shared) * fobj->shared_max;
> > nshared = krealloc(shared, sz,
> >GFP_NOWAIT | __GFP_NOWARN);
> > @@ -420,13 +426,19 @@ int reservation_object_get_fences_rcu(struct 
> > reservation_object *obj,
> > break;
> > }
> > shared = nshared;
> > -   shared_count = fobj->shared_count;
> > -
> > +   shared_count = fobj ? fobj->shared_count : 0;
> > for (i = 0; i < shared_count; ++i) {
> > shared[i] = rcu_dereference(fobj->shared[i]);
> > if (!dma_fence_get_rcu(shared[i]))
> > break;
> > }
> > +
> > +   if (!pfence_excl && fence_excl) {
> > +   shared[i] = fence_excl;
> > +   fence_excl = NULL;
> > +   ++i;
> > +   ++shared_count;
> > +   }
> > }
> > if (i != shared_count || read_seqcount_retry(>seq, seq)) {
> > @@ -448,7 +460,8 @@ int reservation_object_get_fences_rcu(struct 
> > reservation_object *obj,
> > *pshared_count = shared_count;
> > *pshared = shared;
> > -   *pfence_excl = fence_excl;
> > +   if (pfence_excl)
> > +   *pfence_excl = fence_excl;
> > return ret;
> >   }
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-buf/sw_sync: fix document of sw_sync_create_fence_data

2018-01-15 Thread Daniel Vetter
On Mon, Jan 15, 2018 at 11:47:59AM +0800, Shawn Guo wrote:
> The structure should really be sw_sync_create_fence_data rather than
> sw_sync_ioctl_create_fence which is the function name.
> 
> Signed-off-by: Shawn Guo <shawn@linaro.org>

Applied, thanks for your patch.
-Daniel

> ---
>  drivers/dma-buf/sw_sync.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
> index 24f83f9eeaed..7779bdbd18d1 100644
> --- a/drivers/dma-buf/sw_sync.c
> +++ b/drivers/dma-buf/sw_sync.c
> @@ -43,14 +43,14 @@
>   * timelines.
>   *
>   * Fences can be created with SW_SYNC_IOC_CREATE_FENCE ioctl with struct
> - * sw_sync_ioctl_create_fence as parameter.
> + * sw_sync_create_fence_data as parameter.
>   *
>   * To increment the timeline counter, SW_SYNC_IOC_INC ioctl should be used
>   * with the increment as u32. This will update the last signaled value
>   * from the timeline and signal any fence that has a seqno smaller or equal
>   * to it.
>   *
> - * struct sw_sync_ioctl_create_fence
> + * struct sw_sync_create_fence_data
>   * @value:   the seqno to initialise the fence with
>   * @name:the name of the new sync point
>   * @fence:   return the fd of the new sync_file with the created fence
> -- 
> 1.9.1
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH] dma-buf: make returning the exclusive fence optional

2018-01-10 Thread Daniel Vetter
On Wed, Jan 10, 2018 at 02:46:32PM +0100, Christian König wrote:
> Am 10.01.2018 um 14:21 schrieb Daniel Vetter:
> > On Wed, Jan 10, 2018 at 01:53:41PM +0100, Christian König wrote:
> > > Change reservation_object_get_fences_rcu to make the exclusive fence
> > > pointer optional.
> > > 
> > > If not specified the exclusive fence is put into the fence array as
> > > well.
> > > 
> > > This is helpful for a couple of cases where we need all fences in a
> > > single array.
> > > 
> > > Signed-off-by: Christian König <christian.koe...@amd.com>
> > Seeing the use-case for this would be a lot more interesting ...
> 
> Yeah, sorry the use case is a 20 patches set on amd-gfx.
> 
> Didn't wanted to post all those here as well.

Imo better to spam more lists instead of splitting up discussions ... It's
at least what we tend to do for i915 stuff, and no one seems to complain.
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > ---
> > >   drivers/dma-buf/reservation.c | 31 ++-
> > >   1 file changed, 22 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> > > index b759a569b7b8..461afa9febd4 100644
> > > --- a/drivers/dma-buf/reservation.c
> > > +++ b/drivers/dma-buf/reservation.c
> > > @@ -374,8 +374,9 @@ EXPORT_SYMBOL(reservation_object_copy_fences);
> > >* @pshared: the array of shared fence ptrs returned (array is 
> > > krealloc'd to
> > >* the required size, and must be freed by caller)
> > >*
> > > - * RETURNS
> > > - * Zero or -errno
> > > + * Retrieve all fences from the reservation object. If the pointer for 
> > > the
> > > + * exclusive fence is not specified the fence is put into the array of 
> > > the
> > > + * shared fences as well. Returns either zero or -ENOMEM.
> > >*/
> > >   int reservation_object_get_fences_rcu(struct reservation_object *obj,
> > > struct dma_fence **pfence_excl,
> > > @@ -389,8 +390,8 @@ int reservation_object_get_fences_rcu(struct 
> > > reservation_object *obj,
> > >   do {
> > >   struct reservation_object_list *fobj;
> > > - unsigned seq;
> > > - unsigned int i;
> > > + unsigned int i, seq;
> > > + size_t sz = 0;
> > >   shared_count = i = 0;
> > > @@ -402,9 +403,14 @@ int reservation_object_get_fences_rcu(struct 
> > > reservation_object *obj,
> > >   goto unlock;
> > >   fobj = rcu_dereference(obj->fence);
> > > - if (fobj) {
> > > + if (fobj)
> > > + sz += sizeof(*shared) * fobj->shared_max;
> > > +
> > > + if (!pfence_excl && fence_excl)
> > > + sz += sizeof(*shared);
> > > +
> > > + if (sz) {
> > >   struct dma_fence **nshared;
> > > - size_t sz = sizeof(*shared) * fobj->shared_max;
> > >   nshared = krealloc(shared, sz,
> > >  GFP_NOWAIT | __GFP_NOWARN);
> > > @@ -420,13 +426,19 @@ int reservation_object_get_fences_rcu(struct 
> > > reservation_object *obj,
> > >   break;
> > >   }
> > >   shared = nshared;
> > > - shared_count = fobj->shared_count;
> > > -
> > > + shared_count = fobj ? fobj->shared_count : 0;
> > >   for (i = 0; i < shared_count; ++i) {
> > >   shared[i] = 
> > > rcu_dereference(fobj->shared[i]);
> > >   if (!dma_fence_get_rcu(shared[i]))
> > >   break;
> > >   }
> > > +
> > > + if (!pfence_excl && fence_excl) {
> > > + shared[i] = fence_excl;
> > > + fence_excl = NULL;
> > > + ++i;
> > > + ++shared_count;
> > > + }
> > >   }
> > >   if (i != shared_count || read_seqcount_retry(>seq, 
> > > seq)) {
> > > @@ -448,7 +460,8 @@ int reservation_object_get_fences_rcu(struct 
> > > reservation_object *obj,
> > >   *pshared_count = shared_count;
> > >   *pshared = shared;
> > > - *pfence_excl = fence_excl;
> > > + if (pfence_excl)
> > > + *pfence_excl = fence_excl;
> > >   return ret;
> > >   }
> > > -- 
> > > 2.14.1
> > > 
> > > ___
> > > Linaro-mm-sig mailing list
> > > linaro-mm-...@lists.linaro.org
> > > https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH] dma-buf: make returning the exclusive fence optional

2018-01-10 Thread Daniel Vetter
On Wed, Jan 10, 2018 at 01:53:41PM +0100, Christian König wrote:
> Change reservation_object_get_fences_rcu to make the exclusive fence
> pointer optional.
> 
> If not specified the exclusive fence is put into the fence array as
> well.
> 
> This is helpful for a couple of cases where we need all fences in a
> single array.
> 
> Signed-off-by: Christian König <christian.koe...@amd.com>

Seeing the use-case for this would be a lot more interesting ...
-Daniel

> ---
>  drivers/dma-buf/reservation.c | 31 ++-
>  1 file changed, 22 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> index b759a569b7b8..461afa9febd4 100644
> --- a/drivers/dma-buf/reservation.c
> +++ b/drivers/dma-buf/reservation.c
> @@ -374,8 +374,9 @@ EXPORT_SYMBOL(reservation_object_copy_fences);
>   * @pshared: the array of shared fence ptrs returned (array is krealloc'd to
>   * the required size, and must be freed by caller)
>   *
> - * RETURNS
> - * Zero or -errno
> + * Retrieve all fences from the reservation object. If the pointer for the
> + * exclusive fence is not specified the fence is put into the array of the
> + * shared fences as well. Returns either zero or -ENOMEM.
>   */
>  int reservation_object_get_fences_rcu(struct reservation_object *obj,
> struct dma_fence **pfence_excl,
> @@ -389,8 +390,8 @@ int reservation_object_get_fences_rcu(struct 
> reservation_object *obj,
>  
>   do {
>   struct reservation_object_list *fobj;
> - unsigned seq;
> - unsigned int i;
> + unsigned int i, seq;
> + size_t sz = 0;
>  
>   shared_count = i = 0;
>  
> @@ -402,9 +403,14 @@ int reservation_object_get_fences_rcu(struct 
> reservation_object *obj,
>   goto unlock;
>  
>   fobj = rcu_dereference(obj->fence);
> - if (fobj) {
> + if (fobj)
> + sz += sizeof(*shared) * fobj->shared_max;
> +
> + if (!pfence_excl && fence_excl)
> + sz += sizeof(*shared);
> +
> + if (sz) {
>   struct dma_fence **nshared;
> - size_t sz = sizeof(*shared) * fobj->shared_max;
>  
>   nshared = krealloc(shared, sz,
>  GFP_NOWAIT | __GFP_NOWARN);
> @@ -420,13 +426,19 @@ int reservation_object_get_fences_rcu(struct 
> reservation_object *obj,
>   break;
>   }
>   shared = nshared;
> - shared_count = fobj->shared_count;
> -
> + shared_count = fobj ? fobj->shared_count : 0;
>   for (i = 0; i < shared_count; ++i) {
>   shared[i] = rcu_dereference(fobj->shared[i]);
>   if (!dma_fence_get_rcu(shared[i]))
>   break;
>   }
> +
> + if (!pfence_excl && fence_excl) {
> + shared[i] = fence_excl;
> + fence_excl = NULL;
> + ++i;
> + ++shared_count;
> + }
>   }
>  
>   if (i != shared_count || read_seqcount_retry(>seq, seq)) {
> @@ -448,7 +460,8 @@ int reservation_object_get_fences_rcu(struct 
> reservation_object *obj,
>  
>   *pshared_count = shared_count;
>   *pshared = shared;
> - *pfence_excl = fence_excl;
> + if (pfence_excl)
> + *pfence_excl = fence_excl;
>  
>   return ret;
>  }
> -- 
> 2.14.1
> 
> ___
> Linaro-mm-sig mailing list
> linaro-mm-...@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 2/4] dma-buf/fence: Sparse wants __rcu on the object itself

2017-11-06 Thread Daniel Vetter
On Thu, Nov 02, 2017 at 10:03:34PM +0200, Ville Syrjala wrote:
> From: Chris Wilson <ch...@chris-wilson.co.uk>
> 
> In order to silent sparse in dma_fence_get_rcu_safe(), we need to mark

s/silent/silence/

On the series (assuming sparse is indeed happy now, I didn't check that):

Reviewed-by: Daniel Vetter <daniel.vet...@ffwll.ch>

> the incoming fence object as being RCU protected and not the pointer to
> the object.
> 
> Cc: Dave Airlie <airl...@redhat.com>
> Cc: Jason Ekstrand <ja...@jlekstrand.net>
> Cc: linaro-mm-...@lists.linaro.org
> Cc: linux-media@vger.kernel.org
> Cc: Alex Deucher <alexander.deuc...@amd.com>
> Cc: Christian König <christian.koe...@amd.com>
> Cc: Sumit Semwal <sumit.sem...@linaro.org>
> Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
> ---
>  include/linux/dma-fence.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index efdabbb64e3c..4c008170fe65 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -242,7 +242,7 @@ static inline struct dma_fence *dma_fence_get_rcu(struct 
> dma_fence *fence)
>   * The caller is required to hold the RCU read lock.
>   */
>  static inline struct dma_fence *
> -dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep)
> +dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
>  {
>   do {
>   struct dma_fence *fence;
> -- 
> 2.13.6
> 
> ___________
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-fence: fix dma_fence_get_rcu_safe

2017-09-20 Thread Daniel Vetter
On Mon, Sep 11, 2017 at 01:06:32PM +0200, Christian König wrote:
> Am 11.09.2017 um 12:01 schrieb Chris Wilson:
> > [SNIP]
> > > Yeah, but that is illegal with a fence objects.
> > >
> > > When anybody allocates fences this way it breaks at least
> > > reservation_object_get_fences_rcu(),
> > > reservation_object_wait_timeout_rcu() and
> > > reservation_object_test_signaled_single().
> > Many, many months ago I sent patches to fix them all.
>
> Found those after a bit a searching. Yeah, those patches where proposed more
> than a year ago, but never pushed upstream.
>
> Not sure if we really should go this way. dma_fence objects are shared
> between drivers and since we can't judge if it's the correct fence based on
> a criteria in the object (only the read counter which is outside) all
> drivers need to be correct for this.
>
> I would rather go the way and change dma_fence_release() to wrap
> fence->ops->release into call_rcu() to keep the whole RCU handling outside
> of the individual drivers.

Hm, I entirely dropped the ball on this, I kinda assumed that we managed
to get some agreement on this between i915 and dma_fence. Adding a pile
more people.

Joonas, Tvrtko, I guess we need to fix this one way or the other.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: DRM Format Modifiers in v4l2

2017-09-03 Thread Daniel Vetter
On Fri, Sep 1, 2017 at 2:43 PM, Rob Clark <robdcl...@gmail.com> wrote:
> On Fri, Sep 1, 2017 at 3:13 AM, Laurent Pinchart
> <laurent.pinch...@ideasonboard.com> wrote:
>> Hi Nicolas,
>>
>> On Thursday, 31 August 2017 19:12:58 EEST Nicolas Dufresne wrote:
>>> Le jeudi 31 août 2017 à 17:28 +0300, Laurent Pinchart a écrit :
>>> >> e.g. if I have two devices which support MODIFIER_FOO, I could attempt
>>> >> to share a buffer between them which uses MODIFIER_FOO without
>>> >> necessarily knowing exactly what it is/does.
>>> >
>>> > Userspace could certainly set modifiers blindly, but the point of
>>> > modifiers is to generate side effects benefitial to the use case at hand
>>> > (for instance by optimizing the memory access pattern). To use them
>>> > meaningfully userspace would need to have at least an idea of the side
>>> > effects they generate.
>>>
>>> Generic userspace will basically pick some random combination.
>>
>> In that case userspace could set no modifier at all by default (except in the
>> case where unmodified formats are not supported by the hardware, but I don't
>> expect that to be the most common case).
>>
>>> To allow generically picking the optimal configuration we could indeed rely
>>> on the application knowledge, but we could also enhance the spec so that
>>> the order in the enumeration becomes meaningful.
>>
>> I'm not sure how far we should go. I could imagine a system where the API
>> would report capabilities for modifiers (e.g. this modifier lowers the
>> bandwidth, this one enhances the quality, ...), but going in that direction,
>> where do we stop ? In practice I expect userspace to know some information
>> about the hardware, so I'd rather avoid over-engineering the API.
>>
>
> I think in the (hopefully not too) long term, something like
> https://github.com/cubanismo/allocator/ is the way forward.  That
> doesn't quite solve how v4l2 kernel part sorts out w/ corresponding
> userspace .so what is preferable, but at least that is
> compartmentalized to v4l2.. on the gl/vk side of things there will ofc
> be a hardware specific userspace part that knows what it prefers.  For
> v4l2, it probably makes sense to sort out what the userspace level API
> is and work backwards from there, rather than risk trying to design a
> kernel uapi that might turn out to be the wrong thing.

I thought for kms the plan is to make the ordering meaningful, because
it doesn't necessarily match the gl/vk one. E.g. on intel gl would
prefer Y compressed, Y, X, untiled. Whereas display would be Y
compressed, X (much easier to scan out, in many cases allows more
planes to be used), Y (is necessary for 90° rotation), untiled. So if
drm_hwc really wants to use all the planes, it could prioritize the
display over rendering and request X instead of Y tiled.

I think the same would go for v4l.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: DRM Format Modifiers in v4l2

2017-08-30 Thread Daniel Vetter
On Tue, Aug 29, 2017 at 10:47:01AM +0100, Brian Starkey wrote:
> On Mon, Aug 28, 2017 at 10:49:07PM +0200, Daniel Vetter wrote:
> > On Mon, Aug 28, 2017 at 8:07 PM, Nicolas Dufresne <nico...@ndufresne.ca> 
> > wrote:
> > > Le jeudi 24 ao??t 2017 ?? 13:26 +0100, Brian Starkey a ??crit :
> > > > > What I mean was: an application can use the modifier to give buffers 
> > > > > from
> > > > > one device to another without needing to understand it.
> > > > >
> > > > > But a generic video capture application that processes the video 
> > > > > itself
> > > > > cannot be expected to know about the modifiers. It's a custom HW 
> > > > > specific
> > > > > format that you only use between two HW devices or with software 
> > > > > written
> > > > > for that hardware.
> > > > >
> > > > 
> > > > Yes, makes sense.
> > > > 
> > > > > >
> > > > > > However, in DRM the API lets you get the supported formats for each
> > > > > > modifier as-well-as the modifier list itself. I'm not sure how 
> > > > > > exactly
> > > > > > to provide that in a control.
> > > > >
> > > > > We have support for a 'menu' of 64 bit integers: 
> > > > > V4L2_CTRL_TYPE_INTEGER_MENU.
> > > > > You use VIDIOC_QUERYMENU to enumerate the available modifiers.
> > > > >
> > > > > So enumerating these modifiers would work out-of-the-box.
> > > > 
> > > > Right. So I guess the supported set of formats could be somehow
> > > > enumerated in the menu item string. In DRM the pairs are (modifier +
> > > > bitmask) where bits represent formats in the supported formats list
> > > > (commit db1689aa61bd in drm-next). Printing a hex representation of
> > > > the bitmask would be functional but I concede not very pretty.
> > > 
> > > The problem is that the list of modifiers depends on the format
> > > selected. Having to call S_FMT to obtain this list is quite
> > > inefficient.
> > > 
> > > Also, be aware that DRM_FORMAT_MOD_SAMSUNG_64_32_TILE modifier has been
> > > implemented in V4L2 with a direct format (V4L2_PIX_FMT_NV12MT). I think
> > > an other one made it the same way recently, something from Mediatek if
> > > I remember. Though, unlike the Intel one, the same modifier does not
> > > have various result depending on the hardware revision.
> > 
> > Note on the intel modifers: On most recent platforms (iirc gen9) the
> > modifier is well defined and always describes the same byte layout. We
> > simply didn't want to rewrite our entire software stack for all the
> > old gunk platforms, hence the language. I guess we could/should
> > describe the layout in detail, but atm we're the only ones using it.
> > 
> > On your topic of v4l2 encoding the drm fourcc+modifier combo into a
> > special v4l fourcc: That's exactly the mismatch I was thinking of.
> > There's other examples of v4l2 fourcc being more specific than their
> > drm counters (e.g. specific way the different planes are laid out).
> 
> I'm not entirely clear on the v4l2 fourccs being more specific than
> DRM ones - do you mean e.g. NV12 vs NV12M? Specifically in the case of
> multi-planar formats I think it's a non-issue because modifiers are
> allowed to alter the number of planes and the meanings of them. Also
> V4L2 NV12M is a superset of NV12 - so NV12M would always be able to
> describe a DRM NV12 buffer.
> 
> I don't see the "special v4l2 format already exists" case as a problem
> either. It would be up to any drivers that already have special
> formats to decide if they want to also support it via a more generic
> modifiers API or not.
> 
> The fact is, adding special formats for each combination is
> unmanageable - we're talking dozens in the case of our hardware.

Hm right, we can just remap the special combos to the drm-fourcc +
modifier style. Bonus point if v4l does that in the core so not everyone
has to reinvent that wheel :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: DRM Format Modifiers in v4l2

2017-08-28 Thread Daniel Vetter
On Mon, Aug 28, 2017 at 8:07 PM, Nicolas Dufresne <nico...@ndufresne.ca> wrote:
> Le jeudi 24 août 2017 à 13:26 +0100, Brian Starkey a écrit :
>> > What I mean was: an application can use the modifier to give buffers from
>> > one device to another without needing to understand it.
>> >
>> > But a generic video capture application that processes the video itself
>> > cannot be expected to know about the modifiers. It's a custom HW specific
>> > format that you only use between two HW devices or with software written
>> > for that hardware.
>> >
>>
>> Yes, makes sense.
>>
>> > >
>> > > However, in DRM the API lets you get the supported formats for each
>> > > modifier as-well-as the modifier list itself. I'm not sure how exactly
>> > > to provide that in a control.
>> >
>> > We have support for a 'menu' of 64 bit integers: 
>> > V4L2_CTRL_TYPE_INTEGER_MENU.
>> > You use VIDIOC_QUERYMENU to enumerate the available modifiers.
>> >
>> > So enumerating these modifiers would work out-of-the-box.
>>
>> Right. So I guess the supported set of formats could be somehow
>> enumerated in the menu item string. In DRM the pairs are (modifier +
>> bitmask) where bits represent formats in the supported formats list
>> (commit db1689aa61bd in drm-next). Printing a hex representation of
>> the bitmask would be functional but I concede not very pretty.
>
> The problem is that the list of modifiers depends on the format
> selected. Having to call S_FMT to obtain this list is quite
> inefficient.
>
> Also, be aware that DRM_FORMAT_MOD_SAMSUNG_64_32_TILE modifier has been
> implemented in V4L2 with a direct format (V4L2_PIX_FMT_NV12MT). I think
> an other one made it the same way recently, something from Mediatek if
> I remember. Though, unlike the Intel one, the same modifier does not
> have various result depending on the hardware revision.

Note on the intel modifers: On most recent platforms (iirc gen9) the
modifier is well defined and always describes the same byte layout. We
simply didn't want to rewrite our entire software stack for all the
old gunk platforms, hence the language. I guess we could/should
describe the layout in detail, but atm we're the only ones using it.

On your topic of v4l2 encoding the drm fourcc+modifier combo into a
special v4l fourcc: That's exactly the mismatch I was thinking of.
There's other examples of v4l2 fourcc being more specific than their
drm counters (e.g. specific way the different planes are laid out).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [PATCH 02/15] drm: make device_type const

2017-08-25 Thread Daniel Vetter
On Sat, Aug 19, 2017 at 01:52:13PM +0530, Bhumika Goyal wrote:
> Make these const as they are only stored in the type field of a device
> structure, which is const.
> Done using Coccinelle.

I can't apply this, it's missing your s-o-b line. You can just replay with
that.

Thanks, Daniel

> ---
>  drivers/gpu/drm/drm_sysfs.c  | 2 +-
>  drivers/gpu/drm/ttm/ttm_module.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 1c5b5ce..84e4ebe 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -39,7 +39,7 @@
>   * drm_connector_unregister().
>   */
>  
> -static struct device_type drm_sysfs_device_minor = {
> +static const struct device_type drm_sysfs_device_minor = {
>   .name = "drm_minor"
>  };
>  
> diff --git a/drivers/gpu/drm/ttm/ttm_module.c 
> b/drivers/gpu/drm/ttm/ttm_module.c
> index 66fc639..e6604e0 100644
> --- a/drivers/gpu/drm/ttm/ttm_module.c
> +++ b/drivers/gpu/drm/ttm/ttm_module.c
> @@ -37,7 +37,7 @@
>  static DECLARE_WAIT_QUEUE_HEAD(exit_q);
>  static atomic_t device_released;
>  
> -static struct device_type ttm_drm_class_type = {
> +static const struct device_type ttm_drm_class_type = {
>   .name = "ttm",
>   /**
>* Add pm ops here.
> -- 
> 1.9.1
> 
> _______
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCHv2 0/3] drm/i915: add DisplayPort CEC-Tunneling-over-AUX support

2017-08-21 Thread Daniel Vetter
On Sat, Aug 19, 2017 at 02:05:16PM +0200, Hans Verkuil wrote:
> On 08/12/2017 11:01 AM, Hans Verkuil wrote:
> > From: Hans Verkuil <hans.verk...@cisco.com>
> > 
> > This patch series adds support for the DisplayPort CEC-Tunneling-over-AUX
> > feature. This patch series is based on 4.13-rc4 which has all the needed cec
> > and drm 4.13 patches merged.
> > 
> > This patch series has been tested with my NUC7i5BNK and a Samsung USB-C to 
> > HDMI adapter.
> > 
> > Please note this comment at the start of drm_dp_cec.c:
> > 
> > --
> > Unfortunately it turns out that we have a chicken-and-egg situation
> > here. Quite a few active (mini-)DP-to-HDMI or USB-C-to-HDMI adapters
> > have a converter chip that supports CEC-Tunneling-over-AUX (usually the
> > Parade PS176), but they do not wire up the CEC pin, thus making CEC
> > useless.
> > 
> > Sadly there is no way for this driver to know this. What happens is 
> > that a /dev/cecX device is created that is isolated and unable to see
> > any of the other CEC devices. Quite literally the CEC wire is cut
> > (or in this case, never connected in the first place).
> > 
> > I suspect that the reason so few adapters support this is that this
> > tunneling protocol was never supported by any OS. So there was no 
> > easy way of testing it, and no incentive to correctly wire up the
> > CEC pin.
> > 
> > Hopefully by creating this driver it will be easier for vendors to 
> > finally fix their adapters and test the CEC functionality.
> > 
> > I keep a list of known working adapters here:
> > 
> > https://hverkuil.home.xs4all.nl/cec-status.txt
> > 
> > Please mail me (hverk...@xs4all.nl) if you find an adapter that works
> > and is not yet listed there.
> > --
> > 
> > I really hope that this work will provide an incentive for vendors to
> > finally connect the CEC pin. It's a shame that there are so few adapters
> > that work (I found only two USB-C to HDMI adapters that work, and no
> > (mini-)DP to HDMI adapters at all).
> > 
> > Note that a colleague who actually knows his way around a soldering iron
> > modified an UpTab DisplayPort-to-HDMI adapter for me, hooking up the CEC
> > pin. And after that change it worked. I also received confirmation that
> > this really is a chicken-and-egg situation: it is because there is no CEC
> > support for this feature in any OS that they do not hook up the CEC pin.
> > 
> > So hopefully if this gets merged there will be an incentive for vendors
> > to make adapters where this actually works. It is a very nice feature
> > for HTPC boxes.
> > 
> > Changes since v1:
> > 
> > - Incorporated Sean's review comments in patch 1/3.
> 
> Ping?
> 
> Who is supposed to merge this? Is there anything I should do? I'd love to
> get this in for 4.14...

1) you have commit rights, so only really need to find a reviewer. Not
exactly sure who'd be a good reviewer, maybe Imre or Ville?
2) 4.14 is done, this will go into 4.15.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: DRM Format Modifiers in v4l2

2017-08-21 Thread Daniel Vetter
On Mon, Aug 21, 2017 at 5:52 PM, Brian Starkey <brian.star...@arm.com> wrote:
> Hi all,
>
> I couldn't find this topic talked about elsewhere, but apologies if
> it's a duplicate - I'll be glad to be steered in the direction of a
> thread.
>
> We'd like to support DRM format modifiers in v4l2 in order to share
> the description of different (mostly proprietary) buffer formats
> between e.g. a v4l2 device and a DRM device.
>
> DRM format modifiers are defined in include/uapi/drm/drm_fourcc.h and
> are a vendor-namespaced 64-bit value used to describe various
> vendor-specific buffer layouts. They are combined with a (DRM) FourCC
> code to give a complete description of the data contained in a buffer.
>
> The same modifier definition is used in the Khronos EGL extension
> EGL_EXT_image_dma_buf_import_modifiers, and is supported in the
> Wayland linux-dmabuf protocol.
>
>
> This buffer information could of course be described in the
> vendor-specific part of V4L2_PIX_FMT_*, but this would duplicate the
> information already defined in drm_fourcc.h. Additionally, there
> would be quite a format explosion where a device supports a dozen or
> more formats, all of which can use one or more different
> layouts/compression schemes.
>
> So, I'm wondering if anyone has views on how/whether this could be
> incorporated?
>
> I spoke briefly about this to Laurent at LPC last year, and he
> suggested v4l2_control as one approach.
>
> I also wondered if could be added in v4l2_pix_format_mplane - looks
> like there's 8 bytes left before it exceeds the 200 bytes, or could go
> in the reserved portion of v4l2_plane_pix_format.
>
> Thanks for any thoughts,

One problem is that the modifers sometimes reference the DRM fourcc
codes. v4l has a different (and incompatible set) of fourcc codes,
whereas all the protocols and specs (you can add DRI3.1 for Xorg to
that list btw) use both drm fourcc and drm modifiers.

This might or might not make this proposal unworkable, but it's
something I'd at least review carefully.

Otherwise I think it'd be great if we could have one namespace for all
modifiers, that's pretty much why we have them. Please also note that
for drm_fourcc.h we don't require an in-kernel user for a new modifier
since a bunch of them might need to be allocated just for
userspace-to-userspace buffer sharing (e.g. in EGL/vk). One example
for this would be compressed surfaces with fast-clearing, which is
planned for i915 (but current hw can't scan it out). And we really
want to have one namespace for everything.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly

2017-07-25 Thread Daniel Vetter
On Tue, Jul 25, 2017 at 11:11:35AM +0200, Christian König wrote:
> Am 24.07.2017 um 13:57 schrieb Daniel Vetter:
> > On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> > <deathsim...@vodafone.de> wrote:
> > > Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
> > > > On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> > > > > From: Christian König <christian.koe...@amd.com>
> > > > > 
> > > > > With hardware resets in mind it is possible that all shared fences are
> > > > > signaled, but the exlusive isn't. Fix waiting for everything in this
> > > > > situation.
> > > > How did you end up with both shared and exclusive fences on the same
> > > > reservation object? At least I thought the point of exclusive was that
> > > > it's exclusive (and has an implicit barrier on all previous shared
> > > > fences). Same for shared fences, they need to wait for the exclusive one
> > > > (and replace it).
> > > > 
> > > > Is this fallout from the amdgpu trickery where by default you do all
> > > > shared fences? I thought we've aligned semantics a while back ...
> > > 
> > > No, that is perfectly normal even for other drivers. Take a look at the
> > > reservation code.
> > > 
> > > The exclusive fence replaces all shared fences, but adding a shared fence
> > > doesn't replace the exclusive fence. That actually makes sense, cause when
> > > you want to add move shared fences those need to wait for the last 
> > > exclusive
> > > fence as well.
> > Hm right.
> > 
> > > Now normally I would agree that when you have shared fences it is 
> > > sufficient
> > > to wait for all of them cause those operations can't start before the
> > > exclusive one finishes. But with GPU reset and/or the ability to abort
> > > already submitted operations it is perfectly possible that you end up with
> > > an exclusive fence which isn't signaled and a shared fence which is 
> > > signaled
> > > in the same reservation object.
> > How does that work? The batch(es) with the shared fence are all
> > supposed to wait for the exclusive fence before they start, which
> > means even if you gpu reset and restart/cancel certain things, they
> > shouldn't be able to complete out of order.
> 
> Assume the following:
> 1. The exclusive fence is some move operation by the kernel which executes
> on a DMA engine.
> 2. The shared fence is a 3D operation submitted by userspace which executes
> on the 3D engine.
> 
> Now we found the 3D engine to be hung and needs a reset, all currently
> submitted jobs are aborted, marked with an error code and their fences put
> into the signaled state.
> 
> Since we only reset the 3D engine, the move operation (fortunately) isn't
> affected by this.
> 
> I think this applies to all drivers and isn't something amdgpu specific.

Not i915 because:
- At first we only had system wide gpu reset that killed everything, which
  means all requests will be completed, not just on a single engine.

- Now we have per-engine reset, but combined with replaying them (the
  offending one gets a no-op batch to avoid re-hanging), to make sure the
  depency tree doesn't fall apart.

Now I see that doing this isn't all that simple, and either way we still
have the case where one driver resets but not the other (in multi-gpu),
but I'm not exactly sure how to best handle this.

What exactly is the downside of not dropping this assumption, i.e. why do
you want this patch? What blows up?
-Daniel


> 
> Regards,
> Christian.
> 
> > 
> > If you outright cancel a fence then you're supposed to first call
> > dma_fence_set_error(-EIO) and then complete it. Note that atm that
> > part might be slightly overengineered and I'm not sure about how we
> > expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
> > soon, has been) used by i915 for it's internal book-keeping, which
> > might not be the best to leak to other consumers. But completing
> > fences (at least exported ones, where userspace or other drivers can
> > get at them) shouldn't be possible.
> > -Daniel
> 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly

2017-07-25 Thread Daniel Vetter
On Tue, Jul 25, 2017 at 02:55:14PM +0800, zhoucm1 wrote:
> 
> 
> On 2017年07月25日 14:50, Daniel Vetter wrote:
> > On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
> > > 
> > > On 2017年07月24日 19:57, Daniel Vetter wrote:
> > > > On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> > > > <deathsim...@vodafone.de> wrote:
> > > > > Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
> > > > > > On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> > > > > > > From: Christian König <christian.koe...@amd.com>
> > > > > > > 
> > > > > > > With hardware resets in mind it is possible that all shared 
> > > > > > > fences are
> > > > > > > signaled, but the exlusive isn't. Fix waiting for everything in 
> > > > > > > this
> > > > > > > situation.
> > > > > > How did you end up with both shared and exclusive fences on the same
> > > > > > reservation object? At least I thought the point of exclusive was 
> > > > > > that
> > > > > > it's exclusive (and has an implicit barrier on all previous shared
> > > > > > fences). Same for shared fences, they need to wait for the 
> > > > > > exclusive one
> > > > > > (and replace it).
> > > > > > 
> > > > > > Is this fallout from the amdgpu trickery where by default you do all
> > > > > > shared fences? I thought we've aligned semantics a while back ...
> > > > > No, that is perfectly normal even for other drivers. Take a look at 
> > > > > the
> > > > > reservation code.
> > > > > 
> > > > > The exclusive fence replaces all shared fences, but adding a shared 
> > > > > fence
> > > > > doesn't replace the exclusive fence. That actually makes sense, cause 
> > > > > when
> > > > > you want to add move shared fences those need to wait for the last 
> > > > > exclusive
> > > > > fence as well.
> > > > Hm right.
> > > > 
> > > > > Now normally I would agree that when you have shared fences it is 
> > > > > sufficient
> > > > > to wait for all of them cause those operations can't start before the
> > > > > exclusive one finishes. But with GPU reset and/or the ability to abort
> > > > > already submitted operations it is perfectly possible that you end up 
> > > > > with
> > > > > an exclusive fence which isn't signaled and a shared fence which is 
> > > > > signaled
> > > > > in the same reservation object.
> > > > How does that work? The batch(es) with the shared fence are all
> > > > supposed to wait for the exclusive fence before they start, which
> > > > means even if you gpu reset and restart/cancel certain things, they
> > > > shouldn't be able to complete out of order.
> > > Hi Daniel,
> > > 
> > > Do you mean exclusive fence must be signalled before any shared fence? 
> > > Where
> > > could I find this restriction?
> > Yes, Christian also described it above. Could be that we should have
> > better kerneldoc to document this ...
> Is that a known assumption? if that way, it doesn't matter even that we
> always wait exclusive fence, right? Just one more line checking.

The problem is that amdgpu breaks that assumption over gpu reset, and that
might have implications _everywhere_, not just in this code here. Are you
sure this case won't pull the i915 driver over the table when sharing
dma-bufs with it? Did you audit the code (plus all the other drivers too
ofc).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly

2017-07-25 Thread Daniel Vetter
On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
> 
> 
> On 2017年07月24日 19:57, Daniel Vetter wrote:
> > On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> > <deathsim...@vodafone.de> wrote:
> > > Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
> > > > On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> > > > > From: Christian König <christian.koe...@amd.com>
> > > > > 
> > > > > With hardware resets in mind it is possible that all shared fences are
> > > > > signaled, but the exlusive isn't. Fix waiting for everything in this
> > > > > situation.
> > > > How did you end up with both shared and exclusive fences on the same
> > > > reservation object? At least I thought the point of exclusive was that
> > > > it's exclusive (and has an implicit barrier on all previous shared
> > > > fences). Same for shared fences, they need to wait for the exclusive one
> > > > (and replace it).
> > > > 
> > > > Is this fallout from the amdgpu trickery where by default you do all
> > > > shared fences? I thought we've aligned semantics a while back ...
> > > 
> > > No, that is perfectly normal even for other drivers. Take a look at the
> > > reservation code.
> > > 
> > > The exclusive fence replaces all shared fences, but adding a shared fence
> > > doesn't replace the exclusive fence. That actually makes sense, cause when
> > > you want to add move shared fences those need to wait for the last 
> > > exclusive
> > > fence as well.
> > Hm right.
> > 
> > > Now normally I would agree that when you have shared fences it is 
> > > sufficient
> > > to wait for all of them cause those operations can't start before the
> > > exclusive one finishes. But with GPU reset and/or the ability to abort
> > > already submitted operations it is perfectly possible that you end up with
> > > an exclusive fence which isn't signaled and a shared fence which is 
> > > signaled
> > > in the same reservation object.
> > How does that work? The batch(es) with the shared fence are all
> > supposed to wait for the exclusive fence before they start, which
> > means even if you gpu reset and restart/cancel certain things, they
> > shouldn't be able to complete out of order.
> Hi Daniel,
> 
> Do you mean exclusive fence must be signalled before any shared fence? Where
> could I find this restriction?

Yes, Christian also described it above. Could be that we should have
better kerneldoc to document this ...
-Daniel

> 
> Thanks,
> David Zhou
> > 
> > If you outright cancel a fence then you're supposed to first call
> > dma_fence_set_error(-EIO) and then complete it. Note that atm that
> > part might be slightly overengineered and I'm not sure about how we
> > expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
> > soon, has been) used by i915 for it's internal book-keeping, which
> > might not be the best to leak to other consumers. But completing
> > fences (at least exported ones, where userspace or other drivers can
> > get at them) shouldn't be possible.
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly

2017-07-24 Thread Daniel Vetter
On Mon, Jul 24, 2017 at 11:51 AM, Christian König
<deathsim...@vodafone.de> wrote:
> Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
>>
>> On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
>>>
>>> From: Christian König <christian.koe...@amd.com>
>>>
>>> With hardware resets in mind it is possible that all shared fences are
>>> signaled, but the exlusive isn't. Fix waiting for everything in this
>>> situation.
>>
>> How did you end up with both shared and exclusive fences on the same
>> reservation object? At least I thought the point of exclusive was that
>> it's exclusive (and has an implicit barrier on all previous shared
>> fences). Same for shared fences, they need to wait for the exclusive one
>> (and replace it).
>>
>> Is this fallout from the amdgpu trickery where by default you do all
>> shared fences? I thought we've aligned semantics a while back ...
>
>
> No, that is perfectly normal even for other drivers. Take a look at the
> reservation code.
>
> The exclusive fence replaces all shared fences, but adding a shared fence
> doesn't replace the exclusive fence. That actually makes sense, cause when
> you want to add move shared fences those need to wait for the last exclusive
> fence as well.

Hm right.

> Now normally I would agree that when you have shared fences it is sufficient
> to wait for all of them cause those operations can't start before the
> exclusive one finishes. But with GPU reset and/or the ability to abort
> already submitted operations it is perfectly possible that you end up with
> an exclusive fence which isn't signaled and a shared fence which is signaled
> in the same reservation object.

How does that work? The batch(es) with the shared fence are all
supposed to wait for the exclusive fence before they start, which
means even if you gpu reset and restart/cancel certain things, they
shouldn't be able to complete out of order.

If you outright cancel a fence then you're supposed to first call
dma_fence_set_error(-EIO) and then complete it. Note that atm that
part might be slightly overengineered and I'm not sure about how we
expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
soon, has been) used by i915 for it's internal book-keeping, which
might not be the best to leak to other consumers. But completing
fences (at least exported ones, where userspace or other drivers can
get at them) shouldn't be possible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly

2017-07-24 Thread Daniel Vetter
On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> From: Christian König <christian.koe...@amd.com>
> 
> With hardware resets in mind it is possible that all shared fences are
> signaled, but the exlusive isn't. Fix waiting for everything in this 
> situation.

How did you end up with both shared and exclusive fences on the same
reservation object? At least I thought the point of exclusive was that
it's exclusive (and has an implicit barrier on all previous shared
fences). Same for shared fences, they need to wait for the exclusive one
(and replace it).

Is this fallout from the amdgpu trickery where by default you do all
shared fences? I thought we've aligned semantics a while back ...
-Daniel

> 
> Signed-off-by: Christian König <christian.koe...@amd.com>
> ---
>  drivers/dma-buf/reservation.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> index e2eff86..ce3f9c1 100644
> --- a/drivers/dma-buf/reservation.c
> +++ b/drivers/dma-buf/reservation.c
> @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct 
> reservation_object *obj,
>   }
>   }
>  
> - if (!shared_count) {
> + if (!fence) {
>   struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
>  
>   if (fence_excl &&
> -- 
> 2.7.4
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


[PATCH] dma-fence: Don't BUG_ON when not absolutely needed

2017-07-20 Thread Daniel Vetter
It makes debugging a massive pain.

Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>
Cc: Sumit Semwal <sumit.sem...@linaro.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
---
 drivers/dma-buf/dma-fence.c | 4 ++--
 include/linux/dma-fence.h   | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 56e0a0e1b600..9a302799040e 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -48,7 +48,7 @@ static atomic64_t dma_fence_context_counter = 
ATOMIC64_INIT(0);
  */
 u64 dma_fence_context_alloc(unsigned num)
 {
-   BUG_ON(!num);
+   WARN_ON(!num);
return atomic64_add_return(num, _fence_context_counter) - num;
 }
 EXPORT_SYMBOL(dma_fence_context_alloc);
@@ -172,7 +172,7 @@ void dma_fence_release(struct kref *kref)
 
trace_dma_fence_destroy(fence);
 
-   BUG_ON(!list_empty(>cb_list));
+   WARN_ON(!list_empty(>cb_list));
 
if (fence->ops->release)
fence->ops->release(fence);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 9342cf0dada4..171895072435 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -431,8 +431,8 @@ int dma_fence_get_status(struct dma_fence *fence);
 static inline void dma_fence_set_error(struct dma_fence *fence,
   int error)
 {
-   BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags));
-   BUG_ON(error >= 0 || error < -MAX_ERRNO);
+   WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags));
+   WARN_ON(error >= 0 || error < -MAX_ERRNO);
 
fence->error = error;
 }
-- 
2.13.2



Re: [RFC PATCH 7/7] drm/i915: add DisplayPort CEC-Tunneling-over-AUX support

2017-05-29 Thread Daniel Vetter
On Fri, May 26, 2017 at 12:20:48PM +0200, Hans Verkuil wrote:
> On 05/26/2017 09:15 AM, Daniel Vetter wrote:
> > On Thu, May 25, 2017 at 05:06:26PM +0200, Hans Verkuil wrote:
> >> From: Hans Verkuil <hans.verk...@cisco.com>
> >>
> >> Implement support for this DisplayPort feature.
> >>
> >> The cec device is created whenever it detects an adapter that
> >> has this feature. It is only removed when a new adapter is connected
> >> that does not support this. If a new adapter is connected that has
> >> different properties than the previous one, then the old cec device is
> >> unregistered and a new one is registered to replace the old one.
> >>
> >> Signed-off-by: Hans Verkuil <hans.verk...@cisco.com>
> > 
> > Some small comments below.
> > 
> >> ---
> >>  drivers/gpu/drm/i915/Kconfig| 11 ++
> >>  drivers/gpu/drm/i915/intel_dp.c | 46 
> >> +
> >>  2 files changed, 53 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> >> index a5cd5dacf055..f317b13a1409 100644
> >> --- a/drivers/gpu/drm/i915/Kconfig
> >> +++ b/drivers/gpu/drm/i915/Kconfig
> >> @@ -124,6 +124,17 @@ config DRM_I915_GVT_KVMGT
> >>  Choose this option if you want to enable KVMGT support for
> >>  Intel GVT-g.
> >>  
> >> +config DRM_I915_DP_CEC
> >> +  tristate "Enable DisplayPort CEC-Tunneling-over-AUX HDMI support"
> >> +  depends on DRM_I915 && CEC_CORE
> >> +  select DRM_DP_CEC
> >> +  help
> >> +Choose this option if you want to enable HDMI CEC support for
> >> +DisplayPort/USB-C to HDMI adapters.
> >> +
> >> +Note: not all adapters support this feature, and even for those
> >> +that do support this often do not hook up the CEC pin.
> > 
> > Why Kconfig? There's not anything else optional in i915.ko (except debug
> > stuff ofc), since generally just not worth the pain. Also doesn't seem to
> > be wired up at all :-)
> 
> It selects DRM_DP_CEC, but you're right, it can be dropped.
> 
> > 
> >> +
> >>  menu "drm/i915 Debugging"
> >>  depends on DRM_I915
> >>  depends on EXPERT
> >> diff --git a/drivers/gpu/drm/i915/intel_dp.c 
> >> b/drivers/gpu/drm/i915/intel_dp.c
> >> index ee77b519835c..38e17ee2548d 100644
> >> --- a/drivers/gpu/drm/i915/intel_dp.c
> >> +++ b/drivers/gpu/drm/i915/intel_dp.c
> >> @@ -32,6 +32,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -1405,6 +1406,7 @@ static void intel_aux_reg_init(struct intel_dp 
> >> *intel_dp)
> >>  static void
> >>  intel_dp_aux_fini(struct intel_dp *intel_dp)
> >>  {
> >> +  cec_unregister_adapter(intel_dp->aux.cec_adap);
> >>kfree(intel_dp->aux.name);
> >>  }
> >>  
> >> @@ -4179,6 +4181,33 @@ intel_dp_check_mst_status(struct intel_dp *intel_dp)
> >>return -EINVAL;
> >>  }
> >>  
> >> +static bool
> >> +intel_dp_check_cec_status(struct intel_dp *intel_dp)
> >> +{
> >> +  bool handled = false;
> >> +
> >> +  for (;;) {
> >> +  u8 cec_irq;
> >> +  int ret;
> >> +
> >> +  ret = drm_dp_dpcd_readb(_dp->aux,
> >> +  DP_DEVICE_SERVICE_IRQ_VECTOR_ESI1,
> >> +  _irq);
> >> +  if (ret < 0 || !(cec_irq & DP_CEC_IRQ))
> >> +  return handled;
> >> +
> >> +  cec_irq &= ~DP_CEC_IRQ;
> >> +  drm_dp_cec_irq(_dp->aux);
> >> +  handled = true;
> >> +
> >> +  ret = drm_dp_dpcd_writeb(_dp->aux,
> >> +   DP_DEVICE_SERVICE_IRQ_VECTOR_ESI1,
> >> +   cec_irq);
> >> +  if (ret < 0)
> >> +  return handled;
> >> +  }
> >> +}
> > 
> > Shouldn't the above be a helper in the cec library? Doesn't look i915
> > specific to me at least ...
> 
> Good point, this can be moved to drm_dp_cec_irq().
> 
> > 
> >> +
> >>  static void
> >>  intel_dp_retrain_link(struct intel_dp *intel_dp)
>

Re: [RFC PATCH 6/7] drm: add support for DisplayPort CEC-Tunneling-over-AUX

2017-05-26 Thread Daniel Vetter
On Fri, May 26, 2017 at 12:34 PM, Hans Verkuil <hverk...@xs4all.nl> wrote:
>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>> index 78d7fc0ebb57..dd771ce8a3d0 100644
>>> --- a/drivers/gpu/drm/Kconfig
>>> +++ b/drivers/gpu/drm/Kconfig
>>> @@ -120,6 +120,9 @@ config DRM_LOAD_EDID_FIRMWARE
>>>default case is N. Details and instructions how to build your own
>>>EDID data are given in Documentation/EDID/HOWTO.txt.
>>>
>>> +config DRM_DP_CEC
>>> +bool
>>
>> We generally don't bother with a Kconfig for every little bit in drm, not
>> worth the trouble (yes I know there's some exceptions, but somehow they're
>> all from soc people). Just smash this into the KMS_HELPER one and live is
>> much easier for drivers. Also allows you to drop the dummy inline
>> functions.
>
> For all other CEC implementations I have placed it under a config option. The
> reason is that 1) CEC is an optional feature of HDMI and you may not actually
> want it, and 2) enabling CEC also pulls in the cec module.
>
> I still think turning this into a drm config option makes sense. This would
> replace the i915 config option I made in the next patch, i.e. this config 
> option
> is moved up one level.
>
> Your choice, though.

If there is a CEC option already, can we just reuse that one? I.e.
when it's enabled, we compile the drm dp cec helpers, if it's not, you
get the pile of dummy functions. drm_dp_cec.c should still be part of
drm_kms_helper.ko though I think (since the dp aux stuff is in there
anyway, doesn't make sense to split it).

I'm still not sold on Kconfig proliferation for optional features
(have one for the driver, that's imo enough), but if it exists already
not going to block it's use in drm. As long as it's minimally invasive
on the code and drivers don't have to care at all.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [RFC PATCH 6/7] drm: add support for DisplayPort CEC-Tunneling-over-AUX

2017-05-26 Thread Daniel Vetter
return 0;
> + cec_unregister_adapter(aux->cec_adap);
> + }
> +
> + aux->cec_adap = cec_allocate_adapter(_dp_cec_adap_ops,
> +  aux, name, cec_caps, num_las);
> + if (IS_ERR(aux->cec_adap)) {
> + err = PTR_ERR(aux->cec_adap);
> + aux->cec_adap = NULL;
> + return err;
> + }
> + err = cec_register_adapter(aux->cec_adap, parent);
> + if (err) {
> + cec_delete_adapter(aux->cec_adap);
> + aux->cec_adap = NULL;
> + }
> + return err;
> +}
> +EXPORT_SYMBOL(drm_dp_cec_configure_adapter);
> diff --git a/include/drm/drm_dp_helper.h b/include/drm/drm_dp_helper.h
> index 3f4ad709534e..1e373df48108 100644
> --- a/include/drm/drm_dp_helper.h
> +++ b/include/drm/drm_dp_helper.h
> @@ -843,6 +843,8 @@ struct drm_dp_aux_msg {
>   size_t size;
>  };
>  
> +struct cec_adapter;
> +
>  /**
>   * struct drm_dp_aux - DisplayPort AUX channel
>   * @name: user-visible name of this AUX channel and the I2C-over-AUX adapter
> @@ -901,6 +903,10 @@ struct drm_dp_aux {
>* @i2c_defer_count: Counts I2C DEFERs, used for DP validation.
>*/
>   unsigned i2c_defer_count;
> + /**
> +      * @cec_adap: the CEC adapter for CEC-Tunneling-over-AUX support.
> +  */
> + struct cec_adapter *cec_adap;
>  };
>  
>  ssize_t drm_dp_dpcd_read(struct drm_dp_aux *aux, unsigned int offset,
> @@ -972,4 +978,22 @@ void drm_dp_aux_unregister(struct drm_dp_aux *aux);
>  int drm_dp_start_crc(struct drm_dp_aux *aux, struct drm_crtc *crtc);
>  int drm_dp_stop_crc(struct drm_dp_aux *aux);
>  
> +#ifdef CONFIG_DRM_DP_CEC
> +int drm_dp_cec_irq(struct drm_dp_aux *aux);
> +int drm_dp_cec_configure_adapter(struct drm_dp_aux *aux, const char *name,
> +  struct device *parent);
> +#else
> +static inline int drm_dp_cec_irq(struct drm_dp_aux *aux)
> +{
> + return 0;
> +}
> +
> +static inline int drm_dp_cec_configure_adapter(struct drm_dp_aux *aux,
> +const char *name,
> +struct device *parent)
> +{
> + return -ENODEV;
> +}
> +#endif
> +
>  #endif /* _DRM_DP_HELPER_H_ */
> -- 
> 2.11.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [RFC PATCH 7/7] drm/i915: add DisplayPort CEC-Tunneling-over-AUX support

2017-05-26 Thread Daniel Vetter
dp_to_dig_port(intel_dp)))
> + } else if (intel_digital_port_connected(to_i915(dev),
> + dp_to_dig_port(intel_dp))) {
>   status = intel_dp_detect_dpcd(intel_dp);
> - else
> + if (status == connector_status_connected)
> + drm_dp_cec_configure_adapter(_dp->aux,
> +  intel_dp->aux.name, dev->dev);

Did you look into also wiring this up for dp mst chains?
-Daniel

> + } else {
>   status = connector_status_disconnected;
> + }
>  
>   if (status == connector_status_disconnected) {
>       memset(_dp->compliance, 0, sizeof(intel_dp->compliance));
> @@ -5080,6 +5115,9 @@ intel_dp_hpd_pulse(struct intel_digital_port 
> *intel_dig_port, bool long_hpd)
>  
>   intel_display_power_get(dev_priv, intel_dp->aux_power_domain);
>  
> + if (intel_dp->aux.cec_adap)
> + intel_dp_check_cec_status(intel_dp);
> +
>   if (intel_dp->is_mst) {
>   if (intel_dp_check_mst_status(intel_dp) == -EINVAL) {
>   /*
> -- 
> 2.11.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCH] dma-buf: avoid scheduling on fence status query v2

2017-05-24 Thread Daniel Vetter
On Wed, May 24, 2017 at 09:47:49AM +1000, Dave Airlie wrote:
> On 28 April 2017 at 07:27, Gustavo Padovan <gust...@padovan.org> wrote:
> > 2017-04-26 Christian König <deathsim...@vodafone.de>:
> >
> >> Am 26.04.2017 um 16:46 schrieb Andres Rodriguez:
> >> > When a timeout of zero is specified, the caller is only interested in
> >> > the fence status.
> >> >
> >> > In the current implementation, dma_fence_default_wait will always call
> >> > schedule_timeout() at least once for an unsignaled fence. This adds a
> >> > significant overhead to a fence status query.
> >> >
> >> > Avoid this overhead by returning early if a zero timeout is specified.
> >> >
> >> > v2: move early return after enable_signaling
> >> >
> >> > Signed-off-by: Andres Rodriguez <andre...@gmail.com>
> >>
> >> Reviewed-by: Christian König <christian.koe...@amd.com>
> >
> > pushed to drm-misc-next. Thanks all.
> 
> I don't see this patch in -rc2, where did it end up going?

Queued for 4.13. Makes imo sense since it's just a performance
improvement, not a clear bugfix. But it's in your drm-next, so if you want
to fast-track you can cherry-pick it over:

commit 03c0c5f6641533f5fc14bf4e76d2304197402552
Author: Andres Rodriguez <andre...@gmail.com>
Date:   Wed Apr 26 10:46:20 2017 -0400

dma-buf: avoid scheduling on fence status query v2

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [RFC 0/4] Exynos DRM: add Picture Processor extension

2017-05-10 Thread Daniel Vetter
On Wed, May 10, 2017 at 12:29 PM, Inki Dae <inki@samsung.com> wrote:
>> This kind of contradicts with response Marek received from DRM
>> community about his proposal. Which drivers in particular you have in
>> mind?
>
> You can check vmw_overlay_ioctl of vmwgfx driver and 
> intel_overlay_put_image_ioctl of i915 driver. These was all I could find in 
> mainline.
> Seems the boundaries of whether we have to implement pre/post post processing 
> mem2mem driver in V4L2 or DRM are really vague.

These aren't picture processors, but overlay plane support merged
before we had the core drm overlay support. Please do not emulate them
at all, your patch will be rejected :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


Re: [RFC 0/4] Exynos DRM: add Picture Processor extension

2017-05-10 Thread Daniel Vetter
s.freedesktop.org/msg146286.html
> >>>
> >>> For GPUs I generally understand the reasoning: there's a very limited 
> >>> number
> >>> of users of this API --- primarily because it's not an application
> >>> interface.
> >>>
> >>> If you have a device that however falls under the scope of V4L2 (at least
> >>> API-wise), does this continue to be the case? Will there be only one or 
> >>> two
> >>> (or so) users for this API? Is it the case here?
> >>>
> >>> Using a device specific interface definitely has some benefits: there's no
> >>> need to think how would you generalise the interface for other similar
> >>> devices. There's no need to consider backwards compatibility as it's not a
> >>> requirement. The drawback is that the applications that need to support
> >>> similar devices will bear the burden of having to support different APIs.
> >>>
> >>> I don't mean to say that you should ram whatever under V4L2 / MC
> >>> independently of how unworkable that might be, but there are also clear
> >>> advantages in using a standardised interface such as V4L2.
> >>>
> >>> V4L2 has a long history behind it and if it was designed today, I bet it
> >>> would look quite different from what it is now.
> >>
> >> It's true. There is definitely a benefit with V4L2 because V4L2 provides 
> >> Linux standard ABI - for DRM as of now not.
> >>
> >> However, I think that is a only benefit we could get through V4L2. Using 
> >> V4L2 makes software stack of Platform to be complicated - We have to open 
> >> video device node and card device node to display a image on the screen 
> >> scaling or converting color space of the image and also we need to export 
> >> DMA buffer from one side and import it to other side using DMABUF.
> >>
> >> It may not related to this but even V4L2 has performance problem - every 
> >> QBUF/DQBUF requests performs mapping/unmapping DMA buffer you already know 
> >> this. :)
> >>
> >> In addition, recently Display subsystems on ARM SoC tend to include 
> >> pre/post processing hardware in Display controller - OMAP, Exynos8895 and 
> >> MSM as long as I know.
> >>
> > 
> > I agree with many of the arguments given by Inki above and earlier by
> > Marek. However, they apply to already existing V4L2 implementation,
> > not V4L2 as the idea in general, and I believe a comparison against a
> > complete new API that doesn't even exist in the kernel tree and
> > userspace yet (only in terms of patches on the list) is not fair.
> 
> Below is a user space who uses Exynos DRM post processor driver, IPP driver.
> https://review.tizen.org/git/?p=platform/adaptation/samsung_exynos/libtdm-exynos.git;a=blob;f=src/tdm_exynos_pp.c;h=db20e6f226d313672d1d468e06d80526ea30121c;hb=refs/heads/tizen
> 
> Marek patch series is just a new version of this driver which is specific to 
> Exynos DRM. Marek is trying to enhance this driver.
> Ps. other DRM drivers in mainline already have such or similar API.
> 
> We will also open the user space who uses new API later.

Those drivers are different, because they just expose a hw-specific abi.
Like the current IPP interfaces exposed by drm/exynos.

I think you have 2 options:
- Extend the current IPP interface in drm/exynos with whatever new pixel
  processor modes you want. Of course this still means you need to have
  the userspace side open-source, but otherwise it's all private to exynos
  hardware and software.

- If you want something standardized otoh, go with v4l2. And the issues
  you point out in v4l2 aren't uapi issues, but implementation details of
  the current vbuf helpers, which can be fixed. At least that's my
  understanding. And it should be fairly easy to fix that, simply switch
  from doing a map/unmap for every q/deqbuf to caching the mappings and
  use the stream dma-api interfaces to only do the flush (if needed at
  all, should turn into a no-op) on q/deqbuf.

Trying to come up with a generic drm api has imo not much chance of
getting accepted anytime soon (since for the simple pixel processor
pipeline it's just duplicating v4l, and for something more generic/faster
a generic interfaces is alwas too slow).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2 0/2] rcar-du, vsp1: rcar-gen3: Add support for colorkey alpha blending

2017-05-08 Thread Daniel Vetter
On Mon, May 08, 2017 at 09:33:37AM -0700, Eric Anholt wrote:
> Alexandru Gheorghe <alexandru_gheor...@mentor.com> writes:
> 
> > Currently, rcar-du supports colorkeying  only for rcar-gen2 and it uses 
> > some hw capability of the display unit(DU) which is not available on gen3.
> > In order to implement colorkeying for gen3 we need to use the colorkey
> > capability of the VSPD, hence the need to change both drivers rcar-du and
> > vsp1.
> >
> > This patchset had been developed and tested on top of v4.9/rcar-3.5.1 from
> > git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas-bsp.git
> 
> A few questions:
> 
> Are other drivers interested in supporting this property?  VC4 has the
> 24-bit RGB colorkey, but I don't see YCBCR support.  Should it be
> documented in a generic location?
> 
> Does your colorkey end up forcing alpha to 1 for the plane when it's not
> matched?

I think generic color-key for plane compositioning would be nice, but I'm
not sure that's possible due to differences in how the key works.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH v2] dma-buf: Rename dma-ops to prevent conflict with kunmap_atomic macro

2017-04-20 Thread Daniel Vetter
On Wed, Apr 19, 2017 at 01:36:10PM -0600, Logan Gunthorpe wrote:
> Seeing the kunmap_atomic dma_buf_ops share the same name with a macro
> in highmem.h, the former can be aliased if any dma-buf user includes
> that header.
> 
> I'm personally trying to include highmem.h inside scatterlist.h and this
> breaks the dma-buf code proper.
> 
> Christoph Hellwig suggested [1] renaming it and pushing this patch ASAP.
> 
> To maintain consistency I've renamed all four of kmap* and kunmap* to be
> map* and unmap*. (Even though only kmap_atomic presently conflicts.)
> 
> [1] https://www.spinics.net/lists/target-devel/msg15070.html
> 
> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
> Reviewed-by: Sinclair Yeh <s...@vmware.com>

Acked-by: Daniel Vetter <daniel.vet...@ffwll.ch>

Probably simplest if we pull this in through the drm-misc tree for 4.12.
Can we have an ack for the v4l side for that pls?

Thanks, Daniel

> ---
> 
> Changes since v1:
> 
> - Added the missing tegra driver (noticed by kbuild robot)
> - Rebased off of drm-intel-next to get the i915 selftest that is new
> - Fixed nits Sinclair pointed out.
> 
>  drivers/dma-buf/dma-buf.c  | 16 
>  drivers/gpu/drm/armada/armada_gem.c|  8 
>  drivers/gpu/drm/drm_prime.c|  8 
>  drivers/gpu/drm/i915/i915_gem_dmabuf.c |  8 
>  drivers/gpu/drm/i915/selftests/mock_dmabuf.c   |  8 
>  drivers/gpu/drm/omapdrm/omap_gem_dmabuf.c  |  8 
>  drivers/gpu/drm/tegra/gem.c|  8 
>  drivers/gpu/drm/udl/udl_dmabuf.c   |  8 
>  drivers/gpu/drm/vmwgfx/vmwgfx_prime.c  |  8 
>  drivers/media/v4l2-core/videobuf2-dma-contig.c |  4 ++--
>  drivers/media/v4l2-core/videobuf2-dma-sg.c |  4 ++--
>  drivers/media/v4l2-core/videobuf2-vmalloc.c|  4 ++--
>  drivers/staging/android/ion/ion.c  |  8 
>  include/linux/dma-buf.h| 22 +++---
>  14 files changed, 61 insertions(+), 61 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index f72aaac..512bdbc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -405,8 +405,8 @@ struct dma_buf *dma_buf_export(const struct 
> dma_buf_export_info *exp_info)
> || !exp_info->ops->map_dma_buf
> || !exp_info->ops->unmap_dma_buf
> || !exp_info->ops->release
> -   || !exp_info->ops->kmap_atomic
> -   || !exp_info->ops->kmap
> +   || !exp_info->ops->map_atomic
> +   || !exp_info->ops->map
> || !exp_info->ops->mmap)) {
>   return ERR_PTR(-EINVAL);
>   }
> @@ -872,7 +872,7 @@ void *dma_buf_kmap_atomic(struct dma_buf *dmabuf, 
> unsigned long page_num)
>  {
>   WARN_ON(!dmabuf);
> 
> - return dmabuf->ops->kmap_atomic(dmabuf, page_num);
> + return dmabuf->ops->map_atomic(dmabuf, page_num);
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_kmap_atomic);
> 
> @@ -889,8 +889,8 @@ void dma_buf_kunmap_atomic(struct dma_buf *dmabuf, 
> unsigned long page_num,
>  {
>   WARN_ON(!dmabuf);
> 
> - if (dmabuf->ops->kunmap_atomic)
> - dmabuf->ops->kunmap_atomic(dmabuf, page_num, vaddr);
> + if (dmabuf->ops->unmap_atomic)
> + dmabuf->ops->unmap_atomic(dmabuf, page_num, vaddr);
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_kunmap_atomic);
> 
> @@ -907,7 +907,7 @@ void *dma_buf_kmap(struct dma_buf *dmabuf, unsigned long 
> page_num)
>  {
>   WARN_ON(!dmabuf);
> 
> - return dmabuf->ops->kmap(dmabuf, page_num);
> + return dmabuf->ops->map(dmabuf, page_num);
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_kmap);
> 
> @@ -924,8 +924,8 @@ void dma_buf_kunmap(struct dma_buf *dmabuf, unsigned long 
> page_num,
>  {
>   WARN_ON(!dmabuf);
> 
> - if (dmabuf->ops->kunmap)
> - dmabuf->ops->kunmap(dmabuf, page_num, vaddr);
> + if (dmabuf->ops->unmap)
> + dmabuf->ops->unmap(dmabuf, page_num, vaddr);
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_kunmap);
> 
> diff --git a/drivers/gpu/drm/armada/armada_gem.c 
> b/drivers/gpu/drm/armada/armada_gem.c
> index 1597458..d6c2a5d 100644
> --- a/drivers/gpu/drm/armada/armada_gem.c
> +++ b/drivers/gpu/drm/armada/armada_gem.c
> @@ -529,10 +529,10 @@ static const struct dma_buf_ops 
> armada_gem_prime_dmabuf_ops = {
>   .map_dma_buf= armada_gem_p

Re: [Linaro-mm-sig] [PATCHv4 12/12] staging/android: Update Ion TODO list

2017-04-19 Thread Daniel Vetter
On Tue, Apr 18, 2017 at 11:27:14AM -0700, Laura Abbott wrote:
> Most of the items have been taken care of by a clean up series. Remove
> the completed items and add a few new ones.
> 
> Signed-off-by: Laura Abbott <labb...@redhat.com>
> ---
>  drivers/staging/android/TODO | 21 -
>  1 file changed, 4 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/staging/android/TODO b/drivers/staging/android/TODO
> index 8f3ac37..5f14247 100644
> --- a/drivers/staging/android/TODO
> +++ b/drivers/staging/android/TODO
> @@ -7,23 +7,10 @@ TODO:
>  
>  
>  ion/
> - - Remove ION_IOC_SYNC: Flushing for devices should be purely a kernel 
> internal
> -   interface on top of dma-buf. flush_for_device needs to be added to dma-buf
> -   first.
> - - Remove ION_IOC_CUSTOM: Atm used for cache flushing for cpu access in some
> -   vendor trees. Should be replaced with an ioctl on the dma-buf to expose 
> the
> -   begin/end_cpu_access hooks to userspace.
> - - Clarify the tricks ion plays with explicitly managing coherency behind the
> -   dma api's back (this is absolutely needed for high-perf gpu drivers): Add 
> an
> -   explicit coherency management mode to flush_for_device to be used by 
> drivers
> -   which want to manage caches themselves and which indicates whether cpu 
> caches
> -   need flushing.
> - - With those removed there's probably no use for ION_IOC_IMPORT anymore 
> either
> -   since ion would just be the central allocator for shared buffers.
> - - Add dt-binding to expose cma regions as ion heaps, with the rule that any
> -   such cma regions must already be used by some device for dma. I.e. ion 
> only
> -   exposes existing cma regions and doesn't reserve unecessarily memory when
> -   booting a system which doesn't use ion.
> + - Add dt-bindings for remaining heaps (chunk and carveout heaps). This would
> +   involve putting appropriate bindings in a memory node for Ion to find.
> + - Split /dev/ion up into multiple nodes (e.g. /dev/ion/heap0)
> + - Better test framework (integration with VGEM was suggested)

Found another one: Integrate the ion kernel-doc into
Documenation/gpu/ion.rst and link it up within Documenation/gpu/index.rst.
There's a lot of api and overview stuff already around, would be great to
make this more accessible.

But I wouldn't put this as a de-staging blocker, just an idea.

On the series: Acked-by: Daniel Vetter <daniel.vet...@ffwll.ch>

No full review since a bunch of stuff I'm not too familiar with, but I
like where this is going.
-Daniel

>  
>  Please send patches to Greg Kroah-Hartman <g...@kroah.com> and Cc:
>  Arve Hjønnevåg <a...@android.com> and Riley Andrews <riandr...@android.com>
> -- 
> 2.7.4
> 
> ___
> Linaro-mm-sig mailing list
> linaro-mm-...@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] [PATCHv4 10/12] staging: android: ion: Remove ion_handle and ion_client

2017-04-19 Thread Daniel Vetter
On Tue, Apr 18, 2017 at 11:27:12AM -0700, Laura Abbott wrote:
> ion_handle was introduced as an abstraction to represent a reference to
> a buffer via an ion_client. As frameworks outside of Ion evolved, the dmabuf
> emerged as the preferred standard for use in the kernel. This has made
> the ion_handle an unnecessary abstraction and prone to race
> conditions. ion_client is also now only used internally. We have enough
> mechanisms for race conditions and leaks already so just drop ion_handle
> and ion_client. This also includes ripping out most of the debugfs
> infrastructure since much of that was tied to clients and handles.
> The debugfs infrastructure was prone to give confusing data (orphaned
> allocations) so it can be replaced with something better if people
> actually want it.
> 
> Signed-off-by: Laura Abbott <labb...@redhat.com>

Yeah I think improving the dma-buf debugfs stuff (maybe with an allocator
callback to dump additional data) is the better option.

Acked-by: Daniel Vetter <daniel.vet...@ffwll.ch>
> ---
>  drivers/staging/android/ion/ion-ioctl.c |  53 +--
>  drivers/staging/android/ion/ion.c   | 701 
> ++--
>  drivers/staging/android/ion/ion.h   |  77 +---
>  drivers/staging/android/uapi/ion.h  |  25 +-
>  4 files changed, 51 insertions(+), 805 deletions(-)
> 
> diff --git a/drivers/staging/android/ion/ion-ioctl.c 
> b/drivers/staging/android/ion/ion-ioctl.c
> index 4e7bf16..76427e4 100644
> --- a/drivers/staging/android/ion/ion-ioctl.c
> +++ b/drivers/staging/android/ion/ion-ioctl.c
> @@ -21,9 +21,7 @@
>  #include "ion.h"
>  
>  union ion_ioctl_arg {
> - struct ion_fd_data fd;
>   struct ion_allocation_data allocation;
> - struct ion_handle_data handle;
>   struct ion_heap_query query;
>  };
>  
> @@ -48,8 +46,6 @@ static int validate_ioctl_arg(unsigned int cmd, union 
> ion_ioctl_arg *arg)
>  static unsigned int ion_ioctl_dir(unsigned int cmd)
>  {
>   switch (cmd) {
> - case ION_IOC_FREE:
> - return _IOC_WRITE;
>   default:
>   return _IOC_DIR(cmd);
>   }
> @@ -57,8 +53,6 @@ static unsigned int ion_ioctl_dir(unsigned int cmd)
>  
>  long ion_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  {
> - struct ion_client *client = filp->private_data;
> - struct ion_handle *cleanup_handle = NULL;
>   int ret = 0;
>   unsigned int dir;
>   union ion_ioctl_arg data;
> @@ -86,61 +80,28 @@ long ion_ioctl(struct file *filp, unsigned int cmd, 
> unsigned long arg)
>   switch (cmd) {
>   case ION_IOC_ALLOC:
>   {
> - struct ion_handle *handle;
> + int fd;
>  
> - handle = ion_alloc(client, data.allocation.len,
> + fd = ion_alloc(data.allocation.len,
>   data.allocation.heap_id_mask,
>   data.allocation.flags);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> + if (fd < 0)
> + return fd;
>  
> - data.allocation.handle = handle->id;
> + data.allocation.fd = fd;
>  
> - cleanup_handle = handle;
> - break;
> - }
> - case ION_IOC_FREE:
> - {
> - struct ion_handle *handle;
> -
> - mutex_lock(>lock);
> - handle = ion_handle_get_by_id_nolock(client,
> -  data.handle.handle);
> - if (IS_ERR(handle)) {
> - mutex_unlock(>lock);
> - return PTR_ERR(handle);
> - }
> - ion_free_nolock(client, handle);
> - ion_handle_put_nolock(handle);
> - mutex_unlock(>lock);
> - break;
> - }
> - case ION_IOC_SHARE:
> - {
> - struct ion_handle *handle;
> -
> - handle = ion_handle_get_by_id(client, data.handle.handle);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> - data.fd.fd = ion_share_dma_buf_fd(client, handle);
> - ion_handle_put(handle);
> - if (data.fd.fd < 0)
> - ret = data.fd.fd;
>   break;
>   }
>   case ION_IOC_HEAP_QUERY:
> - ret = ion_query_heaps(client, );
> + ret = ion_query_heaps();
>   break;
>   default:
>   return -ENOTTY;
>   }
>  
>   if (dir & _IOC_READ) {
> - if (copy_to_user((void __user *)arg, ,

Re: [Linaro-mm-sig] [PATCHv4 05/12] staging: android: ion: Break the ABI in the name of forward progress

2017-04-19 Thread Daniel Vetter
On Tue, Apr 18, 2017 at 11:27:07AM -0700, Laura Abbott wrote:
> Several of the Ion ioctls were designed in such a way that they
> necessitate compat ioctls. We're breaking a bunch of other ABIs and
> cleaning stuff up anyway so let's follow the ioctl guidelines and clean
> things up while everyone is busy converting things over anyway. As part
> of this, also remove the useless alignment field from the allocation
> structure.
> 
> Signed-off-by: Laura Abbott <labb...@redhat.com>

Reviewed-by: Daniel Vetter <daniel.vet...@ffwll.ch>

> ---
>  drivers/staging/android/ion/Makefile |   3 -
>  drivers/staging/android/ion/compat_ion.c | 152 
> ---
>  drivers/staging/android/ion/compat_ion.h |  29 --
>  drivers/staging/android/ion/ion-ioctl.c  |   1 -
>  drivers/staging/android/ion/ion.c|   5 +-
>  drivers/staging/android/uapi/ion.h   |  19 ++--
>  6 files changed, 11 insertions(+), 198 deletions(-)
>  delete mode 100644 drivers/staging/android/ion/compat_ion.c
>  delete mode 100644 drivers/staging/android/ion/compat_ion.h
> 
> diff --git a/drivers/staging/android/ion/Makefile 
> b/drivers/staging/android/ion/Makefile
> index 66d0c4a..a892afa 100644
> --- a/drivers/staging/android/ion/Makefile
> +++ b/drivers/staging/android/ion/Makefile
> @@ -2,6 +2,3 @@ obj-$(CONFIG_ION) +=  ion.o ion-ioctl.o ion_heap.o \
>   ion_page_pool.o ion_system_heap.o \
>   ion_carveout_heap.o ion_chunk_heap.o
>  obj-$(CONFIG_ION_CMA_HEAP) += ion_cma_heap.o
> -ifdef CONFIG_COMPAT
> -obj-$(CONFIG_ION) += compat_ion.o
> -endif
> diff --git a/drivers/staging/android/ion/compat_ion.c 
> b/drivers/staging/android/ion/compat_ion.c
> deleted file mode 100644
> index 5037ddd..000
> --- a/drivers/staging/android/ion/compat_ion.c
> +++ /dev/null
> @@ -1,152 +0,0 @@
> -/*
> - * drivers/staging/android/ion/compat_ion.c
> - *
> - * Copyright (C) 2013 Google, Inc.
> - *
> - * This software is licensed under the terms of the GNU General Public
> - * License version 2, as published by the Free Software Foundation, and
> - * may be copied, distributed, and modified under those terms.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - *
> - */
> -
> -#include 
> -#include 
> -#include 
> -
> -#include "ion.h"
> -#include "compat_ion.h"
> -
> -/* See drivers/staging/android/uapi/ion.h for the definition of these 
> structs */
> -struct compat_ion_allocation_data {
> - compat_size_t len;
> - compat_size_t align;
> - compat_uint_t heap_id_mask;
> - compat_uint_t flags;
> - compat_int_t handle;
> -};
> -
> -struct compat_ion_handle_data {
> - compat_int_t handle;
> -};
> -
> -#define COMPAT_ION_IOC_ALLOC _IOWR(ION_IOC_MAGIC, 0, \
> -   struct compat_ion_allocation_data)
> -#define COMPAT_ION_IOC_FREE  _IOWR(ION_IOC_MAGIC, 1, \
> -   struct compat_ion_handle_data)
> -
> -static int compat_get_ion_allocation_data(
> - struct compat_ion_allocation_data __user *data32,
> - struct ion_allocation_data __user *data)
> -{
> - compat_size_t s;
> - compat_uint_t u;
> - compat_int_t i;
> - int err;
> -
> - err = get_user(s, >len);
> - err |= put_user(s, >len);
> - err |= get_user(s, >align);
> - err |= put_user(s, >align);
> - err |= get_user(u, >heap_id_mask);
> - err |= put_user(u, >heap_id_mask);
> - err |= get_user(u, >flags);
> - err |= put_user(u, >flags);
> - err |= get_user(i, >handle);
> - err |= put_user(i, >handle);
> -
> - return err;
> -}
> -
> -static int compat_get_ion_handle_data(
> - struct compat_ion_handle_data __user *data32,
> - struct ion_handle_data __user *data)
> -{
> - compat_int_t i;
> - int err;
> -
> - err = get_user(i, >handle);
> - err |= put_user(i, >handle);
> -
> - return err;
> -}
> -
> -static int compat_put_ion_allocation_data(
> - struct compat_ion_allocation_data __user *data32,
> - struct ion_allocation_data __user *data)
> -{
> - compat_size_t s;
> - compat_uint_t u;
> - compat_int_t i;
> - int err;
> -
> - err = get_user(s, >len);
> - err |= put_user(s, >len);
>

Re: [PATCH 05/22] drm/i915: Make use of the new sg_map helper function

2017-04-18 Thread Daniel Vetter
On Thu, Apr 13, 2017 at 04:05:18PM -0600, Logan Gunthorpe wrote:
> This is a single straightforward conversion from kmap to sg_map.
> 
> Signed-off-by: Logan Gunthorpe <log...@deltatee.com>

Acked-by: Daniel Vetter <daniel.vet...@ffwll.ch>

Probably makes sense to merge through some other tree, but please be aware
of the considerable churn rate in i915 (i.e. make sure your tree is in
linux-next before you send a pull request for this). Plane B would be to
get the prep patch in first and then merge the i915 conversion one kernel
release later.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_gem.c | 27 ---
>  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 67b1fc5..1b1b91a 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2188,6 +2188,15 @@ static void __i915_gem_object_reset_page_iter(struct 
> drm_i915_gem_object *obj)
>   radix_tree_delete(>mm.get_page.radix, iter.index);
>  }
>  
> +static void i915_gem_object_unmap(const struct drm_i915_gem_object *obj,
> +   void *ptr)
> +{
> + if (is_vmalloc_addr(ptr))
> + vunmap(ptr);
> + else
> + sg_unmap(obj->mm.pages->sgl, ptr, SG_KMAP);
> +}
> +
>  void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
>enum i915_mm_subclass subclass)
>  {
> @@ -2215,10 +2224,7 @@ void __i915_gem_object_put_pages(struct 
> drm_i915_gem_object *obj,
>   void *ptr;
>  
>   ptr = ptr_mask_bits(obj->mm.mapping);
> - if (is_vmalloc_addr(ptr))
> - vunmap(ptr);
> - else
> - kunmap(kmap_to_page(ptr));
> + i915_gem_object_unmap(obj, ptr);
>  
>   obj->mm.mapping = NULL;
>   }
> @@ -2475,8 +2481,11 @@ static void *i915_gem_object_map(const struct 
> drm_i915_gem_object *obj,
>   void *addr;
>  
>   /* A single page can always be kmapped */
> - if (n_pages == 1 && type == I915_MAP_WB)
> - return kmap(sg_page(sgt->sgl));
> + if (n_pages == 1 && type == I915_MAP_WB) {
> + addr = sg_map(sgt->sgl, SG_KMAP);
> + if (IS_ERR(addr))
> + return NULL;
> + }
>  
>   if (n_pages > ARRAY_SIZE(stack_pages)) {
>   /* Too big for stack -- allocate temporary array instead */
> @@ -2543,11 +2552,7 @@ void *i915_gem_object_pin_map(struct 
> drm_i915_gem_object *obj,
>   goto err_unpin;
>   }
>  
> - if (is_vmalloc_addr(ptr))
> - vunmap(ptr);
> - else
> - kunmap(kmap_to_page(ptr));
> -
> + i915_gem_object_unmap(obj, ptr);
>   ptr = obj->mm.mapping = NULL;
>   }
>  
> -- 
> 2.1.4
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


  1   2   3   4   >