Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-26 Thread Russell King - ARM Linux
On Tue, Jun 25, 2013 at 11:23:21AM +0200, Daniel Vetter wrote:
 Just a quick question on your assertion that we need all four
 functions: Since we already have begin/end_cpu_access functions
 (intention here was to allow the dma_buf exporter to ensure the memory
 is pinned, e.g. for swapable gem objects, but also allow cpu cache
 flushing if required) do we still need the sync_sg_for_cpu?

Possibly not, but let's first look at that kind of scenario:

- It attaches to a dma_buf using dma_buf_attach()
- it maps the DMA buffer using dma_buf_map_attachment().  This does
  dma_map_sg().  This possibly incurs a cache flush depending on the
  coherence properties of the architecture.
- it then calls dma_buf_begin_cpu_access().  This presumably does a
  dma_sync_sg_for_cpu().  This possibly incurs another cache flush
  depending on the coherence properties of the architecture.
- then, presumably, it calls dma_buf_kmap_atomic() or dma_buf_kmap()
  to gain a pointer to the buffer.
- at some point, dma_buf_kunmap_atomic() or dma_buf_kunmap() gets
  called.  On certain cache architectures, kunmap_atomic() results in
  a cache flush.
- dma_buf_end_cpu_access() gets called, calling through presumably to
  dma_sync_sg_for_device().  Possibly incurs a cache flush.
- dma_buf_unmap_attachment()... another cache flush.

Out of those all of those five cache flushes, the only cache flush which is
really necessary out of all those would be the one in kunmap_atomic().  The
rest of them are completely irrelevant to the CPU access provided that there
is no DMA access by the device.  What's more is that you can't say well,
the architecture should optimize them! to which I'd respond how does the
architecture know at dma_map_sg() time that there isn't going to be a
DMA access?

Now, if we say that it's not necessary to call dma_buf_map_attachment()..
dma_buf_unmap_attachment() around the calls to dma_buf_begin_cpu_access()..
dma_buf_end_cpu_access(), then how does dma_buf_begin_cpu_access() know
whether the DMA buffer has been dma_map_sg()'d?  It's invalid to call
dma_sync_sg_for_cpu() on a buffer which has not been dma_map_sg()'d.
Bear in mind that dma_buf_begin_cpu_access() is passed only the
struct dma_buf and not the struct dma_buf_attachment *attach, there's
no hope for the begin_cpu_access() method to know which user of the
dma_buf is asking for this call, so you can't track any state there.

Thank you for pointing this out, I'm less impressed with this dma_buf
API now that I've looked deeper at those two interfaces, and even more
convinced it needs to be redesigned! :P
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-25 Thread Daniel Vetter
On Tue, Jun 18, 2013 at 12:46 PM, Russell King - ARM Linux
li...@arm.linux.org.uk wrote:
  Note: the existing stuff does have the nice side effect of being able
  to pass buffers which do not have a struct page * associated with them
  through the dma_buf API - I think we can still preserve that by having
  dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
  but in any case we need to fix the existing API so that:
 
  dma_buf_map_attachment() becomes dma_buf_get_sg()
  dma_buf_unmap_attachment() becomes dma_buf_put_sg()
 
  both getting rid of the DMA direction argument, and then we have four
  new dma_buf calls:
 
  dma_buf_map_sg()
  dma_buf_unmap_sg()
  dma_buf_sync_sg_for_cpu()
  dma_buf_sync_sg_for_device()
 
  which do the actual sg map/unmap via the DMA API *at the appropriate
  time for DMA*.

 Hm, my idea was to just add a dma_buf_sync_attchment for the device side
 syncing, since the cpu access stuff is already bracketed with the
 begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap,
 but imo mmap support for dma_buf is a bit insane anyway, so I don't care
 too much about it.

 Since such dma mappings would be really longstanding in most cases anyway
 drivers could just map with BIDIRECTIONAL and do all the real flushing
 with the new sync stuff.

 Note that the DMA API debug doesn't allow you to change the direction
 argument on an existing mapping (neither should it, again this is
 documented in the DMA API stuff in Documentation/).  This is where you
 would need the complete set of four functions I mention above which
 reflect the functionality of the DMA API.

[Been travelling a bit, hence the delay.]

Just a quick question on your assertion that we need all four
functions: Since we already have begin/end_cpu_access functions
(intention here was to allow the dma_buf exporter to ensure the memory
is pinned, e.g. for swapable gem objects, but also allow cpu cache
flushing if required) do we still need the sync_sg_for_cpu? At least
with i915 as the exporter we currently hide the cflushing behind our
begin_cpu_access callback. For device dma we currently punt on it due
to lack of the dma_buf_sync_sg_for_device interface.

Aside: I know that i915 doing the clflushing dance itself is a bit
ugly, but thus far we've been the only guys on the x86 block with
non-coherent dma. But it sounds like a bunch of other blocks on Atom
SoCs have similar needs, so I guess it would make sense to move that
into the dma layer.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-21 Thread Lucas Stach
Am Donnerstag, den 20.06.2013, 20:15 +0900 schrieb Inki Dae:
[...]
You already need some kind of IPC between the two tasks, as I suspect
even in your example it wouldn't make much sense to queue the buffer
over and over again in task B without task A writing anything to it.
  So
task A has to signal task B there is new data in the buffer to be
processed.
   
There is no need to share the buffer over and over again just to get
  the
two processes to work together on the same thing. Just share the fd
between both and then do out-of-band completion signaling, as you need
this anyway. Without this you'll end up with unpredictable behavior.
Just because sync allows you to access the buffer doesn't mean it's
valid for your use-case. Without completion signaling you could easily
end up overwriting your data from task A multiple times before task B
even tries to lock the buffer for processing.
   
So the valid flow is (and this already works with the current APIs):
Task ATask B
----
CPU access buffer
 --completion signal-
  qbuf (dragging buffer into
  device domain, flush caches,
  reserve buffer etc.)
|
  wait for device operation to
  complete
|
  dqbuf (dragging buffer back
  into CPU domain, invalidate
  caches, unreserve)
-completion signal
CPU access buffer
   
  
   Correct. In case that data flow goes from A to B, it needs some kind
   of IPC between the two tasks every time as you said. Then, without
   dmabuf-sync, how do think about the case that two tasks share the same
   buffer but these tasks access the buffer(buf1) as write, and data of
   the buffer(buf1) isn't needed to be shared?
  
  Sorry, I don't see the point you are trying to solve here. If you share
  a buffer and want its content to be clearly defined at every point in
  time you have to synchronize the tasks working with the buffer, not just
  the buffer accesses itself.
  
  Easiest way to do so is doing sync through userspace with out-of-band
  IPC, like in the example above.
 
 In my opinion, that's not definitely easiest way. What I try to do is
 to avoid using *the out-of-band IPC*. As I mentioned in document file,
 the conventional mechanism not only makes user application
 complicated-user process needs to understand how the device driver is
 worked-but also may incur performance overhead by using the
 out-of-band IPC. The above my example may not be enough to you but
 there would be other cases able to use my approach efficiently.
 

Yeah, you'll some knowledge and understanding about the API you are
working with to get things right. But I think it's not an unreasonable
thing to expect the programmer working directly with kernel interfaces
to read up on how things work.

Second thing: I'll rather have *one* consistent API for every subsystem,
even if they differ from each other than having to implement this
syncpoint thing in every subsystem. Remember: a single execbuf in DRM
might reference both GEM objects backed by dma-buf as well native SHM or
CMA backed objects. The dma-buf-mgr proposal already allows you to
handle dma-bufs much the same way during validation than native GEM
objects.

And to get back to my original point: if you have more than one task
operating together on a buffer you absolutely need some kind of real IPC
to sync them up and do something useful. Both you syncpoints and the
proposed dma-fences only protect the buffer accesses to make sure
different task don't stomp on each other. There is nothing in there to
make sure that the output of your pipeline is valid. You have to take
care of that yourself in userspace. I'll reuse your example to make it
clear what I mean:

Task A Task B
-- ---
dma_buf_sync_lock(buf1)
CPU write buf1
dma_buf_sync_unlock(buf1)
  -schedule Task A again---
dma_buf_sync_lock(buf1)
CPU write buf1
dma_buf_sync_unlock(buf1)
-schedule Task B-
   qbuf(buf1)
  dma_buf_sync_lock(buf1)
   

This is what can happen if you don't take care of proper syncing. Task A
writes something to the buffer in expectation that Task B will take care
of it, but before Task B 

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-21 Thread Lucas Stach
Hi Inki,

please refrain from sending HTML Mails, it makes proper quoting without
messing up the layout everywhere pretty hard.

Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
[...]

 Yeah, you'll some knowledge and understanding about the API
 you are
 working with to get things right. But I think it's not an
 unreasonable
 thing to expect the programmer working directly with kernel
 interfaces
 to read up on how things work.
 
 Second thing: I'll rather have *one* consistent API for every
 subsystem,
 even if they differ from each other than having to implement
 this
 syncpoint thing in every subsystem. Remember: a single execbuf
 in DRM
 might reference both GEM objects backed by dma-buf as well
 native SHM or
 CMA backed objects. The dma-buf-mgr proposal already allows
 you to
 handle dma-bufs much the same way during validation than
 native GEM
 objects.
  
 Actually, at first I had implemented a fence helper framework based on
 reservation and dma fence to provide easy-use-interface for device
 drivers. However, that was wrong implemention: I had not only
 customized the dma fence but also not considered dead lock issue.
 After that, I have reimplemented it as dmabuf sync to solve dead
 issue, and at that time, I realized that we first need to concentrate
 on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
 DMA can access a same buffer, And the fact simple is the best, and the
 fact we need not only kernel side but also user side interfaces. After
 that, I collected what is the common part for all subsystems, and I
 have devised this dmabuf sync framework for it. I'm not really
 specialist in Desktop world. So question. isn't the execbuf used only
 for the GPU? the gpu has dedicated video memory(VRAM) so it needs
 migration mechanism between system memory and the dedicated video
 memory, and also to consider ordering issue while be migrated.
  

Yeah, execbuf is pretty GPU specific, but I don't see how this matters
for this discussion. Also I don't see a big difference between embedded
and desktop GPUs. Buffer migration is more of a detail here. Both take
command stream that potentially reference other buffers, which might be
native GEM or dma-buf backed objects. Both have to make sure the buffers
are in the right domain (caches cleaned and address mappings set up) and
are available for the desired operation, meaning you have to sync with
other DMA engines and maybe also with CPU.

The only case where sync isn't clearly defined right now by the current
API entrypoints is when you access memory through the dma-buf fallback
mmap support, which might happen with some software processing element
in a video pipeline or something. I agree that we will need a userspace
interface here, but I think this shouldn't be yet another sync object,
but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
hooks into the existing dma-fence and reservation stuff.

 
 And to get back to my original point: if you have more than
 one task
 operating together on a buffer you absolutely need some kind
 of real IPC
 to sync them up and do something useful. Both you syncpoints
 and the
 proposed dma-fences only protect the buffer accesses to make
 sure
 different task don't stomp on each other. There is nothing in
 there to
 make sure that the output of your pipeline is valid. You have
 to take
 care of that yourself in userspace. I'll reuse your example to
 make it
 clear what I mean:
 
 Task A Task B
 -- ---
 dma_buf_sync_lock(buf1)
 CPU write buf1
 dma_buf_sync_unlock(buf1)
   -schedule Task A again---
 dma_buf_sync_lock(buf1)
 CPU write buf1
 dma_buf_sync_unlock(buf1)
 -schedule Task B-
qbuf(buf1)
 
 dma_buf_sync_lock(buf1)

 
 This is what can happen if you don't take care of proper
 syncing. Task A
 writes something to the buffer in expectation that Task B will
 take care
 of it, but before Task B even gets scheduled Task A overwrites
 the
 buffer again. Not what you wanted, isn't it?
  
 Exactly wrong example. I had already mentioned about that. In case
 that data flow goes from A to B, it needs some kind of IPC between the
 two tasks every time  So again, your example would have no any
 problem in case that *two tasks share the same buffer but these tasks
 access the 

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-21 Thread Inki Dae
2013/6/21 Lucas Stach l.st...@pengutronix.de:
 Hi Inki,

 please refrain from sending HTML Mails, it makes proper quoting without
 messing up the layout everywhere pretty hard.


Sorry about that. I should have used text mode.

 Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
 [...]

 Yeah, you'll some knowledge and understanding about the API
 you are
 working with to get things right. But I think it's not an
 unreasonable
 thing to expect the programmer working directly with kernel
 interfaces
 to read up on how things work.

 Second thing: I'll rather have *one* consistent API for every
 subsystem,
 even if they differ from each other than having to implement
 this
 syncpoint thing in every subsystem. Remember: a single execbuf
 in DRM
 might reference both GEM objects backed by dma-buf as well
 native SHM or
 CMA backed objects. The dma-buf-mgr proposal already allows
 you to
 handle dma-bufs much the same way during validation than
 native GEM
 objects.

 Actually, at first I had implemented a fence helper framework based on
 reservation and dma fence to provide easy-use-interface for device
 drivers. However, that was wrong implemention: I had not only
 customized the dma fence but also not considered dead lock issue.
 After that, I have reimplemented it as dmabuf sync to solve dead
 issue, and at that time, I realized that we first need to concentrate
 on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
 DMA can access a same buffer, And the fact simple is the best, and the
 fact we need not only kernel side but also user side interfaces. After
 that, I collected what is the common part for all subsystems, and I
 have devised this dmabuf sync framework for it. I'm not really
 specialist in Desktop world. So question. isn't the execbuf used only
 for the GPU? the gpu has dedicated video memory(VRAM) so it needs
 migration mechanism between system memory and the dedicated video
 memory, and also to consider ordering issue while be migrated.


 Yeah, execbuf is pretty GPU specific, but I don't see how this matters
 for this discussion. Also I don't see a big difference between embedded
 and desktop GPUs. Buffer migration is more of a detail here. Both take
 command stream that potentially reference other buffers, which might be
 native GEM or dma-buf backed objects. Both have to make sure the buffers
 are in the right domain (caches cleaned and address mappings set up) and
 are available for the desired operation, meaning you have to sync with
 other DMA engines and maybe also with CPU.

Yeah, right. Then, in case of desktop gpu, does't it need additional
something to do when a buffer/s is/are migrated from system memory to
video memory domain, or from video memory to system memory domain? I
guess the below members does similar thing, and all other DMA devices
would not need them:
struct fence {
  ...
  unsigned int context, seqno;
  ...
};

And,
   struct seqno_fence {
 ...
 uint32_t seqno_ofs;
 ...
   };


 The only case where sync isn't clearly defined right now by the current
 API entrypoints is when you access memory through the dma-buf fallback
 mmap support, which might happen with some software processing element
 in a video pipeline or something. I agree that we will need a userspace
 interface here, but I think this shouldn't be yet another sync object,
 but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
 hooks into the existing dma-fence and reservation stuff.

I think we don't need addition ioctl commands for that. I am thinking
of using existing resources as possible. My idea also is similar in
using the reservation stuff to your idea because my approach also
should use the dma-buf resource. However, My idea is that a user
process, that wants buffer synchronization with the other, sees a sync
object as a file descriptor like dma-buf does. The below shows simple
my idea about it:

ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, sync);

flock(sync-fd, LOCK_SH); - LOCK_SH means a shared lock.
CPU access for read
flock(sync-fd, LOCK_UN);

Or

flock(sync-fd, LOCK_EX); - LOCK_EX means an exclusive lock
CPU access for write
flock(sync-fd, LOCK_UN);

close(sync-fd);

As you know, that's similar to dmabuf export feature.

In addition, a more simple idea,
flock(dmabuf_fd, LOCK_SH/EX);
CPU access for read/write
flock(dmabuf_fd, LOCK_UN);

However, I'm not sure that the above examples could be worked well,
and there are no problems yet: actually, I don't fully understand
flock mechanism, so looking into it.



 And to get back to my original point: if you have more than
 one task
 operating together on a buffer you absolutely need some kind
 of real IPC
   

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-21 Thread Jerome Glisse
On Fri, Jun 21, 2013 at 12:55 PM, Inki Dae daei...@gmail.com wrote:
 2013/6/21 Lucas Stach l.st...@pengutronix.de:
 Hi Inki,

 please refrain from sending HTML Mails, it makes proper quoting without
 messing up the layout everywhere pretty hard.


 Sorry about that. I should have used text mode.

 Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae:
 [...]

 Yeah, you'll some knowledge and understanding about the API
 you are
 working with to get things right. But I think it's not an
 unreasonable
 thing to expect the programmer working directly with kernel
 interfaces
 to read up on how things work.

 Second thing: I'll rather have *one* consistent API for every
 subsystem,
 even if they differ from each other than having to implement
 this
 syncpoint thing in every subsystem. Remember: a single execbuf
 in DRM
 might reference both GEM objects backed by dma-buf as well
 native SHM or
 CMA backed objects. The dma-buf-mgr proposal already allows
 you to
 handle dma-bufs much the same way during validation than
 native GEM
 objects.

 Actually, at first I had implemented a fence helper framework based on
 reservation and dma fence to provide easy-use-interface for device
 drivers. However, that was wrong implemention: I had not only
 customized the dma fence but also not considered dead lock issue.
 After that, I have reimplemented it as dmabuf sync to solve dead
 issue, and at that time, I realized that we first need to concentrate
 on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and
 DMA can access a same buffer, And the fact simple is the best, and the
 fact we need not only kernel side but also user side interfaces. After
 that, I collected what is the common part for all subsystems, and I
 have devised this dmabuf sync framework for it. I'm not really
 specialist in Desktop world. So question. isn't the execbuf used only
 for the GPU? the gpu has dedicated video memory(VRAM) so it needs
 migration mechanism between system memory and the dedicated video
 memory, and also to consider ordering issue while be migrated.


 Yeah, execbuf is pretty GPU specific, but I don't see how this matters
 for this discussion. Also I don't see a big difference between embedded
 and desktop GPUs. Buffer migration is more of a detail here. Both take
 command stream that potentially reference other buffers, which might be
 native GEM or dma-buf backed objects. Both have to make sure the buffers
 are in the right domain (caches cleaned and address mappings set up) and
 are available for the desired operation, meaning you have to sync with
 other DMA engines and maybe also with CPU.

 Yeah, right. Then, in case of desktop gpu, does't it need additional
 something to do when a buffer/s is/are migrated from system memory to
 video memory domain, or from video memory to system memory domain? I
 guess the below members does similar thing, and all other DMA devices
 would not need them:
 struct fence {
   ...
   unsigned int context, seqno;
   ...
 };

 And,
struct seqno_fence {
  ...
  uint32_t seqno_ofs;
  ...
};


 The only case where sync isn't clearly defined right now by the current
 API entrypoints is when you access memory through the dma-buf fallback
 mmap support, which might happen with some software processing element
 in a video pipeline or something. I agree that we will need a userspace
 interface here, but I think this shouldn't be yet another sync object,
 but rather more a prepare/fini_cpu_access ioctl on the dma-buf which
 hooks into the existing dma-fence and reservation stuff.

 I think we don't need addition ioctl commands for that. I am thinking
 of using existing resources as possible. My idea also is similar in
 using the reservation stuff to your idea because my approach also
 should use the dma-buf resource. However, My idea is that a user
 process, that wants buffer synchronization with the other, sees a sync
 object as a file descriptor like dma-buf does. The below shows simple
 my idea about it:

 ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, sync);

 flock(sync-fd, LOCK_SH); - LOCK_SH means a shared lock.
 CPU access for read
 flock(sync-fd, LOCK_UN);

 Or

 flock(sync-fd, LOCK_EX); - LOCK_EX means an exclusive lock
 CPU access for write
 flock(sync-fd, LOCK_UN);

 close(sync-fd);

 As you know, that's similar to dmabuf export feature.

 In addition, a more simple idea,
 flock(dmabuf_fd, LOCK_SH/EX);
 CPU access for read/write
 flock(dmabuf_fd, LOCK_UN);

 However, I'm not sure that the above examples could be worked well,
 and there are no problems yet: actually, I don't fully understand
 flock mechanism, so looking into it.



 And to get back to my original point: if you have more than
  

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-20 Thread Lucas Stach
Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
 
  -Original Message-
  From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org
  [mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org] On
  Behalf Of Russell King - ARM Linux
  Sent: Thursday, June 20, 2013 3:29 AM
  To: Inki Dae
  Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
  Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org
  Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
  framework
  
  On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
   On the other hand, the below shows how we could enhance the conventional
   way with my approach (just example):
  
   CPU - DMA,
   ioctl(qbuf command)  ioctl(streamon)
 |   |
 |   |
   qbuf  - dma_buf_sync_get   start streaming - syncpoint
  
   dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
  And
   the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
   accesses the sync buffer.
  
   And DMA - CPU,
   ioctl(dqbuf command)
 |
 |
   dqbuf - nothing to do
  
   Actual syncpoint is when DMA operation is completed (in interrupt
  handler):
   the syncpoint is performed by calling dma_buf_sync_unlock().
   Hence,  my approach is to move the syncpoints into just before dma
  access
   as long as possible.
  
  What you've just described does *not* work on architectures such as
  ARMv7 which do speculative cache fetches from memory at any time that
  that memory is mapped with a cacheable status, and will lead to data
  corruption.
 
 I didn't explain that enough. Sorry about that. 'nothing to do' means that a
 dmabuf sync interface isn't called but existing functions are called. So
 this may be explained again:
 ioctl(dqbuf command)
 |
 |
 dqbuf - 1. dma_unmap_sg
 2. dma_buf_sync_unlock (syncpoint)
 
 The syncpoint I mentioned means lock mechanism; not doing cache operation.
 
 In addition, please see the below more detail examples.
 
 The conventional way (without dmabuf-sync) is:
 Task A 
 
  1. CPU accesses buf  
  2. Send the buf to Task B  
  3. Wait for the buf from Task B
  4. go to 1
 
 Task B
 ---
 1. Wait for the buf from Task A
 2. qbuf the buf 
 2.1 insert the buf to incoming queue
 3. stream on
 3.1 dma_map_sg if ready, and move the buf to ready queue
 3.2 get the buf from ready queue, and dma start.
 4. dqbuf
 4.1 dma_unmap_sg after dma operation completion
 4.2 move the buf to outgoing queue
 5. back the buf to Task A
 6. go to 1
 
 In case that two tasks share buffers, and data flow goes from Task A to Task
 B, we would need IPC operation to send and receive buffers properly between
 those two tasks every time CPU or DMA access to buffers is started or
 completed.
 
 
 With dmabuf-sync:
 
 Task A 
 
  1. dma_buf_sync_lock - synpoint (call by user side)
  2. CPU accesses buf  
  3. dma_buf_sync_unlock - syncpoint (call by user side)
  4. Send the buf to Task B (just one time)
  5. go to 1
 
 
 Task B
 ---
 1. Wait for the buf from Task A (just one time)
 2. qbuf the buf 
 1.1 insert the buf to incoming queue
 3. stream on
 3.1 dma_buf_sync_lock - syncpoint (call by kernel side)
 3.2 dma_map_sg if ready, and move the buf to ready queue
 3.3 get the buf from ready queue, and dma start.
 4. dqbuf
 4.1 dma_buf_sync_unlock - syncpoint (call by kernel side)
 4.2 dma_unmap_sg after dma operation completion
 4.3 move the buf to outgoing queue
 5. go to 1
 
 On the other hand, in case of using dmabuf-sync, as you can see the above
 example, we would need IPC operation just one time. That way, I think we
 could not only reduce performance overhead but also make user application
 simplified. Of course, this approach can be used for all DMA device drivers
 such as DRM. I'm not a specialist in v4l2 world so there may be missing
 point.
 

You already need some kind of IPC between the two tasks, as I suspect
even in your example it wouldn't make much sense to queue the buffer
over and over again in task B without task A writing anything to it. So
task A has to signal task B there is new data in the buffer to be
processed.

There is no need to share the buffer over and over again just to get the
two processes to work together on the same thing. Just share the fd
between both and then do out-of-band completion signaling, as you need
this anyway. Without this you'll end up with unpredictable behavior.
Just because sync allows you to access

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-20 Thread Russell King - ARM Linux
On Thu, Jun 20, 2013 at 09:47:07AM +0200, Lucas Stach wrote:
 Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
  
   -Original Message-
   From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org
   [mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org] On
   Behalf Of Russell King - ARM Linux
   Sent: Thursday, June 20, 2013 3:29 AM
   To: Inki Dae
   Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
   Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org
   Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
   framework
   
   On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
On the other hand, the below shows how we could enhance the conventional
way with my approach (just example):
   
CPU - DMA,
ioctl(qbuf command)  ioctl(streamon)
  |   |
  |   |
qbuf  - dma_buf_sync_get   start streaming - syncpoint
   
dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
   And
the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
accesses the sync buffer.
   
And DMA - CPU,
ioctl(dqbuf command)
  |
  |
dqbuf - nothing to do
   
Actual syncpoint is when DMA operation is completed (in interrupt
   handler):
the syncpoint is performed by calling dma_buf_sync_unlock().
Hence,  my approach is to move the syncpoints into just before dma
   access
as long as possible.
   
   What you've just described does *not* work on architectures such as
   ARMv7 which do speculative cache fetches from memory at any time that
   that memory is mapped with a cacheable status, and will lead to data
   corruption.
  
  I didn't explain that enough. Sorry about that. 'nothing to do' means that a
  dmabuf sync interface isn't called but existing functions are called. So
  this may be explained again:
  ioctl(dqbuf command)
  |
  |
  dqbuf - 1. dma_unmap_sg
  2. dma_buf_sync_unlock (syncpoint)
  
  The syncpoint I mentioned means lock mechanism; not doing cache operation.
  
  In addition, please see the below more detail examples.
  
  The conventional way (without dmabuf-sync) is:
  Task A 
  
   1. CPU accesses buf  
   2. Send the buf to Task B  
   3. Wait for the buf from Task B
   4. go to 1
  
  Task B
  ---
  1. Wait for the buf from Task A
  2. qbuf the buf 
  2.1 insert the buf to incoming queue
  3. stream on
  3.1 dma_map_sg if ready, and move the buf to ready queue
  3.2 get the buf from ready queue, and dma start.
  4. dqbuf
  4.1 dma_unmap_sg after dma operation completion
  4.2 move the buf to outgoing queue
  5. back the buf to Task A
  6. go to 1
  
  In case that two tasks share buffers, and data flow goes from Task A to Task
  B, we would need IPC operation to send and receive buffers properly between
  those two tasks every time CPU or DMA access to buffers is started or
  completed.
  
  
  With dmabuf-sync:
  
  Task A 
  
   1. dma_buf_sync_lock - synpoint (call by user side)
   2. CPU accesses buf  
   3. dma_buf_sync_unlock - syncpoint (call by user side)
   4. Send the buf to Task B (just one time)
   5. go to 1
  
  
  Task B
  ---
  1. Wait for the buf from Task A (just one time)
  2. qbuf the buf 
  1.1 insert the buf to incoming queue
  3. stream on
  3.1 dma_buf_sync_lock - syncpoint (call by kernel side)
  3.2 dma_map_sg if ready, and move the buf to ready queue
  3.3 get the buf from ready queue, and dma start.
  4. dqbuf
  4.1 dma_buf_sync_unlock - syncpoint (call by kernel side)
  4.2 dma_unmap_sg after dma operation completion
  4.3 move the buf to outgoing queue
  5. go to 1
  
  On the other hand, in case of using dmabuf-sync, as you can see the above
  example, we would need IPC operation just one time. That way, I think we
  could not only reduce performance overhead but also make user application
  simplified. Of course, this approach can be used for all DMA device drivers
  such as DRM. I'm not a specialist in v4l2 world so there may be missing
  point.
  
 
 You already need some kind of IPC between the two tasks, as I suspect
 even in your example it wouldn't make much sense to queue the buffer
 over and over again in task B without task A writing anything to it. So
 task A has to signal task B there is new data in the buffer to be
 processed.

Hang on.  Since when did dma_buf become another inter-process IPC
mechanism?  That's *not* it's design goal, and there's other much
better

RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-20 Thread Inki Dae


 -Original Message-
 From: Lucas Stach [mailto:l.st...@pengutronix.de]
 Sent: Thursday, June 20, 2013 4:47 PM
 To: Inki Dae
 Cc: 'Russell King - ARM Linux'; 'Inki Dae'; 'linux-fbdev'; 'YoungJun Cho';
 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework
 
 Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
 
   -Original Message-
   From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org
   [mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org]
 On
   Behalf Of Russell King - ARM Linux
   Sent: Thursday, June 20, 2013 3:29 AM
   To: Inki Dae
   Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
 YoungJun
   Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org
   Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
 synchronization
   framework
  
   On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
On the other hand, the below shows how we could enhance the
 conventional
way with my approach (just example):
   
CPU - DMA,
ioctl(qbuf command)  ioctl(streamon)
  |   |
  |   |
qbuf  - dma_buf_sync_get   start streaming - syncpoint
   
dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
   And
the syncpoint is performed by calling dma_buf_sync_lock(), and then
 DMA
accesses the sync buffer.
   
And DMA - CPU,
ioctl(dqbuf command)
  |
  |
dqbuf - nothing to do
   
Actual syncpoint is when DMA operation is completed (in interrupt
   handler):
the syncpoint is performed by calling dma_buf_sync_unlock().
Hence,  my approach is to move the syncpoints into just before dma
   access
as long as possible.
  
   What you've just described does *not* work on architectures such as
   ARMv7 which do speculative cache fetches from memory at any time that
   that memory is mapped with a cacheable status, and will lead to data
   corruption.
 
  I didn't explain that enough. Sorry about that. 'nothing to do' means
 that a
  dmabuf sync interface isn't called but existing functions are called. So
  this may be explained again:
  ioctl(dqbuf command)
  |
  |
  dqbuf - 1. dma_unmap_sg
  2. dma_buf_sync_unlock (syncpoint)
 
  The syncpoint I mentioned means lock mechanism; not doing cache
 operation.
 
  In addition, please see the below more detail examples.
 
  The conventional way (without dmabuf-sync) is:
  Task A
  
   1. CPU accesses buf
   2. Send the buf to Task B
   3. Wait for the buf from Task B
   4. go to 1
 
  Task B
  ---
  1. Wait for the buf from Task A
  2. qbuf the buf
  2.1 insert the buf to incoming queue
  3. stream on
  3.1 dma_map_sg if ready, and move the buf to ready queue
  3.2 get the buf from ready queue, and dma start.
  4. dqbuf
  4.1 dma_unmap_sg after dma operation completion
  4.2 move the buf to outgoing queue
  5. back the buf to Task A
  6. go to 1
 
  In case that two tasks share buffers, and data flow goes from Task A to
 Task
  B, we would need IPC operation to send and receive buffers properly
 between
  those two tasks every time CPU or DMA access to buffers is started or
  completed.
 
 
  With dmabuf-sync:
 
  Task A
  
   1. dma_buf_sync_lock - synpoint (call by user side)
   2. CPU accesses buf
   3. dma_buf_sync_unlock - syncpoint (call by user side)
   4. Send the buf to Task B (just one time)
   5. go to 1
 
 
  Task B
  ---
  1. Wait for the buf from Task A (just one time)
  2. qbuf the buf
  1.1 insert the buf to incoming queue
  3. stream on
  3.1 dma_buf_sync_lock - syncpoint (call by kernel side)
  3.2 dma_map_sg if ready, and move the buf to ready queue
  3.3 get the buf from ready queue, and dma start.
  4. dqbuf
  4.1 dma_buf_sync_unlock - syncpoint (call by kernel side)
  4.2 dma_unmap_sg after dma operation completion
  4.3 move the buf to outgoing queue
  5. go to 1
 
  On the other hand, in case of using dmabuf-sync, as you can see the
 above
  example, we would need IPC operation just one time. That way, I think we
  could not only reduce performance overhead but also make user
 application
  simplified. Of course, this approach can be used for all DMA device
 drivers
  such as DRM. I'm not a specialist in v4l2 world so there may be missing
  point.
 
 
 You already need some kind of IPC between the two tasks, as I suspect
 even in your example it wouldn't make much sense to queue the buffer
 over and over again in task B without task A writing

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-20 Thread Lucas Stach
Am Donnerstag, den 20.06.2013, 09:17 +0100 schrieb Russell King - ARM
Linux:
 On Thu, Jun 20, 2013 at 09:47:07AM +0200, Lucas Stach wrote:
  Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae:
   
-Original Message-
From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org
[mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org] On
Behalf Of Russell King - ARM Linux
Sent: Thursday, June 20, 2013 3:29 AM
To: Inki Dae
Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org
Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer 
synchronization
framework

On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
 On the other hand, the below shows how we could enhance the 
 conventional
 way with my approach (just example):

 CPU - DMA,
 ioctl(qbuf command)  ioctl(streamon)
   |   |
   |   |
 qbuf  - dma_buf_sync_get   start streaming - syncpoint

 dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object.
And
 the syncpoint is performed by calling dma_buf_sync_lock(), and then 
 DMA
 accesses the sync buffer.

 And DMA - CPU,
 ioctl(dqbuf command)
   |
   |
 dqbuf - nothing to do

 Actual syncpoint is when DMA operation is completed (in interrupt
handler):
 the syncpoint is performed by calling dma_buf_sync_unlock().
 Hence,  my approach is to move the syncpoints into just before dma
access
 as long as possible.

What you've just described does *not* work on architectures such as
ARMv7 which do speculative cache fetches from memory at any time that
that memory is mapped with a cacheable status, and will lead to data
corruption.
   
   I didn't explain that enough. Sorry about that. 'nothing to do' means 
   that a
   dmabuf sync interface isn't called but existing functions are called. So
   this may be explained again:
   ioctl(dqbuf command)
   |
   |
   dqbuf - 1. dma_unmap_sg
   2. dma_buf_sync_unlock (syncpoint)
   
   The syncpoint I mentioned means lock mechanism; not doing cache operation.
   
   In addition, please see the below more detail examples.
   
   The conventional way (without dmabuf-sync) is:
   Task A 
   
1. CPU accesses buf  
2. Send the buf to Task B  
3. Wait for the buf from Task B
4. go to 1
   
   Task B
   ---
   1. Wait for the buf from Task A
   2. qbuf the buf 
   2.1 insert the buf to incoming queue
   3. stream on
   3.1 dma_map_sg if ready, and move the buf to ready queue
   3.2 get the buf from ready queue, and dma start.
   4. dqbuf
   4.1 dma_unmap_sg after dma operation completion
   4.2 move the buf to outgoing queue
   5. back the buf to Task A
   6. go to 1
   
   In case that two tasks share buffers, and data flow goes from Task A to 
   Task
   B, we would need IPC operation to send and receive buffers properly 
   between
   those two tasks every time CPU or DMA access to buffers is started or
   completed.
   
   
   With dmabuf-sync:
   
   Task A 
   
1. dma_buf_sync_lock - synpoint (call by user side)
2. CPU accesses buf  
3. dma_buf_sync_unlock - syncpoint (call by user side)
4. Send the buf to Task B (just one time)
5. go to 1
   
   
   Task B
   ---
   1. Wait for the buf from Task A (just one time)
   2. qbuf the buf 
   1.1 insert the buf to incoming queue
   3. stream on
   3.1 dma_buf_sync_lock - syncpoint (call by kernel side)
   3.2 dma_map_sg if ready, and move the buf to ready queue
   3.3 get the buf from ready queue, and dma start.
   4. dqbuf
   4.1 dma_buf_sync_unlock - syncpoint (call by kernel side)
   4.2 dma_unmap_sg after dma operation completion
   4.3 move the buf to outgoing queue
   5. go to 1
   
   On the other hand, in case of using dmabuf-sync, as you can see the above
   example, we would need IPC operation just one time. That way, I think we
   could not only reduce performance overhead but also make user application
   simplified. Of course, this approach can be used for all DMA device 
   drivers
   such as DRM. I'm not a specialist in v4l2 world so there may be missing
   point.
   
  
  You already need some kind of IPC between the two tasks, as I suspect
  even in your example it wouldn't make much sense to queue the buffer
  over and over again in task B without task A writing

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-20 Thread Lucas Stach
Am Donnerstag, den 20.06.2013, 17:24 +0900 schrieb Inki Dae:
[...]
   In addition, please see the below more detail examples.
  
   The conventional way (without dmabuf-sync) is:
   Task A
   
1. CPU accesses buf
2. Send the buf to Task B
3. Wait for the buf from Task B
4. go to 1
  
   Task B
   ---
   1. Wait for the buf from Task A
   2. qbuf the buf
   2.1 insert the buf to incoming queue
   3. stream on
   3.1 dma_map_sg if ready, and move the buf to ready queue
   3.2 get the buf from ready queue, and dma start.
   4. dqbuf
   4.1 dma_unmap_sg after dma operation completion
   4.2 move the buf to outgoing queue
   5. back the buf to Task A
   6. go to 1
  
   In case that two tasks share buffers, and data flow goes from Task A to
  Task
   B, we would need IPC operation to send and receive buffers properly
  between
   those two tasks every time CPU or DMA access to buffers is started or
   completed.
  
  
   With dmabuf-sync:
  
   Task A
   
1. dma_buf_sync_lock - synpoint (call by user side)
2. CPU accesses buf
3. dma_buf_sync_unlock - syncpoint (call by user side)
4. Send the buf to Task B (just one time)
5. go to 1
  
  
   Task B
   ---
   1. Wait for the buf from Task A (just one time)
   2. qbuf the buf
   1.1 insert the buf to incoming queue
   3. stream on
   3.1 dma_buf_sync_lock - syncpoint (call by kernel side)
   3.2 dma_map_sg if ready, and move the buf to ready queue
   3.3 get the buf from ready queue, and dma start.
   4. dqbuf
   4.1 dma_buf_sync_unlock - syncpoint (call by kernel side)
   4.2 dma_unmap_sg after dma operation completion
   4.3 move the buf to outgoing queue
   5. go to 1
  
   On the other hand, in case of using dmabuf-sync, as you can see the
  above
   example, we would need IPC operation just one time. That way, I think we
   could not only reduce performance overhead but also make user
  application
   simplified. Of course, this approach can be used for all DMA device
  drivers
   such as DRM. I'm not a specialist in v4l2 world so there may be missing
   point.
  
  
  You already need some kind of IPC between the two tasks, as I suspect
  even in your example it wouldn't make much sense to queue the buffer
  over and over again in task B without task A writing anything to it. So
  task A has to signal task B there is new data in the buffer to be
  processed.
  
  There is no need to share the buffer over and over again just to get the
  two processes to work together on the same thing. Just share the fd
  between both and then do out-of-band completion signaling, as you need
  this anyway. Without this you'll end up with unpredictable behavior.
  Just because sync allows you to access the buffer doesn't mean it's
  valid for your use-case. Without completion signaling you could easily
  end up overwriting your data from task A multiple times before task B
  even tries to lock the buffer for processing.
  
  So the valid flow is (and this already works with the current APIs):
  Task ATask B
  ----
  CPU access buffer
   --completion signal-
qbuf (dragging buffer into
device domain, flush caches,
reserve buffer etc.)
  |
wait for device operation to
complete
  |
dqbuf (dragging buffer back
into CPU domain, invalidate
caches, unreserve)
  -completion signal
  CPU access buffer
  
 
 Correct. In case that data flow goes from A to B, it needs some kind
 of IPC between the two tasks every time as you said. Then, without
 dmabuf-sync, how do think about the case that two tasks share the same
 buffer but these tasks access the buffer(buf1) as write, and data of
 the buffer(buf1) isn't needed to be shared?
 
Sorry, I don't see the point you are trying to solve here. If you share
a buffer and want its content to be clearly defined at every point in
time you have to synchronize the tasks working with the buffer, not just
the buffer accesses itself.

Easiest way to do so is doing sync through userspace with out-of-band
IPC, like in the example above. A more advanced way to achieve this
would be using cross-device fences to avoid going through userspace for
every syncpoint.

 
 With dmabuf-sync is:
 
  Task A
  
  1. dma_buf_sync_lock - synpoint (call by user side)
  2. CPU writes 

RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-20 Thread Inki Dae


 -Original Message-
 From: Lucas Stach [mailto:l.st...@pengutronix.de]
 Sent: Thursday, June 20, 2013 7:11 PM
 To: Inki Dae
 Cc: 'Russell King - ARM Linux'; 'Inki Dae'; 'linux-fbdev'; 'YoungJun Cho';
 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework
 
 Am Donnerstag, den 20.06.2013, 17:24 +0900 schrieb Inki Dae:
 [...]
In addition, please see the below more detail examples.
   
The conventional way (without dmabuf-sync) is:
Task A

 1. CPU accesses buf
 2. Send the buf to Task B
 3. Wait for the buf from Task B
 4. go to 1
   
Task B
---
1. Wait for the buf from Task A
2. qbuf the buf
2.1 insert the buf to incoming queue
3. stream on
3.1 dma_map_sg if ready, and move the buf to ready queue
3.2 get the buf from ready queue, and dma start.
4. dqbuf
4.1 dma_unmap_sg after dma operation completion
4.2 move the buf to outgoing queue
5. back the buf to Task A
6. go to 1
   
In case that two tasks share buffers, and data flow goes from Task A
 to
   Task
B, we would need IPC operation to send and receive buffers properly
   between
those two tasks every time CPU or DMA access to buffers is started
 or
completed.
   
   
With dmabuf-sync:
   
Task A

 1. dma_buf_sync_lock - synpoint (call by user side)
 2. CPU accesses buf
 3. dma_buf_sync_unlock - syncpoint (call by user side)
 4. Send the buf to Task B (just one time)
 5. go to 1
   
   
Task B
---
1. Wait for the buf from Task A (just one time)
2. qbuf the buf
1.1 insert the buf to incoming queue
3. stream on
3.1 dma_buf_sync_lock - syncpoint (call by kernel side)
3.2 dma_map_sg if ready, and move the buf to ready queue
3.3 get the buf from ready queue, and dma start.
4. dqbuf
4.1 dma_buf_sync_unlock - syncpoint (call by kernel side)
4.2 dma_unmap_sg after dma operation completion
4.3 move the buf to outgoing queue
5. go to 1
   
On the other hand, in case of using dmabuf-sync, as you can see the
   above
example, we would need IPC operation just one time. That way, I
 think we
could not only reduce performance overhead but also make user
   application
simplified. Of course, this approach can be used for all DMA device
   drivers
such as DRM. I'm not a specialist in v4l2 world so there may be
 missing
point.
   
  
   You already need some kind of IPC between the two tasks, as I suspect
   even in your example it wouldn't make much sense to queue the buffer
   over and over again in task B without task A writing anything to it.
 So
   task A has to signal task B there is new data in the buffer to be
   processed.
  
   There is no need to share the buffer over and over again just to get
 the
   two processes to work together on the same thing. Just share the fd
   between both and then do out-of-band completion signaling, as you need
   this anyway. Without this you'll end up with unpredictable behavior.
   Just because sync allows you to access the buffer doesn't mean it's
   valid for your use-case. Without completion signaling you could easily
   end up overwriting your data from task A multiple times before task B
   even tries to lock the buffer for processing.
  
   So the valid flow is (and this already works with the current APIs):
   Task ATask B
   ----
   CPU access buffer
--completion signal-
 qbuf (dragging buffer into
 device domain, flush caches,
 reserve buffer etc.)
   |
 wait for device operation to
 complete
   |
 dqbuf (dragging buffer back
 into CPU domain, invalidate
 caches, unreserve)
   -completion signal
   CPU access buffer
  
 
  Correct. In case that data flow goes from A to B, it needs some kind
  of IPC between the two tasks every time as you said. Then, without
  dmabuf-sync, how do think about the case that two tasks share the same
  buffer but these tasks access the buffer(buf1) as write, and data of
  the buffer(buf1) isn't needed to be shared?
 
 Sorry, I don't see the point you are trying to solve here. If you share

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-19 Thread Lucas Stach
Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae:
 
  -Original Message-
  From: Lucas Stach [mailto:l.st...@pengutronix.de]
  Sent: Tuesday, June 18, 2013 6:47 PM
  To: Inki Dae
  Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
  mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
  ker...@lists.infradead.org; linux-media@vger.kernel.org
  Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
  framework
  
  Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
  [...]
  
a display device driver.  It shouldn't be used within a single driver
as a means of passing buffers between userspace and kernel space.
  
   What I try to do is not really such ugly thing. What I try to do is to
   notify that, when CPU tries to access a buffer , to kernel side through
   dmabuf interface. So it's not really to send the buffer to kernel.
  
   Thanks,
   Inki Dae
  
  The most basic question about why you are trying to implement this sort
  of thing in the dma_buf framework still stands.
  
  Once you imported a dma_buf into your DRM driver it's a GEM object and
  you can and should use the native DRM ioctls to prepare/end a CPU access
  to this BO. Then internally to your driver you can use the dma_buf
  reservation/fence stuff to provide the necessary cross-device sync.
  
 
 I don't really want that is used only for DRM drivers. We really need
 it for all other DMA devices; i.e., v4l2 based drivers. That is what I
 try to do. And my approach uses reservation to use dma-buf resources
 but not dma fence stuff anymore. However, I'm looking into Radeon DRM
 driver for why we need dma fence stuff, and how we can use it if
 needed.
 

Still I don't see the point why you need syncpoints above dma-buf. In
both the DRM and the V4L2 world we have defined points in the API where
a buffer is allowed to change domain from device to CPU and vice versa.

In DRM if you want to access a buffer with the CPU you do a cpu_prepare.
The buffer changes back to GPU domain once you do the execbuf
validation, queue a pageflip to the buffer or similar things.

In V4L2 the syncpoints for cache operations are the queue/dequeue API
entry points. Those are also the exact points to synchronize with other
hardware thus using dma-buf reserve/fence.

In all this I can't see any need for a new syncpoint primitive slapped
on top of dma-buf.

Regards,
Lucas
-- 
Pengutronix e.K.   | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-19 Thread Inki Dae


 -Original Message-
 From: Lucas Stach [mailto:l.st...@pengutronix.de]
 Sent: Wednesday, June 19, 2013 7:22 PM
 To: Inki Dae
 Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
 mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework
 
 Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae:
 
   -Original Message-
   From: Lucas Stach [mailto:l.st...@pengutronix.de]
   Sent: Tuesday, June 18, 2013 6:47 PM
   To: Inki Dae
   Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
   mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
   ker...@lists.infradead.org; linux-media@vger.kernel.org
   Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
 synchronization
   framework
  
   Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
   [...]
   
 a display device driver.  It shouldn't be used within a single
 driver
 as a means of passing buffers between userspace and kernel space.
   
What I try to do is not really such ugly thing. What I try to do is
 to
notify that, when CPU tries to access a buffer , to kernel side
 through
dmabuf interface. So it's not really to send the buffer to kernel.
   
Thanks,
Inki Dae
   
   The most basic question about why you are trying to implement this
 sort
   of thing in the dma_buf framework still stands.
  
   Once you imported a dma_buf into your DRM driver it's a GEM object and
   you can and should use the native DRM ioctls to prepare/end a CPU
 access
   to this BO. Then internally to your driver you can use the dma_buf
   reservation/fence stuff to provide the necessary cross-device sync.
  
 
  I don't really want that is used only for DRM drivers. We really need
  it for all other DMA devices; i.e., v4l2 based drivers. That is what I
  try to do. And my approach uses reservation to use dma-buf resources
  but not dma fence stuff anymore. However, I'm looking into Radeon DRM
  driver for why we need dma fence stuff, and how we can use it if
  needed.
 
 
 Still I don't see the point why you need syncpoints above dma-buf. In
 both the DRM and the V4L2 world we have defined points in the API where
 a buffer is allowed to change domain from device to CPU and vice versa.
 
 In DRM if you want to access a buffer with the CPU you do a cpu_prepare.
 The buffer changes back to GPU domain once you do the execbuf
 validation, queue a pageflip to the buffer or similar things.
 
 In V4L2 the syncpoints for cache operations are the queue/dequeue API
 entry points. Those are also the exact points to synchronize with other
 hardware thus using dma-buf reserve/fence.


If so, what if we want to access a buffer with the CPU _in V4L2_? We should 
open a drm device node, and then do a cpu_prepare? 

Thanks,
Inki Dae

 
 In all this I can't see any need for a new syncpoint primitive slapped
 on top of dma-buf.
 
 Regards,
 Lucas
 --
 Pengutronix e.K.   | Lucas Stach |
 Industrial Linux Solutions | http://www.pengutronix.de/  |
 Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
 Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-19 Thread Lucas Stach
Am Mittwoch, den 19.06.2013, 19:44 +0900 schrieb Inki Dae:
 
  -Original Message-
  From: Lucas Stach [mailto:l.st...@pengutronix.de]
  Sent: Wednesday, June 19, 2013 7:22 PM
  To: Inki Dae
  Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
  mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
  ker...@lists.infradead.org; linux-media@vger.kernel.org
  Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
  framework
  
  Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae:
  
-Original Message-
From: Lucas Stach [mailto:l.st...@pengutronix.de]
Sent: Tuesday, June 18, 2013 6:47 PM
To: Inki Dae
Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
ker...@lists.infradead.org; linux-media@vger.kernel.org
Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer
  synchronization
framework
   
Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
[...]

  a display device driver.  It shouldn't be used within a single
  driver
  as a means of passing buffers between userspace and kernel space.

 What I try to do is not really such ugly thing. What I try to do is
  to
 notify that, when CPU tries to access a buffer , to kernel side
  through
 dmabuf interface. So it's not really to send the buffer to kernel.

 Thanks,
 Inki Dae

The most basic question about why you are trying to implement this
  sort
of thing in the dma_buf framework still stands.
   
Once you imported a dma_buf into your DRM driver it's a GEM object and
you can and should use the native DRM ioctls to prepare/end a CPU
  access
to this BO. Then internally to your driver you can use the dma_buf
reservation/fence stuff to provide the necessary cross-device sync.
   
  
   I don't really want that is used only for DRM drivers. We really need
   it for all other DMA devices; i.e., v4l2 based drivers. That is what I
   try to do. And my approach uses reservation to use dma-buf resources
   but not dma fence stuff anymore. However, I'm looking into Radeon DRM
   driver for why we need dma fence stuff, and how we can use it if
   needed.
  
  
  Still I don't see the point why you need syncpoints above dma-buf. In
  both the DRM and the V4L2 world we have defined points in the API where
  a buffer is allowed to change domain from device to CPU and vice versa.
  
  In DRM if you want to access a buffer with the CPU you do a cpu_prepare.
  The buffer changes back to GPU domain once you do the execbuf
  validation, queue a pageflip to the buffer or similar things.
  
  In V4L2 the syncpoints for cache operations are the queue/dequeue API
  entry points. Those are also the exact points to synchronize with other
  hardware thus using dma-buf reserve/fence.
 
 
 If so, what if we want to access a buffer with the CPU _in V4L2_? We
 should open a drm device node, and then do a cpu_prepare? 
 
Not at all. As I said the syncpoints are the queue/dequeue operations.
When dequeueing a buffer you are explicitly dragging the buffer domain
back from device into userspace and thus CPU domain.

If you are operating on an mmap of a V4L2 processed buffer it's either
before or after it got processed by the hardware and therefore all DMA
operations on the buffer are bracketed by the V4L2 qbug/dqbuf ioctls.
That is where cache operations and synchronization should happen. The
V4L2 driver shouldn't allow you to dequeue a buffer and thus dragging it
back into CPU domain while there is still DMA ongoing. Equally the queue
ioctrl should make sure caches are properly written back to memory. The
results of reading/writing to the mmap of a V4L2 buffer while it is
enqueued to the hardware is simply undefined and there is nothing
suggesting that this is a valid usecase.

Regards,
Lucas

-- 
Pengutronix e.K.   | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-19 Thread Russell King - ARM Linux
On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote:
 On the other hand, the below shows how we could enhance the conventional
 way with my approach (just example):
 
 CPU - DMA,
 ioctl(qbuf command)  ioctl(streamon)
   |   |
   |   |
 qbuf  - dma_buf_sync_get   start streaming - syncpoint
 
 dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And
 the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA
 accesses the sync buffer.
 
 And DMA - CPU,
 ioctl(dqbuf command)
   |
   |
 dqbuf - nothing to do
 
 Actual syncpoint is when DMA operation is completed (in interrupt handler):
 the syncpoint is performed by calling dma_buf_sync_unlock().
 Hence,  my approach is to move the syncpoints into just before dma access
 as long as possible.

What you've just described does *not* work on architectures such as
ARMv7 which do speculative cache fetches from memory at any time that
that memory is mapped with a cacheable status, and will lead to data
corruption.
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-18 Thread Daniel Vetter
On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote:
 On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote:
  2013/6/17 Russell King - ARM Linux li...@arm.linux.org.uk
  Exactly right. But that is not definitely my point. Could you please see
  the below simple example?:
  (Presume that CPU and DMA share a buffer and the buffer is mapped with user
  space as cachable)
 
  handle1 = drm_gem_fd_to_handle(a dmabuf fd);   1
   ...
  va1 = drm_gem_mmap(handle1);
  va2 = drm_gem_mmap(handle2);
  va3 = malloc(size);
   ...
 
  while (conditions) {
   memcpy(va1, some data, size);

 No!

 Well, the first thing to say here is that under the requirements of the
 DMA API, the above is immediately invalid, because you're writing to a
 buffer which under the terms of the DMA API is currently owned by the
 DMA agent, *not* by the CPU.  You're supposed to call dma_sync_sg_for_cpu()
 before you do that - but how is userspace supposed to know that requirement?
 Why should userspace even _have_ to know these requirements of the DMA
 API?

 It's also entirely possible that drm_gem_fd_to_handle() (which indirectly
 causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it
 into userspace is a bug too, as it has the potential to touch caches or
 stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not
 going to make too big a deal about that, because I don't think we have
 anything that picky.

 However, the first point above is the most important one, and exposing
 the quirks of the DMA API to userland is certainly not a nice thing to be
 doing.  This needs to be fixed - we can't go and enforce an API which is
 deeply embedded within the kernel all the way out to userland.

 What we need is something along the lines of:
 (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
 or
 (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.

I strongly vote for (b) (plus adding a dma_buf_map_sync interface to allow
syncing to other devices/cpu whithout dropping the dma mappings). At least
that's been the idea behind things, but currently all (x86-based) drm
drivers cut corners here massively.

Aside: The entire reason behind hiding the dma map step in the dma-buf
attachment is to allow drivers to expose crazy iommus (e.g. the tiler unit
on omap) to other devices. So imo (a) isn't the right choice.

 and for the scatterlist to be mapped for DMA at the point where the DMA
 operation is initiated, and unmapped at the point where the DMA operation
 is complete.

 So no, the problem is not that we need more APIs and code - we need the
 existing kernel API fixed so that we don't go exposing userspace to the
 requirements of the DMA API.  Unless we do that, we're going to end
 up with a huge world of pain, where kernel architecture people need to
 audit every damned DRM userspace implementation that happens to be run
 on their platform, and that's not something arch people really can
 afford to do.

 Basically, I think the dma_buf stuff needs to be rewritten with the
 requirements of the DMA API in the forefront of whosever mind is doing
 the rewriting.

 Note: the existing stuff does have the nice side effect of being able
 to pass buffers which do not have a struct page * associated with them
 through the dma_buf API - I think we can still preserve that by having
 dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
 but in any case we need to fix the existing API so that:

 dma_buf_map_attachment() becomes dma_buf_get_sg()
 dma_buf_unmap_attachment() becomes dma_buf_put_sg()

 both getting rid of the DMA direction argument, and then we have four
 new dma_buf calls:

 dma_buf_map_sg()
 dma_buf_unmap_sg()
 dma_buf_sync_sg_for_cpu()
 dma_buf_sync_sg_for_device()

 which do the actual sg map/unmap via the DMA API *at the appropriate
 time for DMA*.

Hm, my idea was to just add a dma_buf_sync_attchment for the device side
syncing, since the cpu access stuff is already bracketed with the
begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap,
but imo mmap support for dma_buf is a bit insane anyway, so I don't care
too much about it.

Since such dma mappings would be really longstanding in most cases anyway
drivers could just map with BIDIRECTIONAL and do all the real flushing
with the new sync stuff.

Another piece of fun will be integration with the fencing stuff Maarten is
doing. I'm not sure whether we should integrate that into the dma-buf
interface (for simple users who don't want to deal with the full
complexity at least).


 So, the summary of this is - at the moment, I regard DRM Prime and dmabuf
 to be utterly broken in design for architectures such as ARM where the
 requirements of the DMA API have to be followed if you're going to have
 a happy life.

Agreed. Unfortunately there are 

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-18 Thread Russell King - ARM Linux
On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote:
 So I'd like to ask for other DRM maintainers. How do you think about it? it
 seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by
 Rob) and GEM CMA helper also have same issue Russell pointed out. I think
 not only the above approach but also the performance is very important.

CMA uses coherent memory to back their buffers, though that might not be
true of memory obtained from other drivers via dma_buf.  Plus, there is
no support in the CMA helper for exporting or importng these buffers.

I guess Intel i915 is only used on x86, which is a coherent platform and
requires no cache maintanence for DMA.

OMAP DRM does not support importing non-DRM buffers buffers back into
DRM.  Moreover, it will suffer from the problems I described if any
attempt is made to write to the buffer after it has been re-imported.

Lastly, I should point out that the dma_buf stuff is really only useful
when you need to export a dma buffer from one driver and import it into
another driver - for example to pass data from a camera device driver to
a display device driver.  It shouldn't be used within a single driver
as a means of passing buffers between userspace and kernel space.
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-18 Thread Inki Dae


 -Original Message-
 From: Russell King - ARM Linux [mailto:li...@arm.linux.org.uk]
 Sent: Tuesday, June 18, 2013 5:43 PM
 To: Inki Dae
 Cc: 'Maarten Lankhorst'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing
 list'; 'Rob Clark'; 'myungjoo.ham'; 'YoungJun Cho'; 'Daniel Vetter';
 linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework
 
 On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote:
  So I'd like to ask for other DRM maintainers. How do you think about it?
 it
  seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained
 by
  Rob) and GEM CMA helper also have same issue Russell pointed out. I
 think
  not only the above approach but also the performance is very important.
 
 CMA uses coherent memory to back their buffers, though that might not be
 true of memory obtained from other drivers via dma_buf.  Plus, there is
 no support in the CMA helper for exporting or importng these buffers.
 

It's not so. Please see Dave's drm next. recently dmabuf support for the CMA
helper has been merged to there.

 I guess Intel i915 is only used on x86, which is a coherent platform and
 requires no cache maintanence for DMA.
 
 OMAP DRM does not support importing non-DRM buffers buffers back into

Correct. TODO yet.

 DRM.  Moreover, it will suffer from the problems I described if any
 attempt is made to write to the buffer after it has been re-imported.
 
 Lastly, I should point out that the dma_buf stuff is really only useful
 when you need to export a dma buffer from one driver and import it into
 another driver - for example to pass data from a camera device driver to

Most people know that.

 a display device driver.  It shouldn't be used within a single driver
 as a means of passing buffers between userspace and kernel space.

What I try to do is not really such ugly thing. What I try to do is to
notify that, when CPU tries to access a buffer , to kernel side through
dmabuf interface. So it's not really to send the buffer to kernel.

Thanks,
Inki Dae

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-18 Thread Russell King - ARM Linux
On Tue, Jun 18, 2013 at 06:04:44PM +0900, Inki Dae wrote:
 
 
  -Original Message-
  From: Russell King - ARM Linux [mailto:li...@arm.linux.org.uk]
  Sent: Tuesday, June 18, 2013 5:43 PM
  To: Inki Dae
  Cc: 'Maarten Lankhorst'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing
  list'; 'Rob Clark'; 'myungjoo.ham'; 'YoungJun Cho'; 'Daniel Vetter';
  linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
  Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
  framework
  
  On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote:
   So I'd like to ask for other DRM maintainers. How do you think about it?
  it
   seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained
  by
   Rob) and GEM CMA helper also have same issue Russell pointed out. I
  think
   not only the above approach but also the performance is very important.
  
  CMA uses coherent memory to back their buffers, though that might not be
  true of memory obtained from other drivers via dma_buf.  Plus, there is
  no support in the CMA helper for exporting or importng these buffers.
  
 
 It's not so. Please see Dave's drm next. recently dmabuf support for the CMA
 helper has been merged to there.

The point stands: CMA is DMA coherent memory.  It doesn't need and must
never be dma-map-sg'd or dma-sync'd or dma-unmap'd.
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-18 Thread Lucas Stach
Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
[...]
 
  a display device driver.  It shouldn't be used within a single driver
  as a means of passing buffers between userspace and kernel space.
 
 What I try to do is not really such ugly thing. What I try to do is to
 notify that, when CPU tries to access a buffer , to kernel side through
 dmabuf interface. So it's not really to send the buffer to kernel.
 
 Thanks,
 Inki Dae
 
The most basic question about why you are trying to implement this sort
of thing in the dma_buf framework still stands.

Once you imported a dma_buf into your DRM driver it's a GEM object and
you can and should use the native DRM ioctls to prepare/end a CPU access
to this BO. Then internally to your driver you can use the dma_buf
reservation/fence stuff to provide the necessary cross-device sync.

Regards,
Lucas
-- 
Pengutronix e.K.   | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-18 Thread Russell King - ARM Linux
On Tue, Jun 18, 2013 at 09:00:16AM +0200, Daniel Vetter wrote:
 On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote:
  What we need is something along the lines of:
  (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
  or
  (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.
 
 I strongly vote for (b) (plus adding a dma_buf_map_sync interface to allow
 syncing to other devices/cpu whithout dropping the dma mappings). At least
 that's been the idea behind things, but currently all (x86-based) drm
 drivers cut corners here massively.
 
 Aside: The entire reason behind hiding the dma map step in the dma-buf
 attachment is to allow drivers to expose crazy iommus (e.g. the tiler unit
 on omap) to other devices. So imo (a) isn't the right choice.

That's why I also talk below about adding the dma_buf_map/sync callbacks
below, so that a dma_buf implementation can do whatever is necessary with
the sg at the point of use.

However, you are correct that this should be unnecessary, as DRM should
only be calling dma_buf_map_attachment() when it needs to know about the
memory behind the object.  The problem is that people will cache that
information - it also may be expensive to setup for the dma_buf - as it
involves counting the number of pages, memory allocation, and maybe
obtaining the set of pages from shmem.

With (a) plus the callbacks below, it means that step is only performed
once, and then the DMA API part is entirely separate - we can have our
cake and eat it in that regard.

  Note: the existing stuff does have the nice side effect of being able
  to pass buffers which do not have a struct page * associated with them
  through the dma_buf API - I think we can still preserve that by having
  dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
  but in any case we need to fix the existing API so that:
 
  dma_buf_map_attachment() becomes dma_buf_get_sg()
  dma_buf_unmap_attachment() becomes dma_buf_put_sg()
 
  both getting rid of the DMA direction argument, and then we have four
  new dma_buf calls:
 
  dma_buf_map_sg()
  dma_buf_unmap_sg()
  dma_buf_sync_sg_for_cpu()
  dma_buf_sync_sg_for_device()
 
  which do the actual sg map/unmap via the DMA API *at the appropriate
  time for DMA*.
 
 Hm, my idea was to just add a dma_buf_sync_attchment for the device side
 syncing, since the cpu access stuff is already bracketed with the
 begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap,
 but imo mmap support for dma_buf is a bit insane anyway, so I don't care
 too much about it.
 
 Since such dma mappings would be really longstanding in most cases anyway
 drivers could just map with BIDIRECTIONAL and do all the real flushing
 with the new sync stuff.

Note that the DMA API debug doesn't allow you to change the direction
argument on an existing mapping (neither should it, again this is
documented in the DMA API stuff in Documentation/).  This is where you
would need the complete set of four functions I mention above which
reflect the functionality of the DMA API.

The dma_buf implementation doesn't _have_ to implement them if they
are no-ops - for example, your dma_buf sg array contains DMA pointers,
and the memory is already coherent in some way (for example, it's a
separate chunk of memory which isn't part of system RAM.)
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-18 Thread Inki Dae


 -Original Message-
 From: Lucas Stach [mailto:l.st...@pengutronix.de]
 Sent: Tuesday, June 18, 2013 6:47 PM
 To: Inki Dae
 Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI
 mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework
 
 Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae:
 [...]
 
   a display device driver.  It shouldn't be used within a single driver
   as a means of passing buffers between userspace and kernel space.
 
  What I try to do is not really such ugly thing. What I try to do is to
  notify that, when CPU tries to access a buffer , to kernel side through
  dmabuf interface. So it's not really to send the buffer to kernel.
 
  Thanks,
  Inki Dae
 
 The most basic question about why you are trying to implement this sort
 of thing in the dma_buf framework still stands.
 
 Once you imported a dma_buf into your DRM driver it's a GEM object and
 you can and should use the native DRM ioctls to prepare/end a CPU access
 to this BO. Then internally to your driver you can use the dma_buf
 reservation/fence stuff to provide the necessary cross-device sync.
 

I don't really want that is used only for DRM drivers. We really need it for 
all other DMA devices; i.e., v4l2 based drivers. That is what I try to do. And 
my approach uses reservation to use dma-buf resources but not dma fence stuff 
anymore. However, I'm looking into Radeon DRM driver for why we need dma fence 
stuff, and how we can use it if needed.

Thanks,
Inki Dae

 Regards,
 Lucas
 --
 Pengutronix e.K.   | Lucas Stach |
 Industrial Linux Solutions | http://www.pengutronix.de/  |
 Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
 Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Inki Dae
This patch adds a buffer synchronization framework based on DMA BUF[1]
and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
for lock mechanism.

The purpose of this framework is not only to couple cache operations,
and buffer access control to CPU and DMA but also to provide easy-to-use
interfaces for device drivers and potentially user application
(not implemented for user applications, yet). And this framework can be
used for all dma devices using system memory as dma buffer, especially
for most ARM based SoCs.

Changelog v2:
- use atomic_add_unless to avoid potential bug.
- add a macro for checking valid access type.
- code clean.

The mechanism of this framework has the following steps,
1. Register dmabufs to a sync object - A task gets a new sync object and
can add one or more dmabufs that the task wants to access.
This registering should be performed when a device context or an event
context such as a page flip event is created or before CPU accesses a shared
buffer.

dma_buf_sync_get(a sync object, a dmabuf);

2. Lock a sync object - A task tries to lock all dmabufs added in its own
sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead
lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA
and DMA. Taking a lock means that others cannot access all locked dmabufs
until the task that locked the corresponding dmabufs, unlocks all the locked
dmabufs.
This locking should be performed before DMA or CPU accesses these dmabufs.

dma_buf_sync_lock(a sync object);

3. Unlock a sync object - The task unlocks all dmabufs added in its own sync
object. The unlock means that the DMA or CPU accesses to the dmabufs have
been completed so that others may access them.
This unlocking should be performed after DMA or CPU has completed accesses
to the dmabufs.

dma_buf_sync_unlock(a sync object);

4. Unregister one or all dmabufs from a sync object - A task unregisters
the given dmabufs from the sync object. This means that the task dosen't
want to lock the dmabufs.
The unregistering should be performed after DMA or CPU has completed
accesses to the dmabufs or when dma_buf_sync_lock() is failed.

dma_buf_sync_put(a sync object, a dmabuf);
dma_buf_sync_put_all(a sync object);

The described steps may be summarized as:
get - lock - CPU or DMA access to a buffer/s - unlock - put

This framework includes the following two features.
1. read (shared) and write (exclusive) locks - A task is required to declare
the access type when the task tries to register a dmabuf;
READ, WRITE, READ DMA, or WRITE DMA.

The below is example codes,
struct dmabuf_sync *sync;

sync = dmabuf_sync_init(NULL, test sync);

dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
...

And the below can be used as access types:
DMA_BUF_ACCESS_READ,
- CPU will access a buffer for read.
DMA_BUF_ACCESS_WRITE,
- CPU will access a buffer for read or write.
DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
- DMA will access a buffer for read
DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
- DMA will access a buffer for read or write.

2. Mandatory resource releasing - a task cannot hold a lock indefinitely.
A task may never try to unlock a buffer after taking a lock to the buffer.
In this case, a timer handler to the corresponding sync object is called
in five (default) seconds and then the timed-out buffer is unlocked by work
queue handler to avoid lockups and to enforce resources of the buffer.

The below is how to use:
1. Allocate and Initialize a sync object:
struct dmabuf_sync *sync;

sync = dmabuf_sync_init(NULL, test sync);
...

2. Add a dmabuf to the sync object when setting up dma buffer relevant
   registers:
dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
...

3. Lock all dmabufs of the sync object before DMA or CPU accesses
   the dmabufs:
dmabuf_sync_lock(sync);
...

4. Now CPU or DMA can access all dmabufs locked in step 3.

5. Unlock all dmabufs added in a sync object after DMA or CPU access
   to these dmabufs is completed:
dmabuf_sync_unlock(sync);

   And call the following functions to release all resources,
dmabuf_sync_put_all(sync);
dmabuf_sync_fini(sync);

You can refer to actual example codes:

https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/

commit/?h=dmabuf-syncid=4030bdee9bab5841ad32faade528d04cc0c5fc94



Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Maarten Lankhorst
Op 17-06-13 13:15, Inki Dae schreef:
 This patch adds a buffer synchronization framework based on DMA BUF[1]
 and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
 for lock mechanism.

 The purpose of this framework is not only to couple cache operations,
 and buffer access control to CPU and DMA but also to provide easy-to-use
 interfaces for device drivers and potentially user application
 (not implemented for user applications, yet). And this framework can be
 used for all dma devices using system memory as dma buffer, especially
 for most ARM based SoCs.

 Changelog v2:
 - use atomic_add_unless to avoid potential bug.
 - add a macro for checking valid access type.
 - code clean.

 The mechanism of this framework has the following steps,
 1. Register dmabufs to a sync object - A task gets a new sync object and
 can add one or more dmabufs that the task wants to access.
 This registering should be performed when a device context or an event
 context such as a page flip event is created or before CPU accesses a 
 shared
 buffer.

   dma_buf_sync_get(a sync object, a dmabuf);

 2. Lock a sync object - A task tries to lock all dmabufs added in its own
 sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead
 lock issue and for race condition between CPU and CPU, CPU and DMA, and 
 DMA
 and DMA. Taking a lock means that others cannot access all locked dmabufs
 until the task that locked the corresponding dmabufs, unlocks all the 
 locked
 dmabufs.
 This locking should be performed before DMA or CPU accesses these dmabufs.

   dma_buf_sync_lock(a sync object);

 3. Unlock a sync object - The task unlocks all dmabufs added in its own 
 sync
 object. The unlock means that the DMA or CPU accesses to the dmabufs have
 been completed so that others may access them.
 This unlocking should be performed after DMA or CPU has completed accesses
 to the dmabufs.

   dma_buf_sync_unlock(a sync object);

 4. Unregister one or all dmabufs from a sync object - A task unregisters
 the given dmabufs from the sync object. This means that the task dosen't
 want to lock the dmabufs.
 The unregistering should be performed after DMA or CPU has completed
 accesses to the dmabufs or when dma_buf_sync_lock() is failed.

   dma_buf_sync_put(a sync object, a dmabuf);
   dma_buf_sync_put_all(a sync object);

 The described steps may be summarized as:
   get - lock - CPU or DMA access to a buffer/s - unlock - put

 This framework includes the following two features.
 1. read (shared) and write (exclusive) locks - A task is required to 
 declare
 the access type when the task tries to register a dmabuf;
 READ, WRITE, READ DMA, or WRITE DMA.

 The below is example codes,
   struct dmabuf_sync *sync;

   sync = dmabuf_sync_init(NULL, test sync);

   dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
   ...

   And the below can be used as access types:
   DMA_BUF_ACCESS_READ,
   - CPU will access a buffer for read.
   DMA_BUF_ACCESS_WRITE,
   - CPU will access a buffer for read or write.
   DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
   - DMA will access a buffer for read
   DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
   - DMA will access a buffer for read or write.

 2. Mandatory resource releasing - a task cannot hold a lock indefinitely.
 A task may never try to unlock a buffer after taking a lock to the buffer.
 In this case, a timer handler to the corresponding sync object is called
 in five (default) seconds and then the timed-out buffer is unlocked by 
 work
 queue handler to avoid lockups and to enforce resources of the buffer.

 The below is how to use:
   1. Allocate and Initialize a sync object:
   struct dmabuf_sync *sync;

   sync = dmabuf_sync_init(NULL, test sync);
   ...

   2. Add a dmabuf to the sync object when setting up dma buffer relevant
  registers:
   dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
   ...

   3. Lock all dmabufs of the sync object before DMA or CPU accesses
  the dmabufs:
   dmabuf_sync_lock(sync);
   ...

   4. Now CPU or DMA can access all dmabufs locked in step 3.

   5. Unlock all dmabufs added in a sync object after DMA or CPU access
  to these dmabufs is completed:
   dmabuf_sync_unlock(sync);

  And call the following functions to release all resources,
   dmabuf_sync_put_all(sync);
   dmabuf_sync_fini(sync);

   You can refer to actual example codes:
   
 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/
   
 commit/?h=dmabuf-syncid=4030bdee9bab5841ad32faade528d04cc0c5fc94

  

RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Inki Dae


 -Original Message-
 From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
 Sent: Monday, June 17, 2013 8:35 PM
 To: Inki Dae
 Cc: dri-de...@lists.freedesktop.org; linux-fb...@vger.kernel.org; linux-
 arm-ker...@lists.infradead.org; linux-media@vger.kernel.org;
 dan...@ffwll.ch; robdcl...@gmail.com; kyungmin.p...@samsung.com;
 myungjoo@samsung.com; yj44@samsung.com
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework
 
 Op 17-06-13 13:15, Inki Dae schreef:
  This patch adds a buffer synchronization framework based on DMA BUF[1]
  and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
  for lock mechanism.
 
  The purpose of this framework is not only to couple cache operations,
  and buffer access control to CPU and DMA but also to provide easy-to-use
  interfaces for device drivers and potentially user application
  (not implemented for user applications, yet). And this framework can be
  used for all dma devices using system memory as dma buffer, especially
  for most ARM based SoCs.
 
  Changelog v2:
  - use atomic_add_unless to avoid potential bug.
  - add a macro for checking valid access type.
  - code clean.
 
  The mechanism of this framework has the following steps,
  1. Register dmabufs to a sync object - A task gets a new sync object
 and
  can add one or more dmabufs that the task wants to access.
  This registering should be performed when a device context or an
 event
  context such as a page flip event is created or before CPU accesses
a
 shared
  buffer.
 
  dma_buf_sync_get(a sync object, a dmabuf);
 
  2. Lock a sync object - A task tries to lock all dmabufs added in
its
 own
  sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid
 dead
  lock issue and for race condition between CPU and CPU, CPU and DMA,
 and DMA
  and DMA. Taking a lock means that others cannot access all locked
 dmabufs
  until the task that locked the corresponding dmabufs, unlocks all
the
 locked
  dmabufs.
  This locking should be performed before DMA or CPU accesses these
 dmabufs.
 
  dma_buf_sync_lock(a sync object);
 
  3. Unlock a sync object - The task unlocks all dmabufs added in its
 own sync
  object. The unlock means that the DMA or CPU accesses to the dmabufs
 have
  been completed so that others may access them.
  This unlocking should be performed after DMA or CPU has completed
 accesses
  to the dmabufs.
 
  dma_buf_sync_unlock(a sync object);
 
  4. Unregister one or all dmabufs from a sync object - A task
 unregisters
  the given dmabufs from the sync object. This means that the task
 dosen't
  want to lock the dmabufs.
  The unregistering should be performed after DMA or CPU has completed
  accesses to the dmabufs or when dma_buf_sync_lock() is failed.
 
  dma_buf_sync_put(a sync object, a dmabuf);
  dma_buf_sync_put_all(a sync object);
 
  The described steps may be summarized as:
  get - lock - CPU or DMA access to a buffer/s - unlock - put
 
  This framework includes the following two features.
  1. read (shared) and write (exclusive) locks - A task is required to
 declare
  the access type when the task tries to register a dmabuf;
  READ, WRITE, READ DMA, or WRITE DMA.
 
  The below is example codes,
  struct dmabuf_sync *sync;
 
  sync = dmabuf_sync_init(NULL, test sync);
 
  dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
  ...
 
  And the below can be used as access types:
  DMA_BUF_ACCESS_READ,
  - CPU will access a buffer for read.
  DMA_BUF_ACCESS_WRITE,
  - CPU will access a buffer for read or write.
  DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
  - DMA will access a buffer for read
  DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
  - DMA will access a buffer for read or write.
 
  2. Mandatory resource releasing - a task cannot hold a lock
 indefinitely.
  A task may never try to unlock a buffer after taking a lock to the
 buffer.
  In this case, a timer handler to the corresponding sync object is
 called
  in five (default) seconds and then the timed-out buffer is unlocked
 by work
  queue handler to avoid lockups and to enforce resources of the
buffer.
 
  The below is how to use:
  1. Allocate and Initialize a sync object:
  struct dmabuf_sync *sync;
 
  sync = dmabuf_sync_init(NULL, test sync);
  ...
 
  2. Add a dmabuf to the sync object when setting up dma buffer
 relevant
 registers:
  dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
  ...
 
  3. Lock all dmabufs of the sync object before DMA or CPU accesses
 the dmabufs:
  dmabuf_sync_lock(sync);
  ...
 
  4. Now CPU or DMA can access all dmabufs locked

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Maarten Lankhorst
Op 17-06-13 15:04, Inki Dae schreef:

 -Original Message-
 From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
 Sent: Monday, June 17, 2013 8:35 PM
 To: Inki Dae
 Cc: dri-de...@lists.freedesktop.org; linux-fb...@vger.kernel.org; linux-
 arm-ker...@lists.infradead.org; linux-media@vger.kernel.org;
 dan...@ffwll.ch; robdcl...@gmail.com; kyungmin.p...@samsung.com;
 myungjoo@samsung.com; yj44@samsung.com
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework

 Op 17-06-13 13:15, Inki Dae schreef:
 This patch adds a buffer synchronization framework based on DMA BUF[1]
 and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
 for lock mechanism.

 The purpose of this framework is not only to couple cache operations,
 and buffer access control to CPU and DMA but also to provide easy-to-use
 interfaces for device drivers and potentially user application
 (not implemented for user applications, yet). And this framework can be
 used for all dma devices using system memory as dma buffer, especially
 for most ARM based SoCs.

 Changelog v2:
 - use atomic_add_unless to avoid potential bug.
 - add a macro for checking valid access type.
 - code clean.

 The mechanism of this framework has the following steps,
 1. Register dmabufs to a sync object - A task gets a new sync object
 and
 can add one or more dmabufs that the task wants to access.
 This registering should be performed when a device context or an
 event
 context such as a page flip event is created or before CPU accesses
 a
 shared
 buffer.

 dma_buf_sync_get(a sync object, a dmabuf);

 2. Lock a sync object - A task tries to lock all dmabufs added in
 its
 own
 sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid
 dead
 lock issue and for race condition between CPU and CPU, CPU and DMA,
 and DMA
 and DMA. Taking a lock means that others cannot access all locked
 dmabufs
 until the task that locked the corresponding dmabufs, unlocks all
 the
 locked
 dmabufs.
 This locking should be performed before DMA or CPU accesses these
 dmabufs.
 dma_buf_sync_lock(a sync object);

 3. Unlock a sync object - The task unlocks all dmabufs added in its
 own sync
 object. The unlock means that the DMA or CPU accesses to the dmabufs
 have
 been completed so that others may access them.
 This unlocking should be performed after DMA or CPU has completed
 accesses
 to the dmabufs.

 dma_buf_sync_unlock(a sync object);

 4. Unregister one or all dmabufs from a sync object - A task
 unregisters
 the given dmabufs from the sync object. This means that the task
 dosen't
 want to lock the dmabufs.
 The unregistering should be performed after DMA or CPU has completed
 accesses to the dmabufs or when dma_buf_sync_lock() is failed.

 dma_buf_sync_put(a sync object, a dmabuf);
 dma_buf_sync_put_all(a sync object);

 The described steps may be summarized as:
 get - lock - CPU or DMA access to a buffer/s - unlock - put

 This framework includes the following two features.
 1. read (shared) and write (exclusive) locks - A task is required to
 declare
 the access type when the task tries to register a dmabuf;
 READ, WRITE, READ DMA, or WRITE DMA.

 The below is example codes,
 struct dmabuf_sync *sync;

 sync = dmabuf_sync_init(NULL, test sync);

 dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
 ...

 And the below can be used as access types:
 DMA_BUF_ACCESS_READ,
 - CPU will access a buffer for read.
 DMA_BUF_ACCESS_WRITE,
 - CPU will access a buffer for read or write.
 DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
 - DMA will access a buffer for read
 DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
 - DMA will access a buffer for read or write.

 2. Mandatory resource releasing - a task cannot hold a lock
 indefinitely.
 A task may never try to unlock a buffer after taking a lock to the
 buffer.
 In this case, a timer handler to the corresponding sync object is
 called
 in five (default) seconds and then the timed-out buffer is unlocked
 by work
 queue handler to avoid lockups and to enforce resources of the
 buffer.
 The below is how to use:
 1. Allocate and Initialize a sync object:
 struct dmabuf_sync *sync;

 sync = dmabuf_sync_init(NULL, test sync);
 ...

 2. Add a dmabuf to the sync object when setting up dma buffer
 relevant
registers:
 dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
 ...

 3. Lock all dmabufs of the sync object before DMA or CPU accesses
the dmabufs:
 dmabuf_sync_lock(sync);
 ...

 4. Now CPU or DMA can access all dmabufs locked in step 3.

 5. Unlock all dmabufs added in a sync object after DMA

Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Russell King - ARM Linux
On Mon, Jun 17, 2013 at 10:04:45PM +0900, Inki Dae wrote:
 It's just to implement a thin sync framework coupling cache operation. This
 approach is based on dma-buf for more generic implementation against android
 sync driver or KDS.
 
 The described steps may be summarized as:
   lock - cache operation - CPU or DMA access to a buffer/s - unlock
 
 I think that there is no need to get complicated for such approach at least
 for most devices sharing system memory. Simple is best.

But hang on, doesn't the dmabuf API already provide that?

The dmabuf API already uses dma_map_sg() and dma_unmap_sg() by providers,
and the rules around the DMA API are that:

dma_map_sg()
/* DMA _ONLY_ has access, CPU should not access */
dma_unmap_sg()
/* DMA may not access, CPU can access */

It's a little more than that if you include the sync_sg_for_cpu and
sync_sg_for_device APIs too - but the above is the general idea.  What
this means from the dmabuf API point of view is that once you attach to
a dma_buf, and call dma_buf_map_attachment() to get the SG list, the CPU
doesn't have ownership of the buffer and _must_ _not_ access it via any
other means - including using the other dma_buf methods, until either
the appropriate dma_sync_sg_for_cpu() call has been made or the DMA
mapping has been removed via dma_buf_unmap_attachment().

So, the sequence should be:

dma_buf_map_attachment()
/* do DMA */
dma_buf_unmap_attachment()
/* CPU can now access the buffer */
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Russell King - ARM Linux
On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote:
 2013/6/17 Russell King - ARM Linux li...@arm.linux.org.uk
 Exactly right. But that is not definitely my point. Could you please see
 the below simple example?:
 (Presume that CPU and DMA share a buffer and the buffer is mapped with user
 space as cachable)
 
 handle1 = drm_gem_fd_to_handle(a dmabuf fd);   1
  ...
 va1 = drm_gem_mmap(handle1);
 va2 = drm_gem_mmap(handle2);
 va3 = malloc(size);
  ...
 
 while (conditions) {
  memcpy(va1, some data, size);

No!

Well, the first thing to say here is that under the requirements of the
DMA API, the above is immediately invalid, because you're writing to a
buffer which under the terms of the DMA API is currently owned by the
DMA agent, *not* by the CPU.  You're supposed to call dma_sync_sg_for_cpu()
before you do that - but how is userspace supposed to know that requirement?
Why should userspace even _have_ to know these requirements of the DMA
API?

It's also entirely possible that drm_gem_fd_to_handle() (which indirectly
causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it
into userspace is a bug too, as it has the potential to touch caches or
stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not
going to make too big a deal about that, because I don't think we have
anything that picky.

However, the first point above is the most important one, and exposing
the quirks of the DMA API to userland is certainly not a nice thing to be
doing.  This needs to be fixed - we can't go and enforce an API which is
deeply embedded within the kernel all the way out to userland.

What we need is something along the lines of:
(a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
or
(b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.

and for the scatterlist to be mapped for DMA at the point where the DMA
operation is initiated, and unmapped at the point where the DMA operation
is complete.

So no, the problem is not that we need more APIs and code - we need the
existing kernel API fixed so that we don't go exposing userspace to the
requirements of the DMA API.  Unless we do that, we're going to end
up with a huge world of pain, where kernel architecture people need to
audit every damned DRM userspace implementation that happens to be run
on their platform, and that's not something arch people really can
afford to do.

Basically, I think the dma_buf stuff needs to be rewritten with the
requirements of the DMA API in the forefront of whosever mind is doing
the rewriting.

Note: the existing stuff does have the nice side effect of being able
to pass buffers which do not have a struct page * associated with them
through the dma_buf API - I think we can still preserve that by having
dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
but in any case we need to fix the existing API so that:

dma_buf_map_attachment() becomes dma_buf_get_sg()
dma_buf_unmap_attachment() becomes dma_buf_put_sg()

both getting rid of the DMA direction argument, and then we have four
new dma_buf calls:

dma_buf_map_sg()
dma_buf_unmap_sg()
dma_buf_sync_sg_for_cpu()
dma_buf_sync_sg_for_device()

which do the actual sg map/unmap via the DMA API *at the appropriate
time for DMA*.

So, the summary of this is - at the moment, I regard DRM Prime and dmabuf
to be utterly broken in design for architectures such as ARM where the
requirements of the DMA API have to be followed if you're going to have
a happy life.
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Russell King - ARM Linux
On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote:
 On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote:
  2013/6/17 Russell King - ARM Linux li...@arm.linux.org.uk
  Exactly right. But that is not definitely my point. Could you please see
  the below simple example?:
  (Presume that CPU and DMA share a buffer and the buffer is mapped with user
  space as cachable)
  
  handle1 = drm_gem_fd_to_handle(a dmabuf fd);   1
   ...
  va1 = drm_gem_mmap(handle1);
  va2 = drm_gem_mmap(handle2);
  va3 = malloc(size);
   ...
  
  while (conditions) {
   memcpy(va1, some data, size);
 
 No!
 
 Well, the first thing to say here is that under the requirements of the
 DMA API, the above is immediately invalid, because you're writing to a
 buffer which under the terms of the DMA API is currently owned by the
 DMA agent, *not* by the CPU.  You're supposed to call dma_sync_sg_for_cpu()
 before you do that - but how is userspace supposed to know that requirement?
 Why should userspace even _have_ to know these requirements of the DMA
 API?
 
 It's also entirely possible that drm_gem_fd_to_handle() (which indirectly
 causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it
 into userspace is a bug too, as it has the potential to touch caches or
 stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not
 going to make too big a deal about that, because I don't think we have
 anything that picky.
 
 However, the first point above is the most important one, and exposing
 the quirks of the DMA API to userland is certainly not a nice thing to be
 doing.  This needs to be fixed - we can't go and enforce an API which is
 deeply embedded within the kernel all the way out to userland.
 
 What we need is something along the lines of:
 (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA.
 or
 (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all.
 
 and for the scatterlist to be mapped for DMA at the point where the DMA
 operation is initiated, and unmapped at the point where the DMA operation
 is complete.
 
 So no, the problem is not that we need more APIs and code - we need the
 existing kernel API fixed so that we don't go exposing userspace to the
 requirements of the DMA API.  Unless we do that, we're going to end
 up with a huge world of pain, where kernel architecture people need to
 audit every damned DRM userspace implementation that happens to be run
 on their platform, and that's not something arch people really can
 afford to do.
 
 Basically, I think the dma_buf stuff needs to be rewritten with the
 requirements of the DMA API in the forefront of whosever mind is doing
 the rewriting.
 
 Note: the existing stuff does have the nice side effect of being able
 to pass buffers which do not have a struct page * associated with them
 through the dma_buf API - I think we can still preserve that by having
 dma_buf provide a couple of new APIs to do the SG list map/sync/unmap,
 but in any case we need to fix the existing API so that:
 
   dma_buf_map_attachment() becomes dma_buf_get_sg()
   dma_buf_unmap_attachment() becomes dma_buf_put_sg()
 
 both getting rid of the DMA direction argument, and then we have four
 new dma_buf calls:
 
   dma_buf_map_sg()
   dma_buf_unmap_sg()
   dma_buf_sync_sg_for_cpu()
   dma_buf_sync_sg_for_device()
 
 which do the actual sg map/unmap via the DMA API *at the appropriate
 time for DMA*.
 
 So, the summary of this is - at the moment, I regard DRM Prime and dmabuf
 to be utterly broken in design for architectures such as ARM where the
 requirements of the DMA API have to be followed if you're going to have
 a happy life.

An addendum to the above:

I'll also point out that the whole thing of having random buffers mapped
into userspace, and doing DMA on them is _highly_ architecture specific.
I'm told that PARISC is one architecture where this does not work (because
DMA needs to have some kind of congruence factor programmed into it which
can be different between kernel and userspace mappings of the same
physical mappings.

I ran into this when trying to come up with a cross-arch DMA-API mmap API,
and PARISC ended up killing the idea off (the remains of that attempt is
the dma_mmap_* stuff in ARM's asm/dma-mapping.h.)  However, this was for
the DMA coherent stuff, not the streaming API, which is what _this_
DMA prime/dma_buf stuff is about.

What I'm saying is think _very_ _carefully_ about userspace having
interleaved access to random DMA buffers.  Arguably, DRM should _not_
allow this.
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Russell King - ARM Linux
On Tue, Jun 18, 2013 at 02:19:04AM +0900, Inki Dae wrote:
 It seems like that all pages of the scatterlist should be mapped with
 DMA every time DMA operation  is started (or drm_xxx_set_src_dma_buffer
 function call), and the pages should be unmapped from DMA again every
 time DMA operation is completed: internally, including each cache
 operation.

Correct.

 Isn't that big overhead?

Yes, if you _have_ to do a cache operation to ensure that the DMA agent
can see the data the CPU has written.

 And If there is no problem, we should accept such overhead?

If there is no problem then why are we discussing this and why do we need
this API extension? :)

 Actually, drm_gem_fd_to_handle() includes to map a
 given dmabuf with iommu table (just logical data) of the DMA. And then, the
 device address (or iova) already mapped will be set to buffer register of
 the DMA with drm_xxx_set_src/dst_dma_buffer(handle1, ...) call.

Consider this with a PIPT cache:

dma_map_sg()- at this point, the kernel addresses of these
buffers are cleaned and invalidated for the DMA

userspace writes to the buffer, the data sits in the CPU cache
Because the cache is PIPT, this data becomes visible to the
kernel as well.

DMA is started, and it writes to the buffer

Now, at this point, which is the correct data?  The data physically in the
RAM which the DMA has written, or the data in the CPU cache.  It may
the answer is - they both are, and the loss of either can be a potential
data corruption issue - there is no way to tell which data should be
kept but the system is in an inconsistent state and _one_ of them will
have to be discarded.

dma_unmap_sg()  - at this point, the kernel addresses of the
buffers are _invalidated_ and any data in those
cache lines is discarded

Which also means that the data in userspace will also be discarded with
PIPT caches.

This is precisely why we have buffer rules associated with the DMA API,
which are these:

dma_map_sg()
- the buffer transfers ownership from the CPU to the DMA agent.
- the CPU may not alter the buffer in any way.
while (cpu_wants_access) {
dma_sync_sg_for_cpu()
- the buffer transfers ownership from the DMA to the CPU.
- the DMA may not alter the buffer in any way.
dma_sync_sg_for_device()
- the buffer transfers ownership from the CPU to the DMA
- the CPU may not alter the buffer in any way.
}
dma_unmap_sg()
- the buffer transfers ownership from the DMA to the CPU.
- the DMA may not alter the buffer in any way.

Any violation of that is likely to lead to data corruption.  Now, some
may say that the DMA API is only about the kernel mapping.  Yes it is,
because it takes no regard what so ever about what happens with the
userspace mappings.  This is absolutely true when you have VIVT or
aliasing VIPT caches.

Now consider that with a PIPT cache, or a non-aliasing VIPT cache (which
is exactly the same behaviourally from this aspect) any modification of
a userspace mapping is precisely the same as modifying the kernel space
mapping - and what you find is that the above rules end up _inherently_
applying to the userspace mappings as well.

In other words, userspace applications which have mapped the buffers
must _also_ respect these rules.

And there's no way in hell that I'd ever trust userspace to get this
anywhere near right, and I *really* do not like these kinds of internal
kernel API rules being exposed to userspace.

And so the idea that userspace should be allowed to setup DMA transfers
via set source buffer, set destination buffer calls into an API is
just plain rediculous.  No, userspace should be allowed to ask for
please copy from buffer X to buffer Y using this transform.

Also remember that drm_xxx_set_src/dst_dma_buffer() would have to
deal with the DRM object IDs for the buffer, and not the actual buffer
information themselves - that is kept within the kernel, so the kernel
knows what's happening.
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

2013-06-17 Thread Inki Dae


 -Original Message-
 From: Russell King - ARM Linux [mailto:li...@arm.linux.org.uk]
 Sent: Tuesday, June 18, 2013 3:21 AM
 To: Inki Dae
 Cc: Maarten Lankhorst; linux-fbdev; Kyungmin Park; DRI mailing list; Rob
 Clark; myungjoo.ham; YoungJun Cho; Daniel Vetter; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization
 framework
 
 On Tue, Jun 18, 2013 at 02:19:04AM +0900, Inki Dae wrote:
  It seems like that all pages of the scatterlist should be mapped with
  DMA every time DMA operation  is started (or drm_xxx_set_src_dma_buffer
  function call), and the pages should be unmapped from DMA again every
  time DMA operation is completed: internally, including each cache
  operation.
 
 Correct.
 
  Isn't that big overhead?
 
 Yes, if you _have_ to do a cache operation to ensure that the DMA agent
 can see the data the CPU has written.
 
  And If there is no problem, we should accept such overhead?
 
 If there is no problem then why are we discussing this and why do we need
 this API extension? :)

Ok, another issue regardless of dmabuf-sync. Reasonable to me even though
big overhead. Besides, it seems like that most DRM drivers have same issue.
Therefore, we may need to solve this issue like below:
- do not map a dmabuf with DMA. And just create/update buffer object
of importer.
- map the buffer with DMA every time DMA start or iommu page fault
occurs.
- unmap the buffer from DMA every time DMA operation is completed

With the above approach, cache operation portion of my approach,
dmabuf-sync, can be removed. However, I'm not sure that we really have to
use the above approach with a big overhead. Of course, if we don't use the
above approach then user processes would need to do each cache operation
before DMA operation is started and also after DMA operation is completed in
some cases; user space mapped with physical memory as cachable, and CPU and
DMA share the same buffer.

So I'd like to ask for other DRM maintainers. How do you think about it? it
seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by
Rob) and GEM CMA helper also have same issue Russell pointed out. I think
not only the above approach but also the performance is very important.

Thanks,
Inki Dae

 
  Actually, drm_gem_fd_to_handle() includes to map a
  given dmabuf with iommu table (just logical data) of the DMA. And then,
 the
  device address (or iova) already mapped will be set to buffer register
 of
  the DMA with drm_xxx_set_src/dst_dma_buffer(handle1, ...) call.
 
 Consider this with a PIPT cache:
 
   dma_map_sg()- at this point, the kernel addresses of these
   buffers are cleaned and invalidated for the DMA
 
   userspace writes to the buffer, the data sits in the CPU cache
   Because the cache is PIPT, this data becomes visible to the
   kernel as well.
 
   DMA is started, and it writes to the buffer
 
 Now, at this point, which is the correct data?  The data physically in the
 RAM which the DMA has written, or the data in the CPU cache.  It may
 the answer is - they both are, and the loss of either can be a potential
 data corruption issue - there is no way to tell which data should be
 kept but the system is in an inconsistent state and _one_ of them will
 have to be discarded.
 
   dma_unmap_sg()  - at this point, the kernel addresses of the
   buffers are _invalidated_ and any data in those
   cache lines is discarded
 
 Which also means that the data in userspace will also be discarded with
 PIPT caches.
 
 This is precisely why we have buffer rules associated with the DMA API,
 which are these:
 
   dma_map_sg()
   - the buffer transfers ownership from the CPU to the DMA agent.
   - the CPU may not alter the buffer in any way.
   while (cpu_wants_access) {
   dma_sync_sg_for_cpu()
   - the buffer transfers ownership from the DMA to the CPU.
   - the DMA may not alter the buffer in any way.
   dma_sync_sg_for_device()
   - the buffer transfers ownership from the CPU to the DMA
   - the CPU may not alter the buffer in any way.
   }
   dma_unmap_sg()
   - the buffer transfers ownership from the DMA to the CPU.
   - the DMA may not alter the buffer in any way.
 
 Any violation of that is likely to lead to data corruption.  Now, some
 may say that the DMA API is only about the kernel mapping.  Yes it is,
 because it takes no regard what so ever about what happens with the
 userspace mappings.  This is absolutely true when you have VIVT or
 aliasing VIPT caches.
 
 Now consider that with a PIPT cache, or a non-aliasing VIPT cache (which
 is exactly the same behaviourally from this aspect) any modification of
 a userspace mapping is precisely the same as modifying the kernel space
 mapping - and what