Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Tue, Jun 25, 2013 at 11:23:21AM +0200, Daniel Vetter wrote: Just a quick question on your assertion that we need all four functions: Since we already have begin/end_cpu_access functions (intention here was to allow the dma_buf exporter to ensure the memory is pinned, e.g. for swapable gem objects, but also allow cpu cache flushing if required) do we still need the sync_sg_for_cpu? Possibly not, but let's first look at that kind of scenario: - It attaches to a dma_buf using dma_buf_attach() - it maps the DMA buffer using dma_buf_map_attachment(). This does dma_map_sg(). This possibly incurs a cache flush depending on the coherence properties of the architecture. - it then calls dma_buf_begin_cpu_access(). This presumably does a dma_sync_sg_for_cpu(). This possibly incurs another cache flush depending on the coherence properties of the architecture. - then, presumably, it calls dma_buf_kmap_atomic() or dma_buf_kmap() to gain a pointer to the buffer. - at some point, dma_buf_kunmap_atomic() or dma_buf_kunmap() gets called. On certain cache architectures, kunmap_atomic() results in a cache flush. - dma_buf_end_cpu_access() gets called, calling through presumably to dma_sync_sg_for_device(). Possibly incurs a cache flush. - dma_buf_unmap_attachment()... another cache flush. Out of those all of those five cache flushes, the only cache flush which is really necessary out of all those would be the one in kunmap_atomic(). The rest of them are completely irrelevant to the CPU access provided that there is no DMA access by the device. What's more is that you can't say well, the architecture should optimize them! to which I'd respond how does the architecture know at dma_map_sg() time that there isn't going to be a DMA access? Now, if we say that it's not necessary to call dma_buf_map_attachment().. dma_buf_unmap_attachment() around the calls to dma_buf_begin_cpu_access().. dma_buf_end_cpu_access(), then how does dma_buf_begin_cpu_access() know whether the DMA buffer has been dma_map_sg()'d? It's invalid to call dma_sync_sg_for_cpu() on a buffer which has not been dma_map_sg()'d. Bear in mind that dma_buf_begin_cpu_access() is passed only the struct dma_buf and not the struct dma_buf_attachment *attach, there's no hope for the begin_cpu_access() method to know which user of the dma_buf is asking for this call, so you can't track any state there. Thank you for pointing this out, I'm less impressed with this dma_buf API now that I've looked deeper at those two interfaces, and even more convinced it needs to be redesigned! :P -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Tue, Jun 18, 2013 at 12:46 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: Note: the existing stuff does have the nice side effect of being able to pass buffers which do not have a struct page * associated with them through the dma_buf API - I think we can still preserve that by having dma_buf provide a couple of new APIs to do the SG list map/sync/unmap, but in any case we need to fix the existing API so that: dma_buf_map_attachment() becomes dma_buf_get_sg() dma_buf_unmap_attachment() becomes dma_buf_put_sg() both getting rid of the DMA direction argument, and then we have four new dma_buf calls: dma_buf_map_sg() dma_buf_unmap_sg() dma_buf_sync_sg_for_cpu() dma_buf_sync_sg_for_device() which do the actual sg map/unmap via the DMA API *at the appropriate time for DMA*. Hm, my idea was to just add a dma_buf_sync_attchment for the device side syncing, since the cpu access stuff is already bracketed with the begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap, but imo mmap support for dma_buf is a bit insane anyway, so I don't care too much about it. Since such dma mappings would be really longstanding in most cases anyway drivers could just map with BIDIRECTIONAL and do all the real flushing with the new sync stuff. Note that the DMA API debug doesn't allow you to change the direction argument on an existing mapping (neither should it, again this is documented in the DMA API stuff in Documentation/). This is where you would need the complete set of four functions I mention above which reflect the functionality of the DMA API. [Been travelling a bit, hence the delay.] Just a quick question on your assertion that we need all four functions: Since we already have begin/end_cpu_access functions (intention here was to allow the dma_buf exporter to ensure the memory is pinned, e.g. for swapable gem objects, but also allow cpu cache flushing if required) do we still need the sync_sg_for_cpu? At least with i915 as the exporter we currently hide the cflushing behind our begin_cpu_access callback. For device dma we currently punt on it due to lack of the dma_buf_sync_sg_for_device interface. Aside: I know that i915 doing the clflushing dance itself is a bit ugly, but thus far we've been the only guys on the x86 block with non-coherent dma. But it sounds like a bunch of other blocks on Atom SoCs have similar needs, so I guess it would make sense to move that into the dma layer. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Am Donnerstag, den 20.06.2013, 20:15 +0900 schrieb Inki Dae: [...] You already need some kind of IPC between the two tasks, as I suspect even in your example it wouldn't make much sense to queue the buffer over and over again in task B without task A writing anything to it. So task A has to signal task B there is new data in the buffer to be processed. There is no need to share the buffer over and over again just to get the two processes to work together on the same thing. Just share the fd between both and then do out-of-band completion signaling, as you need this anyway. Without this you'll end up with unpredictable behavior. Just because sync allows you to access the buffer doesn't mean it's valid for your use-case. Without completion signaling you could easily end up overwriting your data from task A multiple times before task B even tries to lock the buffer for processing. So the valid flow is (and this already works with the current APIs): Task ATask B ---- CPU access buffer --completion signal- qbuf (dragging buffer into device domain, flush caches, reserve buffer etc.) | wait for device operation to complete | dqbuf (dragging buffer back into CPU domain, invalidate caches, unreserve) -completion signal CPU access buffer Correct. In case that data flow goes from A to B, it needs some kind of IPC between the two tasks every time as you said. Then, without dmabuf-sync, how do think about the case that two tasks share the same buffer but these tasks access the buffer(buf1) as write, and data of the buffer(buf1) isn't needed to be shared? Sorry, I don't see the point you are trying to solve here. If you share a buffer and want its content to be clearly defined at every point in time you have to synchronize the tasks working with the buffer, not just the buffer accesses itself. Easiest way to do so is doing sync through userspace with out-of-band IPC, like in the example above. In my opinion, that's not definitely easiest way. What I try to do is to avoid using *the out-of-band IPC*. As I mentioned in document file, the conventional mechanism not only makes user application complicated-user process needs to understand how the device driver is worked-but also may incur performance overhead by using the out-of-band IPC. The above my example may not be enough to you but there would be other cases able to use my approach efficiently. Yeah, you'll some knowledge and understanding about the API you are working with to get things right. But I think it's not an unreasonable thing to expect the programmer working directly with kernel interfaces to read up on how things work. Second thing: I'll rather have *one* consistent API for every subsystem, even if they differ from each other than having to implement this syncpoint thing in every subsystem. Remember: a single execbuf in DRM might reference both GEM objects backed by dma-buf as well native SHM or CMA backed objects. The dma-buf-mgr proposal already allows you to handle dma-bufs much the same way during validation than native GEM objects. And to get back to my original point: if you have more than one task operating together on a buffer you absolutely need some kind of real IPC to sync them up and do something useful. Both you syncpoints and the proposed dma-fences only protect the buffer accesses to make sure different task don't stomp on each other. There is nothing in there to make sure that the output of your pipeline is valid. You have to take care of that yourself in userspace. I'll reuse your example to make it clear what I mean: Task A Task B -- --- dma_buf_sync_lock(buf1) CPU write buf1 dma_buf_sync_unlock(buf1) -schedule Task A again--- dma_buf_sync_lock(buf1) CPU write buf1 dma_buf_sync_unlock(buf1) -schedule Task B- qbuf(buf1) dma_buf_sync_lock(buf1) This is what can happen if you don't take care of proper syncing. Task A writes something to the buffer in expectation that Task B will take care of it, but before Task B
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Hi Inki, please refrain from sending HTML Mails, it makes proper quoting without messing up the layout everywhere pretty hard. Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae: [...] Yeah, you'll some knowledge and understanding about the API you are working with to get things right. But I think it's not an unreasonable thing to expect the programmer working directly with kernel interfaces to read up on how things work. Second thing: I'll rather have *one* consistent API for every subsystem, even if they differ from each other than having to implement this syncpoint thing in every subsystem. Remember: a single execbuf in DRM might reference both GEM objects backed by dma-buf as well native SHM or CMA backed objects. The dma-buf-mgr proposal already allows you to handle dma-bufs much the same way during validation than native GEM objects. Actually, at first I had implemented a fence helper framework based on reservation and dma fence to provide easy-use-interface for device drivers. However, that was wrong implemention: I had not only customized the dma fence but also not considered dead lock issue. After that, I have reimplemented it as dmabuf sync to solve dead issue, and at that time, I realized that we first need to concentrate on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and DMA can access a same buffer, And the fact simple is the best, and the fact we need not only kernel side but also user side interfaces. After that, I collected what is the common part for all subsystems, and I have devised this dmabuf sync framework for it. I'm not really specialist in Desktop world. So question. isn't the execbuf used only for the GPU? the gpu has dedicated video memory(VRAM) so it needs migration mechanism between system memory and the dedicated video memory, and also to consider ordering issue while be migrated. Yeah, execbuf is pretty GPU specific, but I don't see how this matters for this discussion. Also I don't see a big difference between embedded and desktop GPUs. Buffer migration is more of a detail here. Both take command stream that potentially reference other buffers, which might be native GEM or dma-buf backed objects. Both have to make sure the buffers are in the right domain (caches cleaned and address mappings set up) and are available for the desired operation, meaning you have to sync with other DMA engines and maybe also with CPU. The only case where sync isn't clearly defined right now by the current API entrypoints is when you access memory through the dma-buf fallback mmap support, which might happen with some software processing element in a video pipeline or something. I agree that we will need a userspace interface here, but I think this shouldn't be yet another sync object, but rather more a prepare/fini_cpu_access ioctl on the dma-buf which hooks into the existing dma-fence and reservation stuff. And to get back to my original point: if you have more than one task operating together on a buffer you absolutely need some kind of real IPC to sync them up and do something useful. Both you syncpoints and the proposed dma-fences only protect the buffer accesses to make sure different task don't stomp on each other. There is nothing in there to make sure that the output of your pipeline is valid. You have to take care of that yourself in userspace. I'll reuse your example to make it clear what I mean: Task A Task B -- --- dma_buf_sync_lock(buf1) CPU write buf1 dma_buf_sync_unlock(buf1) -schedule Task A again--- dma_buf_sync_lock(buf1) CPU write buf1 dma_buf_sync_unlock(buf1) -schedule Task B- qbuf(buf1) dma_buf_sync_lock(buf1) This is what can happen if you don't take care of proper syncing. Task A writes something to the buffer in expectation that Task B will take care of it, but before Task B even gets scheduled Task A overwrites the buffer again. Not what you wanted, isn't it? Exactly wrong example. I had already mentioned about that. In case that data flow goes from A to B, it needs some kind of IPC between the two tasks every time So again, your example would have no any problem in case that *two tasks share the same buffer but these tasks access the
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
2013/6/21 Lucas Stach l.st...@pengutronix.de: Hi Inki, please refrain from sending HTML Mails, it makes proper quoting without messing up the layout everywhere pretty hard. Sorry about that. I should have used text mode. Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae: [...] Yeah, you'll some knowledge and understanding about the API you are working with to get things right. But I think it's not an unreasonable thing to expect the programmer working directly with kernel interfaces to read up on how things work. Second thing: I'll rather have *one* consistent API for every subsystem, even if they differ from each other than having to implement this syncpoint thing in every subsystem. Remember: a single execbuf in DRM might reference both GEM objects backed by dma-buf as well native SHM or CMA backed objects. The dma-buf-mgr proposal already allows you to handle dma-bufs much the same way during validation than native GEM objects. Actually, at first I had implemented a fence helper framework based on reservation and dma fence to provide easy-use-interface for device drivers. However, that was wrong implemention: I had not only customized the dma fence but also not considered dead lock issue. After that, I have reimplemented it as dmabuf sync to solve dead issue, and at that time, I realized that we first need to concentrate on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and DMA can access a same buffer, And the fact simple is the best, and the fact we need not only kernel side but also user side interfaces. After that, I collected what is the common part for all subsystems, and I have devised this dmabuf sync framework for it. I'm not really specialist in Desktop world. So question. isn't the execbuf used only for the GPU? the gpu has dedicated video memory(VRAM) so it needs migration mechanism between system memory and the dedicated video memory, and also to consider ordering issue while be migrated. Yeah, execbuf is pretty GPU specific, but I don't see how this matters for this discussion. Also I don't see a big difference between embedded and desktop GPUs. Buffer migration is more of a detail here. Both take command stream that potentially reference other buffers, which might be native GEM or dma-buf backed objects. Both have to make sure the buffers are in the right domain (caches cleaned and address mappings set up) and are available for the desired operation, meaning you have to sync with other DMA engines and maybe also with CPU. Yeah, right. Then, in case of desktop gpu, does't it need additional something to do when a buffer/s is/are migrated from system memory to video memory domain, or from video memory to system memory domain? I guess the below members does similar thing, and all other DMA devices would not need them: struct fence { ... unsigned int context, seqno; ... }; And, struct seqno_fence { ... uint32_t seqno_ofs; ... }; The only case where sync isn't clearly defined right now by the current API entrypoints is when you access memory through the dma-buf fallback mmap support, which might happen with some software processing element in a video pipeline or something. I agree that we will need a userspace interface here, but I think this shouldn't be yet another sync object, but rather more a prepare/fini_cpu_access ioctl on the dma-buf which hooks into the existing dma-fence and reservation stuff. I think we don't need addition ioctl commands for that. I am thinking of using existing resources as possible. My idea also is similar in using the reservation stuff to your idea because my approach also should use the dma-buf resource. However, My idea is that a user process, that wants buffer synchronization with the other, sees a sync object as a file descriptor like dma-buf does. The below shows simple my idea about it: ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, sync); flock(sync-fd, LOCK_SH); - LOCK_SH means a shared lock. CPU access for read flock(sync-fd, LOCK_UN); Or flock(sync-fd, LOCK_EX); - LOCK_EX means an exclusive lock CPU access for write flock(sync-fd, LOCK_UN); close(sync-fd); As you know, that's similar to dmabuf export feature. In addition, a more simple idea, flock(dmabuf_fd, LOCK_SH/EX); CPU access for read/write flock(dmabuf_fd, LOCK_UN); However, I'm not sure that the above examples could be worked well, and there are no problems yet: actually, I don't fully understand flock mechanism, so looking into it. And to get back to my original point: if you have more than one task operating together on a buffer you absolutely need some kind of real IPC
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Fri, Jun 21, 2013 at 12:55 PM, Inki Dae daei...@gmail.com wrote: 2013/6/21 Lucas Stach l.st...@pengutronix.de: Hi Inki, please refrain from sending HTML Mails, it makes proper quoting without messing up the layout everywhere pretty hard. Sorry about that. I should have used text mode. Am Freitag, den 21.06.2013, 20:01 +0900 schrieb Inki Dae: [...] Yeah, you'll some knowledge and understanding about the API you are working with to get things right. But I think it's not an unreasonable thing to expect the programmer working directly with kernel interfaces to read up on how things work. Second thing: I'll rather have *one* consistent API for every subsystem, even if they differ from each other than having to implement this syncpoint thing in every subsystem. Remember: a single execbuf in DRM might reference both GEM objects backed by dma-buf as well native SHM or CMA backed objects. The dma-buf-mgr proposal already allows you to handle dma-bufs much the same way during validation than native GEM objects. Actually, at first I had implemented a fence helper framework based on reservation and dma fence to provide easy-use-interface for device drivers. However, that was wrong implemention: I had not only customized the dma fence but also not considered dead lock issue. After that, I have reimplemented it as dmabuf sync to solve dead issue, and at that time, I realized that we first need to concentrate on the most basic thing: the fact CPU and CPU, CPU and DMA, or DMA and DMA can access a same buffer, And the fact simple is the best, and the fact we need not only kernel side but also user side interfaces. After that, I collected what is the common part for all subsystems, and I have devised this dmabuf sync framework for it. I'm not really specialist in Desktop world. So question. isn't the execbuf used only for the GPU? the gpu has dedicated video memory(VRAM) so it needs migration mechanism between system memory and the dedicated video memory, and also to consider ordering issue while be migrated. Yeah, execbuf is pretty GPU specific, but I don't see how this matters for this discussion. Also I don't see a big difference between embedded and desktop GPUs. Buffer migration is more of a detail here. Both take command stream that potentially reference other buffers, which might be native GEM or dma-buf backed objects. Both have to make sure the buffers are in the right domain (caches cleaned and address mappings set up) and are available for the desired operation, meaning you have to sync with other DMA engines and maybe also with CPU. Yeah, right. Then, in case of desktop gpu, does't it need additional something to do when a buffer/s is/are migrated from system memory to video memory domain, or from video memory to system memory domain? I guess the below members does similar thing, and all other DMA devices would not need them: struct fence { ... unsigned int context, seqno; ... }; And, struct seqno_fence { ... uint32_t seqno_ofs; ... }; The only case where sync isn't clearly defined right now by the current API entrypoints is when you access memory through the dma-buf fallback mmap support, which might happen with some software processing element in a video pipeline or something. I agree that we will need a userspace interface here, but I think this shouldn't be yet another sync object, but rather more a prepare/fini_cpu_access ioctl on the dma-buf which hooks into the existing dma-fence and reservation stuff. I think we don't need addition ioctl commands for that. I am thinking of using existing resources as possible. My idea also is similar in using the reservation stuff to your idea because my approach also should use the dma-buf resource. However, My idea is that a user process, that wants buffer synchronization with the other, sees a sync object as a file descriptor like dma-buf does. The below shows simple my idea about it: ioctl(dmabuf_fd, DMA_BUF_IOC_OPEN_SYNC, sync); flock(sync-fd, LOCK_SH); - LOCK_SH means a shared lock. CPU access for read flock(sync-fd, LOCK_UN); Or flock(sync-fd, LOCK_EX); - LOCK_EX means an exclusive lock CPU access for write flock(sync-fd, LOCK_UN); close(sync-fd); As you know, that's similar to dmabuf export feature. In addition, a more simple idea, flock(dmabuf_fd, LOCK_SH/EX); CPU access for read/write flock(dmabuf_fd, LOCK_UN); However, I'm not sure that the above examples could be worked well, and there are no problems yet: actually, I don't fully understand flock mechanism, so looking into it. And to get back to my original point: if you have more than
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae: -Original Message- From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org [mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org] On Behalf Of Russell King - ARM Linux Sent: Thursday, June 20, 2013 3:29 AM To: Inki Dae Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote: On the other hand, the below shows how we could enhance the conventional way with my approach (just example): CPU - DMA, ioctl(qbuf command) ioctl(streamon) | | | | qbuf - dma_buf_sync_get start streaming - syncpoint dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA accesses the sync buffer. And DMA - CPU, ioctl(dqbuf command) | | dqbuf - nothing to do Actual syncpoint is when DMA operation is completed (in interrupt handler): the syncpoint is performed by calling dma_buf_sync_unlock(). Hence, my approach is to move the syncpoints into just before dma access as long as possible. What you've just described does *not* work on architectures such as ARMv7 which do speculative cache fetches from memory at any time that that memory is mapped with a cacheable status, and will lead to data corruption. I didn't explain that enough. Sorry about that. 'nothing to do' means that a dmabuf sync interface isn't called but existing functions are called. So this may be explained again: ioctl(dqbuf command) | | dqbuf - 1. dma_unmap_sg 2. dma_buf_sync_unlock (syncpoint) The syncpoint I mentioned means lock mechanism; not doing cache operation. In addition, please see the below more detail examples. The conventional way (without dmabuf-sync) is: Task A 1. CPU accesses buf 2. Send the buf to Task B 3. Wait for the buf from Task B 4. go to 1 Task B --- 1. Wait for the buf from Task A 2. qbuf the buf 2.1 insert the buf to incoming queue 3. stream on 3.1 dma_map_sg if ready, and move the buf to ready queue 3.2 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_unmap_sg after dma operation completion 4.2 move the buf to outgoing queue 5. back the buf to Task A 6. go to 1 In case that two tasks share buffers, and data flow goes from Task A to Task B, we would need IPC operation to send and receive buffers properly between those two tasks every time CPU or DMA access to buffers is started or completed. With dmabuf-sync: Task A 1. dma_buf_sync_lock - synpoint (call by user side) 2. CPU accesses buf 3. dma_buf_sync_unlock - syncpoint (call by user side) 4. Send the buf to Task B (just one time) 5. go to 1 Task B --- 1. Wait for the buf from Task A (just one time) 2. qbuf the buf 1.1 insert the buf to incoming queue 3. stream on 3.1 dma_buf_sync_lock - syncpoint (call by kernel side) 3.2 dma_map_sg if ready, and move the buf to ready queue 3.3 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_buf_sync_unlock - syncpoint (call by kernel side) 4.2 dma_unmap_sg after dma operation completion 4.3 move the buf to outgoing queue 5. go to 1 On the other hand, in case of using dmabuf-sync, as you can see the above example, we would need IPC operation just one time. That way, I think we could not only reduce performance overhead but also make user application simplified. Of course, this approach can be used for all DMA device drivers such as DRM. I'm not a specialist in v4l2 world so there may be missing point. You already need some kind of IPC between the two tasks, as I suspect even in your example it wouldn't make much sense to queue the buffer over and over again in task B without task A writing anything to it. So task A has to signal task B there is new data in the buffer to be processed. There is no need to share the buffer over and over again just to get the two processes to work together on the same thing. Just share the fd between both and then do out-of-band completion signaling, as you need this anyway. Without this you'll end up with unpredictable behavior. Just because sync allows you to access
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Thu, Jun 20, 2013 at 09:47:07AM +0200, Lucas Stach wrote: Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae: -Original Message- From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org [mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org] On Behalf Of Russell King - ARM Linux Sent: Thursday, June 20, 2013 3:29 AM To: Inki Dae Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote: On the other hand, the below shows how we could enhance the conventional way with my approach (just example): CPU - DMA, ioctl(qbuf command) ioctl(streamon) | | | | qbuf - dma_buf_sync_get start streaming - syncpoint dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA accesses the sync buffer. And DMA - CPU, ioctl(dqbuf command) | | dqbuf - nothing to do Actual syncpoint is when DMA operation is completed (in interrupt handler): the syncpoint is performed by calling dma_buf_sync_unlock(). Hence, my approach is to move the syncpoints into just before dma access as long as possible. What you've just described does *not* work on architectures such as ARMv7 which do speculative cache fetches from memory at any time that that memory is mapped with a cacheable status, and will lead to data corruption. I didn't explain that enough. Sorry about that. 'nothing to do' means that a dmabuf sync interface isn't called but existing functions are called. So this may be explained again: ioctl(dqbuf command) | | dqbuf - 1. dma_unmap_sg 2. dma_buf_sync_unlock (syncpoint) The syncpoint I mentioned means lock mechanism; not doing cache operation. In addition, please see the below more detail examples. The conventional way (without dmabuf-sync) is: Task A 1. CPU accesses buf 2. Send the buf to Task B 3. Wait for the buf from Task B 4. go to 1 Task B --- 1. Wait for the buf from Task A 2. qbuf the buf 2.1 insert the buf to incoming queue 3. stream on 3.1 dma_map_sg if ready, and move the buf to ready queue 3.2 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_unmap_sg after dma operation completion 4.2 move the buf to outgoing queue 5. back the buf to Task A 6. go to 1 In case that two tasks share buffers, and data flow goes from Task A to Task B, we would need IPC operation to send and receive buffers properly between those two tasks every time CPU or DMA access to buffers is started or completed. With dmabuf-sync: Task A 1. dma_buf_sync_lock - synpoint (call by user side) 2. CPU accesses buf 3. dma_buf_sync_unlock - syncpoint (call by user side) 4. Send the buf to Task B (just one time) 5. go to 1 Task B --- 1. Wait for the buf from Task A (just one time) 2. qbuf the buf 1.1 insert the buf to incoming queue 3. stream on 3.1 dma_buf_sync_lock - syncpoint (call by kernel side) 3.2 dma_map_sg if ready, and move the buf to ready queue 3.3 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_buf_sync_unlock - syncpoint (call by kernel side) 4.2 dma_unmap_sg after dma operation completion 4.3 move the buf to outgoing queue 5. go to 1 On the other hand, in case of using dmabuf-sync, as you can see the above example, we would need IPC operation just one time. That way, I think we could not only reduce performance overhead but also make user application simplified. Of course, this approach can be used for all DMA device drivers such as DRM. I'm not a specialist in v4l2 world so there may be missing point. You already need some kind of IPC between the two tasks, as I suspect even in your example it wouldn't make much sense to queue the buffer over and over again in task B without task A writing anything to it. So task A has to signal task B there is new data in the buffer to be processed. Hang on. Since when did dma_buf become another inter-process IPC mechanism? That's *not* it's design goal, and there's other much better
RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
-Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Thursday, June 20, 2013 4:47 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'Inki Dae'; 'linux-fbdev'; 'YoungJun Cho'; 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae: -Original Message- From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org [mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org] On Behalf Of Russell King - ARM Linux Sent: Thursday, June 20, 2013 3:29 AM To: Inki Dae Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote: On the other hand, the below shows how we could enhance the conventional way with my approach (just example): CPU - DMA, ioctl(qbuf command) ioctl(streamon) | | | | qbuf - dma_buf_sync_get start streaming - syncpoint dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA accesses the sync buffer. And DMA - CPU, ioctl(dqbuf command) | | dqbuf - nothing to do Actual syncpoint is when DMA operation is completed (in interrupt handler): the syncpoint is performed by calling dma_buf_sync_unlock(). Hence, my approach is to move the syncpoints into just before dma access as long as possible. What you've just described does *not* work on architectures such as ARMv7 which do speculative cache fetches from memory at any time that that memory is mapped with a cacheable status, and will lead to data corruption. I didn't explain that enough. Sorry about that. 'nothing to do' means that a dmabuf sync interface isn't called but existing functions are called. So this may be explained again: ioctl(dqbuf command) | | dqbuf - 1. dma_unmap_sg 2. dma_buf_sync_unlock (syncpoint) The syncpoint I mentioned means lock mechanism; not doing cache operation. In addition, please see the below more detail examples. The conventional way (without dmabuf-sync) is: Task A 1. CPU accesses buf 2. Send the buf to Task B 3. Wait for the buf from Task B 4. go to 1 Task B --- 1. Wait for the buf from Task A 2. qbuf the buf 2.1 insert the buf to incoming queue 3. stream on 3.1 dma_map_sg if ready, and move the buf to ready queue 3.2 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_unmap_sg after dma operation completion 4.2 move the buf to outgoing queue 5. back the buf to Task A 6. go to 1 In case that two tasks share buffers, and data flow goes from Task A to Task B, we would need IPC operation to send and receive buffers properly between those two tasks every time CPU or DMA access to buffers is started or completed. With dmabuf-sync: Task A 1. dma_buf_sync_lock - synpoint (call by user side) 2. CPU accesses buf 3. dma_buf_sync_unlock - syncpoint (call by user side) 4. Send the buf to Task B (just one time) 5. go to 1 Task B --- 1. Wait for the buf from Task A (just one time) 2. qbuf the buf 1.1 insert the buf to incoming queue 3. stream on 3.1 dma_buf_sync_lock - syncpoint (call by kernel side) 3.2 dma_map_sg if ready, and move the buf to ready queue 3.3 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_buf_sync_unlock - syncpoint (call by kernel side) 4.2 dma_unmap_sg after dma operation completion 4.3 move the buf to outgoing queue 5. go to 1 On the other hand, in case of using dmabuf-sync, as you can see the above example, we would need IPC operation just one time. That way, I think we could not only reduce performance overhead but also make user application simplified. Of course, this approach can be used for all DMA device drivers such as DRM. I'm not a specialist in v4l2 world so there may be missing point. You already need some kind of IPC between the two tasks, as I suspect even in your example it wouldn't make much sense to queue the buffer over and over again in task B without task A writing
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Am Donnerstag, den 20.06.2013, 09:17 +0100 schrieb Russell King - ARM Linux: On Thu, Jun 20, 2013 at 09:47:07AM +0200, Lucas Stach wrote: Am Donnerstag, den 20.06.2013, 15:43 +0900 schrieb Inki Dae: -Original Message- From: dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org [mailto:dri-devel-bounces+inki.dae=samsung@lists.freedesktop.org] On Behalf Of Russell King - ARM Linux Sent: Thursday, June 20, 2013 3:29 AM To: Inki Dae Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun Cho; linux-media@vger.kernel.org; linux-arm-ker...@lists.infradead.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote: On the other hand, the below shows how we could enhance the conventional way with my approach (just example): CPU - DMA, ioctl(qbuf command) ioctl(streamon) | | | | qbuf - dma_buf_sync_get start streaming - syncpoint dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA accesses the sync buffer. And DMA - CPU, ioctl(dqbuf command) | | dqbuf - nothing to do Actual syncpoint is when DMA operation is completed (in interrupt handler): the syncpoint is performed by calling dma_buf_sync_unlock(). Hence, my approach is to move the syncpoints into just before dma access as long as possible. What you've just described does *not* work on architectures such as ARMv7 which do speculative cache fetches from memory at any time that that memory is mapped with a cacheable status, and will lead to data corruption. I didn't explain that enough. Sorry about that. 'nothing to do' means that a dmabuf sync interface isn't called but existing functions are called. So this may be explained again: ioctl(dqbuf command) | | dqbuf - 1. dma_unmap_sg 2. dma_buf_sync_unlock (syncpoint) The syncpoint I mentioned means lock mechanism; not doing cache operation. In addition, please see the below more detail examples. The conventional way (without dmabuf-sync) is: Task A 1. CPU accesses buf 2. Send the buf to Task B 3. Wait for the buf from Task B 4. go to 1 Task B --- 1. Wait for the buf from Task A 2. qbuf the buf 2.1 insert the buf to incoming queue 3. stream on 3.1 dma_map_sg if ready, and move the buf to ready queue 3.2 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_unmap_sg after dma operation completion 4.2 move the buf to outgoing queue 5. back the buf to Task A 6. go to 1 In case that two tasks share buffers, and data flow goes from Task A to Task B, we would need IPC operation to send and receive buffers properly between those two tasks every time CPU or DMA access to buffers is started or completed. With dmabuf-sync: Task A 1. dma_buf_sync_lock - synpoint (call by user side) 2. CPU accesses buf 3. dma_buf_sync_unlock - syncpoint (call by user side) 4. Send the buf to Task B (just one time) 5. go to 1 Task B --- 1. Wait for the buf from Task A (just one time) 2. qbuf the buf 1.1 insert the buf to incoming queue 3. stream on 3.1 dma_buf_sync_lock - syncpoint (call by kernel side) 3.2 dma_map_sg if ready, and move the buf to ready queue 3.3 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_buf_sync_unlock - syncpoint (call by kernel side) 4.2 dma_unmap_sg after dma operation completion 4.3 move the buf to outgoing queue 5. go to 1 On the other hand, in case of using dmabuf-sync, as you can see the above example, we would need IPC operation just one time. That way, I think we could not only reduce performance overhead but also make user application simplified. Of course, this approach can be used for all DMA device drivers such as DRM. I'm not a specialist in v4l2 world so there may be missing point. You already need some kind of IPC between the two tasks, as I suspect even in your example it wouldn't make much sense to queue the buffer over and over again in task B without task A writing
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Am Donnerstag, den 20.06.2013, 17:24 +0900 schrieb Inki Dae: [...] In addition, please see the below more detail examples. The conventional way (without dmabuf-sync) is: Task A 1. CPU accesses buf 2. Send the buf to Task B 3. Wait for the buf from Task B 4. go to 1 Task B --- 1. Wait for the buf from Task A 2. qbuf the buf 2.1 insert the buf to incoming queue 3. stream on 3.1 dma_map_sg if ready, and move the buf to ready queue 3.2 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_unmap_sg after dma operation completion 4.2 move the buf to outgoing queue 5. back the buf to Task A 6. go to 1 In case that two tasks share buffers, and data flow goes from Task A to Task B, we would need IPC operation to send and receive buffers properly between those two tasks every time CPU or DMA access to buffers is started or completed. With dmabuf-sync: Task A 1. dma_buf_sync_lock - synpoint (call by user side) 2. CPU accesses buf 3. dma_buf_sync_unlock - syncpoint (call by user side) 4. Send the buf to Task B (just one time) 5. go to 1 Task B --- 1. Wait for the buf from Task A (just one time) 2. qbuf the buf 1.1 insert the buf to incoming queue 3. stream on 3.1 dma_buf_sync_lock - syncpoint (call by kernel side) 3.2 dma_map_sg if ready, and move the buf to ready queue 3.3 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_buf_sync_unlock - syncpoint (call by kernel side) 4.2 dma_unmap_sg after dma operation completion 4.3 move the buf to outgoing queue 5. go to 1 On the other hand, in case of using dmabuf-sync, as you can see the above example, we would need IPC operation just one time. That way, I think we could not only reduce performance overhead but also make user application simplified. Of course, this approach can be used for all DMA device drivers such as DRM. I'm not a specialist in v4l2 world so there may be missing point. You already need some kind of IPC between the two tasks, as I suspect even in your example it wouldn't make much sense to queue the buffer over and over again in task B without task A writing anything to it. So task A has to signal task B there is new data in the buffer to be processed. There is no need to share the buffer over and over again just to get the two processes to work together on the same thing. Just share the fd between both and then do out-of-band completion signaling, as you need this anyway. Without this you'll end up with unpredictable behavior. Just because sync allows you to access the buffer doesn't mean it's valid for your use-case. Without completion signaling you could easily end up overwriting your data from task A multiple times before task B even tries to lock the buffer for processing. So the valid flow is (and this already works with the current APIs): Task ATask B ---- CPU access buffer --completion signal- qbuf (dragging buffer into device domain, flush caches, reserve buffer etc.) | wait for device operation to complete | dqbuf (dragging buffer back into CPU domain, invalidate caches, unreserve) -completion signal CPU access buffer Correct. In case that data flow goes from A to B, it needs some kind of IPC between the two tasks every time as you said. Then, without dmabuf-sync, how do think about the case that two tasks share the same buffer but these tasks access the buffer(buf1) as write, and data of the buffer(buf1) isn't needed to be shared? Sorry, I don't see the point you are trying to solve here. If you share a buffer and want its content to be clearly defined at every point in time you have to synchronize the tasks working with the buffer, not just the buffer accesses itself. Easiest way to do so is doing sync through userspace with out-of-band IPC, like in the example above. A more advanced way to achieve this would be using cross-device fences to avoid going through userspace for every syncpoint. With dmabuf-sync is: Task A 1. dma_buf_sync_lock - synpoint (call by user side) 2. CPU writes
RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
-Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Thursday, June 20, 2013 7:11 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'Inki Dae'; 'linux-fbdev'; 'YoungJun Cho'; 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Donnerstag, den 20.06.2013, 17:24 +0900 schrieb Inki Dae: [...] In addition, please see the below more detail examples. The conventional way (without dmabuf-sync) is: Task A 1. CPU accesses buf 2. Send the buf to Task B 3. Wait for the buf from Task B 4. go to 1 Task B --- 1. Wait for the buf from Task A 2. qbuf the buf 2.1 insert the buf to incoming queue 3. stream on 3.1 dma_map_sg if ready, and move the buf to ready queue 3.2 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_unmap_sg after dma operation completion 4.2 move the buf to outgoing queue 5. back the buf to Task A 6. go to 1 In case that two tasks share buffers, and data flow goes from Task A to Task B, we would need IPC operation to send and receive buffers properly between those two tasks every time CPU or DMA access to buffers is started or completed. With dmabuf-sync: Task A 1. dma_buf_sync_lock - synpoint (call by user side) 2. CPU accesses buf 3. dma_buf_sync_unlock - syncpoint (call by user side) 4. Send the buf to Task B (just one time) 5. go to 1 Task B --- 1. Wait for the buf from Task A (just one time) 2. qbuf the buf 1.1 insert the buf to incoming queue 3. stream on 3.1 dma_buf_sync_lock - syncpoint (call by kernel side) 3.2 dma_map_sg if ready, and move the buf to ready queue 3.3 get the buf from ready queue, and dma start. 4. dqbuf 4.1 dma_buf_sync_unlock - syncpoint (call by kernel side) 4.2 dma_unmap_sg after dma operation completion 4.3 move the buf to outgoing queue 5. go to 1 On the other hand, in case of using dmabuf-sync, as you can see the above example, we would need IPC operation just one time. That way, I think we could not only reduce performance overhead but also make user application simplified. Of course, this approach can be used for all DMA device drivers such as DRM. I'm not a specialist in v4l2 world so there may be missing point. You already need some kind of IPC between the two tasks, as I suspect even in your example it wouldn't make much sense to queue the buffer over and over again in task B without task A writing anything to it. So task A has to signal task B there is new data in the buffer to be processed. There is no need to share the buffer over and over again just to get the two processes to work together on the same thing. Just share the fd between both and then do out-of-band completion signaling, as you need this anyway. Without this you'll end up with unpredictable behavior. Just because sync allows you to access the buffer doesn't mean it's valid for your use-case. Without completion signaling you could easily end up overwriting your data from task A multiple times before task B even tries to lock the buffer for processing. So the valid flow is (and this already works with the current APIs): Task ATask B ---- CPU access buffer --completion signal- qbuf (dragging buffer into device domain, flush caches, reserve buffer etc.) | wait for device operation to complete | dqbuf (dragging buffer back into CPU domain, invalidate caches, unreserve) -completion signal CPU access buffer Correct. In case that data flow goes from A to B, it needs some kind of IPC between the two tasks every time as you said. Then, without dmabuf-sync, how do think about the case that two tasks share the same buffer but these tasks access the buffer(buf1) as write, and data of the buffer(buf1) isn't needed to be shared? Sorry, I don't see the point you are trying to solve here. If you share
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae: -Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Tuesday, June 18, 2013 6:47 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae: [...] a display device driver. It shouldn't be used within a single driver as a means of passing buffers between userspace and kernel space. What I try to do is not really such ugly thing. What I try to do is to notify that, when CPU tries to access a buffer , to kernel side through dmabuf interface. So it's not really to send the buffer to kernel. Thanks, Inki Dae The most basic question about why you are trying to implement this sort of thing in the dma_buf framework still stands. Once you imported a dma_buf into your DRM driver it's a GEM object and you can and should use the native DRM ioctls to prepare/end a CPU access to this BO. Then internally to your driver you can use the dma_buf reservation/fence stuff to provide the necessary cross-device sync. I don't really want that is used only for DRM drivers. We really need it for all other DMA devices; i.e., v4l2 based drivers. That is what I try to do. And my approach uses reservation to use dma-buf resources but not dma fence stuff anymore. However, I'm looking into Radeon DRM driver for why we need dma fence stuff, and how we can use it if needed. Still I don't see the point why you need syncpoints above dma-buf. In both the DRM and the V4L2 world we have defined points in the API where a buffer is allowed to change domain from device to CPU and vice versa. In DRM if you want to access a buffer with the CPU you do a cpu_prepare. The buffer changes back to GPU domain once you do the execbuf validation, queue a pageflip to the buffer or similar things. In V4L2 the syncpoints for cache operations are the queue/dequeue API entry points. Those are also the exact points to synchronize with other hardware thus using dma-buf reserve/fence. In all this I can't see any need for a new syncpoint primitive slapped on top of dma-buf. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- | -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
-Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Wednesday, June 19, 2013 7:22 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae: -Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Tuesday, June 18, 2013 6:47 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae: [...] a display device driver. It shouldn't be used within a single driver as a means of passing buffers between userspace and kernel space. What I try to do is not really such ugly thing. What I try to do is to notify that, when CPU tries to access a buffer , to kernel side through dmabuf interface. So it's not really to send the buffer to kernel. Thanks, Inki Dae The most basic question about why you are trying to implement this sort of thing in the dma_buf framework still stands. Once you imported a dma_buf into your DRM driver it's a GEM object and you can and should use the native DRM ioctls to prepare/end a CPU access to this BO. Then internally to your driver you can use the dma_buf reservation/fence stuff to provide the necessary cross-device sync. I don't really want that is used only for DRM drivers. We really need it for all other DMA devices; i.e., v4l2 based drivers. That is what I try to do. And my approach uses reservation to use dma-buf resources but not dma fence stuff anymore. However, I'm looking into Radeon DRM driver for why we need dma fence stuff, and how we can use it if needed. Still I don't see the point why you need syncpoints above dma-buf. In both the DRM and the V4L2 world we have defined points in the API where a buffer is allowed to change domain from device to CPU and vice versa. In DRM if you want to access a buffer with the CPU you do a cpu_prepare. The buffer changes back to GPU domain once you do the execbuf validation, queue a pageflip to the buffer or similar things. In V4L2 the syncpoints for cache operations are the queue/dequeue API entry points. Those are also the exact points to synchronize with other hardware thus using dma-buf reserve/fence. If so, what if we want to access a buffer with the CPU _in V4L2_? We should open a drm device node, and then do a cpu_prepare? Thanks, Inki Dae In all this I can't see any need for a new syncpoint primitive slapped on top of dma-buf. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- | -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Am Mittwoch, den 19.06.2013, 19:44 +0900 schrieb Inki Dae: -Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Wednesday, June 19, 2013 7:22 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Mittwoch, den 19.06.2013, 14:45 +0900 schrieb Inki Dae: -Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Tuesday, June 18, 2013 6:47 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae: [...] a display device driver. It shouldn't be used within a single driver as a means of passing buffers between userspace and kernel space. What I try to do is not really such ugly thing. What I try to do is to notify that, when CPU tries to access a buffer , to kernel side through dmabuf interface. So it's not really to send the buffer to kernel. Thanks, Inki Dae The most basic question about why you are trying to implement this sort of thing in the dma_buf framework still stands. Once you imported a dma_buf into your DRM driver it's a GEM object and you can and should use the native DRM ioctls to prepare/end a CPU access to this BO. Then internally to your driver you can use the dma_buf reservation/fence stuff to provide the necessary cross-device sync. I don't really want that is used only for DRM drivers. We really need it for all other DMA devices; i.e., v4l2 based drivers. That is what I try to do. And my approach uses reservation to use dma-buf resources but not dma fence stuff anymore. However, I'm looking into Radeon DRM driver for why we need dma fence stuff, and how we can use it if needed. Still I don't see the point why you need syncpoints above dma-buf. In both the DRM and the V4L2 world we have defined points in the API where a buffer is allowed to change domain from device to CPU and vice versa. In DRM if you want to access a buffer with the CPU you do a cpu_prepare. The buffer changes back to GPU domain once you do the execbuf validation, queue a pageflip to the buffer or similar things. In V4L2 the syncpoints for cache operations are the queue/dequeue API entry points. Those are also the exact points to synchronize with other hardware thus using dma-buf reserve/fence. If so, what if we want to access a buffer with the CPU _in V4L2_? We should open a drm device node, and then do a cpu_prepare? Not at all. As I said the syncpoints are the queue/dequeue operations. When dequeueing a buffer you are explicitly dragging the buffer domain back from device into userspace and thus CPU domain. If you are operating on an mmap of a V4L2 processed buffer it's either before or after it got processed by the hardware and therefore all DMA operations on the buffer are bracketed by the V4L2 qbug/dqbuf ioctls. That is where cache operations and synchronization should happen. The V4L2 driver shouldn't allow you to dequeue a buffer and thus dragging it back into CPU domain while there is still DMA ongoing. Equally the queue ioctrl should make sure caches are properly written back to memory. The results of reading/writing to the mmap of a V4L2 buffer while it is enqueued to the hardware is simply undefined and there is nothing suggesting that this is a valid usecase. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- | -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Thu, Jun 20, 2013 at 12:10:04AM +0900, Inki Dae wrote: On the other hand, the below shows how we could enhance the conventional way with my approach (just example): CPU - DMA, ioctl(qbuf command) ioctl(streamon) | | | | qbuf - dma_buf_sync_get start streaming - syncpoint dma_buf_sync_get just registers a sync buffer(dmabuf) to sync object. And the syncpoint is performed by calling dma_buf_sync_lock(), and then DMA accesses the sync buffer. And DMA - CPU, ioctl(dqbuf command) | | dqbuf - nothing to do Actual syncpoint is when DMA operation is completed (in interrupt handler): the syncpoint is performed by calling dma_buf_sync_unlock(). Hence, my approach is to move the syncpoints into just before dma access as long as possible. What you've just described does *not* work on architectures such as ARMv7 which do speculative cache fetches from memory at any time that that memory is mapped with a cacheable status, and will lead to data corruption. -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote: On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote: 2013/6/17 Russell King - ARM Linux li...@arm.linux.org.uk Exactly right. But that is not definitely my point. Could you please see the below simple example?: (Presume that CPU and DMA share a buffer and the buffer is mapped with user space as cachable) handle1 = drm_gem_fd_to_handle(a dmabuf fd); 1 ... va1 = drm_gem_mmap(handle1); va2 = drm_gem_mmap(handle2); va3 = malloc(size); ... while (conditions) { memcpy(va1, some data, size); No! Well, the first thing to say here is that under the requirements of the DMA API, the above is immediately invalid, because you're writing to a buffer which under the terms of the DMA API is currently owned by the DMA agent, *not* by the CPU. You're supposed to call dma_sync_sg_for_cpu() before you do that - but how is userspace supposed to know that requirement? Why should userspace even _have_ to know these requirements of the DMA API? It's also entirely possible that drm_gem_fd_to_handle() (which indirectly causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it into userspace is a bug too, as it has the potential to touch caches or stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not going to make too big a deal about that, because I don't think we have anything that picky. However, the first point above is the most important one, and exposing the quirks of the DMA API to userland is certainly not a nice thing to be doing. This needs to be fixed - we can't go and enforce an API which is deeply embedded within the kernel all the way out to userland. What we need is something along the lines of: (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA. or (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all. I strongly vote for (b) (plus adding a dma_buf_map_sync interface to allow syncing to other devices/cpu whithout dropping the dma mappings). At least that's been the idea behind things, but currently all (x86-based) drm drivers cut corners here massively. Aside: The entire reason behind hiding the dma map step in the dma-buf attachment is to allow drivers to expose crazy iommus (e.g. the tiler unit on omap) to other devices. So imo (a) isn't the right choice. and for the scatterlist to be mapped for DMA at the point where the DMA operation is initiated, and unmapped at the point where the DMA operation is complete. So no, the problem is not that we need more APIs and code - we need the existing kernel API fixed so that we don't go exposing userspace to the requirements of the DMA API. Unless we do that, we're going to end up with a huge world of pain, where kernel architecture people need to audit every damned DRM userspace implementation that happens to be run on their platform, and that's not something arch people really can afford to do. Basically, I think the dma_buf stuff needs to be rewritten with the requirements of the DMA API in the forefront of whosever mind is doing the rewriting. Note: the existing stuff does have the nice side effect of being able to pass buffers which do not have a struct page * associated with them through the dma_buf API - I think we can still preserve that by having dma_buf provide a couple of new APIs to do the SG list map/sync/unmap, but in any case we need to fix the existing API so that: dma_buf_map_attachment() becomes dma_buf_get_sg() dma_buf_unmap_attachment() becomes dma_buf_put_sg() both getting rid of the DMA direction argument, and then we have four new dma_buf calls: dma_buf_map_sg() dma_buf_unmap_sg() dma_buf_sync_sg_for_cpu() dma_buf_sync_sg_for_device() which do the actual sg map/unmap via the DMA API *at the appropriate time for DMA*. Hm, my idea was to just add a dma_buf_sync_attchment for the device side syncing, since the cpu access stuff is already bracketed with the begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap, but imo mmap support for dma_buf is a bit insane anyway, so I don't care too much about it. Since such dma mappings would be really longstanding in most cases anyway drivers could just map with BIDIRECTIONAL and do all the real flushing with the new sync stuff. Another piece of fun will be integration with the fencing stuff Maarten is doing. I'm not sure whether we should integrate that into the dma-buf interface (for simple users who don't want to deal with the full complexity at least). So, the summary of this is - at the moment, I regard DRM Prime and dmabuf to be utterly broken in design for architectures such as ARM where the requirements of the DMA API have to be followed if you're going to have a happy life. Agreed. Unfortunately there are
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote: So I'd like to ask for other DRM maintainers. How do you think about it? it seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by Rob) and GEM CMA helper also have same issue Russell pointed out. I think not only the above approach but also the performance is very important. CMA uses coherent memory to back their buffers, though that might not be true of memory obtained from other drivers via dma_buf. Plus, there is no support in the CMA helper for exporting or importng these buffers. I guess Intel i915 is only used on x86, which is a coherent platform and requires no cache maintanence for DMA. OMAP DRM does not support importing non-DRM buffers buffers back into DRM. Moreover, it will suffer from the problems I described if any attempt is made to write to the buffer after it has been re-imported. Lastly, I should point out that the dma_buf stuff is really only useful when you need to export a dma buffer from one driver and import it into another driver - for example to pass data from a camera device driver to a display device driver. It shouldn't be used within a single driver as a means of passing buffers between userspace and kernel space. -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
-Original Message- From: Russell King - ARM Linux [mailto:li...@arm.linux.org.uk] Sent: Tuesday, June 18, 2013 5:43 PM To: Inki Dae Cc: 'Maarten Lankhorst'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'Rob Clark'; 'myungjoo.ham'; 'YoungJun Cho'; 'Daniel Vetter'; linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote: So I'd like to ask for other DRM maintainers. How do you think about it? it seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by Rob) and GEM CMA helper also have same issue Russell pointed out. I think not only the above approach but also the performance is very important. CMA uses coherent memory to back their buffers, though that might not be true of memory obtained from other drivers via dma_buf. Plus, there is no support in the CMA helper for exporting or importng these buffers. It's not so. Please see Dave's drm next. recently dmabuf support for the CMA helper has been merged to there. I guess Intel i915 is only used on x86, which is a coherent platform and requires no cache maintanence for DMA. OMAP DRM does not support importing non-DRM buffers buffers back into Correct. TODO yet. DRM. Moreover, it will suffer from the problems I described if any attempt is made to write to the buffer after it has been re-imported. Lastly, I should point out that the dma_buf stuff is really only useful when you need to export a dma buffer from one driver and import it into another driver - for example to pass data from a camera device driver to Most people know that. a display device driver. It shouldn't be used within a single driver as a means of passing buffers between userspace and kernel space. What I try to do is not really such ugly thing. What I try to do is to notify that, when CPU tries to access a buffer , to kernel side through dmabuf interface. So it's not really to send the buffer to kernel. Thanks, Inki Dae -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Tue, Jun 18, 2013 at 06:04:44PM +0900, Inki Dae wrote: -Original Message- From: Russell King - ARM Linux [mailto:li...@arm.linux.org.uk] Sent: Tuesday, June 18, 2013 5:43 PM To: Inki Dae Cc: 'Maarten Lankhorst'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'Rob Clark'; 'myungjoo.ham'; 'YoungJun Cho'; 'Daniel Vetter'; linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework On Tue, Jun 18, 2013 at 02:27:40PM +0900, Inki Dae wrote: So I'd like to ask for other DRM maintainers. How do you think about it? it seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by Rob) and GEM CMA helper also have same issue Russell pointed out. I think not only the above approach but also the performance is very important. CMA uses coherent memory to back their buffers, though that might not be true of memory obtained from other drivers via dma_buf. Plus, there is no support in the CMA helper for exporting or importng these buffers. It's not so. Please see Dave's drm next. recently dmabuf support for the CMA helper has been merged to there. The point stands: CMA is DMA coherent memory. It doesn't need and must never be dma-map-sg'd or dma-sync'd or dma-unmap'd. -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae: [...] a display device driver. It shouldn't be used within a single driver as a means of passing buffers between userspace and kernel space. What I try to do is not really such ugly thing. What I try to do is to notify that, when CPU tries to access a buffer , to kernel side through dmabuf interface. So it's not really to send the buffer to kernel. Thanks, Inki Dae The most basic question about why you are trying to implement this sort of thing in the dma_buf framework still stands. Once you imported a dma_buf into your DRM driver it's a GEM object and you can and should use the native DRM ioctls to prepare/end a CPU access to this BO. Then internally to your driver you can use the dma_buf reservation/fence stuff to provide the necessary cross-device sync. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- | -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Tue, Jun 18, 2013 at 09:00:16AM +0200, Daniel Vetter wrote: On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote: What we need is something along the lines of: (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA. or (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all. I strongly vote for (b) (plus adding a dma_buf_map_sync interface to allow syncing to other devices/cpu whithout dropping the dma mappings). At least that's been the idea behind things, but currently all (x86-based) drm drivers cut corners here massively. Aside: The entire reason behind hiding the dma map step in the dma-buf attachment is to allow drivers to expose crazy iommus (e.g. the tiler unit on omap) to other devices. So imo (a) isn't the right choice. That's why I also talk below about adding the dma_buf_map/sync callbacks below, so that a dma_buf implementation can do whatever is necessary with the sg at the point of use. However, you are correct that this should be unnecessary, as DRM should only be calling dma_buf_map_attachment() when it needs to know about the memory behind the object. The problem is that people will cache that information - it also may be expensive to setup for the dma_buf - as it involves counting the number of pages, memory allocation, and maybe obtaining the set of pages from shmem. With (a) plus the callbacks below, it means that step is only performed once, and then the DMA API part is entirely separate - we can have our cake and eat it in that regard. Note: the existing stuff does have the nice side effect of being able to pass buffers which do not have a struct page * associated with them through the dma_buf API - I think we can still preserve that by having dma_buf provide a couple of new APIs to do the SG list map/sync/unmap, but in any case we need to fix the existing API so that: dma_buf_map_attachment() becomes dma_buf_get_sg() dma_buf_unmap_attachment() becomes dma_buf_put_sg() both getting rid of the DMA direction argument, and then we have four new dma_buf calls: dma_buf_map_sg() dma_buf_unmap_sg() dma_buf_sync_sg_for_cpu() dma_buf_sync_sg_for_device() which do the actual sg map/unmap via the DMA API *at the appropriate time for DMA*. Hm, my idea was to just add a dma_buf_sync_attchment for the device side syncing, since the cpu access stuff is already bracketed with the begin/end cpu access stuff. We might need a sync_for_cpu or so for mmap, but imo mmap support for dma_buf is a bit insane anyway, so I don't care too much about it. Since such dma mappings would be really longstanding in most cases anyway drivers could just map with BIDIRECTIONAL and do all the real flushing with the new sync stuff. Note that the DMA API debug doesn't allow you to change the direction argument on an existing mapping (neither should it, again this is documented in the DMA API stuff in Documentation/). This is where you would need the complete set of four functions I mention above which reflect the functionality of the DMA API. The dma_buf implementation doesn't _have_ to implement them if they are no-ops - for example, your dma_buf sg array contains DMA pointers, and the memory is already coherent in some way (for example, it's a separate chunk of memory which isn't part of system RAM.) -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
-Original Message- From: Lucas Stach [mailto:l.st...@pengutronix.de] Sent: Tuesday, June 18, 2013 6:47 PM To: Inki Dae Cc: 'Russell King - ARM Linux'; 'linux-fbdev'; 'Kyungmin Park'; 'DRI mailing list'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Am Dienstag, den 18.06.2013, 18:04 +0900 schrieb Inki Dae: [...] a display device driver. It shouldn't be used within a single driver as a means of passing buffers between userspace and kernel space. What I try to do is not really such ugly thing. What I try to do is to notify that, when CPU tries to access a buffer , to kernel side through dmabuf interface. So it's not really to send the buffer to kernel. Thanks, Inki Dae The most basic question about why you are trying to implement this sort of thing in the dma_buf framework still stands. Once you imported a dma_buf into your DRM driver it's a GEM object and you can and should use the native DRM ioctls to prepare/end a CPU access to this BO. Then internally to your driver you can use the dma_buf reservation/fence stuff to provide the necessary cross-device sync. I don't really want that is used only for DRM drivers. We really need it for all other DMA devices; i.e., v4l2 based drivers. That is what I try to do. And my approach uses reservation to use dma-buf resources but not dma fence stuff anymore. However, I'm looking into Radeon DRM driver for why we need dma fence stuff, and how we can use it if needed. Thanks, Inki Dae Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- | -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
This patch adds a buffer synchronization framework based on DMA BUF[1] and reservation[2] to use dma-buf resource, and based on ww-mutexes[3] for lock mechanism. The purpose of this framework is not only to couple cache operations, and buffer access control to CPU and DMA but also to provide easy-to-use interfaces for device drivers and potentially user application (not implemented for user applications, yet). And this framework can be used for all dma devices using system memory as dma buffer, especially for most ARM based SoCs. Changelog v2: - use atomic_add_unless to avoid potential bug. - add a macro for checking valid access type. - code clean. The mechanism of this framework has the following steps, 1. Register dmabufs to a sync object - A task gets a new sync object and can add one or more dmabufs that the task wants to access. This registering should be performed when a device context or an event context such as a page flip event is created or before CPU accesses a shared buffer. dma_buf_sync_get(a sync object, a dmabuf); 2. Lock a sync object - A task tries to lock all dmabufs added in its own sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA and DMA. Taking a lock means that others cannot access all locked dmabufs until the task that locked the corresponding dmabufs, unlocks all the locked dmabufs. This locking should be performed before DMA or CPU accesses these dmabufs. dma_buf_sync_lock(a sync object); 3. Unlock a sync object - The task unlocks all dmabufs added in its own sync object. The unlock means that the DMA or CPU accesses to the dmabufs have been completed so that others may access them. This unlocking should be performed after DMA or CPU has completed accesses to the dmabufs. dma_buf_sync_unlock(a sync object); 4. Unregister one or all dmabufs from a sync object - A task unregisters the given dmabufs from the sync object. This means that the task dosen't want to lock the dmabufs. The unregistering should be performed after DMA or CPU has completed accesses to the dmabufs or when dma_buf_sync_lock() is failed. dma_buf_sync_put(a sync object, a dmabuf); dma_buf_sync_put_all(a sync object); The described steps may be summarized as: get - lock - CPU or DMA access to a buffer/s - unlock - put This framework includes the following two features. 1. read (shared) and write (exclusive) locks - A task is required to declare the access type when the task tries to register a dmabuf; READ, WRITE, READ DMA, or WRITE DMA. The below is example codes, struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... And the below can be used as access types: DMA_BUF_ACCESS_READ, - CPU will access a buffer for read. DMA_BUF_ACCESS_WRITE, - CPU will access a buffer for read or write. DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read or write. 2. Mandatory resource releasing - a task cannot hold a lock indefinitely. A task may never try to unlock a buffer after taking a lock to the buffer. In this case, a timer handler to the corresponding sync object is called in five (default) seconds and then the timed-out buffer is unlocked by work queue handler to avoid lockups and to enforce resources of the buffer. The below is how to use: 1. Allocate and Initialize a sync object: struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); ... 2. Add a dmabuf to the sync object when setting up dma buffer relevant registers: dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... 3. Lock all dmabufs of the sync object before DMA or CPU accesses the dmabufs: dmabuf_sync_lock(sync); ... 4. Now CPU or DMA can access all dmabufs locked in step 3. 5. Unlock all dmabufs added in a sync object after DMA or CPU access to these dmabufs is completed: dmabuf_sync_unlock(sync); And call the following functions to release all resources, dmabuf_sync_put_all(sync); dmabuf_sync_fini(sync); You can refer to actual example codes: https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/ commit/?h=dmabuf-syncid=4030bdee9bab5841ad32faade528d04cc0c5fc94
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Op 17-06-13 13:15, Inki Dae schreef: This patch adds a buffer synchronization framework based on DMA BUF[1] and reservation[2] to use dma-buf resource, and based on ww-mutexes[3] for lock mechanism. The purpose of this framework is not only to couple cache operations, and buffer access control to CPU and DMA but also to provide easy-to-use interfaces for device drivers and potentially user application (not implemented for user applications, yet). And this framework can be used for all dma devices using system memory as dma buffer, especially for most ARM based SoCs. Changelog v2: - use atomic_add_unless to avoid potential bug. - add a macro for checking valid access type. - code clean. The mechanism of this framework has the following steps, 1. Register dmabufs to a sync object - A task gets a new sync object and can add one or more dmabufs that the task wants to access. This registering should be performed when a device context or an event context such as a page flip event is created or before CPU accesses a shared buffer. dma_buf_sync_get(a sync object, a dmabuf); 2. Lock a sync object - A task tries to lock all dmabufs added in its own sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA and DMA. Taking a lock means that others cannot access all locked dmabufs until the task that locked the corresponding dmabufs, unlocks all the locked dmabufs. This locking should be performed before DMA or CPU accesses these dmabufs. dma_buf_sync_lock(a sync object); 3. Unlock a sync object - The task unlocks all dmabufs added in its own sync object. The unlock means that the DMA or CPU accesses to the dmabufs have been completed so that others may access them. This unlocking should be performed after DMA or CPU has completed accesses to the dmabufs. dma_buf_sync_unlock(a sync object); 4. Unregister one or all dmabufs from a sync object - A task unregisters the given dmabufs from the sync object. This means that the task dosen't want to lock the dmabufs. The unregistering should be performed after DMA or CPU has completed accesses to the dmabufs or when dma_buf_sync_lock() is failed. dma_buf_sync_put(a sync object, a dmabuf); dma_buf_sync_put_all(a sync object); The described steps may be summarized as: get - lock - CPU or DMA access to a buffer/s - unlock - put This framework includes the following two features. 1. read (shared) and write (exclusive) locks - A task is required to declare the access type when the task tries to register a dmabuf; READ, WRITE, READ DMA, or WRITE DMA. The below is example codes, struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... And the below can be used as access types: DMA_BUF_ACCESS_READ, - CPU will access a buffer for read. DMA_BUF_ACCESS_WRITE, - CPU will access a buffer for read or write. DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read or write. 2. Mandatory resource releasing - a task cannot hold a lock indefinitely. A task may never try to unlock a buffer after taking a lock to the buffer. In this case, a timer handler to the corresponding sync object is called in five (default) seconds and then the timed-out buffer is unlocked by work queue handler to avoid lockups and to enforce resources of the buffer. The below is how to use: 1. Allocate and Initialize a sync object: struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); ... 2. Add a dmabuf to the sync object when setting up dma buffer relevant registers: dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... 3. Lock all dmabufs of the sync object before DMA or CPU accesses the dmabufs: dmabuf_sync_lock(sync); ... 4. Now CPU or DMA can access all dmabufs locked in step 3. 5. Unlock all dmabufs added in a sync object after DMA or CPU access to these dmabufs is completed: dmabuf_sync_unlock(sync); And call the following functions to release all resources, dmabuf_sync_put_all(sync); dmabuf_sync_fini(sync); You can refer to actual example codes: https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/ commit/?h=dmabuf-syncid=4030bdee9bab5841ad32faade528d04cc0c5fc94
RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
-Original Message- From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com] Sent: Monday, June 17, 2013 8:35 PM To: Inki Dae Cc: dri-de...@lists.freedesktop.org; linux-fb...@vger.kernel.org; linux- arm-ker...@lists.infradead.org; linux-media@vger.kernel.org; dan...@ffwll.ch; robdcl...@gmail.com; kyungmin.p...@samsung.com; myungjoo@samsung.com; yj44@samsung.com Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Op 17-06-13 13:15, Inki Dae schreef: This patch adds a buffer synchronization framework based on DMA BUF[1] and reservation[2] to use dma-buf resource, and based on ww-mutexes[3] for lock mechanism. The purpose of this framework is not only to couple cache operations, and buffer access control to CPU and DMA but also to provide easy-to-use interfaces for device drivers and potentially user application (not implemented for user applications, yet). And this framework can be used for all dma devices using system memory as dma buffer, especially for most ARM based SoCs. Changelog v2: - use atomic_add_unless to avoid potential bug. - add a macro for checking valid access type. - code clean. The mechanism of this framework has the following steps, 1. Register dmabufs to a sync object - A task gets a new sync object and can add one or more dmabufs that the task wants to access. This registering should be performed when a device context or an event context such as a page flip event is created or before CPU accesses a shared buffer. dma_buf_sync_get(a sync object, a dmabuf); 2. Lock a sync object - A task tries to lock all dmabufs added in its own sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA and DMA. Taking a lock means that others cannot access all locked dmabufs until the task that locked the corresponding dmabufs, unlocks all the locked dmabufs. This locking should be performed before DMA or CPU accesses these dmabufs. dma_buf_sync_lock(a sync object); 3. Unlock a sync object - The task unlocks all dmabufs added in its own sync object. The unlock means that the DMA or CPU accesses to the dmabufs have been completed so that others may access them. This unlocking should be performed after DMA or CPU has completed accesses to the dmabufs. dma_buf_sync_unlock(a sync object); 4. Unregister one or all dmabufs from a sync object - A task unregisters the given dmabufs from the sync object. This means that the task dosen't want to lock the dmabufs. The unregistering should be performed after DMA or CPU has completed accesses to the dmabufs or when dma_buf_sync_lock() is failed. dma_buf_sync_put(a sync object, a dmabuf); dma_buf_sync_put_all(a sync object); The described steps may be summarized as: get - lock - CPU or DMA access to a buffer/s - unlock - put This framework includes the following two features. 1. read (shared) and write (exclusive) locks - A task is required to declare the access type when the task tries to register a dmabuf; READ, WRITE, READ DMA, or WRITE DMA. The below is example codes, struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... And the below can be used as access types: DMA_BUF_ACCESS_READ, - CPU will access a buffer for read. DMA_BUF_ACCESS_WRITE, - CPU will access a buffer for read or write. DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read or write. 2. Mandatory resource releasing - a task cannot hold a lock indefinitely. A task may never try to unlock a buffer after taking a lock to the buffer. In this case, a timer handler to the corresponding sync object is called in five (default) seconds and then the timed-out buffer is unlocked by work queue handler to avoid lockups and to enforce resources of the buffer. The below is how to use: 1. Allocate and Initialize a sync object: struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); ... 2. Add a dmabuf to the sync object when setting up dma buffer relevant registers: dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... 3. Lock all dmabufs of the sync object before DMA or CPU accesses the dmabufs: dmabuf_sync_lock(sync); ... 4. Now CPU or DMA can access all dmabufs locked
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
Op 17-06-13 15:04, Inki Dae schreef: -Original Message- From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com] Sent: Monday, June 17, 2013 8:35 PM To: Inki Dae Cc: dri-de...@lists.freedesktop.org; linux-fb...@vger.kernel.org; linux- arm-ker...@lists.infradead.org; linux-media@vger.kernel.org; dan...@ffwll.ch; robdcl...@gmail.com; kyungmin.p...@samsung.com; myungjoo@samsung.com; yj44@samsung.com Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework Op 17-06-13 13:15, Inki Dae schreef: This patch adds a buffer synchronization framework based on DMA BUF[1] and reservation[2] to use dma-buf resource, and based on ww-mutexes[3] for lock mechanism. The purpose of this framework is not only to couple cache operations, and buffer access control to CPU and DMA but also to provide easy-to-use interfaces for device drivers and potentially user application (not implemented for user applications, yet). And this framework can be used for all dma devices using system memory as dma buffer, especially for most ARM based SoCs. Changelog v2: - use atomic_add_unless to avoid potential bug. - add a macro for checking valid access type. - code clean. The mechanism of this framework has the following steps, 1. Register dmabufs to a sync object - A task gets a new sync object and can add one or more dmabufs that the task wants to access. This registering should be performed when a device context or an event context such as a page flip event is created or before CPU accesses a shared buffer. dma_buf_sync_get(a sync object, a dmabuf); 2. Lock a sync object - A task tries to lock all dmabufs added in its own sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA and DMA. Taking a lock means that others cannot access all locked dmabufs until the task that locked the corresponding dmabufs, unlocks all the locked dmabufs. This locking should be performed before DMA or CPU accesses these dmabufs. dma_buf_sync_lock(a sync object); 3. Unlock a sync object - The task unlocks all dmabufs added in its own sync object. The unlock means that the DMA or CPU accesses to the dmabufs have been completed so that others may access them. This unlocking should be performed after DMA or CPU has completed accesses to the dmabufs. dma_buf_sync_unlock(a sync object); 4. Unregister one or all dmabufs from a sync object - A task unregisters the given dmabufs from the sync object. This means that the task dosen't want to lock the dmabufs. The unregistering should be performed after DMA or CPU has completed accesses to the dmabufs or when dma_buf_sync_lock() is failed. dma_buf_sync_put(a sync object, a dmabuf); dma_buf_sync_put_all(a sync object); The described steps may be summarized as: get - lock - CPU or DMA access to a buffer/s - unlock - put This framework includes the following two features. 1. read (shared) and write (exclusive) locks - A task is required to declare the access type when the task tries to register a dmabuf; READ, WRITE, READ DMA, or WRITE DMA. The below is example codes, struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... And the below can be used as access types: DMA_BUF_ACCESS_READ, - CPU will access a buffer for read. DMA_BUF_ACCESS_WRITE, - CPU will access a buffer for read or write. DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA, - DMA will access a buffer for read or write. 2. Mandatory resource releasing - a task cannot hold a lock indefinitely. A task may never try to unlock a buffer after taking a lock to the buffer. In this case, a timer handler to the corresponding sync object is called in five (default) seconds and then the timed-out buffer is unlocked by work queue handler to avoid lockups and to enforce resources of the buffer. The below is how to use: 1. Allocate and Initialize a sync object: struct dmabuf_sync *sync; sync = dmabuf_sync_init(NULL, test sync); ... 2. Add a dmabuf to the sync object when setting up dma buffer relevant registers: dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ); ... 3. Lock all dmabufs of the sync object before DMA or CPU accesses the dmabufs: dmabuf_sync_lock(sync); ... 4. Now CPU or DMA can access all dmabufs locked in step 3. 5. Unlock all dmabufs added in a sync object after DMA
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Mon, Jun 17, 2013 at 10:04:45PM +0900, Inki Dae wrote: It's just to implement a thin sync framework coupling cache operation. This approach is based on dma-buf for more generic implementation against android sync driver or KDS. The described steps may be summarized as: lock - cache operation - CPU or DMA access to a buffer/s - unlock I think that there is no need to get complicated for such approach at least for most devices sharing system memory. Simple is best. But hang on, doesn't the dmabuf API already provide that? The dmabuf API already uses dma_map_sg() and dma_unmap_sg() by providers, and the rules around the DMA API are that: dma_map_sg() /* DMA _ONLY_ has access, CPU should not access */ dma_unmap_sg() /* DMA may not access, CPU can access */ It's a little more than that if you include the sync_sg_for_cpu and sync_sg_for_device APIs too - but the above is the general idea. What this means from the dmabuf API point of view is that once you attach to a dma_buf, and call dma_buf_map_attachment() to get the SG list, the CPU doesn't have ownership of the buffer and _must_ _not_ access it via any other means - including using the other dma_buf methods, until either the appropriate dma_sync_sg_for_cpu() call has been made or the DMA mapping has been removed via dma_buf_unmap_attachment(). So, the sequence should be: dma_buf_map_attachment() /* do DMA */ dma_buf_unmap_attachment() /* CPU can now access the buffer */ -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote: 2013/6/17 Russell King - ARM Linux li...@arm.linux.org.uk Exactly right. But that is not definitely my point. Could you please see the below simple example?: (Presume that CPU and DMA share a buffer and the buffer is mapped with user space as cachable) handle1 = drm_gem_fd_to_handle(a dmabuf fd); 1 ... va1 = drm_gem_mmap(handle1); va2 = drm_gem_mmap(handle2); va3 = malloc(size); ... while (conditions) { memcpy(va1, some data, size); No! Well, the first thing to say here is that under the requirements of the DMA API, the above is immediately invalid, because you're writing to a buffer which under the terms of the DMA API is currently owned by the DMA agent, *not* by the CPU. You're supposed to call dma_sync_sg_for_cpu() before you do that - but how is userspace supposed to know that requirement? Why should userspace even _have_ to know these requirements of the DMA API? It's also entirely possible that drm_gem_fd_to_handle() (which indirectly causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it into userspace is a bug too, as it has the potential to touch caches or stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not going to make too big a deal about that, because I don't think we have anything that picky. However, the first point above is the most important one, and exposing the quirks of the DMA API to userland is certainly not a nice thing to be doing. This needs to be fixed - we can't go and enforce an API which is deeply embedded within the kernel all the way out to userland. What we need is something along the lines of: (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA. or (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all. and for the scatterlist to be mapped for DMA at the point where the DMA operation is initiated, and unmapped at the point where the DMA operation is complete. So no, the problem is not that we need more APIs and code - we need the existing kernel API fixed so that we don't go exposing userspace to the requirements of the DMA API. Unless we do that, we're going to end up with a huge world of pain, where kernel architecture people need to audit every damned DRM userspace implementation that happens to be run on their platform, and that's not something arch people really can afford to do. Basically, I think the dma_buf stuff needs to be rewritten with the requirements of the DMA API in the forefront of whosever mind is doing the rewriting. Note: the existing stuff does have the nice side effect of being able to pass buffers which do not have a struct page * associated with them through the dma_buf API - I think we can still preserve that by having dma_buf provide a couple of new APIs to do the SG list map/sync/unmap, but in any case we need to fix the existing API so that: dma_buf_map_attachment() becomes dma_buf_get_sg() dma_buf_unmap_attachment() becomes dma_buf_put_sg() both getting rid of the DMA direction argument, and then we have four new dma_buf calls: dma_buf_map_sg() dma_buf_unmap_sg() dma_buf_sync_sg_for_cpu() dma_buf_sync_sg_for_device() which do the actual sg map/unmap via the DMA API *at the appropriate time for DMA*. So, the summary of this is - at the moment, I regard DRM Prime and dmabuf to be utterly broken in design for architectures such as ARM where the requirements of the DMA API have to be followed if you're going to have a happy life. -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Mon, Jun 17, 2013 at 04:42:37PM +0100, Russell King - ARM Linux wrote: On Tue, Jun 18, 2013 at 12:03:31AM +0900, Inki Dae wrote: 2013/6/17 Russell King - ARM Linux li...@arm.linux.org.uk Exactly right. But that is not definitely my point. Could you please see the below simple example?: (Presume that CPU and DMA share a buffer and the buffer is mapped with user space as cachable) handle1 = drm_gem_fd_to_handle(a dmabuf fd); 1 ... va1 = drm_gem_mmap(handle1); va2 = drm_gem_mmap(handle2); va3 = malloc(size); ... while (conditions) { memcpy(va1, some data, size); No! Well, the first thing to say here is that under the requirements of the DMA API, the above is immediately invalid, because you're writing to a buffer which under the terms of the DMA API is currently owned by the DMA agent, *not* by the CPU. You're supposed to call dma_sync_sg_for_cpu() before you do that - but how is userspace supposed to know that requirement? Why should userspace even _have_ to know these requirements of the DMA API? It's also entirely possible that drm_gem_fd_to_handle() (which indirectly causes dma_map_sg() on the buffers scatterlist) followed by mmap'ing it into userspace is a bug too, as it has the potential to touch caches or stuff in ways that maybe the DMA or IOMMU may not expect - but I'm not going to make too big a deal about that, because I don't think we have anything that picky. However, the first point above is the most important one, and exposing the quirks of the DMA API to userland is certainly not a nice thing to be doing. This needs to be fixed - we can't go and enforce an API which is deeply embedded within the kernel all the way out to userland. What we need is something along the lines of: (a) dma_buf_map_attachment() _not_ to map the scatterlist for DMA. or (b) drm_gem_prime_import() not to call dma_buf_map_attachment() at all. and for the scatterlist to be mapped for DMA at the point where the DMA operation is initiated, and unmapped at the point where the DMA operation is complete. So no, the problem is not that we need more APIs and code - we need the existing kernel API fixed so that we don't go exposing userspace to the requirements of the DMA API. Unless we do that, we're going to end up with a huge world of pain, where kernel architecture people need to audit every damned DRM userspace implementation that happens to be run on their platform, and that's not something arch people really can afford to do. Basically, I think the dma_buf stuff needs to be rewritten with the requirements of the DMA API in the forefront of whosever mind is doing the rewriting. Note: the existing stuff does have the nice side effect of being able to pass buffers which do not have a struct page * associated with them through the dma_buf API - I think we can still preserve that by having dma_buf provide a couple of new APIs to do the SG list map/sync/unmap, but in any case we need to fix the existing API so that: dma_buf_map_attachment() becomes dma_buf_get_sg() dma_buf_unmap_attachment() becomes dma_buf_put_sg() both getting rid of the DMA direction argument, and then we have four new dma_buf calls: dma_buf_map_sg() dma_buf_unmap_sg() dma_buf_sync_sg_for_cpu() dma_buf_sync_sg_for_device() which do the actual sg map/unmap via the DMA API *at the appropriate time for DMA*. So, the summary of this is - at the moment, I regard DRM Prime and dmabuf to be utterly broken in design for architectures such as ARM where the requirements of the DMA API have to be followed if you're going to have a happy life. An addendum to the above: I'll also point out that the whole thing of having random buffers mapped into userspace, and doing DMA on them is _highly_ architecture specific. I'm told that PARISC is one architecture where this does not work (because DMA needs to have some kind of congruence factor programmed into it which can be different between kernel and userspace mappings of the same physical mappings. I ran into this when trying to come up with a cross-arch DMA-API mmap API, and PARISC ended up killing the idea off (the remains of that attempt is the dma_mmap_* stuff in ARM's asm/dma-mapping.h.) However, this was for the DMA coherent stuff, not the streaming API, which is what _this_ DMA prime/dma_buf stuff is about. What I'm saying is think _very_ _carefully_ about userspace having interleaved access to random DMA buffers. Arguably, DRM should _not_ allow this. -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
On Tue, Jun 18, 2013 at 02:19:04AM +0900, Inki Dae wrote: It seems like that all pages of the scatterlist should be mapped with DMA every time DMA operation is started (or drm_xxx_set_src_dma_buffer function call), and the pages should be unmapped from DMA again every time DMA operation is completed: internally, including each cache operation. Correct. Isn't that big overhead? Yes, if you _have_ to do a cache operation to ensure that the DMA agent can see the data the CPU has written. And If there is no problem, we should accept such overhead? If there is no problem then why are we discussing this and why do we need this API extension? :) Actually, drm_gem_fd_to_handle() includes to map a given dmabuf with iommu table (just logical data) of the DMA. And then, the device address (or iova) already mapped will be set to buffer register of the DMA with drm_xxx_set_src/dst_dma_buffer(handle1, ...) call. Consider this with a PIPT cache: dma_map_sg()- at this point, the kernel addresses of these buffers are cleaned and invalidated for the DMA userspace writes to the buffer, the data sits in the CPU cache Because the cache is PIPT, this data becomes visible to the kernel as well. DMA is started, and it writes to the buffer Now, at this point, which is the correct data? The data physically in the RAM which the DMA has written, or the data in the CPU cache. It may the answer is - they both are, and the loss of either can be a potential data corruption issue - there is no way to tell which data should be kept but the system is in an inconsistent state and _one_ of them will have to be discarded. dma_unmap_sg() - at this point, the kernel addresses of the buffers are _invalidated_ and any data in those cache lines is discarded Which also means that the data in userspace will also be discarded with PIPT caches. This is precisely why we have buffer rules associated with the DMA API, which are these: dma_map_sg() - the buffer transfers ownership from the CPU to the DMA agent. - the CPU may not alter the buffer in any way. while (cpu_wants_access) { dma_sync_sg_for_cpu() - the buffer transfers ownership from the DMA to the CPU. - the DMA may not alter the buffer in any way. dma_sync_sg_for_device() - the buffer transfers ownership from the CPU to the DMA - the CPU may not alter the buffer in any way. } dma_unmap_sg() - the buffer transfers ownership from the DMA to the CPU. - the DMA may not alter the buffer in any way. Any violation of that is likely to lead to data corruption. Now, some may say that the DMA API is only about the kernel mapping. Yes it is, because it takes no regard what so ever about what happens with the userspace mappings. This is absolutely true when you have VIVT or aliasing VIPT caches. Now consider that with a PIPT cache, or a non-aliasing VIPT cache (which is exactly the same behaviourally from this aspect) any modification of a userspace mapping is precisely the same as modifying the kernel space mapping - and what you find is that the above rules end up _inherently_ applying to the userspace mappings as well. In other words, userspace applications which have mapped the buffers must _also_ respect these rules. And there's no way in hell that I'd ever trust userspace to get this anywhere near right, and I *really* do not like these kinds of internal kernel API rules being exposed to userspace. And so the idea that userspace should be allowed to setup DMA transfers via set source buffer, set destination buffer calls into an API is just plain rediculous. No, userspace should be allowed to ask for please copy from buffer X to buffer Y using this transform. Also remember that drm_xxx_set_src/dst_dma_buffer() would have to deal with the DRM object IDs for the buffer, and not the actual buffer information themselves - that is kept within the kernel, so the kernel knows what's happening. -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework
-Original Message- From: Russell King - ARM Linux [mailto:li...@arm.linux.org.uk] Sent: Tuesday, June 18, 2013 3:21 AM To: Inki Dae Cc: Maarten Lankhorst; linux-fbdev; Kyungmin Park; DRI mailing list; Rob Clark; myungjoo.ham; YoungJun Cho; Daniel Vetter; linux-arm- ker...@lists.infradead.org; linux-media@vger.kernel.org Subject: Re: [RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework On Tue, Jun 18, 2013 at 02:19:04AM +0900, Inki Dae wrote: It seems like that all pages of the scatterlist should be mapped with DMA every time DMA operation is started (or drm_xxx_set_src_dma_buffer function call), and the pages should be unmapped from DMA again every time DMA operation is completed: internally, including each cache operation. Correct. Isn't that big overhead? Yes, if you _have_ to do a cache operation to ensure that the DMA agent can see the data the CPU has written. And If there is no problem, we should accept such overhead? If there is no problem then why are we discussing this and why do we need this API extension? :) Ok, another issue regardless of dmabuf-sync. Reasonable to me even though big overhead. Besides, it seems like that most DRM drivers have same issue. Therefore, we may need to solve this issue like below: - do not map a dmabuf with DMA. And just create/update buffer object of importer. - map the buffer with DMA every time DMA start or iommu page fault occurs. - unmap the buffer from DMA every time DMA operation is completed With the above approach, cache operation portion of my approach, dmabuf-sync, can be removed. However, I'm not sure that we really have to use the above approach with a big overhead. Of course, if we don't use the above approach then user processes would need to do each cache operation before DMA operation is started and also after DMA operation is completed in some cases; user space mapped with physical memory as cachable, and CPU and DMA share the same buffer. So I'd like to ask for other DRM maintainers. How do you think about it? it seems like that Intel DRM (maintained by Daniel), OMAP DRM (maintained by Rob) and GEM CMA helper also have same issue Russell pointed out. I think not only the above approach but also the performance is very important. Thanks, Inki Dae Actually, drm_gem_fd_to_handle() includes to map a given dmabuf with iommu table (just logical data) of the DMA. And then, the device address (or iova) already mapped will be set to buffer register of the DMA with drm_xxx_set_src/dst_dma_buffer(handle1, ...) call. Consider this with a PIPT cache: dma_map_sg()- at this point, the kernel addresses of these buffers are cleaned and invalidated for the DMA userspace writes to the buffer, the data sits in the CPU cache Because the cache is PIPT, this data becomes visible to the kernel as well. DMA is started, and it writes to the buffer Now, at this point, which is the correct data? The data physically in the RAM which the DMA has written, or the data in the CPU cache. It may the answer is - they both are, and the loss of either can be a potential data corruption issue - there is no way to tell which data should be kept but the system is in an inconsistent state and _one_ of them will have to be discarded. dma_unmap_sg() - at this point, the kernel addresses of the buffers are _invalidated_ and any data in those cache lines is discarded Which also means that the data in userspace will also be discarded with PIPT caches. This is precisely why we have buffer rules associated with the DMA API, which are these: dma_map_sg() - the buffer transfers ownership from the CPU to the DMA agent. - the CPU may not alter the buffer in any way. while (cpu_wants_access) { dma_sync_sg_for_cpu() - the buffer transfers ownership from the DMA to the CPU. - the DMA may not alter the buffer in any way. dma_sync_sg_for_device() - the buffer transfers ownership from the CPU to the DMA - the CPU may not alter the buffer in any way. } dma_unmap_sg() - the buffer transfers ownership from the DMA to the CPU. - the DMA may not alter the buffer in any way. Any violation of that is likely to lead to data corruption. Now, some may say that the DMA API is only about the kernel mapping. Yes it is, because it takes no regard what so ever about what happens with the userspace mappings. This is absolutely true when you have VIVT or aliasing VIPT caches. Now consider that with a PIPT cache, or a non-aliasing VIPT cache (which is exactly the same behaviourally from this aspect) any modification of a userspace mapping is precisely the same as modifying the kernel space mapping - and what