Re: How to efficiently handle DMA and cache on ARMv7 ? (was Is get_user_pages() enough to prevent pages from being swapped out ?)
On Tue, 2009-08-25 at 16:17 -0700, Laurent Pinchart wrote: On Wednesday 26 August 2009 00:02:48 David Xiao wrote: On Tue, 2009-08-25 at 05:53 -0700, Steven Walter wrote: On Thu, Aug 6, 2009 at 6:25 PM, Russell King - ARM Linuxli...@arm.linux.org.uk wrote: [...] As far as userspace DMA coherency, the only way you could do it with current kernel APIs is by using get_user_pages(), creating a scatterlist from those, and then passing it to dma_map_sg(). While the device has ownership of the SG, userspace must _not_ touch the buffer until after DMA has completed. [...] Would that work on a processor with VIVT caches? It seems not. In particular, dma_map_page uses page_address to get a virtual address to pass to map_single(). map_single() in turn uses this address to perform cache maintenance. Since page_address() returns the kernel virtual address, I don't see how any cache-lines for the userspace virtual address would get invalidated (for the DMA_FROM_DEVICE case). If that's true, then what is the correct way to allow DMA to/from a userspace buffer with a VIVT cache? If not true, what am I missing? page_address() is basically returning page-virtual, which records the virtual/physical mapping for both user/kernel space; and what only matters there is highmem or not. I'm not sure to get it. Are you implying that a physical page will then be mapped to the same address in all contexts (kernelspace and userspace processes) ? Is that even possible ? And if not, how could page-virtual store both the initial kernel map and all the userspace mappings ? Sorry for the confusion, page_address() indeed only returns kernel virtual address; and in order to support VIVT cache maintenance for the user space mappings, the dma_map_sg/dma_map_page() functions or even the struct scatterlist do seem to have to be modified to pass in virtual address, I think. David -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to efficiently handle DMA and cache on ARMv7 ? (was Is get_user_pages() enough to prevent pages from being swapped out ?)
On Tue, 2009-08-25 at 05:53 -0700, Steven Walter wrote: On Thu, Aug 6, 2009 at 6:25 PM, Russell King - ARM Linuxli...@arm.linux.org.uk wrote: [...] As far as userspace DMA coherency, the only way you could do it with current kernel APIs is by using get_user_pages(), creating a scatterlist from those, and then passing it to dma_map_sg(). While the device has ownership of the SG, userspace must _not_ touch the buffer until after DMA has completed. [...] Would that work on a processor with VIVT caches? It seems not. In particular, dma_map_page uses page_address to get a virtual address to pass to map_single(). map_single() in turn uses this address to perform cache maintenance. Since page_address() returns the kernel virtual address, I don't see how any cache-lines for the userspace virtual address would get invalidated (for the DMA_FROM_DEVICE case). If that's true, then what is the correct way to allow DMA to/from a userspace buffer with a VIVT cache? If not true, what am I missing? page_address() is basically returning page-virtual, which records the virtual/physical mapping for both user/kernel space; and what only matters there is highmem or not. David -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to efficiently handle DMA and cache on ARMv7 ? (was Is get_user_pages() enough to prevent pages from being swapped out ?)
On Tue, 2009-08-11 at 02:31 -0700, Catalin Marinas wrote: On Thu, 2009-08-06 at 22:59 -0700, David Xiao wrote: The V7 speculative prefetching will then probably apply to DMA coherency issue in general, both kernel and user space DMAs. Could this be addressed by inside the dma_unmap_sg/single() calling dma_cache_maint() when the direction is DMA_FROM_DEVICE/DMA_BIDIRECTIONAL, to basically invalidate the related cache lines in case any filled by prefetching? Assuming dma_unmap_sg/single() is called after each DMA operation is completed. Theoretically, with speculative prefetching on ARMv7 and the FROM_DEVICE case we need to invalidate the corresponding D-cache lines both before and after the DMA transfer, i.e. in both dma_map_sg and dma_unmap_sg, otherwise there is a risk of stale data in the cache. The dma_map_sg() code is already calling dma_cache_maint() to invalidate the cache lines in the DMA_FROM_DEVICE/DMA_BIDIRECTIONAL direction cases. And the suggestion was to do something similar in dma_unmap_sg() case to deal with the speculative prefetching on ARMv7, and Russel has other postings talking about the details of this in terms of feasibility/etc. Furthermore, duplicate MMU mappings in the kernel bring more twists to this problem as explained in this email chain as well, especially in the case of DMA-coherent memory (dma_alloc_coherent()). David -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to efficiently handle DMA and cache on ARMv7 ? (was Is get_user_pages() enough to prevent pages from being swapped out ?)
On Fri, 2009-08-07 at 13:28 -0700, Russell King - ARM Linux wrote: The kernel direct mapping maps all system (low) memory with normal memory cacheable attributes. So using vmalloc, dma_alloc_coherent, using pages in userspace all create duplicate mappings of pages. If we do want to remove all these duplicate mappings, as part of solution to deal with the speculative prefetching, probably one way is to not map all the RAM into the direct-mapped space at paging_init() time, and instead map them on-demand by different upper layer allocation functions, such as vmalloc/dma_alloc_coherent/do_brk/kmalloc/ get_free_pages/etc. But then the distinction between upper layer allocation functions and non-upper layer ones must be made clear though. I know that mapping the RAM at paging_init() time can take advantage of 1M section mapping most of the time, and thus save many 1KB L2 page tables. But a lot of memory still ends up being remapped with L2 page tables later on, and meanwhile 1KB might not be as precious as it used to be as well-:) David -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to efficiently handle DMA and cache on ARMv7 ? (was Is get_user_pages() enough to prevent pages from being swapped out ?)
On Thu, 2009-08-06 at 06:06 -0700, Laurent Pinchart wrote: Hi Ben, On Thursday 06 August 2009 13:46:19 Ben Dooks wrote: On Thu, Aug 06, 2009 at 12:08:21PM +0200, Laurent Pinchart wrote: [snip] The second problem is to ensure cache coherency. As the userspace application will read data from the video buffers, those buffers will end up being cached in the processor's data cache. The driver does need to invalidate the cache before starting the DMA operation (userspace could in theory write to the buffers, but the data will be overwritten by DMA anyway, so there's no need to clean the cache). You'll need to clean the write buffers, otherwise the CPU may have data queued that it has yet to write back to memory. Good points, thanks. I thought this should have been taken care of by the CPU specific dma_inv_range routine. However, In arch/arm/mm/cache-v7.c, v7_dma_inv_range does not drain the write buffer; and the v6_dma_inv_range does that in the end of all the cache maintenance operaitons. So this is probably something Russel can clarify. As the cache is of the VIPT (Virtual Index Physical Tag) type, cache invalidation can either be done globally (in which case the cache is flushed instead of being invalidated) or based on virtual addresses. In the last case the processor will need to look physical addresses up, either in the TLB or through hardware table walk. I can see three solutions to the DMA/cache problem. 1. Flushing the whole data cache right before starting the DMA transfer. There's no API for that in the ARM architecture, so a whole I+D cache is required. This is quite costly, we're talking about around 30 flushes per second, but it doesn't involve the MMU. That's the solution that I currently use. 2. Invalidating only the cache lines that store video buffer data. This requires a TLB lookup or a hardware table walk, so the userspace application MM context needs to be available (no problem there as where's flushing in userspace context) and all pages need to be mapped properly. This can be a problem as, as Hugh pointed out, pages can still be unmapped from the userspace context after get_user_pages() returns. I have experienced one oops due to a kernel paging request failure: If you already know the virtual addresses of the buffers, why do you need a TLB lookup (or am I being dense here?) The virtual address is used to compute the cache lines index, and the physical address is then used when comparing the cache line tag. So the processor (or actually the CP15 coprocessor if I'm not wrong) does a TLB lookup to get the physical address during cache invalidation/flushing. Unable to handle kernel paging request at virtual address 44e12000 pgd = c8698000 [44e12000] *pgd=8a4fd031, *pte=8cfda1cd, *ppte= Internal error: Oops: 817 [#1] PREEMPT PC is at v7_dma_inv_range+0x2c/0x44 Fixing this requires more investigation, and I'm not sure how to proceed to find out if the page fault is really caused by pages being unmapped from the userspace context. Help would be appreciated. 3. Mark the pages as non-cacheable. Depending on how the buffers are then used by userspace, the additional cache misses might destroy any benefit I would get from not flushing the cache before DMA. I'm not sure how to mark a bunch of pages as non-cacheable though. What usually happens is that video drivers allocate DMA-coherent memory themselves, but in this case I need to deal with an arbitrary buffer allocated by userspace. If someone has any experience with this, it would be appreciated. Another approach is working from a different direction: the kernel allocates the non-cached buffer and then mmap() into user space. I have done that in similar situation to try to achieve zero-copy. David -- To unsubscribe from this list: send the line unsubscribe linux-media in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html