Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
Siarhei Siamashka siarhei.siamas...@gmail.com writes: write-through cache for it (with a simple kernel patch) improves scrolling and moving windows performance by 4x-5x factor (unless shadow framebuffer is used, which is also not good for performance). What is the issue with the shadow framebuffer? It does add some extra memory traffic from the final copy, but I would have thought that everything else becoming much faster, would more than make up for it. Soren ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
On Mon, 2009-10-26 at 17:52 +0100, Soeren Sandmann wrote: write-through cache for it (with a simple kernel patch) improves scrolling and moving windows performance by 4x-5x factor (unless shadow framebuffer is used, which is also not good for performance). What is the issue with the shadow framebuffer? It does add some extra memory traffic from the final copy, but I would have thought that everything else becoming much faster, would more than make up for it. The shadow framebuffer has one additional source of overhead: the damage-region tracking it does to accumulate multiple updates into a single post-copy. If you do a lot of small writes to the shadow, the damage region gets complex and slows everything down. We have a trivial pathological reproduction case, involving spraying random small rectangles all over the screen. It also seems that enabling shadow also forces on backing stores, which is another unnecessary memory overhead and probably involves more copying. It is of course possible to imagine a sloppy damage tracker, to reduce overhead by limiting damage-region complexity. ISTR writing one for a VNC server a long time ago, simply because I didn't want or need precise region tracking. If we don't want to deal with uncached framebuffers in Pixman, and nobody seems entirely comfortable with cached physical framebuffers, maybe it's worth my digging my old code out and seeing if it can be adapted. -- -- From: Jonathan Morton jonathan.mor...@movial.com ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
On Friday 23 October 2009, Koen Kooi wrote: I'm not sure about pixman_gc_t since most of the needed operations are just simple copies. What about starting with just introducing a variant of 'pixman_blt' which is overlapping aware? I created a work-in-progress branch with 'pixman_blt' function (generic C implementation for now) extended to support overlapped source/destination case. A simple test program is also included: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt First, this branch is outdated. There is a new branch with the final code :) http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2 Would using said branch give me 'magically' a performance boost (e.g. not make firefox unusably slow as it is now on an 600MHz cortex a8) or would I need to patch other libs (e.g. xrender) as well? Not really, it's just a small extension of pixman functionality. Currently the handling of overlapped blt operation (for software rendering) is done in xorg-server. As it is the responsibility of pixman to provide CPU-specific SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to move this work to pixman. So the next steps are to add NEON optimizations to pixman_plt and make sure that xserver takes advantage of these optimizations for the overlapped blit too. As for improving scrolling performance (and assuming a standard fbdev driver), the most important thing is to improve framebuffer memory performance. Right now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling write-through cache for it (with a simple kernel patch) improves scrolling and moving windows performance by 4x-5x factor (unless shadow framebuffer is used, which is also not good for performance). This works fine if nothing but CPU can modify framebuffer memory. But if GPU or DSP can also access framebuffer memory or compositing manager is used, everything gets more complicated. Cache invalidate operations will have to be inserted in appropriate places in order to ensure memory coherency and uniform view of its content from all the units. If default write-back cache is used instead of write-through, cache flush operations are needed too. Unpatched firefox is also quite slow for another reason - it tries to always work with 32bpp data internally, no matter what color depth is used for desktop. -- Best regards, Siarhei Siamashka ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
Siarhei Siamashka siarhei.siamas...@gmail.com writes: First introduce something like 'pixman_init' function. Right now CPU type detection is done on the first call to the function. It introduces some minor overhead by having an extra pointer check on each function call. Another problem is that we can't be completely sure that CPU capabilities detection check is always fully reentrant. For example, some platforms may try to set a signal handler and expect to catch SIGILL or something like this. This initialization function would just detect CPU capabilities and set some function pointers. The whole CPU-specific implementation of 'pixman_blt' may be just called via this pointer directly by a client. Or 'pixman_blt' can be just a small thunk which does a call via function pointer, passes exactly the same arguments to it and does nothing more. In this case there will be really no excuse for the compiler for not using tail call, see below. Adding a pixman_init() that applications would be required to call first, would not be a compatible change. If we are designing new API, then I really think it should be done in such a way that it can be extended to handle the core rendering primitives. It does likely make sense to make the pixman_implementation_t type public at some point (renamed to pixman_t probably) and then pass it directly to the various entry points. This would be necessary if we add hardware acceleration to pixman. Also, I really don't see much potential for saving here. For a NEON implementation of blt, the callchain would be: pixman_blt() - _pixman_implementation_blt() - neon_blt() and getting rid of delegates wouldn't really affect that at all. You could get rid of the _pixman_implementation_blt() call by making it a macro, but as I mentioned before, gcc turns it into a tail call that reused the arguments on the stack, so the overhead really is minimal. On what kind of platform and with which version of gcc are you getting proper tail call here? I meant that the _pixman_implemenation_blt() - neon_blt() would be a tail call. GCC v 4.3.2 on x86-32 produces: _pixman_implementation_blt: pushl %ebp movl%esp, %ebp movl8(%ebp), %edx popl%ebp movl12(%edx), %ecx jmp *%ecx .size _pixman_implementation_blt, .-_pixman_implementation_blt .p2align 4,,15 I don't see it being used and the overhead is rather hefty, which is also confirmed by benchmarking and profiling. Well, with a microbenchmark you can make anything stand out. Ultimately, this function is called from XCopyArea(), and compared to the marshalling of the client call and the long call chain inside the X server, these 35 instructions or so, really are not very significant. I think Jonathan said that pixman_blt() was getting called once per scanline, but I'm pretty sure that's not the case. (Or if it is, that would be the first thing to fix before worrying about eliminating this call). Soren ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
On Thursday 04 June 2009, Soeren Sandmann wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: What kind of guarantees (or the lack of) pixman and XRender are supposed to provide when dealing with overlapping parts of images? (Adding xorg-devel). See this thread: http://lists.freedesktop.org/archives/xorg/2008-October/039346.html The guarantee that I would suggest for Render and pixman is that if any pixel is both read and written in the same request, then the result of the whole request is undefined, except that it obeys the clipping rules. The practical use case could be scrolling of data inside of a single big image. If rendering with overlapped source and destination areas is not supported, a temporary image has to be created to achieve expected result and this is an additional performance hit. Yes, scrolling is one thing that the current pixman API doesn't really provide. 'pixman_blt()' only deals with cases where the source and destination don't overlap. I think the best solution is to move all of the X primitives (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would probably look something like this: void pixman_copy_area (pixman_image_t *src, pixman_image_t *dest, pixman_gc_t *gc, int src_x, int src_y, int width, int height, int dest_x, int dest_y); and that would be guaranteed to handle overlapping src and dest. A pixman_gc_t would be a new type of object, corresponding to an X GC. pixman_blt() would then become a deprecated wrapper that would just call pixman_copy_area(). Same for pixman_fill() and a new pixman_fill_rectangles(). I'm not sure about pixman_gc_t since most of the needed operations are just simple copies. What about starting with just introducing a variant of 'pixman_blt' which is overlapping aware? I created a work-in-progress branch with 'pixman_blt' function (generic C implementation for now) extended to support overlapped source/destination case. A simple test program is also included: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt Making use of the already existing SIMD optimized pixel copy functions should provide fast scrolling in all the directions except for from left to right. This special case will require a SIMD optimized backwards copy. I wonder if it makes sense to drop delegates support for pixman_blt and make call chain shorter when introducing SIMD optimized copies? It seems to be a little bit overdesigned here. Running test program for the current pixman master (SSE2 optimized): $ test/overlapped-blt-test bpp=8, supported=0, normal_ok=0, overlapped_ok=0 bpp=16, supported=1, normal_ok=1, overlapped_ok=0 bpp=32, supported=1, normal_ok=1, overlapped_ok=0 Running test program for the pixman from this branch with generic C version of pixman_blt (8bpp now uses a fallback to generic C implementation): $ test/overlapped-blt-test bpp=8, supported=1, normal_ok=1, overlapped_ok=1 bpp=16, supported=1, normal_ok=1, overlapped_ok=0 bpp=32, supported=1, normal_ok=1, overlapped_ok=0 -- Best regards, Siarhei Siamashka ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
Cairo can do a transformation when copying (like a rotation or scale). I think it is best to say that no guarantee is made when the source and destination intersect. You could say it works if there is no transformation other than integer translation but it seems a little artificial to me. It seems most consistent to say that nothing is allowed. Also on modern graphics systems it is probably a lot easier to write and just as fast to just draw the entire image over again in it's new position, rather than trying to copy areas. Or if copying an area really does help, making the entire image in an off-screen surface and copying from that would work better. Soeren Sandmann wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: What kind of guarantees (or the lack of) pixman and XRender are supposed to provide when dealing with overlapping parts of images? -- Bill Spitzak, Senior Software Engineer The Foundry, 1 Wardour Street, London, W1D 6PA, UK Tel: +44 (0)20 7434 0449 * Fax: +44 (0)20 7434 1550 * Web: www.thefoundry.co.uk The Foundry Visionmongers Ltd * Registered in England and Wales No: 4642027 ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
Siarhei Siamashka siarhei.siamas...@gmail.com writes: What kind of guarantees (or the lack of) pixman and XRender are supposed to provide when dealing with overlapping parts of images? (Adding xorg-devel). See this thread: http://lists.freedesktop.org/archives/xorg/2008-October/039346.html The guarantee that I would suggest for Render and pixman is that if any pixel is both read and written in the same request, then the result of the whole request is undefined, except that it obeys the clipping rules. The practical use case could be scrolling of data inside of a single big image. If rendering with overlapped source and destination areas is not supported, a temporary image has to be created to achieve expected result and this is an additional performance hit. Yes, scrolling is one thing that the current pixman API doesn't really provide. 'pixman_blt()' only deals with cases where the source and destination don't overlap. I think the best solution is to move all of the X primitives (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would probably look something like this: void pixman_copy_area (pixman_image_t *src, pixman_image_t *dest, pixman_gc_t *gc, int src_x, int src_y, int width, int height, int dest_x, int dest_y); and that would be guaranteed to handle overlapping src and dest. A pixman_gc_t would be a new type of object, corresponding to an X GC. pixman_blt() would then become a deprecated wrapper that would just call pixman_copy_area(). Same for pixman_fill() and a new pixman_fill_rectangles(). Generic path in pixman fetches lines into a temporary buffer, so everything should be fine in horizontal direction. For vertical direction, reversing the order of handling lines (and using negative stride) could be also possible. So if overlapped rendering is to be supported, technically there should not be any big problems. Performance would be worse in the case when overlapped rendering has to be sent to the generic path, but it would provide incorrect (or unexpected) results in the fast path anyway. Overlapping-aware fast path functions can be also created. I am not necessarily opposed to making Render guarantee correct operation for overlapping sources and destinations if someone wants to fix the implementation. However, - many of the operators aren't all that useful when used with overlapping sources and destinations; - to be useful for scrolling windows, applications need to get GraphicsExposure events, and Render currently doesn't generate those; - for images with transformations, you will need to make a copy anyway; - you'll need to fix the server and drivers so that they either implement this correctly, or fall back to software. Soren ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel