Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-10-26 Thread Soeren Sandmann
Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 write-through cache for it (with a simple kernel patch) improves scrolling
 and moving windows performance by 4x-5x factor (unless shadow framebuffer is
 used, which is also not good for performance). 

What is the issue with the shadow framebuffer? It does add some extra
memory traffic from the final copy, but I would have thought that
everything else becoming much faster, would more than make up for it.


Soren
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-10-26 Thread Jonathan Morton
On Mon, 2009-10-26 at 17:52 +0100, Soeren Sandmann wrote:

  write-through cache for it (with a simple kernel patch) improves scrolling
  and moving windows performance by 4x-5x factor (unless shadow framebuffer is
  used, which is also not good for performance). 
 
 What is the issue with the shadow framebuffer? It does add some extra
 memory traffic from the final copy, but I would have thought that
 everything else becoming much faster, would more than make up for it.

The shadow framebuffer has one additional source of overhead: the
damage-region tracking it does to accumulate multiple updates into a
single post-copy.  If you do a lot of small writes to the shadow, the
damage region gets complex and slows everything down.  We have a trivial
pathological reproduction case, involving spraying random small
rectangles all over the screen.

It also seems that enabling shadow also forces on backing stores, which
is another unnecessary memory overhead and probably involves more
copying.

It is of course possible to imagine a sloppy damage tracker, to reduce
overhead by limiting damage-region complexity.  ISTR writing one for a
VNC server a long time ago, simply because I didn't want or need precise
region tracking.

If we don't want to deal with uncached framebuffers in Pixman, and
nobody seems entirely comfortable with cached physical framebuffers,
maybe it's worth my digging my old code out and seeing if it can be
adapted.

-- 
--
From: Jonathan Morton
  jonathan.mor...@movial.com


___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-10-25 Thread Siarhei Siamashka
On Friday 23 October 2009, Koen Kooi wrote:
  I'm not sure about pixman_gc_t since most of the needed operations are just
  simple copies. What about starting with just introducing a variant
  of 'pixman_blt' which is overlapping aware?
 
  I created a work-in-progress branch with 'pixman_blt' function (generic C
  implementation for now) extended to support overlapped source/destination
  case. A simple test program is also included:
  http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
 
First, this branch is outdated. There is a new branch with the final code :)
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2

 Would using said branch give me 'magically' a performance boost (e.g. 
 not make firefox unusably slow as it is now on an 600MHz cortex a8) or 
 would I need to patch other libs (e.g. xrender) as well?

Not really, it's just a small extension of pixman functionality. Currently
the handling of overlapped blt operation (for software rendering) is done
in xorg-server. As it is the responsibility of pixman to provide CPU-specific
SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to
move this work to pixman. So the next steps are to add NEON optimizations
to pixman_plt and make sure that xserver takes advantage of these
optimizations for the overlapped blit too.

As for improving scrolling performance (and assuming a standard fbdev driver),
the most important thing is to improve framebuffer memory performance. Right
now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling
write-through cache for it (with a simple kernel patch) improves scrolling
and moving windows performance by 4x-5x factor (unless shadow framebuffer is
used, which is also not good for performance). This works fine if nothing
but CPU can modify framebuffer memory. But if GPU or DSP can also access
framebuffer memory or compositing manager is used, everything gets more
complicated. Cache invalidate operations will have to be inserted in
appropriate places in order to ensure memory coherency and uniform view
of its content from all the units. If default write-back cache is used
instead of write-through, cache flush operations are needed too.

Unpatched firefox is also quite slow for another reason - it tries to
always work with 32bpp data internally, no matter what color depth is
used for desktop.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-10-20 Thread Soeren Sandmann
Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 First introduce something like 'pixman_init' function. Right now CPU type
 detection is done on the first call to the function. It introduces some
 minor overhead by having an extra pointer check on each function call.
 Another problem is that we can't be completely sure that CPU capabilities
 detection check is always fully reentrant. For example, some platforms may
 try to set a signal handler and expect to catch SIGILL or something like
 this.
 
 This initialization function would just detect CPU capabilities and set some
 function pointers. The whole CPU-specific implementation of 'pixman_blt'
 may be just called via this pointer directly by a client. Or 'pixman_blt' can
 be just a small thunk which does a call via function pointer, passes exactly
 the same arguments to it and does nothing more. In this case there will be
 really no excuse for the compiler for not using tail call, see
 below.

Adding a pixman_init() that applications would be required to call
first, would not be a compatible change. If we are designing new API,
then I really think it should be done in such a way that it can be
extended to handle the core rendering primitives.

It does likely make sense to make the pixman_implementation_t type
public at some point (renamed to pixman_t probably) and then pass it
directly to the various entry points. This would be necessary if we
add hardware acceleration to pixman.

  Also, I really don't see much potential for saving here. For a NEON
  implementation of blt, the callchain would be:
 
 pixman_blt() -  _pixman_implementation_blt() - neon_blt()
 
  and getting rid of delegates wouldn't really affect that at all. You
  could get rid of the _pixman_implementation_blt() call by making it a
  macro, but as I mentioned before, gcc turns it into a tail call that
  reused the arguments on the stack, so the overhead really is minimal.
 
 On what kind of platform and with which version of gcc are you getting
 proper tail call here? 

I meant that the 

_pixman_implemenation_blt() - neon_blt()

would be a tail call. GCC v 4.3.2 on x86-32 produces:

_pixman_implementation_blt:
pushl   %ebp
movl%esp, %ebp
movl8(%ebp), %edx
popl%ebp
movl12(%edx), %ecx
jmp *%ecx
.size   _pixman_implementation_blt,
.-_pixman_implementation_blt
.p2align 4,,15

 I don't see it being used and the overhead is rather hefty, which is
 also confirmed by benchmarking and profiling.

Well, with a microbenchmark you can make anything stand out.
Ultimately, this function is called from XCopyArea(), and compared to
the marshalling of the client call and the long call chain inside the
X server, these 35 instructions or so, really are not very
significant.

I think Jonathan said that pixman_blt() was getting called once per
scanline, but I'm pretty sure that's not the case. (Or if it is, that
would be the first thing to fix before worrying about eliminating this
call).


Soren
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-10-19 Thread Siarhei Siamashka
On Thursday 04 June 2009, Soeren Sandmann wrote:
 Siarhei Siamashka siarhei.siamas...@gmail.com writes:
  What kind of guarantees (or the lack of) pixman and XRender are supposed
  to provide when dealing with overlapping parts of images?

 (Adding xorg-devel). See this thread:

 http://lists.freedesktop.org/archives/xorg/2008-October/039346.html

 The guarantee that I would suggest for Render and pixman is that if
 any pixel is both read and written in the same request, then the
 result of the whole request is undefined, except that it obeys the
 clipping rules.

  The practical use case could be scrolling of data inside of a single big
  image. If rendering with overlapped source and destination areas is not
  supported, a temporary image has to be created to achieve expected result
  and this is an additional performance hit.

 Yes, scrolling is one thing that the current pixman API doesn't really
 provide. 'pixman_blt()' only deals with cases where the source and
 destination don't overlap.

 I think the best solution is to move all of the X primitives
 (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would
 probably look something like this:

 void
 pixman_copy_area (pixman_image_t *src,
   pixman_image_t *dest,
   pixman_gc_t *gc,
   int src_x,
   int src_y,
   int width,
   int height,
   int dest_x,
   int dest_y);

 and that would be guaranteed to handle overlapping src and dest. A
 pixman_gc_t would be a new type of object, corresponding to an X GC.

 pixman_blt() would then become a deprecated wrapper that would just
 call pixman_copy_area(). Same for pixman_fill() and a new
 pixman_fill_rectangles().

I'm not sure about pixman_gc_t since most of the needed operations are just
simple copies. What about starting with just introducing a variant
of 'pixman_blt' which is overlapping aware?

I created a work-in-progress branch with 'pixman_blt' function (generic C
implementation for now) extended to support overlapped source/destination
case. A simple test program is also included:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt

Making use of the already existing SIMD optimized pixel copy functions should
provide fast scrolling in all the directions except for from left to right.
This special case will require a SIMD optimized backwards copy.

I wonder if it makes sense to drop delegates support for pixman_blt and make
call chain shorter when introducing SIMD optimized copies? It seems to be a
little bit overdesigned here.

Running test program for the current pixman master (SSE2 optimized):

$ test/overlapped-blt-test
bpp=8, supported=0, normal_ok=0, overlapped_ok=0
bpp=16, supported=1, normal_ok=1, overlapped_ok=0
bpp=32, supported=1, normal_ok=1, overlapped_ok=0

Running test program for the pixman from this branch with generic
C version of pixman_blt (8bpp now uses a fallback to generic C
implementation):

$ test/overlapped-blt-test
bpp=8, supported=1, normal_ok=1, overlapped_ok=1
bpp=16, supported=1, normal_ok=1, overlapped_ok=0
bpp=32, supported=1, normal_ok=1, overlapped_ok=0


-- 
Best regards,
Siarhei Siamashka
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-06-04 Thread Bill Spitzak
Cairo can do a transformation when copying (like a rotation or scale). I 
think it is best to say that no guarantee is made when the source and 
destination intersect. You could say it works if there is no 
transformation other than integer translation but it seems a little 
artificial to me. It seems most consistent to say that nothing is allowed.

Also on modern graphics systems it is probably a lot easier to write and 
just as fast to just draw the entire image over again in it's new 
position, rather than trying to copy areas. Or if copying an area really 
does help, making the entire image in an off-screen surface and copying 
from that would work better.

Soeren Sandmann wrote:
 Siarhei Siamashka siarhei.siamas...@gmail.com writes:
 
 What kind of guarantees (or the lack of) pixman and XRender are supposed to
 provide when dealing with overlapping parts of images?


-- 
Bill Spitzak, Senior Software Engineer
The Foundry, 1 Wardour Street, London, W1D 6PA, UK
Tel: +44 (0)20 7434 0449 * Fax: +44 (0)20 7434 1550 * Web: 
www.thefoundry.co.uk
The Foundry Visionmongers Ltd * Registered in England and Wales No: 4642027

___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-06-03 Thread Soeren Sandmann
Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 What kind of guarantees (or the lack of) pixman and XRender are supposed to
 provide when dealing with overlapping parts of images?

(Adding xorg-devel). See this thread:

http://lists.freedesktop.org/archives/xorg/2008-October/039346.html

The guarantee that I would suggest for Render and pixman is that if
any pixel is both read and written in the same request, then the
result of the whole request is undefined, except that it obeys the
clipping rules.

 The practical use case could be scrolling of data inside of a single big
 image. If rendering with overlapped source and destination areas is not
 supported, a temporary image has to be created to achieve expected result
 and this is an additional performance hit.

Yes, scrolling is one thing that the current pixman API doesn't really
provide. 'pixman_blt()' only deals with cases where the source and
destination don't overlap.

I think the best solution is to move all of the X primitives
(CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would
probably look something like this:

void
pixman_copy_area (pixman_image_t *src,
  pixman_image_t *dest, 
  pixman_gc_t *gc, 
  int src_x,
  int src_y,
  int width,
  int height,
  int dest_x,
  int dest_y);

and that would be guaranteed to handle overlapping src and dest. A
pixman_gc_t would be a new type of object, corresponding to an X GC.

pixman_blt() would then become a deprecated wrapper that would just
call pixman_copy_area(). Same for pixman_fill() and a new
pixman_fill_rectangles().

 Generic path in pixman fetches lines into a temporary buffer, so everything
 should be fine in horizontal direction. For vertical direction, reversing the
 order of handling lines (and using negative stride) could be also possible.
 So if overlapped rendering is to be supported, technically there should not be
 any big problems. Performance would be worse in the case when overlapped
 rendering has to be sent to the generic path, but it would provide incorrect
 (or unexpected) results in the fast path anyway. Overlapping-aware fast path
 functions can be also created.

I am not necessarily opposed to making Render guarantee correct
operation for overlapping sources and destinations if someone wants to
fix the implementation. However,

- many of the operators aren't all that useful when used with
  overlapping sources and destinations;

- to be useful for scrolling windows, applications need to get
  GraphicsExposure events, and Render currently doesn't generate
  those;

- for images with transformations, you will need to make a copy
  anyway;

- you'll need to fix the server and drivers so that they either
  implement this correctly, or fall back to software.


Soren
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel