Re: ARMSOC X11 plugin issues

2016-06-25 Thread Siarhei Siamashka
On Mon, 4 Apr 2016 13:24:21 +0100
Daniel Stone <dan...@fooishbar.org> wrote:

> Hi,
> 
> On 4 April 2016 at 11:45, Siarhei Siamashka <siarhei.siamas...@gmail.com> 
> wrote:
> > This really must be something new, because I don't remember anything
> > like this when skimming through the EULA back in 2013. Making the EULA
> > even more restrictive seems to contradict with the promise of Jem
> > Davies to provide legally redistributable mali binary drivers that
> > we had in 2014:
> >
> > 
> > http://anandtech.com/comments/8226/ask-the-experts-arm-fellow-jem-davies-answers-your-gpu-questions/409101
> >
> > Can anyone from ARM provide an update on this?  
> 
> Getting fairly off-topic now, but the Midgard (6xx/7xx/8xx) drivers
> are now redistributable.

Hi Daniel,

I'm not sure if this is really offtopic, because it is still related to
the interoperability with the X server after all.

Thanks for your comment. Sorry for a late reply, but I have just noticed
that the ARM Mali Utgard GPU User Space Drivers page now has the
r6p0-01rel0 release with an updated EULA, which is more permissive
compared to the previous r5p0-01rel0 release:


http://malideveloper.arm.com/resources/drivers/arm-mali-utgard-gpu-user-space-drivers/

The page states that "A new and more permissive version was introduced
in January 2016 and all Midgard user-space drivers starting with r6p0
are now distributed under the new terms. The main changes are to allow
redistribution of the binaries under the same EULA, commercial use and
benchmarking. Please read the END_USER_LICENCE_AGREEMENT.txt document
included in the packages for the exact licencing terms."

There is obviously a minor mistake in this text, because it mentions
"Midgard" (most likely a copy-paste issue).

Anyway, it is good that the Utgard (Mali-400/450) binaries now got a
usable license after all these years. And the weird benchmark results
confidentiality clause has been also updated. Thanks to whoever finally
made this happen!

Still I wonder who is going to also provide binaries for the 32-bit
mode and for Mali-400 hardware (unless the userland binaries for
Mali-450 are fully backwards compatible). Also the X11 binaries are
still missing from the downloads page. Can ARM people comment on this?

Maxime, if you happen to have 32-bit redistributable userland
Mali-400 binaries, could you please share them with the others?
Maybe ARM people could even put them on their download page?

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: https://lists.x.org/mailman/listinfo/xorg-devel

Re: ARMSOC X11 plugin issues

2016-04-05 Thread Siarhei Siamashka
On Tue, 5 Apr 2016 11:12:00 -0700
Maxime Ripard <maxime.rip...@free-electrons.com> wrote:

> On Mon, Apr 04, 2016 at 02:46:19PM +0900, Michel Dänzer wrote:
> > On 01.04.2016 15:57, David Garbett wrote:  
> > > The main difference between modesetting and armsoc is that armsoc
> > > supports DRI2, and modesetting doesn't.  
> > 
> > The modesetting driver itself supports DRI2 as well. If it's not working
> > on this platform, it's probably because
> > hw/xfree86/dri2/dri2.c:dri2_probe_driver_name() doesn't know yet which
> > DRI client driver name to use for it.  
> 
> My understanding was that the modesetting driver was supporting DRI2
> through mesa, which we aren't using unfortunately (but instead the
> proprietary ARM OpenGL ES implementation)

Yes, the DRI2 spec does not define the exact interpretation of the
DRI2BUFFER data fields and leaves a lot up to the implementations.

This works fine as long as the DDX (or "plugin" as you call it)
and the client side GL library are interpreting DRI2BUFFER in
the same way and understand each other. But different DRI2
implementations are generally incompatible with each other.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: https://lists.x.org/mailman/listinfo/xorg-devel

Re: ARMSOC X11 plugin issues

2016-04-04 Thread Siarhei Siamashka
Hello Maxime,

On Sat, 2 Apr 2016 15:58:37 +0200
Maxime Ripard <maxime.rip...@free-electrons.com> wrote:

> Hello David,
> 
> Thanks for your answer.
> 
> On Fri, Apr 01, 2016 at 06:57:07AM +, David Garbett wrote:
> > The main difference between modesetting and armsoc is that armsoc
> > supports DRI2, and modesetting doesn't. This is what allows the GLES
> > driver to render to X buffers.  
> 
> Yes, that's what I figured.
> 
> > DRI2 enables any application to pass any X pixmap into the GLES/EGL
> > driver, so all buffers need to be allocated from shareable memory.
> > That's not the case with modesetting - other than the main framebuffer,
> > other allocations (for pixmaps and windows) can just be local to the X
> > server, so don't allocate from the DRM driver.
> > 
> > I've not looked at your DRM driver proposal so I can't really say why it
> > can't cope with the additional allocations. Though I do notice in your
> > armsoc patch that you don't handle scanout buffers any differently. In
> > many systems scanout buffers are a more scarce resource (perhaps they
> > need to be contiguous, or meet certain alignment restrictions).  
> 
> We do agree on that. However, it looks like the arm soc driver does
> way much more allocations, and usually smaller ones, than modesetting
> that basically just allocates a buffer for its framebuffers and that's
> it.
> 
> If I was to guess, I'd say that it allocates buffers for applications
> (and possibly part of the applications window), that eventually
> depletes the CMA pool, even though the GPU is not used.
> 
> I'd expect the DRI buffers to be allocated from this pool, because of
> the reasons you pointed out, but I would also expect that the
> applications do not hit that pool, precisely because it's a scarce
> resource.
> 
> I'm possibly missing something though, or have broken expectations :)
> 
> In our case, we don't really have any restrictions on the memory
> resource locations, and I'm not aware of any particular weird
> alignment constraints either.

Yes, this problem can be solved by using malloc for all pixmaps, and
only migrating them to CMA or any other special allocation method
when needed. We encountered this back in 2013, when xf86-video-mali
had exactly the same design issues as you see now with
xf86-video-armsoc:

http://ssvb.github.io/2013/02/01/new-xf86-video-sunxifb-ddx-driver.html

As far as I know, Daniel Drake had some fixes for xf86-video-armsoc,
but I have no idea which of the armsoc forks is the most up to date
right now.

There are also a few tricks to ensure that the 2D performance
is not regressed in Xorg by integrating the mali drivers, and
also to get zero-copy tear-free OpenGL ES rendering in windows.
This all is now available in the xf86-video-fbturbo DDX (renamed
from xf86-video-sunxifb) if anyone is interested.

And I'm happy to know that you are working on the kernel side and
doing the conversion of the Allwinner's custom fbdev based kernel
interface into a more standard DRM/KMS interface. This had to be
done by somebody and it is better late than never.

> > > > Then, we noticed (using xfce4, on debian jessie) that the systray
> > > > icons were not displayed for some reason. There's also some game
> > > > (alex4 [4]), that starts, runs, but the window content remains black
> > > > (but it remains interactive, audio plays and if we take a screenshot,
> > > > the content is on the image, but the screen remains black).  
> 
> That's what concerns me the most though. We can always work around the
> memory allocations by playing cat and mouse and allocating a bigger
> pool (even though I'd really like to avoid that).
> 
> However, those glitches are kind of weird, and not really convenient :/

Well, the mali drivers EULA kind of warns that a problem-free experience
is not to be expected. And they deliver what is promised :-)

The EULA now has an extra clause in the "RESTRICTIONS ON USE OF THE
SOFTWARE" paragraph, which basically says that any benchmark results
or reports about quality issues are not allowed to be disclosed without
an explicit permission from ARM:


http://malideveloper.arm.com/resources/drivers/arm-mali-utgard-gpu-user-space-drivers/

This really must be something new, because I don't remember anything
like this when skimming through the EULA back in 2013. Making the EULA
even more restrictive seems to contradict with the promise of Jem
Davies to provide legally redistributable mali binary drivers that
we had in 2014:


http://anandtech.com/comments/8226/ask-the-experts-arm-fellow-jem-davies-answers-your-gpu-questions/409101

Can anyone from ARM provide an update on this?

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: https://lists.x.org/mailman/listinfo/xorg-devel

Re: [PATCH 2/2] glamor: Use float matrix for render transformations

2014-08-21 Thread Siarhei Siamashka
On Mon, 18 Aug 2014 15:15:28 -0700
Keith Packard kei...@keithp.com wrote:

 Siarhei Siamashka siarhei.siamas...@gmail.com writes:
 
  Okay, now I see what's going on. The old glamor code used to convert
  the 16.16 fixed point values to 32-bit floats. This is definitely not
  great because of http://en.wikipedia.org/wiki/Rounding#Double_rounding
  on the 64-bit float - 16.16 fixed point - 32-bit float
  conversion path. Directly passing 32-bit floats to XRENDER should
  somewhat mitigate the accuracy loss, but not totally eliminate it.
 
 You need to separate the coordinate representation issues from the
 transformation representation issues. For coordinates, yes, 32-bit
 floats would lose some of the low bits. For transformation matrices,
 the values aren't coordinates; they're all scaling factors for the
 coordinates.
 
 | A B C |
 | D E F |
 | G H J |
 
 For affine transforms, A, B, D, E generally have small values, related
 to the overall image scaling factor. C and F are the translation while G
 and H are zero and J is one. The small values (A, B, D, E) are
 multiplied by coordinates, the large values (C, F) are multiplied by
 one. Having 24 bits of mantissa for all of these values offers the
 maximum precision for the resulting values.

Software development just does not work this way. You can't rely on
assumptions like For affine transforms, A, B, D, E generally have
small values, the user always provides correct input, this race
condition is so rare that it is unlikely to crash our software and
anything of this sort.

Fixed point 16.16 format at least has predictable good accuracy (less
than 0.5 pixel absolute error in the worst case) for any affine
transformation when working with the coordinates and matrices within
the valid range supported by XRENDER.

Single precision floating point format has much less reliable accuracy
for affine transformations. That's the fact.

 Let's try to construct a really bad case. That's where your matrix
 has values which are just less than 0.5 ULP (1/2**17)
 
 A * x + B * y + C
   x' =  -
 G * x + H * y + J
 
 so, the maximum error here would be where the denominator was small, and
 the maximum magnitude of the matrix values was large so as to keep us
 From scaling the matrix up to provide precision for the smallest
 members.
 
 Here's a matrix which almost reflects in X with a tiny amount of
 non-affine adjustment
 
 | -1032767 |
 |  010 |
 |  e01 |
 
 Set e to 1/2 ULP - epsilon (a tiny bit less than 1/2**17) so that the
 fixed point matrix has a zero there and the floating point matrix has a
 non-zero value. For x, y of -2000,2000:
 
   -1 * -2000 + 0 * 2000 + 3276734767
 x' =  - = --- = 35305.72
e * -2000 + 0 * 2000 + 1   0.98474
 
   -1 * -2000 + 0 * 2000 + 3276734767
 x' ~= - = --- = 34767
0 * -2000 + 0 * 2000 + 1  1
 
 This has an error of about 539 pixels.
 
  Is the accuracy actually good enough with this patch applied and solves
  the off by several pixels problem? The patch still looks messed up
  with the use of a mix of double and float.
 
 The example above is contrived, but shows that fixed point
 introduces considerable inaccuracy when used in the transformation matrix.

Let's consider that

   e = 1.0 / (1  10)

And then try to do the following transform in 3 different ways:

   | 32767 + e-32767 + e0 |   | 32767 |
   | 01 0 | * | 32767 |
   | 00 1 |   | 1 |

a) If we use double precision for everything, then the resulting value
 x = 63.998047
b) If we convert the matrix to 16.16 fixed point format and back to
   double precision, then the resulting value
 x = 63.998047
c) If we convert the matrix to single precision floating point format
   and back to double precision, then the resulting value
 x = 0.00

The example above is contrived, but shows that single precision
floating point introduces considerable inaccuracy when used in the
transformation matrix. Moreover, this is an *affine* transformation.
Unlike the *projective* transformation from your contrived example.

The vast majority of the existing XRENDER users are only ever doing
affine transformations. This is what matters in the real world.

We have only a single real world example of the use of XRENDER
projective transformations (if you know more, please tell me). And
this example is the xrandr tool, where the transformation accuracy
apparently was never considered to be a (serious) problem until now.
If the xrandr developers really cared about the accuracy, they would
have at least tried to implement correct rounding there. IMHO the ugly
protocol level hack is a bit of an overkill for the xrandr problem,
especially considering that an alternative solution exists:
http://lists.x.org

Re: [PATCH renderproto] Add floating point transforms

2014-08-21 Thread Siarhei Siamashka
On Tue, 19 Aug 2014 10:50:08 +0600
Alexander E. Patrakov patra...@gmail.com wrote:

 19.08.2014 03:02, Keith Packard wrote:
  Siarhei Siamashka siarhei.siamas...@gmail.com writes:
 
  Do you have a reproducible testcase for this problem? Being off by
  several pixels seems to be very wrong. Especially if this happens
  often.
 
  Non-affine transforms cause problems. I noticed this with RandR when
  doing keystone correction.
 
  Here's a simple example:
 
  Desired matrix:
 
33.13915858   15.17037037 -16384.
0.23.73634938 0.
-0.00057170   -0.00182142 25.70348287
 
  Rounded to 16.16 fixed matrix:
 
33.13916016   15.17036438 -16384.
0.23.73634338 0.
-0.00056458   -0.00181580 25.70347595
 
  Actual bits:
 
{ 0x00014dba, 0x98c6, 0xfd7b7d45 },
{ 0x, 0xef09, 0x },
{ 0x, 0xfffb, 0x000102d9 },
 
  Take the point 2560,1600 and transform:
 
  desired:4348.04, 1780.87
  actual: 4342.50, 1778.60
 
  error:  5.99
 
 Sorry, but I think that we have bigger problems. Please look at 
 https://bugs.freedesktop.org/show_bug.cgi?id=39949 (which was a blocker 
 for xserver-1.12 and makes mouse input on scaled displays unusable in 
 some cases). Since there was no manpower to review that bug for years, I 
 think that the best course for now is to prohibit all transforms that do 
 not amount to mere translation. You can change my opinion by reviewing 
 the patch in the bug.

Alexander, input handling is a different problem, which is not directly
related to the transform accuracy. But thanks for bringing it up.

Also I got an off-list e-mail about xrandr from another person. Which
clearly shows that there is some interest in it, but people might be
a little bit too shy to complain or remind about problems in public :-)

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: [PATCH renderproto] Add floating point transforms

2014-08-21 Thread Siarhei Siamashka
On Fri, 15 Aug 2014 08:55:18 -0700
Keith Packard kei...@keithp.com wrote:

 Fixed point coordinates don't provide reasonable precision for
 transformation operations; the resulting transforms are often off by
 several pixels.
 
 Allow clients to represent the transformation using either fixed point
 or floating point.

The protocol extension is an action with serious long lasting
consequences. XRENDER is a rather old extension and my understanding
was that it serves mostly to just allow running the existing
applications. Extending the protocol at this point needs to be
clearly justified. The current commit message is a bit vague, uses
weasel words and is not entirely accurate. So some clarifications
would be really appreciated.

* Who are the expected users of this new functionality and what are
  the practical advantages for them? Preferably something measurable
  and easy to confirm.

* Are you expecting the existing applications/toolkits to eventually
  switch from the 16.16 fixed point format to the single precision
  floating point format? Or is it only intended to be used by a few
  selected tools, such as xrandr?

* Are there any possible risks? Is there anything like like performance
  and/or accuracy drop in some use cases? Are the existing drivers and
  backends ready to provide this new functionality without exposing
  problems? Any backwards/forward compatibility concerns to care about?

* Are there any other alternatives to the proposed solution?

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


[PATCH 0/2] Projective transformation accuracy fixes for the xrandr tool

2014-08-20 Thread Siarhei Siamashka
As revealed in http://lists.x.org/archives/xorg-devel/2014-August/043624.html
by Keith Packard, there are some accuracy issues in the way how the xrandr
tool is handling projective transformations. Because being off by several
pixels is a bit too much, some xrandr users are not very happy.

The first patch fixes a very trivial and very obvious rounding problem.

The second patch implements matrix rescaling to ensure that the rounding
errors are significantly reduced specifically for the most critical G and
H matrix coefficients.

As a result, now projective transformations should be pixel accurate for
anything sane that can be done with xrandr and modern high resolution
monitors. But only testing can show. I would be very interested to know
if there are still some problematic use cases remaining.

The search for the optimal matrix scaling factors is relatively computationally
intensive and may have a lot of room for improvement. This should not be a
problem for the xrandr command line tool though, because it does not have to
deal with millions of matrices. Only a single matrix is used per run.

https://github.com/ssvb/xrandr/tree/transform-accuracy

Siarhei Siamashka (2):
  xrandr: Use rounding when converting from float to fixed point matrix
  xrandr: Rescale matrix to improve projective transformations accuracy

 xrandr.c | 185 +--
 1 file changed, 180 insertions(+), 5 deletions(-)

-- 
1.8.3.2

___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


[PATCH 2/2] xrandr: Rescale matrix to improve projective transformations accuracy

2014-08-20 Thread Siarhei Siamashka
When using projective transformations and homogenous coordinates, we
can freely multiply the transformation matrix by a constant. Optimal
selection of this constant has a huge impact on accuracy.

A naive approach is just to upscale the matrix by the largest possible
factor, only ensuring that the matrix coefficints don't excees 32767.
This somewhat reduces the relative error caused by rounding, but is
still not a silver bullet.

A more advanced approach is to focus specifically on G and H
coefficients, because they are contributing to the divisor value
and the rounding errors in them may be scaled manyfold in the
final result.

A simplified variant of this approach is to pick either G or H
coefficient, round it to the nearest value which fits 16.16 fixed
point format and use the ratio between the rounded and the original
coefficient as the constant for rescaling the matrix. After this is
done, either G or H coefficient becomes perfectly accurate and has
no rounding error on the conversion to 16.16 fixed point format.

A better solution is to try minimizing rounding errors for both
G and H at once. This patch is implementing a simple bruteforce
search for the optimal rescaling constant. The performance
should not be an issue for the xrandr command line tool, because
this is done only once per tool run.

Matrices are dumped to the console when using the '--verbose' command
line option and the accuracy can be evaluated. For example, using
the matrix coefficients from
http://lists.x.org/archives/xorg-devel/2014-August/043624.html
provides the following results:

octave:1 {paste the original_matrix and the adjusted_matrix data}

octave:3 x = [2560 ; 1600 ; 1]
octave:4 p1 = original_matrix * x ; p2 = adjusted_matrix * x
octave:5 difference_between_p1_and_p2 = p1 / p1(3) - p2 / p2(3)

   0.0046335
   0.0024114
   0.000

The top two values are showing the absolute error for X and Y
coordinates. The accuracy remains good up to something like
7000x7000, but then degrades quickly.

Signed-off-by: Siarhei Siamashka siarhei.siamas...@gmail.com
---
 xrandr.c | 173 +--
 1 file changed, 170 insertions(+), 3 deletions(-)

diff --git a/xrandr.c b/xrandr.c
index 5ba8c00..e6c3a0f 100644
--- a/xrandr.c
+++ b/xrandr.c
@@ -463,16 +463,183 @@ mode_width (XRRModeInfo *mode_info, Rotation rotation)
 }
 }
 
+/* Something that is way too small for 16.16 fixed point representation */
+#define MATRIX_EPSILON(1.0 / (1  24))
+/* The matrix coefficients can't be larger an this */
+#define MATRIX_FIXED_POINT_MAXVAL 32767.0
+/* Configures the minimal range for scale factor search */
+#define MATRIX_SCALE_RANGE1.5
+/* Bruteforce search runs this number of iterations */
+#define MATRIX_BRUTEFORCE_LOOP_COUNT  1
+
+static Bool
+is_matrix_affine (double matrix[3][3])
+{
+return (fabs (matrix[2][0])  MATRIX_EPSILON 
+   fabs (matrix[2][1])  MATRIX_EPSILON 
+   fabs (matrix[2][2] - 1.0)  MATRIX_EPSILON);
+}
+
+static inline double
+round_to_nearest_integer (double x)
+{
+return floor(x + 0.5);
+}
+
+/*
+ * This function tries to find the best suitable scale factor (from the range
+ * between 'minscale' and 'maxscale'), so that 'orig_a * scale' and
+ * 'orig_b * scale' are very close to some integer values as possible.
+ * A bruteforce search is performed, trying to minimize the squared distance.
+ *
+ * The geometric interpretation of this process is a line, originating
+ * from (0, 0) and going through the first quadrant on a Cartesian plane.
+ * The slope of this line depends on the ratio between 'orig_a' and 'orig_b'.
+ * The line passes near a large number of points with integer coordinates.
+ * The distances between the line and these points are evaluated.
+ */
+static double
+matrix_find_best_scale_factor(double orig_a, double orig_b,
+ double minscale, double maxscale)
+{
+double a = orig_a, b = orig_b;
+double best_sqr_dist = 1.0;
+double best_scale_factor = maxscale;
+double sqr_orig_a, sqr_orig_b;
+double recip_orig_a, recip_orig_b, recip_sqr_orig_a_plus_sqr_orig_b;
+double refscale;
+double refscale_step = (maxscale - minscale) / 
MATRIX_BRUTEFORCE_LOOP_COUNT;
+
+/* Ensure that 'orig_a' and 'orig_b' are both positive non-zero values */
+orig_a = fabs(orig_a);
+orig_b = fabs(orig_b);
+if (orig_a  MATRIX_EPSILON)
+   orig_a = MATRIX_EPSILON;
+if (orig_b  MATRIX_EPSILON)
+   orig_b = MATRIX_EPSILON;
+
+/* Pre-computed constants */
+sqr_orig_a = orig_a * orig_a;
+sqr_orig_b = orig_b * orig_b;
+recip_orig_a = 1.0 / orig_a;
+recip_orig_b = 1.0 / orig_b;
+recip_sqr_orig_a_plus_sqr_orig_b = 1.0 / (sqr_orig_a + sqr_orig_b);
+
+/*
+ * Try different scale factors between 'minscale' and 'maxscale'. They
+ * are only used as the initial approximations though.
+ */
+for (refscale = minscale; refscale

[PATCH 1/2] xrandr: Use rounding when converting from float to fixed point matrix

2014-08-20 Thread Siarhei Siamashka
This prevents unnecessary accuracy loss when the '--transform' option
is used. Also the preparation of a transformation matrix is now moved
to its own separate function.

Signed-off-by: Siarhei Siamashka siarhei.siamas...@gmail.com
---
 xrandr.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/xrandr.c b/xrandr.c
index 366f6dc..5ba8c00 100644
--- a/xrandr.c
+++ b/xrandr.c
@@ -463,6 +463,18 @@ mode_width (XRRModeInfo *mode_info, Rotation rotation)
 }
 }
 
+static void
+make_fixed_point_matrix (XTransform *transform, double matrix[3][3])
+{
+/* XDoubleToFixed does not provide correct rounding, so use a workaround */
+double fixed_1 = XDoubleToFixed (1);
+int k, l;
+
+for (k = 0; k  3; k++)
+   for (l = 0; l  3; l++)
+   transform-matrix[k][l] = floor (matrix[k][l] * fixed_1 + 0.5);
+}
+
 static Bool
 transform_point (XTransform *transform, double *xp, double *yp)
 {
@@ -2908,7 +2920,6 @@ main (int argc, char **argv)
}
if (!strcmp (--transform, argv[i])) {
double  transform[3][3];
-   int k, l;
if (!config_output) argerr (%s must be used after --output\n, 
argv[i]);
if (++i = argc) argerr (%s requires an argument\n, argv[i-1]);
init_transform (config_output-transform);
@@ -2921,10 +2932,7 @@ main (int argc, char **argv)
!= 9)
argerr (failed to parse '%s' as a transformation\n, 
argv[i]);
init_transform (config_output-transform);
-   for (k = 0; k  3; k++)
-   for (l = 0; l  3; l++) {
-   config_output-transform.transform.matrix[k][l] = 
XDoubleToFixed (transform[k][l]);
-   }
+   make_fixed_point_matrix(config_output-transform.transform, 
transform);
config_output-transform.filter = bilinear;
config_output-transform.nparams = 0;
config_output-transform.params = NULL;
-- 
1.8.3.2

___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: [PATCH renderproto] Add floating point transforms

2014-08-19 Thread Siarhei Siamashka
On Mon, 18 Aug 2014 14:02:13 -0700
Keith Packard kei...@keithp.com wrote:

 Siarhei Siamashka siarhei.siamas...@gmail.com writes:
 
  Do you have a reproducible testcase for this problem? Being off by
  several pixels seems to be very wrong. Especially if this happens
  often.
 
 Non-affine transforms cause problems. I noticed this with RandR when
 doing keystone correction.
 
 Here's a simple example:
 
 Desired matrix:
 
  33.13915858  15.17037037 -16384.
  0.   23.73634938 0.
  -0.00057170  -0.00182142 25.70348287
 
 Rounded to 16.16 fixed matrix:
 
  33.13916016  15.17036438 -16384.
  0.   23.73634338 0.
  -0.00056458  -0.00181580 25.70347595
 
 Actual bits:
 
  { 0x00014dba, 0x98c6, 0xfd7b7d45 },
  { 0x, 0xef09, 0x },
  { 0x, 0xfffb, 0x000102d9 },
 
 Take the point 2560,1600 and transform:
 
 desired:4348.04, 1780.87
 actual: 4342.50, 1778.60
 
 error:  5.99

It is possible to pre-multiply all the coefficients in the desired
matrix by 1.147703 before converting it to 16.16 fixed point format.
In this case we get the following rounded coefficients in the matrix:

  38.033912   17.411080   -18803.965952
  0.0027.242279   0.00
  -0.000656   -0.002090   29.499964

The following actual bits:

  { 0x002608ae, 0x0011693d, 0xb68c08b7 },
  { 0x, 0x001b3e06, 0x },
  { 0xffd5, 0xff77, 0x001d7ffe },

And the 2560,1600 point transforms to 4348.03, 1780.86

I found this particular multiplication constant by a simple bruteforce
search (minimizing the rounding errors for the m[2][0] and m[2][1]
matrix coefficients). But quite likely, a reasonably good constant
can be also found by some better performing algorithm, suitable
for realtime use in applications.

Basically, for an affine transform we just need to do correct rounding.
And for a projective transform, we also need to nudge the matrix a bit
in order to minimize rounding errors when converting to 16.16 fixed
point.

Would this be a suitable solution for the xrandr command line tool?

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: [PATCH renderproto] Add floating point transforms

2014-08-18 Thread Siarhei Siamashka
On Fri, 15 Aug 2014 08:55:18 -0700
Keith Packard kei...@keithp.com wrote:

 Fixed point coordinates don't provide reasonable precision for
 transformation operations; the resulting transforms are often off by
 several pixels.

Do you have a reproducible testcase for this problem? Being off by
several pixels seems to be very wrong. Especially if this happens often.

 Allow clients to represent the transformation using either fixed point
 or floating point.
 
 Signed-off-by: Keith Packard kei...@keithp.com

The 16.16 fixed point format should be pixel accurate for the whole
range of coordinates supported by XRENDER (from -32K to 32K). At least
for affine transformations. Projective transformations are a somewhat
special case and don't seem to be used in practice (Cairo does not
use them).

When we are doing matrix multiplications, each transformed coordinate
is a sum of three multiplications. For affine transformations it is
effectively a sum of just two multiplications. Assuming that a perfectly
accurate matrix is getting converted to the 16.16 fixed point format,
we get some rounding errors. In the worst case, the rounding error
for each of the matrix coefficients is 1/65536 (the granularity of
fixed point values) divided by two (because we are rounding to
nearest). Also in the worst case, the coordinates before transformation
can be as large as 32768. So the worst case absolute error is
1/65536/2 * 32768 * 2 = 0.5 pixels for affine transformations.
And the error is naturally smaller than this on average.

32-bit floating point values use the same number of bits as 16.16 fixed
point values for data storage. But the key difference is that the
accuracy of the floating point values is seriously non-uniform. It
tends to be better when the values are close to zero, but degrades
significantly (less bits are allocated to the fractional part) when
the pixel coordinates move further away from zero. In general, floating
point calculations are not associative and this causes additional
headaches. For example, repeating N times x += dx does not
necessarily produce the same result as x += dx * N.

Floating point format supports a wider range of values (unnecessary
if the rest of XRENDER remains restricted to 16-bit coordinates) but
has much worse accuracy in corner cases and is poorly predictable.
The 32-bit floating point format does not look like an obvious
upgrade over 16.16 fixed point.

IMHO this problem needs a bit more debugging before taking any
actions with serious consequences (such as extending the protocol).

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: [PATCH renderproto] Add floating point transforms

2014-08-18 Thread Siarhei Siamashka
On Sun, 17 Aug 2014 14:26:13 -0700
Keith Packard kei...@keithp.com wrote:

 walter harms wha...@bfs.de writes:
 
  Would it make sense to use double for internal calculation ?
  Most systems provide 64bit (or more) fpu-registers these days.
 
 In fact, pixman exposes only double matrices at this point, so the
 server implementation stores both fixed point and floating point
 variants in that format.

Double precision floating point matrices and operations with them
(pixman_f_transform_*) are not really used as a part of the pixman
rendering pipeline. They are more like just the extra bundled
convenience utility functions. When pixman needs a transformation
matrix, this matrix is provided to it using the
'pixman_image_set_transform()' function:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman.h?id=pixman-0.32.6#n806
Which uses the 16.16 fixed point format.

There are also some useless (and even harmful) utility functions, which
are implementing operations on 16.16 fixed point matrices. If anyone
tries to use them, a lot of accuracy gets lost in the intermediate
calculations. A reasonable workflow is to do all the intermediate
calculations with double precision matrices and only convert to the
16.16 fixed point format as the final step before passing them to
pixman. Of course, if somebody knows what he is doing, then it is
possible to prepare a 16.16 fixed point matrix without resorting to
floating point calculations. But special care needs to be taken to
ensure that there is no unnecessary accuracy loss.

 However, a GPU will likely want to stick with 32 bit values, which is
 why I expressed the protocol using 32 bit floats; 64-bit computation in
 the GPU is generally much slower than 32-bit.

This does not necessarily mean that the 32-bit floats should be exposed
by the protocol or the public API. If we really need improved accuracy
(do we?), then 64-bit floats are the way to go.

IMHO it is up to the implementation to decide if the 32-bit floating
point format is accurate enough for the requested operation and take
a fast path.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: [PATCH 2/2] glamor: Use float matrix for render transformations

2014-08-18 Thread Siarhei Siamashka
 to
32-bit float (which only has 24-bit mantissa). The 32-bit floating
point format is somewhat less accurate than necessary for XRENDER.

And the introduction of the 32-bit floating point format in the
protocol is going to be really harmful. If the new applications start
using it for the sake of glamor, then the existing users of software
rendering drivers (which used to be perfectly fine until now), are
going to be messed up by the 64-bit float - 32-bit float - 16.16
fixed point double rounding penalty. Using the 64-bit floating point
format in the protocol is not so bad, because it can store 16.16 fixed
point values without any accuracy loss. However there would be some
performance penalty, caused by the extra conversion. That's the usual
pick your poison situation.

Is the accuracy actually good enough with this patch applied and solves
the off by several pixels problem? The patch still looks messed up
with the use of a mix of double and float.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: xf86-video-armsoc, Mali and exynos4

2013-11-13 Thread Siarhei Siamashka
On Tue, 12 Nov 2013 16:08:27 -0600
Daniel Drake dr...@endlessm.com wrote:

 On Tue, Nov 12, 2013 at 3:49 PM, Luc Verhaegen l...@skynet.be wrote:
  Fwiw, I am happily using a derivative of the standard fbdev driver on my
  sunxi and exynos based devices. Namely fbturbo which is being done by
  Siarhei Siamashka. For sunxi, it includes some knowledge about the
  strangely implemented display engine, for exynos, it currently speeds up
  some operations using neon assembler.
 
 Yes, it does work reasonably well. Unfortunately this does not support
 xrandr resolution switching though

I was considering to get xrandr working at least on sunxi hardware.
So that dual monitor support and screen resolution switching at
runtime works in X11 desktop in a standard way transparently for the
applications and for the user. But then decided that it is not worth
the efforts. And very few people actually asked about this feature.

 which is now made available for free by the underlying exynos_drm
 display driver.

Does xf86-video-modesetting work properly with the exynos_drm kernel
driver on your hardware? It is useful to first check whether the
kernel is really functional.

A year ago just trying to load the modesetting DDX only oopsed the
kernel on my ARM Chromebook. But it is possible that things have
improved since then.

 Maybe we could add KMS support to fbturbo

Right now fbturbo is built on top of the xf86-video-fbdev chassis.
Just because it simply works everywhere and provides the best
compatibility with various hardware. Trying to hack the KMS support
into it may be a little bit invasive and the perfect compatibility
with the ump based mali400 binary blobs can't be guaranteed. The fully
open source driver for mali400 is going to provide a lot more freedom
in shaping a usable graphics stack.

The xf86-video-modesetting would be likely a preferred option instead
of xf86-video-fbdev if it worked everywhere too (at least on the
Raspberry Pi, Allwinner, Rockchip, Exynos4 and Exynos5 based hardware).

Now an interesting question is whether it makes sense to suspend all
the other activity and work on a better kms support in the kernel for
all this hardware simultaneously? Or first try to wait for decent
results on at least one platform (freedreno? armada?), learn from
this experience and then do the same for the other ARM platforms?

My primary motivation for fbturbo/sunxifb was ensuring a usable X11
desktop on the low end ARM hardware right now by fixing/workarounding
some obvious performance issues. It's a stop gap solution, which surely
needs to be replaced in the long run. But any potential replacement
must be confirmed to be better (not just assumed to be better).

 I was just interested in exploring things from the armsoc side
 as the original announcement talks about the combination of X/Mali
 integrated with exynos, DRM and KMS.
 http://lists.x.org/archives/xorg-devel/2012-May/031250.html
 Maybe that level of integration has not yet been reached in public
 code though :(

A lot may happen along the way, no matter how good or bad the original
plan was. I guess that Tom would be the best person to explain the
current status of this work.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: Where does Xorg fallback to ?

2013-09-23 Thread Siarhei Siamashka
On Mon, 23 Sep 2013 10:45:41 +0800
小飞珑 404962...@qq.com wrote:

 Hi, 
I am looking into Xorg for performance optimization, and want to do
 some software acceleration for the Xorg fallback functions when the
 hardware acceleration is not available.

The software fallbacks are partially implemented in the Xorg server
(fb subdirectory) and also partially in http://pixman.org/

 I am now using a notebook with intel display card. In the  UXA driver, I
 force the driver to use fallback funcs instead of the Accelerated funcs.
 As in uxa/uxa-accel.c - uxa_copy_area(), I made uxa_screen-force_fallback = 
 1,
 force uxa driver to call uxa_check_copy_area()-fbCopyArea(), but the
 fbCopyArea seemed NOT to be the fbCopyArea in Xorg fb/fbcopy.c ! For I added
 return NULL immediately  in the beginning of fb/fbcopy.c -
 fbCopyArea(), but made no difference, but if I comment the fbCopyArea()
 in uxa/uxa-unaccel.c the display is ruined. It seemed that the above
 two fbCopyArea() functions are not the same. Where did the Xorg
 fallback to? NOT to the fb/fbcopy.c ? I am eager to know that, could
 some body tell ? Thanks a lot!

The Intel hardware and the UXA driver seem to be a strange choice.

If this notebook is really your target hardware, then you probably
should have a look at SNA.

But if you want to improve Xorg software fallbacks just for the sake
of making them better, then running tests with xf86-video-fbdev or
xf86-video-modesetting driver would be more reasonable. The current
fb code in Xorg is implemented as just naive loops, and whenever
they are heavily used (instead of the optimized assembly in glibc or
pixman), the performance suffers a lot. A good start would be to fix
the XShmPutImage performance, because it is causing real troubles for
the users. This can be workarounded in the DDX drivers, but fixing it
in Xorg would be also nice:

https://github.com/ssvb/xf86-video-fbturbo/issues/9

Unfortunately the person, who reported this problem, has not tried
to also submit a fix to the Xorg yet. But you or anybody else is free
to pick up this job :-)

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel

Re: An iwmmxt optimised shadowUpdateRotate16_270YX, with problems.

2011-08-26 Thread Siarhei Siamashka
On Fri, Aug 26, 2011 at 2:21 PM, Marko Katić drom...@gmail.com wrote:
 Hi there!
 I'm trying to optimise shadow copies from landscape oriented 16bit shadowfb
 to portrait 16bit fb proper. This is done in shadowUpdateRotate16_270YX
 which is located in miext/shadow/shrotpackYX.h. My optimisation is aimed at
 pxa27x and xscale3 arm processors only since it uses iwmmxt asm code. I
 guess it could easily be ported to x86 mmx code too.
 As I understand, the current implementation copies a single pixel at a time,
 like this:
  *win = *sha++;
 win += WINSTEPX(winStride);
 This also means that we're stepping over entire cachelines since every pixel
 of a single shadowfb line has to be copied to a new line of fb proper.
 My patch tries to copy 4x4 pixel blocks prerotated to portrait orientation.
  Basically, it takes 4 lines of shadowfb and divides it into 4x4 blocks.
 Then it rotates them and copies them to fb proper. This way, instead of
 copying a single pixel per fb proper line, it copies four. The rotation code
 is
 done in iwmmxt asm and takes about 0.9 instructions per pixel (assuming the
 4x4 block is already in iwmmxt registers). 4x4 blocks imply that the
 rectangle
 to be copied is width and height aligned to 4 pixels. If not, the patch
 reduces the rectangle to proper alignment with single pixel copies for width
 and height.
 It doesn't really work and i can't find a reason why. The inital Xfbdev
 screen is looking fine, but when i start moving the pointer or windows, all
 i get is garbage.
 The patch was tested on kdrive 1.3.0.0 running in qemu and on a Zaurus
 C-1000.
 If anyone has any suggestions, please do tell.

To get the best performance, it is important to take cache and TLB
misses into account. The 4x4 pixel blocks may be a bit too small.

I think it may be interesting to integrate iwmmxt optimizations into
pixman and then use pixman for doing these rotations in xserver. Right
now there is more or less cache friendly C implementation for rotation
in pixman:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.c?id=pixman-0.22.2#n1620

Rotation is at least partially covered in the pixman test suite ('make
check'), so detecting and fixing the most obvious bugs could be a bit
easier than watching for image corruption on real use. Also the
'affine-test' program from the test suite can be used as an example of
doing rotations on memory buffers with pixman:
http://cgit.freedesktop.org/pixman/tree/test/affine-test.c?id=pixman-0.22.2

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel

Re: An iwmmxt optimised shadowUpdateRotate16_270YX, with problems.

2011-08-26 Thread Siarhei Siamashka
On Fri, Aug 26, 2011 at 9:54 PM, Marko Katić drom...@gmail.com wrote:
 I chose 4x4 block size for my initial implementation. First i wanted to get
 it working. The block size could easily be resized to 8x4, 8x8 or even 12x4
 maybe. TLB misses could be completely avoided or at least cut in less than
 half, i'm also working on an improved pxafb driver that splits the
 framebuffer between internal sram (256K on a pxa270) and ram. The ram part
 is placed in a section mapping, greatly reducing the number of TLB entries.
 It may be possible to place the internal sram in a section mapping too, i
 have to try this. If it's possible, the fb proper and the shadowfb would
 only need 2 tlb entries.

OK, using section mapping is definitely a good idea to eliminate TLB
misses. Still I expect that writing more than 4 pixels (8 bytes) per
scanline to the destination buffer would be useful. The write
combining buffer has such a name for a reason, it helps to utilize
burst writes and make a better use of memory bandwidth. But running
some benchmarks is always better for identifying which memory access
pattern actually works the best.

 Could alignment be the problem with my patch? The shadowfb and fb proper
 lines are probably not 64 bit aligned before calling iwmmxt_rotate_copy...

Yes, very likely. Also mov pc, lr is not thumb interworking safe and
you are clobbering callee saved registers in your code (r4-r11). See:
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/IHI0042D_aapcs.pdf
This can be solved by using push {r4, r6, r7, lr} on entry and pop
{r4, r6, r7, pc} on exit.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel

Re: [PATCH 1/6] make byte swapping macros less opaque to the optimizer

2011-04-29 Thread Siarhei Siamashka
On Fri, Apr 29, 2011 at 1:33 AM, Matt Turner matts...@gmail.com wrote:
 On Thu, Apr 28, 2011 at 4:36 PM, Siarhei Siamashka
 siarhei.siamas...@gmail.com wrote:
 On Thu, Apr 28, 2011 at 2:33 AM, Matt Turner matts...@gmail.com wrote:
 diff --git a/include/misc.h b/include/misc.h
 index 803f5ba..b7a3fd2 100644
 --- a/include/misc.h
 +++ b/include/misc.h
 @@ -240,32 +240,17 @@ xstrtokenize(const char *str, const char* separators);
  #define SwapRestL(stuff) \
     SwapLongs((CARD32 *)(stuff + 1), LengthRestL(stuff))

 -/* byte swap a 32-bit value */
 -#define swapl(x, n) { \
 -                n = ((char *) (x))[0];\
 -                ((char *) (x))[0] = ((char *) (x))[3];\
 -                ((char *) (x))[3] = n;\
 -                n = ((char *) (x))[1];\
 -                ((char *) (x))[1] = ((char *) (x))[2];\
 -                ((char *) (x))[2] = n; }
 -
 -/* byte swap a short */
 -#define swaps(x, n) { \
 -                n = ((char *) (x))[0];\
 -                ((char *) (x))[0] = ((char *) (x))[1];\
 -                ((char *) (x))[1] = n; }
 -
  /* copy 32-bit value from src to dst byteswapping on the way */
 -#define cpswapl(src, dst) { \
 -                 ((char *)(dst))[0] = ((char *) (src))[3];\
 -                 ((char *)(dst))[1] = ((char *) (src))[2];\
 -                 ((char *)(dst))[2] = ((char *) (src))[1];\
 -                 ((char *)(dst))[3] = ((char *) (src))[0]; }
 +#define cpswapl(src, dst) (dst) = lswapl((src))

  /* copy short from src to dst byteswapping on the way */
 -#define cpswaps(src, dst) { \
 -                ((char *) (dst))[0] = ((char *) (src))[1];\
 -                ((char *) (dst))[1] = ((char *) (src))[0]; }
 +#define cpswaps(src, dst) (dst) = lswaps((src))
 +
 +/* byte swap a 32-bit value */
 +#define swapl(x, n) (*(x)) = lswapl(*(x))
 +
 +/* byte swap a short */
 +#define swaps(x, n) (*(x)) = lswaps(*(x))

 This is not an equal replacement for the previous code. The difference
 is that the older variant of 'swapl' could accept any pointer having
 any alignment, but still do byteswap for exactly 4 bytes at that
 memory location.

 Just to perform an additional sanity check, the following test code can be 
 used:

 -#define swapl(x, n) (*(x)) = lswapl(*(x))
 +#define swapl(x, n) do {                    \
 +        char dummytmp1[4 - sizeof(*(x))];   \
 +        char dummytmp2[sizeof(*(x)) - 4];   \
 +        (*(x)) = lswapl(*(x));              \
 +    } while (0)

  /* byte swap a short */
 -#define swaps(x, n) (*(x)) = lswaps(*(x))
 +#define swaps(x, n) do {                    \
 +        char dummytmp1[2 - sizeof(*(x))];   \
 +        char dummytmp2[sizeof(*(x)) - 2];   \
 +        (*(x)) = lswaps(*(x));              \
 +    } while (0)

 And this extra guard check already spots at least one issue:

 swaprep.c:436:2: error: size of array 'dummytmp2' is too large

 http://cgit.freedesktop.org/xorg/xserver/tree/dix/swaprep.c?id=xorg-server-1.10.1#n423

 One more potential pitfall is data alignment. The older variant could
 work correctly with any alignment, but the new variant will fail if
 the pointer is somehow not naturally aligned. So I think the patch is
 very dangerous and every affected line of code needs to be carefully
 reviewed.

 --
 Best regards,
 Siarhei Siamashka

 Good call!

 I wonder if these misalignments are bugs in their own right?

I'm not even sure if there are any real problems related to alignment
here. Just the proposed patch now needs strict alignment and the older
code did not. It may be either fine or may potentially trigger some
alignment traps at runtime on the architectures which need strict
alignment if one of the uses for these macros happens to use unaligned
pointers. To be on a safe side, it is possible to use a trick with
__attribute__((packed)) when compiling with gcc like described in
http://lwn.net/Articles/259732/

void swapl_unaligned(void *x)
{
struct dummy
{
uint32_t data;
} __attribute__((packed));
((struct dummy *)x)-data = __builtin_bswap32(((struct dummy *)x)-data);
}

void swapl_aligned(uint32_t *x)
{
*x = __builtin_bswap32(*x);
}

The above two functions are compiled into the following code on x86-64:

 swapl_unaligned:
   0:   8b 07   mov(%rdi),%eax
   2:   0f c8   bswap  %eax
   4:   89 07   mov%eax,(%rdi)
   6:   c3  retq

0007 swapl_aligned:
   7:   8b 07   mov(%rdi),%eax
   9:   0f c8   bswap  %eax
   b:   89 07   mov%eax,(%rdi)
   d:   c3  retq

And for armv4t it is:

 swapl_unaligned:
   0:   e5d0c001ldrbip, [r0, #1]
   4:   e5d02000ldrbr2, [r0]
   8:   e5d01002ldrbr1, [r0, #2]
   c:   e5d03003ldrbr3, [r0, #3]
  10:   e182240corr r2, r2, ip, lsl #8
  14:   e1822801orr r2, r2, r1, lsl #16
  18:   e1822c03orr r2, r2

Re: [PATCH] fb: Fix memcpy abuse

2011-04-29 Thread Siarhei Siamashka
On Thu, Apr 21, 2011 at 11:37 PM, Adam Jackson a...@redhat.com wrote:
 The memcpy fast path implicitly assumes that the copy walks
 left-to-right.  That's not something memcpy guarantees, and newer glibc
 on some processors will indeed break that assumption.  Since we walk a
 line at a time, check the source and destination against the width of
 the blit to determine whether we can be sloppy enough to allow memcpy.
 (Having done this, we can remove the check for !reverse as well.)

 On an Intel Core i7-2630QM with an NVIDIA GeForce GTX 460M running in
 NoAccel, the broken code and various fixes for -copywinwin{10,100,500}
 gives (edited to fit in 80 columns):

 1: Disable the fastpath entirely
 2: Replace memcpy with memmove
 3: This fix
 4: The code before this fix

  1            2                 3                 4           Operation
 --   ---   ---   ---   
 258000   269000 (  1.04)   544000 (  2.11)   552000 (  2.14)   Copy 10x10
  21300    23000 (  1.08)    43700 (  2.05)    47100 (  2.21)   Copy 100x100
   960      962 (  1.00)     1990 (  2.09)     1990 (  2.07)   Copy 500x500

 So it's a modest performance hit, but correctness demands it, and it's
 probably worth keeping the 2x speedup from having the fast path in the
 first place.

In my opinion, still the best solution is to do this stuff in pixman,
because it has its own SIMD optimized code and can do a lot better job
than memcpy/memmove (for example, it can be improved to use MOVNTx
instructions for x86 when scrolling large areas exceeding L2/L3 cache
size, and this optimization can't be easily done by glibc if
memcpy/memmove is used separately for each individual scanline). I did
some experiments with improving scrolling performance when using
non-hardware accelerated framebuffer earlier and it showed a really
major speedup on ARM:
http://lists.x.org/archives/xorg-devel/2009-November/003536.html

I'm a bit surprised that there were no other people even trying to
push for something similar because the scrolling performance issue is
so obvious. You are more likely to see the people trying to use ARM
devices without hardware acceleration, become horrified by poor
scrolling performance, and then start searching for a way to obtain
and install the proprietary binary drivers :)

 Signed-off-by: Adam Jackson a...@redhat.com
 ---
  fb/fbblt.c |    9 +++--
  1 files changed, 7 insertions(+), 2 deletions(-)

 diff --git a/fb/fbblt.c b/fb/fbblt.c
 index a040298..b955245 100644
 --- a/fb/fbblt.c
 +++ b/fb/fbblt.c
 @@ -65,6 +65,7 @@ fbBlt (FbBits   *srcLine,
     int            n, nmiddle;
     Bool    destInvarient;
     int            startbyte, endbyte;
 +    int     careful;
     FbDeclareMergeRop ();

     if (bpp == 24  !FbCheck24Pix (pm))
 @@ -74,12 +75,16 @@ fbBlt (FbBits   *srcLine,
        return;
     }

 -    if (alu == GXcopy  pm == FB_ALLONES  !reverse 
 +    careful = !((srcLine  dstLine  srcLine + width * (bpp/8)  dstLine) ||
 +                (dstLine  srcLine  dstLine + width * (bpp/8)  srcLine)) 
 ||
 +              (bpp % 8);
 +
 +    if (alu == GXcopy  pm == FB_ALLONES  !careful 
             !(srcX  7)  !(dstX  7)  !(width  7)) {
         int i;
         CARD8 *src = (CARD8 *) srcLine;
         CARD8 *dst = (CARD8 *) dstLine;
 -
 +
         srcStride *= sizeof(FbBits);
         dstStride *= sizeof(FbBits);
         width = 3;

What is the story with the 'reverse' variable? Does it have no use
anymore and needs to be purged later?

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel

Re: [PATCH 1/6] make byte swapping macros less opaque to the optimizer

2011-04-28 Thread Siarhei Siamashka
On Thu, Apr 28, 2011 at 2:33 AM, Matt Turner matts...@gmail.com wrote:
 diff --git a/include/misc.h b/include/misc.h
 index 803f5ba..b7a3fd2 100644
 --- a/include/misc.h
 +++ b/include/misc.h
 @@ -240,32 +240,17 @@ xstrtokenize(const char *str, const char* separators);
  #define SwapRestL(stuff) \
     SwapLongs((CARD32 *)(stuff + 1), LengthRestL(stuff))

 -/* byte swap a 32-bit value */
 -#define swapl(x, n) { \
 -                n = ((char *) (x))[0];\
 -                ((char *) (x))[0] = ((char *) (x))[3];\
 -                ((char *) (x))[3] = n;\
 -                n = ((char *) (x))[1];\
 -                ((char *) (x))[1] = ((char *) (x))[2];\
 -                ((char *) (x))[2] = n; }
 -
 -/* byte swap a short */
 -#define swaps(x, n) { \
 -                n = ((char *) (x))[0];\
 -                ((char *) (x))[0] = ((char *) (x))[1];\
 -                ((char *) (x))[1] = n; }
 -
  /* copy 32-bit value from src to dst byteswapping on the way */
 -#define cpswapl(src, dst) { \
 -                 ((char *)(dst))[0] = ((char *) (src))[3];\
 -                 ((char *)(dst))[1] = ((char *) (src))[2];\
 -                 ((char *)(dst))[2] = ((char *) (src))[1];\
 -                 ((char *)(dst))[3] = ((char *) (src))[0]; }
 +#define cpswapl(src, dst) (dst) = lswapl((src))

  /* copy short from src to dst byteswapping on the way */
 -#define cpswaps(src, dst) { \
 -                ((char *) (dst))[0] = ((char *) (src))[1];\
 -                ((char *) (dst))[1] = ((char *) (src))[0]; }
 +#define cpswaps(src, dst) (dst) = lswaps((src))
 +
 +/* byte swap a 32-bit value */
 +#define swapl(x, n) (*(x)) = lswapl(*(x))
 +
 +/* byte swap a short */
 +#define swaps(x, n) (*(x)) = lswaps(*(x))

This is not an equal replacement for the previous code. The difference
is that the older variant of 'swapl' could accept any pointer having
any alignment, but still do byteswap for exactly 4 bytes at that
memory location.

Just to perform an additional sanity check, the following test code can be used:

-#define swapl(x, n) (*(x)) = lswapl(*(x))
+#define swapl(x, n) do {\
+char dummytmp1[4 - sizeof(*(x))];   \
+char dummytmp2[sizeof(*(x)) - 4];   \
+(*(x)) = lswapl(*(x));  \
+} while (0)

 /* byte swap a short */
-#define swaps(x, n) (*(x)) = lswaps(*(x))
+#define swaps(x, n) do {\
+char dummytmp1[2 - sizeof(*(x))];   \
+char dummytmp2[sizeof(*(x)) - 2];   \
+(*(x)) = lswaps(*(x));  \
+} while (0)

And this extra guard check already spots at least one issue:

swaprep.c:436:2: error: size of array 'dummytmp2' is too large

http://cgit.freedesktop.org/xorg/xserver/tree/dix/swaprep.c?id=xorg-server-1.10.1#n423

One more potential pitfall is data alignment. The older variant could
work correctly with any alignment, but the new variant will fail if
the pointer is somehow not naturally aligned. So I think the patch is
very dangerous and every affected line of code needs to be carefully
reviewed.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel

Re: Premultiplied alpha handle for driver

2010-06-01 Thread Siarhei Siamashka
On Tuesday 01 June 2010, Jonathan Morton wrote:
 On Mon, 2010-05-31 at 10:42 +0800, Huang, FrankR wrote:
  We know the premultiplied alpha will be used sometimes in blend. Is
  that used in the graphics driver?

 XRender always expects premultiplied alpha in images.  Your previous
 example, with ARGB values of (100,255,0,0) is therefore an invalid pixel
 vector, since the R channel is greater than the A channel.

This is not necessarily invalid. It's more like additive blending feature
which is even used by some people:
http://home.comcast.net/~tom_forsyth/blog.wiki.html#[[Premultiplied%20alpha]]

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [PATCH, pixman] ARM NEON optimized pixman_blt overlapped stuff

2010-01-14 Thread Siarhei Siamashka
On Friday 20 November 2009, Koen Kooi wrote:
 On 18-11-09 14:41, Siarhei Siamashka wrote:
  But a bit more speed can be gained by tweaking pixman_blt to support
  overlapped source/destination areas and making use of SIMD optimizations
  there:
  http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped_blt

 Using those patches on top of 0.17.3 + the xserver patch seem to be
 working without problems on beagleboard. I see no artifacts when
 scrolling in midori or arora.

It does not just work, but should improve performance too :) On ARM Cortex-A8,
using NEON optimized functions from pixman for overlapped copies is at least
1.5x faster for the cases when the code from fbBlt in X server would be used
instead.

Looks like firefox 3.6 now also joined the list of applications which directly
benefit from fast overlapped blits. Earlier versions of firefox did not seem
to use such blits (much).

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


[PATCH, pixman] ARM NEON optimized pixman_blt overlapped stuff

2009-11-18 Thread Siarhei Siamashka
Hello,

Here is a branch with some basic ARM NEON optimizations for pixman_blt:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=pixman_blt_neon
These are quite similar to MMX/SSE2 and don't introduce any controversy,
so I hope that they can be pushed to pixman master soon unless somebody
has objections.

But a bit more speed can be gained by tweaking pixman_blt to support
overlapped source/destination areas and making use of SIMD optimizations
there:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped_blt

Making real use of overlapped pixman_blt also requires patching xserver and
the most simple way to do it is included in the attached test patch (not
intended to be applied as-is). I wonder if it just makes sense to apply
overlapped_blt patches to pixman now? And then later bump minimal required
pixman version in xserver from current 0.15.20 to the one which supports
overlapped blt operations?

Overlapped blitting is used in many places like GTK file dialog, arora web
browser (scrolling page back) and some other applications.

Comments are very much welcome.

I did not include the changes to drop delegates support in this patchset
because they only distracted everyone from the main topic earlier when
overlapped pixman_blt had been discussed :-)

-- 
Best regards,
Siarhei Siamashka
diff --git a/fb/fbcopy.c b/fb/fbcopy.c
index 07eb663..ba394b7 100644
--- a/fb/fbcopy.c
+++ b/fb/fbcopy.c
@@ -91,8 +91,7 @@ fbCopyNtoN (DrawablePtr	pSrcDrawable,
 while (nbox--)
 {
 #ifndef FB_ACCESS_WRAPPER /* pixman_blt() doesn't support accessors yet */
-	if (pm == FB_ALLONES  alu == GXcopy  !reverse 
-	!upsidedown)
+	if (pm == FB_ALLONES  alu == GXcopy)
 	{
 	if (!pixman_blt ((uint32_t *)src, (uint32_t *)dst, srcStride, dstStride, srcBpp, dstBpp,
 			 (pbox-x1 + dx + srcXoff),


signature.asc
Description: This is a digitally signed message part.
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [PATCH, pixman] Eliminate fast path flags

2009-11-18 Thread Siarhei Siamashka
On Tuesday 17 November 2009, Soeren Sandmann wrote:
 Siarhei Siamashka siarhei.siamas...@gmail.com writes:
   Reviews appreciated. I'd also appreciate if people can test that it
   still compiles and works on ARM as I had to make some changes blindly
   there.
 
  It compiles fine, but requires a minor fix for ARM NEON. These two lines
  need a 'PIXMAN_a8' - 'PIXMAN_solid' change:
 
  +{ PIXMAN_OP_OVER, PIXMAN_a8r8g8b8, PIXMAN_a8,   PIXMAN_a8r8g8b8,
  neon_composite_over__n_, 0 },
  +{ PIXMAN_OP_OVER, PIXMAN_a8r8g8b8, PIXMAN_a8,   PIXMAN_x8r8g8b8,
  neon_composite_over__n_, 0 },
 
  This was spotted by just patch review.

 Thanks - I have fixed this and pushed the patch.

Great!

  But it's also worth mentioning that
  using 'blitters-test', the problem gets only detected on 3280827th
  iteration. And the test is currently set to run 200 iterations by
  default. Just the probability of getting right operation and right
  formats for source, mask and destination all at the same time is a bit
  too low.
 
  Tweaking probabilities for more uniform coverage may help a bit.
  Increasing the number of iterations that are run by default may also be
  useful. But still, an overnight run makes sense to spot some of the more
  rare problems.

 Yeah, we should probably at have at least two CRC32 values, one for
 the 2 million case, and one for a 100 million case that could run
 overnight.

 It might also be interesting to have many more than two values, and
 store them in a table where each row would contain both CRC and seed
 for 2 million iterations, for 4 million, for 6 million and so on.

 The test could then pick a range to run at random, which would
 substantially increase the coverage.

The problem is that performance of different systems may vary a lot, so 100
millions may be fine for one system, but too much for the other. I'm currently
using a variation of the attached script to test pixman fast path functions.
The longer it runs, more chances that it will spot problems (if they exist).

For modern systems it probably also makes sense to add multithreading
support to the test so that it can utilize all the available CPU cores.

-- 
Best regards,
Siarhei Siamashka


test-pixman-fast-paths.sh
Description: application/shellscript


signature.asc
Description: This is a digitally signed message part.
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [PATCH, pixman] ARM NEON optimized pixman_blt overlapped stuff

2009-11-18 Thread Siarhei Siamashka
On Wednesday 18 November 2009, Koen Kooi wrote:
 On 18-11-09 14:41, Siarhei Siamashka wrote:
  Making real use of overlapped pixman_blt also requires patching xserver
  and the most simple way to do it is included in the attached test patch
  (not intended to be applied as-is).

 How many kittens will get killed if I provide a pixman with overlapped
 blit support and apply your patch to Xorg after that?

Everything should work fine except that there will be no way for downgrading
pixman. Patched xserver will show bad image artefacts on the use of overlapped
blits when run with older unpatched versions of pixman.

The other way around is always fine. Patches for pixman have no effect on
interoperability with any versions of xserver.

 I suspect overlapped blits will also benefit from a cached framebuffer
 as well, right?

Yes, at least for me the best type of 'non-3d-accelerated desktop' experience
on OMAP3 devices can be achieved with write-through cached 16bpp framebuffer
(and ShadowFB option explicitly disabled). But it still needs a few more NEON
optimizations to address all the bottlenecks.

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


[PATCH][pixman] C fast path function for 'over_n_1_0565'

2009-11-17 Thread Siarhei Siamashka
Hello,

This patch provides more than 2x performance improvement for this particular
case (terminal, bitmap font, 16bpp desktop color depth). Benchmarked on
ARM Cortex-A8.

Tested for correctness on both big and little endian systems.

-- 
Best regards,
Siarhei Siamashka
From 712e3cc5b9eac0b3ed8dabe644fa4ea2f192c58b Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka siarhei.siamas...@nokia.com
Date: Mon, 9 Nov 2009 14:10:00 +0200
Subject: [PATCH] C fast path function for 'over_n_1_0565'
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

This function is needed to improve performance of xfce4 terminal when
using bitmap fonts and running with 16bpp desktop. Some other applications
may potentially benefit too.

After applying this patch, top functions from Xorg process in
oprofile log change from

samples  %image name   symbol name
1329629.1528  libpixman-1.so.0.17.1combine_over_u
6452 14.1466  libpixman-1.so.0.17.1fetch_scanline_r5g6b5
5516 12.0944  libpixman-1.so.0.17.1fetch_scanline_a1
2273  4.9838  libpixman-1.so.0.17.1store_scanline_r5g6b5
1741  3.8173  libpixman-1.so.0.17.1fast_composite_add_1000_1000
1718  3.7669  libc-2.9.so  memcpy

to

samples  %image name   symbol name
5594 14.7033  libpixman-1.so.0.17.1fast_composite_over_n_1_0565
4323 11.3626  libc-2.9.so  memcpy
3695  9.7119  libpixman-1.so.0.17.1fast_composite_add_1000_1000

when scrolling text in terminal (reading man page).
---
 pixman/pixman-fast-path.c |  100 +
 1 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index d2c456b..75a0c1e 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -1175,6 +1175,104 @@ fast_composite_over_n_1_ (pixman_implementation_t *imp,
 }
 }
 
+static void
+fast_composite_over_n_1_0565 (pixman_implementation_t *imp,
+  pixman_op_t  op,
+  pixman_image_t * src_image,
+  pixman_image_t * mask_image,
+  pixman_image_t * dst_image,
+  int32_t  src_x,
+  int32_t  src_y,
+  int32_t  mask_x,
+  int32_t  mask_y,
+  int32_t  dest_x,
+  int32_t  dest_y,
+  int32_t  width,
+  int32_t  height)
+{
+uint32_t src, srca;
+uint16_t*dst, *dst_line;
+uint32_t*mask, *mask_line;
+int  mask_stride, dst_stride;
+uint32_t bitcache, bitmask;
+int32_t  w;
+uint32_t d;
+uint16_t src565;
+
+if (width = 0)
+	return;
+
+src = _pixman_image_get_solid (src_image, dst_image-bits.format);
+srca = src  24;
+if (src == 0)
+	return;
+
+PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t,
+   dst_stride, dst_line, 1);
+PIXMAN_IMAGE_GET_LINE (mask_image, 0, mask_y, uint32_t,
+   mask_stride, mask_line, 1);
+mask_line += mask_x  5;
+
+if (srca == 0xff)
+{
+	src565 = CONVERT__TO_0565 (src);
+	while (height--)
+	{
+	dst = dst_line;
+	dst_line += dst_stride;
+	mask = mask_line;
+	mask_line += mask_stride;
+	w = width;
+
+	bitcache = *mask++;
+	bitmask = CREATE_BITMASK (mask_x  31);
+
+	while (w--)
+	{
+		if (bitmask == 0)
+		{
+		bitcache = *mask++;
+		bitmask = CREATE_BITMASK (0);
+		}
+		if (bitcache  bitmask)
+		*dst = src565;
+		bitmask = UPDATE_BITMASK (bitmask);
+		dst++;
+	}
+	}
+}
+else
+{
+	while (height--)
+	{
+	dst = dst_line;
+	dst_line += dst_stride;
+	mask = mask_line;
+	mask_line += mask_stride;
+	w = width;
+
+	bitcache = *mask++;
+	bitmask = CREATE_BITMASK (mask_x  31);
+
+	while (w--)
+	{
+		if (bitmask == 0)
+		{
+		bitcache = *mask++;
+		bitmask = CREATE_BITMASK (0);
+		}
+		if (bitcache  bitmask)
+		{
+		d = over (src, CONVERT_0565_TO_0888 (*dst));
+		*dst = CONVERT__TO_0565 (d);
+		}
+		bitmask = UPDATE_BITMASK (bitmask);
+		dst++;
+	}
+	}
+}
+}
+
 /*
  * Simple bitblt
  */
@@ -1261,6 +1359,8 @@ static const pixman_fast_path_t c_fast_paths[] =
 { PIXMAN_OP_OVER, PIXMAN_solid,PIXMAN_a1,   PIXMAN_x8r8g8b8, fast_composite_over_n_1_, },
 { PIXMAN_OP_OVER, PIXMAN_solid,PIXMAN_a1,   PIXMAN_a8b8g8r8, fast_composite_over_n_1_, },
 { PIXMAN_OP_OVER, PIXMAN_solid,PIXMAN_a1,   PIXMAN_x8b8g8r8, fast_composite_over_n_1_

Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-10-25 Thread Siarhei Siamashka
On Friday 23 October 2009, Koen Kooi wrote:
  I'm not sure about pixman_gc_t since most of the needed operations are just
  simple copies. What about starting with just introducing a variant
  of 'pixman_blt' which is overlapping aware?
 
  I created a work-in-progress branch with 'pixman_blt' function (generic C
  implementation for now) extended to support overlapped source/destination
  case. A simple test program is also included:
  http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
 
First, this branch is outdated. There is a new branch with the final code :)
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2

 Would using said branch give me 'magically' a performance boost (e.g. 
 not make firefox unusably slow as it is now on an 600MHz cortex a8) or 
 would I need to patch other libs (e.g. xrender) as well?

Not really, it's just a small extension of pixman functionality. Currently
the handling of overlapped blt operation (for software rendering) is done
in xorg-server. As it is the responsibility of pixman to provide CPU-specific
SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to
move this work to pixman. So the next steps are to add NEON optimizations
to pixman_plt and make sure that xserver takes advantage of these
optimizations for the overlapped blit too.

As for improving scrolling performance (and assuming a standard fbdev driver),
the most important thing is to improve framebuffer memory performance. Right
now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling
write-through cache for it (with a simple kernel patch) improves scrolling
and moving windows performance by 4x-5x factor (unless shadow framebuffer is
used, which is also not good for performance). This works fine if nothing
but CPU can modify framebuffer memory. But if GPU or DSP can also access
framebuffer memory or compositing manager is used, everything gets more
complicated. Cache invalidate operations will have to be inserted in
appropriate places in order to ensure memory coherency and uniform view
of its content from all the units. If default write-back cache is used
instead of write-through, cache flush operations are needed too.

Unpatched firefox is also quite slow for another reason - it tries to
always work with 32bpp data internally, no matter what color depth is
used for desktop.

-- 
Best regards,
Siarhei Siamashka
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel


Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data

2009-10-19 Thread Siarhei Siamashka
On Thursday 04 June 2009, Soeren Sandmann wrote:
 Siarhei Siamashka siarhei.siamas...@gmail.com writes:
  What kind of guarantees (or the lack of) pixman and XRender are supposed
  to provide when dealing with overlapping parts of images?

 (Adding xorg-devel). See this thread:

 http://lists.freedesktop.org/archives/xorg/2008-October/039346.html

 The guarantee that I would suggest for Render and pixman is that if
 any pixel is both read and written in the same request, then the
 result of the whole request is undefined, except that it obeys the
 clipping rules.

  The practical use case could be scrolling of data inside of a single big
  image. If rendering with overlapped source and destination areas is not
  supported, a temporary image has to be created to achieve expected result
  and this is an additional performance hit.

 Yes, scrolling is one thing that the current pixman API doesn't really
 provide. 'pixman_blt()' only deals with cases where the source and
 destination don't overlap.

 I think the best solution is to move all of the X primitives
 (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would
 probably look something like this:

 void
 pixman_copy_area (pixman_image_t *src,
   pixman_image_t *dest,
   pixman_gc_t *gc,
   int src_x,
   int src_y,
   int width,
   int height,
   int dest_x,
   int dest_y);

 and that would be guaranteed to handle overlapping src and dest. A
 pixman_gc_t would be a new type of object, corresponding to an X GC.

 pixman_blt() would then become a deprecated wrapper that would just
 call pixman_copy_area(). Same for pixman_fill() and a new
 pixman_fill_rectangles().

I'm not sure about pixman_gc_t since most of the needed operations are just
simple copies. What about starting with just introducing a variant
of 'pixman_blt' which is overlapping aware?

I created a work-in-progress branch with 'pixman_blt' function (generic C
implementation for now) extended to support overlapped source/destination
case. A simple test program is also included:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt

Making use of the already existing SIMD optimized pixel copy functions should
provide fast scrolling in all the directions except for from left to right.
This special case will require a SIMD optimized backwards copy.

I wonder if it makes sense to drop delegates support for pixman_blt and make
call chain shorter when introducing SIMD optimized copies? It seems to be a
little bit overdesigned here.

Running test program for the current pixman master (SSE2 optimized):

$ test/overlapped-blt-test
bpp=8, supported=0, normal_ok=0, overlapped_ok=0
bpp=16, supported=1, normal_ok=1, overlapped_ok=0
bpp=32, supported=1, normal_ok=1, overlapped_ok=0

Running test program for the pixman from this branch with generic
C version of pixman_blt (8bpp now uses a fallback to generic C
implementation):

$ test/overlapped-blt-test
bpp=8, supported=1, normal_ok=1, overlapped_ok=1
bpp=16, supported=1, normal_ok=1, overlapped_ok=0
bpp=32, supported=1, normal_ok=1, overlapped_ok=0


-- 
Best regards,
Siarhei Siamashka
___
xorg-devel mailing list
xorg-devel@lists.x.org
http://lists.x.org/mailman/listinfo/xorg-devel