Re: ARMSOC X11 plugin issues
On Mon, 4 Apr 2016 13:24:21 +0100 Daniel Stone <dan...@fooishbar.org> wrote: > Hi, > > On 4 April 2016 at 11:45, Siarhei Siamashka <siarhei.siamas...@gmail.com> > wrote: > > This really must be something new, because I don't remember anything > > like this when skimming through the EULA back in 2013. Making the EULA > > even more restrictive seems to contradict with the promise of Jem > > Davies to provide legally redistributable mali binary drivers that > > we had in 2014: > > > > > > http://anandtech.com/comments/8226/ask-the-experts-arm-fellow-jem-davies-answers-your-gpu-questions/409101 > > > > Can anyone from ARM provide an update on this? > > Getting fairly off-topic now, but the Midgard (6xx/7xx/8xx) drivers > are now redistributable. Hi Daniel, I'm not sure if this is really offtopic, because it is still related to the interoperability with the X server after all. Thanks for your comment. Sorry for a late reply, but I have just noticed that the ARM Mali Utgard GPU User Space Drivers page now has the r6p0-01rel0 release with an updated EULA, which is more permissive compared to the previous r5p0-01rel0 release: http://malideveloper.arm.com/resources/drivers/arm-mali-utgard-gpu-user-space-drivers/ The page states that "A new and more permissive version was introduced in January 2016 and all Midgard user-space drivers starting with r6p0 are now distributed under the new terms. The main changes are to allow redistribution of the binaries under the same EULA, commercial use and benchmarking. Please read the END_USER_LICENCE_AGREEMENT.txt document included in the packages for the exact licencing terms." There is obviously a minor mistake in this text, because it mentions "Midgard" (most likely a copy-paste issue). Anyway, it is good that the Utgard (Mali-400/450) binaries now got a usable license after all these years. And the weird benchmark results confidentiality clause has been also updated. Thanks to whoever finally made this happen! Still I wonder who is going to also provide binaries for the 32-bit mode and for Mali-400 hardware (unless the userland binaries for Mali-450 are fully backwards compatible). Also the X11 binaries are still missing from the downloads page. Can ARM people comment on this? Maxime, if you happen to have 32-bit redistributable userland Mali-400 binaries, could you please share them with the others? Maybe ARM people could even put them on their download page? -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: https://lists.x.org/mailman/listinfo/xorg-devel
Re: ARMSOC X11 plugin issues
On Tue, 5 Apr 2016 11:12:00 -0700 Maxime Ripard <maxime.rip...@free-electrons.com> wrote: > On Mon, Apr 04, 2016 at 02:46:19PM +0900, Michel Dänzer wrote: > > On 01.04.2016 15:57, David Garbett wrote: > > > The main difference between modesetting and armsoc is that armsoc > > > supports DRI2, and modesetting doesn't. > > > > The modesetting driver itself supports DRI2 as well. If it's not working > > on this platform, it's probably because > > hw/xfree86/dri2/dri2.c:dri2_probe_driver_name() doesn't know yet which > > DRI client driver name to use for it. > > My understanding was that the modesetting driver was supporting DRI2 > through mesa, which we aren't using unfortunately (but instead the > proprietary ARM OpenGL ES implementation) Yes, the DRI2 spec does not define the exact interpretation of the DRI2BUFFER data fields and leaves a lot up to the implementations. This works fine as long as the DDX (or "plugin" as you call it) and the client side GL library are interpreting DRI2BUFFER in the same way and understand each other. But different DRI2 implementations are generally incompatible with each other. -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: https://lists.x.org/mailman/listinfo/xorg-devel
Re: ARMSOC X11 plugin issues
Hello Maxime, On Sat, 2 Apr 2016 15:58:37 +0200 Maxime Ripard <maxime.rip...@free-electrons.com> wrote: > Hello David, > > Thanks for your answer. > > On Fri, Apr 01, 2016 at 06:57:07AM +, David Garbett wrote: > > The main difference between modesetting and armsoc is that armsoc > > supports DRI2, and modesetting doesn't. This is what allows the GLES > > driver to render to X buffers. > > Yes, that's what I figured. > > > DRI2 enables any application to pass any X pixmap into the GLES/EGL > > driver, so all buffers need to be allocated from shareable memory. > > That's not the case with modesetting - other than the main framebuffer, > > other allocations (for pixmaps and windows) can just be local to the X > > server, so don't allocate from the DRM driver. > > > > I've not looked at your DRM driver proposal so I can't really say why it > > can't cope with the additional allocations. Though I do notice in your > > armsoc patch that you don't handle scanout buffers any differently. In > > many systems scanout buffers are a more scarce resource (perhaps they > > need to be contiguous, or meet certain alignment restrictions). > > We do agree on that. However, it looks like the arm soc driver does > way much more allocations, and usually smaller ones, than modesetting > that basically just allocates a buffer for its framebuffers and that's > it. > > If I was to guess, I'd say that it allocates buffers for applications > (and possibly part of the applications window), that eventually > depletes the CMA pool, even though the GPU is not used. > > I'd expect the DRI buffers to be allocated from this pool, because of > the reasons you pointed out, but I would also expect that the > applications do not hit that pool, precisely because it's a scarce > resource. > > I'm possibly missing something though, or have broken expectations :) > > In our case, we don't really have any restrictions on the memory > resource locations, and I'm not aware of any particular weird > alignment constraints either. Yes, this problem can be solved by using malloc for all pixmaps, and only migrating them to CMA or any other special allocation method when needed. We encountered this back in 2013, when xf86-video-mali had exactly the same design issues as you see now with xf86-video-armsoc: http://ssvb.github.io/2013/02/01/new-xf86-video-sunxifb-ddx-driver.html As far as I know, Daniel Drake had some fixes for xf86-video-armsoc, but I have no idea which of the armsoc forks is the most up to date right now. There are also a few tricks to ensure that the 2D performance is not regressed in Xorg by integrating the mali drivers, and also to get zero-copy tear-free OpenGL ES rendering in windows. This all is now available in the xf86-video-fbturbo DDX (renamed from xf86-video-sunxifb) if anyone is interested. And I'm happy to know that you are working on the kernel side and doing the conversion of the Allwinner's custom fbdev based kernel interface into a more standard DRM/KMS interface. This had to be done by somebody and it is better late than never. > > > > Then, we noticed (using xfce4, on debian jessie) that the systray > > > > icons were not displayed for some reason. There's also some game > > > > (alex4 [4]), that starts, runs, but the window content remains black > > > > (but it remains interactive, audio plays and if we take a screenshot, > > > > the content is on the image, but the screen remains black). > > That's what concerns me the most though. We can always work around the > memory allocations by playing cat and mouse and allocating a bigger > pool (even though I'd really like to avoid that). > > However, those glitches are kind of weird, and not really convenient :/ Well, the mali drivers EULA kind of warns that a problem-free experience is not to be expected. And they deliver what is promised :-) The EULA now has an extra clause in the "RESTRICTIONS ON USE OF THE SOFTWARE" paragraph, which basically says that any benchmark results or reports about quality issues are not allowed to be disclosed without an explicit permission from ARM: http://malideveloper.arm.com/resources/drivers/arm-mali-utgard-gpu-user-space-drivers/ This really must be something new, because I don't remember anything like this when skimming through the EULA back in 2013. Making the EULA even more restrictive seems to contradict with the promise of Jem Davies to provide legally redistributable mali binary drivers that we had in 2014: http://anandtech.com/comments/8226/ask-the-experts-arm-fellow-jem-davies-answers-your-gpu-questions/409101 Can anyone from ARM provide an update on this? -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: https://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH 2/2] glamor: Use float matrix for render transformations
On Mon, 18 Aug 2014 15:15:28 -0700 Keith Packard kei...@keithp.com wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: Okay, now I see what's going on. The old glamor code used to convert the 16.16 fixed point values to 32-bit floats. This is definitely not great because of http://en.wikipedia.org/wiki/Rounding#Double_rounding on the 64-bit float - 16.16 fixed point - 32-bit float conversion path. Directly passing 32-bit floats to XRENDER should somewhat mitigate the accuracy loss, but not totally eliminate it. You need to separate the coordinate representation issues from the transformation representation issues. For coordinates, yes, 32-bit floats would lose some of the low bits. For transformation matrices, the values aren't coordinates; they're all scaling factors for the coordinates. | A B C | | D E F | | G H J | For affine transforms, A, B, D, E generally have small values, related to the overall image scaling factor. C and F are the translation while G and H are zero and J is one. The small values (A, B, D, E) are multiplied by coordinates, the large values (C, F) are multiplied by one. Having 24 bits of mantissa for all of these values offers the maximum precision for the resulting values. Software development just does not work this way. You can't rely on assumptions like For affine transforms, A, B, D, E generally have small values, the user always provides correct input, this race condition is so rare that it is unlikely to crash our software and anything of this sort. Fixed point 16.16 format at least has predictable good accuracy (less than 0.5 pixel absolute error in the worst case) for any affine transformation when working with the coordinates and matrices within the valid range supported by XRENDER. Single precision floating point format has much less reliable accuracy for affine transformations. That's the fact. Let's try to construct a really bad case. That's where your matrix has values which are just less than 0.5 ULP (1/2**17) A * x + B * y + C x' = - G * x + H * y + J so, the maximum error here would be where the denominator was small, and the maximum magnitude of the matrix values was large so as to keep us From scaling the matrix up to provide precision for the smallest members. Here's a matrix which almost reflects in X with a tiny amount of non-affine adjustment | -1032767 | | 010 | | e01 | Set e to 1/2 ULP - epsilon (a tiny bit less than 1/2**17) so that the fixed point matrix has a zero there and the floating point matrix has a non-zero value. For x, y of -2000,2000: -1 * -2000 + 0 * 2000 + 3276734767 x' = - = --- = 35305.72 e * -2000 + 0 * 2000 + 1 0.98474 -1 * -2000 + 0 * 2000 + 3276734767 x' ~= - = --- = 34767 0 * -2000 + 0 * 2000 + 1 1 This has an error of about 539 pixels. Is the accuracy actually good enough with this patch applied and solves the off by several pixels problem? The patch still looks messed up with the use of a mix of double and float. The example above is contrived, but shows that fixed point introduces considerable inaccuracy when used in the transformation matrix. Let's consider that e = 1.0 / (1 10) And then try to do the following transform in 3 different ways: | 32767 + e-32767 + e0 | | 32767 | | 01 0 | * | 32767 | | 00 1 | | 1 | a) If we use double precision for everything, then the resulting value x = 63.998047 b) If we convert the matrix to 16.16 fixed point format and back to double precision, then the resulting value x = 63.998047 c) If we convert the matrix to single precision floating point format and back to double precision, then the resulting value x = 0.00 The example above is contrived, but shows that single precision floating point introduces considerable inaccuracy when used in the transformation matrix. Moreover, this is an *affine* transformation. Unlike the *projective* transformation from your contrived example. The vast majority of the existing XRENDER users are only ever doing affine transformations. This is what matters in the real world. We have only a single real world example of the use of XRENDER projective transformations (if you know more, please tell me). And this example is the xrandr tool, where the transformation accuracy apparently was never considered to be a (serious) problem until now. If the xrandr developers really cared about the accuracy, they would have at least tried to implement correct rounding there. IMHO the ugly protocol level hack is a bit of an overkill for the xrandr problem, especially considering that an alternative solution exists: http://lists.x.org
Re: [PATCH renderproto] Add floating point transforms
On Tue, 19 Aug 2014 10:50:08 +0600 Alexander E. Patrakov patra...@gmail.com wrote: 19.08.2014 03:02, Keith Packard wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: Do you have a reproducible testcase for this problem? Being off by several pixels seems to be very wrong. Especially if this happens often. Non-affine transforms cause problems. I noticed this with RandR when doing keystone correction. Here's a simple example: Desired matrix: 33.13915858 15.17037037 -16384. 0.23.73634938 0. -0.00057170 -0.00182142 25.70348287 Rounded to 16.16 fixed matrix: 33.13916016 15.17036438 -16384. 0.23.73634338 0. -0.00056458 -0.00181580 25.70347595 Actual bits: { 0x00014dba, 0x98c6, 0xfd7b7d45 }, { 0x, 0xef09, 0x }, { 0x, 0xfffb, 0x000102d9 }, Take the point 2560,1600 and transform: desired:4348.04, 1780.87 actual: 4342.50, 1778.60 error: 5.99 Sorry, but I think that we have bigger problems. Please look at https://bugs.freedesktop.org/show_bug.cgi?id=39949 (which was a blocker for xserver-1.12 and makes mouse input on scaled displays unusable in some cases). Since there was no manpower to review that bug for years, I think that the best course for now is to prohibit all transforms that do not amount to mere translation. You can change my opinion by reviewing the patch in the bug. Alexander, input handling is a different problem, which is not directly related to the transform accuracy. But thanks for bringing it up. Also I got an off-list e-mail about xrandr from another person. Which clearly shows that there is some interest in it, but people might be a little bit too shy to complain or remind about problems in public :-) -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH renderproto] Add floating point transforms
On Fri, 15 Aug 2014 08:55:18 -0700 Keith Packard kei...@keithp.com wrote: Fixed point coordinates don't provide reasonable precision for transformation operations; the resulting transforms are often off by several pixels. Allow clients to represent the transformation using either fixed point or floating point. The protocol extension is an action with serious long lasting consequences. XRENDER is a rather old extension and my understanding was that it serves mostly to just allow running the existing applications. Extending the protocol at this point needs to be clearly justified. The current commit message is a bit vague, uses weasel words and is not entirely accurate. So some clarifications would be really appreciated. * Who are the expected users of this new functionality and what are the practical advantages for them? Preferably something measurable and easy to confirm. * Are you expecting the existing applications/toolkits to eventually switch from the 16.16 fixed point format to the single precision floating point format? Or is it only intended to be used by a few selected tools, such as xrandr? * Are there any possible risks? Is there anything like like performance and/or accuracy drop in some use cases? Are the existing drivers and backends ready to provide this new functionality without exposing problems? Any backwards/forward compatibility concerns to care about? * Are there any other alternatives to the proposed solution? -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
[PATCH 0/2] Projective transformation accuracy fixes for the xrandr tool
As revealed in http://lists.x.org/archives/xorg-devel/2014-August/043624.html by Keith Packard, there are some accuracy issues in the way how the xrandr tool is handling projective transformations. Because being off by several pixels is a bit too much, some xrandr users are not very happy. The first patch fixes a very trivial and very obvious rounding problem. The second patch implements matrix rescaling to ensure that the rounding errors are significantly reduced specifically for the most critical G and H matrix coefficients. As a result, now projective transformations should be pixel accurate for anything sane that can be done with xrandr and modern high resolution monitors. But only testing can show. I would be very interested to know if there are still some problematic use cases remaining. The search for the optimal matrix scaling factors is relatively computationally intensive and may have a lot of room for improvement. This should not be a problem for the xrandr command line tool though, because it does not have to deal with millions of matrices. Only a single matrix is used per run. https://github.com/ssvb/xrandr/tree/transform-accuracy Siarhei Siamashka (2): xrandr: Use rounding when converting from float to fixed point matrix xrandr: Rescale matrix to improve projective transformations accuracy xrandr.c | 185 +-- 1 file changed, 180 insertions(+), 5 deletions(-) -- 1.8.3.2 ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
[PATCH 2/2] xrandr: Rescale matrix to improve projective transformations accuracy
When using projective transformations and homogenous coordinates, we can freely multiply the transformation matrix by a constant. Optimal selection of this constant has a huge impact on accuracy. A naive approach is just to upscale the matrix by the largest possible factor, only ensuring that the matrix coefficints don't excees 32767. This somewhat reduces the relative error caused by rounding, but is still not a silver bullet. A more advanced approach is to focus specifically on G and H coefficients, because they are contributing to the divisor value and the rounding errors in them may be scaled manyfold in the final result. A simplified variant of this approach is to pick either G or H coefficient, round it to the nearest value which fits 16.16 fixed point format and use the ratio between the rounded and the original coefficient as the constant for rescaling the matrix. After this is done, either G or H coefficient becomes perfectly accurate and has no rounding error on the conversion to 16.16 fixed point format. A better solution is to try minimizing rounding errors for both G and H at once. This patch is implementing a simple bruteforce search for the optimal rescaling constant. The performance should not be an issue for the xrandr command line tool, because this is done only once per tool run. Matrices are dumped to the console when using the '--verbose' command line option and the accuracy can be evaluated. For example, using the matrix coefficients from http://lists.x.org/archives/xorg-devel/2014-August/043624.html provides the following results: octave:1 {paste the original_matrix and the adjusted_matrix data} octave:3 x = [2560 ; 1600 ; 1] octave:4 p1 = original_matrix * x ; p2 = adjusted_matrix * x octave:5 difference_between_p1_and_p2 = p1 / p1(3) - p2 / p2(3) 0.0046335 0.0024114 0.000 The top two values are showing the absolute error for X and Y coordinates. The accuracy remains good up to something like 7000x7000, but then degrades quickly. Signed-off-by: Siarhei Siamashka siarhei.siamas...@gmail.com --- xrandr.c | 173 +-- 1 file changed, 170 insertions(+), 3 deletions(-) diff --git a/xrandr.c b/xrandr.c index 5ba8c00..e6c3a0f 100644 --- a/xrandr.c +++ b/xrandr.c @@ -463,16 +463,183 @@ mode_width (XRRModeInfo *mode_info, Rotation rotation) } } +/* Something that is way too small for 16.16 fixed point representation */ +#define MATRIX_EPSILON(1.0 / (1 24)) +/* The matrix coefficients can't be larger an this */ +#define MATRIX_FIXED_POINT_MAXVAL 32767.0 +/* Configures the minimal range for scale factor search */ +#define MATRIX_SCALE_RANGE1.5 +/* Bruteforce search runs this number of iterations */ +#define MATRIX_BRUTEFORCE_LOOP_COUNT 1 + +static Bool +is_matrix_affine (double matrix[3][3]) +{ +return (fabs (matrix[2][0]) MATRIX_EPSILON + fabs (matrix[2][1]) MATRIX_EPSILON + fabs (matrix[2][2] - 1.0) MATRIX_EPSILON); +} + +static inline double +round_to_nearest_integer (double x) +{ +return floor(x + 0.5); +} + +/* + * This function tries to find the best suitable scale factor (from the range + * between 'minscale' and 'maxscale'), so that 'orig_a * scale' and + * 'orig_b * scale' are very close to some integer values as possible. + * A bruteforce search is performed, trying to minimize the squared distance. + * + * The geometric interpretation of this process is a line, originating + * from (0, 0) and going through the first quadrant on a Cartesian plane. + * The slope of this line depends on the ratio between 'orig_a' and 'orig_b'. + * The line passes near a large number of points with integer coordinates. + * The distances between the line and these points are evaluated. + */ +static double +matrix_find_best_scale_factor(double orig_a, double orig_b, + double minscale, double maxscale) +{ +double a = orig_a, b = orig_b; +double best_sqr_dist = 1.0; +double best_scale_factor = maxscale; +double sqr_orig_a, sqr_orig_b; +double recip_orig_a, recip_orig_b, recip_sqr_orig_a_plus_sqr_orig_b; +double refscale; +double refscale_step = (maxscale - minscale) / MATRIX_BRUTEFORCE_LOOP_COUNT; + +/* Ensure that 'orig_a' and 'orig_b' are both positive non-zero values */ +orig_a = fabs(orig_a); +orig_b = fabs(orig_b); +if (orig_a MATRIX_EPSILON) + orig_a = MATRIX_EPSILON; +if (orig_b MATRIX_EPSILON) + orig_b = MATRIX_EPSILON; + +/* Pre-computed constants */ +sqr_orig_a = orig_a * orig_a; +sqr_orig_b = orig_b * orig_b; +recip_orig_a = 1.0 / orig_a; +recip_orig_b = 1.0 / orig_b; +recip_sqr_orig_a_plus_sqr_orig_b = 1.0 / (sqr_orig_a + sqr_orig_b); + +/* + * Try different scale factors between 'minscale' and 'maxscale'. They + * are only used as the initial approximations though. + */ +for (refscale = minscale; refscale
[PATCH 1/2] xrandr: Use rounding when converting from float to fixed point matrix
This prevents unnecessary accuracy loss when the '--transform' option is used. Also the preparation of a transformation matrix is now moved to its own separate function. Signed-off-by: Siarhei Siamashka siarhei.siamas...@gmail.com --- xrandr.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/xrandr.c b/xrandr.c index 366f6dc..5ba8c00 100644 --- a/xrandr.c +++ b/xrandr.c @@ -463,6 +463,18 @@ mode_width (XRRModeInfo *mode_info, Rotation rotation) } } +static void +make_fixed_point_matrix (XTransform *transform, double matrix[3][3]) +{ +/* XDoubleToFixed does not provide correct rounding, so use a workaround */ +double fixed_1 = XDoubleToFixed (1); +int k, l; + +for (k = 0; k 3; k++) + for (l = 0; l 3; l++) + transform-matrix[k][l] = floor (matrix[k][l] * fixed_1 + 0.5); +} + static Bool transform_point (XTransform *transform, double *xp, double *yp) { @@ -2908,7 +2920,6 @@ main (int argc, char **argv) } if (!strcmp (--transform, argv[i])) { double transform[3][3]; - int k, l; if (!config_output) argerr (%s must be used after --output\n, argv[i]); if (++i = argc) argerr (%s requires an argument\n, argv[i-1]); init_transform (config_output-transform); @@ -2921,10 +2932,7 @@ main (int argc, char **argv) != 9) argerr (failed to parse '%s' as a transformation\n, argv[i]); init_transform (config_output-transform); - for (k = 0; k 3; k++) - for (l = 0; l 3; l++) { - config_output-transform.transform.matrix[k][l] = XDoubleToFixed (transform[k][l]); - } + make_fixed_point_matrix(config_output-transform.transform, transform); config_output-transform.filter = bilinear; config_output-transform.nparams = 0; config_output-transform.params = NULL; -- 1.8.3.2 ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH renderproto] Add floating point transforms
On Mon, 18 Aug 2014 14:02:13 -0700 Keith Packard kei...@keithp.com wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: Do you have a reproducible testcase for this problem? Being off by several pixels seems to be very wrong. Especially if this happens often. Non-affine transforms cause problems. I noticed this with RandR when doing keystone correction. Here's a simple example: Desired matrix: 33.13915858 15.17037037 -16384. 0. 23.73634938 0. -0.00057170 -0.00182142 25.70348287 Rounded to 16.16 fixed matrix: 33.13916016 15.17036438 -16384. 0. 23.73634338 0. -0.00056458 -0.00181580 25.70347595 Actual bits: { 0x00014dba, 0x98c6, 0xfd7b7d45 }, { 0x, 0xef09, 0x }, { 0x, 0xfffb, 0x000102d9 }, Take the point 2560,1600 and transform: desired:4348.04, 1780.87 actual: 4342.50, 1778.60 error: 5.99 It is possible to pre-multiply all the coefficients in the desired matrix by 1.147703 before converting it to 16.16 fixed point format. In this case we get the following rounded coefficients in the matrix: 38.033912 17.411080 -18803.965952 0.0027.242279 0.00 -0.000656 -0.002090 29.499964 The following actual bits: { 0x002608ae, 0x0011693d, 0xb68c08b7 }, { 0x, 0x001b3e06, 0x }, { 0xffd5, 0xff77, 0x001d7ffe }, And the 2560,1600 point transforms to 4348.03, 1780.86 I found this particular multiplication constant by a simple bruteforce search (minimizing the rounding errors for the m[2][0] and m[2][1] matrix coefficients). But quite likely, a reasonably good constant can be also found by some better performing algorithm, suitable for realtime use in applications. Basically, for an affine transform we just need to do correct rounding. And for a projective transform, we also need to nudge the matrix a bit in order to minimize rounding errors when converting to 16.16 fixed point. Would this be a suitable solution for the xrandr command line tool? -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH renderproto] Add floating point transforms
On Fri, 15 Aug 2014 08:55:18 -0700 Keith Packard kei...@keithp.com wrote: Fixed point coordinates don't provide reasonable precision for transformation operations; the resulting transforms are often off by several pixels. Do you have a reproducible testcase for this problem? Being off by several pixels seems to be very wrong. Especially if this happens often. Allow clients to represent the transformation using either fixed point or floating point. Signed-off-by: Keith Packard kei...@keithp.com The 16.16 fixed point format should be pixel accurate for the whole range of coordinates supported by XRENDER (from -32K to 32K). At least for affine transformations. Projective transformations are a somewhat special case and don't seem to be used in practice (Cairo does not use them). When we are doing matrix multiplications, each transformed coordinate is a sum of three multiplications. For affine transformations it is effectively a sum of just two multiplications. Assuming that a perfectly accurate matrix is getting converted to the 16.16 fixed point format, we get some rounding errors. In the worst case, the rounding error for each of the matrix coefficients is 1/65536 (the granularity of fixed point values) divided by two (because we are rounding to nearest). Also in the worst case, the coordinates before transformation can be as large as 32768. So the worst case absolute error is 1/65536/2 * 32768 * 2 = 0.5 pixels for affine transformations. And the error is naturally smaller than this on average. 32-bit floating point values use the same number of bits as 16.16 fixed point values for data storage. But the key difference is that the accuracy of the floating point values is seriously non-uniform. It tends to be better when the values are close to zero, but degrades significantly (less bits are allocated to the fractional part) when the pixel coordinates move further away from zero. In general, floating point calculations are not associative and this causes additional headaches. For example, repeating N times x += dx does not necessarily produce the same result as x += dx * N. Floating point format supports a wider range of values (unnecessary if the rest of XRENDER remains restricted to 16-bit coordinates) but has much worse accuracy in corner cases and is poorly predictable. The 32-bit floating point format does not look like an obvious upgrade over 16.16 fixed point. IMHO this problem needs a bit more debugging before taking any actions with serious consequences (such as extending the protocol). -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH renderproto] Add floating point transforms
On Sun, 17 Aug 2014 14:26:13 -0700 Keith Packard kei...@keithp.com wrote: walter harms wha...@bfs.de writes: Would it make sense to use double for internal calculation ? Most systems provide 64bit (or more) fpu-registers these days. In fact, pixman exposes only double matrices at this point, so the server implementation stores both fixed point and floating point variants in that format. Double precision floating point matrices and operations with them (pixman_f_transform_*) are not really used as a part of the pixman rendering pipeline. They are more like just the extra bundled convenience utility functions. When pixman needs a transformation matrix, this matrix is provided to it using the 'pixman_image_set_transform()' function: http://cgit.freedesktop.org/pixman/tree/pixman/pixman.h?id=pixman-0.32.6#n806 Which uses the 16.16 fixed point format. There are also some useless (and even harmful) utility functions, which are implementing operations on 16.16 fixed point matrices. If anyone tries to use them, a lot of accuracy gets lost in the intermediate calculations. A reasonable workflow is to do all the intermediate calculations with double precision matrices and only convert to the 16.16 fixed point format as the final step before passing them to pixman. Of course, if somebody knows what he is doing, then it is possible to prepare a 16.16 fixed point matrix without resorting to floating point calculations. But special care needs to be taken to ensure that there is no unnecessary accuracy loss. However, a GPU will likely want to stick with 32 bit values, which is why I expressed the protocol using 32 bit floats; 64-bit computation in the GPU is generally much slower than 32-bit. This does not necessarily mean that the 32-bit floats should be exposed by the protocol or the public API. If we really need improved accuracy (do we?), then 64-bit floats are the way to go. IMHO it is up to the implementation to decide if the 32-bit floating point format is accurate enough for the requested operation and take a fast path. -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH 2/2] glamor: Use float matrix for render transformations
to 32-bit float (which only has 24-bit mantissa). The 32-bit floating point format is somewhat less accurate than necessary for XRENDER. And the introduction of the 32-bit floating point format in the protocol is going to be really harmful. If the new applications start using it for the sake of glamor, then the existing users of software rendering drivers (which used to be perfectly fine until now), are going to be messed up by the 64-bit float - 32-bit float - 16.16 fixed point double rounding penalty. Using the 64-bit floating point format in the protocol is not so bad, because it can store 16.16 fixed point values without any accuracy loss. However there would be some performance penalty, caused by the extra conversion. That's the usual pick your poison situation. Is the accuracy actually good enough with this patch applied and solves the off by several pixels problem? The patch still looks messed up with the use of a mix of double and float. -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: xf86-video-armsoc, Mali and exynos4
On Tue, 12 Nov 2013 16:08:27 -0600 Daniel Drake dr...@endlessm.com wrote: On Tue, Nov 12, 2013 at 3:49 PM, Luc Verhaegen l...@skynet.be wrote: Fwiw, I am happily using a derivative of the standard fbdev driver on my sunxi and exynos based devices. Namely fbturbo which is being done by Siarhei Siamashka. For sunxi, it includes some knowledge about the strangely implemented display engine, for exynos, it currently speeds up some operations using neon assembler. Yes, it does work reasonably well. Unfortunately this does not support xrandr resolution switching though I was considering to get xrandr working at least on sunxi hardware. So that dual monitor support and screen resolution switching at runtime works in X11 desktop in a standard way transparently for the applications and for the user. But then decided that it is not worth the efforts. And very few people actually asked about this feature. which is now made available for free by the underlying exynos_drm display driver. Does xf86-video-modesetting work properly with the exynos_drm kernel driver on your hardware? It is useful to first check whether the kernel is really functional. A year ago just trying to load the modesetting DDX only oopsed the kernel on my ARM Chromebook. But it is possible that things have improved since then. Maybe we could add KMS support to fbturbo Right now fbturbo is built on top of the xf86-video-fbdev chassis. Just because it simply works everywhere and provides the best compatibility with various hardware. Trying to hack the KMS support into it may be a little bit invasive and the perfect compatibility with the ump based mali400 binary blobs can't be guaranteed. The fully open source driver for mali400 is going to provide a lot more freedom in shaping a usable graphics stack. The xf86-video-modesetting would be likely a preferred option instead of xf86-video-fbdev if it worked everywhere too (at least on the Raspberry Pi, Allwinner, Rockchip, Exynos4 and Exynos5 based hardware). Now an interesting question is whether it makes sense to suspend all the other activity and work on a better kms support in the kernel for all this hardware simultaneously? Or first try to wait for decent results on at least one platform (freedreno? armada?), learn from this experience and then do the same for the other ARM platforms? My primary motivation for fbturbo/sunxifb was ensuring a usable X11 desktop on the low end ARM hardware right now by fixing/workarounding some obvious performance issues. It's a stop gap solution, which surely needs to be replaced in the long run. But any potential replacement must be confirmed to be better (not just assumed to be better). I was just interested in exploring things from the armsoc side as the original announcement talks about the combination of X/Mali integrated with exynos, DRM and KMS. http://lists.x.org/archives/xorg-devel/2012-May/031250.html Maybe that level of integration has not yet been reached in public code though :( A lot may happen along the way, no matter how good or bad the original plan was. I guess that Tom would be the best person to explain the current status of this work. -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: Where does Xorg fallback to ?
On Mon, 23 Sep 2013 10:45:41 +0800 小飞珑 404962...@qq.com wrote: Hi, I am looking into Xorg for performance optimization, and want to do some software acceleration for the Xorg fallback functions when the hardware acceleration is not available. The software fallbacks are partially implemented in the Xorg server (fb subdirectory) and also partially in http://pixman.org/ I am now using a notebook with intel display card. In the UXA driver, I force the driver to use fallback funcs instead of the Accelerated funcs. As in uxa/uxa-accel.c - uxa_copy_area(), I made uxa_screen-force_fallback = 1, force uxa driver to call uxa_check_copy_area()-fbCopyArea(), but the fbCopyArea seemed NOT to be the fbCopyArea in Xorg fb/fbcopy.c ! For I added return NULL immediately in the beginning of fb/fbcopy.c - fbCopyArea(), but made no difference, but if I comment the fbCopyArea() in uxa/uxa-unaccel.c the display is ruined. It seemed that the above two fbCopyArea() functions are not the same. Where did the Xorg fallback to? NOT to the fb/fbcopy.c ? I am eager to know that, could some body tell ? Thanks a lot! The Intel hardware and the UXA driver seem to be a strange choice. If this notebook is really your target hardware, then you probably should have a look at SNA. But if you want to improve Xorg software fallbacks just for the sake of making them better, then running tests with xf86-video-fbdev or xf86-video-modesetting driver would be more reasonable. The current fb code in Xorg is implemented as just naive loops, and whenever they are heavily used (instead of the optimized assembly in glibc or pixman), the performance suffers a lot. A good start would be to fix the XShmPutImage performance, because it is causing real troubles for the users. This can be workarounded in the DDX drivers, but fixing it in Xorg would be also nice: https://github.com/ssvb/xf86-video-fbturbo/issues/9 Unfortunately the person, who reported this problem, has not tried to also submit a fix to the Xorg yet. But you or anybody else is free to pick up this job :-) -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: An iwmmxt optimised shadowUpdateRotate16_270YX, with problems.
On Fri, Aug 26, 2011 at 2:21 PM, Marko Katić drom...@gmail.com wrote: Hi there! I'm trying to optimise shadow copies from landscape oriented 16bit shadowfb to portrait 16bit fb proper. This is done in shadowUpdateRotate16_270YX which is located in miext/shadow/shrotpackYX.h. My optimisation is aimed at pxa27x and xscale3 arm processors only since it uses iwmmxt asm code. I guess it could easily be ported to x86 mmx code too. As I understand, the current implementation copies a single pixel at a time, like this: *win = *sha++; win += WINSTEPX(winStride); This also means that we're stepping over entire cachelines since every pixel of a single shadowfb line has to be copied to a new line of fb proper. My patch tries to copy 4x4 pixel blocks prerotated to portrait orientation. Basically, it takes 4 lines of shadowfb and divides it into 4x4 blocks. Then it rotates them and copies them to fb proper. This way, instead of copying a single pixel per fb proper line, it copies four. The rotation code is done in iwmmxt asm and takes about 0.9 instructions per pixel (assuming the 4x4 block is already in iwmmxt registers). 4x4 blocks imply that the rectangle to be copied is width and height aligned to 4 pixels. If not, the patch reduces the rectangle to proper alignment with single pixel copies for width and height. It doesn't really work and i can't find a reason why. The inital Xfbdev screen is looking fine, but when i start moving the pointer or windows, all i get is garbage. The patch was tested on kdrive 1.3.0.0 running in qemu and on a Zaurus C-1000. If anyone has any suggestions, please do tell. To get the best performance, it is important to take cache and TLB misses into account. The 4x4 pixel blocks may be a bit too small. I think it may be interesting to integrate iwmmxt optimizations into pixman and then use pixman for doing these rotations in xserver. Right now there is more or less cache friendly C implementation for rotation in pixman: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.c?id=pixman-0.22.2#n1620 Rotation is at least partially covered in the pixman test suite ('make check'), so detecting and fixing the most obvious bugs could be a bit easier than watching for image corruption on real use. Also the 'affine-test' program from the test suite can be used as an example of doing rotations on memory buffers with pixman: http://cgit.freedesktop.org/pixman/tree/test/affine-test.c?id=pixman-0.22.2 -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: An iwmmxt optimised shadowUpdateRotate16_270YX, with problems.
On Fri, Aug 26, 2011 at 9:54 PM, Marko Katić drom...@gmail.com wrote: I chose 4x4 block size for my initial implementation. First i wanted to get it working. The block size could easily be resized to 8x4, 8x8 or even 12x4 maybe. TLB misses could be completely avoided or at least cut in less than half, i'm also working on an improved pxafb driver that splits the framebuffer between internal sram (256K on a pxa270) and ram. The ram part is placed in a section mapping, greatly reducing the number of TLB entries. It may be possible to place the internal sram in a section mapping too, i have to try this. If it's possible, the fb proper and the shadowfb would only need 2 tlb entries. OK, using section mapping is definitely a good idea to eliminate TLB misses. Still I expect that writing more than 4 pixels (8 bytes) per scanline to the destination buffer would be useful. The write combining buffer has such a name for a reason, it helps to utilize burst writes and make a better use of memory bandwidth. But running some benchmarks is always better for identifying which memory access pattern actually works the best. Could alignment be the problem with my patch? The shadowfb and fb proper lines are probably not 64 bit aligned before calling iwmmxt_rotate_copy... Yes, very likely. Also mov pc, lr is not thumb interworking safe and you are clobbering callee saved registers in your code (r4-r11). See: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/IHI0042D_aapcs.pdf This can be solved by using push {r4, r6, r7, lr} on entry and pop {r4, r6, r7, pc} on exit. -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH 1/6] make byte swapping macros less opaque to the optimizer
On Fri, Apr 29, 2011 at 1:33 AM, Matt Turner matts...@gmail.com wrote: On Thu, Apr 28, 2011 at 4:36 PM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: On Thu, Apr 28, 2011 at 2:33 AM, Matt Turner matts...@gmail.com wrote: diff --git a/include/misc.h b/include/misc.h index 803f5ba..b7a3fd2 100644 --- a/include/misc.h +++ b/include/misc.h @@ -240,32 +240,17 @@ xstrtokenize(const char *str, const char* separators); #define SwapRestL(stuff) \ SwapLongs((CARD32 *)(stuff + 1), LengthRestL(stuff)) -/* byte swap a 32-bit value */ -#define swapl(x, n) { \ - n = ((char *) (x))[0];\ - ((char *) (x))[0] = ((char *) (x))[3];\ - ((char *) (x))[3] = n;\ - n = ((char *) (x))[1];\ - ((char *) (x))[1] = ((char *) (x))[2];\ - ((char *) (x))[2] = n; } - -/* byte swap a short */ -#define swaps(x, n) { \ - n = ((char *) (x))[0];\ - ((char *) (x))[0] = ((char *) (x))[1];\ - ((char *) (x))[1] = n; } - /* copy 32-bit value from src to dst byteswapping on the way */ -#define cpswapl(src, dst) { \ - ((char *)(dst))[0] = ((char *) (src))[3];\ - ((char *)(dst))[1] = ((char *) (src))[2];\ - ((char *)(dst))[2] = ((char *) (src))[1];\ - ((char *)(dst))[3] = ((char *) (src))[0]; } +#define cpswapl(src, dst) (dst) = lswapl((src)) /* copy short from src to dst byteswapping on the way */ -#define cpswaps(src, dst) { \ - ((char *) (dst))[0] = ((char *) (src))[1];\ - ((char *) (dst))[1] = ((char *) (src))[0]; } +#define cpswaps(src, dst) (dst) = lswaps((src)) + +/* byte swap a 32-bit value */ +#define swapl(x, n) (*(x)) = lswapl(*(x)) + +/* byte swap a short */ +#define swaps(x, n) (*(x)) = lswaps(*(x)) This is not an equal replacement for the previous code. The difference is that the older variant of 'swapl' could accept any pointer having any alignment, but still do byteswap for exactly 4 bytes at that memory location. Just to perform an additional sanity check, the following test code can be used: -#define swapl(x, n) (*(x)) = lswapl(*(x)) +#define swapl(x, n) do { \ + char dummytmp1[4 - sizeof(*(x))]; \ + char dummytmp2[sizeof(*(x)) - 4]; \ + (*(x)) = lswapl(*(x)); \ + } while (0) /* byte swap a short */ -#define swaps(x, n) (*(x)) = lswaps(*(x)) +#define swaps(x, n) do { \ + char dummytmp1[2 - sizeof(*(x))]; \ + char dummytmp2[sizeof(*(x)) - 2]; \ + (*(x)) = lswaps(*(x)); \ + } while (0) And this extra guard check already spots at least one issue: swaprep.c:436:2: error: size of array 'dummytmp2' is too large http://cgit.freedesktop.org/xorg/xserver/tree/dix/swaprep.c?id=xorg-server-1.10.1#n423 One more potential pitfall is data alignment. The older variant could work correctly with any alignment, but the new variant will fail if the pointer is somehow not naturally aligned. So I think the patch is very dangerous and every affected line of code needs to be carefully reviewed. -- Best regards, Siarhei Siamashka Good call! I wonder if these misalignments are bugs in their own right? I'm not even sure if there are any real problems related to alignment here. Just the proposed patch now needs strict alignment and the older code did not. It may be either fine or may potentially trigger some alignment traps at runtime on the architectures which need strict alignment if one of the uses for these macros happens to use unaligned pointers. To be on a safe side, it is possible to use a trick with __attribute__((packed)) when compiling with gcc like described in http://lwn.net/Articles/259732/ void swapl_unaligned(void *x) { struct dummy { uint32_t data; } __attribute__((packed)); ((struct dummy *)x)-data = __builtin_bswap32(((struct dummy *)x)-data); } void swapl_aligned(uint32_t *x) { *x = __builtin_bswap32(*x); } The above two functions are compiled into the following code on x86-64: swapl_unaligned: 0: 8b 07 mov(%rdi),%eax 2: 0f c8 bswap %eax 4: 89 07 mov%eax,(%rdi) 6: c3 retq 0007 swapl_aligned: 7: 8b 07 mov(%rdi),%eax 9: 0f c8 bswap %eax b: 89 07 mov%eax,(%rdi) d: c3 retq And for armv4t it is: swapl_unaligned: 0: e5d0c001ldrbip, [r0, #1] 4: e5d02000ldrbr2, [r0] 8: e5d01002ldrbr1, [r0, #2] c: e5d03003ldrbr3, [r0, #3] 10: e182240corr r2, r2, ip, lsl #8 14: e1822801orr r2, r2, r1, lsl #16 18: e1822c03orr r2, r2
Re: [PATCH] fb: Fix memcpy abuse
On Thu, Apr 21, 2011 at 11:37 PM, Adam Jackson a...@redhat.com wrote: The memcpy fast path implicitly assumes that the copy walks left-to-right. That's not something memcpy guarantees, and newer glibc on some processors will indeed break that assumption. Since we walk a line at a time, check the source and destination against the width of the blit to determine whether we can be sloppy enough to allow memcpy. (Having done this, we can remove the check for !reverse as well.) On an Intel Core i7-2630QM with an NVIDIA GeForce GTX 460M running in NoAccel, the broken code and various fixes for -copywinwin{10,100,500} gives (edited to fit in 80 columns): 1: Disable the fastpath entirely 2: Replace memcpy with memmove 3: This fix 4: The code before this fix 1 2 3 4 Operation -- --- --- --- 258000 269000 ( 1.04) 544000 ( 2.11) 552000 ( 2.14) Copy 10x10 21300 23000 ( 1.08) 43700 ( 2.05) 47100 ( 2.21) Copy 100x100 960 962 ( 1.00) 1990 ( 2.09) 1990 ( 2.07) Copy 500x500 So it's a modest performance hit, but correctness demands it, and it's probably worth keeping the 2x speedup from having the fast path in the first place. In my opinion, still the best solution is to do this stuff in pixman, because it has its own SIMD optimized code and can do a lot better job than memcpy/memmove (for example, it can be improved to use MOVNTx instructions for x86 when scrolling large areas exceeding L2/L3 cache size, and this optimization can't be easily done by glibc if memcpy/memmove is used separately for each individual scanline). I did some experiments with improving scrolling performance when using non-hardware accelerated framebuffer earlier and it showed a really major speedup on ARM: http://lists.x.org/archives/xorg-devel/2009-November/003536.html I'm a bit surprised that there were no other people even trying to push for something similar because the scrolling performance issue is so obvious. You are more likely to see the people trying to use ARM devices without hardware acceleration, become horrified by poor scrolling performance, and then start searching for a way to obtain and install the proprietary binary drivers :) Signed-off-by: Adam Jackson a...@redhat.com --- fb/fbblt.c | 9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fb/fbblt.c b/fb/fbblt.c index a040298..b955245 100644 --- a/fb/fbblt.c +++ b/fb/fbblt.c @@ -65,6 +65,7 @@ fbBlt (FbBits *srcLine, int n, nmiddle; Bool destInvarient; int startbyte, endbyte; + int careful; FbDeclareMergeRop (); if (bpp == 24 !FbCheck24Pix (pm)) @@ -74,12 +75,16 @@ fbBlt (FbBits *srcLine, return; } - if (alu == GXcopy pm == FB_ALLONES !reverse + careful = !((srcLine dstLine srcLine + width * (bpp/8) dstLine) || + (dstLine srcLine dstLine + width * (bpp/8) srcLine)) || + (bpp % 8); + + if (alu == GXcopy pm == FB_ALLONES !careful !(srcX 7) !(dstX 7) !(width 7)) { int i; CARD8 *src = (CARD8 *) srcLine; CARD8 *dst = (CARD8 *) dstLine; - + srcStride *= sizeof(FbBits); dstStride *= sizeof(FbBits); width = 3; What is the story with the 'reverse' variable? Does it have no use anymore and needs to be purged later? -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH 1/6] make byte swapping macros less opaque to the optimizer
On Thu, Apr 28, 2011 at 2:33 AM, Matt Turner matts...@gmail.com wrote: diff --git a/include/misc.h b/include/misc.h index 803f5ba..b7a3fd2 100644 --- a/include/misc.h +++ b/include/misc.h @@ -240,32 +240,17 @@ xstrtokenize(const char *str, const char* separators); #define SwapRestL(stuff) \ SwapLongs((CARD32 *)(stuff + 1), LengthRestL(stuff)) -/* byte swap a 32-bit value */ -#define swapl(x, n) { \ - n = ((char *) (x))[0];\ - ((char *) (x))[0] = ((char *) (x))[3];\ - ((char *) (x))[3] = n;\ - n = ((char *) (x))[1];\ - ((char *) (x))[1] = ((char *) (x))[2];\ - ((char *) (x))[2] = n; } - -/* byte swap a short */ -#define swaps(x, n) { \ - n = ((char *) (x))[0];\ - ((char *) (x))[0] = ((char *) (x))[1];\ - ((char *) (x))[1] = n; } - /* copy 32-bit value from src to dst byteswapping on the way */ -#define cpswapl(src, dst) { \ - ((char *)(dst))[0] = ((char *) (src))[3];\ - ((char *)(dst))[1] = ((char *) (src))[2];\ - ((char *)(dst))[2] = ((char *) (src))[1];\ - ((char *)(dst))[3] = ((char *) (src))[0]; } +#define cpswapl(src, dst) (dst) = lswapl((src)) /* copy short from src to dst byteswapping on the way */ -#define cpswaps(src, dst) { \ - ((char *) (dst))[0] = ((char *) (src))[1];\ - ((char *) (dst))[1] = ((char *) (src))[0]; } +#define cpswaps(src, dst) (dst) = lswaps((src)) + +/* byte swap a 32-bit value */ +#define swapl(x, n) (*(x)) = lswapl(*(x)) + +/* byte swap a short */ +#define swaps(x, n) (*(x)) = lswaps(*(x)) This is not an equal replacement for the previous code. The difference is that the older variant of 'swapl' could accept any pointer having any alignment, but still do byteswap for exactly 4 bytes at that memory location. Just to perform an additional sanity check, the following test code can be used: -#define swapl(x, n) (*(x)) = lswapl(*(x)) +#define swapl(x, n) do {\ +char dummytmp1[4 - sizeof(*(x))]; \ +char dummytmp2[sizeof(*(x)) - 4]; \ +(*(x)) = lswapl(*(x)); \ +} while (0) /* byte swap a short */ -#define swaps(x, n) (*(x)) = lswaps(*(x)) +#define swaps(x, n) do {\ +char dummytmp1[2 - sizeof(*(x))]; \ +char dummytmp2[sizeof(*(x)) - 2]; \ +(*(x)) = lswaps(*(x)); \ +} while (0) And this extra guard check already spots at least one issue: swaprep.c:436:2: error: size of array 'dummytmp2' is too large http://cgit.freedesktop.org/xorg/xserver/tree/dix/swaprep.c?id=xorg-server-1.10.1#n423 One more potential pitfall is data alignment. The older variant could work correctly with any alignment, but the new variant will fail if the pointer is somehow not naturally aligned. So I think the patch is very dangerous and every affected line of code needs to be carefully reviewed. -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: Premultiplied alpha handle for driver
On Tuesday 01 June 2010, Jonathan Morton wrote: On Mon, 2010-05-31 at 10:42 +0800, Huang, FrankR wrote: We know the premultiplied alpha will be used sometimes in blend. Is that used in the graphics driver? XRender always expects premultiplied alpha in images. Your previous example, with ARGB values of (100,255,0,0) is therefore an invalid pixel vector, since the R channel is greater than the A channel. This is not necessarily invalid. It's more like additive blending feature which is even used by some people: http://home.comcast.net/~tom_forsyth/blog.wiki.html#[[Premultiplied%20alpha]] -- Best regards, Siarhei Siamashka ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [PATCH, pixman] ARM NEON optimized pixman_blt overlapped stuff
On Friday 20 November 2009, Koen Kooi wrote: On 18-11-09 14:41, Siarhei Siamashka wrote: But a bit more speed can be gained by tweaking pixman_blt to support overlapped source/destination areas and making use of SIMD optimizations there: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped_blt Using those patches on top of 0.17.3 + the xserver patch seem to be working without problems on beagleboard. I see no artifacts when scrolling in midori or arora. It does not just work, but should improve performance too :) On ARM Cortex-A8, using NEON optimized functions from pixman for overlapped copies is at least 1.5x faster for the cases when the code from fbBlt in X server would be used instead. Looks like firefox 3.6 now also joined the list of applications which directly benefit from fast overlapped blits. Earlier versions of firefox did not seem to use such blits (much). -- Best regards, Siarhei Siamashka ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
[PATCH, pixman] ARM NEON optimized pixman_blt overlapped stuff
Hello, Here is a branch with some basic ARM NEON optimizations for pixman_blt: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=pixman_blt_neon These are quite similar to MMX/SSE2 and don't introduce any controversy, so I hope that they can be pushed to pixman master soon unless somebody has objections. But a bit more speed can be gained by tweaking pixman_blt to support overlapped source/destination areas and making use of SIMD optimizations there: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped_blt Making real use of overlapped pixman_blt also requires patching xserver and the most simple way to do it is included in the attached test patch (not intended to be applied as-is). I wonder if it just makes sense to apply overlapped_blt patches to pixman now? And then later bump minimal required pixman version in xserver from current 0.15.20 to the one which supports overlapped blt operations? Overlapped blitting is used in many places like GTK file dialog, arora web browser (scrolling page back) and some other applications. Comments are very much welcome. I did not include the changes to drop delegates support in this patchset because they only distracted everyone from the main topic earlier when overlapped pixman_blt had been discussed :-) -- Best regards, Siarhei Siamashka diff --git a/fb/fbcopy.c b/fb/fbcopy.c index 07eb663..ba394b7 100644 --- a/fb/fbcopy.c +++ b/fb/fbcopy.c @@ -91,8 +91,7 @@ fbCopyNtoN (DrawablePtr pSrcDrawable, while (nbox--) { #ifndef FB_ACCESS_WRAPPER /* pixman_blt() doesn't support accessors yet */ - if (pm == FB_ALLONES alu == GXcopy !reverse - !upsidedown) + if (pm == FB_ALLONES alu == GXcopy) { if (!pixman_blt ((uint32_t *)src, (uint32_t *)dst, srcStride, dstStride, srcBpp, dstBpp, (pbox-x1 + dx + srcXoff), signature.asc Description: This is a digitally signed message part. ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [PATCH, pixman] Eliminate fast path flags
On Tuesday 17 November 2009, Soeren Sandmann wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: Reviews appreciated. I'd also appreciate if people can test that it still compiles and works on ARM as I had to make some changes blindly there. It compiles fine, but requires a minor fix for ARM NEON. These two lines need a 'PIXMAN_a8' - 'PIXMAN_solid' change: +{ PIXMAN_OP_OVER, PIXMAN_a8r8g8b8, PIXMAN_a8, PIXMAN_a8r8g8b8, neon_composite_over__n_, 0 }, +{ PIXMAN_OP_OVER, PIXMAN_a8r8g8b8, PIXMAN_a8, PIXMAN_x8r8g8b8, neon_composite_over__n_, 0 }, This was spotted by just patch review. Thanks - I have fixed this and pushed the patch. Great! But it's also worth mentioning that using 'blitters-test', the problem gets only detected on 3280827th iteration. And the test is currently set to run 200 iterations by default. Just the probability of getting right operation and right formats for source, mask and destination all at the same time is a bit too low. Tweaking probabilities for more uniform coverage may help a bit. Increasing the number of iterations that are run by default may also be useful. But still, an overnight run makes sense to spot some of the more rare problems. Yeah, we should probably at have at least two CRC32 values, one for the 2 million case, and one for a 100 million case that could run overnight. It might also be interesting to have many more than two values, and store them in a table where each row would contain both CRC and seed for 2 million iterations, for 4 million, for 6 million and so on. The test could then pick a range to run at random, which would substantially increase the coverage. The problem is that performance of different systems may vary a lot, so 100 millions may be fine for one system, but too much for the other. I'm currently using a variation of the attached script to test pixman fast path functions. The longer it runs, more chances that it will spot problems (if they exist). For modern systems it probably also makes sense to add multithreading support to the test so that it can utilize all the available CPU cores. -- Best regards, Siarhei Siamashka test-pixman-fast-paths.sh Description: application/shellscript signature.asc Description: This is a digitally signed message part. ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [PATCH, pixman] ARM NEON optimized pixman_blt overlapped stuff
On Wednesday 18 November 2009, Koen Kooi wrote: On 18-11-09 14:41, Siarhei Siamashka wrote: Making real use of overlapped pixman_blt also requires patching xserver and the most simple way to do it is included in the attached test patch (not intended to be applied as-is). How many kittens will get killed if I provide a pixman with overlapped blit support and apply your patch to Xorg after that? Everything should work fine except that there will be no way for downgrading pixman. Patched xserver will show bad image artefacts on the use of overlapped blits when run with older unpatched versions of pixman. The other way around is always fine. Patches for pixman have no effect on interoperability with any versions of xserver. I suspect overlapped blits will also benefit from a cached framebuffer as well, right? Yes, at least for me the best type of 'non-3d-accelerated desktop' experience on OMAP3 devices can be achieved with write-through cached 16bpp framebuffer (and ShadowFB option explicitly disabled). But it still needs a few more NEON optimizations to address all the bottlenecks. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
[PATCH][pixman] C fast path function for 'over_n_1_0565'
Hello, This patch provides more than 2x performance improvement for this particular case (terminal, bitmap font, 16bpp desktop color depth). Benchmarked on ARM Cortex-A8. Tested for correctness on both big and little endian systems. -- Best regards, Siarhei Siamashka From 712e3cc5b9eac0b3ed8dabe644fa4ea2f192c58b Mon Sep 17 00:00:00 2001 From: Siarhei Siamashka siarhei.siamas...@nokia.com Date: Mon, 9 Nov 2009 14:10:00 +0200 Subject: [PATCH] C fast path function for 'over_n_1_0565' MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit This function is needed to improve performance of xfce4 terminal when using bitmap fonts and running with 16bpp desktop. Some other applications may potentially benefit too. After applying this patch, top functions from Xorg process in oprofile log change from samples %image name symbol name 1329629.1528 libpixman-1.so.0.17.1combine_over_u 6452 14.1466 libpixman-1.so.0.17.1fetch_scanline_r5g6b5 5516 12.0944 libpixman-1.so.0.17.1fetch_scanline_a1 2273 4.9838 libpixman-1.so.0.17.1store_scanline_r5g6b5 1741 3.8173 libpixman-1.so.0.17.1fast_composite_add_1000_1000 1718 3.7669 libc-2.9.so memcpy to samples %image name symbol name 5594 14.7033 libpixman-1.so.0.17.1fast_composite_over_n_1_0565 4323 11.3626 libc-2.9.so memcpy 3695 9.7119 libpixman-1.so.0.17.1fast_composite_add_1000_1000 when scrolling text in terminal (reading man page). --- pixman/pixman-fast-path.c | 100 + 1 files changed, 100 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c index d2c456b..75a0c1e 100644 --- a/pixman/pixman-fast-path.c +++ b/pixman/pixman-fast-path.c @@ -1175,6 +1175,104 @@ fast_composite_over_n_1_ (pixman_implementation_t *imp, } } +static void +fast_composite_over_n_1_0565 (pixman_implementation_t *imp, + pixman_op_t op, + pixman_image_t * src_image, + pixman_image_t * mask_image, + pixman_image_t * dst_image, + int32_t src_x, + int32_t src_y, + int32_t mask_x, + int32_t mask_y, + int32_t dest_x, + int32_t dest_y, + int32_t width, + int32_t height) +{ +uint32_t src, srca; +uint16_t*dst, *dst_line; +uint32_t*mask, *mask_line; +int mask_stride, dst_stride; +uint32_t bitcache, bitmask; +int32_t w; +uint32_t d; +uint16_t src565; + +if (width = 0) + return; + +src = _pixman_image_get_solid (src_image, dst_image-bits.format); +srca = src 24; +if (src == 0) + return; + +PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t, + dst_stride, dst_line, 1); +PIXMAN_IMAGE_GET_LINE (mask_image, 0, mask_y, uint32_t, + mask_stride, mask_line, 1); +mask_line += mask_x 5; + +if (srca == 0xff) +{ + src565 = CONVERT__TO_0565 (src); + while (height--) + { + dst = dst_line; + dst_line += dst_stride; + mask = mask_line; + mask_line += mask_stride; + w = width; + + bitcache = *mask++; + bitmask = CREATE_BITMASK (mask_x 31); + + while (w--) + { + if (bitmask == 0) + { + bitcache = *mask++; + bitmask = CREATE_BITMASK (0); + } + if (bitcache bitmask) + *dst = src565; + bitmask = UPDATE_BITMASK (bitmask); + dst++; + } + } +} +else +{ + while (height--) + { + dst = dst_line; + dst_line += dst_stride; + mask = mask_line; + mask_line += mask_stride; + w = width; + + bitcache = *mask++; + bitmask = CREATE_BITMASK (mask_x 31); + + while (w--) + { + if (bitmask == 0) + { + bitcache = *mask++; + bitmask = CREATE_BITMASK (0); + } + if (bitcache bitmask) + { + d = over (src, CONVERT_0565_TO_0888 (*dst)); + *dst = CONVERT__TO_0565 (d); + } + bitmask = UPDATE_BITMASK (bitmask); + dst++; + } + } +} +} + /* * Simple bitblt */ @@ -1261,6 +1359,8 @@ static const pixman_fast_path_t c_fast_paths[] = { PIXMAN_OP_OVER, PIXMAN_solid,PIXMAN_a1, PIXMAN_x8r8g8b8, fast_composite_over_n_1_, }, { PIXMAN_OP_OVER, PIXMAN_solid,PIXMAN_a1, PIXMAN_a8b8g8r8, fast_composite_over_n_1_, }, { PIXMAN_OP_OVER, PIXMAN_solid,PIXMAN_a1, PIXMAN_x8b8g8r8, fast_composite_over_n_1_
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
On Friday 23 October 2009, Koen Kooi wrote: I'm not sure about pixman_gc_t since most of the needed operations are just simple copies. What about starting with just introducing a variant of 'pixman_blt' which is overlapping aware? I created a work-in-progress branch with 'pixman_blt' function (generic C implementation for now) extended to support overlapped source/destination case. A simple test program is also included: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt First, this branch is outdated. There is a new branch with the final code :) http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2 Would using said branch give me 'magically' a performance boost (e.g. not make firefox unusably slow as it is now on an 600MHz cortex a8) or would I need to patch other libs (e.g. xrender) as well? Not really, it's just a small extension of pixman functionality. Currently the handling of overlapped blt operation (for software rendering) is done in xorg-server. As it is the responsibility of pixman to provide CPU-specific SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to move this work to pixman. So the next steps are to add NEON optimizations to pixman_plt and make sure that xserver takes advantage of these optimizations for the overlapped blit too. As for improving scrolling performance (and assuming a standard fbdev driver), the most important thing is to improve framebuffer memory performance. Right now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling write-through cache for it (with a simple kernel patch) improves scrolling and moving windows performance by 4x-5x factor (unless shadow framebuffer is used, which is also not good for performance). This works fine if nothing but CPU can modify framebuffer memory. But if GPU or DSP can also access framebuffer memory or compositing manager is used, everything gets more complicated. Cache invalidate operations will have to be inserted in appropriate places in order to ensure memory coherency and uniform view of its content from all the units. If default write-back cache is used instead of write-through, cache flush operations are needed too. Unpatched firefox is also quite slow for another reason - it tries to always work with 32bpp data internally, no matter what color depth is used for desktop. -- Best regards, Siarhei Siamashka ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel
Re: [cairo] [RFC] Pixman compositing with overlapping source and destination pixel data
On Thursday 04 June 2009, Soeren Sandmann wrote: Siarhei Siamashka siarhei.siamas...@gmail.com writes: What kind of guarantees (or the lack of) pixman and XRender are supposed to provide when dealing with overlapping parts of images? (Adding xorg-devel). See this thread: http://lists.freedesktop.org/archives/xorg/2008-October/039346.html The guarantee that I would suggest for Render and pixman is that if any pixel is both read and written in the same request, then the result of the whole request is undefined, except that it obeys the clipping rules. The practical use case could be scrolling of data inside of a single big image. If rendering with overlapped source and destination areas is not supported, a temporary image has to be created to achieve expected result and this is an additional performance hit. Yes, scrolling is one thing that the current pixman API doesn't really provide. 'pixman_blt()' only deals with cases where the source and destination don't overlap. I think the best solution is to move all of the X primitives (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would probably look something like this: void pixman_copy_area (pixman_image_t *src, pixman_image_t *dest, pixman_gc_t *gc, int src_x, int src_y, int width, int height, int dest_x, int dest_y); and that would be guaranteed to handle overlapping src and dest. A pixman_gc_t would be a new type of object, corresponding to an X GC. pixman_blt() would then become a deprecated wrapper that would just call pixman_copy_area(). Same for pixman_fill() and a new pixman_fill_rectangles(). I'm not sure about pixman_gc_t since most of the needed operations are just simple copies. What about starting with just introducing a variant of 'pixman_blt' which is overlapping aware? I created a work-in-progress branch with 'pixman_blt' function (generic C implementation for now) extended to support overlapped source/destination case. A simple test program is also included: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt Making use of the already existing SIMD optimized pixel copy functions should provide fast scrolling in all the directions except for from left to right. This special case will require a SIMD optimized backwards copy. I wonder if it makes sense to drop delegates support for pixman_blt and make call chain shorter when introducing SIMD optimized copies? It seems to be a little bit overdesigned here. Running test program for the current pixman master (SSE2 optimized): $ test/overlapped-blt-test bpp=8, supported=0, normal_ok=0, overlapped_ok=0 bpp=16, supported=1, normal_ok=1, overlapped_ok=0 bpp=32, supported=1, normal_ok=1, overlapped_ok=0 Running test program for the pixman from this branch with generic C version of pixman_blt (8bpp now uses a fallback to generic C implementation): $ test/overlapped-blt-test bpp=8, supported=1, normal_ok=1, overlapped_ok=1 bpp=16, supported=1, normal_ok=1, overlapped_ok=0 bpp=32, supported=1, normal_ok=1, overlapped_ok=0 -- Best regards, Siarhei Siamashka ___ xorg-devel mailing list xorg-devel@lists.x.org http://lists.x.org/mailman/listinfo/xorg-devel