Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Knut Petersen



The special case for s_pitch == 2 saves about 270 ms system time (2120 ->
1850ms)
with a 16x30 font.
   


Compared to what? How much is the function call overhead?

 

Your version of the inline code inserted after an if (idx==2) in 
bit_putcs against my version of the

inline code.

cu,
knut

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

> I added the multiply back because gcc (v. 3.3.4) does generate the fastest
> code
> if I write it this way.

The multiply is not generally faster, so your version may be the fastest, 
but in other situations it will be a lot slower. My version is generally 
fast.

> The special case for s_pitch == 2 saves about 270 ms system time (2120 ->
> 1850ms)
> with a 16x30 font.

Compared to what? How much is the function call overhead?

> It´s as fast/slow as your previous version, the measurements are almost
> identical.

The generated code is a bit smaller, so it mostly depends on the icache to 
see a difference.

bye, Roman

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Knut Petersen

 Hi Roman!


+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
u32 s_pitch, u32 height)
+{
+   int i, j;
+
+   if (likely(s_pitch==1))
+   for(i=0; i < height; i++)
+   dst[d_pitch*i] = src[i];
   



I added the multiply back because gcc (v. 3.3.4) does generate the 
fastest code
if I write it this way. I compiled, inspected the generated assembly and 
benchmarked

about a  dozend variations of the code (benchmark as previously described).

The special case for s_pitch == 1 saves about 10 ms system time (770 ms 
-> 760 ms)


The special case for s_pitch == 2 saves about 270 ms system time (2120 
-> 1850ms)

with a 16x30 font.

The third case is for even bigger fonts ... I believe that it will not 
often be used but

something like that must be present.

You have now 3 slightly different variants of the same, which isn't really 
an improvement. In my example I showed you how to generate the first and 
last version from the same source.


The first version will only be generated when gcc can be sure that 
s_pitch is 1.
Therefore you had to explicitly call __fb_pad_aligned_buffer with that 
value:


if (likely(idx == 1))
__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);

With the version I propose it´s enough to write

__fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);

instead, and you will get good performance for all cases. If the value of
idx/s_pitch is know at compile time, the compiler can and will ignore the
other cases.

If you also want to optimize for other sizes, you might want to always 
inline the function, if the function call overhead is the largest part 
anyway, the special case for 2 bytes might not be needed anymore.
 

fb_pad_aligned_buffer() is  usefull to save some space in cases like 
softcursor.
It´s also used by some  drivers (nvidia and riva), but the authors of 
those drivers
have to decide if they prefer the inlined version or the version fixed 
in fbmem.



BTW this version saves another condition:

static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 
s_pitch, u32 height)
{
int i, j;

d_pitch -= s_pitch;
i = height;
do {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
j = s_pitch;
do
*dst++ = *src++;
while (--j > 0);
dst += d_pitch;
} while (--i > 0);
}


I tested that code, together with the followig code in but_putcs():

if(idx==1)
   __fb_pad_aligned_buffer(dst, pitch, src, 1, 
image.height);

else if (idx==2)
   __fb_pad_aligned_buffer(dst, pitch, src, 2, 
image.height);

else
   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);

dst += width;

It´s as fast/slow as your previous version, the measurements are almost 
identical.


Let´s summarize:

Your version of __fb_pad_aligned_buffer looks much better, but it needs 
not so nice
conditionals when used. My version looks bad, but it is easier to use 
and it is

_faster_.

cu,
knut



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

> +static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
> u32 s_pitch, u32 height)
> +{
> + int i, j;
> +
> + if (likely(s_pitch==1))
> + for(i=0; i < height; i++)
> + dst[d_pitch*i] = src[i];
> + else if (s_pitch==2)
> + for(i=0; i < height; i++) {
> + *(u16 *)dst = ((u16 *)src)[i];
> + dst += d_pitch;
> + }
> + else {
> + d_pitch -= s_pitch;
> + for (i = height; i--; ) {
> + for (j = 0; j < s_pitch; j++)
> + *dst++ = *src++;
> + dst += d_pitch;
> + }
> + }
> +}

Why did you add the multiply back?
You have now 3 slightly different variants of the same, which isn't really 
an improvement. In my example I showed you how to generate the first and 
last version from the same source.
If you also want to optimize for other sizes, you might want to always 
inline the function, if the function call overhead is the largest part 
anyway, the special case for 2 bytes might not be needed anymore.

BTW this version saves another condition:

static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 
s_pitch, u32 height)
{
int i, j;

d_pitch -= s_pitch;
i = height;
do {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
j = s_pitch;
do
*dst++ = *src++;
while (--j > 0);
dst += d_pitch;
} while (--i > 0);
}

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Antonino A. Daplas wrote:

> Roman, okay if you have a 'Signed-off-by' line?

Okay.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Knut Petersen


Something like below, which has the advantange that there is still only 
one implementation of the function



True, that´s a great advantage.


and if it's still slower, we really need to check the compiler
 

Please have a look at the following patch. It takes your idea of 
inlining but moves
the special cases into the macro, speeding things up for the very likely 
case of
s_pitch == 1 and the less likely case of s_pitch of 2. Treating s_pitch 
== 2 special
gives a still significant performance improvement of more than 10 % for 
16x30

fonts.

This way also bit_putcs looks better again ...

Andrew, as this way is better than and still as fast as my first 
approach I think

framebuffer-bit_putcs-optimization-for-8x.patch should be reverted and the
following patch should be applied instead.

Antonino, Roman, Geert, do you agree?


cu,
knut

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c   2005-08-29 01:41:01.0 
+0200
+++ linux/drivers/video/console/bitblit.c   2005-08-31 10:06:22.0 
+0200
@@ -175,7 +175,7 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}

-   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
+   __fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
dst += width;
}
}
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/drivers/video/fbmem.c linux/drivers/video/fbmem.c
--- linuxorig/drivers/video/fbmem.c 2005-08-29 01:41:01.0 +0200
+++ linux/drivers/video/fbmem.c 2005-08-31 13:36:16.0 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
 */
void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 
height)
{
-   int i, j;
-
-   for (i = height; i--; ) {
-   /* s_pitch is a few bytes at the most, memcpy is suboptimal */
-   for (j = 0; j < s_pitch; j++)
-   dst[j] = src[j];
-   src += s_pitch;
-   dst += d_pitch;
-   }
+   __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
}
EXPORT_SYMBOL(fb_pad_aligned_buffer);

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/include/linux/fb.h linux/include/linux/fb.h
--- linuxorig/include/linux/fb.h2005-08-29 01:41:01.0 +0200
+++ linux/include/linux/fb.h2005-08-31 12:45:08.0 +0200
@@ -824,6 +824,38 @@ extern int fb_get_color_depth(struct fb_
extern int fb_get_options(char *name, char **option);
extern int fb_new_modelist(struct fb_info *info);

+
+/*
+ * Don't change without testing performance of framebuffer
+ * bitblitting. Inlining is necessary for performance reasons.
+ * Although the code might not _look_ fast because of some
+ * multiplications, it really _is_ fast as it is easier for gcc
+ * to optimize well.
+ */
+
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, 
+		u32 s_pitch, u32 height)

+{
+   int i, j;
+
+   if (likely(s_pitch==1))
+   for(i=0; i < height; i++)
+   dst[d_pitch*i] = src[i];
+   else if (s_pitch==2)
+   for(i=0; i < height; i++) {
+   *(u16 *)dst = ((u16 *)src)[i];
+   dst += d_pitch;
+   }
+   else {
+   d_pitch -= s_pitch;
+   for (i = height; i--; ) {
+   for (j = 0; j < s_pitch; j++)
+   *dst++ = *src++;
+   dst += d_pitch;
+   }
+   }
+}
+
extern struct fb_info *registered_fb[FB_MAX];
extern int num_registered_fb;





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Antonino A. Daplas
Roman Zippel wrote:
> Hi,
> 
> On Wed, 31 Aug 2005, Knut Petersen wrote:
> 
>> How could I make it an inline function? It is used in console/bitblit.c,
>> nvidia/nvidia.c,
>> riva/fbdev.c and softcursor.c.
> 
> Something like below, which has the advantange that there is still only 
> one implementation of the function and if it's still slower, we really 
> need to check the compiler.
> 

I do get better numbers with this, not much, but better than Knut's (5ms), and
definitely much better than the old uninlined one (100ms).

Andrew, don't get this yet.  I'll incorporate this with the bit_putcs()
breakup that I'm currently doing.

Roman, okay if you have a 'Signed-off-by' line?

Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Antonino A. Daplas
Roman Zippel wrote:
 Hi,
 
 On Wed, 31 Aug 2005, Knut Petersen wrote:
 
 How could I make it an inline function? It is used in console/bitblit.c,
 nvidia/nvidia.c,
 riva/fbdev.c and softcursor.c.
 
 Something like below, which has the advantange that there is still only 
 one implementation of the function and if it's still slower, we really 
 need to check the compiler.
 

I do get better numbers with this, not much, but better than Knut's (5ms), and
definitely much better than the old uninlined one (100ms).

Andrew, don't get this yet.  I'll incorporate this with the bit_putcs()
breakup that I'm currently doing.

Roman, okay if you have a 'Signed-off-by' line?

Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Knut Petersen


Something like below, which has the advantange that there is still only 
one implementation of the function



True, that´s a great advantage.


and if it's still slower, we really need to check the compiler
 

Please have a look at the following patch. It takes your idea of 
inlining but moves
the special cases into the macro, speeding things up for the very likely 
case of
s_pitch == 1 and the less likely case of s_pitch of 2. Treating s_pitch 
== 2 special
gives a still significant performance improvement of more than 10 % for 
16x30

fonts.

This way also bit_putcs looks better again ...

Andrew, as this way is better than and still as fast as my first 
approach I think

framebuffer-bit_putcs-optimization-for-8x.patch should be reverted and the
following patch should be applied instead.

Antonino, Roman, Geert, do you agree?


cu,
knut

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c   2005-08-29 01:41:01.0 
+0200
+++ linux/drivers/video/console/bitblit.c   2005-08-31 10:06:22.0 
+0200
@@ -175,7 +175,7 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}

-   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
+   __fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
dst += width;
}
}
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/drivers/video/fbmem.c linux/drivers/video/fbmem.c
--- linuxorig/drivers/video/fbmem.c 2005-08-29 01:41:01.0 +0200
+++ linux/drivers/video/fbmem.c 2005-08-31 13:36:16.0 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
 */
void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 
height)
{
-   int i, j;
-
-   for (i = height; i--; ) {
-   /* s_pitch is a few bytes at the most, memcpy is suboptimal */
-   for (j = 0; j  s_pitch; j++)
-   dst[j] = src[j];
-   src += s_pitch;
-   dst += d_pitch;
-   }
+   __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
}
EXPORT_SYMBOL(fb_pad_aligned_buffer);

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/include/linux/fb.h linux/include/linux/fb.h
--- linuxorig/include/linux/fb.h2005-08-29 01:41:01.0 +0200
+++ linux/include/linux/fb.h2005-08-31 12:45:08.0 +0200
@@ -824,6 +824,38 @@ extern int fb_get_color_depth(struct fb_
extern int fb_get_options(char *name, char **option);
extern int fb_new_modelist(struct fb_info *info);

+
+/*
+ * Don't change without testing performance of framebuffer
+ * bitblitting. Inlining is necessary for performance reasons.
+ * Although the code might not _look_ fast because of some
+ * multiplications, it really _is_ fast as it is easier for gcc
+ * to optimize well.
+ */
+
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, 
+		u32 s_pitch, u32 height)

+{
+   int i, j;
+
+   if (likely(s_pitch==1))
+   for(i=0; i  height; i++)
+   dst[d_pitch*i] = src[i];
+   else if (s_pitch==2)
+   for(i=0; i  height; i++) {
+   *(u16 *)dst = ((u16 *)src)[i];
+   dst += d_pitch;
+   }
+   else {
+   d_pitch -= s_pitch;
+   for (i = height; i--; ) {
+   for (j = 0; j  s_pitch; j++)
+   *dst++ = *src++;
+   dst += d_pitch;
+   }
+   }
+}
+
extern struct fb_info *registered_fb[FB_MAX];
extern int num_registered_fb;





-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Antonino A. Daplas wrote:

 Roman, okay if you have a 'Signed-off-by' line?

Okay.

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

 +static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
 u32 s_pitch, u32 height)
 +{
 + int i, j;
 +
 + if (likely(s_pitch==1))
 + for(i=0; i  height; i++)
 + dst[d_pitch*i] = src[i];
 + else if (s_pitch==2)
 + for(i=0; i  height; i++) {
 + *(u16 *)dst = ((u16 *)src)[i];
 + dst += d_pitch;
 + }
 + else {
 + d_pitch -= s_pitch;
 + for (i = height; i--; ) {
 + for (j = 0; j  s_pitch; j++)
 + *dst++ = *src++;
 + dst += d_pitch;
 + }
 + }
 +}

Why did you add the multiply back?
You have now 3 slightly different variants of the same, which isn't really 
an improvement. In my example I showed you how to generate the first and 
last version from the same source.
If you also want to optimize for other sizes, you might want to always 
inline the function, if the function call overhead is the largest part 
anyway, the special case for 2 bytes might not be needed anymore.

BTW this version saves another condition:

static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 
s_pitch, u32 height)
{
int i, j;

d_pitch -= s_pitch;
i = height;
do {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
j = s_pitch;
do
*dst++ = *src++;
while (--j  0);
dst += d_pitch;
} while (--i  0);
}

bye, Roman

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Knut Petersen

 Hi Roman!


+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
u32 s_pitch, u32 height)
+{
+   int i, j;
+
+   if (likely(s_pitch==1))
+   for(i=0; i  height; i++)
+   dst[d_pitch*i] = src[i];
   



I added the multiply back because gcc (v. 3.3.4) does generate the 
fastest code
if I write it this way. I compiled, inspected the generated assembly and 
benchmarked

about a  dozend variations of the code (benchmark as previously described).

The special case for s_pitch == 1 saves about 10 ms system time (770 ms 
- 760 ms)


The special case for s_pitch == 2 saves about 270 ms system time (2120 
- 1850ms)

with a 16x30 font.

The third case is for even bigger fonts ... I believe that it will not 
often be used but

something like that must be present.

You have now 3 slightly different variants of the same, which isn't really 
an improvement. In my example I showed you how to generate the first and 
last version from the same source.


The first version will only be generated when gcc can be sure that 
s_pitch is 1.
Therefore you had to explicitly call __fb_pad_aligned_buffer with that 
value:


if (likely(idx == 1))
__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);

With the version I propose it´s enough to write

__fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);

instead, and you will get good performance for all cases. If the value of
idx/s_pitch is know at compile time, the compiler can and will ignore the
other cases.

If you also want to optimize for other sizes, you might want to always 
inline the function, if the function call overhead is the largest part 
anyway, the special case for 2 bytes might not be needed anymore.
 

fb_pad_aligned_buffer() is  usefull to save some space in cases like 
softcursor.
It´s also used by some  drivers (nvidia and riva), but the authors of 
those drivers
have to decide if they prefer the inlined version or the version fixed 
in fbmem.



BTW this version saves another condition:

static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 
s_pitch, u32 height)
{
int i, j;

d_pitch -= s_pitch;
i = height;
do {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
j = s_pitch;
do
*dst++ = *src++;
while (--j  0);
dst += d_pitch;
} while (--i  0);
}


I tested that code, together with the followig code in but_putcs():

if(idx==1)
   __fb_pad_aligned_buffer(dst, pitch, src, 1, 
image.height);

else if (idx==2)
   __fb_pad_aligned_buffer(dst, pitch, src, 2, 
image.height);

else
   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);

dst += width;

It´s as fast/slow as your previous version, the measurements are almost 
identical.


Let´s summarize:

Your version of __fb_pad_aligned_buffer looks much better, but it needs 
not so nice
conditionals when used. My version looks bad, but it is easier to use 
and it is

_faster_.

cu,
knut



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

 I added the multiply back because gcc (v. 3.3.4) does generate the fastest
 code
 if I write it this way.

The multiply is not generally faster, so your version may be the fastest, 
but in other situations it will be a lot slower. My version is generally 
fast.

 The special case for s_pitch == 2 saves about 270 ms system time (2120 -
 1850ms)
 with a 16x30 font.

Compared to what? How much is the function call overhead?

 It´s as fast/slow as your previous version, the measurements are almost
 identical.

The generated code is a bit smaller, so it mostly depends on the icache to 
see a difference.

bye, Roman

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-31 Thread Knut Petersen



The special case for s_pitch == 2 saves about 270 ms system time (2120 -
1850ms)
with a 16x30 font.
   


Compared to what? How much is the function call overhead?

 

Your version of the inline code inserted after an if (idx==2) in 
bit_putcs against my version of the

inline code.

cu,
knut

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Antonino A. Daplas
Knut Petersen wrote:
> This trivial patch gives a performance boost to the framebuffer console
> 
> Constructing the bitmaps that are given to the bitblit functions of the
> framebuffer
> drivers is time consuming. Here we avoide a call to the slow
> fb_pad_aligned_buffer().
> The patch replaces that call with a simple but much more efficient
> bytewise copy.
> 
> The kernel spends a significant time at this place if you use 8x* fonts.
> Every
> pixel displayed on your screen is prepared here.
> 
> Some benchmark results:
> 
> Displaying a file of 2000 lines with 160 characters each takes 889 ms
> system
> time using  cyblafb on my system  (I´m using a 1280x1024 video mode,
> resulting in a 160x64 character console)
> 
> Displaying the same file with the enclosed patch applied to 2.6.13 only
> takes
> 760 ms system time, saving 129 ms or 14.5%.

Where did this 14.5% come from?  Is it bit_putcs alone or is more
real world, such as 'time cat text_file'? If I'm to guess, a large
percent of the improvement is caused by the inlining of the code.

I'm not against the patch, it will benefit drivers with very fast
imageblits.  However, since most drivers have imageblits done in software,
a large proportion of the processing time will go to the imageblit itself,
so I don't think you'll get that high a number (I get only a 4%
improvement, and this is in a driver with accelerated blits, and it will
probably be lower, ie, in vesafb).

Anyway, with the addition of your patch, bit_putcs has now reached an
'ugliness threshhold' for a revamp.

Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Antonino A. Daplas
Knut Petersen wrote:
> fb_pad_aligned_buffer() is also slower for those cases. But does anybody
> use such fonts?

Yes, there are 16x30 fonts out there in the wild. 

Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

> How could I make it an inline function? It is used in console/bitblit.c,
> nvidia/nvidia.c,
> riva/fbdev.c and softcursor.c.

Something like below, which has the advantange that there is still only 
one implementation of the function and if it's still slower, we really 
need to check the compiler.

bye, Roman

 drivers/video/console/bitblit.c |5 -
 drivers/video/fbmem.c   |   10 +-
 include/linux/fb.h  |   13 +
 3 files changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/video/fbmem.c
===
--- linux-2.6.orig/drivers/video/fbmem.c2005-08-30 21:17:44.0 
+0200
+++ linux-2.6/drivers/video/fbmem.c 2005-08-31 01:20:37.0 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
  */
 void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 
height)
 {
-   int i, j;
-
-   for (i = height; i--; ) {
-   /* s_pitch is a few bytes at the most, memcpy is suboptimal */
-   for (j = 0; j < s_pitch; j++)
-   dst[j] = src[j];
-   src += s_pitch;
-   dst += d_pitch;
-   }
+   __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
 }
 EXPORT_SYMBOL(fb_pad_aligned_buffer);
 
Index: linux-2.6/drivers/video/console/bitblit.c
===
--- linux-2.6.orig/drivers/video/console/bitblit.c  2005-08-30 
01:55:20.0 +0200
+++ linux-2.6/drivers/video/console/bitblit.c   2005-08-31 01:25:30.0 
+0200
@@ -175,7 +175,10 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}
 
-   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
+   if (likely(idx == 1))
+   __fb_pad_aligned_buffer(dst, pitch, 
src, 1, image.height);
+   else
+   fb_pad_aligned_buffer(dst, pitch, src, 
idx, image.height);
dst += width;
}
}
Index: linux-2.6/include/linux/fb.h
===
--- linux-2.6.orig/include/linux/fb.h   2005-08-30 01:56:29.0 +0200
+++ linux-2.6/include/linux/fb.h2005-08-31 01:21:04.0 +0200
@@ -824,6 +824,19 @@ extern int fb_get_color_depth(struct fb_
 extern int fb_get_options(char *name, char **option);
 extern int fb_new_modelist(struct fb_info *info);
 
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 
s_pitch, u32 height)
+{
+   int i, j;
+
+   d_pitch -= s_pitch;
+   for (i = height; i--; ) {
+   /* s_pitch is a few bytes at the most, memcpy is suboptimal */
+   for (j = 0; j < s_pitch; j++)
+   *dst++ = *src++;
+   dst += d_pitch;
+   }
+}
+
 extern struct fb_info *registered_fb[FB_MAX];
 extern int num_registered_fb;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Knut Petersen

Hi Roman,

Could you try the patch below, for a few extra cycles you might want to 
make it an inline function.
 

No, it does not help. If there is any difference, it is too small to be 
measured on

my system ... and my system does run at 1000 Hz.

After 2.6.12 fb_pad_aligned_buffer() was changed to use memcpy() instead 
of a
bytewise copy. That slowed things down a lot, some weeks ago that was 
reverted.
fb_pad_aligned _buffer() isn´t that slow, it´s just an external function 
to be called

and that means a lot of unnecessary code.

How could I make it an inline function? It is used in console/bitblit.c, 
nvidia/nvidia.c,

riva/fbdev.c and softcursor.c.

cu,
Knut
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Geert Uytterhoeven
On Tue, 30 Aug 2005, Knut Petersen wrote:
> > Probably you can make it even faster by avoiding the multiplication, like
> > 
> >unsigned int offset = 0;
> >for (i = 0; i < image.height; i++) {
> > dst[offset] = src[i];
> > offset += pitch;
> >}
> 
> More than two decades ago I learned to avoid mul and imul. Use shifts, add and
> lea instead,
> that was the credo those days. The name of the game was CP/M 80/86, a86, d86
> and ddt ;-)
> 
> But let�s get serious again.

On modern CPUs, a multiplication indeed takes 1 cycle, just like an addition.
But on older CPUs (still supported by Linux), this is not true.

> Your proposed change of the patch results in a 21 ms performance decrease on
> my system.
> Yes, I do know that this is hard to believe. I tested a similar variation
> before, and the results
> were even worse.
> 
> Avoiding mul is a good idea in assembly language today, but often it is better
> to write a
> multiplication  with the loop counter in C and not to introduce an extra
> variable instead. The
> compiler will optimize the code and it�s easier for gcc without that extra
> variable.

But you are right. On actual inspection of the generated assembly code for a
very simple test case, it turns out both (m68k-linux-)gcc 2.95.2 and 3.3.3 are
smart enough to convert the multiplication to an addition...

And interestingly, if I avoid the multiplication explicitly, gcc 2.95.2 still
generates the same code, but 3.3.3 adds a few extra instructions to
save/restore local vars. So this probably explains why it turned out to be
slower for you. Ugh...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Roman Zippel
Hi,

On Tue, 30 Aug 2005, Knut Petersen wrote:

> > Probably you can make it even faster by avoiding the multiplication, like
> > 
> >unsigned int offset = 0;
> >for (i = 0; i < image.height; i++) {
> > dst[offset] = src[i];
> > offset += pitch;
> >}
> > 
> 
> More than two decades ago I learned to avoid mul and imul. Use shifts, add and
> lea instead,
> that was the credo those days. The name of the game was CP/M 80/86, a86, d86
> and ddt ;-)
> 
> But let´s get serious again.
> 
> Your proposed change of the patch results in a 21 ms performance decrease on
> my system.
> Yes, I do know that this is hard to believe. I tested a similar variation
> before, and the results
> were even worse.

Could you try the patch below, for a few extra cycles you might want to 
make it an inline function.

bye, Roman

Index: linux-2.6/drivers/video/fbmem.c
===
--- linux-2.6.orig/drivers/video/fbmem.c2005-08-30 01:55:18.0 
+0200
+++ linux-2.6/drivers/video/fbmem.c 2005-08-30 21:10:25.705462837 +0200
@@ -82,11 +82,11 @@ void fb_pad_aligned_buffer(u8 *dst, u32 
 {
int i, j;
 
+   d_pitch -= s_pitch;
for (i = height; i--; ) {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
for (j = 0; j < s_pitch; j++)
-   dst[j] = src[j];
-   src += s_pitch;
+   *dst++ = *src++;
dst += d_pitch;
}
 }

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Knut Petersen



Probably you can make it even faster by avoiding the multiplication, like

   unsigned int offset = 0;
   for (i = 0; i < image.height; i++) {
dst[offset] = src[i];
offset += pitch;
   }



More than two decades ago I learned to avoid mul and imul. Use shifts, 
add and lea instead,
that was the credo those days. The name of the game was CP/M 80/86, a86, 
d86 and ddt ;-)


But let´s get serious again.

Your proposed change of the patch results in a 21 ms performance 
decrease on my system.
Yes, I do know that this is hard to believe. I tested a similar 
variation before, and the results

were even worse.

Avoiding mul is a good idea in assembly language today, but often it is 
better to write a
multiplication  with the loop counter in C and not to introduce an extra 
variable instead. The
compiler will optimize the code and it´s easier for gcc without that 
extra variable.


More interesting would be the question what should be done for idx==2 or 
idx==3. Probably
fb_pad_aligned_buffer() is also slower for those cases. But does anybody 
use such fonts?


cu,
knut
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Geert Uytterhoeven
On Tue, 30 Aug 2005, Knut Petersen wrote:
> linux/drivers/video/console/bitblit.c
> --- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.0
> +0200
> +++ linux/drivers/video/console/bitblit.c 2005-08-30 17:19:57.0
> +0200
> @@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
>   unsigned int scan_align = info->pixmap.scan_align - 1;
>   unsigned int buf_align = info->pixmap.buf_align - 1;
>   unsigned int shift_low = 0, mod = vc->vc_font.width % 8;
> - unsigned int shift_high = 8, pitch, cnt, size, k;
> + unsigned int shift_high = 8, pitch, cnt, size, i, k;
>   unsigned int idx = vc->vc_font.width >> 3;
>   unsigned int attribute = get_attribute(info, scr_readw(s));
>   struct fb_image image;
> @@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
>   src = buf;
>   }
> 
> - fb_pad_aligned_buffer(dst, pitch, src, idx,
> image.height);
> + if (idx == 1)
> + for(i=0; i < image.height; i++)
> + dst[pitch*i] = src[i];
^^^
> + else
> + fb_pad_aligned_buffer(dst, pitch, src,
> idx, image.height);
>   dst += width;
>   }
>   }

Probably you can make it even faster by avoiding the multiplication, like

unsigned int offset = 0;
for (i = 0; i < image.height; i++) {
dst[offset] = src[i];
offset += pitch;
}

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Knut Petersen

This trivial patch gives a performance boost to the framebuffer console

Constructing the bitmaps that are given to the bitblit functions of the 
framebuffer
drivers is time consuming. Here we avoide a call to the slow 
fb_pad_aligned_buffer().
The patch replaces that call with a simple but much more efficient 
bytewise copy.


The kernel spends a significant time at this place if you use 8x* fonts. 
Every

pixel displayed on your screen is prepared here.

Some benchmark results:

Displaying a file of 2000 lines with 160 characters each takes 889 ms 
system

time using  cyblafb on my system  (I´m using a 1280x1024 video mode,
resulting in a 160x64 character console)

Displaying the same file with the enclosed patch applied to 2.6.13 only 
takes

760 ms system time, saving 129 ms or 14.5%.

Font widths other than 8 are not affected.

The advantage and correctness of this patch should be obvious.

Signed-off-by: Knut Petersen <[EMAIL PROTECTED]>

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c   2005-08-29 01:41:01.0 
+0200
+++ linux/drivers/video/console/bitblit.c   2005-08-30 17:19:57.0 
+0200
@@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
unsigned int scan_align = info->pixmap.scan_align - 1;
unsigned int buf_align = info->pixmap.buf_align - 1;
unsigned int shift_low = 0, mod = vc->vc_font.width % 8;
-   unsigned int shift_high = 8, pitch, cnt, size, k;
+   unsigned int shift_high = 8, pitch, cnt, size, i, k;
unsigned int idx = vc->vc_font.width >> 3;
unsigned int attribute = get_attribute(info, scr_readw(s));
struct fb_image image;
@@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}

-   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
+   if (idx == 1)
+   for(i=0; i < image.height; i++)
+   dst[pitch*i] = src[i];
+   else
+   fb_pad_aligned_buffer(dst, pitch, src, 
idx, image.height);
dst += width;
}
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Knut Petersen

This trivial patch gives a performance boost to the framebuffer console

Constructing the bitmaps that are given to the bitblit functions of the 
framebuffer
drivers is time consuming. Here we avoide a call to the slow 
fb_pad_aligned_buffer().
The patch replaces that call with a simple but much more efficient 
bytewise copy.


The kernel spends a significant time at this place if you use 8x* fonts. 
Every

pixel displayed on your screen is prepared here.

Some benchmark results:

Displaying a file of 2000 lines with 160 characters each takes 889 ms 
system

time using  cyblafb on my system  (I´m using a 1280x1024 video mode,
resulting in a 160x64 character console)

Displaying the same file with the enclosed patch applied to 2.6.13 only 
takes

760 ms system time, saving 129 ms or 14.5%.

Font widths other than 8 are not affected.

The advantage and correctness of this patch should be obvious.

Signed-off-by: Knut Petersen [EMAIL PROTECTED]

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' 
linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c   2005-08-29 01:41:01.0 
+0200
+++ linux/drivers/video/console/bitblit.c   2005-08-30 17:19:57.0 
+0200
@@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
unsigned int scan_align = info-pixmap.scan_align - 1;
unsigned int buf_align = info-pixmap.buf_align - 1;
unsigned int shift_low = 0, mod = vc-vc_font.width % 8;
-   unsigned int shift_high = 8, pitch, cnt, size, k;
+   unsigned int shift_high = 8, pitch, cnt, size, i, k;
unsigned int idx = vc-vc_font.width  3;
unsigned int attribute = get_attribute(info, scr_readw(s));
struct fb_image image;
@@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}

-   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
+   if (idx == 1)
+   for(i=0; i  image.height; i++)
+   dst[pitch*i] = src[i];
+   else
+   fb_pad_aligned_buffer(dst, pitch, src, 
idx, image.height);
dst += width;
}
}


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Geert Uytterhoeven
On Tue, 30 Aug 2005, Knut Petersen wrote:
 linux/drivers/video/console/bitblit.c
 --- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.0
 +0200
 +++ linux/drivers/video/console/bitblit.c 2005-08-30 17:19:57.0
 +0200
 @@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
   unsigned int scan_align = info-pixmap.scan_align - 1;
   unsigned int buf_align = info-pixmap.buf_align - 1;
   unsigned int shift_low = 0, mod = vc-vc_font.width % 8;
 - unsigned int shift_high = 8, pitch, cnt, size, k;
 + unsigned int shift_high = 8, pitch, cnt, size, i, k;
   unsigned int idx = vc-vc_font.width  3;
   unsigned int attribute = get_attribute(info, scr_readw(s));
   struct fb_image image;
 @@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
   src = buf;
   }
 
 - fb_pad_aligned_buffer(dst, pitch, src, idx,
 image.height);
 + if (idx == 1)
 + for(i=0; i  image.height; i++)
 + dst[pitch*i] = src[i];
^^^
 + else
 + fb_pad_aligned_buffer(dst, pitch, src,
 idx, image.height);
   dst += width;
   }
   }

Probably you can make it even faster by avoiding the multiplication, like

unsigned int offset = 0;
for (i = 0; i  image.height; i++) {
dst[offset] = src[i];
offset += pitch;
}

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Knut Petersen



Probably you can make it even faster by avoiding the multiplication, like

   unsigned int offset = 0;
   for (i = 0; i  image.height; i++) {
dst[offset] = src[i];
offset += pitch;
   }



More than two decades ago I learned to avoid mul and imul. Use shifts, 
add and lea instead,
that was the credo those days. The name of the game was CP/M 80/86, a86, 
d86 and ddt ;-)


But let´s get serious again.

Your proposed change of the patch results in a 21 ms performance 
decrease on my system.
Yes, I do know that this is hard to believe. I tested a similar 
variation before, and the results

were even worse.

Avoiding mul is a good idea in assembly language today, but often it is 
better to write a
multiplication  with the loop counter in C and not to introduce an extra 
variable instead. The
compiler will optimize the code and it´s easier for gcc without that 
extra variable.


More interesting would be the question what should be done for idx==2 or 
idx==3. Probably
fb_pad_aligned_buffer() is also slower for those cases. But does anybody 
use such fonts?


cu,
knut
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Roman Zippel
Hi,

On Tue, 30 Aug 2005, Knut Petersen wrote:

  Probably you can make it even faster by avoiding the multiplication, like
  
 unsigned int offset = 0;
 for (i = 0; i  image.height; i++) {
  dst[offset] = src[i];
  offset += pitch;
 }
  
 
 More than two decades ago I learned to avoid mul and imul. Use shifts, add and
 lea instead,
 that was the credo those days. The name of the game was CP/M 80/86, a86, d86
 and ddt ;-)
 
 But let´s get serious again.
 
 Your proposed change of the patch results in a 21 ms performance decrease on
 my system.
 Yes, I do know that this is hard to believe. I tested a similar variation
 before, and the results
 were even worse.

Could you try the patch below, for a few extra cycles you might want to 
make it an inline function.

bye, Roman

Index: linux-2.6/drivers/video/fbmem.c
===
--- linux-2.6.orig/drivers/video/fbmem.c2005-08-30 01:55:18.0 
+0200
+++ linux-2.6/drivers/video/fbmem.c 2005-08-30 21:10:25.705462837 +0200
@@ -82,11 +82,11 @@ void fb_pad_aligned_buffer(u8 *dst, u32 
 {
int i, j;
 
+   d_pitch -= s_pitch;
for (i = height; i--; ) {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
for (j = 0; j  s_pitch; j++)
-   dst[j] = src[j];
-   src += s_pitch;
+   *dst++ = *src++;
dst += d_pitch;
}
 }

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Geert Uytterhoeven
On Tue, 30 Aug 2005, Knut Petersen wrote:
  Probably you can make it even faster by avoiding the multiplication, like
  
 unsigned int offset = 0;
 for (i = 0; i  image.height; i++) {
  dst[offset] = src[i];
  offset += pitch;
 }
 
 More than two decades ago I learned to avoid mul and imul. Use shifts, add and
 lea instead,
 that was the credo those days. The name of the game was CP/M 80/86, a86, d86
 and ddt ;-)
 
 But let�s get serious again.

On modern CPUs, a multiplication indeed takes 1 cycle, just like an addition.
But on older CPUs (still supported by Linux), this is not true.

 Your proposed change of the patch results in a 21 ms performance decrease on
 my system.
 Yes, I do know that this is hard to believe. I tested a similar variation
 before, and the results
 were even worse.
 
 Avoiding mul is a good idea in assembly language today, but often it is better
 to write a
 multiplication  with the loop counter in C and not to introduce an extra
 variable instead. The
 compiler will optimize the code and it�s easier for gcc without that extra
 variable.

But you are right. On actual inspection of the generated assembly code for a
very simple test case, it turns out both (m68k-linux-)gcc 2.95.2 and 3.3.3 are
smart enough to convert the multiplication to an addition...

And interestingly, if I avoid the multiplication explicitly, gcc 2.95.2 still
generates the same code, but 3.3.3 adds a few extra instructions to
save/restore local vars. So this probably explains why it turned out to be
slower for you. Ugh...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Knut Petersen

Hi Roman,

Could you try the patch below, for a few extra cycles you might want to 
make it an inline function.
 

No, it does not help. If there is any difference, it is too small to be 
measured on

my system ... and my system does run at 1000 Hz.

After 2.6.12 fb_pad_aligned_buffer() was changed to use memcpy() instead 
of a
bytewise copy. That slowed things down a lot, some weeks ago that was 
reverted.
fb_pad_aligned _buffer() isn´t that slow, it´s just an external function 
to be called

and that means a lot of unnecessary code.

How could I make it an inline function? It is used in console/bitblit.c, 
nvidia/nvidia.c,

riva/fbdev.c and softcursor.c.

cu,
Knut
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Roman Zippel
Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

 How could I make it an inline function? It is used in console/bitblit.c,
 nvidia/nvidia.c,
 riva/fbdev.c and softcursor.c.

Something like below, which has the advantange that there is still only 
one implementation of the function and if it's still slower, we really 
need to check the compiler.

bye, Roman

 drivers/video/console/bitblit.c |5 -
 drivers/video/fbmem.c   |   10 +-
 include/linux/fb.h  |   13 +
 3 files changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/video/fbmem.c
===
--- linux-2.6.orig/drivers/video/fbmem.c2005-08-30 21:17:44.0 
+0200
+++ linux-2.6/drivers/video/fbmem.c 2005-08-31 01:20:37.0 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
  */
 void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 
height)
 {
-   int i, j;
-
-   for (i = height; i--; ) {
-   /* s_pitch is a few bytes at the most, memcpy is suboptimal */
-   for (j = 0; j  s_pitch; j++)
-   dst[j] = src[j];
-   src += s_pitch;
-   dst += d_pitch;
-   }
+   __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
 }
 EXPORT_SYMBOL(fb_pad_aligned_buffer);
 
Index: linux-2.6/drivers/video/console/bitblit.c
===
--- linux-2.6.orig/drivers/video/console/bitblit.c  2005-08-30 
01:55:20.0 +0200
+++ linux-2.6/drivers/video/console/bitblit.c   2005-08-31 01:25:30.0 
+0200
@@ -175,7 +175,10 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}
 
-   fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
+   if (likely(idx == 1))
+   __fb_pad_aligned_buffer(dst, pitch, 
src, 1, image.height);
+   else
+   fb_pad_aligned_buffer(dst, pitch, src, 
idx, image.height);
dst += width;
}
}
Index: linux-2.6/include/linux/fb.h
===
--- linux-2.6.orig/include/linux/fb.h   2005-08-30 01:56:29.0 +0200
+++ linux-2.6/include/linux/fb.h2005-08-31 01:21:04.0 +0200
@@ -824,6 +824,19 @@ extern int fb_get_color_depth(struct fb_
 extern int fb_get_options(char *name, char **option);
 extern int fb_new_modelist(struct fb_info *info);
 
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 
s_pitch, u32 height)
+{
+   int i, j;
+
+   d_pitch -= s_pitch;
+   for (i = height; i--; ) {
+   /* s_pitch is a few bytes at the most, memcpy is suboptimal */
+   for (j = 0; j  s_pitch; j++)
+   *dst++ = *src++;
+   dst += d_pitch;
+   }
+}
+
 extern struct fb_info *registered_fb[FB_MAX];
 extern int num_registered_fb;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Antonino A. Daplas
Knut Petersen wrote:
 fb_pad_aligned_buffer() is also slower for those cases. But does anybody
 use such fonts?

Yes, there are 16x30 fonts out there in the wild. 

Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

2005-08-30 Thread Antonino A. Daplas
Knut Petersen wrote:
 This trivial patch gives a performance boost to the framebuffer console
 
 Constructing the bitmaps that are given to the bitblit functions of the
 framebuffer
 drivers is time consuming. Here we avoide a call to the slow
 fb_pad_aligned_buffer().
 The patch replaces that call with a simple but much more efficient
 bytewise copy.
 
 The kernel spends a significant time at this place if you use 8x* fonts.
 Every
 pixel displayed on your screen is prepared here.
 
 Some benchmark results:
 
 Displaying a file of 2000 lines with 160 characters each takes 889 ms
 system
 time using  cyblafb on my system  (I´m using a 1280x1024 video mode,
 resulting in a 160x64 character console)
 
 Displaying the same file with the enclosed patch applied to 2.6.13 only
 takes
 760 ms system time, saving 129 ms or 14.5%.

Where did this 14.5% come from?  Is it bit_putcs alone or is more
real world, such as 'time cat text_file'? If I'm to guess, a large
percent of the improvement is caused by the inlining of the code.

I'm not against the patch, it will benefit drivers with very fast
imageblits.  However, since most drivers have imageblits done in software,
a large proportion of the processing time will go to the imageblit itself,
so I don't think you'll get that high a number (I get only a 4%
improvement, and this is in a driver with accelerated blits, and it will
probably be lower, ie, in vesafb).

Anyway, with the addition of your patch, bit_putcs has now reached an
'ugliness threshhold' for a revamp.

Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/