Re: rasops(9): revert changes in rasops32_putchar()?

2019-03-23 Thread Theo de Raadt
Mateusz Guzik  wrote:

> You may notice pagezero uses rep. Background page zeroing was removed
> quite some time ago. I see OpenBSD still has it, but it is most likely
> detrimental by now. If it is to be kept, a pagezero variant utilizing
> non-temporal stores still makes sense but a different function should be
> created for on-demand zeroing.

I believe our memory utilization is completely different from other
operating systems.  We are attempting to apply randomization at every
level, and this means we have very different (higher) costs.

Our kernel pool subsystem works differently, different recycling
paradigm.  Our page allocation strategy is not linear.  And finally our
userland malloc stores tracking data in seperate pages and is very
aggressive about munmap, which means we have a much greater amount of
fresh allocation occuring in processes, there is much more zfod mmap()
occuring (we accept the cost of the mmap, but obviously those pages
need to be zero'd by someone).

Everything is different in subtle ways.

Their measurements will probably be very different from ours.



Re: rasops(9): revert changes in rasops32_putchar()?

2019-03-23 Thread Mark Kettenis
> From: Mateusz Guzik 
> Date: Sat, 23 Mar 2019 15:29:39 +0100
> 
> On 3/18/19, Frederic Cambus  wrote:
> > Hi tech@,
> >
> > Now that efifb(4) supports remapping the framebuffer in write combining
> > mode, it's on par with inteldrm regarding display performance as for as
> > rasops(9) is concerned.
> >
> > Therefore, I'm proposing reverting changes which were introduced in
> > rasops32_putchar() in revision 1.8 [1] as they are now also slowing
> > things down on efifb(4).
> >
> > I used a 6.3M text file for testing display performance:
> > ftp https://norvig.com/big.txt
> >
> > The inteldrm and efifb tests were done on the same machine, the
> > radeondrm ones were done on another (faster) machine.
> >
> > Screen size: 1920x1080
> > Font size: 16x32
> >
> > Here are the results (3 runs each) of running: time cat big.txt
> >
> > inteldrm:
> >
> > 2m39.52s real 0m00.00s user 2m39.52s system
> > 2m39.74s real 0m00.00s user 2m39.84s system
> > 2m39.74s real 0m00.00s user 2m39.77s system
> >
> > inteldrm (with revert diff applied):
> >
> > 1m37m76s real 0m00.00s user 1m37m60s system
> > 1m37m80s real 0m00.00s user 1m37m56s system
> > 1m37m43s real 0m00.00s user 1m37m47s system
> >
> > efifb:
> >
> > 2m40.46s real 0m00.00s user 2m39.43s system
> > 2m39.49s real 0m00.00s user 2m39.52s system
> > 2m39.45s real 0m00.00s user 2m39.48s system
> >
> > efifb (with revert diff applied):
> >
> > 1m37m66s real 0m00.00s user 1m37m19s system
> > 1m37m17s real 0m00.00s user 1m37m22s system
> > 1m37m15s real 0m00.00s user 1m37m20s system
> >
> > radeondrm:
> >
> > 4m40.75s real 0m00.00s user 4m39m75s system
> > 4m39.84s real 0m00.00s user 4m39m85s system
> > 4m39.68s real 0m00.00s user 4m39m71s system
> >
> > radeondrm (with revert diff applied):
> >
> > 0m21.08s real 0m00.00s user 0m21.08s system
> > 0m21.15s real 0m00.00s user 0m21.05s system
> > 0m21.10s real 0m00.00s user 0m21.06s system
> >
> > [1]
> > http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7=1.8
> >
> > Comments? OK?
> >
> > Index: sys/dev/rasops/rasops32.c
> > ===
> > RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v
> > retrieving revision 1.8
> > diff -u -p -r1.8 rasops32.c
> > --- sys/dev/rasops/rasops32.c   20 Feb 2017 15:35:05 -  1.8
> > +++ sys/dev/rasops/rasops32.c   18 Mar 2019 08:15:18 -
> > @@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row,
> > struct rasops_info *ri;
> > int32_t *dp, *rp;
> > u_char *fr;
> > -   uint32_t buffer[64];
> >
> > ri = (struct rasops_info *)cookie;
> >
> > @@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row,
> > clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf];
> >
> > if (uc == ' ') {
> > -   for (cnt = 0; cnt < width; cnt++)
> > -   buffer[cnt] = clr[0];
> > while (height--) {
> > dp = rp;
> > DELTA(rp, ri->ri_stride, int32_t *);
> >
> > -   memcpy(dp, buffer, width << 2);
> > +   for (cnt = width; cnt; cnt--)
> > +   *dp++ = clr[0];
> > }
> > } else {
> > uc -= ri->ri_font->firstchar;
> > @@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row,
> > fr += fs;
> > DELTA(rp, ri->ri_stride, int32_t *);
> >
> > -   for (cnt = 0; cnt < width; cnt++) {
> > -   buffer[cnt] = clr[(fb >> 31) & 1];
> > +   for (cnt = width; cnt; cnt--) {
> > +   *dp++ = clr[(fb >> 31) & 1];
> > fb <<= 1;
> > }
> > -   memcpy(dp, buffer, width << 2);
> > }
> > }
> >
> >
> >
> 
> So the majority of the win stems from not doing memcpy, but this is only
> true because the current implementation is heavily pessimized by always
> doing rep movsq + rep movsb which is particular harmful performance-wise
> for small sizes.
> 
> FreeBSD also had this problem and several changes were made to remedy it,
> see in particular:
> 
> https://reviews.freebsd.org/D17398
> https://reviews.freebsd.org/D17661
> 
> Apart from that other important routines were also updated (copyin,
> copyout, copyinstr and so on). Note this code is still not the fastest it
> can be, but it's mostly diminishing returns from here.
> 
> Also note userspace code could use SIMD. I plan to import variants from
> bionic (BSD-licensed).
> 
> See https://svnweb.freebsd.org/base/head/sys/amd64/amd64/support.S .
> 
> You may notice pagezero uses rep. Background page zeroing was removed
> quite some time ago. I see OpenBSD still has it, but it is most likely
> detrimental by now. If 

Re: rasops(9): revert changes in rasops32_putchar()?

2019-03-23 Thread Mateusz Guzik
On 3/18/19, Frederic Cambus  wrote:
> Hi tech@,
>
> Now that efifb(4) supports remapping the framebuffer in write combining
> mode, it's on par with inteldrm regarding display performance as for as
> rasops(9) is concerned.
>
> Therefore, I'm proposing reverting changes which were introduced in
> rasops32_putchar() in revision 1.8 [1] as they are now also slowing
> things down on efifb(4).
>
> I used a 6.3M text file for testing display performance:
> ftp https://norvig.com/big.txt
>
> The inteldrm and efifb tests were done on the same machine, the
> radeondrm ones were done on another (faster) machine.
>
> Screen size: 1920x1080
> Font size: 16x32
>
> Here are the results (3 runs each) of running: time cat big.txt
>
> inteldrm:
>
> 2m39.52s real 0m00.00s user 2m39.52s system
> 2m39.74s real 0m00.00s user 2m39.84s system
> 2m39.74s real 0m00.00s user 2m39.77s system
>
> inteldrm (with revert diff applied):
>
> 1m37m76s real 0m00.00s user 1m37m60s system
> 1m37m80s real 0m00.00s user 1m37m56s system
> 1m37m43s real 0m00.00s user 1m37m47s system
>
> efifb:
>
> 2m40.46s real 0m00.00s user 2m39.43s system
> 2m39.49s real 0m00.00s user 2m39.52s system
> 2m39.45s real 0m00.00s user 2m39.48s system
>
> efifb (with revert diff applied):
>
> 1m37m66s real 0m00.00s user 1m37m19s system
> 1m37m17s real 0m00.00s user 1m37m22s system
> 1m37m15s real 0m00.00s user 1m37m20s system
>
> radeondrm:
>
> 4m40.75s real 0m00.00s user 4m39m75s system
> 4m39.84s real 0m00.00s user 4m39m85s system
> 4m39.68s real 0m00.00s user 4m39m71s system
>
> radeondrm (with revert diff applied):
>
> 0m21.08s real 0m00.00s user 0m21.08s system
> 0m21.15s real 0m00.00s user 0m21.05s system
> 0m21.10s real 0m00.00s user 0m21.06s system
>
> [1]
> http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7=1.8
>
> Comments? OK?
>
> Index: sys/dev/rasops/rasops32.c
> ===
> RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v
> retrieving revision 1.8
> diff -u -p -r1.8 rasops32.c
> --- sys/dev/rasops/rasops32.c 20 Feb 2017 15:35:05 -  1.8
> +++ sys/dev/rasops/rasops32.c 18 Mar 2019 08:15:18 -
> @@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row,
>   struct rasops_info *ri;
>   int32_t *dp, *rp;
>   u_char *fr;
> - uint32_t buffer[64];
>
>   ri = (struct rasops_info *)cookie;
>
> @@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row,
>   clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf];
>
>   if (uc == ' ') {
> - for (cnt = 0; cnt < width; cnt++)
> - buffer[cnt] = clr[0];
>   while (height--) {
>   dp = rp;
>   DELTA(rp, ri->ri_stride, int32_t *);
>
> - memcpy(dp, buffer, width << 2);
> + for (cnt = width; cnt; cnt--)
> + *dp++ = clr[0];
>   }
>   } else {
>   uc -= ri->ri_font->firstchar;
> @@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row,
>   fr += fs;
>   DELTA(rp, ri->ri_stride, int32_t *);
>
> - for (cnt = 0; cnt < width; cnt++) {
> - buffer[cnt] = clr[(fb >> 31) & 1];
> + for (cnt = width; cnt; cnt--) {
> + *dp++ = clr[(fb >> 31) & 1];
>   fb <<= 1;
>   }
> - memcpy(dp, buffer, width << 2);
>   }
>   }
>
>
>

So the majority of the win stems from not doing memcpy, but this is only
true because the current implementation is heavily pessimized by always
doing rep movsq + rep movsb which is particular harmful performance-wise
for small sizes.

FreeBSD also had this problem and several changes were made to remedy it,
see in particular:

https://reviews.freebsd.org/D17398
https://reviews.freebsd.org/D17661

Apart from that other important routines were also updated (copyin,
copyout, copyinstr and so on). Note this code is still not the fastest it
can be, but it's mostly diminishing returns from here.

Also note userspace code could use SIMD. I plan to import variants from
bionic (BSD-licensed).

See https://svnweb.freebsd.org/base/head/sys/amd64/amd64/support.S .

You may notice pagezero uses rep. Background page zeroing was removed
quite some time ago. I see OpenBSD still has it, but it is most likely
detrimental by now. If it is to be kept, a pagezero variant utilizing
non-temporal stores still makes sense but a different function should be
created for on-demand zeroing.

That said, it should be easy to lift a lot of this code. Could you please
benchmark with memcpy as implemented above? Not saying this 

Re: rasops(9): revert changes in rasops32_putchar()?

2019-03-22 Thread Mark Kettenis
> Date: Mon, 18 Mar 2019 14:49:48 +0100
> From: Frederic Cambus 
> 
> Hi tech@,
> 
> Now that efifb(4) supports remapping the framebuffer in write combining
> mode, it's on par with inteldrm regarding display performance as for as
> rasops(9) is concerned.
> 
> Therefore, I'm proposing reverting changes which were introduced in
> rasops32_putchar() in revision 1.8 [1] as they are now also slowing
> things down on efifb(4).
> 
> I used a 6.3M text file for testing display performance:
> ftp https://norvig.com/big.txt
> 
> The inteldrm and efifb tests were done on the same machine, the
> radeondrm ones were done on another (faster) machine.
> 
> Screen size: 1920x1080
> Font size: 16x32
> 
> Here are the results (3 runs each) of running: time cat big.txt
> 
> inteldrm:
> 
> 2m39.52s real 0m00.00s user 2m39.52s system
> 2m39.74s real 0m00.00s user 2m39.84s system
> 2m39.74s real 0m00.00s user 2m39.77s system
> 
> inteldrm (with revert diff applied):
> 
> 1m37m76s real 0m00.00s user 1m37m60s system
> 1m37m80s real 0m00.00s user 1m37m56s system
> 1m37m43s real 0m00.00s user 1m37m47s system
> 
> efifb:
> 
> 2m40.46s real 0m00.00s user 2m39.43s system
> 2m39.49s real 0m00.00s user 2m39.52s system
> 2m39.45s real 0m00.00s user 2m39.48s system
> 
> efifb (with revert diff applied):
> 
> 1m37m66s real 0m00.00s user 1m37m19s system
> 1m37m17s real 0m00.00s user 1m37m22s system
> 1m37m15s real 0m00.00s user 1m37m20s system
> 
> radeondrm:
> 
> 4m40.75s real 0m00.00s user 4m39m75s system
> 4m39.84s real 0m00.00s user 4m39m85s system
> 4m39.68s real 0m00.00s user 4m39m71s system
> 
> radeondrm (with revert diff applied):
> 
> 0m21.08s real 0m00.00s user 0m21.08s system
> 0m21.15s real 0m00.00s user 0m21.05s system
> 0m21.10s real 0m00.00s user 0m21.06s system
> 
> [1] 
> http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7=1.8
> 
> Comments? OK?

Those differences are significant!  So yes, let's give this a try and
commit this.

ok kettenis@

> Index: sys/dev/rasops/rasops32.c
> ===
> RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v
> retrieving revision 1.8
> diff -u -p -r1.8 rasops32.c
> --- sys/dev/rasops/rasops32.c 20 Feb 2017 15:35:05 -  1.8
> +++ sys/dev/rasops/rasops32.c 18 Mar 2019 08:15:18 -
> @@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row, 
>   struct rasops_info *ri;
>   int32_t *dp, *rp;
>   u_char *fr;
> - uint32_t buffer[64];
>  
>   ri = (struct rasops_info *)cookie;
>  
> @@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row, 
>   clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf];
>  
>   if (uc == ' ') {
> - for (cnt = 0; cnt < width; cnt++)
> - buffer[cnt] = clr[0];
>   while (height--) {
>   dp = rp;
>   DELTA(rp, ri->ri_stride, int32_t *);
>  
> - memcpy(dp, buffer, width << 2);
> + for (cnt = width; cnt; cnt--)
> + *dp++ = clr[0];
>   }
>   } else {
>   uc -= ri->ri_font->firstchar;
> @@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row, 
>   fr += fs;
>   DELTA(rp, ri->ri_stride, int32_t *);
>  
> - for (cnt = 0; cnt < width; cnt++) {
> - buffer[cnt] = clr[(fb >> 31) & 1];
> + for (cnt = width; cnt; cnt--) {
> + *dp++ = clr[(fb >> 31) & 1];
>   fb <<= 1;
>   }
> - memcpy(dp, buffer, width << 2);
>   }
>   }
>  
> 
> 



rasops(9): revert changes in rasops32_putchar()?

2019-03-18 Thread Frederic Cambus
Hi tech@,

Now that efifb(4) supports remapping the framebuffer in write combining
mode, it's on par with inteldrm regarding display performance as for as
rasops(9) is concerned.

Therefore, I'm proposing reverting changes which were introduced in
rasops32_putchar() in revision 1.8 [1] as they are now also slowing
things down on efifb(4).

I used a 6.3M text file for testing display performance:
ftp https://norvig.com/big.txt

The inteldrm and efifb tests were done on the same machine, the
radeondrm ones were done on another (faster) machine.

Screen size: 1920x1080
Font size: 16x32

Here are the results (3 runs each) of running: time cat big.txt

inteldrm:

2m39.52s real 0m00.00s user 2m39.52s system
2m39.74s real 0m00.00s user 2m39.84s system
2m39.74s real 0m00.00s user 2m39.77s system

inteldrm (with revert diff applied):

1m37m76s real 0m00.00s user 1m37m60s system
1m37m80s real 0m00.00s user 1m37m56s system
1m37m43s real 0m00.00s user 1m37m47s system

efifb:

2m40.46s real 0m00.00s user 2m39.43s system
2m39.49s real 0m00.00s user 2m39.52s system
2m39.45s real 0m00.00s user 2m39.48s system

efifb (with revert diff applied):

1m37m66s real 0m00.00s user 1m37m19s system
1m37m17s real 0m00.00s user 1m37m22s system
1m37m15s real 0m00.00s user 1m37m20s system

radeondrm:

4m40.75s real 0m00.00s user 4m39m75s system
4m39.84s real 0m00.00s user 4m39m85s system
4m39.68s real 0m00.00s user 4m39m71s system

radeondrm (with revert diff applied):

0m21.08s real 0m00.00s user 0m21.08s system
0m21.15s real 0m00.00s user 0m21.05s system
0m21.10s real 0m00.00s user 0m21.06s system

[1] 
http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7=1.8

Comments? OK?

Index: sys/dev/rasops/rasops32.c
===
RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v
retrieving revision 1.8
diff -u -p -r1.8 rasops32.c
--- sys/dev/rasops/rasops32.c   20 Feb 2017 15:35:05 -  1.8
+++ sys/dev/rasops/rasops32.c   18 Mar 2019 08:15:18 -
@@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row, 
struct rasops_info *ri;
int32_t *dp, *rp;
u_char *fr;
-   uint32_t buffer[64];
 
ri = (struct rasops_info *)cookie;
 
@@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row, 
clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf];
 
if (uc == ' ') {
-   for (cnt = 0; cnt < width; cnt++)
-   buffer[cnt] = clr[0];
while (height--) {
dp = rp;
DELTA(rp, ri->ri_stride, int32_t *);
 
-   memcpy(dp, buffer, width << 2);
+   for (cnt = width; cnt; cnt--)
+   *dp++ = clr[0];
}
} else {
uc -= ri->ri_font->firstchar;
@@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row, 
fr += fs;
DELTA(rp, ri->ri_stride, int32_t *);
 
-   for (cnt = 0; cnt < width; cnt++) {
-   buffer[cnt] = clr[(fb >> 31) & 1];
+   for (cnt = width; cnt; cnt--) {
+   *dp++ = clr[(fb >> 31) & 1];
fb <<= 1;
}
-   memcpy(dp, buffer, width << 2);
}
}