> From: Mateusz Guzik <mjgu...@gmail.com>
> Date: Sat, 23 Mar 2019 15:29:39 +0100
> 
> On 3/18/19, Frederic Cambus <f...@statdns.com> wrote:
> > Hi tech@,
> >
> > Now that efifb(4) supports remapping the framebuffer in write combining
> > mode, it's on par with inteldrm regarding display performance as for as
> > rasops(9) is concerned.
> >
> > Therefore, I'm proposing reverting changes which were introduced in
> > rasops32_putchar() in revision 1.8 [1] as they are now also slowing
> > things down on efifb(4).
> >
> > I used a 6.3M text file for testing display performance:
> > ftp https://norvig.com/big.txt
> >
> > The inteldrm and efifb tests were done on the same machine, the
> > radeondrm ones were done on another (faster) machine.
> >
> > Screen size: 1920x1080
> > Font size: 16x32
> >
> > Here are the results (3 runs each) of running: time cat big.txt
> >
> > inteldrm:
> >
> >     2m39.52s real     0m00.00s user     2m39.52s system
> >     2m39.74s real     0m00.00s user     2m39.84s system
> >     2m39.74s real     0m00.00s user     2m39.77s system
> >
> > inteldrm (with revert diff applied):
> >
> >     1m37m76s real     0m00.00s user     1m37m60s system
> >     1m37m80s real     0m00.00s user     1m37m56s system
> >     1m37m43s real     0m00.00s user     1m37m47s system
> >
> > efifb:
> >
> >     2m40.46s real     0m00.00s user     2m39.43s system
> >     2m39.49s real     0m00.00s user     2m39.52s system
> >     2m39.45s real     0m00.00s user     2m39.48s system
> >
> > efifb (with revert diff applied):
> >
> >     1m37m66s real     0m00.00s user     1m37m19s system
> >     1m37m17s real     0m00.00s user     1m37m22s system
> >     1m37m15s real     0m00.00s user     1m37m20s system
> >
> > radeondrm:
> >
> >     4m40.75s real     0m00.00s user     4m39m75s system
> >     4m39.84s real     0m00.00s user     4m39m85s system
> >     4m39.68s real     0m00.00s user     4m39m71s system
> >
> > radeondrm (with revert diff applied):
> >
> >     0m21.08s real     0m00.00s user     0m21.08s system
> >     0m21.15s real     0m00.00s user     0m21.05s system
> >     0m21.10s real     0m00.00s user     0m21.06s system
> >
> > [1]
> > http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7&r2=1.8
> >
> > Comments? OK?
> >
> > Index: sys/dev/rasops/rasops32.c
> > ===================================================================
> > RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v
> > retrieving revision 1.8
> > diff -u -p -r1.8 rasops32.c
> > --- sys/dev/rasops/rasops32.c       20 Feb 2017 15:35:05 -0000      1.8
> > +++ sys/dev/rasops/rasops32.c       18 Mar 2019 08:15:18 -0000
> > @@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row,
> >     struct rasops_info *ri;
> >     int32_t *dp, *rp;
> >     u_char *fr;
> > -   uint32_t buffer[64];
> >
> >     ri = (struct rasops_info *)cookie;
> >
> > @@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row,
> >     clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf];
> >
> >     if (uc == ' ') {
> > -           for (cnt = 0; cnt < width; cnt++)
> > -                   buffer[cnt] = clr[0];
> >             while (height--) {
> >                     dp = rp;
> >                     DELTA(rp, ri->ri_stride, int32_t *);
> >
> > -                   memcpy(dp, buffer, width << 2);
> > +                   for (cnt = width; cnt; cnt--)
> > +                           *dp++ = clr[0];
> >             }
> >     } else {
> >             uc -= ri->ri_font->firstchar;
> > @@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row,
> >                     fr += fs;
> >                     DELTA(rp, ri->ri_stride, int32_t *);
> >
> > -                   for (cnt = 0; cnt < width; cnt++) {
> > -                           buffer[cnt] = clr[(fb >> 31) & 1];
> > +                   for (cnt = width; cnt; cnt--) {
> > +                           *dp++ = clr[(fb >> 31) & 1];
> >                             fb <<= 1;
> >                     }
> > -                   memcpy(dp, buffer, width << 2);
> >             }
> >     }
> >
> >
> >
> 
> So the majority of the win stems from not doing memcpy, but this is only
> true because the current implementation is heavily pessimized by always
> doing rep movsq + rep movsb which is particular harmful performance-wise
> for small sizes.
> 
> FreeBSD also had this problem and several changes were made to remedy it,
> see in particular:
> 
> https://reviews.freebsd.org/D17398
> https://reviews.freebsd.org/D17661
> 
> Apart from that other important routines were also updated (copyin,
> copyout, copyinstr and so on). Note this code is still not the fastest it
> can be, but it's mostly diminishing returns from here.
> 
> Also note userspace code could use SIMD. I plan to import variants from
> bionic (BSD-licensed).
> 
> See https://svnweb.freebsd.org/base/head/sys/amd64/amd64/support.S .
> 
> You may notice pagezero uses rep. Background page zeroing was removed
> quite some time ago. I see OpenBSD still has it, but it is most likely
> detrimental by now. If it is to be kept, a pagezero variant utilizing
> non-temporal stores still makes sense but a different function should be
> created for on-demand zeroing.
> 
> That said, it should be easy to lift a lot of this code. Could you please
> benchmark with memcpy as implemented above? Not saying this particular
> patch is wrong, but that the bigger problem should be addressed first.

Valid points.  However, I'd argue that using memcpy here is actually
wrong because a framebuffer may not be "normal" memory.  An optimized
memcpy might be slower, or worse, introduce artifacts because of
limitations of the underlying bus protocols (this happens on some ARM
designs).

Reply via email to