> From: Jeremie Courreges-Anglas <[email protected]> > Date: Fri, 23 Jul 2021 13:54:31 +0200 > > On Fri, Jul 23 2021, Mark Kettenis <[email protected]> wrote: > >> From: Jeremie Courreges-Anglas <[email protected]> > >> Date: Fri, 23 Jul 2021 11:54:51 +0200 > >> Content-Type: text/plain > >> > >> > >> I've been using a variation of this diff on my hifive unmatched since > >> a few days. The goal is to at least optimize the aligned cases by using > >> 8 or 4 bytes loads/stores. On this hifive unmatched, I found that > >> unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower > >> than equivalent 1 byte loads/stores (say 40x slower). > >> > >> This improves eg i/o throughput and shaves off between 10 and 15s out of > >> a total 11m30s in ''make clean; make -j4'' kernel builds. > >> > >> I have another diff that tries to re-align initially unaligned addresses > >> if possible but it's uglier and it's hard to tell whether it makes any > >> difference in real life. > >> > >> ok? > >> > >> > >> Index: copy.S > >> =================================================================== > >> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v > >> retrieving revision 1.6 > >> diff -u -p -p -u -r1.6 copy.S > >> --- copy.S 28 Jun 2021 18:53:10 -0000 1.6 > >> +++ copy.S 23 Jul 2021 07:45:16 -0000 > >> @@ -49,8 +49,38 @@ ENTRY(copyin) > >> SWAP_FAULT_HANDLER(a3, a4, a5) > >> ENTER_USER_ACCESS(a4) > >> > >> -// XXX optimize? > >> .Lcopyio: > >> +.Lcopy8: > >> + li a5, 8 > >> + bltu a2, a5, .Lcopy4 > >> + > >> + or a7, a0, a1 > >> + andi a7, a7, 7 > >> + bnez a7, .Lcopy4 > >> + > >> +1: ld a4, 0(a0) > >> + addi a0, a0, 8 > >> + sd a4, 0(a1) > >> + addi a1, a1, 8 > >> + addi a2, a2, -8 > >> + bgtu a2, a5, 1b > > > > Shouldn't this be > > > > bgeu a2, a5, 1b > > Yes, that's better ideed, thanks! Updated diff.
ok kettenis@ > Index: copy.S > =================================================================== > RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v > retrieving revision 1.6 > diff -u -p -p -u -r1.6 copy.S > --- copy.S 28 Jun 2021 18:53:10 -0000 1.6 > +++ copy.S 23 Jul 2021 11:52:54 -0000 > @@ -49,8 +49,38 @@ ENTRY(copyin) > SWAP_FAULT_HANDLER(a3, a4, a5) > ENTER_USER_ACCESS(a4) > > -// XXX optimize? > .Lcopyio: > +.Lcopy8: > + li a5, 8 > + bltu a2, a5, .Lcopy4 > + > + or a7, a0, a1 > + andi a7, a7, 7 > + bnez a7, .Lcopy4 > + > +1: ld a4, 0(a0) > + addi a0, a0, 8 > + sd a4, 0(a1) > + addi a1, a1, 8 > + addi a2, a2, -8 > + bgeu a2, a5, 1b > + > +.Lcopy4: > + li a5, 4 > + bltu a2, a5, .Lcopy1 > + > + andi a7, a7, 3 > + bnez a7, .Lcopy1 > + > +1: lw a4, 0(a0) > + addi a0, a0, 4 > + sw a4, 0(a1) > + addi a1, a1, 4 > + addi a2, a2, -4 > + bgeu a2, a5, 1b > + > +.Lcopy1: > + beqz a2, .Lcopy0 > 1: lb a4, 0(a0) > addi a0, a0, 1 > sb a4, 0(a1) > @@ -58,6 +88,7 @@ ENTRY(copyin) > addi a2, a2, -1 > bnez a2, 1b > > +.Lcopy0: > EXIT_USER_ACCESS(a4) > SET_FAULT_HANDLER(a3, a4) > .Lcopyiodone: > > > -- > jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE >
