Re: riscv64: slightly optimized copyin/copyout/kcopy
> From: Jeremie Courreges-Anglas > Date: Fri, 23 Jul 2021 13:54:31 +0200 > > On Fri, Jul 23 2021, Mark Kettenis wrote: > >> From: Jeremie Courreges-Anglas > >> Date: Fri, 23 Jul 2021 11:54:51 +0200 > >> Content-Type: text/plain > >> > >> > >> I've been using a variation of this diff on my hifive unmatched since > >> a few days. The goal is to at least optimize the aligned cases by using > >> 8 or 4 bytes loads/stores. On this hifive unmatched, I found that > >> unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower > >> than equivalent 1 byte loads/stores (say 40x slower). > >> > >> This improves eg i/o throughput and shaves off between 10 and 15s out of > >> a total 11m30s in ''make clean; make -j4'' kernel builds. > >> > >> I have another diff that tries to re-align initially unaligned addresses > >> if possible but it's uglier and it's hard to tell whether it makes any > >> difference in real life. > >> > >> ok? > >> > >> > >> Index: copy.S > >> === > >> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v > >> retrieving revision 1.6 > >> diff -u -p -p -u -r1.6 copy.S > >> --- copy.S 28 Jun 2021 18:53:10 - 1.6 > >> +++ copy.S 23 Jul 2021 07:45:16 - > >> @@ -49,8 +49,38 @@ ENTRY(copyin) > >>SWAP_FAULT_HANDLER(a3, a4, a5) > >>ENTER_USER_ACCESS(a4) > >> > >> -// XXX optimize? > >> .Lcopyio: > >> +.Lcopy8: > >> + li a5, 8 > >> + bltua2, a5, .Lcopy4 > >> + > >> + or a7, a0, a1 > >> + andia7, a7, 7 > >> + bneza7, .Lcopy4 > >> + > >> +1:ld a4, 0(a0) > >> + addia0, a0, 8 > >> + sd a4, 0(a1) > >> + addia1, a1, 8 > >> + addia2, a2, -8 > >> + bgtua2, a5, 1b > > > > Shouldn't this be > > > > bgeua2, a5, 1b > > Yes, that's better ideed, thanks! Updated diff. ok kettenis@ > Index: copy.S > === > RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v > retrieving revision 1.6 > diff -u -p -p -u -r1.6 copy.S > --- copy.S28 Jun 2021 18:53:10 - 1.6 > +++ copy.S23 Jul 2021 11:52:54 - > @@ -49,8 +49,38 @@ ENTRY(copyin) > SWAP_FAULT_HANDLER(a3, a4, a5) > ENTER_USER_ACCESS(a4) > > -// XXX optimize? > .Lcopyio: > +.Lcopy8: > + li a5, 8 > + bltua2, a5, .Lcopy4 > + > + or a7, a0, a1 > + andia7, a7, 7 > + bneza7, .Lcopy4 > + > +1: ld a4, 0(a0) > + addia0, a0, 8 > + sd a4, 0(a1) > + addia1, a1, 8 > + addia2, a2, -8 > + bgeua2, a5, 1b > + > +.Lcopy4: > + li a5, 4 > + bltua2, a5, .Lcopy1 > + > + andia7, a7, 3 > + bneza7, .Lcopy1 > + > +1: lw a4, 0(a0) > + addia0, a0, 4 > + sw a4, 0(a1) > + addia1, a1, 4 > + addia2, a2, -4 > + bgeua2, a5, 1b > + > +.Lcopy1: > + beqza2, .Lcopy0 > 1: lb a4, 0(a0) > addia0, a0, 1 > sb a4, 0(a1) > @@ -58,6 +88,7 @@ ENTRY(copyin) > addia2, a2, -1 > bneza2, 1b > > +.Lcopy0: > EXIT_USER_ACCESS(a4) > SET_FAULT_HANDLER(a3, a4) > .Lcopyiodone: > > > -- > jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE >
Re: riscv64: slightly optimized copyin/copyout/kcopy
On Fri, Jul 23 2021, Mark Kettenis wrote: >> From: Jeremie Courreges-Anglas >> Date: Fri, 23 Jul 2021 11:54:51 +0200 >> Content-Type: text/plain >> >> >> I've been using a variation of this diff on my hifive unmatched since >> a few days. The goal is to at least optimize the aligned cases by using >> 8 or 4 bytes loads/stores. On this hifive unmatched, I found that >> unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower >> than equivalent 1 byte loads/stores (say 40x slower). >> >> This improves eg i/o throughput and shaves off between 10 and 15s out of >> a total 11m30s in ''make clean; make -j4'' kernel builds. >> >> I have another diff that tries to re-align initially unaligned addresses >> if possible but it's uglier and it's hard to tell whether it makes any >> difference in real life. >> >> ok? >> >> >> Index: copy.S >> === >> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v >> retrieving revision 1.6 >> diff -u -p -p -u -r1.6 copy.S >> --- copy.S 28 Jun 2021 18:53:10 - 1.6 >> +++ copy.S 23 Jul 2021 07:45:16 - >> @@ -49,8 +49,38 @@ ENTRY(copyin) >> SWAP_FAULT_HANDLER(a3, a4, a5) >> ENTER_USER_ACCESS(a4) >> >> -// XXX optimize? >> .Lcopyio: >> +.Lcopy8: >> +li a5, 8 >> +bltua2, a5, .Lcopy4 >> + >> +or a7, a0, a1 >> +andia7, a7, 7 >> +bneza7, .Lcopy4 >> + >> +1: ld a4, 0(a0) >> +addia0, a0, 8 >> +sd a4, 0(a1) >> +addia1, a1, 8 >> +addia2, a2, -8 >> +bgtua2, a5, 1b > > Shouldn't this be > > bgeua2, a5, 1b Yes, that's better ideed, thanks! Updated diff. Index: copy.S === RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v retrieving revision 1.6 diff -u -p -p -u -r1.6 copy.S --- copy.S 28 Jun 2021 18:53:10 - 1.6 +++ copy.S 23 Jul 2021 11:52:54 - @@ -49,8 +49,38 @@ ENTRY(copyin) SWAP_FAULT_HANDLER(a3, a4, a5) ENTER_USER_ACCESS(a4) -// XXX optimize? .Lcopyio: +.Lcopy8: + li a5, 8 + bltua2, a5, .Lcopy4 + + or a7, a0, a1 + andia7, a7, 7 + bneza7, .Lcopy4 + +1: ld a4, 0(a0) + addia0, a0, 8 + sd a4, 0(a1) + addia1, a1, 8 + addia2, a2, -8 + bgeua2, a5, 1b + +.Lcopy4: + li a5, 4 + bltua2, a5, .Lcopy1 + + andia7, a7, 3 + bneza7, .Lcopy1 + +1: lw a4, 0(a0) + addia0, a0, 4 + sw a4, 0(a1) + addia1, a1, 4 + addia2, a2, -4 + bgeua2, a5, 1b + +.Lcopy1: + beqza2, .Lcopy0 1: lb a4, 0(a0) addia0, a0, 1 sb a4, 0(a1) @@ -58,6 +88,7 @@ ENTRY(copyin) addia2, a2, -1 bneza2, 1b +.Lcopy0: EXIT_USER_ACCESS(a4) SET_FAULT_HANDLER(a3, a4) .Lcopyiodone: -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE
Re: riscv64: slightly optimized copyin/copyout/kcopy
> From: Jeremie Courreges-Anglas > Date: Fri, 23 Jul 2021 11:54:51 +0200 > Content-Type: text/plain > > > I've been using a variation of this diff on my hifive unmatched since > a few days. The goal is to at least optimize the aligned cases by using > 8 or 4 bytes loads/stores. On this hifive unmatched, I found that > unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower > than equivalent 1 byte loads/stores (say 40x slower). > > This improves eg i/o throughput and shaves off between 10 and 15s out of > a total 11m30s in ''make clean; make -j4'' kernel builds. > > I have another diff that tries to re-align initially unaligned addresses > if possible but it's uglier and it's hard to tell whether it makes any > difference in real life. > > ok? > > > Index: copy.S > === > RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v > retrieving revision 1.6 > diff -u -p -p -u -r1.6 copy.S > --- copy.S28 Jun 2021 18:53:10 - 1.6 > +++ copy.S23 Jul 2021 07:45:16 - > @@ -49,8 +49,38 @@ ENTRY(copyin) > SWAP_FAULT_HANDLER(a3, a4, a5) > ENTER_USER_ACCESS(a4) > > -// XXX optimize? > .Lcopyio: > +.Lcopy8: > + li a5, 8 > + bltua2, a5, .Lcopy4 > + > + or a7, a0, a1 > + andia7, a7, 7 > + bneza7, .Lcopy4 > + > +1: ld a4, 0(a0) > + addia0, a0, 8 > + sd a4, 0(a1) > + addia1, a1, 8 > + addia2, a2, -8 > + bgtua2, a5, 1b Shouldn't this be bgeua2, a5, 1b > + > +.Lcopy4: > + li a5, 4 > + bltua2, a5, .Lcopy1 > + > + andia7, a7, 3 > + bneza7, .Lcopy1 > + > +1: lw a4, 0(a0) > + addia0, a0, 4 > + sw a4, 0(a1) > + addia1, a1, 4 > + addia2, a2, -4 > + bgtua2, a5, 1b Same here? > + > +.Lcopy1: > + beqza2, .Lcopy0 > 1: lb a4, 0(a0) > addia0, a0, 1 > sb a4, 0(a1) > @@ -58,6 +88,7 @@ ENTRY(copyin) > addia2, a2, -1 > bneza2, 1b > > +.Lcopy0: > EXIT_USER_ACCESS(a4) > SET_FAULT_HANDLER(a3, a4) > .Lcopyiodone: > > > -- > jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE > >
riscv64: slightly optimized copyin/copyout/kcopy
I've been using a variation of this diff on my hifive unmatched since a few days. The goal is to at least optimize the aligned cases by using 8 or 4 bytes loads/stores. On this hifive unmatched, I found that unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower than equivalent 1 byte loads/stores (say 40x slower). This improves eg i/o throughput and shaves off between 10 and 15s out of a total 11m30s in ''make clean; make -j4'' kernel builds. I have another diff that tries to re-align initially unaligned addresses if possible but it's uglier and it's hard to tell whether it makes any difference in real life. ok? Index: copy.S === RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v retrieving revision 1.6 diff -u -p -p -u -r1.6 copy.S --- copy.S 28 Jun 2021 18:53:10 - 1.6 +++ copy.S 23 Jul 2021 07:45:16 - @@ -49,8 +49,38 @@ ENTRY(copyin) SWAP_FAULT_HANDLER(a3, a4, a5) ENTER_USER_ACCESS(a4) -// XXX optimize? .Lcopyio: +.Lcopy8: + li a5, 8 + bltua2, a5, .Lcopy4 + + or a7, a0, a1 + andia7, a7, 7 + bneza7, .Lcopy4 + +1: ld a4, 0(a0) + addia0, a0, 8 + sd a4, 0(a1) + addia1, a1, 8 + addia2, a2, -8 + bgtua2, a5, 1b + +.Lcopy4: + li a5, 4 + bltua2, a5, .Lcopy1 + + andia7, a7, 3 + bneza7, .Lcopy1 + +1: lw a4, 0(a0) + addia0, a0, 4 + sw a4, 0(a1) + addia1, a1, 4 + addia2, a2, -4 + bgtua2, a5, 1b + +.Lcopy1: + beqza2, .Lcopy0 1: lb a4, 0(a0) addia0, a0, 1 sb a4, 0(a1) @@ -58,6 +88,7 @@ ENTRY(copyin) addia2, a2, -1 bneza2, 1b +.Lcopy0: EXIT_USER_ACCESS(a4) SET_FAULT_HANDLER(a3, a4) .Lcopyiodone: -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE