Re: riscv64: slightly optimized copyin/copyout/kcopy

2021-07-23 Thread Mark Kettenis
> From: Jeremie Courreges-Anglas 
> Date: Fri, 23 Jul 2021 13:54:31 +0200
> 
> On Fri, Jul 23 2021, Mark Kettenis  wrote:
> >> From: Jeremie Courreges-Anglas 
> >> Date: Fri, 23 Jul 2021 11:54:51 +0200
> >> Content-Type: text/plain
> >> 
> >> 
> >> I've been using a variation of this diff on my hifive unmatched since
> >> a few days.  The goal is to at least optimize the aligned cases by using
> >> 8 or 4 bytes loads/stores.  On this hifive unmatched, I found that
> >> unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower
> >> than equivalent 1 byte loads/stores (say 40x slower).
> >> 
> >> This improves eg i/o throughput and shaves off between 10 and 15s out of
> >> a total 11m30s in ''make clean; make -j4'' kernel builds.
> >> 
> >> I have another diff that tries to re-align initially unaligned addresses
> >> if possible but it's uglier and it's hard to tell whether it makes any
> >> difference in real life.
> >> 
> >> ok?
> >> 
> >> 
> >> Index: copy.S
> >> ===
> >> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
> >> retrieving revision 1.6
> >> diff -u -p -p -u -r1.6 copy.S
> >> --- copy.S 28 Jun 2021 18:53:10 -  1.6
> >> +++ copy.S 23 Jul 2021 07:45:16 -
> >> @@ -49,8 +49,38 @@ ENTRY(copyin)
> >>SWAP_FAULT_HANDLER(a3, a4, a5)
> >>ENTER_USER_ACCESS(a4)
> >>  
> >> -// XXX optimize?
> >>  .Lcopyio:
> >> +.Lcopy8:
> >> +  li  a5, 8
> >> +  bltua2, a5, .Lcopy4
> >> +
> >> +  or  a7, a0, a1
> >> +  andia7, a7, 7
> >> +  bneza7, .Lcopy4
> >> +
> >> +1:ld  a4, 0(a0)
> >> +  addia0, a0, 8
> >> +  sd  a4, 0(a1)
> >> +  addia1, a1, 8
> >> +  addia2, a2, -8
> >> +  bgtua2, a5, 1b
> >
> > Shouldn't this be
> >
> > bgeua2, a5, 1b
> 
> Yes, that's better ideed, thanks!  Updated diff.

ok kettenis@

> Index: copy.S
> ===
> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
> retrieving revision 1.6
> diff -u -p -p -u -r1.6 copy.S
> --- copy.S28 Jun 2021 18:53:10 -  1.6
> +++ copy.S23 Jul 2021 11:52:54 -
> @@ -49,8 +49,38 @@ ENTRY(copyin)
>   SWAP_FAULT_HANDLER(a3, a4, a5)
>   ENTER_USER_ACCESS(a4)
>  
> -// XXX optimize?
>  .Lcopyio:
> +.Lcopy8:
> + li  a5, 8
> + bltua2, a5, .Lcopy4
> +
> + or  a7, a0, a1
> + andia7, a7, 7
> + bneza7, .Lcopy4
> +
> +1:   ld  a4, 0(a0)
> + addia0, a0, 8
> + sd  a4, 0(a1)
> + addia1, a1, 8
> + addia2, a2, -8
> + bgeua2, a5, 1b
> +
> +.Lcopy4:
> + li  a5, 4
> + bltua2, a5, .Lcopy1
> +
> + andia7, a7, 3
> + bneza7, .Lcopy1
> +
> +1:   lw  a4, 0(a0)
> + addia0, a0, 4
> + sw  a4, 0(a1)
> + addia1, a1, 4
> + addia2, a2, -4
> + bgeua2, a5, 1b
> +
> +.Lcopy1:
> + beqza2, .Lcopy0
>  1:   lb  a4, 0(a0)
>   addia0, a0, 1
>   sb  a4, 0(a1)
> @@ -58,6 +88,7 @@ ENTRY(copyin)
>   addia2, a2, -1
>   bneza2, 1b
>  
> +.Lcopy0:
>   EXIT_USER_ACCESS(a4)
>   SET_FAULT_HANDLER(a3, a4)
>  .Lcopyiodone:
> 
> 
> -- 
> jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE
> 



Re: riscv64: slightly optimized copyin/copyout/kcopy

2021-07-23 Thread Jeremie Courreges-Anglas
On Fri, Jul 23 2021, Mark Kettenis  wrote:
>> From: Jeremie Courreges-Anglas 
>> Date: Fri, 23 Jul 2021 11:54:51 +0200
>> Content-Type: text/plain
>> 
>> 
>> I've been using a variation of this diff on my hifive unmatched since
>> a few days.  The goal is to at least optimize the aligned cases by using
>> 8 or 4 bytes loads/stores.  On this hifive unmatched, I found that
>> unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower
>> than equivalent 1 byte loads/stores (say 40x slower).
>> 
>> This improves eg i/o throughput and shaves off between 10 and 15s out of
>> a total 11m30s in ''make clean; make -j4'' kernel builds.
>> 
>> I have another diff that tries to re-align initially unaligned addresses
>> if possible but it's uglier and it's hard to tell whether it makes any
>> difference in real life.
>> 
>> ok?
>> 
>> 
>> Index: copy.S
>> ===
>> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
>> retrieving revision 1.6
>> diff -u -p -p -u -r1.6 copy.S
>> --- copy.S   28 Jun 2021 18:53:10 -  1.6
>> +++ copy.S   23 Jul 2021 07:45:16 -
>> @@ -49,8 +49,38 @@ ENTRY(copyin)
>>  SWAP_FAULT_HANDLER(a3, a4, a5)
>>  ENTER_USER_ACCESS(a4)
>>  
>> -// XXX optimize?
>>  .Lcopyio:
>> +.Lcopy8:
>> +li  a5, 8
>> +bltua2, a5, .Lcopy4
>> +
>> +or  a7, a0, a1
>> +andia7, a7, 7
>> +bneza7, .Lcopy4
>> +
>> +1:  ld  a4, 0(a0)
>> +addia0, a0, 8
>> +sd  a4, 0(a1)
>> +addia1, a1, 8
>> +addia2, a2, -8
>> +bgtua2, a5, 1b
>
> Shouldn't this be
>
>   bgeua2, a5, 1b

Yes, that's better ideed, thanks!  Updated diff.


Index: copy.S
===
RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
retrieving revision 1.6
diff -u -p -p -u -r1.6 copy.S
--- copy.S  28 Jun 2021 18:53:10 -  1.6
+++ copy.S  23 Jul 2021 11:52:54 -
@@ -49,8 +49,38 @@ ENTRY(copyin)
SWAP_FAULT_HANDLER(a3, a4, a5)
ENTER_USER_ACCESS(a4)
 
-// XXX optimize?
 .Lcopyio:
+.Lcopy8:
+   li  a5, 8
+   bltua2, a5, .Lcopy4
+
+   or  a7, a0, a1
+   andia7, a7, 7
+   bneza7, .Lcopy4
+
+1: ld  a4, 0(a0)
+   addia0, a0, 8
+   sd  a4, 0(a1)
+   addia1, a1, 8
+   addia2, a2, -8
+   bgeua2, a5, 1b
+
+.Lcopy4:
+   li  a5, 4
+   bltua2, a5, .Lcopy1
+
+   andia7, a7, 3
+   bneza7, .Lcopy1
+
+1: lw  a4, 0(a0)
+   addia0, a0, 4
+   sw  a4, 0(a1)
+   addia1, a1, 4
+   addia2, a2, -4
+   bgeua2, a5, 1b
+
+.Lcopy1:
+   beqza2, .Lcopy0
 1: lb  a4, 0(a0)
addia0, a0, 1
sb  a4, 0(a1)
@@ -58,6 +88,7 @@ ENTRY(copyin)
addia2, a2, -1
bneza2, 1b
 
+.Lcopy0:
EXIT_USER_ACCESS(a4)
SET_FAULT_HANDLER(a3, a4)
 .Lcopyiodone:


-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: riscv64: slightly optimized copyin/copyout/kcopy

2021-07-23 Thread Mark Kettenis
> From: Jeremie Courreges-Anglas 
> Date: Fri, 23 Jul 2021 11:54:51 +0200
> Content-Type: text/plain
> 
> 
> I've been using a variation of this diff on my hifive unmatched since
> a few days.  The goal is to at least optimize the aligned cases by using
> 8 or 4 bytes loads/stores.  On this hifive unmatched, I found that
> unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower
> than equivalent 1 byte loads/stores (say 40x slower).
> 
> This improves eg i/o throughput and shaves off between 10 and 15s out of
> a total 11m30s in ''make clean; make -j4'' kernel builds.
> 
> I have another diff that tries to re-align initially unaligned addresses
> if possible but it's uglier and it's hard to tell whether it makes any
> difference in real life.
> 
> ok?
> 
> 
> Index: copy.S
> ===
> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
> retrieving revision 1.6
> diff -u -p -p -u -r1.6 copy.S
> --- copy.S28 Jun 2021 18:53:10 -  1.6
> +++ copy.S23 Jul 2021 07:45:16 -
> @@ -49,8 +49,38 @@ ENTRY(copyin)
>   SWAP_FAULT_HANDLER(a3, a4, a5)
>   ENTER_USER_ACCESS(a4)
>  
> -// XXX optimize?
>  .Lcopyio:
> +.Lcopy8:
> + li  a5, 8
> + bltua2, a5, .Lcopy4
> +
> + or  a7, a0, a1
> + andia7, a7, 7
> + bneza7, .Lcopy4
> +
> +1:   ld  a4, 0(a0)
> + addia0, a0, 8
> + sd  a4, 0(a1)
> + addia1, a1, 8
> + addia2, a2, -8
> + bgtua2, a5, 1b

Shouldn't this be

bgeua2, a5, 1b

> +
> +.Lcopy4:
> + li  a5, 4
> + bltua2, a5, .Lcopy1
> +
> + andia7, a7, 3
> + bneza7, .Lcopy1
> +
> +1:   lw  a4, 0(a0)
> + addia0, a0, 4
> + sw  a4, 0(a1)
> + addia1, a1, 4
> + addia2, a2, -4
> + bgtua2, a5, 1b

Same here?

> +
> +.Lcopy1:
> + beqza2, .Lcopy0
>  1:   lb  a4, 0(a0)
>   addia0, a0, 1
>   sb  a4, 0(a1)
> @@ -58,6 +88,7 @@ ENTRY(copyin)
>   addia2, a2, -1
>   bneza2, 1b
>  
> +.Lcopy0:
>   EXIT_USER_ACCESS(a4)
>   SET_FAULT_HANDLER(a3, a4)
>  .Lcopyiodone:
> 
> 
> -- 
> jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE
> 
> 



riscv64: slightly optimized copyin/copyout/kcopy

2021-07-23 Thread Jeremie Courreges-Anglas


I've been using a variation of this diff on my hifive unmatched since
a few days.  The goal is to at least optimize the aligned cases by using
8 or 4 bytes loads/stores.  On this hifive unmatched, I found that
unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower
than equivalent 1 byte loads/stores (say 40x slower).

This improves eg i/o throughput and shaves off between 10 and 15s out of
a total 11m30s in ''make clean; make -j4'' kernel builds.

I have another diff that tries to re-align initially unaligned addresses
if possible but it's uglier and it's hard to tell whether it makes any
difference in real life.

ok?


Index: copy.S
===
RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
retrieving revision 1.6
diff -u -p -p -u -r1.6 copy.S
--- copy.S  28 Jun 2021 18:53:10 -  1.6
+++ copy.S  23 Jul 2021 07:45:16 -
@@ -49,8 +49,38 @@ ENTRY(copyin)
SWAP_FAULT_HANDLER(a3, a4, a5)
ENTER_USER_ACCESS(a4)
 
-// XXX optimize?
 .Lcopyio:
+.Lcopy8:
+   li  a5, 8
+   bltua2, a5, .Lcopy4
+
+   or  a7, a0, a1
+   andia7, a7, 7
+   bneza7, .Lcopy4
+
+1: ld  a4, 0(a0)
+   addia0, a0, 8
+   sd  a4, 0(a1)
+   addia1, a1, 8
+   addia2, a2, -8
+   bgtua2, a5, 1b
+
+.Lcopy4:
+   li  a5, 4
+   bltua2, a5, .Lcopy1
+
+   andia7, a7, 3
+   bneza7, .Lcopy1
+
+1: lw  a4, 0(a0)
+   addia0, a0, 4
+   sw  a4, 0(a1)
+   addia1, a1, 4
+   addia2, a2, -4
+   bgtua2, a5, 1b
+
+.Lcopy1:
+   beqza2, .Lcopy0
 1: lb  a4, 0(a0)
addia0, a0, 1
sb  a4, 0(a1)
@@ -58,6 +88,7 @@ ENTRY(copyin)
addia2, a2, -1
bneza2, 1b
 
+.Lcopy0:
EXIT_USER_ACCESS(a4)
SET_FAULT_HANDLER(a3, a4)
 .Lcopyiodone:


-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE