I've been using a variation of this diff on my hifive unmatched since
a few days. The goal is to at least optimize the aligned cases by using
8 or 4 bytes loads/stores. On this hifive unmatched, I found that
unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower
than equivalent 1 byte loads/stores (say 40x slower).
This improves eg i/o throughput and shaves off between 10 and 15s out of
a total 11m30s in ''make clean; make -j4'' kernel builds.
I have another diff that tries to re-align initially unaligned addresses
if possible but it's uglier and it's hard to tell whether it makes any
difference in real life.
ok?
Index: copy.S
===================================================================
RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
retrieving revision 1.6
diff -u -p -p -u -r1.6 copy.S
--- copy.S 28 Jun 2021 18:53:10 -0000 1.6
+++ copy.S 23 Jul 2021 07:45:16 -0000
@@ -49,8 +49,38 @@ ENTRY(copyin)
SWAP_FAULT_HANDLER(a3, a4, a5)
ENTER_USER_ACCESS(a4)
-// XXX optimize?
.Lcopyio:
+.Lcopy8:
+ li a5, 8
+ bltu a2, a5, .Lcopy4
+
+ or a7, a0, a1
+ andi a7, a7, 7
+ bnez a7, .Lcopy4
+
+1: ld a4, 0(a0)
+ addi a0, a0, 8
+ sd a4, 0(a1)
+ addi a1, a1, 8
+ addi a2, a2, -8
+ bgtu a2, a5, 1b
+
+.Lcopy4:
+ li a5, 4
+ bltu a2, a5, .Lcopy1
+
+ andi a7, a7, 3
+ bnez a7, .Lcopy1
+
+1: lw a4, 0(a0)
+ addi a0, a0, 4
+ sw a4, 0(a1)
+ addi a1, a1, 4
+ addi a2, a2, -4
+ bgtu a2, a5, 1b
+
+.Lcopy1:
+ beqz a2, .Lcopy0
1: lb a4, 0(a0)
addi a0, a0, 1
sb a4, 0(a1)
@@ -58,6 +88,7 @@ ENTRY(copyin)
addi a2, a2, -1
bnez a2, 1b
+.Lcopy0:
EXIT_USER_ACCESS(a4)
SET_FAULT_HANDLER(a3, a4)
.Lcopyiodone:
--
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE