Re: [fpc-devel] Guidance for code generation for shift operations for AVR target

2019-08-24 Thread Christo Crause
On Mon, Aug 19, 2019 at 10:36 PM Florian Klaempfl 
wrote:

> Am 19.08.2019 um 22:20 schrieb Christo Crause:
> > I'm interested in trying to improve the code generated for shift
> > operations (in particular involving a compile time constant shift) for
> > the AVR target.
> For me the idea looks good, actually, I planned once to do the same, but
> got stuck for time reasons with the beginnings you discovered.
>

Attached a patch that optimizes SHL and SHR for constant right value for
comment.  Note that I haven't implemented support for SAR, ROL or ROR,
these operations should fall through to the current code path.  A heuristic
to decide when loop unrolling is favoured is also included, taking into
account -Os.  Some of the details behind the optimizations are discussed
here:
https://github.com/ccrause/freepascal/wiki/Optimizing-code-generation-for-shift-with-compile-time-constant

One remaining optimization I would like to implement is to eliminate
loading part of a variable to a register if it is eliminated by the shift
amount.  The loading of the variable and saving of the result from/to
memory happens before tcgavr.a_op_const_reg_reg gets called.  Is there a
way to remove such a redundant load instruction from
tcgavr.a_op_const_reg_reg?  Or at least mark it for removal at a later
stage ( peephole optimizer)?

I assume it would make sense to also include SAR support in this patch,
since it could be generated by div?

best wishes,
Christo
diff --git compiler/avr/cgcpu.pas compiler/avr/cgcpu.pas
index 4be4f4823e..ba6a06427e 100644
--- compiler/avr/cgcpu.pas
+++ compiler/avr/cgcpu.pas
@@ -438,6 +438,11 @@ unit cgcpu;
 
 
  procedure tcgavr.a_op_const_reg_reg(list: TAsmList; op: TOpCg; size: tcgsize; a: tcgint; src, dst: tregister);
+   var
+ tmpSrc, tmpDst, countreg: TRegister;
+ b, b2, i, j: byte;
+ s1, s2, t1: integer;
+ l1: TAsmLabel;
begin
  if (op in [OP_MUL,OP_IMUL]) and (size in [OS_16,OS_S16]) and (a in [2,4,8]) then
begin
@@ -451,6 +456,88 @@ unit cgcpu;
  a:=a shr 1;
end;
end
+ else if (op in [OP_SHL, OP_SHR]) and (a > 0)  then // a = 0 get eliminated later by tcg.optimize_op_const
+   begin
+ b := a div 8;  // number of bytes to shift
+
+ // copy from src to dst accounting for shift offset
+ for i := 0 to (tcgsize2size[size]-b-1) do
+   if op = OP_SHL then
+ a_load_reg_reg(list, OS_8, OS_8,
+   GetOffsetReg64(src, NR_NO, i),
+   GetOffsetReg64(dst, NR_NO, i+b))
+   else
+ a_load_reg_reg(list, OS_8, OS_8,
+   GetOffsetReg64(src, NR_NO, i+b),
+   GetOffsetReg64(dst, NR_NO, i));
+
+ b2 := a mod 8; // remaining bit shifts
+ if b2 > 0 then
+   begin
+ // Cost of loop
+ s1 := 3 + tcgsize2size[size] - b;
+ t1 := b2*(tcgsize2size[size] - b + 3);
+ //Cost of loop unrolling, t2 = s2
+ s2 := b2*(tcgsize2size[size] - b);
+
+ if ((cs_opt_size in current_settings.optimizerswitches) and (s1 < s2)) or
+(((s2 - s1) - t1/s2) > 0) then
+   begin
+ // Shift non-moved bytes in loop
+ current_asmdata.getjumplabel(l1);
+ countreg:=getintregister(list,OS_8);
+ a_load_const_reg(list,OS_8,b2,countreg);
+ cg.a_label(list,l1);
+ if op = OP_SHL then
+   list.concat(taicpu.op_reg(A_LSL,GetOffsetReg64(dst, NR_NO, b)))
+ else
+   list.concat(taicpu.op_reg(A_LSR,GetOffsetReg64(dst, NR_NO, tcgsize2size[size]-1-b)));
+
+ if size in [OS_S16,OS_16,OS_S32,OS_32,OS_S64,OS_64] then
+   begin
+ for i:=2+b to tcgsize2size[size] do
+   if op = OP_SHL then
+ list.concat(taicpu.op_reg(A_ROL,GetOffsetReg64(dst,NR_NO,i-1)))
+   else
+ list.concat(taicpu.op_reg(A_ROR,GetOffsetReg64(dst, NR_NO,tcgsize2size[size]-i-b)));
+   end;
+ list.concat(taicpu.op_reg(A_DEC,countreg));
+ a_jmp_flags(list,F_NE,l1);
+ // keep registers alive
+ list.concat(taicpu.op_reg_reg(A_MOV,countreg,countreg));
+   end
+ else
+   begin
+ // Unroll shift loop over non-moved bytes
+ for j := 1 to b2 do
+ begin
+   if op = OP_SHL then
+ list.concat(taicpu.op_reg(A_LSL,
+ GetOffsetReg64(dst, NR_NO, b)))
+  

Re: [fpc-devel] Guidance for code generation for shift operations for AVR target

2019-08-19 Thread Florian Klaempfl
Am 19.08.2019 um 22:20 schrieb Christo Crause:
> I'm interested in trying to improve the code generated for shift
> operations (in particular involving a compile time constant shift) for
> the AVR target.  The AVR processor doesn't have a barrel shifter,
> instead it can only shift a single bit position per clock cycle.
> Currently the compiler by default generates a bit shift loop where the
> loop is executed n times to push a value by n bits.  The only
> optimization I noticed for the case of shifting a value by a compile
> time constant is in Tcg.a_op_const_reg_reg where an 8 bit shift of a 16
> bit value is converted to copying the low byte of the left operand into
> the high byte of the result, and setting the low byte of the result to 0.
> 
> I would like to extend this type of optimization to cover more cases -
> the obvious extension is to convert all shifts by 8 bit multiples by
> corresponding byte moves. A more general approach (which I've got
> working for shl as concept) is to at least convert all 8 bit multiples
> as byte moves, then just do the last few bit shifts (if any) either as
> an unrolled loop (e.g. as implemented in tcgavr.a_op_const_reg_internal)
> or by generating the conventional shift loop (as implemented in
> tcgavr.a_op_reg_reg_internal). At the moment I've implemented the this
> logic in tcgavr.a_op_const_reg_reg.  I first check if I can generate
> smaller code compared to a shift loop and if not, the code calls the
> inherited method  a_op_const_reg_reg which basically follows the
> existing path (see also attached patch)
> 
> The code generator is complex to follow for me, since the functionality
> is kind of normalized and distributed across generic and CPU specific
> parts, and code flow jumps around up and down the inheritance chain. 

Yes, nice OOP ;)

> I
> therefore have some questions around this proposed modification and
> implementation in the code generator for which I would like some
> guidance on:
> * Is tcgavr.a_op_const_reg_reg the correct place for this type of
> functionality?
> * Am I messing up something else by effectively moving most of the code
> generation for this case higher up the call chain?
> * Am I missing some other path which could also benefit from this
> optimization?
> * Should I try and generate different code depending on whether -Os is
> specified or not (e.g. perform more loop unrolling if -Os is not specified)?
> * Any comments on the patch, which is a work in progress?

For me the idea looks good, actually, I planned once to do the same, but
got stuck for time reasons with the beginnings you discovered.

If such a patch really works, can normaly only checked by testing.
Besides simple benchmarking I often do another thing: I build without
the patch with -al, copy all .s files to a tmp dir, then I do a build
with the patch with -al, copy all resulting .s files to another temp.
dir and then I look at the differences with WinMerge. It becomes pretty
fast obvious if such a patch works good.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] Guidance for code generation for shift operations for AVR target

2019-08-19 Thread Christo Crause
I'm interested in trying to improve the code generated for shift operations
(in particular involving a compile time constant shift) for the AVR
target.  The AVR processor doesn't have a barrel shifter, instead it can
only shift a single bit position per clock cycle. Currently the compiler by
default generates a bit shift loop where the loop is executed n times to
push a value by n bits.  The only optimization I noticed for the case of
shifting a value by a compile time constant is in Tcg.a_op_const_reg_reg
where an 8 bit shift of a 16 bit value is converted to copying the low byte
of the left operand into the high byte of the result, and setting the low
byte of the result to 0.

I would like to extend this type of optimization to cover more cases - the
obvious extension is to convert all shifts by 8 bit multiples by
corresponding byte moves. A more general approach (which I've got working
for shl as concept) is to at least convert all 8 bit multiples as byte
moves, then just do the last few bit shifts (if any) either as an unrolled
loop (e.g. as implemented in tcgavr.a_op_const_reg_internal) or by
generating the conventional shift loop (as implemented in
tcgavr.a_op_reg_reg_internal). At the moment I've implemented the this
logic in tcgavr.a_op_const_reg_reg.  I first check if I can generate
smaller code compared to a shift loop and if not, the code calls the
inherited method  a_op_const_reg_reg which basically follows the existing
path (see also attached patch)

The code generator is complex to follow for me, since the functionality is
kind of normalized and distributed across generic and CPU specific parts,
and code flow jumps around up and down the inheritance chain.  I therefore
have some questions around this proposed modification and implementation in
the code generator for which I would like some guidance on:
* Is tcgavr.a_op_const_reg_reg the correct place for this type of
functionality?
* Am I messing up something else by effectively moving most of the code
generation for this case higher up the call chain?
* Am I missing some other path which could also benefit from this
optimization?
* Should I try and generate different code depending on whether -Os is
specified or not (e.g. perform more loop unrolling if -Os is not specified)?
* Any comments on the patch, which is a work in progress?

best wishes,
Christo
diff --git compiler/avr/cgcpu.pas compiler/avr/cgcpu.pas
index 4be4f4823e..355e204bf7 100644
--- compiler/avr/cgcpu.pas
+++ compiler/avr/cgcpu.pas
@@ -438,6 +438,9 @@ unit cgcpu;
 
 
  procedure tcgavr.a_op_const_reg_reg(list: TAsmList; op: TOpCg; size: tcgsize; a: tcgint; src, dst: tregister);
+   var
+ tmpSrc, tmpDst: TRegister;
+ b, i, j: byte;
begin
  if (op in [OP_MUL,OP_IMUL]) and (size in [OS_16,OS_S16]) and (a in [2,4,8]) then
begin
@@ -451,6 +454,43 @@ unit cgcpu;
  a:=a shr 1;
end;
end
+ // Word size wins up to remainder = 3
+ // Dword size wins up to remainder = 2
+ else if (op = OP_SHL) and (a > 0) // a = 0 get eliminated later by tcg.optimize_op_const
+  and ((a mod 8) * tcgsize2size[size] <= 8) then // remaining shifts short enough for unrolled loop
+   begin
+ b := a div 8;  // number of bytes to shift
+ // first fill LSB with 0
+ tmpSrc := src;
+ tmpDst := dst;
+ for i := 1 to b do
+ begin
+   a_load_const_reg(list,OS_8,0,tmpDst);
+   tmpDst := GetNextReg(tmpDst);
+ end;
+
+ for i := b+1 to tcgsize2size[size] do
+ begin
+   a_load_reg_reg(list,OS_8,OS_8,tmpSrc,tmpDst);
+   if i < tcgsize2size[size] then // don't retrieve next reg if on last iteration
+   begin
+ tmpSrc := GetNextReg(tmpSrc);
+ tmpDst := GetNextReg(tmpDst);
+   end;
+ end;
+
+ b := a mod 8;
+ if b > 0 then
+ begin
+   for j := 1 to b do
+   begin
+ list.concat(taicpu.op_reg(A_LSL, dst));
+ if not(size in [OS_8, OS_S8]) then
+   for i := 2 to tcgsize2size[size] do
+ list.concat(taicpu.op_reg(A_ROL,GetOffsetReg64(dst, NR_NO, i-1)));
+   end;
+ end;
+   end
  else
inherited a_op_const_reg_reg(list,op,size,a,src,dst);
end;
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel