Re: [PATCH] powerpc: Force inlining of csum_add()

2021-06-17 Thread Michael Ellerman
On Tue, 11 May 2021 06:08:06 + (UTC), Christophe Leroy wrote:
> Commit 328e7e487a46 ("powerpc: force inlining of csum_partial() to
> avoid multiple csum_partial() with GCC10") inlined csum_partial().
> 
> Now that csum_partial() is inlined, GCC outlines csum_add() when
> called by csum_partial().
> 
> c064fb28 :
> c064fb28: 7c 63 20 14 addcr3,r3,r4
> c064fb2c: 7c 63 01 94 addze   r3,r3
> c064fb30: 4e 80 00 20 blr
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc: Force inlining of csum_add()
  https://git.kernel.org/powerpc/c/4423eff71ca6b8f2c5e0fc4cea33d8cdfe3c3740

cheers


Re: [PATCH] powerpc: Force inlining of csum_add()

2021-05-12 Thread Segher Boessenkool
On Wed, May 12, 2021 at 04:43:33PM +0200, Christophe Leroy wrote:
> Le 12/05/2021 à 16:31, Segher Boessenkool a écrit :
> >On Wed, May 12, 2021 at 02:56:56PM +0200, Christophe Leroy wrote:
> >>Le 11/05/2021 à 12:51, Segher Boessenkool a écrit :
> >>>Something seems to have decided this asm is more expensive than it is.
> >>>That isn't always avoidable -- the compiler cannot look inside asms --
> >>>but it seems it could be improved here.
> >>>
> >>>Do you have (or can make) a self-contained testcase?
> >>
> >>I have not tried, and I fear it might be difficult, because on a kernel
> >>build with dozens of calls to csum_add(), only ip6_tunnel.o exhibits such
> >>an issue.
> >
> >Yeah.  Sometimes you can force some of the decisions, but that usually
> >requires knowing too many GCC internals :-/
> >
> And there is even one completely unused instance of csum_add().
> >>>
> >>>That is strange, that should never happen.
> >>
> >>It seems that several .o include unused versions of csum_add. After the
> >>final link, one remains (in addition to the used one) in vmlinux.
> >
> >But it is a static function, so it should not end up in any object file
> >where it isn't used.
> 
> Well  did I dream ?
> 
> Now I only find one extra .o with unused csum_add() : That's 
> net/ipv6/exthdrs.o
> It matches the one found in vmlinux.
> 
> Are you interested in -fdump-tree-einline-all for that one as well ?

Sure.  Hopefully it will show more :-)


Segher


Re: [PATCH] powerpc: Force inlining of csum_add()

2021-05-12 Thread Christophe Leroy




Le 12/05/2021 à 16:31, Segher Boessenkool a écrit :

On Wed, May 12, 2021 at 02:56:56PM +0200, Christophe Leroy wrote:

Le 11/05/2021 à 12:51, Segher Boessenkool a écrit :

Something seems to have decided this asm is more expensive than it is.
That isn't always avoidable -- the compiler cannot look inside asms --
but it seems it could be improved here.

Do you have (or can make) a self-contained testcase?


I have not tried, and I fear it might be difficult, because on a kernel
build with dozens of calls to csum_add(), only ip6_tunnel.o exhibits such
an issue.


Yeah.  Sometimes you can force some of the decisions, but that usually
requires knowing too many GCC internals :-/


And there is even one completely unused instance of csum_add().


That is strange, that should never happen.


It seems that several .o include unused versions of csum_add. After the
final link, one remains (in addition to the used one) in vmlinux.


But it is a static function, so it should not end up in any object file
where it isn't used.


Well  did I dream ?

Now I only find one extra .o with unused csum_add() : That's net/ipv6/exthdrs.o
It matches the one found in vmlinux.

Are you interested in -fdump-tree-einline-all for that one as well ?

Christophe


Re: [PATCH] powerpc: Force inlining of csum_add()

2021-05-12 Thread Segher Boessenkool
On Wed, May 12, 2021 at 02:56:56PM +0200, Christophe Leroy wrote:
> Le 11/05/2021 à 12:51, Segher Boessenkool a écrit :
> >Something seems to have decided this asm is more expensive than it is.
> >That isn't always avoidable -- the compiler cannot look inside asms --
> >but it seems it could be improved here.
> >
> >Do you have (or can make) a self-contained testcase?
> 
> I have not tried, and I fear it might be difficult, because on a kernel 
> build with dozens of calls to csum_add(), only ip6_tunnel.o exhibits such 
> an issue.

Yeah.  Sometimes you can force some of the decisions, but that usually
requires knowing too many GCC internals :-/

> >>And there is even one completely unused instance of csum_add().
> >
> >That is strange, that should never happen.
> 
> It seems that several .o include unused versions of csum_add. After the 
> final link, one remains (in addition to the used one) in vmlinux.

But it is a static function, so it should not end up in any object file
where it isn't used.

> >>In the non-inlined version, the first sum with 0 was performed.
> >>Here it is skipped.
> >
> >That is because of how __builtin_constant_p works, most likely.  As we
> >discussed elsewhere it is evaluated before all forms of loop unrolling.
> 
> But we are not talking about loop unrolling here, are we ?

Oh, right you are, but that doesn't change much.  The
_builtin_constant_p(len) is evaluated long before the compiler sees len
is a constant here.

> It seems that the reason here is that __builtin_constant_p() is evaluated 
> long after GCC decided to not inline that call to csum_add().

Yes, it seems we do not currently do even trivial inlining except very
early in the compiler.

Thanks,


Segher


Re: [PATCH] powerpc: Force inlining of csum_add()

2021-05-12 Thread Christophe Leroy

Hi,

Le 11/05/2021 à 12:51, Segher Boessenkool a écrit :

Hi!

On Tue, May 11, 2021 at 06:08:06AM +, Christophe Leroy wrote:

Commit 328e7e487a46 ("powerpc: force inlining of csum_partial() to
avoid multiple csum_partial() with GCC10") inlined csum_partial().

Now that csum_partial() is inlined, GCC outlines csum_add() when
called by csum_partial().



c064fb28 :
c064fb28:   7c 63 20 14 addcr3,r3,r4
c064fb2c:   7c 63 01 94 addze   r3,r3
c064fb30:   4e 80 00 20 blr


Could you build this with -fdump-tree-einline-all and send me the
results?  Or open a GCC PR yourself :-)


Ok, I'll forward it to you in a minute.



Something seems to have decided this asm is more expensive than it is.
That isn't always avoidable -- the compiler cannot look inside asms --
but it seems it could be improved here.

Do you have (or can make) a self-contained testcase?


I have not tried, and I fear it might be difficult, because on a kernel build with dozens of calls 
to csum_add(), only ip6_tunnel.o exhibits such an issue.





The sum with 0 is useless, should have been skipped.


That isn't something the compiler can do anything about (not sure if you
were suggesting that); it has to be done in the user code (and it tries
to already, see below).


I was not suggesting that, only that when properly inlined the sum with 0 is skipped (because we put 
the necessary stuff in csum_add() of course).





And there is even one completely unused instance of csum_add().


That is strange, that should never happen.


It seems that several .o include unused versions of csum_add. After the final link, one remains (in 
addition to the used one) in vmlinux.





./arch/powerpc/include/asm/checksum.h: In function '__ip6_tnl_rcv':
./arch/powerpc/include/asm/checksum.h:94:22: warning: inlining failed in call 
to 'csum_add': call is unlikely and code size would grow [-Winline]
94 | static inline __wsum csum_add(__wsum csum, __wsum addend)
   |  ^~~~
./arch/powerpc/include/asm/checksum.h:172:31: note: called from here
   172 | sum = csum_add(sum, (__force __wsum)*(const 
u32 *)buff);
   |   
^


At least we say what happened.  Progress!  :-)


Lol. I've seen this warning for long, that's not something new I guess.




In the non-inlined version, the first sum with 0 was performed.
Here it is skipped.


That is because of how __builtin_constant_p works, most likely.  As we
discussed elsewhere it is evaluated before all forms of loop unrolling.


But we are not talking about loop unrolling here, are we ?

It seems that the reason here is that __builtin_constant_p() is evaluated long after GCC decided to 
not inline that call to csum_add().


Christophe


Re: [PATCH] powerpc: Force inlining of csum_add()

2021-05-11 Thread Segher Boessenkool
Hi!

On Tue, May 11, 2021 at 06:08:06AM +, Christophe Leroy wrote:
> Commit 328e7e487a46 ("powerpc: force inlining of csum_partial() to
> avoid multiple csum_partial() with GCC10") inlined csum_partial().
> 
> Now that csum_partial() is inlined, GCC outlines csum_add() when
> called by csum_partial().

> c064fb28 :
> c064fb28: 7c 63 20 14 addcr3,r3,r4
> c064fb2c: 7c 63 01 94 addze   r3,r3
> c064fb30: 4e 80 00 20 blr

Could you build this with -fdump-tree-einline-all and send me the
results?  Or open a GCC PR yourself :-)

Something seems to have decided this asm is more expensive than it is.
That isn't always avoidable -- the compiler cannot look inside asms --
but it seems it could be improved here.

Do you have (or can make) a self-contained testcase?

> The sum with 0 is useless, should have been skipped.

That isn't something the compiler can do anything about (not sure if you
were suggesting that); it has to be done in the user code (and it tries
to already, see below).

> And there is even one completely unused instance of csum_add().

That is strange, that should never happen.

> ./arch/powerpc/include/asm/checksum.h: In function '__ip6_tnl_rcv':
> ./arch/powerpc/include/asm/checksum.h:94:22: warning: inlining failed in call 
> to 'csum_add': call is unlikely and code size would grow [-Winline]
>94 | static inline __wsum csum_add(__wsum csum, __wsum addend)
>   |  ^~~~
> ./arch/powerpc/include/asm/checksum.h:172:31: note: called from here
>   172 | sum = csum_add(sum, (__force __wsum)*(const 
> u32 *)buff);
>   |   
> ^

At least we say what happened.  Progress!  :-)

> In the non-inlined version, the first sum with 0 was performed.
> Here it is skipped.

That is because of how __builtin_constant_p works, most likely.  As we
discussed elsewhere it is evaluated before all forms of loop unrolling.

The patch looks perfect of course :-)

Reviewed-by: Segher Boessenkool 


Segher


> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -91,7 +91,7 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, 
> __be32 daddr, __u32 len,
>  }
>  
>  #define HAVE_ARCH_CSUM_ADD
> -static inline __wsum csum_add(__wsum csum, __wsum addend)
> +static __always_inline __wsum csum_add(__wsum csum, __wsum addend)
>  {
>  #ifdef __powerpc64__
>   u64 res = (__force u64)csum;