Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-12 Thread Gabriel Paubert
On Thu, Aug 11, 2016 at 05:11:19PM -0500, Segher Boessenkool wrote:
> On Thu, Aug 11, 2016 at 11:34:37PM +0200, Gabriel Paubert wrote:
> > On the other hand gcc did at the time a very poor job (quite an
> > understatement) at bswapdi when compiling for 64 bit processors 
> > (see the example).
> > 
> > But what do modern compilers generate for bswapdi these days? Do they
> > still call the library or not?
> 
> Nope.

Great, could then these functions be removed from misc_32.S, or are
compilers that use libcalls still supported for kernel builds?

> 
> > After all, bswapdi on 32 bit processors only takes 6 instructions if the
> > input and output registers don't overlap.
> 
> For this testcase:
> ===
> typedef unsigned long long u64;
> u64 bs(u64 x) { return __builtin_bswap64(x); }
> ===
> 
> we get with -m32:
> ===
> bs:
>   mr 9,3
>   rotlwi 3,4,24
>   rlwimi 3,4,8,8,15
>   rlwimi 3,4,8,24,31
>   rotlwi 4,9,24
>   rlwimi 4,9,8,8,15
>   rlwimi 4,9,8,24,31
>   blr

In this case the compiler is constrained by the fact that the input and
ouput registers are the same. When inlined with other things it can
probably perform better scheduling and interleaving of operations.


> ===
> 
> and with -m64:
> ===
> .L.bs:
>   srdi 10,3,32
>   mr 9,3
>   rotlwi 3,3,24
>   rotlwi 8,10,24
>   rlwimi 3,9,8,8,15
>   rlwimi 8,10,8,8,15
>   rlwimi 3,9,8,24,31
>   rlwimi 8,10,8,24,31
>   sldi 3,3,32
>   or 3,3,8
>   blr
> ===
> 

As demonstrated here where the two halves of the 64 bit quantity
are byte swapped in an interleaved fashion. Not perfect (I think
that with proper ordering the last 2 instructions could be replaced
by a rldimi), but reasonable.

> Neither as tight as possible, but neither horrible either.
> 

Indeed.

Gabriel


Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-11 Thread Segher Boessenkool
On Thu, Aug 11, 2016 at 11:34:37PM +0200, Gabriel Paubert wrote:
> On the other hand gcc did at the time a very poor job (quite an
> understatement) at bswapdi when compiling for 64 bit processors 
> (see the example).
> 
> But what do modern compilers generate for bswapdi these days? Do they
> still call the library or not?

Nope.

> After all, bswapdi on 32 bit processors only takes 6 instructions if the
> input and output registers don't overlap.

For this testcase:
===
typedef unsigned long long u64;
u64 bs(u64 x) { return __builtin_bswap64(x); }
===

we get with -m32:
===
bs:
mr 9,3
rotlwi 3,4,24
rlwimi 3,4,8,8,15
rlwimi 3,4,8,24,31
rotlwi 4,9,24
rlwimi 4,9,8,8,15
rlwimi 4,9,8,24,31
blr
===

and with -m64:
===
.L.bs:
srdi 10,3,32
mr 9,3
rotlwi 3,3,24
rotlwi 8,10,24
rlwimi 3,9,8,8,15
rlwimi 8,10,8,8,15
rlwimi 3,9,8,24,31
rlwimi 8,10,8,24,31
sldi 3,3,32
or 3,3,8
blr
===

Neither as tight as possible, but neither horrible either.


Segher


Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-11 Thread Gabriel Paubert
On Wed, Aug 10, 2016 at 12:18:15PM +0200, Christophe Leroy wrote:
> 
> 
> Le 10/08/2016 à 10:56, Gabriel Paubert a écrit :
> >On Fri, Aug 05, 2016 at 01:28:02PM +0200, Christophe Leroy wrote:
> >>Signed-off-by: Christophe Leroy 
> >>---
> >> arch/powerpc/kernel/misc_32.S | 3 +--
> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >>diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
> >>index e025230..e18055c 100644
> >>--- a/arch/powerpc/kernel/misc_32.S
> >>+++ b/arch/powerpc/kernel/misc_32.S
> >>@@ -578,9 +578,8 @@ _GLOBAL(__bswapdi2)
> >>rlwimi  r9,r4,24,0,7
> >>rlwimi  r10,r3,24,0,7
> >>rlwimi  r9,r4,24,16,23
> >>-   rlwimi  r10,r3,24,16,23
> >>+   rlwimi  r4,r3,24,16,23
> >>mr  r3,r9
> >>-   mr  r4,r10
> >>blr
> >>
> >
> >Hmmm, are you sure that it works? rlwimi is a bit special since the
> >first operand is both an input and an output of the instruction.
> >
> >
> 
> Oops, you are right ...

I just found this: 

http://hardwarebug.org/2010/01/14/beware-the-builtins/

the bswapdi2 suggested sequence only needs a single mr instruction, the 
other one is absorbed in a rotlwi.

The scheduling looks poor, but it seems impossible to interleave the
operations between the two halves without adding another instructions,
and the routine is 8 instructions long, which happens to be exactly a
cache line on most 32 bit processors.

On the other hand gcc did at the time a very poor job (quite an
understatement) at bswapdi when compiling for 64 bit processors 
(see the example).

But what do modern compilers generate for bswapdi these days? Do they
still call the library or not?

After all, bswapdi on 32 bit processors only takes 6 instructions if the
input and output registers don't overlap.

Gabriel



Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-10 Thread Christophe Leroy



Le 10/08/2016 à 10:56, Gabriel Paubert a écrit :

On Fri, Aug 05, 2016 at 01:28:02PM +0200, Christophe Leroy wrote:

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/misc_32.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index e025230..e18055c 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -578,9 +578,8 @@ _GLOBAL(__bswapdi2)
rlwimi  r9,r4,24,0,7
rlwimi  r10,r3,24,0,7
rlwimi  r9,r4,24,16,23
-   rlwimi  r10,r3,24,16,23
+   rlwimi  r4,r3,24,16,23
mr  r3,r9
-   mr  r4,r10
blr



Hmmm, are you sure that it works? rlwimi is a bit special since the
first operand is both an input and an output of the instruction.




Oops, you are right ...


Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-10 Thread Gabriel Paubert
On Fri, Aug 05, 2016 at 01:28:02PM +0200, Christophe Leroy wrote:
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/misc_32.S | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
> index e025230..e18055c 100644
> --- a/arch/powerpc/kernel/misc_32.S
> +++ b/arch/powerpc/kernel/misc_32.S
> @@ -578,9 +578,8 @@ _GLOBAL(__bswapdi2)
>   rlwimi  r9,r4,24,0,7
>   rlwimi  r10,r3,24,0,7
>   rlwimi  r9,r4,24,16,23
> - rlwimi  r10,r3,24,16,23
> + rlwimi  r4,r3,24,16,23
>   mr  r3,r9
> - mr  r4,r10
>   blr
>  

Hmmm, are you sure that it works? rlwimi is a bit special since the
first operand is both an input and an output of the instruction.


In other words, did you test the code?

Gabriel


[PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-05 Thread Christophe Leroy
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/misc_32.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index e025230..e18055c 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -578,9 +578,8 @@ _GLOBAL(__bswapdi2)
rlwimi  r9,r4,24,0,7
rlwimi  r10,r3,24,0,7
rlwimi  r9,r4,24,16,23
-   rlwimi  r10,r3,24,16,23
+   rlwimi  r4,r3,24,16,23
mr  r3,r9
-   mr  r4,r10
blr
 
 #ifdef CONFIG_SMP
-- 
2.1.0