Re: [RFC PATCH v4 00/11] powerpc: switch VDSO to C implementation.

2020-01-20 Thread Segher Boessenkool
On Mon, Jan 20, 2020 at 06:08:23PM +0100, Christophe Leroy wrote:
> Not easy I think.
> 
> First we have the unavoidable ASM entry function that can't be dropped 
> because of the CR[SO] bit the set on error or clear on no error and that 
> can't be done in C.

Yup.

> In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts 
> are generic and read from the VDSO data.

Does that cost more than just a few cycles?

> And there is still some funny code generated by GCC (8.1), like:
> 
>  620: 7d 29 3c 30 srw r9,r9,r7
>  624: 21 87 00 20 subfic  r12,r7,32
>  628: 7d 07 3c 31 srw.r7,r8,r7
>  62c: 7d 08 60 30 slw r8,r8,r12
>  630: 7d 0b 4b 78 or  r11,r8,r9

(This can be done cheaper for fixed shifts, you can use rlwimi then).

>  634: 39 40 00 00 li  r10,0
>  638: 40 82 00 84 bne 6bc <__c_kernel_clock_gettime+0x114>
>  63c: 81 23 00 24 lwz r9,36(r3)
>  640: 81 05 00 00 lwz r8,0(r5)
> ...
>  6bc: 7d 69 5b 78 mr  r9,r11
>  6c0: 7c ea 3b 78 mr  r10,r7
>  6c4: 7d 2b 4b 78 mr  r11,r9
>  6c8: 4b ff ff 74 b   63c <__c_kernel_clock_gettime+0x94>
> 
> This branch to 6bc is totally useless:
> - copying r11 into r9 is pointless as r9 is overwritten in 63c
> - copying back r9 into r11 is pointless as r11 has not been modified 
> inbetween.

Yeah, huh, how did that happen.

> - loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is 
> pointless as well, could have directly put the result of srw. in r10.

This may be harder to make the compiler do.

But the r9/r11 thing suggests you are preventing optimisation somewhere,
maybe with some asm?  Do you have some small testcase I can compile?


Segher


Re: [RFC PATCH v4 00/11] powerpc: switch VDSO to C implementation.

2020-01-20 Thread Christophe Leroy




Le 20/01/2020 à 16:19, Segher Boessenkool a écrit :

On Mon, Jan 20, 2020 at 02:56:00PM +, Christophe Leroy wrote:

Nice!  Much better.

It should be tested on more representative hardware, too, but this looks
promising alright :-)


mpc832x (e300c2 core) at 333 MHz:

Before:

gettimeofday:vdso: 235 nsec/call
clock-gettime-realtime:vdso: 244 nsec/call

With the series:

gettimeofday:vdso: 271 nsec/call
clock-gettime-realtime:vdso: 281 nsec/call


Those are important, and degrade ~15%.  That is acceptable IMO, but do
you see a way to optimise this (later)?


Not easy I think.

First we have the unavoidable ASM entry function that can't be dropped 
because of the CR[SO] bit the set on error or clear on no error and that 
can't be done in C.


In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts 
are generic and read from the VDSO data.


And there is still some funny code generated by GCC (8.1), like:

 620:   7d 29 3c 30 srw r9,r9,r7
 624:   21 87 00 20 subfic  r12,r7,32
 628:   7d 07 3c 31 srw.r7,r8,r7
 62c:   7d 08 60 30 slw r8,r8,r12
 630:   7d 0b 4b 78 or  r11,r8,r9
 634:   39 40 00 00 li  r10,0
 638:   40 82 00 84 bne 6bc <__c_kernel_clock_gettime+0x114>
 63c:   81 23 00 24 lwz r9,36(r3)
 640:   81 05 00 00 lwz r8,0(r5)
...
 6bc:   7d 69 5b 78 mr  r9,r11
 6c0:   7c ea 3b 78 mr  r10,r7
 6c4:   7d 2b 4b 78 mr  r11,r9
 6c8:   4b ff ff 74 b   63c <__c_kernel_clock_gettime+0x94>

This branch to 6bc is totally useless:
- copying r11 into r9 is pointless as r9 is overwritten in 63c
- copying back r9 into r11 is pointless as r11 has not been modified 
inbetween.
- loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is 
pointless as well, could have directly put the result of srw. in r10.


Christophe


Re: [RFC PATCH v4 00/11] powerpc: switch VDSO to C implementation.

2020-01-20 Thread Segher Boessenkool
On Mon, Jan 20, 2020 at 02:56:00PM +, Christophe Leroy wrote:
> >Nice!  Much better.
> >
> >It should be tested on more representative hardware, too, but this looks
> >promising alright :-)
> 
> mpc832x (e300c2 core) at 333 MHz:
> 
> Before:
> 
> gettimeofday:vdso: 235 nsec/call
> clock-gettime-realtime:vdso: 244 nsec/call
> 
> With the series:
> 
> gettimeofday:vdso: 271 nsec/call
> clock-gettime-realtime:vdso: 281 nsec/call

Those are important, and degrade ~15%.  That is acceptable IMO, but do
you see a way to optimise this (later)?

Anyway, excellent results, thanks for your persistence!


Segher


Re: [RFC PATCH v4 00/11] powerpc: switch VDSO to C implementation.

2020-01-20 Thread Christophe Leroy

Hi

On 01/17/2020 08:58 AM, Segher Boessenkool wrote:

Hi!

On Thu, Jan 16, 2020 at 05:58:24PM +, Christophe Leroy wrote:

On a powerpc8xx, with current powerpc/32 ASM VDSO:

gettimeofday:vdso: 907 nsec/call
clock-getres-realtime:vdso: 484 nsec/call
clock-gettime-realtime:vdso: 899 nsec/call

The first patch adds VDSO generic C support without any changes to common code.
Performance is as follows:

gettimeofday:vdso: 1211 nsec/call
clock-getres-realtime:vdso: 722 nsec/call
clock-gettime-realtime:vdso: 1216 nsec/call

Then a few changes in the common code have allowed performance improvement. At
the end of the series we have:

gettimeofday:vdso: 974 nsec/call
clock-getres-realtime:vdso: 545 nsec/call
clock-gettime-realtime:vdso: 941 nsec/call

The final result is rather close to pure ASM VDSO:
* 7% more on gettimeofday (9 cycles)
* 5% more on clock-gettime-realtime (6 cycles)
* 12% more on clock-getres-realtime (8 cycles)


Nice!  Much better.

It should be tested on more representative hardware, too, but this looks
promising alright :-)



mpc832x (e300c2 core) at 333 MHz:

Before:

gettimeofday:vdso: 235 nsec/call
clock-getres-realtime-coarse:vdso: 1668 nsec/call
clock-gettime-realtime-coarse:vdso: 1338 nsec/call
clock-getres-realtime:vdso: 135 nsec/call
clock-gettime-realtime:vdso: 244 nsec/call
clock-getres-boottime:vdso: 1232 nsec/call
clock-gettime-boottime:vdso: 1935 nsec/call
clock-getres-tai:vdso: 1257 nsec/call
clock-gettime-tai:vdso: 1898 nsec/call
clock-getres-monotonic-raw:vdso: 1229 nsec/call
clock-gettime-monotonic-raw:vdso: 1541 nsec/call
clock-getres-monotonic-coarse:vdso: 1699 nsec/call
clock-gettime-monotonic-coarse:vdso: 1477 nsec/call
clock-getres-monotonic:vdso: 135 nsec/call
clock-gettime-monotonic:vdso: 283 nsec/call

With the series:

gettimeofday:vdso: 271 nsec/call
clock-getres-realtime-coarse:vdso: 159 nsec/call
clock-gettime-realtime-coarse:vdso: 184 nsec/call
clock-getres-realtime:vdso: 163 nsec/call
clock-gettime-realtime:vdso: 281 nsec/call
clock-getres-boottime:vdso: 169 nsec/call
clock-gettime-boottime:vdso: 274 nsec/call
clock-getres-tai:vdso: 163 nsec/call
clock-gettime-tai:vdso: 277 nsec/call
clock-getres-monotonic-raw:vdso: 166 nsec/call
clock-gettime-monotonic-raw:vdso: 302 nsec/call
clock-getres-monotonic-coarse:vdso: 159 nsec/call
clock-gettime-monotonic-coarse:vdso: 184 nsec/call
clock-getres-monotonic:vdso: 166 nsec/call
clock-gettime-monotonic:vdso: 274 nsec/call

Christophe


Re: [RFC PATCH v4 00/11] powerpc: switch VDSO to C implementation.

2020-01-17 Thread Christophe Leroy




Le 17/01/2020 à 09:58, Segher Boessenkool a écrit :

Hi!

On Thu, Jan 16, 2020 at 05:58:24PM +, Christophe Leroy wrote:

On a powerpc8xx, with current powerpc/32 ASM VDSO:

gettimeofday:vdso: 907 nsec/call
clock-getres-realtime:vdso: 484 nsec/call
clock-gettime-realtime:vdso: 899 nsec/call

The first patch adds VDSO generic C support without any changes to common code.
Performance is as follows:

gettimeofday:vdso: 1211 nsec/call
clock-getres-realtime:vdso: 722 nsec/call
clock-gettime-realtime:vdso: 1216 nsec/call

Then a few changes in the common code have allowed performance improvement. At
the end of the series we have:

gettimeofday:vdso: 974 nsec/call
clock-getres-realtime:vdso: 545 nsec/call
clock-gettime-realtime:vdso: 941 nsec/call

The final result is rather close to pure ASM VDSO:
* 7% more on gettimeofday (9 cycles)
* 5% more on clock-gettime-realtime (6 cycles)
* 12% more on clock-getres-realtime (8 cycles)


Nice!  Much better.

It should be tested on more representative hardware, too, but this looks
promising alright :-)



Yes.

Now the challenge is to get VDSO32 buildable on PPC64. The big issue is 
that in most powerpc/include/asm/*.h , CONFIG_PPC64 is used to know if 
the build is a 64 bits build or a 32 bits build, so VDSO32 build fails.


I don't know how this could be easily fixed.

Christophe


Re: [RFC PATCH v4 00/11] powerpc: switch VDSO to C implementation.

2020-01-17 Thread Segher Boessenkool
Hi!

On Thu, Jan 16, 2020 at 05:58:24PM +, Christophe Leroy wrote:
> On a powerpc8xx, with current powerpc/32 ASM VDSO:
> 
> gettimeofday:vdso: 907 nsec/call
> clock-getres-realtime:vdso: 484 nsec/call
> clock-gettime-realtime:vdso: 899 nsec/call
> 
> The first patch adds VDSO generic C support without any changes to common 
> code.
> Performance is as follows:
> 
> gettimeofday:vdso: 1211 nsec/call
> clock-getres-realtime:vdso: 722 nsec/call
> clock-gettime-realtime:vdso: 1216 nsec/call
> 
> Then a few changes in the common code have allowed performance improvement. At
> the end of the series we have:
> 
> gettimeofday:vdso: 974 nsec/call
> clock-getres-realtime:vdso: 545 nsec/call
> clock-gettime-realtime:vdso: 941 nsec/call
> 
> The final result is rather close to pure ASM VDSO:
> * 7% more on gettimeofday (9 cycles)
> * 5% more on clock-gettime-realtime (6 cycles)
> * 12% more on clock-getres-realtime (8 cycles)

Nice!  Much better.

It should be tested on more representative hardware, too, but this looks
promising alright :-)


Segher