Re: ARM public key benchmark

2013-04-04 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: I'll also try using fewer updates of the up pointer, that seems to save half a cycle, and could perhaps speed up addmul_1 too. No speedup for addmul_1, unfortunately, but a saving for submul_1. Here are new versions of both files (for mpn/arm/v6). I

Re: Some secondary asm T3,T4,T5 functions

2013-04-04 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Attached is a dive_1.asm that works for me on real hardware as well as T4 timings from: tune/speed -p1000 -s1-1000 -f1.1 -C mpn_divexact_1.3 This timing is most curious. The cost of inversion computation should be clearly visible for tiny

Re: New T3/T4 code batch

2013-04-04 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple things to try, indicated in a comment. This gets the expected 3 cycles per limb

Re: ARM public key benchmark

2013-04-04 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I sometimes get better A9 performance with *discrete* pointer updates, not one-out-of-four autoincrement pointer updates like used here. I think the code you started with had that one-out-of-four trick for str, already? Right, it uses a single

Re: ARM public key benchmark

2013-04-04 Thread Richard Henderson
On 2013-04-04 06:51, Niels Möller wrote: And it's no use to even think of porting the loop mixer to arm without access to cycle-accurate timing. Looking around the web it seems that what most folks do is write a minimal kernel module that toggles the bit that allows userspace access to the

Re: ARM public key benchmark

2013-04-04 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I had on the other hand not realised David's ones complement + pre-invert carry trick. Not sure I understand what you are referring to here. I haven't been following the sparc developments very closely (and I don't remember much of sparc

Re: ARM public key benchmark

2013-04-04 Thread Torbjorn Granlund
Richard Henderson r...@twiddle.net writes: On 2013-04-04 06:51, Niels Möller wrote: And it's no use to even think of porting the loop mixer to arm without access to cycle-accurate timing. Looking around the web it seems that what most folks do is write a minimal kernel module that

Re: ARM public key benchmark

2013-04-04 Thread Niels Möller
Richard Henderson r...@twiddle.net writes: Looking around the web it seems that what most folks do is write a minimal kernel module that toggles the bit that allows userspace access to the cycle counter MSRs. I've seen some code snippets to do that too. I have a dual core system; one known

Re: ARM public key benchmark

2013-04-04 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: The newer sparc adds 64-bit carrying adds, but they still don't have corresponding subtraction instructions. Se David sets carry before entering the loop, and ones complements the subtrahend. Ah. I think I even suggested that trick, for mpn_sub_n.

Re: ARM public key benchmark

2013-04-04 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I guess it's lowest numbered first (and lowest memory address). But a loop with use r7 ldm up!, {r4,r5,r6,r7} use r4 looks like poor scheduling betwen load of r4 and use of it, and the ldm can't be moved earlier since