Re: Possible new T3-T5 mul_1

2013-04-02 Thread Torbjorn Granlund
Torbjorn Granlund t...@gmplib.org writes: This version probably overschedules loads, I'll try another variant some day which fixes that. Two variants. The 1st is just the previous 3 c/l one, with a bug fix, and renamed. The 2nd is a version which I hope still runs at 3 c/l, but with a

ARM neon pseudo op

2013-04-02 Thread Niels Möller
On my pandaboard (with a cortex-a9), I run Debian GNU/Linux, and the assembler calls itself $ as --version GNU assembler (GNU Binutils for Debian) 2.22 It refuses to assemble the new shiny mpn/arm/neon/*.asm files. With somewhat confusing error messages like tmp-lshift.s: Assembler

Re: ARM neon pseudo op

2013-04-02 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I need the first few dozen lines from the configure output to have a guess about what might go wrong. You might actually compare the output to that of panda.gmplib.org yourself.

Re: ARM public key benchmark

2013-04-02 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd like to try. That wasn't a clear win... I use addmul_1 and submul_1 as a fallback (and I always do in-place operation, so that works). Now, cnd_sub_n beats submul_1 (except for n

Re: Possible new T3-T5 mul_1

2013-04-02 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Tue, 02 Apr 2013 09:38:42 +0200 Torbjorn Granlund t...@gmplib.org writes: This version probably overschedules loads, I'll try another variant some day which fixes that. Two variants. The 1st is just the previous 3 c/l one, with a bug

Re: ARM neon pseudo op

2013-04-02 Thread Michael Mohr
If your eventual target is a fat binary, using -mfpu=neon in CFLAGS is a bad idea (at least for Android). It would be far better to approach the problem as Niels did, using .fpu neon as required. That way, non-neon code can be selected at runtime if necessary for the critical code paths.

Re: ARM public key benchmark

2013-04-02 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd like to try. That wasn't a clear win... I use addmul_1 and submul_1 as a fallback (and I always do in-place operation,

Re: Possible new T3-T5 mul_1

2013-04-02 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can. overhead 6.00 cycles, precision 1000 units of 3.51e-10 secs, CPU freq 2847.41 MHz Darn. Is the load latency 3 cycles? The old code had a load-use schedule of 8 cycles,

Re: Some secondary asm T3,T4,T5 functions

2013-04-02 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I started playing with these, and one problem is that the addxc/addxccc instructions do not accept an immediate field. They only accept rs1, rs2, rd arguments. Please update your compat macros to catch this. Oops. Missed that. With this

Re: Possible new T3-T5 mul_1

2013-04-02 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Tue, 02 Apr 2013 21:09:51 +0200 So what's going on in the a and b code variants? I assume the total OoO capacity was just not enough for a ld-mul-add 17 cycle chain scheduled at just 3+4 cycles. with fully scheduled loads, the OoO requirement was

Re: Possible new T3-T5 mul_1

2013-04-02 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Wed, 03 Apr 2013 01:05:05 +0200 I rescheduled the addmul_2 and mul_2. If I have not misunderstood this pipeline, we should finally reach 3.5 c/l and 3 c/l, respectively. Attached are the output of: tune/speed -p1000 -s1-1000 -f1.1 -C

Re: Possible new T3-T5 mul_1

2013-04-02 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Attached are the output of: tune/speed -p1000 -s1-1000 -f1.1 -C mpn_mul_2.3 3.25 c/l, not 3 c/l as I had hoped. tune/speed -p1000 -s1-1000 -f1.1 -C mpn_addmul_2.3 3.75 c/l, not 3.5 c/l as I had hoped... I will accept this, since

Re: Possible new T3-T5 mul_1

2013-04-02 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Wed, 03 Apr 2013 01:26:48 +0200 I will accept this, since micro-optimisation takes more attempts than both me and you will care to try over email. :-) I can toy with it in my spare time. ___ gmp-devel

mul_1 2-way on T3

2013-04-02 Thread David Miller
So I've been toying with this loop: 1: mulxu0, v0, %l2 sub n, -(2 * 8), n umulxhi u0, v0, %l5 ldx [n + u0_off], u0 mulxu1, v0, %l3 addxccc %l2, %o5, r0 umulxhi u1, v0, %o5 ldx [n + u1_off], u1 addxccc

Re: mul_1 2-way on T3

2013-04-02 Thread David Miller
From: David Miller da...@davemloft.net Date: Tue, 02 Apr 2013 20:04:19 -0400 (EDT) I'll keep playing with it and if I can get it to run consistently in 6 cycles per loop we should seriously consider taking this approach. Actually, I think I've figured some of this out. My variant can never be

Re: mul_1 2-way on T3

2013-04-02 Thread David Miller
From: David Miller da...@davemloft.net Date: Tue, 02 Apr 2013 20:24:51 -0400 (EDT) Only loop like mul_1a.asm (and potentially mul_1b.asm) can, because only they have enough cycles in the loop to retire multiplies without positive accumulation into the OoO buffer. Actually, mul1b.asm cannot