Torbjorn Granlund t...@gmplib.org writes:
This version probably overschedules loads, I'll try another variant some
day which fixes that.
Two variants. The 1st is just the previous 3 c/l one, with a bug fix,
and renamed. The 2nd is a version which I hope still runs at 3 c/l, but
with a
On my pandaboard (with a cortex-a9), I run Debian GNU/Linux, and the
assembler calls itself
$ as --version
GNU assembler (GNU Binutils for Debian) 2.22
It refuses to assemble the new shiny mpn/arm/neon/*.asm files. With
somewhat confusing error messages like
tmp-lshift.s: Assembler
Torbjorn Granlund t...@gmplib.org writes:
I need the first few dozen lines from the configure output to have a
guess about what might go wrong. You might actually compare the output
to that of panda.gmplib.org yourself.
ni...@lysator.liu.se (Niels Möller) writes:
I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd
like to try.
That wasn't a clear win... I use addmul_1 and submul_1 as a fallback
(and I always do in-place operation, so that works). Now, cnd_sub_n
beats submul_1 (except for n
From: Torbjorn Granlund t...@gmplib.org
Date: Tue, 02 Apr 2013 09:38:42 +0200
Torbjorn Granlund t...@gmplib.org writes:
This version probably overschedules loads, I'll try another variant some
day which fixes that.
Two variants. The 1st is just the previous 3 c/l one, with a bug
If your eventual target is a fat binary, using -mfpu=neon in CFLAGS
is a bad idea (at least for Android). It would be far better to
approach the problem as Niels did, using .fpu neon as required. That
way, non-neon code can be selected at runtime if necessary for the
critical code paths.
ni...@lysator.liu.se (Niels Möller) writes:
ni...@lysator.liu.se (Niels Möller) writes:
I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd
like to try.
That wasn't a clear win... I use addmul_1 and submul_1 as a fallback
(and I always do in-place operation,
David Miller da...@davemloft.net writes:
See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can.
overhead 6.00 cycles, precision 1000 units of 3.51e-10 secs, CPU freq
2847.41 MHz
Darn. Is the load latency 3 cycles?
The old code had a load-use schedule of 8 cycles,
David Miller da...@davemloft.net writes:
I started playing with these, and one problem is that the
addxc/addxccc instructions do not accept an immediate field. They
only accept rs1, rs2, rd arguments. Please update your compat macros
to catch this.
Oops. Missed that.
With this
From: Torbjorn Granlund t...@gmplib.org
Date: Tue, 02 Apr 2013 21:09:51 +0200
So what's going on in the a and b code variants? I assume the total OoO
capacity was just not enough for a ld-mul-add 17 cycle chain scheduled
at just 3+4 cycles. with fully scheduled loads, the OoO requirement was
From: Torbjorn Granlund t...@gmplib.org
Date: Wed, 03 Apr 2013 01:05:05 +0200
I rescheduled the addmul_2 and mul_2. If I have not misunderstood this
pipeline, we should finally reach 3.5 c/l and 3 c/l, respectively.
Attached are the output of:
tune/speed -p1000 -s1-1000 -f1.1 -C
David Miller da...@davemloft.net writes:
Attached are the output of:
tune/speed -p1000 -s1-1000 -f1.1 -C mpn_mul_2.3
3.25 c/l, not 3 c/l as I had hoped.
tune/speed -p1000 -s1-1000 -f1.1 -C mpn_addmul_2.3
3.75 c/l, not 3.5 c/l as I had hoped...
I will accept this, since
From: Torbjorn Granlund t...@gmplib.org
Date: Wed, 03 Apr 2013 01:26:48 +0200
I will accept this, since micro-optimisation takes more attempts than
both me and you will care to try over email.
:-) I can toy with it in my spare time.
___
gmp-devel
So I've been toying with this loop:
1:
mulxu0, v0, %l2
sub n, -(2 * 8), n
umulxhi u0, v0, %l5
ldx [n + u0_off], u0
mulxu1, v0, %l3
addxccc %l2, %o5, r0
umulxhi u1, v0, %o5
ldx [n + u1_off], u1
addxccc
From: David Miller da...@davemloft.net
Date: Tue, 02 Apr 2013 20:04:19 -0400 (EDT)
I'll keep playing with it and if I can get it to run consistently in
6 cycles per loop we should seriously consider taking this approach.
Actually, I think I've figured some of this out.
My variant can never be
From: David Miller da...@davemloft.net
Date: Tue, 02 Apr 2013 20:24:51 -0400 (EDT)
Only loop like mul_1a.asm (and potentially mul_1b.asm) can, because
only they have enough cycles in the loop to retire multiplies without
positive accumulation into the OoO buffer.
Actually, mul1b.asm cannot
16 matches
Mail list logo