Torbjorn Granlund t...@gmplib.org writes:
I found the A9 manual here:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf
And for neon instructions, cycle numbers are in
ni...@lysator.liu.se (Niels Möller) writes:
And for neon instructions, cycle numbers are in
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf
Page?
Seems it should be able to do one vmull per cycle. Not sure how to
get latency from the given
Torbjorn Granlund t...@gmplib.org writes:
ni...@lysator.liu.se (Niels Möller) writes:
And for neon instructions, cycle numbers are in
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf
Page?
In Chapter 3, multiplication instructions listed in
ni...@lysator.liu.se (Niels Möller) writes:
In Chapter 3, multiplication instructions listed in a table starting on
page 3-14. But now I see I read the entry for a smaller data size. For
32-bit inputs, it's apparently 2 cycles, not 1.
It seems to be 2 cycles indeed:
.text
The corresponding code sustains one vmull.u32 per cycle on A15. That's
4 times the bandwidth of its umul implementation.
It is usually tricky to make use of SIMD operations for addmul_(k) and
friends. The well-designed ARM instructions will surely make it easier,
but it might still require many
Torbjorn Granlund t...@gmplib.org writes:
But IIUC, we are thus performing a 32 x 32 - 64 mul per cycle.
Can one stick addition here without consuming cycles?
As I understand the manual, operations in the main cpu can be done in
parallel with the simd instructions. But it also warns about
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
But IIUC, we are thus performing a 32 x 32 - 64 mul per cycle.
Can one stick addition here without consuming cycles?
As I understand the manual, operations in the main cpu can be done in
There are a few aspects worth noticing for prospective Neon hackers:
There are 32 64-bit register, available both in in VFPv3-D32 and Neon.
(There are IIRC at least 4 levels of FP support, VFP, VFPv2,
VFPv3-D16, and VFPv3-D32... I've seen references to VFPv4 too.)
In Neon, the registers can
ni...@lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund t...@gmplib.org writes:
I played with vmlal.u32 on A9 and A15. Surprisingly, both CPUs are very
cooperative in that the accumulation dependency is very shallow.
Nice. Is the same true for the non-simd umaal
From: Michael Mohr m...@linuxcertified.com
Date: Mon, 14 Jan 2013 12:10:30 -0800
At least for Android this may not be an option. It provides a
cpufeatures library for use at runtime. I suggest using a configure
option which explicitly enables this, and disable it otherwise.
I really think
David Miller da...@davemloft.net writes:
From: Michael Mohr m...@linuxcertified.com
Date: Mon, 14 Jan 2013 12:10:30 -0800
At least for Android this may not be an option. It provides a
cpufeatures library for use at runtime. I suggest using a configure
option which explicitly
David Miller da...@davemloft.net writes:
IFUNC relocations allow symbols to resolve based upon run-time checks,
such as tests on the cpu type.
I think that's nice. There are a couple of difficulties, though.
GMP also wants to set up machine-dependent threshold variables, i.e.,
data rather
From: Torbjorn Granlund t...@gmplib.org
Date: Mon, 14 Jan 2013 23:19:40 +0100
David Miller da...@davemloft.net writes:
From: ni...@lysator.liu.se (Niels Möller)
Date: Mon, 14 Jan 2013 22:22:28 +0100
Furthermore, gmp needs to be portable to non-glibc systems as well. We
have a
On Monday 14 January 2013 14:36:43 Torbjorn Granlund wrote:
At some point, we'd like to make the assembly code in GMP support x32.
how so ? for the most part, existing x86_64 assembly should just work for
x32. pointers tend to be where things get into trouble, but otherwise x32 is
simply
David Miller da...@davemloft.net writes:
My opinion is that IFUNC is valuable for the sake of turning what
would be two calls, into one through the PLT which is the minimum you
can get away with.
With the current GMP mechanism, you have the call through the PLT,
followed by a jump through
From: ni...@lysator.liu.se (Niels Möller)
Date: Tue, 15 Jan 2013 08:28:07 +0100
From my admittedly limited understanding of ELF linking, I think it has
to be done a bit differently if we update the pointers directly in the
PLT. For one, iirc, each shared library has its own PLT for the
16 matches
Mail list logo