Re: arm neon

2013-01-14 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I found the A9 manual here: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf And for neon instructions, cycle numbers are in

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And for neon instructions, cycle numbers are in http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf Page? Seems it should be able to do one vmull per cycle. Not sure how to get latency from the given

Re: arm neon

2013-01-14 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: ni...@lysator.liu.se (Niels Möller) writes: And for neon instructions, cycle numbers are in http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf Page? In Chapter 3, multiplication instructions listed in

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: In Chapter 3, multiplication instructions listed in a table starting on page 3-14. But now I see I read the entry for a smaller data size. For 32-bit inputs, it's apparently 2 cycles, not 1. It seems to be 2 cycles indeed: .text

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
The corresponding code sustains one vmull.u32 per cycle on A15. That's 4 times the bandwidth of its umul implementation. It is usually tricky to make use of SIMD operations for addmul_(k) and friends. The well-designed ARM instructions will surely make it easier, but it might still require many

Re: arm neon

2013-01-14 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: But IIUC, we are thus performing a 32 x 32 - 64 mul per cycle. Can one stick addition here without consuming cycles? As I understand the manual, operations in the main cpu can be done in parallel with the simd instructions. But it also warns about

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: But IIUC, we are thus performing a 32 x 32 - 64 mul per cycle. Can one stick addition here without consuming cycles? As I understand the manual, operations in the main cpu can be done in

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
There are a few aspects worth noticing for prospective Neon hackers: There are 32 64-bit register, available both in in VFPv3-D32 and Neon. (There are IIRC at least 4 levels of FP support, VFP, VFPv2, VFPv3-D16, and VFPv3-D32... I've seen references to VFPv4 too.) In Neon, the registers can

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: I played with vmlal.u32 on A9 and A15. Surprisingly, both CPUs are very cooperative in that the accumulation dependency is very shallow. Nice. Is the same true for the non-simd umaal

Re: arm neon

2013-01-14 Thread David Miller
From: Michael Mohr m...@linuxcertified.com Date: Mon, 14 Jan 2013 12:10:30 -0800 At least for Android this may not be an option. It provides a cpufeatures library for use at runtime. I suggest using a configure option which explicitly enables this, and disable it otherwise. I really think

Re: arm neon

2013-01-14 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: From: Michael Mohr m...@linuxcertified.com Date: Mon, 14 Jan 2013 12:10:30 -0800 At least for Android this may not be an option. It provides a cpufeatures library for use at runtime. I suggest using a configure option which explicitly

Re: arm neon

2013-01-14 Thread Niels Möller
David Miller da...@davemloft.net writes: IFUNC relocations allow symbols to resolve based upon run-time checks, such as tests on the cpu type. I think that's nice. There are a couple of difficulties, though. GMP also wants to set up machine-dependent threshold variables, i.e., data rather

Re: arm neon

2013-01-14 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Mon, 14 Jan 2013 23:19:40 +0100 David Miller da...@davemloft.net writes: From: ni...@lysator.liu.se (Niels Möller) Date: Mon, 14 Jan 2013 22:22:28 +0100 Furthermore, gmp needs to be portable to non-glibc systems as well. We have a

Re: [patch] add x32 support

2013-01-14 Thread Mike Frysinger
On Monday 14 January 2013 14:36:43 Torbjorn Granlund wrote: At some point, we'd like to make the assembly code in GMP support x32. how so ? for the most part, existing x86_64 assembly should just work for x32. pointers tend to be where things get into trouble, but otherwise x32 is simply

Re: arm neon

2013-01-14 Thread Niels Möller
David Miller da...@davemloft.net writes: My opinion is that IFUNC is valuable for the sake of turning what would be two calls, into one through the PLT which is the minimum you can get away with. With the current GMP mechanism, you have the call through the PLT, followed by a jump through

Re: arm neon

2013-01-14 Thread David Miller
From: ni...@lysator.liu.se (Niels Möller) Date: Tue, 15 Jan 2013 08:28:07 +0100 From my admittedly limited understanding of ELF linking, I think it has to be done a bit differently if we update the pointers directly in the PLT. For one, iirc, each shared library has its own PLT for the