ni...@lysator.liu.se (Niels Möller) writes:
I'll also try using fewer updates of the up pointer, that seems to save
half a cycle, and could perhaps speed up addmul_1 too.
No speedup for addmul_1, unfortunately, but a saving for submul_1. Here
are new versions of both files (for mpn/arm/v6). I
David Miller da...@davemloft.net writes:
Attached is a dive_1.asm that works for me on real hardware as
well as T4 timings from:
tune/speed -p1000 -s1-1000 -f1.1 -C mpn_divexact_1.3
This timing is most curious. The cost of inversion computation should
be clearly visible for tiny
David Miller da...@davemloft.net writes:
First mul_1, renamed again, now encoding the load scheduling. Only the
6c variant is new. Please time it. If it doesn't run at 3 c/l, then
there are 2 simple things to try, indicated in a comment.
This gets the expected 3 cycles per limb
Torbjorn Granlund t...@gmplib.org writes:
I sometimes get better A9 performance with *discrete* pointer updates,
not one-out-of-four autoincrement pointer updates like used here. I
think the code you started with had that one-out-of-four trick for str,
already?
Right, it uses a single
On 2013-04-04 06:51, Niels Möller wrote:
And it's no use to even think of porting the loop mixer to arm without
access to cycle-accurate timing.
Looking around the web it seems that what most folks do is write a
minimal kernel module that toggles the bit that allows userspace
access to the
ni...@lysator.liu.se (Niels Möller) writes:
I had on the other hand not realised David's ones complement + pre-invert
carry trick.
Not sure I understand what you are referring to here. I haven't been
following the sparc developments very closely (and I don't remember much
of sparc
Richard Henderson r...@twiddle.net writes:
On 2013-04-04 06:51, Niels Möller wrote:
And it's no use to even think of porting the loop mixer to arm without
access to cycle-accurate timing.
Looking around the web it seems that what most folks do is write a
minimal kernel module that
Richard Henderson r...@twiddle.net writes:
Looking around the web it seems that what most folks do is write a
minimal kernel module that toggles the bit that allows userspace
access to the cycle counter MSRs.
I've seen some code snippets to do that too. I have a dual core system;
one known
Torbjorn Granlund t...@gmplib.org writes:
The newer sparc adds 64-bit carrying adds, but they still don't have
corresponding subtraction instructions. Se David sets carry before
entering the loop, and ones complements the subtrahend.
Ah. I think I even suggested that trick, for mpn_sub_n.
ni...@lysator.liu.se (Niels Möller) writes:
I guess it's lowest numbered first (and lowest memory address).
But a loop with
use r7
ldm up!, {r4,r5,r6,r7}
use r4
looks like poor scheduling betwen load of r4 and use of it, and the ldm
can't be moved earlier since
10 matches
Mail list logo