Re: [ft-devel] FT_MulFix assembly

2010-09-19 Thread Werner LEMBERG
Werner: Miles' version is shorter, is only wrong by one ulp and only when the product overflows and is negative. My variation, called another() above, fixes that slight difference. Which would you prefer, if anything? I tend to prefer the faster one $(Q#|(B

Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread Miles Bader
James Cloos cl...@jhcloos.com writes: The C version does away-from-zero rounding. MB Do you have test cases that show this? I tried using random inputs, MB but even up to billions of iterations, I can't seem to find a set of MB inputs where my function yields different results from yours.

Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread James Cloos
MB == Miles Bader mi...@gnu.org writes: MB Hm, are you sure that's not backwards? When I tried the git C version[*], MB as well as your most recent FT_MulFix_x86_64, it returned 0x8506... Odd. Adding your algo to my test app, I get: 7AFA8000, , 8505, 8505, 8506 #

Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread Miles Bader
James Cloos cl...@jhcloos.com writes: Since FT's C version uses longs, though, this: int another (long a, long b) { long r = (long)a * (long)b; long s = r 31; return (r + s + 0x8000) 16; } That's not correct though, is it? The variable s should be the all sign portion of the

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Graham Asher
Have you done an ARM version? Forgive my inattentiveness if you've already announced one. It just struck me that this sort of optimisation is even more necessary on mobile devices. Graham James Cloos wrote: The final result for amd64 looks like: static __inline__ long FT_MulFix_x86_64(

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
James Cloos cl...@jhcloos.com writes: __asm__ __volatile__ ( movq %1, %%rax\n imul %2\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16, %%rax\n : =a(result) : g(a), g(b) : rdx ); The above code has a latency of 1+5+1+1+1 = 10

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Incidentally, you wrote: The assembly generated by the C code is 45 lines and 158 octets long, contains six conditional jumps, three each of explicit compares and tests, and still benchmarks are just as fast. Out-of-order processing wins out over hand-coded asm. :-/ ... but when I follow

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Miles Bader mi...@gnu.org writes: The compiler generates the following assembly: mov %esi, %eax mov %edi, %edi imulq %rdi, %rax addq$32768, %rax shrq$16, %rax The movs there are obviously a bit silly (compiler bug?), but that output seems

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
GA == Graham Asher graham.as...@btinternet.com writes: GA Have you done an ARM version? Forgive my inattentiveness if you've GA already announced one. It just struck me that this sort of GA optimisation is even more necessary on mobile devices. I386, arm and arm-thumb versions were already

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
MB == Miles Bader mi...@gnu.org writes: MB The compiler generates the following assembly: MB mov %esi, %eax MB mov %edi, %edi MB imulq %rdi, %rax MB addq$32768, %rax MB shrq$16, %rax That does not match the C code though; it rounds negative values wrong.

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
On Tue, Sep 7, 2010 at 4:28 AM, James Cloos cl...@jhcloos.com wrote: MB == Miles Bader mi...@gnu.org writes: MB The compiler generates the following assembly: MB     mov     %esi, %eax MB     mov     %edi, %edi MB     imulq   %rdi, %rax MB     addq    $32768, %rax MB     shrq    $16, %rax

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
MB == Miles Bader mi...@gnu.org writes: The C version does away-from-zero rounding. MB Do you have test cases that show this? I tried using random inputs, MB but even up to billions of iterations, I can't seem to find a set of MB inputs where my function yields different results from yours.

Re: [ft-devel] FT_MulFix assembly

2010-09-05 Thread James Cloos
The final result for amd64 looks like: static __inline__ long FT_MulFix_x86_64( long a, long b ) { register long result; __asm__ __volatile__ ( movq %1, %%rax\n imul %2\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16,

Re: [ft-devel] FT_MulFix assembly

2010-08-12 Thread Werner LEMBERG
I have to finish the patch, but I thought I'd offer the algorithm for review, if anyone wants to. I haven't enough knowledge to comment, but thanks for working on it! Werner ___ Freetype-devel mailing list Freetype-devel@nongnu.org

Re: [ft-devel] FT_MulFix assembly

2010-08-08 Thread James Cloos
My first cut at FT_MulFix_x86_64() is: static __inline__ FT_Int32 FT_MulFix_x86_64 (FT_Int32 a, FT_Int32 b) { register FT_Int32 r; __asm__ __volatile__ ( movslq %%edx, %%rdx\n cltq\n imul %%rdx\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n

Re: [ft-devel] FT_MulFix assembly

2010-08-06 Thread Werner LEMBERG
I see implementations for ia32 and arm; would other platforms benefit from assembply implementations of MulFix? As usual: patches are highly welcomed. Werner ___ Freetype-devel mailing list Freetype-devel@nongnu.org