Re: [Info], Add suport for PowerPC IEEE 128-bit floating point
On Tue, Jul 15, 2014 at 04:50:33PM -0500, Segher Boessenkool wrote: On Tue, Jul 15, 2014 at 05:20:31PM -0400, Michael Meissner wrote: I did some timing tests to compare the new PowerPC IEEE 128-bit results to the current implementation of long double using the IBM extended format. The test consisted a short loop doing the operation over arrays of 1,024 elements, reading in two values, doing the operation, and then storing it back. This loop in turn was done multiple times, with the idea that most of the values would be in the cache, and we didn't have to worry about pre-fetching, etc. The float, double tests were done with vectorization disabled, while the vector float and vector double tests, the compiler was allowed to do the normal auto vectorization. The number reported was how much longer the second column took over the first: I assume you mean the other way around? Generally, the __float128 is 2x slower than the current IBM extended double format, except for divide, where it is 5x slower. I must say, the software floating point emulation routines worked well, and once the proper macros were setup, I only needed to override the type used for IEEE 128-bit. Add loop float vs double: 2.00x Why is float twice as slow as double? Pat re-ran the tests, and now float/double are the same speed. Since I was running this on a development machine, and not a dedicated machine, it was probably just luck of the draw that somebody was doing a large build at the time I ran the tests. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Re: [Info], Add suport for PowerPC IEEE 128-bit floating point
On Tue, Jul 15, 2014 at 04:50:33PM -0500, Segher Boessenkool wrote: On Tue, Jul 15, 2014 at 05:20:31PM -0400, Michael Meissner wrote: I did some timing tests to compare the new PowerPC IEEE 128-bit results to the current implementation of long double using the IBM extended format. The test consisted a short loop doing the operation over arrays of 1,024 elements, reading in two values, doing the operation, and then storing it back. This loop in turn was done multiple times, with the idea that most of the values would be in the cache, and we didn't have to worry about pre-fetching, etc. The float, double tests were done with vectorization disabled, while the vector float and vector double tests, the compiler was allowed to do the normal auto vectorization. The number reported was how much longer the second column took over the first: I assume you mean the other way around? Generally, the __float128 is 2x slower than the current IBM extended double format, except for divide, where it is 5x slower. I must say, the software floating point emulation routines worked well, and once the proper macros were setup, I only needed to override the type used for IEEE 128-bit. Add loop float vs double: 2.00x Why is float twice as slow as double? I'm not sure, and the Book IV's claims that lfsu vs. ldsu, fadds vs. fadd, and stfsu vs. stfdu are the same number of cycles as each other. The inner loop is fairly simple: ;; Float loop .p2align 6,,63 .L460: lfsu 12,4(10) lfsu 0,4(8) fadds 0,12,0 stfsu 0,4(9) bdnz .L460 ;; Double loop .p2align 6,,63 .L466: lfdu 12,8(10) lfdu 0,8(8) fadd 0,12,0 stfdu 0,8(9) bdnz .L466 I would suspect that given internally the PowerPC keeps scalar floating point in double format in the registers probably accounts for slow downs in a tight loop (i.e. lfsu must load and convert the value to double, fadds must do the add and then round to float precision, and stfsu must convert the double to float format). We've also seen cases where load with update is slower than doing the instructions separately. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Re: [Info], Add suport for PowerPC IEEE 128-bit floating point
I did some timing tests to compare the new PowerPC IEEE 128-bit results to the current implementation of long double using the IBM extended format. The test consisted a short loop doing the operation over arrays of 1,024 elements, reading in two values, doing the operation, and then storing it back. This loop in turn was done multiple times, with the idea that most of the values would be in the cache, and we didn't have to worry about pre-fetching, etc. The float, double tests were done with vectorization disabled, while the vector float and vector double tests, the compiler was allowed to do the normal auto vectorization. The number reported was how much longer the second column took over the first: Generally, the __float128 is 2x slower than the current IBM extended double format, except for divide, where it is 5x slower. I must say, the software floating point emulation routines worked well, and once the proper macros were setup, I only needed to override the type used for IEEE 128-bit. Add loop float vs double: 2.00x float vs vector float:4.97x double vs vector double: 2.63x long double vs double: 16.85x __float128 vs double: 23.34x __float128 vs long double: 1.39x Subtract loop = float vs double: 1.99x float vs vector float:4.66x double vs vector double: 2.63x long double vs double: 14.47x __float128 vs double: 27.65x __float128 vs long double: 1.91x Multiply loop = float vs double: 2.05x float vs vector float:5.18x double vs vector double: 2.59x long double vs double: 11.58x __float128 vs double: 27.44x __float128 vs long double: 2.37x Divide loop === float vs double: 0.82x float vs vector float:2.11x double vs vector double: 2.00x long double vs double: 5.90x __float128 vs double: 34.57x __float128 vs long double: 5.86x Maximum via comparison and ?: = float vs double: 1.74x float vs vector float:4.62x double vs vector double: 2.62x long double vs double: 5.07x __float128 vs double: 18.02x __float128 vs long double: 3.55x Minimum via comparison and ?: = float vs double: 1.74x float vs vector float:4.52x double vs vector double: 2.62x long double vs double: 5.38x __float128 vs double: 15.14x __float128 vs long double: 2.82x -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Re: [Info], Add suport for PowerPC IEEE 128-bit floating point
On Tue, Jul 15, 2014 at 05:20:31PM -0400, Michael Meissner wrote: I did some timing tests to compare the new PowerPC IEEE 128-bit results to the current implementation of long double using the IBM extended format. The test consisted a short loop doing the operation over arrays of 1,024 elements, reading in two values, doing the operation, and then storing it back. This loop in turn was done multiple times, with the idea that most of the values would be in the cache, and we didn't have to worry about pre-fetching, etc. The float, double tests were done with vectorization disabled, while the vector float and vector double tests, the compiler was allowed to do the normal auto vectorization. The number reported was how much longer the second column took over the first: I assume you mean the other way around? Generally, the __float128 is 2x slower than the current IBM extended double format, except for divide, where it is 5x slower. I must say, the software floating point emulation routines worked well, and once the proper macros were setup, I only needed to override the type used for IEEE 128-bit. Add loop float vs double: 2.00x Why is float twice as slow as double? Segher