Hi everybody,
Good news : the GEMMs calls for OpenCL on dense non-proxy matrix now call
the generator ! It's a good step towards performance portability.
For now, single precision :
850-900GFLOP/s on HD5850
2TFLOP/s on HD7970
600GFLOP/s on GTX470
CPUs : results in my private memory for now ;)
#HD5850
#size AA TA AT TT//no good profile found for
now
128 23.6299 9.74287 6.569 6.58446
256 12.5672 138.369 83.7814 62.1954
384 260.936 415.582 253.347 168.647
512 393.889 445.167 240.426 163.58
640 513.504 733.27 607.518 181.887
768 697.705 696.097 489.978 194.895
896 749.295 763.41 581.154 197.291
1024 575.656 778.215 280.314 148.031
1152 801.796 824.164 651.464 187.839
1280 840.373 799.524 602.76 187.938
1408 876.048 831.489 657.165 201.655
1536 576.592 813.76 458.038 191.022
1664 880.964 832.307 666.249 200.701
1792 852.594 837.091 667.895 195.158
1920 897.241 834.583 704.565 193.118
2048 86.9554 749.869 176.842 177.298
2176 899.283 855.277 700.905 196.739
2304 839.768 871.575 710.792 195.424
2432 911.239 867.037 706.232 195.298
All row-major.
Bad news : peaky performance :
There's no missing digit for the "2048" case. Dealing with it is fairly
complicated, since it involves having different profiles for different
sizes. Since it seems to only affect AMD Hardware, I think we just warn
about the issue...
However, this is the size used by default in the blas3_bench, which made me
freak out.
What do you guys would think about a more "graphical" (ie either plot or a
list, like above) benchmark, so that people who really care a lot about
performance can have an idea of when to use what.
*Examples:*
-> I'm having a matrix of internal size 384*384 => In C=A*B, I'd prefer to
either order A as column-major (or to use C=trans(A)*B where A is row-major
instead, on my specific hardware)
-> If I want some performance increase for big problems, I'd rather go for
C=A*B in all row-major.
What do you think about that?
Best regards,
Philippe
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel