Hi everybody...

There are a lot of problems related to coupling the current BLAS3
implementation with the kernel generator:

- While I think I could add some range support, adding slices will be
extremely difficult, and it would probably result in bad performance
whatever kernel is used. The most efficient way to do this is probably :
> copy slice to temporary dense
> perform product
> slice copy of the result

- Kernels take forever to compile. Particularly, the generated program
include a duplicata of each kernel for each device in the associated
context:

AMD Platform :
-> AMD GPU
-> Intel CPU
The CPU kernel is extremely unrolled, in this case, the program takes up to
2-3seconds to compile, on my Desktop Core i7-4770... We can have some
completely crappy profile for the CPU that compiles fast, though, to solve
this problem at the expense of OpenCL CPU performance.

NVidia platform:
-> GTX 470
-> Tesla C2050
Here, each kernel is long to compile, mainly because of the #pragma unroll
directive that almost doubles the performance. Basically, on a remote
machine with a core i7 960, it also takes several seconds to compile...
Note that this also makes using #pragma unroll in the autotuner a bad idea,
since all the kernels may end up taking forever to compile...

Each "variation" of BLAS3 generates a different program, whether it is :
C = alpha*prod(A,B)
C = alpha*prod(rangeA,B)
C = alpha*prod(A,B) + C
C = alpha*prod(A,B) + D
C = element_exp(alpha*prod(A,B))
....
C+= alpha*prod(A,B)
...
This is the way kernel generation works..
I guess you see the problem here. Even though one rarely uses all the
programs, only the NVidia SDK caches the programs to avoid recompilation.
Clearly packing all the possibles blas3 kernels into one is impossible. I
don't see any way out of this mess. Handling the binaries and performing
the caching by ourselves?


Best regards,
Philippe
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to