Hi Xeon Phil ;-)

 > There are a lot of problems related to coupling the current BLAS3
> implementation with the kernel generator:
>
> - While I think I could add some range support, adding slices will be
> extremely difficult, and it would probably result in bad performance
> whatever kernel is used. The most efficient way to do this is probably :
>  > copy slice to temporary dense
>  > perform product
>  > slice copy of the result

Yeah, slices are much harder. Let's stick with the current 
implementation for ranges and slices and only use the generator for the 
full products. We can add more cleverness/performance with 1.5.1.


> - Kernels take forever to compile. Particularly, the generated program
> include a duplicata of each kernel for each device in the associated
> context:
>
> AMD Platform :
> -> AMD GPU
> -> Intel CPU
> The CPU kernel is extremely unrolled, in this case, the program takes up
> to 2-3seconds to compile, on my Desktop Core i7-4770... We can have some
> completely crappy profile for the CPU that compiles fast, though, to
> solve this problem at the expense of OpenCL CPU performance.

Hmm, isn't the default to use just one device per context? In such case 
it takes about 1-2 seconds, which is somewhat reasonable. Maybe there is 
a 'cheaper' alternative in terms of compilation time with only slightly 
reduced performance on the CPU?



> NVidia platform:
> -> GTX 470
> -> Tesla C2050
> Here, each kernel is long to compile, mainly because of the #pragma
> unroll directive that almost doubles the performance. Basically, on a
> remote machine with a core i7 960, it also takes several seconds to
> compile... Note that this also makes using #pragma unroll in the
> autotuner a bad idea, since all the kernels may end up taking forever to
> compile...
>
> Each "variation" of BLAS3 generates a different program, whether it is :
> C = alpha*prod(A,B)
> C = alpha*prod(rangeA,B)
> C = alpha*prod(A,B) + C
> C = alpha*prod(A,B) + D
> C = element_exp(alpha*prod(A,B))
> ....
> C+= alpha*prod(A,B)
> ...
> This is the way kernel generation works..

Considering that among the listed kernels a user may need only one or 
two in a given program, I think we can live with this limitation.

On the other hand, what about extracting the kernel sources for the 
standard GEMM case from the generator at first request and placing them 
into a separate 'special' program?


> I guess you see the problem here. Even though one rarely uses all the
> programs, only the NVidia SDK caches the programs to avoid
> recompilation. Clearly packing all the possibles blas3 kernels into one
> is impossible. I don't see any way out of this mess. Handling the
> binaries and performing the caching by ourselves?

OpenCL 2.0 will offer an IR, so this should cut down compilation times 
quite a bit.
Handling the binaries ourselves will be pretty messy, as we would have 
all different types of weird interactions with the enclosing operating 
system. Thus, I suggest to simply accept the compilation overhead as is, 
since it's an O(1) overhead and not an issue for large runs. It may not 
feel perfect, but I think it's the most reasonable approach considering 
our resources. Plus, we can bring these limitations to attention at the 
vendors by example.

Best regards,
Karli



------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to