Hi Karl,

2013/8/12 Karl Rupp <r...@iue.tuwien.ac.at>

> Hi Xeon Phil ;-)
>  > There are a lot of problems related to coupling the current BLAS3
> > implementation with the kernel generator:
> >
> > - While I think I could add some range support, adding slices will be
> > extremely difficult, and it would probably result in bad performance
> > whatever kernel is used. The most efficient way to do this is probably :
> >  > copy slice to temporary dense
> >  > perform product
> >  > slice copy of the result
> Yeah, slices are much harder. Let's stick with the current
> implementation for ranges and slices and only use the generator for the
> full products. We can add more cleverness/performance with 1.5.1.

Yes, it does require more cleverness. In particular, adding padding to
ranges does not seem possible to me... Maybe the range kernel will need to
perform some bound-checking

> > - Kernels take forever to compile. Particularly, the generated program
> > include a duplicata of each kernel for each device in the associated
> > context:
> >
> > AMD Platform :
> > -> AMD GPU
> > -> Intel CPU
> > The CPU kernel is extremely unrolled, in this case, the program takes up
> > to 2-3seconds to compile, on my Desktop Core i7-4770... We can have some
> > completely crappy profile for the CPU that compiles fast, though, to
> > solve this problem at the expense of OpenCL CPU performance.
> Hmm, isn't the default to use just one device per context? In such case
> it takes about 1-2 seconds, which is somewhat reasonable. Maybe there is
> a 'cheaper' alternative in terms of compilation time with only slightly
> reduced performance on the CPU?

After looking at the code, it seems that the default ocl::context created
in the backend grabs any device whose type is Cl_DEVICE_TYPE_DEFAULT. Most
of the time, it only includes GPUs. In the case of AMD, the CPU is not in
the default current_context(); However, in the case of NVidia, we still
have this pretty logn compilation time, but it doesn't really matter since
they cache programs.

More details about the problem on the CPU (6-7 seconds to compile on my
high end desktop)
Each work-unit computes a mS*nS block in the resulting matrix. This block
is stores in registers, and the ms*ks*ns associated operations are unrolled
in the kernel.
CPUs often map one work-item to one thread... It requires having ms, ks and
ns pretty large. For the best kernel, we therefore have a ridiculous
ms*ks*ns = 16*8**128* operations and 16*128 registers. Poor compiler. :)
I suspect ks to be a useless parameter, included in the #pragma unroll. If
I manage to get rid of this parameter, both the auto-tuning time and the
compiler load will benefit from it.

> > NVidia platform:
> > -> GTX 470*
> > -> Tesla C2050
> > Here, each kernel is long to compile, mainly because of the #pragma
> > unroll directive that almost doubles the performance. Basically, on a
> > remote machine with a core i7 960, it also takes several seconds to
> > compile... Note that this also makes using #pragma unroll in the
> > autotuner a bad idea, since all the kernels may end up taking forever to
> > compile...
> >
> > Each "variation" of BLAS3 generates a different program, whether it is :
> > C = alpha*prod(A,B)
> > C = alpha*prod(rangeA,B)
> > C = alpha*prod(A,B) + C
> > C = alpha*prod(A,B) + D
> > C = element_exp(alpha*prod(A,B))
> > ....
> > C+= alpha*prod(A,B)
> > ...
> > This is the way kernel generation works..
> Considering that among the listed kernels a user may need only one or
> two in a given program, I think we can live with this limitation.
> On the other hand, what about extracting the kernel sources for the
> standard GEMM case from the generator at first request and placing them
> into a separate 'special' program?

Yes, this is how I have reimplemented opencl::prod(), more or less. More on
it on a follow-up mail :P

> > I guess you see the problem here. Even though one rarely uses all the
> > programs, only the NVidia SDK caches the programs to avoid
> > recompilation. Clearly packing all the possibles blas3 kernels into one
> > is impossible. I don't see any way out of this mess. Handling the
> > binaries and performing the caching by ourselves?
> OpenCL 2.0 will offer an IR, so this should cut down compilation times
> quite a bit.
> Handling the binaries ourselves will be pretty messy, as we would have
> all different types of weird interactions with the enclosing operating
> system. Thus, I suggest to simply accept the compilation overhead as is,
> since it's an O(1) overhead and not an issue for large runs. It may not
> feel perfect, but I think it's the most reasonable approach considering
> our resources. Plus, we can bring these limitations to attention at the
> vendors by example.

Yes, I agree. But maybe we could consider offering some primitives to let
the user handle cachine himself? I could introduce in 1.6.0's generator,
things like "save_binary" or "load_binary". Since we want to make things
transparent, it could even be something like :
viennacl::ocl::save_binary(std::vector<statement> const & statements,
std::string const & filename, viennacl::ocl::context & ctx)
bool viennacl::ocl::load_binary(std::vector<statement> const & statements,
std::string const & filename, viennacl::ocl::context & ctx); //returns
whether or not the binary was found.

and call clBuildProgramWithBinary. This way , the user can handle caching
himself if compilation time is a problem.

Best regards,

> Best regards,
> Karli
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite!
> It's a free troubleshooting tool designed for production.
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
> _______________________________________________
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
ViennaCL-devel mailing list

Reply via email to