Re: [ViennaCL-devel] BLAS3, range, slice, compilation time...

Philippe Tillet Mon, 12 Aug 2013 11:56:39 -0700

Hi Karl,


2013/8/12 Karl Rupp <r...@iue.tuwien.ac.at>

> Hi Xeon Phil ;-)
>
>
>  > There are a lot of problems related to coupling the current BLAS3
> > implementation with the kernel generator:
> >
> > - While I think I could add some range support, adding slices will be
> > extremely difficult, and it would probably result in bad performance
> > whatever kernel is used. The most efficient way to do this is probably :
> >  > copy slice to temporary dense
> >  > perform product
> >  > slice copy of the result
>
> Yeah, slices are much harder. Let's stick with the current
> implementation for ranges and slices and only use the generator for the
> full products. We can add more cleverness/performance with 1.5.1.
>

Yes, it does require more cleverness. In particular, adding padding to
ranges does not seem possible to me... Maybe the range kernel will need to
perform some bound-checking


>
> > - Kernels take forever to compile. Particularly, the generated program
> > include a duplicata of each kernel for each device in the associated
> > context:
> >
> > AMD Platform :
> > -> AMD GPU
> > -> Intel CPU
> > The CPU kernel is extremely unrolled, in this case, the program takes up
> > to 2-3seconds to compile, on my Desktop Core i7-4770... We can have some
> > completely crappy profile for the CPU that compiles fast, though, to
> > solve this problem at the expense of OpenCL CPU performance.
>
> Hmm, isn't the default to use just one device per context? In such case
> it takes about 1-2 seconds, which is somewhat reasonable. Maybe there is
> a 'cheaper' alternative in terms of compilation time with only slightly
> reduced performance on the CPU?
>

After looking at the code, it seems that the default ocl::context created
in the backend grabs any device whose type is Cl_DEVICE_TYPE_DEFAULT. Most
of the time, it only includes GPUs. In the case of AMD, the CPU is not in
the default current_context(); However, in the case of NVidia, we still
have this pretty logn compilation time, but it doesn't really matter since
they cache programs.

More details about the problem on the CPU (6-7 seconds to compile on my
high end desktop)
Each work-unit computes a mS*nS block in the resulting matrix. This block
is stores in registers, and the ms*ks*ns associated operations are unrolled
in the kernel.
CPUs often map one work-item to one thread... It requires having ms, ks and
ns pretty large. For the best kernel, we therefore have a ridiculous
ms*ks*ns = 16*8**128* operations and 16*128 registers. Poor compiler. :)
I suspect ks to be a useless parameter, included in the #pragma unroll. If
I manage to get rid of this parameter, both the auto-tuning time and the
compiler load will benefit from it.

*
>
>
> > NVidia platform:
> > -> GTX 470*
> > -> Tesla C2050
> > Here, each kernel is long to compile, mainly because of the #pragma
> > unroll directive that almost doubles the performance. Basically, on a
> > remote machine with a core i7 960, it also takes several seconds to
> > compile... Note that this also makes using #pragma unroll in the
> > autotuner a bad idea, since all the kernels may end up taking forever to
> > compile...
> >
> > Each "variation" of BLAS3 generates a different program, whether it is :
> > C = alpha*prod(A,B)
> > C = alpha*prod(rangeA,B)
> > C = alpha*prod(A,B) + C
> > C = alpha*prod(A,B) + D
> > C = element_exp(alpha*prod(A,B))
> > ....
> > C+= alpha*prod(A,B)
> > ...
> > This is the way kernel generation works..
>
> Considering that among the listed kernels a user may need only one or
> two in a given program, I think we can live with this limitation.
>
> On the other hand, what about extracting the kernel sources for the
> standard GEMM case from the generator at first request and placing them
> into a separate 'special' program?
>

Yes, this is how I have reimplemented opencl::prod(), more or less. More on
it on a follow-up mail :P

>
>
> > I guess you see the problem here. Even though one rarely uses all the
> > programs, only the NVidia SDK caches the programs to avoid
> > recompilation. Clearly packing all the possibles blas3 kernels into one
> > is impossible. I don't see any way out of this mess. Handling the
> > binaries and performing the caching by ourselves?
>
> OpenCL 2.0 will offer an IR, so this should cut down compilation times
> quite a bit.
> Handling the binaries ourselves will be pretty messy, as we would have
> all different types of weird interactions with the enclosing operating
> system. Thus, I suggest to simply accept the compilation overhead as is,
> since it's an O(1) overhead and not an issue for large runs. It may not
> feel perfect, but I think it's the most reasonable approach considering
> our resources. Plus, we can bring these limitations to attention at the
> vendors by example.
>

Yes, I agree. But maybe we could consider offering some primitives to let
the user handle cachine himself? I could introduce in 1.6.0's generator,
things like "save_binary" or "load_binary". Since we want to make things
transparent, it could even be something like :
viennacl::ocl::save_binary(std::vector<statement> const & statements,
std::string const & filename, viennacl::ocl::context & ctx)
bool viennacl::ocl::load_binary(std::vector<statement> const & statements,
std::string const & filename, viennacl::ocl::context & ctx); //returns
whether or not the binary was found.

and call clBuildProgramWithBinary. This way , the user can handle caching
himself if compilation time is a problem.

Best regards,
Philippe


> Best regards,
> Karli
>
>
>
>
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite!
> It's a free troubleshooting tool designed for production.
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
> _______________________________________________
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk

_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] BLAS3, range, slice, compilation time...

Reply via email to