Hey,

alright, we've got some issues to fight ;-)

On GPUs with 16kB of shared memory (e.g. GTX 285), the generated GEMM 
kernels now exceed the available memory:

Log: ptxas error   : Entry function 'kernel_0x207f4b0_0' uses too much 
shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max)

This is because of
     __local float lhs_buf[4128];
which is more than the total 16kB of shared memory (already ignoring 
some overhead for kernel parameters, etc.). Phil, could you please cut 
this default down to only half the work group size, i.e. half the shared 
memory?

I also got a CL_INVALID_WORK_GROUP_SIZE in
blas3_prod_double-test-opencl, but this may be a follow-up issue.


 >     Yeah, slices are much harder. Let's stick with the current
>     implementation for ranges and slices and only use the generator for the
>     full products. We can add more cleverness/performance with 1.5.1.
>
>
> Yes, it does require more cleverness. In particular, adding padding to
> ranges does not seem possible to me... Maybe the range kernel will need
> to perform some bound-checking

Padding will actually work okay if the range happens to start at one of 
the 128-boundaries and spans all the way to the end of the rows/columns. 
In our QR factorization implementation one can actually guarantee this :-)


>     Hmm, isn't the default to use just one device per context? In such case
>     it takes about 1-2 seconds, which is somewhat reasonable. Maybe there is
>     a 'cheaper' alternative in terms of compilation time with only slightly
>     reduced performance on the CPU?
>
>
> After looking at the code, it seems that the default ocl::context
> created in the backend grabs any device whose type is
> Cl_DEVICE_TYPE_DEFAULT. Most of the time, it only includes GPUs. In the
> case of AMD, the CPU is not in the default current_context(); However,
> in the case of NVidia, we still have this pretty logn compilation time,
> but it doesn't really matter since they cache programs.

I just double-checked this in found that there was a bug:
   cl_uint device_num
    = std::max(default_device_num_, device_id_array.size());
with default_device_num_ being set to 1 by default. Clearly, the max() 
is bogus, so it's now changed to min(), finally resulting in only one 
device per context. Users can use context::default_device_num() to 
obtain the 'old' behavior with multiple devices per context.



> More details about the problem on the CPU (6-7 seconds to compile on my
> high end desktop)
> Each work-unit computes a mS*nS block in the resulting matrix. This
> block is stores in registers, and the ms*ks*ns associated operations are
> unrolled in the kernel.
> CPUs often map one work-item to one thread... It requires having ms, ks
> and ns pretty large. For the best kernel, we therefore have a ridiculous
> ms*ks*ns = 16*8**128* operations and 16*128 registers. Poor compiler. :)

Hmm, the Intel optimization guide actually suggests that one work group 
represents one thread. Some compiler-magic within a work group then 
generates SSE and AVX instructions.


> I suspect ks to be a useless parameter, included in the #pragma unroll.
> If I manage to get rid of this parameter, both the auto-tuning time and
> the compiler load will benefit from it.

Better to keep ks in, but eventually fix it to a particular value if we 
find throughout our tuning runs that it doesn't impact performance.


> Yes, I agree. But maybe we could consider offering some primitives to
> let the user handle cachine himself? I could introduce in 1.6.0's
> generator, things like "save_binary" or "load_binary". Since we want to
> make things transparent, it could even be something like :
> viennacl::ocl::save_binary(std::vector<statement> const & statements,
> std::string const & filename, viennacl::ocl::context & ctx)
> bool viennacl::ocl::load_binary(std::vector<statement> const &
> statements, std::string const & filename, viennacl::ocl::context & ctx);
> //returns whether or not the binary was found.
>
> and call clBuildProgramWithBinary. This way , the user can handle
> caching himself if compilation time is a problem.

Fair enough, we can do that. 1.5.0 is too short on the way, 1.6.0 is 
reasonable. By that time we will also have a better idea about OpenCL 
2.0 I suppose.

Best regards,
Karli


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to