Hey, alright, we've got some issues to fight ;-)
On GPUs with 16kB of shared memory (e.g. GTX 285), the generated GEMM kernels now exceed the available memory: Log: ptxas error : Entry function 'kernel_0x207f4b0_0' uses too much shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max) This is because of __local float lhs_buf[4128]; which is more than the total 16kB of shared memory (already ignoring some overhead for kernel parameters, etc.). Phil, could you please cut this default down to only half the work group size, i.e. half the shared memory? I also got a CL_INVALID_WORK_GROUP_SIZE in blas3_prod_double-test-opencl, but this may be a follow-up issue. > Yeah, slices are much harder. Let's stick with the current > implementation for ranges and slices and only use the generator for the > full products. We can add more cleverness/performance with 1.5.1. > > > Yes, it does require more cleverness. In particular, adding padding to > ranges does not seem possible to me... Maybe the range kernel will need > to perform some bound-checking Padding will actually work okay if the range happens to start at one of the 128-boundaries and spans all the way to the end of the rows/columns. In our QR factorization implementation one can actually guarantee this :-) > Hmm, isn't the default to use just one device per context? In such case > it takes about 1-2 seconds, which is somewhat reasonable. Maybe there is > a 'cheaper' alternative in terms of compilation time with only slightly > reduced performance on the CPU? > > > After looking at the code, it seems that the default ocl::context > created in the backend grabs any device whose type is > Cl_DEVICE_TYPE_DEFAULT. Most of the time, it only includes GPUs. In the > case of AMD, the CPU is not in the default current_context(); However, > in the case of NVidia, we still have this pretty logn compilation time, > but it doesn't really matter since they cache programs. I just double-checked this in found that there was a bug: cl_uint device_num = std::max(default_device_num_, device_id_array.size()); with default_device_num_ being set to 1 by default. Clearly, the max() is bogus, so it's now changed to min(), finally resulting in only one device per context. Users can use context::default_device_num() to obtain the 'old' behavior with multiple devices per context. > More details about the problem on the CPU (6-7 seconds to compile on my > high end desktop) > Each work-unit computes a mS*nS block in the resulting matrix. This > block is stores in registers, and the ms*ks*ns associated operations are > unrolled in the kernel. > CPUs often map one work-item to one thread... It requires having ms, ks > and ns pretty large. For the best kernel, we therefore have a ridiculous > ms*ks*ns = 16*8**128* operations and 16*128 registers. Poor compiler. :) Hmm, the Intel optimization guide actually suggests that one work group represents one thread. Some compiler-magic within a work group then generates SSE and AVX instructions. > I suspect ks to be a useless parameter, included in the #pragma unroll. > If I manage to get rid of this parameter, both the auto-tuning time and > the compiler load will benefit from it. Better to keep ks in, but eventually fix it to a particular value if we find throughout our tuning runs that it doesn't impact performance. > Yes, I agree. But maybe we could consider offering some primitives to > let the user handle cachine himself? I could introduce in 1.6.0's > generator, things like "save_binary" or "load_binary". Since we want to > make things transparent, it could even be something like : > viennacl::ocl::save_binary(std::vector<statement> const & statements, > std::string const & filename, viennacl::ocl::context & ctx) > bool viennacl::ocl::load_binary(std::vector<statement> const & > statements, std::string const & filename, viennacl::ocl::context & ctx); > //returns whether or not the binary was found. > > and call clBuildProgramWithBinary. This way , the user can handle > caching himself if compilation time is a problem. Fair enough, we can do that. 1.5.0 is too short on the way, 1.6.0 is reasonable. By that time we will also have a better idea about OpenCL 2.0 I suppose. Best regards, Karli ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel