[ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

Philippe Tillet Tue, 13 Aug 2013 08:38:53 -0700

Oops, again did "reply" instead of "reply to all". :)

---------- Forwarded message ----------
From: Philippe Tillet <phil.til...@gmail.com>
Date: 2013/8/13
Subject: Re: [ViennaCL-devel] BLAS3, range, slice, compilation time...
To: Karl Rupp <r...@iue.tuwien.ac.at>



Hey,


2013/8/13 Karl Rupp <r...@iue.tuwien.ac.at>

> Hey,
>
> alright, we've got some issues to fight ;-)
>
> On GPUs with 16kB of shared memory (e.g. GTX 285), the generated GEMM
> kernels now exceed the available memory:
>
> Log: ptxas error   : Entry function 'kernel_0x207f4b0_0' uses too much
> shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max)
>
> This is because of
>     __local float lhs_buf[4128];
> which is more than the total 16kB of shared memory (already ignoring some
> overhead for kernel parameters, etc.). Phil, could you please cut this
> default down to only half the work group size, i.e. half the shared memory?
>
> I also got a CL_INVALID_WORK_GROUP_SIZE in
> blas3_prod_double-test-opencl, but this may be a follow-up issue.
>
>

Okay, I will do that :) This brings back another thing I wanted to discuss.
Since any device for a given vendor can have 16kB of shared memory, this
means that the vendor defaults will actually have to be very conservative.
A way to solve this issue is to have some "generation defaults"... the
problem is that it is pretty difficult to achieve without parsing the
device name, which is a bit dirty in my opinion... Do you think this is a
good idea?


>
> >     Yeah, slices are much harder. Let's stick with the current
>
>>     implementation for ranges and slices and only use the generator for
>> the
>>     full products. We can add more cleverness/performance with 1.5.1.
>>
>>
>> Yes, it does require more cleverness. In particular, adding padding to
>> ranges does not seem possible to me... Maybe the range kernel will need
>> to perform some bound-checking
>>
>
> Padding will actually work okay if the range happens to start at one of
> the 128-boundaries and spans all the way to the end of the rows/columns. In
> our QR factorization implementation one can actually guarantee this :-)


Yes, but the user may want for example just to prune a few rows/column of a
matrix...

>
>
>
>      Hmm, isn't the default to use just one device per context? In such
>> case
>>     it takes about 1-2 seconds, which is somewhat reasonable. Maybe there
>> is
>>     a 'cheaper' alternative in terms of compilation time with only
>> slightly
>>     reduced performance on the CPU?
>>
>>
>> After looking at the code, it seems that the default ocl::context
>> created in the backend grabs any device whose type is
>> Cl_DEVICE_TYPE_DEFAULT. Most of the time, it only includes GPUs. In the
>> case of AMD, the CPU is not in the default current_context(); However,
>> in the case of NVidia, we still have this pretty logn compilation time,
>> but it doesn't really matter since they cache programs.
>>
>
> I just double-checked this in found that there was a bug:
>   cl_uint device_num
>    = std::max(default_device_num_, device_id_array.size());
> with default_device_num_ being set to 1 by default. Clearly, the max() is
> bogus, so it's now changed to min(), finally resulting in only one device
> per context. Users can use context::default_device_num() to obtain the
> 'old' behavior with multiple devices per context.
>

Alright, good !


>
>
>
>  More details about the problem on the CPU (6-7 seconds to compile on my
>> high end desktop)
>> Each work-unit computes a mS*nS block in the resulting matrix. This
>> block is stores in registers, and the ms*ks*ns associated operations are
>> unrolled in the kernel.
>> CPUs often map one work-item to one thread... It requires having ms, ks
>> and ns pretty large. For the best kernel, we therefore have a ridiculous
>> ms*ks*ns = 16*8**128* operations and 16*128 registers. Poor compiler. :)
>>
>
> Hmm, the Intel optimization guide actually suggests that one work group
> represents one thread. Some compiler-magic within a work group then
> generates SSE and AVX instructions.



 Hmm, I think it depends, I remember reading somewhere that one work-item
was usually mapped to one thread. I remember their SSE/AVX
autovectorization module, they even say that they "scalarize" all the code,
and auto-vectorize it again

>
>
>
>  I suspect ks to be a useless parameter, included in the #pragma unroll.
>> If I manage to get rid of this parameter, both the auto-tuning time and
>> the compiler load will benefit from it.
>>
>
> Better to keep ks in, but eventually fix it to a particular value if we
> find throughout our tuning runs that it doesn't impact performance.


>
>
>  Yes, I agree. But maybe we could consider offering some primitives to
>> let the user handle cachine himself? I could introduce in 1.6.0's
>> generator, things like "save_binary" or "load_binary". Since we want to
>> make things transparent, it could even be something like :
>> viennacl::ocl::save_binary(**std::vector<statement> const & statements,
>> std::string const & filename, viennacl::ocl::context & ctx)
>> bool viennacl::ocl::load_binary(**std::vector<statement> const &
>> statements, std::string const & filename, viennacl::ocl::context & ctx);
>> //returns whether or not the binary was found.
>>
>> and call clBuildProgramWithBinary. This way , the user can handle
>> caching himself if compilation time is a problem.
>>
>
> Fair enough, we can do that. 1.5.0 is too short on the way, 1.6.0 is
> reasonable. By that time we will also have a better idea about OpenCL 2.0 I
> suppose.
>

Alright :)

Best regards,
Philippe


> Best regards,
> Karli
>
>

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk

_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

Reply via email to