Hi hi,
Yes, the default NVidia profile for double precision uses a work group size
of 1024... All this is checked during the autotuning procedure so that it
will work for the hardware it's tunned for...
Meh, seems like we need a couple additional levels of abstraction to reach
safety.
Best regards,
Philippe
2013/8/13 Karl Rupp <r...@iue.tuwien.ac.at>
> Hi again,
>
> thanks, the compilation problem is fixed. Unfortunately, there's still the
> invalid work group size error showing up. Output from viennacl-info:
>
> Address Bits: 32
> Available: 1
> Compiler Available: 1
> Endian Little: 1
> Error Correction Support: 0
> Execution Capabilities: CL_EXEC_KERNEL
> Extensions: cl_khr_byte_addressable_store cl_khr_icd
> cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query
> cl_nv_pragma_unroll cl_khr_global_int32_base_**atomics
> cl_khr_global_int32_extended_**atomics cl_khr_local_int32_base_**atomics
> cl_khr_local_int32_extended_**atomics cl_khr_fp64
> Global Mem Cache Size: 0 Bytes
> Global Mem Cache Type: CL_NONE
> Global Mem Cacheline Size: 0 Bytes
> Global Mem Size: 1073414144 Bytes
> Host Unified Memory: 0
> Image Support: 1
> Image2D Max Height: 16383
> Image2D Max Width: 4096
> Image3D Max Depth: 2048
> Image3D Max Height: 2048
> Image3D Max Width: 2048
> Local Mem Size: 16384 Bytes
> Local Mem Type: CL_LOCAL
> Max Clock Frequency: 1476 MHz
> Max Compute Units: 30
> Max Constant Args: 9
> Max Constant Buffer Size: 65536 Bytes
> Max Mem Alloc Size: 268353536 Bytes
> Max Parameter Size: 4352 Bytes
> Max Read Image Args: 128
> Max Samplers: 16
> Max Work Group Size: 512
> Max Work Item Dimensions: 3
> Max Work Item Sizes: 512 512 64
> Max Write Image Args: 8
> Mem Base Addr Align: 2048
> Min Data Type Align Size: 128 Bytes
> Name: GeForce GTX 285
> Native Vector Width char: 1
> Native Vector Width short: 1
> Native Vector Width int: 1
> Native Vector Width long: 1
> Native Vector Width float: 1
> Native Vector Width double: 1
> Native Vector Width half: 0
> OpenCL C Version: OpenCL C 1.1
> Platform: 0xbf45c0
> Preferred Vector Width char: 1
> Preferred Vector Width short: 1
> Preferred Vector Width int: 1
> Preferred Vector Width long: 1
> Preferred Vector Width float: 1
> Preferred Vector Width double: 1
> Preferred Vector Width half: 0
> Profile: FULL_PROFILE
> Profiling Timer Resolution: 1000 ns
> Queue Properties: CL_QUEUE_OUT_OF_ORDER_EXEC_**MODE_ENABLE
> CL_QUEUE_PROFILING_ENABLE
> Single FP Config: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
> CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
> Type: GPU
> Vendor: NVIDIA Corporation
> Vendor ID: 4318
> Version: OpenCL 1.0 CUDA
> Driver Version: 304.43
>
>
> Maybe the work group size exceeds 512? It works well on the GTX 470,
> though...
>
> Best regards,
> Karli
>
>
>
> On 08/13/2013 11:01 AM, Philippe Tillet wrote:
>
>> Hi hi,
>>
>>
>> 2013/8/13 Karl Rupp <r...@iue.tuwien.ac.at <mailto:r...@iue.tuwien.ac.at>
>> **>
>>
>>
>> Hi,
>>
>> > On GPUs with 16kB of shared memory (e.g. GTX 285), the
>> generated
>> > GEMM kernels now exceed the available memory:
>> >
>> > Log: ptxas error : Entry function 'kernel_0x207f4b0_0' uses
>> too
>> > much shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max)
>> >
>> > This is because of
>> > __local float lhs_buf[4128];
>> > which is more than the total 16kB of shared memory (already
>> ignoring
>> > some overhead for kernel parameters, etc.). Phil, could you
>> please
>> > cut this default down to only half the work group size, i.e.
>> half
>> > the shared memory?
>> >
>> > I also got a CL_INVALID_WORK_GROUP_SIZE in
>> > blas3_prod_double-test-opencl, but this may be a follow-up
>> issue.
>> >
>> >
>> >
>> > Okay, I will do that :) This brings back another thing I wanted to
>> > discuss. Since any device for a given vendor can have 16kB of
>> shared
>> > memory, this means that the vendor defaults will actually have to
>> be
>> > very conservative. A way to solve this issue is to have some
>> "generation
>> > defaults"... the problem is that it is pretty difficult to achieve
>> > without parsing the device name, which is a bit dirty in my
>> opinion...
>> > Do you think this is a good idea?
>>
>> We can directly query the available local device memory (which is the
>> reason why I added all this buffering to the device class). Am I
>> missing
>> something?
>>
>>
>> Yes, we could. But having the combination {vendor, local memory} seems a
>> bit weird to me, I think {vendor, generation} makes more sense, don't
>> you think?
>>
>> Best regards,
>> Philippe
>>
>
>
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel