Hi hi,

Yes, the default NVidia profile for double precision uses a work group size
of 1024... All this is checked during the autotuning procedure so that it
will work for the hardware it's tunned for...
Meh, seems like we need a couple additional levels of abstraction to reach
safety.

Best regards,
Philippe


2013/8/13 Karl Rupp <r...@iue.tuwien.ac.at>

> Hi again,
>
> thanks, the compilation problem is fixed. Unfortunately, there's still the
> invalid work group size error showing up. Output from viennacl-info:
>
> Address Bits:                  32
> Available:                     1
> Compiler Available:            1
> Endian Little:                 1
> Error Correction Support:      0
> Execution Capabilities:        CL_EXEC_KERNEL
> Extensions:                    cl_khr_byte_addressable_store cl_khr_icd
> cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query
> cl_nv_pragma_unroll  cl_khr_global_int32_base_**atomics
> cl_khr_global_int32_extended_**atomics cl_khr_local_int32_base_**atomics
> cl_khr_local_int32_extended_**atomics cl_khr_fp64
> Global Mem Cache Size:         0 Bytes
> Global Mem Cache Type:         CL_NONE
> Global Mem Cacheline Size:     0 Bytes
> Global Mem Size:               1073414144 Bytes
> Host Unified Memory:           0
> Image Support:                 1
> Image2D Max Height:            16383
> Image2D Max Width:             4096
> Image3D Max Depth:             2048
> Image3D Max Height:            2048
> Image3D Max Width:             2048
> Local Mem Size:                16384 Bytes
> Local Mem Type:                CL_LOCAL
> Max Clock Frequency:           1476 MHz
> Max Compute Units:             30
> Max Constant Args:             9
> Max Constant Buffer Size:      65536 Bytes
> Max Mem Alloc Size:            268353536 Bytes
> Max Parameter Size:            4352 Bytes
> Max Read Image Args:           128
> Max Samplers:                  16
> Max Work Group Size:           512
> Max Work Item Dimensions:      3
> Max Work Item Sizes:           512 512 64
> Max Write Image Args:          8
> Mem Base Addr Align:           2048
> Min Data Type Align Size:      128 Bytes
> Name:                          GeForce GTX 285
> Native Vector Width char:      1
> Native Vector Width short:     1
> Native Vector Width int:       1
> Native Vector Width long:      1
> Native Vector Width float:     1
> Native Vector Width double:    1
> Native Vector Width half:      0
> OpenCL C Version:              OpenCL C 1.1
> Platform:                      0xbf45c0
> Preferred Vector Width char:   1
> Preferred Vector Width short:  1
> Preferred Vector Width int:    1
> Preferred Vector Width long:   1
> Preferred Vector Width float:  1
> Preferred Vector Width double: 1
> Preferred Vector Width half:   0
> Profile:                       FULL_PROFILE
> Profiling Timer Resolution:    1000 ns
> Queue Properties:              CL_QUEUE_OUT_OF_ORDER_EXEC_**MODE_ENABLE
> CL_QUEUE_PROFILING_ENABLE
> Single FP Config:              CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
> CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
> Type:                          GPU
> Vendor:                        NVIDIA Corporation
> Vendor ID:                     4318
> Version:                       OpenCL 1.0 CUDA
> Driver Version:                304.43
>
>
> Maybe the work group size exceeds 512? It works well on the GTX 470,
> though...
>
> Best regards,
> Karli
>
>
>
> On 08/13/2013 11:01 AM, Philippe Tillet wrote:
>
>> Hi hi,
>>
>>
>> 2013/8/13 Karl Rupp <r...@iue.tuwien.ac.at <mailto:r...@iue.tuwien.ac.at>
>> **>
>>
>>
>>     Hi,
>>
>>      >     On GPUs with 16kB of shared memory (e.g. GTX 285), the
>> generated
>>      >     GEMM kernels now exceed the available memory:
>>      >
>>      >     Log: ptxas error   : Entry function 'kernel_0x207f4b0_0' uses
>> too
>>      >     much shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max)
>>      >
>>      >     This is because of
>>      >          __local float lhs_buf[4128];
>>      >     which is more than the total 16kB of shared memory (already
>>     ignoring
>>      >     some overhead for kernel parameters, etc.). Phil, could you
>>     please
>>      >     cut this default down to only half the work group size, i.e.
>> half
>>      >     the shared memory?
>>      >
>>      >     I also got a CL_INVALID_WORK_GROUP_SIZE in
>>      >     blas3_prod_double-test-opencl, but this may be a follow-up
>> issue.
>>      >
>>      >
>>      >
>>      > Okay, I will do that :) This brings back another thing I wanted to
>>      > discuss. Since any device for a given vendor can have 16kB of
>> shared
>>      > memory, this means that the vendor defaults will actually have to
>> be
>>      > very conservative. A way to solve this issue is to have some
>>     "generation
>>      > defaults"... the problem is that it is pretty difficult to achieve
>>      > without parsing the device name, which is a bit dirty in my
>>     opinion...
>>      > Do you think this is a good idea?
>>
>>     We can directly query the available local device memory (which is the
>>     reason why I added all this buffering to the device class). Am I
>> missing
>>     something?
>>
>>
>> Yes, we could. But having the combination {vendor, local memory} seems a
>> bit weird to me, I think {vendor, generation} makes more sense, don't
>> you think?
>>
>> Best regards,
>> Philippe
>>
>
>
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to