Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

Karl Rupp Tue, 13 Aug 2013 09:43:08 -0700

Hi again,

thanks, the compilation problem is fixed. Unfortunately, there's still 
the invalid work group size error showing up. Output from viennacl-info:


Address Bits:                  32
Available:                     1
Compiler Available:            1
Endian Little:                 1
Error Correction Support:      0
Execution Capabilities:        CL_EXEC_KERNEL
Extensions:                    cl_khr_byte_addressable_store cl_khr_icd 
cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query 
cl_nv_pragma_unroll  cl_khr_global_int32_base_atomics 
cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics 
cl_khr_local_int32_extended_atomics cl_khr_fp64
Global Mem Cache Size:         0 Bytes
Global Mem Cache Type:         CL_NONE
Global Mem Cacheline Size:     0 Bytes
Global Mem Size:               1073414144 Bytes
Host Unified Memory:           0
Image Support:                 1
Image2D Max Height:            16383
Image2D Max Width:             4096
Image3D Max Depth:             2048
Image3D Max Height:            2048
Image3D Max Width:             2048
Local Mem Size:                16384 Bytes
Local Mem Type:                CL_LOCAL
Max Clock Frequency:           1476 MHz
Max Compute Units:             30
Max Constant Args:             9
Max Constant Buffer Size:      65536 Bytes
Max Mem Alloc Size:            268353536 Bytes
Max Parameter Size:            4352 Bytes
Max Read Image Args:           128
Max Samplers:                  16
Max Work Group Size:           512
Max Work Item Dimensions:      3
Max Work Item Sizes:           512 512 64
Max Write Image Args:          8
Mem Base Addr Align:           2048
Min Data Type Align Size:      128 Bytes
Name:                          GeForce GTX 285
Native Vector Width char:      1
Native Vector Width short:     1
Native Vector Width int:       1
Native Vector Width long:      1
Native Vector Width float:     1
Native Vector Width double:    1
Native Vector Width half:      0
OpenCL C Version:              OpenCL C 1.1
Platform:                      0xbf45c0
Preferred Vector Width char:   1
Preferred Vector Width short:  1
Preferred Vector Width int:    1
Preferred Vector Width long:   1
Preferred Vector Width float:  1
Preferred Vector Width double: 1
Preferred Vector Width half:   0
Profile:                       FULL_PROFILE
Profiling Timer Resolution:    1000 ns
Queue Properties:              CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE 
CL_QUEUE_PROFILING_ENABLE
Single FP Config:              CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST 
CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
Type:                          GPU
Vendor:                        NVIDIA Corporation
Vendor ID:                     4318
Version:                       OpenCL 1.0 CUDA
Driver Version:                304.43


Maybe the work group size exceeds 512? It works well on the GTX 470, 
though...

Best regards,
Karli


On 08/13/2013 11:01 AM, Philippe Tillet wrote:
> Hi hi,
>
>
> 2013/8/13 Karl Rupp <r...@iue.tuwien.ac.at <mailto:r...@iue.tuwien.ac.at>>
>
>     Hi,
>
>      >     On GPUs with 16kB of shared memory (e.g. GTX 285), the generated
>      >     GEMM kernels now exceed the available memory:
>      >
>      >     Log: ptxas error   : Entry function 'kernel_0x207f4b0_0' uses too
>      >     much shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max)
>      >
>      >     This is because of
>      >          __local float lhs_buf[4128];
>      >     which is more than the total 16kB of shared memory (already
>     ignoring
>      >     some overhead for kernel parameters, etc.). Phil, could you
>     please
>      >     cut this default down to only half the work group size, i.e. half
>      >     the shared memory?
>      >
>      >     I also got a CL_INVALID_WORK_GROUP_SIZE in
>      >     blas3_prod_double-test-opencl, but this may be a follow-up issue.
>      >
>      >
>      >
>      > Okay, I will do that :) This brings back another thing I wanted to
>      > discuss. Since any device for a given vendor can have 16kB of shared
>      > memory, this means that the vendor defaults will actually have to be
>      > very conservative. A way to solve this issue is to have some
>     "generation
>      > defaults"... the problem is that it is pretty difficult to achieve
>      > without parsing the device name, which is a bit dirty in my
>     opinion...
>      > Do you think this is a good idea?
>
>     We can directly query the available local device memory (which is the
>     reason why I added all this buffering to the device class). Am I missing
>     something?
>
>
> Yes, we could. But having the combination {vendor, local memory} seems a
> bit weird to me, I think {vendor, generation} makes more sense, don't
> you think?
>
> Best regards,
> Philippe


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

Reply via email to