Thank you for your reply.  When I get a chance I'll give it a shot.  As 
it is I converted my OpenCL kernel to CUDA and it ran flawlessly there 
and was quite a bit faster as the problem size increased.  Maybe that's 
an Nvidia driver issue.

Nevertheless, for now,  I have Viennacl out of the loop in this project 
due to a very strange error in the output of the kernel, which was not 
seen in CUDA.  In particular as I increase the problem size, the edges 
of my block of data get corrupted,  ie the indices associated with small 
values of get_global_id(0) and get_global_id(1).

When I get a chance I will try to generate up a non proprietary small 
example of the effects I'm seeing.  I suppose I should also try your 
flat memory copy idea.  I do know that I was not satisfied with the 
performance of my kernel and attributed it to all the memory copies,  
but it could be that Nvidia nerfs their opencl drivers, not sure.

Actually I might take a look at your cuda back end since I could use a 
tuned matrix multiply,  and using cublas is a bit of a pain.






On 09/24/2015 11:49 AM, Karl Rupp wrote:
> Hi Matthew,
>
> > I am at a loss how I can use a custom kernel that operates on a
>> component of a matrix.
>> I can not seem to get a device pointer to the internal buffer in any
>> obvious way.  (Maybe it's somewhere in the doc.s but I can't find it).
>> I notice that you can do strange things like this, from the
>> custom-kernels.cpp example.
>
> OpenCL does not have the concept of a device pointer, it only provides 
> device handles. This is where OpenCL differs by more than just a 
> change of terminology from CUDA. You can, however, use the CUDA 
> backend of ViennaCL if you prefer to work with pointer arithmetic.
>
>>
>> viennacl::ocl::enqueue(my_kernel_mul(vec1, vec2, result_mul,
>> static_cast<cl_uint>(vec1.size())));
>>
>> In theory vec1 and vec2 should float * pointers to a device buffer,
>> according to the type signature of my_kernel_mul().
>> However vec1 and vec2 and result_mul  are all viennacl vectors. This
>> pattern works for matrices as well for custom kernels.
>
> What happens at the kernel call is that the OpenCL handle is extracted 
> from the object. This is achieved by calling .handle() on the element 
> to get the backend-agnostic ViennaCL memory handle, and fetching the 
> OpenCL handle from it via .opencl_handle(). In short, to get the 
> OpenCL handle from 'vec1', call
>  vec1.handle().opencl_handle()
>
> (use vec1.handle().cuda_handle() if you use the CUDA backend)
>
>
>
>> Unfortunately I can not work on a sub-block of the matrix.  This doesn't
>> work
>> float *ptr = &(matrix(i,j)) ;
>>
>> In fact I can't even do this,
>> float *ptr = matrix ;
>
> Nope, OpenCL defines its own memory space for which you cannot perform 
> any pointer arithmetic. Also, ViennaCL objects 'live' in main memory, 
> whereas device memory is a separate memory domain, so the address 
> operator will always return the address of the respective proxy 
> objects in main memory.
>
>
>> So I'm not really understanding the internals of how this works. I'm
>> more familiar with CUDA where you can work with device pointers and
>> pointer arithmetic.
>> My kernel needs to fill in sublocks of the output matrix and so I need
>> to pass in a pointer that has an offset.  Is there any mechanism for 
>> this?
>
> I recommend to pass the base OpenCL handle to the kernel and work with 
> explicit offsets from there (this is how ViennaCL deals with arbitrary 
> submatrices with strides possibly different from 1). The alternative 
> is to create a subbuffer with appropriate offsets directly via the 
> OpenCL API and pass that to the kernel.
>
>
>> My other problem with viennacl is that the zero padding and current API
>> seems to force at least 3 copies to get  dense matrix data into and out
>> of the device buffer into a flat float Host array with no zero padding.
>> It's such a common use case I'm surprised that there isn't some more
>> efficient method or function to support this.
>
> You can create a matrix without padding from a linear memory buffer. 
> To do so, copy your matrix data into a viennacl::vector v via
>  viennacl::fast_copy(input_begin, input_end, v.begin());
> with appropriate input iterators 'input_begin' and 'input_end'. Then 
> wrap the copied data into the base type 'matrix_base' of a 
> viennacl::matrix:
>
>  viennacl::matrix_base<T> A(v.handle(),
>                             M, 0, 1, M,
>                             N, 0, 1, N);
>
> for a matrix with M rows and N columns. The 'zeros' denote the row and 
> column offsets, respectively, so you could also specify nonzero values 
> for working on a submatrix (this is how viennacl::matrix_range<> and 
> viennacl::matrix_slice<> work).
>
> Why all this zero-padding by default? In most cases one wants to run 
> FLOP-intensive operations on dense matrices on the GPU, so the 
> overhead of the extra data conversion is often small compared to gains 
> from the improved data layout (keep in mind that DRAM bandwidth is at 
> least a factor of 2 better than PCI-Express bandwidth!).
>
> Best regards,
> Karli
>


------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
ViennaCL-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-support

Reply via email to