Thank you for your reply. When I get a chance I'll give it a shot. As it is I converted my OpenCL kernel to CUDA and it ran flawlessly there and was quite a bit faster as the problem size increased. Maybe that's an Nvidia driver issue.
Nevertheless, for now, I have Viennacl out of the loop in this project due to a very strange error in the output of the kernel, which was not seen in CUDA. In particular as I increase the problem size, the edges of my block of data get corrupted, ie the indices associated with small values of get_global_id(0) and get_global_id(1). When I get a chance I will try to generate up a non proprietary small example of the effects I'm seeing. I suppose I should also try your flat memory copy idea. I do know that I was not satisfied with the performance of my kernel and attributed it to all the memory copies, but it could be that Nvidia nerfs their opencl drivers, not sure. Actually I might take a look at your cuda back end since I could use a tuned matrix multiply, and using cublas is a bit of a pain. On 09/24/2015 11:49 AM, Karl Rupp wrote: > Hi Matthew, > > > I am at a loss how I can use a custom kernel that operates on a >> component of a matrix. >> I can not seem to get a device pointer to the internal buffer in any >> obvious way. (Maybe it's somewhere in the doc.s but I can't find it). >> I notice that you can do strange things like this, from the >> custom-kernels.cpp example. > > OpenCL does not have the concept of a device pointer, it only provides > device handles. This is where OpenCL differs by more than just a > change of terminology from CUDA. You can, however, use the CUDA > backend of ViennaCL if you prefer to work with pointer arithmetic. > >> >> viennacl::ocl::enqueue(my_kernel_mul(vec1, vec2, result_mul, >> static_cast<cl_uint>(vec1.size()))); >> >> In theory vec1 and vec2 should float * pointers to a device buffer, >> according to the type signature of my_kernel_mul(). >> However vec1 and vec2 and result_mul are all viennacl vectors. This >> pattern works for matrices as well for custom kernels. > > What happens at the kernel call is that the OpenCL handle is extracted > from the object. This is achieved by calling .handle() on the element > to get the backend-agnostic ViennaCL memory handle, and fetching the > OpenCL handle from it via .opencl_handle(). In short, to get the > OpenCL handle from 'vec1', call > vec1.handle().opencl_handle() > > (use vec1.handle().cuda_handle() if you use the CUDA backend) > > > >> Unfortunately I can not work on a sub-block of the matrix. This doesn't >> work >> float *ptr = &(matrix(i,j)) ; >> >> In fact I can't even do this, >> float *ptr = matrix ; > > Nope, OpenCL defines its own memory space for which you cannot perform > any pointer arithmetic. Also, ViennaCL objects 'live' in main memory, > whereas device memory is a separate memory domain, so the address > operator will always return the address of the respective proxy > objects in main memory. > > >> So I'm not really understanding the internals of how this works. I'm >> more familiar with CUDA where you can work with device pointers and >> pointer arithmetic. >> My kernel needs to fill in sublocks of the output matrix and so I need >> to pass in a pointer that has an offset. Is there any mechanism for >> this? > > I recommend to pass the base OpenCL handle to the kernel and work with > explicit offsets from there (this is how ViennaCL deals with arbitrary > submatrices with strides possibly different from 1). The alternative > is to create a subbuffer with appropriate offsets directly via the > OpenCL API and pass that to the kernel. > > >> My other problem with viennacl is that the zero padding and current API >> seems to force at least 3 copies to get dense matrix data into and out >> of the device buffer into a flat float Host array with no zero padding. >> It's such a common use case I'm surprised that there isn't some more >> efficient method or function to support this. > > You can create a matrix without padding from a linear memory buffer. > To do so, copy your matrix data into a viennacl::vector v via > viennacl::fast_copy(input_begin, input_end, v.begin()); > with appropriate input iterators 'input_begin' and 'input_end'. Then > wrap the copied data into the base type 'matrix_base' of a > viennacl::matrix: > > viennacl::matrix_base<T> A(v.handle(), > M, 0, 1, M, > N, 0, 1, N); > > for a matrix with M rows and N columns. The 'zeros' denote the row and > column offsets, respectively, so you could also specify nonzero values > for working on a submatrix (this is how viennacl::matrix_range<> and > viennacl::matrix_slice<> work). > > Why all this zero-padding by default? In most cases one wants to run > FLOP-intensive operations on dense matrices on the GPU, so the > overhead of the extra data conversion is often small compared to gains > from the improved data layout (keep in mind that DRAM bandwidth is at > least a factor of 2 better than PCI-Express bandwidth!). > > Best regards, > Karli > ------------------------------------------------------------------------------ Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 _______________________________________________ ViennaCL-support mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/viennacl-support
