Hi again, > So fast_copy still copies the memory and has copying overhead, even with > MAIN_MEMORY context?
Yes. It's a copy() operation, so it just does what the name suggests. > Is there a way to do shallow copying (i.e. just pointer initialization) > to the matrix data buffer? Isn't it what some constructors of matrix or > matrix_base do? Yes, you can pass your pointer via the constructors, e.g. https://github.com/viennacl/viennacl-dev/blob/master/viennacl/matrix.hpp#L721 > What i am getting at, it looks like i am getting a significant overhead > for just copying -- actually, it seems i am getting double overhead -- > once when i prepare padding and all as required by the internal_size?(), > and then i pass it into the fast_copy() which apparently does copying > again, even if we are using host memory matrices. If you want to 'wrap' your data in a ViennaCL matrix, pass the pointer to the constructors. If you want to quickly copy your data over to memory managed by a ViennaCL matrix, use copy() or fast_copy(). From your description it looks like you are now looking for the constructor calls, but from your earlier email I thought that you are looking for a fast_copy(). > all in all, by my estimates this copying back and forth (which is, > granted, is not greatly optimized on our side) takes ~15..17 seconds out > of 60 seconds total when multiplying 10k x 10k dense arguments via > ViennaCL. I also optimize to -march=haswell and use -ffast-math, > without those i seem to fall too far behind what R + openblas can do in > this test. Then, my processing time swells up to 2 minutes without > optimizing for non-compliant arithmetics. 15 seconds of copying for a 10k-by-10k matrix looks way too much. 10k-by-10k is 800 MB of data for double precision, so this should not take much more than 100 ms on a low-range laptop (10 GB/sec memory bandwidth). Even with multiple matrices and copies you should stay in the 1 second regime. > If i can wrap the buffer and avoid copying for MAIN_MEMORY context, i'd > be shaving off another 10% or so of the execution time. Which would make > me happier, as i probably would be able to beat openblas given custom > cpu architecture flags. Why do you expect to beat OpenBLAS? Their kernels are really well optimized, and for lare dense matrix-matrix you are always FLOP-limited. > On the other hand, bidmat (which allegedly uses mkl) does the same test, > double precision, in under 10 seconds. I can't fathom how, but it does. > I have a haswell-E platform. Multiplication of 10k-by-10k matrices amounts to 200 GFLOP of compute in double precision. A Haswell-E machine provides that within a few seconds, depending on the number of cores (2.4 GHz * 4 doubles with AVX * 2 for FMA = 19.2 GFLOP/sec per core. MKL achieves about 15 GFLOP/sec per core). ViennaCL's host-backend is not strong on dense matrix-matrix multiplies (even though we've got some improvements in a pull request), so for this particular operation you will get better performance from MKL, OpenBLAS, or libflame. Best regards, Karli > On Tue, Jul 12, 2016 at 9:27 AM, Karl Rupp <[email protected] > <mailto:[email protected]>> wrote: > > Hi, > > > One question: you mentioned padding for the `matrix` type. When i > > initialize the `matrix` instance, i only specify dimensions. how > do I > know padding values? > > > if you want to provide your own padded dimensions, consider using > matrix_base directly. If you want to query the padded dimensions, > use internal_size1() and internal_size2() for the internal number of > rows and columns. > > http://viennacl.sourceforge.net/doc/manual-types.html#manual-types-matrix > > Best regards, > Karli > > > > > On Tue, Jul 12, 2016 at 5:53 AM, Karl Rupp > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > wrote: > > Hi Dmitriy, > > On 07/12/2016 07:17 AM, Dmitriy Lyubimov wrote: > > Hi, > > I am trying to create some elementary wrappers for VCL > in javacpp. > > Everything goes fine, except i really would rather not > use those > "cpu" > types (std::map, > std::vector) and rather initialize matrices directly by > feeding > row-major or CCS formats. > > I see that matrix () constructor accepts this form of > initialization; > but it really states that > it does "wrapping" for the device memory. > > > Yes, the constructors either create their own memory buffer > (zero-initialized) or wrap an existing buffer. These are > the only > reasonable options. > > > Now, i can create a host matrix() using host memory and > row-major > packing. This works ok it seems. > > However, these are still host instances. Can i copy host > instances to > instances on opencl context? > > > Did you look at viennacl::copy() or viennacl::fast_copy()? > > > That might be one way bypassing unnecessary (in my case) > complexities of > working with std::vector and std::map classes from java > side. > > But it looks like there's no copy() variation that > would accept a > matrix-on-host and matrix-on-opencl arguments (or > rather, it of > course > declares those to be ambiguous since two methods fit). > > > If you want to copy your OpenCL data into a > viennacl::matrix, you > may wrap the memory handle (obtained with .elements()) into > a vector > and copy that. If you have plain host data, use > viennacl::fast_copy() and mind the data layout (padding of > rows/columns!) > > > For compressed_matrix, there seems to be a set() > method, but i guess > this also requires CCS arrays in the device memory if I > use it. Same > question, is there a way to send-and-wrap CCS arrays to an > opencl device > instance of compressed matrix without using std::map? > > > Currently you have to use .set() if you want to bypass > viennacl::copy() and std::map. > > I acknowledge that the C++ type system is a pain when > interfacing > from other languages. We will make this much more convenient in > ViennaCL 2.0. The existing interface in ViennaCL 1.x is too > hard to > fix without breaking lots of user code, so we won't invest > time in > that (contributions welcome, though :-) ) > > Best regards, > Karli > > > > > ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports.http://sdm.link/zohodev2dev _______________________________________________ ViennaCL-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/viennacl-devel
