Hey, > There is some trickery going on with transpositions and layout, > but it > works for every transpose/layout combination. One can also link > A's blas > to his own gemm function, provided a tiny wrapper (essentially > to ensure > signature compatibility) > > > Cool! > > > > It is actually interesting to point out that only 4 GEMM kernels are > needed for any implementation : NN, NT, TN, TT . Then, one can use the > equivalence Row-Major+N <=> Col-Major+T , and C = AB <=> C^T = B^T.A^T.
Yeah, this is one of the things I want to apply for the backend implementations. The current implementation is too repetitive and stresses the compiler unnecessarily hard (more on that on my reply to your email today). > For our native CUDA implementation it's probably only a matter of > porting the results from the OpenCL tuner over. Unfortunately I > don't see a good way of doing this with CUDA without a significant > penalty on compilation times, because there is no concept of runtime > kernel selection in CUDA so far. The performance difference for GEMM > of our CPU backend is not surprising, this was never subject to > optimization ;-) > > > That's exactly the point of this feature ! Optimizing GEMM for CPU is > pretty complicated, and linking with external BLAS libraries allow us > not to focus too much on these problems, and to just provide a fallback > implementation for the sake of code portability The main 'problem' actually is that the really fast BLAS implementations for CPUs go down to the assembly level, which we cannot reasonably support across different compilers. As we've seen from our benchmarking with MKL, we can sometimes get surprisingly close to peak performance with OpenCL, but it's going to be very hard to do the same with a pure, portable C/C++ implementation. Just-in-time compilation might help a lot (LLVM), but this would still require a bunch of effort to get it going. > Yes, you're right. However, the types for .blas() are as of different > accross the backends. This is because I chose a low-level interface for > the Blas wrappers, therefore the signature of the function are slightly > different [ T const * A, vcl_size_t A_internal_size1... versus cl_mem > const A, vcl_size_t A_internal_size1 ...). I can easily change the > signature to a higher level one ( viennacl::matrix<T> A ... ). This is > probably better, right ? Indeed. We should not let implementation details propagate to the user API unless we have good reasons for it. > I don't know whether .blas() is the best name for this, because in > the future we might also have more non-BLAS operations such as > sorting or FFT - maybe we use .operations() to better reflect the > operations table? > > > Yes, I also thought about it... I'm not sure how to handle the default > case, A.operations().gemm(NULL), but I guess that > A.operations().gemm(viennacl::backend::default()), where a proper > overload would set the pointer to NULL internally. I prefer the second because the first does not reflect that a default implementation is used. > It seems to me that this is going in a very fruitful directions. Any > objections in pushing and extending this for the 1.6.0 release? > 1.5.0 is essentially done, I'm currently writing the last bits of > documentation and resolve some minor warnings on Visual Studio.. > > > Yes. This is already pushed in a feature branch, I can try to extended > it to allow for the list implementation you suggested. There are also a > couple of changes in the generator on another feature branch, so I'll > have a lot of stuff to merge :P I had a couple of fixes (mostly silencing warnings) to apply to the current generator in master, so merging will require a bit of effort... Let's coordinate the further steps after 1.5.0 is out (maybe tonight). Best regards, Karli ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel