Hey,

 >         There is some trickery going on with transpositions and layout,
>         but it
>         works for every transpose/layout combination. One can also link
>         A's blas
>         to his own gemm function, provided a tiny wrapper (essentially
>         to ensure
>         signature compatibility)
>
>
>     Cool!
>
>
>
> It is actually interesting to point out that only 4 GEMM kernels are
> needed for any implementation : NN, NT, TN, TT . Then, one can use the
> equivalence  Row-Major+N <=> Col-Major+T , and C = AB <=> C^T = B^T.A^T.

Yeah, this is one of the things I want to apply for the backend 
implementations. The current implementation is too repetitive and 
stresses the compiler unnecessarily hard (more on that on my reply to 
your email today).


>     For our native CUDA implementation it's probably only a matter of
>     porting the results from the OpenCL tuner over. Unfortunately I
>     don't see a good way of doing this with CUDA without a significant
>     penalty on compilation times, because there is no concept of runtime
>     kernel selection in CUDA so far. The performance difference for GEMM
>     of our CPU backend is not surprising, this was never subject to
>     optimization ;-)
>
>
> That's exactly the point of this feature ! Optimizing GEMM for CPU is
> pretty complicated, and linking with external BLAS libraries allow us
> not to focus too much on these problems, and to just provide a fallback
> implementation for the sake of code portability

The main 'problem' actually is that the really fast BLAS implementations 
for CPUs go down to the assembly level, which we cannot reasonably 
support across different compilers. As we've seen from our benchmarking 
with MKL, we can sometimes get surprisingly close to peak performance 
with OpenCL, but it's going to be very hard to do the same with a pure, 
portable C/C++ implementation. Just-in-time compilation might help a lot 
(LLVM), but this would still require a bunch of effort to get it going.


> Yes, you're right. However, the types for .blas() are as of different
> accross the backends. This is because I chose a low-level interface for
> the Blas wrappers, therefore the signature of the function are slightly
> different [ T const * A, vcl_size_t A_internal_size1... versus cl_mem
> const A, vcl_size_t A_internal_size1 ...). I can easily change the
> signature to a higher level one ( viennacl::matrix<T> A ... ). This is
> probably better, right ?

Indeed. We should not let implementation details propagate to the user 
API unless we have good reasons for it.

>     I don't know whether .blas() is the best name for this, because in
>     the future we might also have more non-BLAS operations such as
>     sorting or FFT - maybe we use .operations() to better reflect the
>     operations table?
>
>
> Yes, I also thought about it... I'm not sure how to handle the default
> case, A.operations().gemm(NULL), but I guess that
> A.operations().gemm(viennacl::backend::default()), where a proper
> overload would set the pointer to NULL internally.

I prefer the second because the first does not reflect that a default 
implementation is used.


>     It seems to me that this is going in a very fruitful directions. Any
>     objections in pushing and extending this for the 1.6.0 release?
>     1.5.0 is essentially done, I'm currently writing the last bits of
>     documentation and resolve some minor warnings on Visual Studio..
>
>
> Yes. This is already pushed in a feature branch, I can try to extended
> it to allow for the list implementation you suggested. There are also a
> couple of changes in the generator on another feature branch, so I'll
> have a lot of stuff to merge :P

I had a couple of fixes (mostly silencing warnings) to apply to the 
current generator in master, so merging will require a bit of effort... 
Let's coordinate the further steps after 1.5.0 is out (maybe tonight).

Best regards,
Karli


------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to