Hello,

Watching at the roadmap:
https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap

I was concerned with 4 elements:
(1) Hook in external BLAS libraries and use them as a computing backend
(2) Distributed vectors and matrices (multiple devices, possibly mixed
CUDA/OpenCL/OpenMP
(3) Support for reductions (vector-reduction, row-wise reduction, col-wise
reduction). Naive OpenMP/CUDA implementation, but integrated in the kernel
generator for OpenCL.
(4) Full integration of the micro-scheduler and the generator.

Needless to say that this seems overly ambitious!
I had done a prototype for (1), but realized quickly that it would be
pretty complicated to make it stable and robust with respect to devices,
context, etc. Plus, the generator now gives the same (DENSE!) performance
as CuBlas on NVidia GPUs (for Fermi, at least), and clAmdBlas on AMD GPUs.
Linking could allow us to have very good performance on OPENMP/CUDA, as
well as Sparse Linear algebra on OpenCL. This is interesting, but it is
also a good amount of work!

(2) Will also require a huge amount of work. Plus, I think it is dangerous
to do that when we're not even sure of how we handle ViennaCL on a single
device (considering input-dependent kernels, for example). I'd say we
should postpone this

I'll do (3). It's not a lot of work and the kernel generator already
supports it. We just need to add an API.

(4) is where I've spent and will spend most of my time. The Kernel
Generator is now fully integrated for all the vector operations, all the
matrix-vector operations (except rank1 updates) and most of the dense
matrix operations (all but LU, FFT,Inplace triangular substitution). While
the database is not populated yet, recent benchmarks suggest very good
performance (Like CuBlas on GTX470, and 80% of the peak on R9 290x). I
think it is necessary to push forward in this direction, and make ViennaCL
1.6 a BIG DATA  BIG DATA BIG DATA BIG DATAperformance-based release.

I've been very motivated to work on the kernel generator recently, and
simply don't feel like working on (1) or (2) at the moment. Now, there are
two different options, for (4):
4.1 - Implementing the kernel fusion mechanism inside the scheduler.
4.2 - Input-dependent kernels, and performance prediction.

While I could help with 4.1, I don't feel like I could do this task alone,
because I don't have a sufficient knowledge of the backend. Plus, it
implies to get rid of op_executor(), and I'm not sure how I could do this,
too!
I feel operational, though, for 4.2. I feel like ViennaCL 1.6 should be a
performance-oriented release, and having an (input+device)-dependent kernel
selection mechanism is something we have to do!

Any thoughts on how the roadmap could/should be rearranged?

Philippe
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to