Hello ! I have finally started my work at the crossroads between Machine Learning and HPC. This is an excellent example of how PyViennaCL and ViennaCL can interact.
Goal -------------------- We want to execute a routine (GEMM, GEMV, DOT, FFT, etc...) on some hardware and a set of inputs . For now, the auto-tuner / generator optimizes only the routine with respect to the hardware. I'm working on optimizing it as well with respect to the properties of the inputs (in the case of GEMM : the three sizes involved). Solution -------------------- The idea is to run a large enough number of auto-tuning procedure and to record the best profiles for different given inputs (different M, N, K for GEMM). One can then do supervised learning to find the most suitable profile to execute (i.e. kernel to generate) given new inputs, without re-runing the auto-tuning procedure. Experiments -------------------- I have basically carried out about 30 carefully-chosen GEMM auto-tuning procedures on Hawaii SGEMM. And I can tell that the size and the shape both matter... Basically, if you use the wrong kernel, you might end up with lower performance (up to 20-30%). Anyway, 30 is an extremely small number considering that we are spanning three dimensions. I obtain 13 different optimal kernels, and in many cases the optimal kernel appears only once. Things can get better if we accept different input to share the same optimum if it doesn't hurt the performance too much (say, not more than 5%). For now, the results I have obtained running an SVM-classifier seem to make sense, but I think that we should have between 50 and 100 examples to make it work properly. This is not very tractable as of now, but another part of my research is to find a way to speed-up the auto-tuning procedure. Anyway, this is altogether an interesting research direction which could potentially lead to nice performance improvements (in average). I'm not sure whether or not SVM is the most appropriate classifier for this, but it is what seems to make most sense for me in this particular case. Disussion ------------- There are to very distinct steps in that procedure, that I'll recall for those who don't have a ML background: -> The training step: the parameters of the classifier are found. This is where all the auto-tuning procedures execute, and this is basically what takes potentially forever (a couple of days, perhaps). That is, we don't care about the over-head here. The point is that this is also a separate routine, so there's absolutely no reason to write it in C++! Plus, the whole Machine Learning community uses Python. That is, what we want to do here is to provide a few wrappers into pyViennaCL generate a kernel for a given profile. From that point, we can re-use the existing work of other researchers to speed-up the auto-tuning procedure, and *train* some classifier for input-dependent kernel generation. Once the classifier is trained, we can export the model to a file (most ML libraries allow this). We could ideally replace the vendor-specific model file by some header-only C++ source code. -> The prediction step: This is executed every-time a matrix-multiplication is carried out. A prediction is made at run-time given the inputs, the hardware and the model created during the training step. This triggers the generation/compilation of a hardware/input-specific kernel for optimal performance. I'm not afraid of the prediction overhead if we use C++. (since input is only 3-dimensional.) This is imho a perfect example of how pyViennaCL could be used to increase the productivity on the core. @Toby: Do you think that it would be possible to provide a wrapper for the class inside viennacl/generator/generate.hpp ? I would love to do it, but I don't know a lot about how python wrappers are done... Philippe
------------------------------------------------------------------------------
_______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel