Hi Charles :)
The BLAS kernels for CUDA and OpenCL are entirely different, actually.
OpenCL kernels rely on a code-generator, and have been auto-tuned. As far
as I know, the CUDA kernels have not been auto-tuned, and don't rely on the
same generation engine as the OpenCL ones. While for BLAS1-2,
Hi,
Such row-rise / column-wise reductions could be generate-able by the OpenCL
backend, but this won't work on the Host of CUDA backend. Plus, this is not
really maintained at the moment. I would recommend Karl's solution, even
though it won't be optimal when the vector does not fit in the L2
Hey :-)
Worked well on my laptop :-) A couple of suggestions:
- Maybe use layout N-T for GEMM, or perhaps it is already possible to
chose? From my experience NT-col major (TN row major) always leads to
higher performance on GEMM.
- The plots were hard to read because rather small on my laptop.
Hey :)
2014-11-09 10:06 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi guys,
I've updated our roadmap taking into account the latest release:
https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap
Feel free to add your topics and post your wishes :-)
Awesome! Is it like a
I remember us already having a problem with strlen on the cache with your
NVidia SDK, which disappeared when you rebooted. Didn't we?
2014-11-05 16:25 GMT-05:00 Toby St Clere Smithe m...@tsmithe.net:
Toby St Clere Smithe m...@tsmithe.net
writes:
The segfault happens when calling (in
Hey,
Sorry for the late answer. I've been extremely busy with my stats homework
lately.
The caching mechanism indeed doesn't account for the device. This is pretty
easy to add, ie append the device name + platform version + platform name
when doing the hashing.
Philippe
2014-11-04 16:12
,
Philippe Tillet phil.til...@gmail.com
writes:
Sorry for the late answer. I've been extremely busy with my stats
homework
lately.
The caching mechanism indeed doesn't account for the device. This is
pretty
easy to add, ie append the device name + platform version + platform name
when
Hey Namik,
Congratulations! :-)
Yes, we very hope that you'll stay with us in this adventure. I personally
really like open-source development because (1) it's really educative, and
(2) it makes me feel free. I think that research/jobs can put a lot of
pressure on me, to the point that it can
Hey,
2014-08-17 11:52 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi,
So it seems like most of the features are ready for ViennaCL 1.6. My
merge from a few days ago (finally) fully integrated the use of
device-specific kernels for BLAS1, BLAS2, BLAS3.
hurray! :-)
The reduction API
://github.com/viennacl/viennacl-dev/issues/71
https://github.com/viennacl/viennacl-dev/issues/77
https://github.com/viennacl/viennacl-dev/issues/66
https://github.com/viennacl/viennacl-dev/issues/2
Philippe
2014-08-17 19:36 GMT+02:00 Philippe Tillet phil.til...@gmail.com:
So the dense benchmark suite
Hey Namik,
The code looks fine. As a small tip, I would advise to use
blas3MatrixSize{A,B,C} = {M, N, K} ; it's much more conventional. I would
also suggest to remove LU from the benchmark. I only achieve 11 GFLOP/s on
my machine (GEMM peaks at 120GFLOP/s). It will smash the overall score if
you
Hey!
So it seems like most of the features are ready for ViennaCL 1.6. My merge
from a few days ago (finally) fully integrated the use of device-specific
kernels for BLAS1, BLAS2, BLAS3. The reduction API is still missing,
though, but I think that the priority should be to polish the code, and to
Hey,
The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks
involved. This gets hard to test, so I thought it could be a good idea to
discuss this. Basically, here is how it works:
A = [A1 A2; A3 A4]
B = [B1 B2; B3 B4]
C = [C1 C2; C3 C4]
Where each block is divided according
Hello !
This all looks pretty good. Good job!
2014-08-12 3:40 GMT+02:00 Namik Karovic namik.karo...@gmail.com:
Hi Karl,
I'm fine with splitting things into something like Basic Benchmark and
Expert Benchmark ('view' sounds inappropriate), but as long as both
benchmark do the same thing,
Hey Toby,
My two cents:
Don't forget that while matrix-vector multiplication will still introduce
some round-off errors. Ie, when you are computing
y = A*[1,1,...]
then you are actually computing something like
y' = A*( [1,1,...]+eps).
GEMV is (backward stable) so you are sure that y' will be
Hi,
It's horrible! As soon as I want to introduce some vectorized types in an
opencl template as simple as AXPY, everything starts exploding.
Well, first things first, I probably need to justify why I think that we
cannot do without double2, float4 in all of our dense kernel templates:
- From my
Hi guys,
So I expect ViennaCL 1.6 to offer some really good performance on CPUs with
the OpenCL backend -- possibly 80% of OpenBLAS / MKL on a Core i7 4770, for
example. As the OpenCL kernel generator and the auto-tuner will get better,
we can hope for further improvements.
This will create a
Hey,
I've noted that the console benchmarks for ViennaCL were quite outdated,
performance for AXPY are reported in FLOP/s, for example. I think it'd be
great to have something compact, all incorporated in a single benchmarking
executable:
===
BLAS [float, full]
Hi Namik,
Good job! It all looks very appealing. I don't have much to say. Just a few
comments:
- I'd rather use the median instead of the averge, indeed.
- As for the latency in the expert section, it would be great to also have
an execution time vs size plot, in order to show until when the
if
caching is disabled.
Philippe
2014-07-09 17:53 GMT+02:00 Philippe Tillet phil.til...@gmail.com:
Hey hey,
2014-07-09 14:47 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hey,
Philippe, did you by chance check the impact of the generator
integration on kernel latency? We only have
Hello,
Watching at the roadmap:
https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap
I was concerned with 4 elements:
(1) Hook in external BLAS libraries and use them as a computing backend
(2) Distributed vectors and matrices (multiple devices, possibly mixed
CUDA/OpenCL/OpenMP
(3)
Hi,
2014-07-08 20:59 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi Philippe,
Watching at the roadmap:
https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap
argl, I forgot to update this after our IRC meeting. The protocol here
defines features for 1.6.0 which are far more
Hey,
After some investigations it looks like the problem is not with the GEMM
kernel but with the way the kernel is enqueued. It fails when A and B are
associated with the same handle in C = alpha*op(A)*op(A) + beta*C... (this
handle-checking feature is to allow for some optimizations in other
Until this is fixed, I disable the use of the generator for GEMM.
2014-07-07 15:00 GMT+02:00 Philippe Tillet phil.til...@gmail.com:
Hey,
After some investigations it looks like the problem is not with the GEMM
kernel but with the way the kernel is enqueued. It fails when A and B
is
less than 12.5% over the ideal case already, but at the same time the
kernel still works for older GPUs with limited amounts of shared memory.
Best regards,
Karli
On 06/26/2014 11:09 PM, Philippe Tillet wrote:
I'll add something. I assume that multiple kernels are launched thanks
Hi,
Unfortunately I won't be available until Tuesday for a meeting. Python and
CUDA-based libraries are widely used by the Machine Learning community. I
also want to push OpenCL forwards, but supporting CUDA through PyViennaCL
would be a very good thing to do, since a lot of researcher think that
I'll be available from Tuesday afternoon on. What about wednesday 13:00 UTC
and 15:00 UTC?
2014-06-27 18:30 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hey,
Unfortunately I won't be available until Tuesday for a meeting. Python
and CUDA-based libraries are widely used by the Machine
Hello!
I note this in the implementation of multi_inner_prod:
switch (vec_tuple.const_size() - current_index)
{
case 7:
case 6:
case 5:
case 4:
//do stuff
However, there is a test for 5,6,7 so I assume that these
Hey
2014-06-24 12:29 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hey,
If yes,I think that it should be changed
because this easily violates the axioms of a norm : we can have
norm(alpha*v) != alpha*norm(v) because of the rounding.
This will usually be the case
sometimes it may coincide)
Philippe
2014-06-17 10:29 GMT+02:00 Toby St Clere Smithe m...@tsmithe.net:
Hey Philippe,
Philippe Tillet phil.til...@gmail.com
writes:
The integration of the generator is going on slowly but safely. Vector
kernels are fully integrated and I'm about to support some
Hey Namik,
2014-05-06 19:43 GMT+02:00 Namik Karovic namik.karo...@gmail.com:
Hello,
Apologies for not replying earlier, I've been quite busy these last two
days.
Don't worry ;)
So far I have been exploring the advantages/disadvantages of using
QML/QtQuick vs traditional widget based
Hi,
2014-05-06 9:38 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi,
Why is data pointless? I'd rather have only a few datapoints on new
hardware out there rather than having absolutely no data at all.
I mean, the data is pretty useful because it tells us about the best
default
Hi,
2014-05-05 9:18 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi,
(CC-ing viennacl-devel, as this is developer-talk ;-) )
Either way, I want to let you know that the generator/auto-tuner is
undergoing significant changes, and that you will, actually, not have to
worry about it for
Hi hi,
2014-05-05 21:49 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi,
Well, I think this is not entirely unrelated. The purpose of the GUI
is still to allow a broader community to feed us with benchmark
data, so somehow the loop over all possible configurations is still
Hi,
2014-04-29 15:59 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi,
So I can't help but to bring up this topic :) Is there any reason why
we're using the OpenCL C API instead of the C++ one?
Yes, the reason is simple: The C++ API was standardized quite some time
*after* the
Hi,
2014-04-29 16:54 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi,
It seems like we could save several thousands of lines of code
(and gain
a lot of clarity) by using the C++ API directly.
Well, I'm not so sure about that actually. I'd more conservatively
Hey Namik,
Congratulations for your acceptance to the GSoC!
I don't know to which extent this blog is customizable, but it would be
nice to have some sub-sections related to some sub-parts of the project, to
clearly distinguish your updates/ideas on the GUI itself from those you'll
have on e.g.
Hello everybody,
After two years pending, the final specifications for WebCL 1.0 were
released a couple of days ago. It is logically based on OpenCL 1.1 since
ViennaCL doesn't support anything more.
I don't see any clear applications of ViennaCL with that, and I'm
incredibly bad with everything
Hey everybody,
My recent advances on auto-tuning gave birth to a new GSoC idea in my mind.
More exactly, I've come up with something more complete around
(crowd-sourced) auto-tuning and the GUI.
This would include:
- Developing a portable auto-tuning GUI (as of now : BLAS1 / Dense BLAS2 /
Dense
Hi,
I completely agree, concerning matrix-free implementations of the linear
solver.
Their absence is the very reason why I had to reimplement solvers for
UMinTL. Furthermore, some other fancy stopping criterions may be provided.
For example, some algorithms in unconstrained optimization use CG
Hi Karl,
Wow, that's really neat!
I'll fix the warnings for Clang and for generator_blas1-opencl
Philippe
2014-02-14 10:38 GMT+01:00 Karl Rupp r...@iue.tuwien.ac.at:
Hi guys,
in the past few days we worked here in Vienna on setting up an automated
nightly build system based on CTest and
Hello,
So, as of now, the generation of row-wise reduction can be triggered
through the interface:
viennacl::reduceop_add(viennacl::row_wise(Mat))
viennacl::reduceop_max(viennacl::col_wise(Mat))
viennacl::reduceop_min(Vec)
This plugs into a statement under the form:
Hi,
I'll be once more available as a mentor :)
I'll be myself pretty busy with some BLAS2/BLAS3 tuning for Hawaii. I'm
also in favor of ideas of projects which don't require a strong knowledge
of the current codebase, such as the GUI autotuning/benchmarking tool. I
think that ViennaCL could also
Hey,
I think we agree on everything now! Okay, I will generate all the kernels,
this will lead actually to 16 kernels for each cpu-gpu scalar combination,
so 64 small kernels in total. This took time but it was a fruitful
discussion :)
Anyways, my ideas are much clearer now, thanks!
Best
Hello everyone,
I have found this relatively new and interesting PDF file :
http://www.altera.com/literature/hb/opencl-sdk/aocl_optimization_guide.pdf.
I'll read it overnight. This is of course for a mid/long-term
perspective, but there are some remarkable points within, for example (some
teasing
Hey hey Karl,
2014/1/25 Karl Rupp r...@iue.tuwien.ac.at
Hi Phil,
Oh, I get it better now. I am not entirely convinced, though ;)
From my experience, the overhead of the jit launch is negligible
compared to the compilation of one kernel. I'm not sure whether
compiling two kernels
Hello,
I am a bit confused, is there any reason for using reciprocal and
flip_sign, instead of just changing the scalar accordingly?
Best regards,
Philippe
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Hi Karl,
2014/1/24 Karl Rupp r...@iue.tuwien.ac.at
Hey,
I am a bit confused, is there any reason for using reciprocal and
flip_sign, instead of just changing the scalar accordingly?
yes (with a drawback I'll discuss at the end): Consider the family of
operations
x = +- y OP1 a +-
Hey,
2014/1/24 Karl Rupp r...@iue.tuwien.ac.at
Hi,
I was in fact wondering why one passed reciprocal_alpha and flip_sign
into the kernel. After thinking more about it, I have noticed that this
permits us to do the corresponding inversion/multiplication within the
kernel, and therefore
Hey hey,
2014/1/25 Karl Rupp r...@iue.tuwien.ac.at
Hi,
I prefer option 3. This would allow for something like :
if(size(x)1e5 stride==1 start==0){
Here we also need to check the internal_size to fit the vector width
//The following steps are costly for small vectors
Hey,
2014/1/25 Karl Rupp r...@iue.tuwien.ac.at
Hey hey hey,
Convergence depends on what is inside generate_execute() ;-) How is
the problem with alpha and beta residing on the GPU addressed? How
will the batch-compilation look like? The important point is that
for the
Hi,
Oh, I get it better now. I am not entirely convinced, though ;)
From my experience, the overhead of the jit launch is negligible compared
to the compilation of one kernel. I'm not sure whether compiling two
kernels in the same program or two different program creates a big
difference. Plus,
, they are duplicated between
opencl/cuda/openmp . Once this is done, I will probably work towards the
full integration of the micro-scheduler. Can we get rid of op_executor?
Best regards,
Philippe
2013/12/27 Philippe Tillet phil.til...@gmail.com
Hey,
Sorry for the late reply :P I'm supposed to defend my
Hey Karl,
So today I went back to ViennaCL. I tried to move the equivalence
columntrans = rownotrans upwards in the dispatching mechanism but it
turns out to be impossible, because matrixT,row_major is not (and should
not be) convertible to matrixT, column_major, rendering the underlying
Hey,
Sorry for the late reply :P I'm supposed to defend my MSc in 2 weeks, and I
am yet to start writing my thesis... (I won't have a lot of time to give to
ViennaCL until everything is sorted out)
2013/12/23 Karl Rupp r...@iue.tuwien.ac.at
Hi guys,
Now as 1.5.0 is out, I spent some thoughts
Hey,
I've started back on the generator today, and realized how ugly the
dispatching mechanism was, to take advantage of the equivalencies based on
the fact that
RowMajor + Trans = ColMajor + NoTrans
Actually, I've been wondering : why wouldn't we do this on the whole
codebase? We could
*Sneeks in*
(Seems like it's time to hide a if( rand() RAND_MAX/2) return;
somewhere in the code where Karl won't find it !)
:D
Philippe
2013/12/19 Karl Rupp r...@iue.tuwien.ac.at
Hi Toby,
please allow for ~1 more day, then 1.5.0 is out and I'm available for
testing :-)
Best
Hey,
2013/12/18 Karl Rupp r...@iue.tuwien.ac.at
Hi.
A short update : I've implemented linkage to CBlas and CuBlas with
dynamic selection.
If activated through VIENNACL_WITH_CUBLAS, one can go back and forth
between cublas and the original backend by doing:
A.blas().gemm(NULL);
Hey Toby,
Excellent ! Thank you !
I'm installing it right away, and I'll test it later tonight.
Philippe
2013/12/17 Toby St Clere Smithe m...@tsmithe.net
Toby St Clere Smithe m...@tsmithe.net
writes:
Yep, looks like the build was successful, so I'll go ahead and make sure
it's all
that option 2 is better, considering that there is already
cuda_handle(), opencl_handle(), cpu_handle() or something similar, if I'm
correct. Any advice?
Best regards,
Philippe
2013/12/15 Philippe Tillet phil.til...@gmail.com
Hi,
2013/12/15 Karl Rupp r...@iue.tuwien.ac.at
Hi,
Yeah
Hey,
2013/12/15 Karl Rupp r...@iue.tuwien.ac.at
Hi again,
While we're at it, let's discuss the dynamic dispatching mechanism we'd
ideally want. I see two options:
(1) A global function pointer table. So, one could for example set:
viennacl::internal_blas::sgemv_ptr =
Hi,
2013/12/15 Karl Rupp r...@iue.tuwien.ac.at
Hey,
I agree. However, it seems to me that setting the implementation for
each matrix would end up being tedious... one table per memory backend
since to make sense conceptually to me, since the performance (and the
portability) of each
Hi,
2013/12/15 Karl Rupp r...@iue.tuwien.ac.at
Hi,
Yeah, it certainly is a bit tedious. Feel free to only do this for
matrix-matrix multiplications for now, a full operation table is
presumably too much of a refactoring for ViennaCL 1.x.y, but much
better suited for
Hello,
I've just realized that most BLAS implementation don't provide anyway to do
strided matrix accesses in the non-leading dimension ... ! Is this correct?
I was hoping that we could have avoided such special cases, but it seems
like a couple of tests will need to be made.
Philippe
Hello everybody,
I am done implementing :
x = viennacl::reduceop(viennacl::rows(A));
x = viennacl::reduceop(viennacl::cols(A));
s = viennacl::reduceop(x);
In the generator. For now, the op supported are : add, mult, max, min. I
can't support them all, because I need to provide their neutral
Hello,
I had not noticed that only the first reduction would be executed in this
case, so my arguments were indeed invalid :)
However, I am now even more worried than before ;)
This makes the assumption that the 2-way reduction will always be the best
way to compute an inner-product on any OpenCL
Hi hi,
2013/10/27 Karl Rupp r...@iue.tuwien.ac.at
Hi,
This makes the assumption that the 2-way reduction will always be the
best way to compute an inner-product on any OpenCL device. We want the
reduction-based programs to be device-specific, so these sometimes
truncated operations
Hello,
Now that I'm back to some C++ coding, I want to finish the integration of
viennacl::op_reduce.
I've noticed a lot of different operator overloads for
viennacl::scalar_expression, with basically different implicit
conversions to raw scalar. I'm a bit skeptical here :)
This allows to handle
A clearer classification :
OPERATION_FUNCTION_SUB_TYPE_FAMILY (norm, prod, inner_prod, etc...)
OPERATION_ELEMENT_FUNCTION_SUB_TYPE_FAMILY (abs, pow, etc)
OPERATION_ELEMENT_OPERATOR_SUB_TYPE_FAMILY(+, ==, , etc...)
Philippe
2013/10/18 Philippe Tillet phil.til...@gmail.com
Hello,
Currently
Hey,
While we're at it. I'm implementing reductions, now.
There are two options here :
templateclass OP, class VectorType reduce(VectorType const v) {
return scalar_expressionVectorType, OP, reduce_type(v, OP());
}
or
templateclass OP, class VectorType reduce(VectorType const v) {
to the same end-tree anyway, which will lead to the same
problem inside the statement...
Philippe
2013/10/18 Philippe Tillet phil.til...@gmail.com
Hey,
While we're at it. I'm implementing reductions, now.
There are two options here :
templateclass OP, class VectorType reduce(VectorType
Hi,
It seems like the behavior of scalar_vector, unit_vector etc has changed a
bit since the appearance of the kernel generator.
I am currently extending the API of the generator, with relational
operators. I want to design a specific kernel which checks for X[i] 0.42,
for all i.
Since operator
Hi hi,
2013/10/16 Karl Rupp r...@iue.tuwien.ac.at
Hi,
It seems like the behavior of scalar_vector, unit_vector etc has changed
a bit since the appearance of the kernel generator.
I am currently extending the API of the generator, with relational
operators. I want to design a specific
Hey hey,
Well, the main problem I have with incorporating implicit_vector_base
inside vector_base is that this sounds like replacing inheritance with
switches on enum :P
However, I think I have found a solution which will satisfy both of us:
viennacl::vector_base already have this constructor:
Hey,
I'll be there!
Philippe
2013/10/2 Karl Rupp r...@iue.tuwien.ac.at
Hi guys,
we haven't had an IRC meeting for quite a while now. I'm finally done
with most of my relocation from the US back to Austria, so I propose to
have our next IRC meeting on Friday, October 4, at 15:00 UTC. Is
have to provide a fair comparison in order to orient the scientists
that are looking for a high-level GPGPU solution.
Philippe
2013/10/3 Toby St Clere Smithe m...@tsmithe.net
Yep, so will I.
Toby
Philippe Tillet phil.til...@gmail.com
writes:
Hey,
I'll be there!
Philippe
Hi everybody :)
Okay, so in the roadmap i've added Reductions support for ViennaCL 1.6 ...
I plan to take care of it for the three backends, but there are several
things to consider here. For now, I will call them reduce, reduce_rows,
reduce_cols. A convenience layer such that reduce(mat.rows())
Hi hi,
2013/8/30 Karl Rupp r...@iue.tuwien.ac.at
Hi Philippe,
About 6months ago I had heard of a library that also performed
autotuning (http://raijincl.org), but that offered the same performance
as ours back then.
Since then, the performance have *greatly* improved, largely
, 2013 4:14 PM, Philippe Tillet phil.til...@gmail.com wrote:
Hello everybody,
For providing good default GEMM kernels for the Kepler Architecture, I
need the help of the community ! :)
I'm looking for someone with an NVidia GeForce Kepler graphic card... If
there is such person here, would he
Hey everyone,
It seems to me that most of the differences between CUDA and OpenCL come
from the respective APIs, but that the kernel code is very similar in the
two cases.
Do you guys think it's possible to easily translate the generated kernel
from OpenCL to CUDA, by just doing one-to-one
Hi,
2013/8/16 Karl Rupp r...@iue.tuwien.ac.at
Hi guys,
the scheduler for kernel fusion makes good progress. Toby, you should be
able to use all of the fundamental dense linear algebra operations now.
There should be only be two blocks of functionality missing:
- Sparse matrices (i.e.
. This should now allow you to build with
`make -j4`
on weaker machines with limited RAM.
Best regards,
Karli
On 08/01/2013 08:35 PM, Philippe Tillet wrote:
Hi everybody,
I have had troubles compiling matrix-test-* for quite some time, but it
has gone worse over time. The compilation process
Hey everybody,
For a few days, I've been playing around with AMD's CodeXL, the HD5850 and
the generator/autotuner:
- First of all, I want to share something that made me completely crazy.
Avoid :
*vector += scalar*vector
*
in a compute bound context. After replacing the above by:
*vector.s0 +=
Hi again !
The generator code is pushed on the master branch.
2013/7/28 Karl Rupp r...@iue.tuwien.ac.at
Hey,
My preferred option is to pad by default and either to make the
padding a multiple of four or sixteen. However, we need to maintain
a full set of unpadded
Hello everybody,
I'm proud to announce that after about 3weeks, I've recoded from scratch
the OpenCL code generator to integrate it fully with
viennacl::scheduler::statement.
That being said, I'm entering the point where I need to inquire your
opinion for (many) further design choices. Sorted by
85 matches
Mail list logo