from:"Philippe Tillet"

Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?

2015-07-31 Thread Philippe Tillet

Hi Charles :)

The BLAS kernels for CUDA and OpenCL are entirely different, actually.
OpenCL kernels rely on a code-generator, and have been auto-tuned. As far
as I know, the CUDA kernels have not been auto-tuned, and don't rely on the
same generation engine as the OpenCL ones. While for BLAS1-2, the
difference should not be so significant, for GEMM it's totally possible to
observe a huge difference.

Philippe

2015-07-31 12:04 GMT-07:00 Charles Determan cdeterma...@gmail.com:

 Greetings,

 Brief background, I am developing a series of R packages to bring ViennaCL
 to the R community.  I have had success with the development of my gpuR
 package (https://github.com/cdeterman/gpuR) which relies on the OpenCL
 backend of ViennaCL (which is housed in the package RViennaCL).  I am
 hoping to submit to CRAN in the coming weeks now that the latest stable
 ViennaCL version has just been released.

 Naturally, I wanted a companion package for a CUDA backend.  This is now
 the gpuRcuda package (https://github.com/cdeterman/gpuRcuda).  This has
 appeared to work successfully as most of the code is the same.  However, my
 initial benchmarks are showing very dismal performance with the CUDA
 backend.

 I was wondering if someone from this list would be willing to have a look
 at my code to see why the CUDA code would be so much worse.  I had thought,
 given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved
 speed but the benchmarks are showing performance at least 5-fold slower
 than the CPU based R multiplication.  Even the 'float' type matrix
 multiplication is slower than R (which only has double type support!).

 The sgemm CUDA file is (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and
 the associated C++ file is (
 https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp
 ).

 Other note, I have tried making the two packages completely independent
 and the performance is still very poor with CUDA.

 I really appreciate any help others could provide troubleshooting this.  I
 have truly run out of ideas as to why the code has such poor performance.

 Regards,
 Charles


 --

 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel


--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Column-wise kernels?

2015-07-27 Thread Philippe Tillet

Hi,

Such row-rise / column-wise reductions could be generate-able by the OpenCL
backend, but this won't work on the Host of CUDA backend. Plus, this is not
really maintained at the moment. I would recommend Karl's solution, even
though it won't be optimal when the vector does not fit in the L2 cache of
the OpenCL device (Maxwell for example has 2MB of L2 cache), as the current
algorithm for GEMV accesses the entire vector get_num_groups(0) times.

Philippe

2015-07-27 9:40 GMT-07:00 Karl Rupp r...@iue.tuwien.ac.at:


  Excellent, thank you.  I thought that would be the way to go initially
  but I hesitated because of concerns about having additional temporary
  objects taking up memory when matrices begin to get larger but it
  certainly is simpler this way.

 Just pushed:

 https://github.com/viennacl/viennacl-dev/commit/4063c941235d46804cd448db7ddecf0c3238548f

 Yeah, it's a bit of a trade-off: Sure, one could optimize the summation
 kernel, but this also implies more code to maintain. On the other hand,
 I'm not aware (which, of course, does not deny a possible existence) of
 a scenario where such summation routines are the performance bottleneck.

  Glad to hear that 1.7.0 is nearly completed.  Does that mean we should
  expect a formal release soon?

 Yep. Expect the release on Wednesday.

 Best regards,
 Karli



  On Mon, Jul 27, 2015 at 9:57 AM, Karl Rupp r...@iue.tuwien.ac.at
  mailto:r...@iue.tuwien.ac.at wrote:
 
  Hi Charles,
 
 I am working on writing some additional opencl kernels
  (potentially to
   incorporate in to viennacl) which involve column-wise reductions.
 A
   simple case would simply be the sum of each column of a matrix.
   However, I am having an extremely difficult time getting my kernel
   correct (reductions are tricky to me).  That said, after searching
 for
   some resources I came across an old post on sourceforge referring
 to
   column-wise kernels
(http://sourceforge.net/p/viennacl/mailman/message/27542552/)
 with
   viennacl.  This leads me to my primary question.
  
   Are there such kernels already in ViennaCL that I have overlooked?
 
  Yes ;-) Have a look here at how row-wise sums reduce to a standard
  matrix-vector product:
 
 https://sourceforge.net/p/viennacl/discussion/1143678/thread/38e942a0/
 
  That is, in order to compute a row-sum and a column-sum you can use
 row_sum = prod(A, ones);
 col_sum = prod(trans(A), ones);
 
  In an hour or two I will push convenience functions for summation
 fixing
  the only remaining issue for the 1.7.0 release:
  https://github.com/viennacl/viennacl-dev/issues/127
 
 
   If not, are there any examples or resources you would recommend to
 help
   learn this topic?  I have tried searching further but the only
 thing I
   can really find is a reduction of an entire matrix (which is
 relatively
   simple) as opposed to by column or row.
 
  At this point I can only recommend to think about how such operations
  can be recast in terms of (standard) linear algebra. For example,
 row-
  and column-wise updates to a matrix are special cases of the more
  general
 A += outer_prod(u, v);
  operation (rank-1 updates). I'll improve the documentation in that
  direction.
 
  Best regards,
  Karli
 
 
 
  
 --
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  mailto:ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 
 



 --
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] ViennaCL Benchmark GUI 1.0.0 Release Candidate

2014-11-24 Thread Philippe Tillet

Hey :-)

Worked well on my laptop :-)  A couple of suggestions:

- Maybe use layout N-T for GEMM, or perhaps it is already possible to
chose? From my experience NT-col major (TN row major) always leads to
higher performance on GEMM.

- The plots were hard to read because rather small on my laptop. I would
love to be able to make the plot fullscreen, or to display the data as a
table when I click on a curve. I don't know how easy this is to do with
qt-creator, though...

Apart from this, I'm impressed! This is very user-friendly and detailed.

This is my first try of the benchmark GUI in a long, long time, so I hope
it brought some perspective.

Philippe

2014-11-24 2:23 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,

 sorry, my email got stuck in the queue over night. Thanks to Namik for
 already fixing the 'automatic upload' box. :-)

 Best regards,
 Karli


 On 11/23/2014 09:51 PM, Karl Rupp wrote:
  Hi guys,
 
  a release candidate for the benchmark GUI is available for download, I'd
  appreciate any testing - particularly the Windows version:
 
  ** Windows **
  http://viennaclbenchmark.sourceforge.net/ViennaCLBenchmark-1.0.0-RC.zip
 
  This is a self-contained package which is ready to launch after
  unzipping. It only requires OpenCL to be installed system-wide or to be
  available in the PATH environment variable.
 
  ** Linux **
 
 http://viennaclbenchmark.sourceforge.net/ViennaCLBenchmark-1.0.0-Linux-RC.gz
 
 
  Requires qt4 to be available on the system (Ubuntu: apt-get install
  libqt4, Arch Linux: pacman -Su qt4). On some distributions the webkit
  component needs to installed separately (Ubuntu: apt-get install
  libqtwebkit4, Arch Linux: pacman -Su qtwebkit). Make sure libOpenCL.so
  can be found system-wide, or run ldconfig accordingly.
 
 
  A few smaller issues are still left and will be addressed tomorrow.
  @Namik: Do you have some time to fix the layout on the start screen for
  multiple OpenCL devices (see screenshot)? Adding a 'scrollable' property
  might do the trick already...
 
  Best regards,
  Karli
 
 
 
 --
  Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
  from Actuate! Instantly Supercharge Your Business Reports and Dashboards
  with Interactivity, Sharing, Native Excel Exports, App Integration  more
  Get technology previously reserved for billion-dollar corporations, FREE
 
 http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk
 
 
 
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 



 --
 Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
 from Actuate! Instantly Supercharge Your Business Reports and Dashboards
 with Interactivity, Sharing, Native Excel Exports, App Integration  more
 Get technology previously reserved for billion-dollar corporations, FREE

 http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Roadmap update

2014-11-09 Thread Philippe Tillet

Hey :)

2014-11-09 10:06 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi guys,

 I've updated our roadmap taking into account the latest release:
   https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap
 Feel free to add your topics and post your wishes :-)


Awesome! Is it like a christmas present list? Can we post any wish? I'd
like a pony, actually. :D


The 1.6.1 release is scheduled for the week November 17-21, for which we
 will provide a new fast kernel right when it is presented at the
 Supercomputing conference.


I had the hope I could get my hand on some GTX970 or GTX980, but I wasn't
able to. If any deveoper has access to such hardware, it would be great to
let us know, so that we can get optimized kernel for this hardware, and
possibly compare against CuBLAS, before SC14.



 My personal main goal for 1.7.0 is to reduce the use of Boost.uBLAS as
 much as possible and to have a fast, entirely GPU-based AMG
 preconditioner (similar to what is in CUSP). At the same time, I'd like
 to promote shorter release cycles: 1.6.0 was released about a year after
 1.5.0, which keeps quite a number of completed features stuck in the
 pipeline for too long.


I've added mines. Rather modest: better auto-tuning, and more devices
supported. I am directing  my efforts towards my  specialization for dense
BLAS on OpenCL, which will hopefully get integrated in the 2.0.0 release.


Maybe there will be a 1.8.0 release as well, which will still follow the
 current header-only model. However, we may also switch to ViennaCL 2.0.0
 right after the 1.7.x series in order to better target languages other
 than C++ (most notably C and Fortran due to their wide-spread use in HPC).


I will post what I think is reasonable, although most of my thoughts go
towards ViennaCL 2.0. As I said I have started today to rewrite the OpenCL
layer of ViennaCL using CL/cl.hpp and dynamic layout + datatype (the
rationale behind this choice is that OpenCL is already not type-safe
anyway, and so clAmdBlas is not type-safe either). It will be interesting
to see the influence it will have on the compilation time.


Philippe

Any thoughts and input is - as always - welcome :-)

 Best regards,
 Karli


 --
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] More weird problems

2014-11-05 Thread Philippe Tillet

I remember us already having a problem with strlen on the cache with your
NVidia SDK, which disappeared when you rebooted. Didn't we?

2014-11-05 16:25 GMT-05:00 Toby St Clere Smithe m...@tsmithe.net:

 Toby St Clere Smithe m...@tsmithe.net
 writes:
  The segfault happens when calling (in ocl/context.hpp):
 
  443   err = clGetProgramBuildInfo(temp, devices_[0].id(),
 CL_PROGRAM_BUILD_LOG, 0, NULL, ret_val_size);

 Oh, and the segfault happens in nVidia's OpenCL when it calls strlen
 somewhere..

 Oh, wait.. Apparently my nvidia module was still loaded, so I take back
 the beignet comment (the system defaults to nvidia if it's available;
 what I thought was beignet was therefore not..).

 And reloading the nvidia module seems to solve this for nvidia, too.

 Very strange.

 But the matrix_operations problems still remain!




 --
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Segfault running PyViennaCL direct solver tests

2014-11-04 Thread Philippe Tillet

Hey,

Sorry for the late answer. I've been extremely busy with my stats homework
lately.
The caching mechanism indeed doesn't account for the device. This is pretty
easy to add, ie append the device name + platform version + platform name
when doing the hashing.

Philippe

2014-11-04 16:12 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi Toby,

  thanks for the reports. I'll run the respective functions through
 a
  valgrind-like environment today, but I don't expect something to show
 up
  at this point. The direct solve kernels for dense matrices are
 unchanged
  for quite some time and haven't shown anything suspicious in the
 nightly
  tests for *months* now. Thus, I'm very tempted to assume that this is
 a
  problem with beignet - yet I'll double-check.
 
  Yes, I think so, too, now. But it is weird that I received a segfault
 on
  nVidia initially, too. I haven't studied the kernel caching mechanism:
  at the moment, the PyViennaCL cache directory is versioned, but should
  it also be separate for different devices? (And I will need to remember
  to clear out the cache directory for different viennacl git revisions,
  or add a mechanism to include the git reference..)
 
  The caching mechanism computes a hash of the source code and uses that
  hash to access the binary object. I doubt that there is binary
  compatibility across different OpenCL SDKs.
 
  Yes, having now updated my beignet installation to the latest point
  release and tested various combinations of stale and clean caches, it
  seems like the tests pass successfully and without segfaults when there
  is no overlap of the cached objects across devices.

 Thanks, this is good news. I'm fighting with some low-level hardware
 here right now, pretty challenging to get this to work properly :-(



  Philippe, does the caching mechanism take that into account and store
  separate binaries for each OpenCL SDK? Or is the SDK name part of the
 hash?
 
  It seems not to do so. I can change this in PyViennaCL, but I don't know
  if it might be good to do as you suggest and have the SDK name part of
  the hash in the core. I can always change it in PyViennaCL for this
  release, and this could be postponed for the core till later.

 A change of the OpenCL platform is fairly unlikely, so we may be able to
 go without. But on the other hand, it may lead to some hard-to-debug
 failures, just like you observed now. I leave this decision up to you...

 Best regards,
 Karli


 --
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Segfault running PyViennaCL direct solver tests

2014-11-04 Thread Philippe Tillet

I cannot reproduce the bug on my machine, so it's probably better if you
patch it :) Because a context may be attached to multiple devices, the fix
should concatenate the information of each device.

 line 402
std::string prefix;
for(std::vector viennacl::ocl::device ::const_iterator it =
devices_.begin() ; it != devices_.end() ; ++it)
   prefix += it-name() + it-vendor() + it-driver_version()
std::string sha1 = prefix + tools::sha1(source);

I can't think of any other place where a change would be necessary

2014-11-04 16:32 GMT-05:00 Toby St Clere Smithe m...@tsmithe.net:

 Hej Philippe,

 Philippe Tillet phil.til...@gmail.com
 writes:
  Sorry for the late answer. I've been extremely busy with my stats
 homework
  lately.
  The caching mechanism indeed doesn't account for the device. This is
 pretty
  easy to add, ie append the device name + platform version + platform name
  when doing the hashing.

 Yes -- this is precisely what I was thinking of doing. If no one gets
 there first, I'll knock up a patch early tomorrow afternoon (CET).

 Cheers,

 Toby

 
  2014-11-04 16:12 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at:
 
  Hi Toby,
 
   thanks for the reports. I'll run the respective functions
 through
  a
   valgrind-like environment today, but I don't expect something to
 show
  up
   at this point. The direct solve kernels for dense matrices are
  unchanged
   for quite some time and haven't shown anything suspicious in the
  nightly
   tests for *months* now. Thus, I'm very tempted to assume that this
 is
  a
   problem with beignet - yet I'll double-check.
  
   Yes, I think so, too, now. But it is weird that I received a
 segfault
  on
   nVidia initially, too. I haven't studied the kernel caching
 mechanism:
   at the moment, the PyViennaCL cache directory is versioned, but
 should
   it also be separate for different devices? (And I will need to
 remember
   to clear out the cache directory for different viennacl git
 revisions,
   or add a mechanism to include the git reference..)
  
   The caching mechanism computes a hash of the source code and uses
 that
   hash to access the binary object. I doubt that there is binary
   compatibility across different OpenCL SDKs.
  
   Yes, having now updated my beignet installation to the latest point
   release and tested various combinations of stale and clean caches, it
   seems like the tests pass successfully and without segfaults when
 there
   is no overlap of the cached objects across devices.
 
  Thanks, this is good news. I'm fighting with some low-level hardware
  here right now, pretty challenging to get this to work properly :-(
 
 
 
   Philippe, does the caching mechanism take that into account and store
   separate binaries for each OpenCL SDK? Or is the SDK name part of the
  hash?
  
   It seems not to do so. I can change this in PyViennaCL, but I don't
 know
   if it might be good to do as you suggest and have the SDK name part of
   the hash in the core. I can always change it in PyViennaCL for this
   release, and this could be postponed for the core till later.
 
  A change of the OpenCL platform is fairly unlikely, so we may be able to
  go without. But on the other hand, it may lead to some hard-to-debug
  failures, just like you observed now. I leave this decision up to you...
 
  Best regards,
  Karli
 
 
 
 --
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 
 
 --
 
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 

 --
 Toby St Clere Smithe
 http://tsmithe.net



 --
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Benchmark GUI - GSoC Closing Words and Future Plans

2014-08-25 Thread Philippe Tillet

Hey Namik,

Congratulations! :-)
Yes, we very hope that you'll stay with us in this adventure. I personally
really like open-source development because (1) it's really educative, and
(2) it makes me feel free. I think that research/jobs can put a lot of
pressure on me, to the point that it can become somewhat alienating. Having
a time window to develop my personal projects somehow keeps me optimistic :p
Open-Source software is actually not only about coding. I think you could
further improve your GUI by clearly defining when it should be used, and
when it shouldn't. Assume that your GUI ends up being (mis?)used by some
technical journalists, how would you like them to comment the results? If
you don't tell them the limits of your GUI, they can't know! If you want my
idea on the topic:
- The GUI will indicate the performance of an average program (not tuned
for any particular architecture) on different devices. This can reveal some
information such as It's hard to optimize code for this device, but if you
do this maybe you'll get some amazing results ; I don't know.
- The GUI *does not* compare the peak performance of two different
devices. Whoever uses the GUI has to be extremely careful about it. This is
exactly what NVidia/AMD/Intel/WhicheverVendor is doing when presenting an
eyecandy slide that says : oh, look how better our GPU is for Numerical
Computing. A lot of researchers/journalists fall into this trap, and this
is pretty sad.
I guess that these two examples give you a clear direction in which you
could document your code. Don't hesitate to add a usage section in the
GUI, to give some guidelines on how the results should be interpreted.

Philippe


2014-08-25 20:21 GMT+02:00 Namik Karovic namik.karo...@gmail.com:

 Hi Karl,


 thanks, Namik! Congratulations on successfully completing the GSoC
 project. I hope you got a good insight in how open source projects are done
 and how much fun it could be (although at some point one also needs to make
 sure 'things get done' by dealing with not-so-much fun stuff).


  Thanks. I must say it felt damn good to finally work on something that's
 big and important :)


 The important next step is to finalize the first release. I don't think
 there's much left to be done feature-wise, now it's mostly a matter of
 cleaning up and packaging. We hope to have you with us not only for this
 step, but also for the later future. The central idea of GSoC is to grow
 the community of open source projects, so we're hoping and encourage you to
 stay with us to the extent possible considering your other constraints such
 as course work.


 How long I'll stick around depends on how much free time I'll have. I'm
 currently looking for a job, and if I manage to find one, I'm afraid I
 won't have a lot of free time.


 In your case the documentation part wasn't that urgent, because the GUI is
 mainly a matter of fusing available functionality from ViennaCL together.
 The two 'TODOs' with respect to documentation are:
  - Document the source code using Doxygen-style comments, just like in
 the ViennaCL source tree. Ideally, this is done right when writing code,
 because then any assumptions on function arguments are clear.
  - Write a user manual on how the GUI works (including some screenshots,
 etc.). This last part, however, should be written right before the release
 in order make sure that the screen shots are up-to-date.


 Alright, I'll get down to writing documentation now.

 Regards, Namik


 On Mon, Aug 25, 2014 at 1:33 PM, Karl Rupp r...@iue.tuwien.ac.at wrote:

 Hi Namik,


  I'd like to send a big thanks to Karl and Philippe for the positive GSoC

 final evaluation mark.
 And a big thanks to everyone for helping me with my project.
 Also, congrats to Toby for successfully completing his GSoC project.


 thanks, Namik! Congratulations on successfully completing the GSoC
 project. I hope you got a good insight in how open source projects are done
 and how much fun it could be (although at some point one also needs to make
 sure 'things get done' by dealing with not-so-much fun stuff).



  It's been a great experience and a pleasure to work with you guys. I
 plan to continue working on the Benchmark GUI, at least until it's in a
 respectable shape. I'd also like to offer my help if you plan on making
 the benchmark result website a reality. Of course, I won't be as active
 as I was during GSoC.


 The important next step is to finalize the first release. I don't think
 there's much left to be done feature-wise, now it's mostly a matter of
 cleaning up and packaging. We hope to have you with us not only for this
 step, but also for the later future. The central idea of GSoC is to grow
 the community of open source projects, so we're hoping and encourage you to
 stay with us to the extent possible considering your other constraints such
 as course work.



  Also, there's one thing still unclear to me. What about documentation?
 Was I supposed to have written it by

Re: [ViennaCL-devel] Roadmap to 1.6 : Cleaning the code, refurbishing the test suite, the benchmark suite, etc...

2014-08-17 Thread Philippe Tillet

Hey,


2014-08-17 11:52 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,


  So it seems like most of the features are ready for ViennaCL 1.6. My

 merge from a few days ago (finally) fully integrated the use of
 device-specific kernels for BLAS1, BLAS2, BLAS3.


 hurray! :-)


  The reduction API is
 still missing, though, but I think that the priority should be to polish
 the code, and to ensure ViennaCL is still stable despite the migration
 to device-specific kernels. Specifically, I think that we should spend
 the next few weeks cleaning the code-base.


 Agreed. The full set of nightly test machines should be back to
 operational tomorrow, as we can finally move back into our offices in
 Vienna. These older systems should give us some more confidence on the
 stability.



  I can list a few points that
 have caught my eye

 - I've rewritten from scratch the GEMM test. It now uses ublas::prod 27
 times, instead of 2500, and thanks to a few macros the file size is
 substantially smaller (~250 lines vs ~850). The test now completes about
 15times faster using the single threaded ViennaCL implementation, and in
 the glimpse of an eye (~10seconds) when OpenCL is used. Hurray!
 More importantly, this new version allowed me to spot the bug which was
 responsible for the failure of libviennacl-blas3 in tonight's dash. The
 culprit was blas3_prod-test choosing slice1==slice2, while the slices
 were bugged for C=row-major+sliced... This is frightening because some
 other similar glitches may be hidden here and there in the test suite.
 For example, the matrix_vector test passes, but the libviennacl-blas2
 test fails. Probably due to some stride issues for row-major matrices.
 Things get more complicated now that the col-major kernels are used for
 the row-major cases. Anyhow, I think that we should somehow ensure that
 there is no such glitch remaining in the test suite, before shipping
 ViennaCL 1.6 (ie, all matrix_slices/ranges use different offsets/strides
 in each direction)


 Cool, thanks, that looks indeed a lot more compact now. With regard to
 tweaking the tests towards full coverage of row/column strides: This is
 something you have to add to the tests, because you know the internals of
 the kernels best and can design the tests towards testing corner cases.
 What caught my attention was the use of {start_M, start_N, start_K} as well
 as {stride_M, stride_N, stride_K}: Wouldn't it be better to use separates
 strides for A, B, C in both dimensions, making it six parameters rather
 than three?



It would be probably better, indeed.


  - I really think that we should rewrite the benchmarks for the 1.6
 release, all the more that it would value the substantial performance
 improvement that this release will bring. I can start writing a
 condensed benchmark including copy, axpy, dot, gemv, gemm. I think it
 would be cool to have sparse,solver,qr also included in that routine. I
 won't have the time to carry out this ; I'm moving to the United States
 in 1 week :-p


 I can take care of that, particularly as I'll have to adjust the tests in
 the benchmark GUI as well. However, I don't think that we should merge the
 sparse and solver routines into the same executable, these are two distinct
 fields of application (dense vs. sparse linear algebra). Merging too many
 different things into one executable also has some disadvantages if one
 piece of functionality does not work on a certain machine for whatever
 reason.


Cool! Yes, you're right, dense and sparse routines are not used for the
same purposes. Which operation should be included in each executable, then?
One for dense benchmarks, and one for sparse benchmarks?



 - I've noticed a some of unsafe/faulty legacy code dating back to when
 the layout was made a runtime parameter.
 * nmf only implements matrixT, but in principle matrix_baseT should
 work (since no custom kernel is called, I believe)


NMF uses a custom kernel and thus only works with OpenCL. A generalization
 to matrix_base should be straight-forward, yes. I should be able to do it
 for the release.


The kernel it uses is:

template typename StringType
void generate_nmf_el_wise_mul_div(StringType  source, std::string
const  numeric_string)
{
  source.append(__kernel void el_wise_mul_div( \n);
  source.append(  __global );
source.append(numeric_string); source.append( * matrix1, \n);
  source.append(  __global const );
source.append(numeric_string); source.append( * matrix2, \n);
  source.append(  __global const );
source.append(numeric_string); source.append( * matrix3, \n);
  source.append(  unsigned int size) \n);
  source.append({ \n);
  source.append(  for (unsigned int i = get_global_id(0); i 
size; i += get_global_size(0)) \n);
  source.append(  { \n);
  source.append(); source.append(numeric_string);
source.append( val = matrix1[i] * matrix2[i];

Re: [ViennaCL-devel] Roadmap to 1.6 : Cleaning the code, refurbishing the test suite, the benchmark suite, etc...

2014-08-17 Thread Philippe Tillet

Hey,

The nasty bug on strided GEMV got solved.

I'm available on wednesday for the code uniformization session. We should
be on IRC at the same time, though, in case we face a situation we had not
discussed. I have a couple of questions regarding a standardized way of
naming the numeric type of a matrix/vector. Sometimes it's NumericT,
sometimes it's T, sometimes it's TYPE... What about NumericType everywhere?
Anyway, some similar questions could arise so it's probably better to be
able to chat in real time while making the code style uniform.

We must also remember to sort out
https://github.com/viennacl/viennacl-dev/issues/71
https://github.com/viennacl/viennacl-dev/issues/77
https://github.com/viennacl/viennacl-dev/issues/66
https://github.com/viennacl/viennacl-dev/issues/2

Philippe


2014-08-17 19:36 GMT+02:00 Philippe Tillet phil.til...@gmail.com:

 So the dense benchmark suite got refurbished here:


 https://github.com/viennacl/viennacl-dev/commit/73f46e36cfa4104628f831195e4da25a62f9ef66
 The same template using macros can be used for any benchmark. It's pretty
 concise and maintainable!

 Philippe


 2014-08-17 13:50 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,


  * nmf only implements matrixT, but in principle matrix_baseT
 should

 work (since no custom kernel is called, I believe)


 NMF uses a custom kernel and thus only works with OpenCL. A
 generalization to matrix_base should be straight-forward, yes. I
 should be able to do it for the release.


 The kernel it uses is:

  template typename StringType
  void generate_nmf_el_wise_mul_div(StringType  source,
 std::string const  numeric_string)
  {
source.append(__kernel void el_wise_mul_div( \n);
source.append(  __global );
 source.append(numeric_string); source.append( * matrix1, \n);
source.append(  __global const );
 source.append(numeric_string); source.append( * matrix2, \n);
source.append(  __global const );
 source.append(numeric_string); source.append( * matrix3, \n);
source.append(  unsigned int size) \n);
source.append({ \n);
source.append(  for (unsigned int i = get_global_id(0); i 
 size; i += get_global_size(0)) \n);
source.append(  { \n);
source.append(); source.append(numeric_string);
 source.append( val = matrix1[i] * matrix2[i]; \n);
source.append(); source.append(numeric_string);
 source.append( divisor = matrix3[i]; \n);
source.append(matrix1[i] = (divisor  ();
 source.append(numeric_string); source.append()0.1) ? (val /
 divisor) : (); source.append(numeric_string); source.append()0; \n);
source.append(  } \n);
source.append(} \n);
  }

 So, the layout of the matrix shouldn't matter, indeed. It would be
 pretty easy to have this kernel generated by the generator, too, as this
 can be represented by the expression tree :
 matrix1 = select(matrix3  0.1, element_div(element_prod(matrix1,
 matrix2), matrix3), castT(0)).
 However, we're running out of time so I wouldn't port it. But we have to
 keep in mind that this would be a trivial thing to do.


 The same student who ported the FFT-code to multiple backends will take
 care of porting NMF to multiple backends. He's pretty quick already, so it
 should be done by the release.

 However, I'd refrain from integrating this into the generator for now
 because it is totally non-critical in terms of overall performance. We can
 port that under perfect control within the OpenCL backend later when we
 have more confidence in the stability of the generator (no pun' intended).



  - We should definitely have a discussion on matrix padding,
 which is no
 longer required anywhere in ViennaCL, as far as I know. I am in
 favor of
 making size()==internal_size() by default. That's not the point
 of the
 e-mail, but we should have a discussion on what we should do
 with it!


 Getting rid of the padding would certainly remove the traps of using
 fast_copy() on a matrix. Other than that, I don't think it has a
 substantial influence on the code because internal_size() is still
 needed for dealing with ranges.

 There may be an influence on certain bandwidth-limited operations,
 though, as for example a matrix addition may lead to bank conflicts
 (or channel conflicts, whatever...) when accessing GPU RAM for
 certain matrix sizes. Before making a decision on the padding issue,
 we should run some benchmarks to see whether there is an impact.


 Well, one thing I'm sure of is that we should give the possibility to
 use no padding if needed (for memory constraints), or (probably even
 better) to choose the padding size.


 Apparently it is not an easy choice for us to pick the default because of
 the many things to consider. Thus

Re: [ViennaCL-devel] Benchmark GUI Expert Mode

2014-08-17 Thread Philippe Tillet

Hey Namik,

The code looks fine. As a small tip, I would advise to use
blas3MatrixSize{A,B,C} = {M, N, K} ; it's much more conventional. I would
also suggest to remove LU from the benchmark. I only achieve 11 GFLOP/s on
my machine (GEMM peaks at 120GFLOP/s). It will smash the overall score if
you keep it enabled!

Philippe (Not sleeping either :-p)


2014-08-17 23:28 GMT+02:00 Namik Karovic namik.karo...@gmail.com:

 Hi all,

 I just pushed the first working version of expert(custom) benchmark mode.
 Selecting custom sparse matrices is yet to be implemented, but all other
 benchmark configs are working.

 Except blas3, that is. I think I got the sizes wrong. I'd appreciate it if
 someone could check if I did it right:

 //blas3MatrixSizeA,B = size1,2
 //blas3MatrixSizeB,C = size2,3
   viennacl::matrixScalarType vcl_A(blas3MatrixSizeA, blas3MatrixSizeB);
   viennacl::matrixScalarType vcl_B(blas3MatrixSizeB, blas3MatrixSizeC);
   viennacl::matrixScalarType vcl_C(blas3MatrixSizeA, blas3MatrixSizeC);

 // Fill the matrix
   for (unsigned int i = 0; i  blas3MatrixSizeA; ++i)
 for (unsigned int j = 0; j  blas3MatrixSizeB; ++j)
   stl_A[i*blas3MatrixSizeA + j] = randomScalarType();

   for (unsigned int i = 0; i  blas3MatrixSizeB; ++i)
 for (unsigned int j = 0; j  blas3MatrixSizeC; ++j)
   stl_B[i + j*blas3MatrixSizeC] = randomScalarType();

 //using ranges
   viennacl::range r(blas3MatrixSizeB/4, 3 * blas3MatrixSizeB/4);

 //using slices
   viennacl::slice s(0, 2, blas3MatrixSizeB/2);

 The benchmark crashes on test 4 (LU factorization). I don't know if I
 messed up somewhere before test 4 (in the code written above), or somewhere
 else.

 Regards, Namik



 --

 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel


--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Roadmap to 1.6 : Cleaning the code, refurbishing the test suite, the benchmark suite, etc...

2014-08-16 Thread Philippe Tillet

Hey!

So it seems like most of the features are ready for ViennaCL 1.6. My merge
from a few days ago (finally) fully integrated the use of device-specific
kernels for BLAS1, BLAS2, BLAS3. The reduction API is still missing,
though, but I think that the priority should be to polish the code, and to
ensure ViennaCL is still stable despite the migration to device-specific
kernels. Specifically, I think that we should spend the next few weeks
cleaning the code-base. I can list a few points that have caught my eye

- I've rewritten from scratch the GEMM test. It now uses ublas::prod 27
times, instead of 2500, and thanks to a few macros the file size is
substantially smaller (~250 lines vs ~850). The test now completes about
15times faster using the single threaded ViennaCL implementation, and in
the glimpse of an eye (~10seconds) when OpenCL is used. Hurray!
More importantly, this new version allowed me to spot the bug which was
responsible for the failure of libviennacl-blas3 in tonight's dash. The
culprit was blas3_prod-test choosing slice1==slice2, while the slices were
bugged for C=row-major+sliced... This is frightening because some other
similar glitches may be hidden here and there in the test suite. For
example, the matrix_vector test passes, but the libviennacl-blas2 test
fails. Probably due to some stride issues for row-major matrices. Things
get more complicated now that the col-major kernels are used for the
row-major cases. Anyhow, I think that we should somehow ensure that there
is no such glitch remaining in the test suite, before shipping ViennaCL 1.6
(ie, all matrix_slices/ranges use different offsets/strides in each
direction)

- I really think that we should rewrite the benchmarks for the 1.6 release,
all the more that it would value the substantial performance improvement
that this release will bring. I can start writing a condensed benchmark
including copy, axpy, dot, gemv, gemm. I think it would be cool to have
sparse,solver,qr also included in that routine. I won't have the time to
carry out this ; I'm moving to the United States in 1 week :-p

- I've noticed a some of unsafe/faulty legacy code dating back to when the
layout was made a runtime parameter.
* nmf only implements matrixT, but in principle matrix_baseT should
work (since no custom kernel is called, I believe)
* There was a faulty row_major(is_row_majorBaseType::value) in
matrix_range and matrix_stride. This caused matrix_rangematrix_baseT 
to be column-major no matter what. More generally, there are a couple of
places using the static is_row_major or alignment traits. I thought
that it could be a good idea to delete these traits to be sure that there
can be no such faulty code anywhere else. Am I overlooking any side effect?

- We should definitely have a discussion on matrix padding, which is no
longer required anywhere in ViennaCL, as far as I know. I am in favor of
making size()==internal_size() by default. That's not the point of the
e-mail, but we should have a discussion on what we should do with it!

- Finally, there is a performance regression for GEMM with slices, due to
my fallback being too extreme (one element computed per work-unit). I'm on
it, so don't worry if you've got like 3GFLOP/s on slices in the current
blas3 benchmark.

Okay, that's pretty much everything I'm worried about, I think!

Philippe
--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Testing GEMM

2014-08-14 Thread Philippe Tillet

Hey,

The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks
involved. This gets hard to test, so I thought it could be a good idea to
discuss this. Basically, here is how it works:

A = [A1 A2; A3 A4]
B = [B1 B2; B3 B4]
C = [C1 C2; C3 C4]

Where each block is divided according to the corresponding block size of
the template. For example; A1 is the closest multiple of the size tuple
(ML, KL), where ML is the number of rows computed by each work group, and
KL the width step for computing the inner products (If the kernel use
local memories, it will load successive blocks of size ML*KL in each work
group).

A few kernels are enqueued so that:
C1 = A1*B1 [optimized kernel]
C1 += A2*B3 [fallback] if needed
C2 = A1*B2 [fallback] if needed
C2 += A2*B4 [fallback] if needed
etc...

Basically, one optimized kernel doing the bulk of the work, and the other
ones doing the clean-up. This works well for full matrices and ranges.
When slices are involved, things get more complicated. If the stride is on
the non-leading dimension (stride2 for column-major matrices), then it can
be incorporated in the optimized kernel. (by appending ld *= stride2 at the
beginning of the kernel). However, if stride1  1, then we need to use the
fallback kernel. This is a reasonable thing to do : in most applications I
know of, only one stride is accessed at the time (we want a set of the
rows/columns of a given matrix).

However, this becomes really messy to test! Basically, I think that, to
have an exhaustive enough testing suite, then we should go for:

- Matrices of complicated arbitrary sizes (143, 284, 395). It is important
to space them by more than 128, to be sure that A1, B1 and C1 is not square.
- Ranges of similar complicated sizes.
- Optimized range: (128, 256, 384) for example
- matrix row-wise slices, matrix col-wise slices, matrix slice in both
directions.

I am ready to rewrite the GEMM tests accordingly, but any thought on the
procedure would be appreciated!

Philippe
--
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Benchmark GUI Feedback Needed

2014-08-11 Thread Philippe Tillet

Hello !

This all looks pretty good. Good job!


2014-08-12 3:40 GMT+02:00 Namik Karovic namik.karo...@gmail.com:

 Hi Karl,


 I'm fine with splitting things into something like Basic Benchmark and
 Expert Benchmark ('view' sounds inappropriate), but as long as both
 benchmark do the same thing, I don't see the problem why the expert version
 cannot be a refinement of the basic version. Could you please elaborate?


 The entire issue comes down to this: should basic mode be able to run the
 benchmark with expert mode's settings? Or should it always run using the
 default settings, no matter what.

 My motivation for bringing this up is that one could first do a basic
 benchmark, then continue on to playing with the expert mode. The basic mode
 can then be used for quick reference, as it will be not be altered by
 expert mode runs.

 So both modes will have their own results, and their own settings. I will
 prevent users from running both modes at the same time, of course.

 I hope it's clearer now.


I would rather lean towards re-using the expert settings for the basic
benchmarks, and to provide some reset button, so that if one messes
things up he could still retrieve the original basic results.



 Looks great, this is a really useful graph (something is fishy with the
 values on the y-axis, though...) :-) Can you please draw the x- and the
 y-axis in logarithmic scale and make the vector increment a multiplicative
 factor (2 by default)?


 The axis labels are fishy because they aren't properly set up yet :) Sure,
 I can make em logarithmic. What about the default number of increments? I
 got it currently set to increment by 1 million from 1M to 15M, so 14
 increments. Should there be more increment steps? I need to know so I can
 calculate the optimum min and max vector size for x2 factor increment.


It's actually important to have finer grained data for small vectors, and
more spaced points as the data grows bigger : this is why it is better to
choose the sizes according to a a^x law than an a*x one. You can experiment
other values than 2 for a, if you want. If I were you, I'd probably go with
something like :
[int(1.5**x) for x in range(30,45)]

That is, an increment 1.5 factor from ~190,000 to ~55,000,000



 This looks quite okay, actually.


 Alright, if you say so. But note that in fullscreen it will be a lot more
 stretched, and thus a lot less visually appealing. I'll do some more
 thinking to try and make it a bit more organized.


 There should be a third size for Blas3 part. This will then also make all
 four boxes (Blas3, Sparse, Copy, Vector) equally high, which should improve
 the visual appearance.


 So x,y,z dimensions for Blas3? Blas3 currently uses 2D matrices, so I'll
 have to modify the benchmark to use 3D matrices?


Blas3 multiplies two matrices : A(size1, size2) * B(size2, size3), hence
the three sizes required :-p Not sure about what kind of 3D matrices you
are referring to! ;)

In any case, great job!
Philippe


  I don't see a problem with making the string conversion routines public,
 so I just pushed a commit for doing so. :-)


  Thanks. Appreciate it.


 Regards, Namik


 On Mon, Aug 11, 2014 at 8:18 PM, Karl Rupp r...@iue.tuwien.ac.at wrote:

 Hi Namik,


  I'm starting work on the expert view and would appreciate some feedback

 before I get into it more seriously.


 thanks for the latest bunch of features :-)



  I've got quite a lot of questions, so bear with me please. Here we go:

 -Should basic and expert views be changed to independent benchmarking
 modes, or remain different views of the same banchmark backend? I
 initially imagined basic  expert views as differently detailed
 presentations of the same benchmark instance (one could run the basic
 benchmark, and switch to expert view after it's done to examine the
 results in more detail).

 However, now I'm thinking it would be better not to mix them. Let basic
 mode be a simple benchmark with default settings, and let expert be
 fully customizable  independent. That way the basic mode would be
 unaffected by expert mode's settings. This would allow basic mode to act
 as a safe reference mode. It would also allow easier usage of
 benchmark profiles (saving user's expert mode config info for later
 usage), but that's a story for another time.

 It's worth mentioning that it's easier to implement two independent
 modes than to have them share a single benchmark mode.

 So, which version am I to develop?


 I'm fine with splitting things into something like Basic Benchmark and
 Expert Benchmark ('view' sounds inappropriate), but as long as both
 benchmark do the same thing, I don't see the problem why the expert version
 cannot be a refinement of the basic version. Could you please elaborate?



  -I've implemented line plotting of copy  vector benchmarks. There's
 still some minor tweaks to be done, but the main functionality is ready.
 Here's a screenshot for quick reference:

Re: [ViennaCL-devel] Tolerances for tests

2014-08-05 Thread Philippe Tillet

Hey Toby,

My two cents:
Don't forget that while matrix-vector multiplication will still introduce
some round-off errors. Ie, when you are computing
y = A*[1,1,...]
then you are actually computing something like
y' = A*( [1,1,...]+eps).
GEMV is (backward stable) so you are sure that y' will be close to y.

This being said, I don't know much about the stability/back-ward stability
of GMRES, but if it's not backward stable you won't be able to get your
result close to [1,1,...].
In other words, you're probably better of comparing y' with A*x', where x'
is the output of the GMRES procedure, rather than x with x'

Philippe


2014-08-05 22:10 GMT+02:00 Toby St Clere Smithe m...@tsmithe.net:

 Hi all,

 I've now implemented a test for the iterative solvers and
 preconditioners, using generate_fdm_laplace. This is good because it
 gives consistent results, though compared to using a randomly generated
 system matrix it means that the solvers are only tested on one set of
 input data.

 My test works by constructing the system matrix, choosing a solution
 vector that is just a vector of 1.0s of the correct size, and
 multiplying to find the RHS to put into the solver. I then run the
 solver and compare the output to my vector of 1.0s. I report a failure
 if the error is greater than some tolerance value specific for the
 datatype.

 Of course, this absolute error tolerance has a different definition to
 that in (for instance) the GMRES solver, where we have

   solver quits if ||r||  tolerance * ||r_initial|| obtains.

 This means that the solver might return successfully, and yet cause a
 false test failure. I have a crude work-around for this. It seems to
 suffice to set the solver tolerance to 1e-1 times the test tolerance,
 which strictly might be stronger than is ideally warranted. I think this
 should suffice for the purposes of this test, however; else I'll have to
 do silly solver-specific things like computing ||r_initial||.

 Currently, I use a test tolerance of 1e-2 for single-precision, which
 means a solver tolerance of 1e-3. This seems less precise than I'd like;
 machine epsilon should be around 1e-7, and so I feel like I should be
 able to use a test tolerance of 1e-3 and a solver tolerance of
 1e-4. However, using GMRES, this gives incorrect results (regardless of
 max_iterations) -- wildly so when combined with some preconditioners. I
 suspect that this is caused by rounding errors, but I'm not sure.

 I tried to check for what the ViennaCL test does, but I couldn't find
 one for the iterative solvers!

 Cheers,

 Toby



 --
 Toby St Clere Smithe
 http://tsmithe.net



 --
 Infragistics Professional
 Build stunning WinForms apps today!
 Reboot your WinForms applications with our WinForms controls.
 Build a bridge from your legacy apps to the future.

 http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] On the use of vector types in viennacl's opencl kernels

2014-07-31 Thread Philippe Tillet

Hi,

It's horrible! As soon as I want to introduce some vectorized types in an
opencl template as simple as AXPY, everything starts exploding.

Well, first things first, I probably need to justify why I think that we
cannot do without double2, float4 in all of our dense kernel templates:
- From my own experience, it turns out that some element-wise expressions
can be easily compute-bound. In statistics it can be pretty easy to
encounter complicated elementwise transforms when evaluating a probability
density function. I've personally had to use SSE on my CPU a couple of
times to alleviate this problem.
- Some vendors explicitely state in their optimization guide that loads of
16 bytes will result in a better bandwidth.

On the other hand, using stride!=1 will prevent the use of vectorized loads
in any kernel (AXPY, GEMM, etc). We're definitely facing a dilemma, here,
where we have to choose between higher JIT overhead (the programs can be
cached, however) and potentially higher execution time. My belief is that
we should provide a fallback program for stride!=1, which will be compiled
only if strided accesses are used.

Note that even this wouldn't solve all our problems. How to handle offsets
that are not multiple of 4? How to handle sizes that are not multiple of 4.
We could use the same fallback, or provide a different optimized kernel.
http://paste.ubuntu.com/7915787/
optimized_1 should be able to handle quite well the remaining cases, while
optimized_0 should be faster because it doesn't have to check for the
alignment contrary to vload4, and doesn't have to do any clean up. In the
case of AXPY, I'd expect optimized_1 to be a better option. For GEMM, I'd
however prefer the cleanup to be done in some other kernel calls.

Seriously, what a headache !! But discarding vector types for everything
but GEMM just sounds wrong to me...

Philippe
--
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] OpenMP Matrix Multiplication

2014-07-27 Thread Philippe Tillet

Hi guys,

So I expect ViennaCL 1.6 to offer some really good performance on CPUs with
the OpenCL backend -- possibly 80% of OpenBLAS / MKL on a Core i7 4770, for
example. As the OpenCL kernel generator and the auto-tuner will get better,
we can hope for further improvements.

This will create a huge gap with the fallback OpenMP version, which hardly
reaches 0.5 GFLOP/s. What would you thinking about extracting the assembly
output of the Intel OpenCL compiler? I'm not familiar *at all* with
assembly code. How would we handle multi-threading in such a setting?

Philippe
--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] ViennaCL console benchmarks

2014-07-14 Thread Philippe Tillet

Hey,

I've noted that the console benchmarks for ViennaCL were quite outdated,
performance for AXPY are reported in FLOP/s, for example. I think it'd be
great to have something compact, all incorporated in a single benchmarking
executable:

===
BLAS [float, full]
-
AXPY : ... (GB/s)
DOT : ... (GB/s)
GEMV : ...(GB/s)
GEMM-NN : ...(GFLOP/s)
GEMM-TN : ...(GFLOP/s)
GEMM-NT : ...(GFLOP/s)
GEMM-TT: ...(GFLOP/s)
... solver, perhaps some other things

BLAS [float, ranges]
-
...
===

I can't really think of a case where one would be only interested in the
performance of one single operation !

Do you have any other idea to make the benchmarks more
concise/readable/informative ?

Philippe
--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Benchmark GUI First Look

2014-07-11 Thread Philippe Tillet

Hi Namik,

Good job! It all looks very appealing. I don't have much to say. Just a few
comments:

- I'd rather use the median instead of the averge, indeed.
- As for the latency in the expert section, it would be great to also have
an execution time vs size plot, in order to show until when the latency
dominates the routines.

I'll be here at 1600 UTC

Philippe


2014-07-11 12:50 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hey,

   (...)
 
  I'll skip the pause functionality then.

 Ok :-)




  Consider using the median rather than the average. In some cases one
  can observe infrequent outliers, which would distort the average.
 
  Would the median value really be a good choice to represent all
  benchmark sub-tests?  I mean, surely those outliers should have some
  impact on the final result. Performance bottlenecks can sometimes be
  that thin line that separates high-end from low-end products. I'm not so
  sure we should just ignore them.

 I hope we're talking about the same thing here. Imagine you're
 benchmarking the vector addition, then you may get the following
 individual performances:
   0.1 sec
   0.1 sec
   0.1 sec
   0.5 sec  -- some unexpected other load on the system here

 The median is 0.1 here, whereas the average is 0.2. I'd consider the
 median to provide a more reliable indicator for system performance here.



  Rathre than a single score, what about providing multiple scores for
  the main quantities of interest? One score for GFLOPs, one score for
  memory bandwidth, one score for latency? Is this too technical
 already?
 
  Well, a multi-score is fine by me. However, memory bandwidth in terms of
  GB/s is only measured by copy benchmark as far I've seen. We should
  consider adding some more bandwidth benchmarks in that case.

 The classic benchmark for this is the STREAM benchmark
 (http://www.cs.virginia.edu/stream/) which covers the following four
 vector operations:
   x - y (copy)
   x - a * y (scale)
   x - y + z (sum)
   x - y + alpha * z (triad)
 All four of them are easy to reproduce. If you are amazed about the
 simplicity of the benchmark, you'll be surprised about how much it can
 tell you regarding performance for a huge number of applications.



  As for latency, I don't think our average users would care that much
  about it. Whenever I overclocked my computer, I would primarily focus on
  achieving higher memory bandwidth instead of lower latency. That is why
  I would rather see my memory bandwidth instead of latency. It would
  certainly be a useful addition, but not *that* important to have
  dedicated score.

 Valid point, latency and friends should go to the expert results section.




  Live updates would certainly be cool. However, you need to make sure
  that the plots are only drawn while in between different benchmark
  tests, otherwise the plotting might induce a certain load which will
  interfere with the benchmark. I think this can be accomplished
  without too much effort.
 
  Well each benchmark is run in its own separate thread. I don't think the
  main GUI thread can interfere with a benchmark's thread.

 Sure it can. Not in terms of data, but in terms of eating CPU cycles
 while the benchmark thread is running. Keep in mind that ultimately
 we're also running tests with OpenMP on the CPU.


  But if you're
  referring to data transfer between the CPU and GPU, then I can't say for
  sure. To the best of my knowledge, Qt widgets utilize CPU and RAM,
  there's no GPU and modern OpenGL involved. As for communication between
running benchmarks and GUI, all messages coming from benchmarks are
  sent between sub-tests. All benchmarks' sub-tests are intact, so there
  shouldn't be any problems regarding message emitting.

 Messages between CPU and GPU are less an issue. Still, the GUI should do
 as little work as possible while the benchmark is running. It's okay to
 update in between sub-benchmarks.



  At this point we should certainly define a reasonable grouping of
  the results. For example, our current LU factorization is fairly
  slow because of the way it is implemented, not because the hardware
  is poor. Are you available for an IRC session on this tomorrow, e.g.
  16:00 UTC?
 
  Sure, I'm available at 16:00 UTC tomorrow (I guess that's today now :D )

 Yes, Friday, 16:00 UTC.


  I quickly adjusted the one I designed some time ago, see attachment.
  I can commit the GIMP-File if you like it. Of course we can alter it
  further as needed and appropriate.
 
  Thanks, I implemented it right away and pushed to GitHub. Looks good.
  Yeah I'd like the GIMP file. I'll play around with it and see if
  anything can be improved.

 Ok, I'll push it right after sending the email. Don't forget to pull. ;-)

 Btw: Could you please consider rearranging the folder hierarchy such
 that src/ only contains actual code? Put things like the splash screen

Re: [ViennaCL-devel] ViennaCL 1.6 Roadmap

2014-07-09 Thread Philippe Tillet

Hi,

I'd like to add something, to point out that input-dependent kernels are
pointless without kernel caching (both would use an environment variable
and the filesystem). Indeed, each program will contain multiple versions of
a given operations, which can make the compilation time very long if
caching is disabled.

Philippe


2014-07-09 17:53 GMT+02:00 Philippe Tillet phil.til...@gmail.com:

 Hey hey,




 2014-07-09 14:47 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hey,


  Philippe, did you by chance check the impact of the generator
 integration on kernel latency? We only have a 1-10us margin to work
 with, which I haven't checked yet.



 Don't worry for the overhead. It used to be fine. I'll re-check to see
 whether everything is still fine, but when the program-name and the
 kernel name prefix is known in advance (ie for the pre-compiled
 programs), I don't see where a significant overhead could come from!
 I'll benchmark this ASAP, once some other modifications are done.


 The overhead could come from too many indirections in memory accesses,
 i.e. if too many lookups in maps and string comparisons are involved. Since
 you know the implementation better than me and don't think this is an
 issue, it should be fine. Either way, it needs to be checked, as it is such
 a fundamental quantity for the onset of scaling behavior of almost all
 'higher-level' algorithms.


 The process of enqueueing  the generator is extremely lightweight, there
 is no map involved. It does basically two things:
 - Parse the statement to retrieve some quantities (e.g. M,N,K in the case
 of GEMM)
 - Recursively enqueues the elements of the statement (matrix, vector,
 scalar, etc)
 When the program name is known in advance, there is no need to build the
 representation of the statement (which fills a char*), but even this should
 be fast enough. I remember having measured, some time ago, a total overhead
  10microseconds when building this representation. But I'll re-evaluate
 this ASAP.






  I've been very motivated to work on the kernel generator
 recently, and
 simply don't feel like working on (1) or (2) at the moment. Now,
 there
 are two different options, for (4):
 4.1 - Implementing the kernel fusion mechanism inside the
 scheduler.
 4.2 - Input-dependent kernels, and performance prediction.

 While I could help with 4.1, I don't feel like I could do this
 task
 alone, because I don't have a sufficient knowledge of the
 backend. Plus,
 it implies to get rid of op_executor(), and I'm not sure how I
 could do
 this, too!
 I feel operational, though, for 4.2. I feel like ViennaCL 1.6
 should be
 a performance-oriented release, and having an
 (input+device)-dependent
 kernel selection mechanism is something we have to do!


 I think we should not go for 4.1 with a 1.6.0 release, simply
 because it would delay the release cycle. We should provide features
 to our users fairly quickly after they are stabilized, not have them
 hanging around in the developer repository for too long. We have
 enough features for 1.6.0 already ;-)

 Some work from your side on 4.2 would be good, so if you have some
 resources left, please focus on that.


 Sure. 4.2 is part of my (future) PhD work, so I can't expect to have
 everything working flawlessly for ViennaCL 1.6.0.


 As always, it's better to have a smaller set of reliable features in a
 release rather than a larger set of broken features ;-)




  But I feel like I
 should be able to create the backbone for this release.a simple
 environment-variable based mechanism that points to a folder where the f
 spitted out by the python auto-tuner. I'd like an environment-variable
 based extension, as they can be easily exploited by the advanced users
 in C++, and generalized by pyviennacl. (since python has a portable
 filesystem framework) !

 Here's my idea. We could have VIENNACL_MODELS_PATH pointing to a
 directory containing standardized device names (lower-case, spaces
 replaced by dashes). At runtime, we check if the environment variable is
 set and if we can open the corresponding file. If not, we fallback on
 the built-in, input-agnostic database.


 This sounds to me much more like a researchers facility rather than
 something an average user wants to be exposed to. Keep in mind that
 whenever something needs to go through the file system, it is subject to
 additional problems: These can be permission problems, problems with blanks
 (or umlauts, etc.), random IO errors, or tricky problems in batch systems
 on supercomputers. Since I'm part of the PETSc developer team I've learned
 about so many problems on machines 'out there', where Murphy's law is
 constantly in action. Can we focus on populating the built-in database for
 the 1.6.0 release instead? A standard-user with a standard-GPU should

[ViennaCL-devel] ViennaCL 1.6 Roadmap

2014-07-08 Thread Philippe Tillet

Hello,

Watching at the roadmap:
https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap

I was concerned with 4 elements:
(1) Hook in external BLAS libraries and use them as a computing backend
(2) Distributed vectors and matrices (multiple devices, possibly mixed
CUDA/OpenCL/OpenMP
(3) Support for reductions (vector-reduction, row-wise reduction, col-wise
reduction). Naive OpenMP/CUDA implementation, but integrated in the kernel
generator for OpenCL.
(4) Full integration of the micro-scheduler and the generator.

Needless to say that this seems overly ambitious!
I had done a prototype for (1), but realized quickly that it would be
pretty complicated to make it stable and robust with respect to devices,
context, etc. Plus, the generator now gives the same (DENSE!) performance
as CuBlas on NVidia GPUs (for Fermi, at least), and clAmdBlas on AMD GPUs.
Linking could allow us to have very good performance on OPENMP/CUDA, as
well as Sparse Linear algebra on OpenCL. This is interesting, but it is
also a good amount of work!

(2) Will also require a huge amount of work. Plus, I think it is dangerous
to do that when we're not even sure of how we handle ViennaCL on a single
device (considering input-dependent kernels, for example). I'd say we
should postpone this

I'll do (3). It's not a lot of work and the kernel generator already
supports it. We just need to add an API.

(4) is where I've spent and will spend most of my time. The Kernel
Generator is now fully integrated for all the vector operations, all the
matrix-vector operations (except rank1 updates) and most of the dense
matrix operations (all but LU, FFT,Inplace triangular substitution). While
the database is not populated yet, recent benchmarks suggest very good
performance (Like CuBlas on GTX470, and 80% of the peak on R9 290x). I
think it is necessary to push forward in this direction, and make ViennaCL
1.6 a BIG DATA  BIG DATA BIG DATA BIG DATAperformance-based release.

I've been very motivated to work on the kernel generator recently, and
simply don't feel like working on (1) or (2) at the moment. Now, there are
two different options, for (4):
4.1 - Implementing the kernel fusion mechanism inside the scheduler.
4.2 - Input-dependent kernels, and performance prediction.

While I could help with 4.1, I don't feel like I could do this task alone,
because I don't have a sufficient knowledge of the backend. Plus, it
implies to get rid of op_executor(), and I'm not sure how I could do this,
too!
I feel operational, though, for 4.2. I feel like ViennaCL 1.6 should be a
performance-oriented release, and having an (input+device)-dependent kernel
selection mechanism is something we have to do!

Any thoughts on how the roadmap could/should be rearranged?

Philippe
--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] ViennaCL 1.6 Roadmap

2014-07-08 Thread Philippe Tillet

Hi,




2014-07-08 20:59 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi Philippe,


  Watching at the roadmap:

 https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap


 argl, I forgot to update this after our IRC meeting. The protocol here
 defines features for 1.6.0 which are far more reasonable:

 https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Developer-Meetings




  I was concerned with 4 elements:
 (1) Hook in external BLAS libraries and use them as a computing backend
 (2) Distributed vectors and matrices (multiple devices, possibly mixed
 CUDA/OpenCL/OpenMP
 (3) Support for reductions (vector-reduction, row-wise reduction,
 col-wise reduction). Naive OpenMP/CUDA implementation, but integrated in
 the kernel generator for OpenCL.
 (4) Full integration of the micro-scheduler and the generator.

 Needless to say that this seems overly ambitious!
 I had done a prototype for (1), but realized quickly that it would be
 pretty complicated to make it stable and robust with respect to devices,
 context, etc. Plus, the generator now gives the same (DENSE!)
 performance as CuBlas on NVidia GPUs (for Fermi, at least), and
 clAmdBlas on AMD GPUs. Linking could allow us to have very good
 performance on OPENMP/CUDA, as well as Sparse Linear algebra on OpenCL.
 This is interesting, but it is also a good amount of work!


 We postponed that and instead agreed to focus on the full scheduler
 integration.




  (2) Will also require a huge amount of work. Plus, I think it is
 dangerous to do that when we're not even sure of how we handle ViennaCL
 on a single device (considering input-dependent kernels, for example).
 I'd say we should postpone this


 Certainly postpone this. Today I got notice that we will have funding for
 a PhD student working on this. It's still hard to find a good candidate,
 but at least we have the funding now ;-)



  I'll do (3). It's not a lot of work and the kernel generator already
 supports it. We just need to add an API.


 Today there was a user requesting this on sourceforge. I'll also have time
 in the next days to work on this, but since you volunteered for it, I'll go
 for the iterative solver optimizations first.



  (4) is where I've spent and will spend most of my time. The Kernel
 Generator is now fully integrated for all the vector operations, all the
 matrix-vector operations (except rank1 updates) and most of the dense
 matrix operations (all but LU, FFT,Inplace triangular substitution).
 While the database is not populated yet, recent benchmarks suggest very
 good performance (Like CuBlas on GTX470, and 80% of the peak on R9
 290x). I think it is necessary to push forward in this direction, and
 make ViennaCL 1.6 a BIG DATA BIG DATA BIG DATA BIG DATAperformance-based
 release.


 I'll help with stripping the op_executor beast, so that everything
 interfaces the scheduler directly.

 Philippe, did you by chance check the impact of the generator integration
 on kernel latency? We only have a 1-10us margin to work with, which I
 haven't checked yet.



Don't worry for the overhead. It used to be fine. I'll re-check to see
whether everything is still fine, but when the program-name and the kernel
name prefix is known in advance (ie for the pre-compiled programs), I don't
see where a significant overhead could come from! I'll benchmark this ASAP,
once some other modifications are done.



  I've been very motivated to work on the kernel generator recently, and
 simply don't feel like working on (1) or (2) at the moment. Now, there
 are two different options, for (4):
 4.1 - Implementing the kernel fusion mechanism inside the scheduler.
 4.2 - Input-dependent kernels, and performance prediction.

 While I could help with 4.1, I don't feel like I could do this task
 alone, because I don't have a sufficient knowledge of the backend. Plus,
 it implies to get rid of op_executor(), and I'm not sure how I could do
 this, too!
 I feel operational, though, for 4.2. I feel like ViennaCL 1.6 should be
 a performance-oriented release, and having an (input+device)-dependent
 kernel selection mechanism is something we have to do!


 I think we should not go for 4.1 with a 1.6.0 release, simply because it
 would delay the release cycle. We should provide features to our users
 fairly quickly after they are stabilized, not have them hanging around in
 the developer repository for too long. We have enough features for 1.6.0
 already ;-)

 Some work from your side on 4.2 would be good, so if you have some
 resources left, please focus on that.


Sure. 4.2 is part of my (future) PhD work, so I can't expect to have
everything working flawlessly for ViennaCL 1.6.0. But I feel like I should
be able to create the backbone for this release.a simple
environment-variable based mechanism that points to a folder where the f
spitted out by the python auto-tuner. I'd like an environment-variable
based extension, as they can be easily exploited by the advanced users in
C++, and

Re: [ViennaCL-devel] GEMM broken in Nightly builds

2014-07-07 Thread Philippe Tillet

Hey,

After some investigations it looks like the problem is not with the GEMM
kernel but with the way the kernel is enqueued. It fails when A and B are
associated with the same handle in C = alpha*op(A)*op(A) + beta*C... (this
handle-checking feature is to allow for some optimizations in other kernels
such as those who compute complicated elementwise-functions : y =
element_prod(element_exp(x), x))
This bug seems simple to fix at first sight, but it's gonna be hard to
provide a good fix. I could tell the kernel to ignore the handle values for
GEMM and always consider C = alpha*op(A)*op(B) + beta*D, but then it
prevents me from doing pointer arithmetics with C and it'll have an
influence on register usage). I'll have to do a gemm-specific
handle-binding policy. I'll try to do it ASAP, but it may not come before
tomorrow since I have other things to do today, like getting the database
ready for Toby's benchmarks...

It looks like blas3_prod-test-opencl fails with timeout on centos5. This is
just SO strange.

Philippe


2014-07-07 11:38 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hey,

 our Nightly tests report new issues with some examples, which are
 probably all due to GEMM:
 http://viennastar.iue.tuwien.ac.at/CDash/index.php?project=ViennaCL
 (also look at the previous day)

 Philippe, I see a bunch or recent commits. Is it possible that this got
 fixed in the meanwhile?

 Best regards,
 Karli


 --
 Open source business process management suite built on Java and Eclipse
 Turn processes into business applications with Bonita BPM Community Edition
 Quickly connect people, data, and systems into organized workflows
 Winner of BOSSIE, CODIE, OW2 and Gartner awards
 http://p.sf.net/sfu/Bonitasoft
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] GEMM broken in Nightly builds

2014-07-07 Thread Philippe Tillet

Until this is fixed, I disable the use of the generator for GEMM.


2014-07-07 15:00 GMT+02:00 Philippe Tillet phil.til...@gmail.com:

 Hey,

 After some investigations it looks like the problem is not with the GEMM
 kernel but with the way the kernel is enqueued. It fails when A and B are
 associated with the same handle in C = alpha*op(A)*op(A) + beta*C... (this
 handle-checking feature is to allow for some optimizations in other kernels
 such as those who compute complicated elementwise-functions : y =
 element_prod(element_exp(x), x))
 This bug seems simple to fix at first sight, but it's gonna be hard to
 provide a good fix. I could tell the kernel to ignore the handle values for
 GEMM and always consider C = alpha*op(A)*op(B) + beta*D, but then it
 prevents me from doing pointer arithmetics with C and it'll have an
 influence on register usage). I'll have to do a gemm-specific
 handle-binding policy. I'll try to do it ASAP, but it may not come before
 tomorrow since I have other things to do today, like getting the database
 ready for Toby's benchmarks...

 It looks like blas3_prod-test-opencl fails with timeout on centos5. This
 is just SO strange.

 Philippe


 2014-07-07 11:38 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hey,

 our Nightly tests report new issues with some examples, which are
 probably all due to GEMM:
 http://viennastar.iue.tuwien.ac.at/CDash/index.php?project=ViennaCL
 (also look at the previous day)

 Philippe, I see a bunch or recent commits. Is it possible that this got
 fixed in the meanwhile?

 Best regards,
 Karli


 --
 Open source business process management suite built on Java and Eclipse
 Turn processes into business applications with Bonita BPM Community
 Edition
 Quickly connect people, data, and systems into organized workflows
 Winner of BOSSIE, CODIE, OW2 and Gartner awards
 http://p.sf.net/sfu/Bonitasoft
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel



--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Implementation of multi_inner_prod

2014-06-27 Thread Philippe Tillet

Ok, thanks!
This sounds reasonable indeed.

Philippe


2014-06-26 23:51 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,

 the cases 5, 6, and 7 are handled by running a kernel for four vectors,
 then subtract '4' and run a dedicated kernel on the remaining 1, 2, or 3
 vectors. This could also be handled by a generated kernel, yes, but I
 haven't implemented this for two reasons:
  1. less kernels to compile
  2. less implementation effort

 One single kernel is not possible for arbitrary values of vectors. Eight
 vectors turned out to be a reasonable upper bound because the overhead is
 less than 12.5% over the ideal case already, but at the same time the
 kernel still works for older GPUs with limited amounts of shared memory.

 Best regards,
 Karli



 On 06/26/2014 11:09 PM, Philippe Tillet wrote:

 I'll add something. I assume that multiple kernels are launched thanks
 to current_index. Wouldn't it be better to launch one single kernel ? I
 think that a lot of users would prefer to have better performance for
 perhaps a slightly longer JIT overhead (since we'll provide a caching
 mechanism).

 Philippe


 2014-06-26 23:07 GMT+02:00 Philippe Tillet phil.til...@gmail.com
 mailto:phil.til...@gmail.com:


 Hello!

 I note this in the implementation of multi_inner_prod:

switch (vec_tuple.const_size() - current_index)
{
  case 7:
  case 6:
  case 5:
  case 4:
//do stuff

 However, there is a test for 5,6,7 so I assume that these have to be
 implemented somehow. Could I have more details on why there is no
 specific kernel for these three cases?

 NB : This is the very last thing that has to be done before I can
 push the new device-specific OpenCL backend. All the tests pass
 except multi_inner_prod for tuple_size = 5. :)

 Philippe




 
 --
 Open source business process management suite built on Java and Eclipse
 Turn processes into business applications with Bonita BPM Community
 Edition
 Quickly connect people, data, and systems into organized workflows
 Winner of BOSSIE, CODIE, OW2 and Gartner awards
 http://p.sf.net/sfu/Bonitasoft



 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel



--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] PyViennaCL midterm

2014-06-27 Thread Philippe Tillet

Hi,

Unfortunately I won't be available until Tuesday for a meeting. Python and
CUDA-based libraries are widely used by the Machine Learning community. I
also want to push OpenCL forwards, but supporting CUDA through PyViennaCL
would be a very good thing to do, since a lot of researcher think that CUDA
is faster (or will at least want to compare).

Philippe


2014-06-26 18:57 GMT+02:00 Toby St Clere Smithe m...@tsmithe.net:

 Karl Rupp r...@iue.tuwien.ac.at writes:
  fine with me. My schedule is very much in flux in the next ~10 days or
  so, so I might be unavailable on short notice. Rather than having one
  big IRC meeting with all topics crushed together, I suggest we have a
  couple of smaller topic-oriented meetings. To start out with, I suggest
  two topics+dates:

 If it's inconvenient, I can wait!

* Friday, June 27, 18:00 UTC on PyViennaCL

 But, if not, this time is fine for me :)

  We can also use VoIP-technology in addition to IRC to speed things up if
  desired. Toby, do you have questions for Andreas on PyOpenCL or PyCUDA?

 I'm in contact with him regarding the best way to deal with padding of
 ViennaCL vectors / matrices. For my manual tests, I work with very small
 dimensions (to make it easy to spot errors), but of course these are
 padded out. So if I have a 3x3 matrix, I (say) really have a 128x128
 matrix, with a lot of 0 entries.

 I have a convenience function which takes a matrix or vector to a
 PyOpenCL 'Array' object (which is just an array-like wrapper around an
 OpenCL buffer). Clearly, the underlying PyOpenCL buffer object should
 point to the whole 128x128 memory, but should I have the Array only
 expose the 3x3 matrix (like ViennaCL does), or the whole padded matrix?

 What I'm working on right now is having the Array work more like the
 ViennaCL matrix, so that if the user wants to print out the object (for
 example), they only get the non-padded entries. It's easy to get the
 whole buffer from an Array, and of course if the user wants to work with
 it in OpenCL, then they'll need to deal with the padding... So perhaps I
 should leave the padding visible when the user takes a ViennaCL object
 to a PyOpenCL one, to make it obvious?

  There are a couple of bits of API that need some more work. For
  instance, with regards to structured matrices, the Vandermonde matrix is
  missing (I had some API incompatibility with my current code which needs
  looking at), and there is some thinking to be done about operations
  involving two structured matrices; the logic for computing the result
  type needs work here. Bandwidth reduction is also missing, because I
  don't have a PyViennaCL wrapper type for the std::vectorstd::mapT 
  type used here (is generic sparse support on the agenda? Does it matter
  right now?). I also want to add support for casting between numeric
  types, as we were discussing earlier. By and large, however, the body of
  work here is done, and the remaining bits shouldn't take more than an
  afternoon.
 
  A summer student will soon extend FFT to the multi-backend case, which
  should also make the structured matrices available for multiple
  backends. As for the std::vectorstd::mapT  type: Are there any
  standard sparse matrix formats used in Python/NumPy/etc.? Anything in
  e.g. CSR format? I think it makes most sense to provide convenience
  conversions for these and not worry about std::vectorstd::mapT .

 Well, I want to spend some time investigating bringing the PyViennaCL
 and SciPy sparse types closer together. But of course PyViennaCL
 supports the ViennaCL compressed / coordinate / etc types right now.

  A similar situation obtains in the context of supporting multiple
  back-end platforms. I have implemented a generic Context object, and it
  is now possible to construct any PyViennaCL type in any given Context
  (for instance, in host memory, or on a specific OpenCL device with a
  given OpenCL queue). I'd like to pay my respects to Andreas for his work
  on PyOpenCL, which has made my life here fairly easy. Meaningful
  exceptions are raised if you try and execute an operation involving
  objects in different contexts.
 
  Perfect! Did you stumble over any problems in which the context isn't
  used correctly? There may be some corner cases where this isn't handled
  correctly, so please don't hesitate to report if you run into issues.

 Hmm. As far as I'm aware, the only failures I've seen so far have been
 my fault: for instance, if I try and do A+B with A and B having contexts
 on different devices or with different associated queues. I now test the
 vcl::ocl::context equality operator and throw an exception if I get
 false (or if A and B have different memory domains entirely, etc).

  My next job is to write a simple example involving a custom PyOpenCL
  kernel interacting with ViennaCL objects and operations, which I hope to
  have by the end of the week. Subsequently, I need to prepare some simple
  benchmarks and my paper for

Re: [ViennaCL-devel] PyViennaCL midterm

2014-06-27 Thread Philippe Tillet

I'll be available from Tuesday afternoon on. What about wednesday 13:00 UTC
and 15:00 UTC?


2014-06-27 18:30 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hey,


  Unfortunately I won't be available until Tuesday for a meeting. Python

 and CUDA-based libraries are widely used by the Machine Learning
 community. I also want to push OpenCL forwards, but supporting CUDA
 through PyViennaCL would be a very good thing to do, since a lot of
 researcher think that CUDA is faster (or will at least want to compare).


 Okay, if you're unavailable until Tuesday then we just postpone it until
 next week (both topic meetings). Let's use the time to arrange good time
 slots: Suggestions?

 Best regards,
 Karli

--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Implementation of multi_inner_prod

2014-06-26 Thread Philippe Tillet

Hello!

I note this in the implementation of multi_inner_prod:

  switch (vec_tuple.const_size() - current_index)
  {
case 7:
case 6:
case 5:
case 4:
  //do stuff

However, there is a test for 5,6,7 so I assume that these have to be
implemented somehow. Could I have more details on why there is no specific
kernel for these three cases?

NB : This is the very last thing that has to be done before I can push the
new device-specific OpenCL backend. All the tests pass except
multi_inner_prod for tuple_size = 5. :)

Philippe
--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Behavior of norm_* on vectorint

2014-06-24 Thread Philippe Tillet

Hey


2014-06-24 12:29 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hey,

 If yes,I think that it should be changed
because this easily violates the axioms of a norm : we can have
norm(alpha*v) != alpha*norm(v) because of the rounding.
 
  This will usually be the case even if we change it. There are good
  reasons why Clang emits warnings when using != or == for floating
 point
  comparisons ;-)
 
 
  I know that ;) But I'd say that the error we make for norm2 using float
  is still stable. For integers, I doubt it is :p

 What do you consider as 'stable' here? Even integer results can be
 stable in the sense that your error will be less than one in modulus ;-)


Let's not talk too much about this, we'll worry about machine epsilons
later :P


I think that norm_*(vectorint) should be changed to float
norm_*(vectorint). Any thoughts?
 
  There is no need to change it for anything for norm_1 and norm_inf.
 So,
  the only relevant implementation case is norm_2, for which ublas uses
  the same type convention we use now (at least that's what I found
 when I
  looked it up). Although a floating point return type is probably more
  often desired than an integer type, it would certainly complicate the
  implementation. Moreover, it would introduce inconsistency, which I'm
  not very fond of. The other thing, of course, is that it complicates
 the
  implementation considerably (which floating point type to return?
 float
  is not great in terms of precision, but double may not be available
 on
  the GPU...).
 
  I'm open to using a different approach than what we have now, but I'd
  like to hear solid arguments in favor of a change ;-)
 
 
  Well, this implementation problem already exists! The sqrt() functions
  only takes float/double as input (except std11's sqrt which casts to
  double). As a result, norm_2() is actually disabled for integers (I had
  not noticed it in my first e-mail) ;). This makes a lot of sense to
  disable it, indeed. This leads to another  question, though. Should we
  add on the todo list some casting operators such as :
  viennacl::norm_2(viennacl::castfloat(v_int)).
  For opencl, these can be easily handled by the generator.

 Hmm, do we have enough use cases for this? I'd rather handle this
 through an explicit conversion step, i.e.
   viennacl::vectorfloat v_float(v_int);
   viennacl::norm_2(v_float);
 instead of integrating more complexity into how we handle the various
 operations. The explicit conversion can be provided in a controlled
 manner (the number of any-to-any-conversions is still manageable),
 whereas the introduction of casting would immediately blow up the
 possible input combinations for *all* the operations. For example, the
 operation
   x = y + z
 with the current requirement of the same scalar types for x, y, and z
 results in ~10 different input combinations (four signed integer types,
 four unsigned types, two floating point types). If we allow any
 combinations through casting, this would make it 1000 combinations.
 OpenCL would certainly help with jit-ting this, but compilation times
 for CUDA and OpenMP would certainly explode if we want to provide these
 operations through a shared library interface later.
 In contrast, an explicit any-to-any conversion can be covered with
 10x10=100 kernels, ultimately resulting in the same functionality for
 our users. I expect that the lower performance due to an explicit
 conversion/duplication is negligible in typical use cases.


This sounds reasonable indeed. I need casting operation_node_type for the
generator to control explicit casting within a generated kernel, but it
sounds very reasonable to only allow such constructors indeed. For example,
max(int, abs(int)) won't compile in OpenCL because abs returns an unsigned
int, therefore leading to an ambiguous call of max. So, I have to modify a
bit the expression tree to invoke the generator for with max(int,
(int)abs(int)).
Now, adding an interface to allow:
x = viennacl::element_castfloat(y) + z
for OpenCL only seems superfluous indeed. There are not enough use cases to
justify an interface divergence.

However, having a very explicit constructor:

viennacl::vectorfloat x(viennacl::element_castfloat(x_int)).

may be better than

viennacl::vectorfloat x(x_int);

Especially when a vector is constructed using operator=.

I can add this on the todo list for ViennaCL 1.6. It is a great feature
which could permit some great bandwidth savings as well as some
mixed-precision implementation

Philippe


Best regards,
 Karli



 --
 Open source business process management suite built on Java and Eclipse
 Turn processes into business applications with Bonita BPM Community Edition
 Quickly connect people, data, and systems into organized workflows
 Winner of BOSSIE, CODIE, OW2 and Gartner awards

Re: [ViennaCL-devel] Are {op_row, op_diag, op_column} unary or binary?

2014-06-17 Thread Philippe Tillet

I want to be more precise, actually : the problem I see with this is that
mathematically, diag or row remain unary operators. The fact that it is
binary has to do with our implementation. So actually I think that we
should never stick to the mathematical meaning of the operators (even
though sometimes it may coincide)

Philippe


2014-06-17 10:29 GMT+02:00 Toby St Clere Smithe m...@tsmithe.net:

 Hey Philippe,

 Philippe Tillet phil.til...@gmail.com
 writes:
  The integration of the generator is going on slowly but safely. Vector
  kernels are fully integrated and I'm about to support some matrix kernel
 as
  well ( excluding FFT, LU, and a few others).
  I have one metaphysical question, though. There are two interpretation
  possible for op_row, for example. Either we consider it as a unary
 operator
  because it acts on a single matrix, or we consider it as a binary
 operator
  because it also requires an uint as its RHS. I'm leaning towards the
 latter
  interpretation, yes I'd be glad to have your opinion on this!

 I'd go with binary, for exactly your reason, I think.

 Cheers,

 Toby



 --
 HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
 Find What Matters Most in Your Big Data with HPCC Systems
 Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
 Leverages Graph Analysis for Fast Processing  Easy Data Exploration
 http://p.sf.net/sfu/hpccsystems
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing  Easy Data Exploration
http://p.sf.net/sfu/hpccsystems___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Benchmark GUI warmup

2014-05-06 Thread Philippe Tillet

Hey Namik,


2014-05-06 19:43 GMT+02:00 Namik Karovic namik.karo...@gmail.com:

 Hello,

 Apologies for not replying earlier, I've been quite busy these last two
 days.


Don't worry ;)

So far I have been exploring the advantages/disadvantages of using
 QML/QtQuick vs traditional widget based GUI. QML has some great design
 features that could improve the overall user experience and aren't easily
 implemented when using widgets. I was originally planning to develop some
 parts using QML ( animations and charts ) and integrate them with the main
 widget based GUI. However I am now exploring the possibility of doing the
 entire GUI in QML. Suggestions on which approach to choose are welcome.


Unfortunately, I don't know much about Qt, so I probably couldn't help
here. However, keep in mind that we aim for maximum portability. I would
tend to think that QML is more portable accross languages, and so I would
say go for it, as long as you don't loose portability elsewhere.

Reading through your discussion about expert benchmark setting I see that I
 probably should have spent more time studying the autotuner and benchmark
 codes :/ I understand that there is a great need for expert benchmark
 customization and I hope to succeed in making that part as detailed as
 possible, but there should a certain limit to the extent of details. What
 I'm saying is I'd rather not spend time developing features that will be
 used only a couple of times. Surely there are some details that aren't of
 critical importance?


 It would be great if you guys could agree on what expert details are of
 greatest priority. I'm going to start studying the autotuner and benchmark
 codes so I can better understand what needs to be done.


I think that the most important part of the project is the
intuitiveness/functionnality of the GUI. Keep in mind that most of your
userbase will have a limited amount of time, and that anything beyond
double-click+coffee break will probably be ignored ;)

I really believe that there is no need to read any code related to the
auto-tuner, as it is disappearing. Re-implementing an exhaustive search for
one particular size for the GUI will not be a huge challenge, so don't
worry too much about it.
This thread is exclusively dedicated to possible features in the expert
tab, which is not a priority for now (but it's still good to have some
mid-term perspective when starting a project).

That being said, I believe that the Basic options should include:
- Benchmarking of as many routines as possible : BLAS, FFT, Solvers, etc...
- Simple exhaustive-search auto-tuning for what supports it : What could
this hardware ideally give on this problem
- Export of the benchmark results to an open database

I don't think you should worry about anything else as of now. I'll be
working rather actively on a command-line interface to some advanced
auto-tuning features.

Philippe



 Best regards,
 Namik


 On Tue, May 6, 2014 at 9:38 AM, Karl Rupp r...@iue.tuwien.ac.at wrote:

 Hi,


  Why is data pointless? I'd rather have only a few datapoints on new

 hardware out there rather than having absolutely no data at all.


 I mean, the data is pretty useful because it tells us about the best
 default kernel for large square matrices, but it is not very useful if
 we want to build a general input-dependent model, as it requires in my
 experience more than 1000 data points.


 This is true. So this calls for an hierarchical approach:
  Level 1: Just a couple of known kernels for a given data size, which are
 compared on the target machine.
  Level 2: A full tuning set for one data size on the target
  Level 3: All ~1000 points for building a model

 Execution times between these levels vary significantly: While almost all
 users will go through Level 1 anyway, only a few will have the patience to
 wait for results on Level 2. Level 3 will be mostly for us to have a
 'normalized' process for building performance models. Either way, if others
 will join (machine learning community?), that would be great!



  I'd rather refrain from running Python scripts from the benchmark
 GUI. This is intended to be an end-user tool. Those interested in
 running from Python should take the Python code (i.e. PyViennaCL)
 directly.


 Are you sure? It would not take a lot of efforts to have an optional way
 to call the python script with the proper arguments from the auto-tuner,
 as long as the user provides the path and that he has all the necessary
 dependencies.


 The second half of the last sentence is the problem. I expect 80% of
 users to run on Windows, where anything but a 'double click installer' is a
 non-standard process. If Namik has time left by the end of the summer, we
 can look into that, but we first need to focus on our target audience.



  Such cases are probably only interesting for the 'expert settings'
 tab in the GUI, as these parameters only make sense to people who
 *really* know what

Re: [ViennaCL-devel] Benchmark GUI warmup

2014-05-06 Thread Philippe Tillet

Hi,


2014-05-06 9:38 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,


  Why is data pointless? I'd rather have only a few datapoints on new

 hardware out there rather than having absolutely no data at all.


 I mean, the data is pretty useful because it tells us about the best
 default kernel for large square matrices, but it is not very useful if
 we want to build a general input-dependent model, as it requires in my
 experience more than 1000 data points.


 This is true. So this calls for an hierarchical approach:
  Level 1: Just a couple of known kernels for a given data size, which are
 compared on the target machine.
  Level 2: A full tuning set for one data size on the target
  Level 3: All ~1000 points for building a model

 Execution times between these levels vary significantly: While almost all
 users will go through Level 1 anyway, only a few will have the patience to
 wait for results on Level 2. Level 3 will be mostly for us to have a
 'normalized' process for building performance models. Either way, if others
 will join (machine learning community?), that would be great!



  I'd rather refrain from running Python scripts from the benchmark
 GUI. This is intended to be an end-user tool. Those interested in
 running from Python should take the Python code (i.e. PyViennaCL)
 directly.


 Are you sure? It would not take a lot of efforts to have an optional way
 to call the python script with the proper arguments from the auto-tuner,
 as long as the user provides the path and that he has all the necessary
 dependencies.


 The second half of the last sentence is the problem. I expect 80% of
 users to run on Windows, where anything but a 'double click installer' is a
 non-standard process. If Namik has time left by the end of the summer, we
 can look into that, but we first need to focus on our target audience.


I think you're right. Namik's GUI should only provide Level 1 and 2, which
do not require any Python. Since Level 3 would be an internal tool as you
correctly pointed it out, we could stick to a python command-line
interface, or a rudimentary PyQt GUI.

Philippe




  Such cases are probably only interesting for the 'expert settings'
 tab in the GUI, as these parameters only make sense to people who
 *really* know what they are doing (and willing to invest the time).
 For bloggers, journalists, etc., who just want to quickly get some
 performance datapoints for the very latest hardware, this is usually
 not of interest. We need to foc

 us on serving the main audience first and then watch out for
 fruitful directions on how to extend it further.


 Of course ! I've been referring to the expert settings tab from the
 beginning :)


 Ah, please say so :-)

 Best regards,
 Karli


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Benchmark GUI warmup

2014-05-05 Thread Philippe Tillet

Hi,


2014-05-05 9:18 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,

 (CC-ing viennacl-devel, as this is developer-talk ;-) )


  Either way, I want to let you know that the generator/auto-tuner is
 undergoing significant changes, and that you will, actually, not have to
 worry about it for your GSoC project. The generator will be used
 transparently via the viennacl::linalg:: functions, and the auto-tuner
 will be entirely moved to pyviennacl.


 Well, I think this is not entirely unrelated. The purpose of the GUI is
 still to allow a broader community to feed us with benchmark data, so
 somehow the loop over all possible configurations is still essential. With
 an interface to Python I assume that an API to do exactly that will still
 be available ;-)


Well, looping over all the possible configurations for one particular
problem size is good for benchmarking purpose only; the data generated this
way will not be re-usable unless we can make some assumption on the
input-data size. That is, if the GUI only auto-tunes GEMV/GEMM for large
square matrices, then we will collect a lot of pointless data. Instead, the
GUI should export a model which, given some input data sizes and a hardware
configuration, is able to predict the optimal kernel. This is why the
auto-tuner is being moved to pyviennacl.
However, the GUI could/should indeed still be able to execute the
corresponding python scripts.



  There is, however, one additional point I'd like to discuss. The
 performance of all the algorithms you'll have to benchmark are highly
 dependent on the characteristics of the input data. For example, matrix
 products will behave very differently according to the size/shape of the
 input matrices. This is very important : this means that a good
 benchmarking GUI could help the users to design their system.
 Here's an example. Suppose that someone wants to solve the linear system:
 A*x* = *y*

 If, for his particular application, A is a 50,000x50,000 sparse matrix,
 then he could be greatly interested in knowing how he could pad A to
 achieve better performance. In that case, the benchmarking-gui could
 explore randomly R^2 beyond (50,000 ; 50,000), and potentially tell the
 user that, if he makes A a (50,500; 50,500) matrix, then he could
 improve his performance by say 10 or 20%.


 For sparse matrices I don't believe in random patterns. The user usually
 has a particular application in mind, so I consider it more important to
  a) Allow users to feed the tuner with their own sparse matrix
  b) Allow users to select sparse matrices from the Florida matrix market
 The second option is important for benchmark purposes and for comparison
 with data in the literature. We can also add a third option for random
 matrices, but it's certainly far less important.



We could also try to describe a sparse matrix by a few parameters (number
of rows/cols, format, sparsity pattern, etc...) and use machine learning to
predict the optimal kernel given an arbitrary sparse matrix. For the
training data, we could use the Florida matrix market, indeed.





  In the case of dense matrix
 products, one may even be able to double his performance by slightly
 altering the size of the input matrices.


 Okay, this is only about adjusting the padding parameter and should
 transparently included in the tuning process anyway, shouldn't it?


This is not exactly what I meant. Suppose that someone wants to compute the
dense matrix product:
A*B
where A is in R^{238, 2031}, and B is in R^{2031, 1240}.
Then, the auto-tuner should indeed find the optimal padding size, and A and
B would be transparently padded to multiples of 128: {256,2048} and {2048,
1280}.
 However, for some reason, using matrices of size {256, 2176} and {2176,
1280} may be worth it on SGEMM (but not on DGEMM), because 2048 could
trigger a lot of bank conflicts. Similarly, one might fall on a sweet spot
of his GPU for {256,2560}x{2560,1408}.  I don't think that ViennaCL should
handle this. I can think of some applications in the field of Artificial
Neural Networks, where one may want to resize the layers of his neural
network so as to fall on some sweet spots of his GPU.

Philippe


 Best regards,
 Karli


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Benchmark GUI warmup

2014-05-05 Thread Philippe Tillet

Hi hi,


2014-05-05 21:49 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,

  Well, I think this is not entirely unrelated. The purpose of the GUI
 is still to allow a broader community to feed us with benchmark
 data, so somehow the loop over all possible configurations is still
 essential. With an interface to Python I assume that an API to do
 exactly that will still be available ;-)


 Well, looping over all the possible configurations for one particular
 problem size is good for benchmarking purpose only; the data generated
 this way will not be re-usable unless we can make some assumption on the
 input-data size.


 The data is reusable, of course assuming that one knows the matrix sizes
 it has been obtained for.


  That is, if the GUI only auto-tunes GEMV/GEMM for large
 square matrices, then we will collect a lot of pointless data.


 Why is data pointless? I'd rather have only a few datapoints on new
 hardware out there rather than having absolutely no data at all.


I mean, the data is pretty useful because it tells us about the best
default kernel for large square matrices, but it is not very useful if we
want to build a general input-dependent model, as it requires in my
experience more than 1000 data points.



  Instead,
 the GUI should export a model which, given some input data sizes and a
 hardware configuration, is able to predict the optimal kernel. This is
 why the auto-tuner is being moved to pyviennacl.
 However, the GUI could/should indeed still be able to execute the
 corresponding python scripts.


 I'd rather refrain from running Python scripts from the benchmark GUI.
 This is intended to be an end-user tool. Those interested in running from
 Python should take the Python code (i.e. PyViennaCL) directly.


Are you sure? It would not take a lot of efforts to have an optional way to
call the python script with the proper arguments from the auto-tuner, as
long as the user provides the path and that he has all the necessary
dependencies.



  For sparse matrices I don't believe in random patterns. The user
 usually has a particular application in mind, so I consider it more
 important to
   a) Allow users to feed the tuner with their own sparse matrix
   b) Allow users to select sparse matrices from the Florida matrix
 market
 The second option is important for benchmark purposes and for
 comparison with data in the literature. We can also add a third
 option for random matrices, but it's certainly far less important.



 We could also try to describe a sparse matrix by a few parameters
 (number of rows/cols, format, sparsity pattern, etc...) and use machine
 learning to predict the optimal kernel given an arbitrary sparse matrix.
 For the training data, we could use the Florida matrix market, indeed.


 I agree with this approach. Everything is better than using a fixed work
 group size as we do now (even though this is how other libraries deal with
 the problem as well)


  In the case of dense matrix
 products, one may even be able to double his performance by
 slightly
 altering the size of the input matrices.


 Okay, this is only about adjusting the padding parameter and should
 transparently included in the tuning process anyway, shouldn't it?


 This is not exactly what I meant. Suppose that someone wants to compute
 the dense matrix product:
 A*B
 where A is in R^{238, 2031}, and B is in R^{2031, 1240}.
 Then, the auto-tuner should indeed find the optimal padding size, and A
 and B would be transparently padded to multiples of 128: {256,2048} and
 {2048, 1280}.
   However, for some reason, using matrices of size {256, 2176} and
 {2176, 1280} may be worth it on SGEMM (but not on DGEMM), because 2048
 could trigger a lot of bank conflicts. Similarly, one might fall on a
 sweet spot of his GPU for {256,2560}x{2560,1408}.  I don't think that
 ViennaCL should handle this. I can think of some applications in the
 field of Artificial Neural Networks, where one may want to resize the
 layers of his neural network so as to fall on some sweet spots of his GPU.


 Such cases are probably only interesting for the 'expert settings' tab in
 the GUI, as these parameters only make sense to people who *really* know
 what they are doing (and willing to invest the time). For bloggers,
 journalists, etc., who just want to quickly get some performance datapoints
 for the very latest hardware, this is usually not of interest. We need to
 foc

us on serving the main audience first and then watch out for fruitful
 directions on how to extend it further.


Of course ! I've been referring to the expert settings tab from the
beginning :)

Philippe


 Best regards,
 Karli


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing

Re: [ViennaCL-devel] OpenCL C++ API

2014-04-29 Thread Philippe Tillet

Hi,


2014-04-29 15:59 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,


  So I can't help but to bring up this topic :) Is there any reason why

 we're using the OpenCL C API instead of the C++ one?


 Yes, the reason is simple: The C++ API was standardized quite some time
 *after* the development of ViennaCL started.


  It seems like we could save several thousands of lines of code (and gain
 a lot of clarity) by using the C++ API directly.


 Well, I'm not so sure about that actually. I'd more conservatively
 estimate it in the range of hundreds.


Actually, I think that most of the code bloat comes from the ClGet*Info()
calls. There are dozens of methods in ocl::device() to handle this, while
this could be simply handled using a proper templated method. I notice that
I already did that work in viennacl/ocl/infos.hpp but that it is somewhat
not used. Is there any drawback with using :
viennacl::ocl::infoCL_DEVICE_NAME(viennacl::ocl::device const  d);
instead of
viennacl::ocl::device::name() ?

Perhaps we could rename it so that the call becomes
viennacl::ocl::infoVIENNACL_DEVICE_NAME(viennacl::ocl::device const ),
so that VIENNACL_DEVICE_NAME could be used with CUDA/OpenMP devices too if
ever needed.




  Of course, we would
 keep the ocl::device/context/whatever. I see a couple of reasons why we
 would want to do that:
 - For now, we have to use VIENNACL_ERR_CHECK(err) after each internal
 call to the C API. Since it's cumbersome, i've noticed that there are a
 lot of places in the code where we would just ignore some error checking
 for the sake of clarity. If we used the C++ API, then we could use the
 C++ exceptions internally and wouldn't have to bother with all these
 error checking.


 As you certainly know, I'd like the ViennaCL core to open up to other
 languages with version 2.0.0, eventually making it a C shared library with
 a C++ layer on top. Introducing a C++ exception-based error handling at
 this stage looks like a step in the wrong direction under these long-term
 perspectives.


Oh, yes, that's right. ViennaCL is already using an exception-based
mechanism, but this would be of course easier to change it ourself if we
ever need it.




  - It seems like the list of the ViennaCL exception is not up-to-date.
 If we forget to handle a case, then we get ocl::unknown_error, while the
 error is properly handled by the OpenCL C++ API.


 Which cases aren't covered now?


I have just noticed that these maybe extensions, such as :
CL_PLATFORM_NOT_FOUND_KHR

I'll look into that.



  - We could save literally thousands of lines of code. I'm ready to bet
 that it could relieve the compiler and reduce the compilation times


 Hundreds ;-) I bet against your bet that compilation times reduce notably
 for two reasons:
  - the OpenCL C++ API just adds yet another type-hierarchy
  - most of the compilation time is spent in other parts of the library
 However, feel free to prove me wrong ;-)


It's true, most of the time is spent in template instantiation rather than
parsing. I remember having benchmarked it on an older version of ViennaCL.
I think that 1000 lines could be saved by using viennacl::ocl::infos
instead of viennacl::ocl::{device|program|kernel|chocolatebar}::*().



  - It would be easier to maintain. There is a whole community bugfixing
 the cl.hpp file, and we are more likely to have bugs in our C calls
 rather than our C++ calls


 How do you intend to deal with bugs or warnings obtained from cl.hpp? We
 can handle all bugs and warnings in our codes, but we might have a hard
 time dealing with warnings or even errors in older versions of cl.hpp. Yes,
 it won't be noticed by 99% of our users, but I also care about the
 remaining one percent.


We actually had a warning in cl.h recently (or was it in cl.hpp?), didn't
we? :P So we might have to deal with warnings in cl.h as well. I agree that
we don't want a riot with angry people shouting We are the 1%.



  - It doesn't add any external dependency.


 True.


  - I think that when we need to deal with more complicated things such
 as multiple devices, we'll gain in productivity and robustness by using
 the C++ API.


 Robustness might be true. In which way are we going to gain in
 productivity? I understand that you may gain in productivity when dealing
 with the generator, but most other parts of ViennaCL are sitting on top of
 a backend-agnostic layer, not getting in touch with OpenCL directly. So the
 question is whether you consider the additional effort of replacing the
 current use of the C-API worth it.


Actually, I came accross a lot of inconveniences while plugging a simple
caching mechanism into viennacl::ocl::context::add_program(). It's not at
all about the generator ;) The inconveniences however had to do with
unintegrated use of STL vectors and cumbersome clGet*Info. Adding methods
would not have solved the problem because the corresponding viennacl
objects were not created at this point (but the cl handles were)

Re: [ViennaCL-devel] OpenCL C++ API

2014-04-29 Thread Philippe Tillet

Hi,


2014-04-29 16:54 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,


  It seems like we could save several thousands of lines of code

 (and gain
 a lot of clarity) by using the C++ API directly.


 Well, I'm not so sure about that actually. I'd more conservatively
 estimate it in the range of hundreds.


 Actually, I think that most of the code bloat comes from the
 ClGet*Info() calls. There are dozens of methods in ocl::device() to
 handle this, while this could be simply handled using a proper templated
 method. I notice that I already did that work in viennacl/ocl/infos.hpp
 but that it is somewhat not used. Is there any drawback with using :
 viennacl::ocl::infoCL_DEVICE_NAME(viennacl::ocl::device const  d);
 instead of
 viennacl::ocl::device::name() ?


 Yes, we discussed that. The reason is that with info() one cannot buffer
 the results.



  Perhaps we could rename it so that the call becomes
 viennacl::ocl::infoVIENNACL_DEVICE_NAME(viennacl::ocl::device const
 ), so that VIENNACL_DEVICE_NAME could be used with CUDA/OpenMP devices
 too if ever needed.


 I think that's never needed.



  - We could save literally thousands of lines of code. I'm ready
 to bet
 that it could relieve the compiler and reduce the compilation
 times


 Hundreds ;-) I bet against your bet that compilation times reduce
 notably for two reasons:
   - the OpenCL C++ API just adds yet another type-hierarchy
   - most of the compilation time is spent in other parts of the
 library
 However, feel free to prove me wrong ;-)


 It's true, most of the time is spent in template instantiation rather
 than parsing. I remember having benchmarked it on an older version of
 ViennaCL.  I think that 1000 lines could be saved by using
 viennacl::ocl::infos instead of
 viennacl::ocl::{device|program|kernel|chocolatebar}::*().


 Yes and no. We might save some milliseconds from this, but it's getting
 harder for us to document and users will have a harder time finding it.
 Also, return values cannot be buffered, see above. That was the point of
 dropping infos. I can dig out the other thread if you like.



  - It would be easier to maintain. There is a whole community
 bugfixing
 the cl.hpp file, and we are more likely to have bugs in our C
 calls
 rather than our C++ calls


 How do you intend to deal with bugs or warnings obtained from
 cl.hpp? We can handle all bugs and warnings in our codes, but we
 might have a hard time dealing with warnings or even errors in older
 versions of cl.hpp. Yes, it won't be noticed by 99% of our users,
 but I also care about the remaining one percent.


 We actually had a warning in cl.h recently (or was it in cl.hpp?),
 didn't we? :P So we might have to deal with warnings in cl.h as well. I
 agree that we don't want a riot with angry people shouting We are the
 1%.


 Yes, it was a warning in the OpenCL 1.1 headers, which is fixed in OpenCL
 1.2. This was in a trivial C-header. You can imagine how things can get
 worse with C++.



  - I think that when we need to deal with more complicated
 things such
 as multiple devices, we'll gain in productivity and robustness
 by using
 the C++ API.


 Robustness might be true. In which way are we going to gain in
 productivity? I understand that you may gain in productivity when
 dealing with the generator, but most other parts of ViennaCL are
 sitting on top of a backend-agnostic layer, not getting in touch
 with OpenCL directly. So the question is whether you consider the
 additional effort of replacing the current use of the C-API worth it.


 Actually, I came accross a lot of inconveniences while plugging a simple
 caching mechanism into viennacl::ocl::context::add_program(). It's not
 at all about the generator ;) The inconveniences however had to do with
 unintegrated use of STL vectors and cumbersome clGet*Info. Adding
 methods would not have solved the problem because the corresponding
 viennacl objects were not created at this point (but the cl handles were)


 This sounds to me like the OpenCL C++ API won't make a difference here ;-)



  On the other hand, I cannot see any functionnality in the C API
 which
 isn't wrapped in the C++ one. It sure would take quite a bit of
 work,
 but I'm ready to handle it myself if there is no objection.


 I'm not objecting to a change, but I also want to submit that I
 don't think it is worth the effort. If you're fine with the effort
 (including the testing and bugfixes to bring it to a state
 comparable to the current), please go ahead :-)


 I think you're right... Plus, we need information such as the device
 macro-architecture, which will probably never be provided by the OpenCL
 standards.


 To get informations such as NUMA-domains, we might even have to resort

Re: [ViennaCL-devel] Benchmark GUI Project Overview

2014-04-27 Thread Philippe Tillet

Hey Namik,

Congratulations for your acceptance to the GSoC!
I don't know to which extent this blog is customizable, but it would be
nice to have some sub-sections related to some sub-parts of the project, to
clearly distinguish your updates/ideas on the GUI itself from those you'll
have on e.g. the code architecture. It will make communication easier.

Philippe


2014-04-27 19:16 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,

 thanks for putting everything in shape and posting it. Just some
 thoughts on the blog itself:

 * Blog title: Try to use more descriptive titles. GSoC 2014 Warmup is
 better suited than just Warmup, because it contains an important
 keyword GSoC and you don't get collisions in the following years.

 * The title/header bar is quite big, using a lot of vertical space
 (300px) without much information. If you reduce that to about 2/3, you
 can fit more interesting information to the first screen on
 lower-resolution displays.

 Best regards,
 Karli



 On 04/27/2014 12:36 AM, Namik Karovic wrote:
  Greetings,
 
  I've uploaded the benchmark GUI project proposal on my blog. You can
  read it at http://zalomiga.ba/blog/warmup/ . Any feedback is welcome.
 
  Regards,
  Namik
 
 
 
 --
  Start Your Social Network Today - Download eXo Platform
  Build your Enterprise Intranet with eXo Platform Software
  Java Based Open Source Intranet - Social, Extensible, Cloud Ready
  Get Started Now And Turn Your Intranet Into A Collaboration Platform
  http://p.sf.net/sfu/ExoPlatform
 
 
 
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 



 --
 Start Your Social Network Today - Download eXo Platform
 Build your Enterprise Intranet with eXo Platform Software
 Java Based Open Source Intranet - Social, Extensible, Cloud Ready
 Get Started Now And Turn Your Intranet Into A Collaboration Platform
 http://p.sf.net/sfu/ExoPlatform
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] WebCL 1.0 final specifications released

2014-03-24 Thread Philippe Tillet

Hello everybody,

After two years pending, the final specifications for WebCL 1.0 were
released a couple of days ago. It is logically based on OpenCL 1.1 since
ViennaCL doesn't support anything more.

I don't see any clear applications of ViennaCL with that, and I'm
incredibly bad with everything more or less related to web development. I
just wanted to keep you guys updated :)

Philippe
--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Ideas for Google Summer of Code?

2014-02-22 Thread Philippe Tillet

Hey everybody,

My recent advances on auto-tuning gave birth to a new GSoC idea in my mind.
More exactly, I've come up with something more complete around
(crowd-sourced) auto-tuning and the GUI.

This would include:

- Developing a portable auto-tuning GUI (as of now : BLAS1 / Dense BLAS2 /
Dense BLAS3)
- Building a better data-base representation for our profiles ; I feel
like manually feeling a map will become awful as we will obtain more data.
Ideally, it would be about integrating ViennaCL with cTuning. This would
allow us to build a centralized data-base and to easily build
statistics/graphs of our data, and in the future to create machine learning
models for these datasets.
- Filling the database with as many devices as possible.

What do you think about it?

Best regards,
Philippe



2014-02-16 21:02 GMT+01:00 Evan Bollig bol...@gmail.com:

 No real preference. I do want to play with some of the new features in
 CUDA 6 this summer. Aside from that, the SpMM is of interest to the other
 groups I work with. I think Petsc bindings would be a good project. I met
 with a CFD group on Friday to discuss optimization and scaling improvements
 in their petsc code. Seems to be a big item of interest for many.

 -E


 On Saturday, February 15, 2014, Karl Rupp r...@iue.tuwien.ac.at wrote:

 Hi Evan,

  Fyi, I'm willing to contribute to the GSoC effort as well. I can sponsor

 access to the variety of hardware we have at the Minnesota
 Supercomputing Institute. Including single and multi-GPU nodes and
 workstations (m2070s, k20s, k2s, and quadro variants), single and multi
 Xeon phi 5110p nodes, and nehalem and sandy bridge nodes.

 Let me know if the need arises. I'm also open to mentor if needed.


 awesome, Evan, that would be absolutely great. Do you have any particular
 project you'd like to see addressed? With multiple nodes available it would
 also be interesting to work on the PETSc-ViennaCL bindings. (This would
 certainly require a fairly experienced student)

 Best regards,
 Karli





 On Saturday, February 15, 2014, Karl Rupp r...@iue.tuwien.ac.at
 mailto:r...@iue.tuwien.ac.at wrote:

 Hi Philippe,

I completely agree, concerning matrix-free implementations of
 the linear
   solver.
  
   Their absence is the very reason why I had to reimplement solvers
 for
   UMinTL.

 I assume you are aware that you can overload viennacl::linalg::prod()
 for whatever custom 'matrix' type you pass to solve()?


   Furthermore, some other fancy stopping criterions may be
   provided. For example, some algorithms in unconstrained
 optimization use
   CG on an indefinite matrix, and abort the solver once p^TAp  0.
 There
   are also some probabilistic stopping criterions for CG when the
   matrix-free implementation is an estimator of the true
 matrix-vector
   product. In the end, the CG I ended up with for UMinTL is pretty
 big and
   flexible, and I think it would be a good thing to have the same
 thing
   within ViennaCL.

 The monitoring capabilities for the iterative solvers in ViennaCL are
 indeed poor, not providing any feedback to the outside about the
 current
 residual and/or custom stopping criteria, convergence reasons, etc.
 Improving this is desirable, among many other things to do as well.
 Again a matter of priorities... ;-)

 Best regards,
 Karli


 
 --
 Android apps run on BlackBerry 10
 Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
 Now with support for Jelly Bean, Bluetooth, Mapview and more.
 Get your Android app in front of a whole new audience.  Start now.
 http://pubads.g.doubleclick.net/gampad/clk?id=124407151;
 iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net javascript:;
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel



 --
 -Evan Bollig
 bol...@gmail.com mailto:bol...@gmail.com
 bol...@scs.fsu.edu mailto:bol...@scs.fsu.edu




 --
 -Evan Bollig
 bol...@gmail.com
 bol...@scs.fsu.edu

--
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Ideas for Google Summer of Code?

2014-02-15 Thread Philippe Tillet

Hi,

I completely agree, concerning matrix-free implementations of the linear
solver.

Their absence is the very reason why I had to reimplement solvers for
UMinTL. Furthermore, some other fancy stopping criterions may be provided.
For example, some algorithms in unconstrained optimization use CG on an
indefinite matrix, and abort the solver once p^TAp  0. There are also some
probabilistic stopping criterions for CG when the matrix-free
implementation is an estimator of the true matrix-vector product. In the
end, the CG I ended up with for UMinTL is pretty big and flexible, and I
think it would be a good thing to have the same thing within ViennaCL.

Best regards,
Philippe


2014-02-15 9:32 GMT+01:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi,

   As long as you're a student, you're eligible to apply for GSoC. ;-)
  However, I don't give any guarantees, your application will be treated
  equally. You certainly have an advantage with respect to how things
  work, but no other student should be excluded upfront.
 
  It would definitely be great to be able to finish what I've started!

 A preliminary list of ideas is available here:
 http://www.iue.tuwien.ac.at/cse/index.php/gsoc/2014.html
 I think about adding an OpenMP tuning project, since more and more users
 seem to get in touch with it.


  I was thinking about another nice-to-have feature. I'd quite like to
  make it possible (if it is in fact possible..) to play with prototyping
  implementations of other algorithms for ViennaCL in PyViennaCl using
  PyOpenCL and PyCUDA; for instance (just picking a recently discussed
  example) to implement using PyOpenCL a cl_program for sparse matrix
  multiplication, and be able to use the resultant buffer like any other
  PyViennaCL matrix object. I don't know if it would be worthwhile to hook
  into the generator or scheduler at this point.

 One thing that will be certainly of interest for a bunch of people is
 the ability to provide custom matrix-vector products for the iterative
 solvers (matrix-free implementations). Andreas Kloeckner is also
 looking forward to provide any help needed. Hooking this into the
 scheduler is possible to some degree, at least for the common
 applications of an operator to a vector or matrix.


  Just personally, I'd quite like this functionality, because I do find
  rapid development easier in Python than C++, and I would like to play
  with implementing matrix algorithms at some point in the future.

 Python *is* more suitable for rapid prototyping than C++. You will
 quickly find that for rapid prototyping it is important to have broad
 support for all basic operations (including elementwise operations,
 etc.), so this should have highest priority. We should be careful with
 implementing additional algorithms in PyViennaCL for anything other than
 tutorial purposes, because then we would quickly end up maintaining
 multiple versions of the same functionality.


  Yes; I'll need to investigate this. At the moment, I quite enjoy the
  object-oriented nature of ViennaCL, and there are parts of PyViennaCL
  which are inelegantly not-OOP. So I'd probably want to think about which
  way is going to be most elegant overall.
 
  Presumably, the C++ API isn't going to disappear?

 The object-oriented nature will not disappear. Even if the core is going
 to be a shared (C-)library for ViennaCL 2.0.0, it will still have the
 same spirit of the current C++ API. Actually, I expect that ViennaCL
 2.0.0 will still have a (more lightweight) C++ layer on top of the
 shared library, so it's API won't change much.

 Best regards,
 Karli



 --
 Android apps run on BlackBerry 10
 Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
 Now with support for Jelly Bean, Bluetooth, Mapview and more.
 Get your Android app in front of a whole new audience.  Start now.

 http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] More extensive Nightly Tests booting...

2014-02-14 Thread Philippe Tillet

Hi Karl,

Wow, that's really neat!
I'll fix the warnings for Clang and for generator_blas1-opencl

Philippe


2014-02-14 10:38 GMT+01:00 Karl Rupp r...@iue.tuwien.ac.at:

 Hi guys,

 in the past few days we worked here in Vienna on setting up an automated
 nightly build system based on CTest and CDash. It isn't fully completed
 yet, but it already starts to pay off:
   http://jwein2.iue.tuwien.ac.at:5/CDash/index.php?project=ViennaCL

 Philippe, could you please have a look at the new warnings obtained for
 the generator on Clang? Some of the warnings look pretty ugly and are
 likely to cause wrong runtime behavior. Also, generator_blas1-opencl now
 fails due to a missing check for double precision.

 An older CentOS 5.x system, a Linux Mint machine, and a Windows machine
 still need to be integrated. Usually I'll take care of warnings, but it
 certainly helps if you bookmark that page and also check the results in
 (ir)regular intervals. Automated email notifications are possible, just
 let me know if I should sign you up.

 Best regards,
 Karli


 --
 Android apps run on BlackBerry 10
 Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
 Now with support for Jelly Bean, Bluetooth, Mapview and more.
 Get your Android app in front of a whole new audience.  Start now.

 http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] viennacl::reduce and viennacl::row/col_wise()

2014-02-12 Thread Philippe Tillet

Hello,

So, as of now, the generation of row-wise reduction can be triggered
through the interface:

viennacl::reduceop_add(viennacl::row_wise(Mat))
viennacl::reduceop_max(viennacl::col_wise(Mat))
viennacl::reduceop_min(Vec)

This plugs into a statement under the form:

--
Node 1
==
type : COMPOUND_OPERATION
operator : OPERATION_UNARY_REDUCTION
lhs : Node 2
rhs : OPERATION_BINARY_ADD_TYPE

Node 2
==
type : COMPOUND_OPERATION
operator : OPERATION_UNARY_ROW_WISE_TYPE
lhs : MATRIX
rhs : UNDEFINED


I think that an operator is a symbolic entity, and that the difference
between an elementwise addition and a vector summation should not be
encoded at the level of the addition operation. This is why in both cases
the same OPERATION_BINARY_ADD_TYPE will be involved.

I think that the statement representation is nice enough, but the UI may
not be optimal from a compilation time perspective.

On the other hand, calling

viennacl::reduce(A, VIENNACL_ADD_ROW_WISE);

seems not so great at all since this will involve a *lot* of duplication in
the scheduler and the generator. Yet, I don't see any way of having a
dynamic interface (no expression template) while preserving the flexibility
of the statement mentionned above. Any idea?

Best regards,
Philippe
--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Ideas for Google Summer of Code?

2014-02-03 Thread Philippe Tillet

Hi,

I'll be once more available as a mentor :)
I'll be myself pretty busy with some BLAS2/BLAS3 tuning for Hawaii. I'm
also in favor of ideas of projects which don't require a strong knowledge
of the current codebase, such as the GUI autotuning/benchmarking tool. I
think that ViennaCL could also hugely benefit from parameterized FFT
kernels...
Of course, PyViennaCL is another option.

Best regards,
Philippe



2014-02-03 Karl Rupp r...@iue.tuwien.ac.at:

 Hi guys,

 the Google Summer of Code [1] is approaching. It attracted some great
 contributors in the past, most notably Philippe and Toby, and I hope
 there's more to come. So, guys, please provide your project ideas.

 My experience is that good projects are those which don't require the
 student to understand large parts of the existing code and are easy to
 formulate (but not necessarily easy to accomplish :-P). Although I don't
 see work items on our roadmap nicely fulfilling this optimum, we still
 have at least two neat things to work on:

 The first project I have in mind is the benchmarking GUI we brainstormed
 about in IRC. It's probably a good idea to push out a first working
 version in the next weeks and then let the student work on refinements
 such as a visualization of the results, etc.

 Second, I think PyViennaCL will benefit from another push. Toby, how's
 your availability for a release in February? I have more time now for
 assisting you with the final polishing.

 Something rather generic is the implementation of algorithms such as
 additional iterative solvers, etc. Now as the standard matrix and vector
 operations are pretty mature and support multiple backends, this should
 be a pretty fun piece of work :-)

 @Philippe, Toby: How about mentoring this year?

 Best regards,
 Karli

 [1] http://www.google-melange.com/gsoc/homepage/google/gsoc2014


 --
 WatchGuard Dimension instantly turns raw network data into actionable
 security intelligence. It gives you real-time visual feedback on key
 security issues and trends.  Skip the complicated setup - simply import
 a virtual appliance and go from zero to informed in seconds.

 http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-26 Thread Philippe Tillet

Hey,

I think we agree on everything now! Okay, I will generate all the kernels,
this will lead actually to 16 kernels for each cpu-gpu scalar combination,
so 64 small kernels in total. This took time but it was a fruitful
discussion :)

Anyways, my ideas are much clearer now, thanks!

Best regards,
Philippe


2014-01-26 Karl Rupp r...@iue.tuwien.ac.at

 Hey,


  (x programs/y kernels each)  Execution time

 (1/128) 1.4
 (2/64)  2.0
 (4/32)  3.2
 (8/16)  5.6
 (16/8) 10.5
 (32/4) 20.0
 (64/2) 39.5
 (128/1)80.6

 Thus, jit launch overhead is in the order of a second!


 Okay, it seems like 1 program for all the kernels is the way to go. From
 your hard facts, though, it seems like generating 16 kernels inside the
 same program would have practically the same cost as generating only
 one, since the execution time is largely dominated by the kernel launch
 overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
 which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~
 6ms.


 Considering that the flip_sign and reciprocal trick cannot be applied for
 unsigned integers, this is the way to go then. The increase in the number
 of kernels should be somewhat compensated by the fact that each of the
 kernels is shorter.



  All we need to do is to have a interface to the generator where we
 can just extract the axpy-kernels. The generator should not do any
 OpenCL program and kernel management.


 I don't see any problem with extracting the source code from the
 generator in order to create this program (it is already done for GEMM),
 but the generator doesn't handle reciprocal and flip_sign. As I said
 earlier this feature is cool because it may prevent the transfer of
 several GPU-scalar in order to invert/reverse the value. On the other
 hand, though, it is incompatible with the clBlas interface and the
 kernel generator  (both of which are fed with cl_float and cl_double) .
 Modifying the generator to handle x = y/a - w/b - z*c internally as x
 = y*a + w*b + z*c + option_a + option_b + option_c sounds like a very
 dangerous idea to me. It could have a lot of undesirable side effects if
 made general, and making an axpy-specific tree parsing would lead to a
 huge amount of code bloat. This is actually the reason why I am so
 reluctant to integrating reciprocal and flip_sign within the generator...


 Okay, let's not propagate reciprocal and flip_sign into the generator
 then. Also, feel free to eliminate the second reduction stage for scalars,
 which is encoded into the option value. It is currently unused and makes
 the generator integration harder than necessary. We can revisit that later
 if all other optimizations are exhausted ;-)



  if(size(x)1e5  stride==1  start==0){ //Vectors are padded, wouldn't
 it be confounding/unnecessary to check for the internal size to fit the
 width?

 //The following steps are costly for small vectors
   cl_typeNumericT cpu_alpha = alpha //copy back to host when the
 scalar is on global device memory)


 Never copy device scalars back unless requested by the user. They reads
 block the command queue, preventing overlaps of host and device
 computations.


if(alpha_flip) cpu_alpha*=-1;
   if(reciprocal) cpu_alpha = 1/cpu_alpha;
   //... same for beta


 Let's just generate all the needed kernels and only dispatch into the
 correct kernel.



  //Optimized routines
   if(external_blas)
 call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
   else{
 dynamically_generated_program::init();
 ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
   }
 else{
statically_generated_program::init();
ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
 reciprocal_beta, flip_beta, z)
 }


 What is the difference between
   dynamically_generated_program::init();
 and
   statically_generated_program::init();
 ? Why aren't they the same?

 Also, mind the coding style regarding the placement of curly braces and
 spaces ;-)



  Wouldn't this solve all of our issues?

 I (really) hope we're converging now! :)


 I think we can safely use
   dynamically_generated_program::init();
 in both cases, which contains all the kernels which are currently in the
 statically generated program.



  I don't believe it is our task to implement such a cache. This is
 way too much a source of error and messing with the filesystem for
 ViennaCL which is supposed to run with user permissions. An OpenCL
 SDK is installed into the system and thus has much better options to
 deal with the location of cache, etc. Also, why is only NVIDIA able
 to provide such a cache, even though they don't even seem to care
 about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for
 an extended amount of time.


 Agreed. I was just suggesting this because PyOpenCL already provides
 this, but python comes with a set of dynamic libraries, so

[ViennaCL-devel] Altera OpenCL optimization guide

2014-01-26 Thread Philippe Tillet

Hello everyone,

I have found this relatively new and interesting PDF file :
http://www.altera.com/literature/hb/opencl-sdk/aocl_optimization_guide.pdf.
I'll read it overnight. This is of course for a mid/long-term
perspective, but there are some remarkable points within, for example (some
teasing :D) :

The AOC implements local memory in FPGAs very differently than in GPUs. If
your OpenCL kernel contains code to avoid GPU-specific local memory bank
conflicts, remove that code because the AOC generates hardware that avoids
local memory bank conflicts automatically whenever possible.

Anyway, this is of course not a priority (we don't even have any hardware
to test it), but it might be useful to get some insight on how Altera's
OpenCL is behaving...

Best regards,
Philippe
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-25 Thread Philippe Tillet

Hey hey Karl,





2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hi Phil,


  Oh, I get it better now. I am not entirely convinced, though ;)

  From my experience, the overhead of the jit launch is negligible
   compared to the compilation of one kernel. I'm not sure whether
 compiling two kernels in the same program or two different program
 creates a big difference.


 Okay, time to feed you with some hard facts ;-) Scenario: compilation of
 128 kernels. Configurations (x programs with y kernels each, x*y=128)
 Execution times:

 (x programs/y kernels each)  Execution time
 (1/128) 1.4
 (2/64)  2.0
 (4/32)  3.2
 (8/16)  5.6
 (16/8) 10.5
 (32/4) 20.0
 (64/2) 39.5
 (128/1)80.6

 Thus, jit launch overhead is in the order of a second!


Okay, it seems like 1 program for all the kernels is the way to go. From
your hard facts, though, it seems like generating 16 kernels inside the
same program would have practically the same cost as generating only one,
since the execution time is largely dominated by the kernel launch
overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms.



  Plus, ideally, in the case of linear solver,
 the generator could be used to generate fused kernels, provided that the
 scheduler is fully operationnal.


 Sure, kernel fusion is a bonus of the micro-scheduler, but we still need
 to have a fast default behavior for scenarios where the the kernel fusion
 is disabled.



  I fear that any solution to the
 aforementioned problem would destroy this precious ability... Ideally,
 once we enable it, the generate_execute() mentioned above would just be
 replaced by generate() (or enqueue_for_generation, which is more explicit)


 All we need to do is to have a interface to the generator where we can
 just extract the axpy-kernels. The generator should not do any OpenCL
 program and kernel management.


I don't see any problem with extracting the source code from the generator
in order to create this program (it is already done for GEMM), but the
generator doesn't handle reciprocal and flip_sign. As I said earlier this
feature is cool because it may prevent the transfer of several GPU-scalar
in order to invert/reverse the value. On the other hand, though, it is
incompatible with the clBlas interface and the kernel generator  (both of
which are fed with cl_float and cl_double) . Modifying the generator to
handle x = y/a - w/b - z*c internally as x = y*a + w*b + z*c + option_a
+ option_b + option_c sounds like a very dangerous idea to me. It could
have a lot of undesirable side effects if made general, and making an
axpy-specific tree parsing would lead to a huge amount of code bloat. This
is actually the reason why I am so reluctant to integrating reciprocal and
flip_sign within the generator...

if(size(x)1e5  stride==1  start==0){ //Vectors are padded, wouldn't it
be confounding/unnecessary to check for the internal size to fit the width?

 //The following steps are costly for small vectors
 cl_typeNumericT cpu_alpha = alpha //copy back to host when the scalar is
on global device memory)
 if(alpha_flip) cpu_alpha*=-1;
 if(reciprocal) cpu_alpha = 1/cpu_alpha;
 //... same for beta

 //Optimized routines
 if(external_blas)
   call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
 else{
   dynamically_generated_program::init();
   ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
 }
else{
  statically_generated_program::init();
  ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
reciprocal_beta, flip_beta, z)
 }

Wouldn't this solve all of our issues?

I (really) hope we're converging now! :)






  This put aside, I'm not sure if we should give that much importance to
 jit-compilation overhead, since the binaries can be cached. If I
 remember well, Denis Demidov implemented such a caching mechanism for
 VexCL. What if we replace  distributed vector/matrix with optionnal
 automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a
 limited amount of time :P) ? The drawback is that the filesystem library
 would have to be dynamically linked, though, but afterall OpenCL itself
 also has to be dynamically linked.


 I don't believe it is our task to implement such a cache. This is way too
 much a source of error and messing with the filesystem for ViennaCL which
 is supposed to run with user permissions. An OpenCL SDK is installed into
 the system and thus has much better options to deal with the location of
 cache, etc. Also, why is only NVIDIA able to provide such a cache, even
 though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD
 will go without a cache for an extended amount of time.


Agreed. I was just suggesting this because PyOpenCL already provides this,
but python comes with a set of dynamic libraries, so this is probably not
the same context.

Best regards,
Philippe


 Best regards,
 Karli

[ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet

Hello,

I am a bit confused, is there any reason for using reciprocal and
flip_sign, instead of just changing the scalar accordingly?

Best regards,
Philippe
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet

Hi Karl,



2014/1/24 Karl Rupp r...@iue.tuwien.ac.at

 Hey,

   I am a bit confused, is there any reason for using reciprocal and
  flip_sign, instead of just changing the scalar accordingly?

 yes (with a drawback I'll discuss at the end): Consider the family of
 operations

   x = +- y OP1 a +- z OP2 b

 where x, y, and z are vectors, OP1 and OP2 are either multiplication or
 division, and a,b are host scalars. If I did the math correctly, these
 are 16 different kernels when coded explicitly. Hence, if you put all
 these into separate OpenCL kernels, you'll get fairly long compilation
 times. However, not that you cannot do this if a and b stem from device
 scalars, because then the manipulation of a and b would result in
 additional buffer allocations and kernel launches - way too slow.

 For floating point operations, one can reduce the number of operations a
 lot when (+- OP1 a) and (+- OP2 b) are computed once in a preprocessing
 step. Then, only the kernel

   x = y * a' + z * b'

 is needed, cutting the number of OpenCL kernels from 16 to 1. Since (-a)
 and (1/a) cannot be computed outside the kernel if a is a GPU scalar,
 this is always computed in a preprocessing step inside the OpenCL kernel
 for unification purposes. I think we can even apply some more cleverness
 here if we delegate all the work to a suitable implementation function.

 And now for the drawback: When using integers, the operation n/m is no
 longer the same as n * (1/m). Even worse, for unsigned integers it is
 also no longer possible to replace n - m by n + (-m). Thus, we certainly
 have to bite the bullet and generate kernels for all 16 combinations
 when using unsigned integers. However, I'm reluctant to generate all 16
 combinations for floating point arguments if this is not needed...


Thanks for the clarification. I also absolutely don't want to generate the
16 kernels either!

I was in fact wondering why one passed reciprocal_alpha and flip_sign into
the kernel. After thinking more about it, I have noticed that this permits
us to do the corresponding inversion/multiplication within the kernel, and
therefore avoid one some latency penalty / kernel launch overhead when the
scalar is pointed out, that's smart!
On the other hand, modifying the generator to not actually generate a
specific kernel would be absurd imho. This brings another question, then.
How could ambm beneficiate from the auto-tuning environment?
I propose the following solution:

check the size of the matrices/vector

If the computation is dominated by the kernel launch time (say, less than
100,000 elements), then we use the current ambm kernel. Otherwise, we
transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b'
= +- OP b, and either generate the kernel or use a BLAS library. This way,
we beneficiate from kernel launch time optimization for small data, and
high-bandwidth for large data. Does this sounds good?

Best regards,
Philippe


Best regards,
 Karli



 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet

Hey,


2014/1/24 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  I was in fact wondering why one passed reciprocal_alpha and flip_sign

 into the kernel. After thinking more about it, I have noticed that this
 permits us to do the corresponding inversion/multiplication within the
 kernel, and therefore avoid one some latency penalty / kernel launch
 overhead when the scalar is pointed out, that's smart!
 On the other hand, modifying the generator to not actually generate a
 specific kernel would be absurd imho. This brings another question,
 then. How could ambm beneficiate from the auto-tuning environment?
 I propose the following solution:

 check the size of the matrices/vector

 If the computation is dominated by the kernel launch time (say, less
 than 100,000 elements), then we use the current ambm kernel. Otherwise,
 we transfer the scalars to the CPU, perform the corresponding a' = +- OP
 a, b' = +- OP b, and either generate the kernel or use a BLAS library.
 This way, we beneficiate from kernel launch time optimization for small
 data, and high-bandwidth for large data. Does this sounds good?


 In terms of execution time, this is probably the best solution. On the
 other hand, it does not solve the problem of compilation overhead: If we
 only dispatch into the generator for large data, we still have to generate
 the respective kernels and go through the OpenCL jit-compiler each time.
 The compilation overhead of this is even likely to dominate any gains we
 get from a faster execution.

Instead, what about opening up the generator a bit? It is enough if we have
 some mechanism to access a batch-generation of axpy-like operations, for
 all other operations the generator can remain as-is.

 Another option is to move only the axpy-template from the generator over
 to linalg/opencl/kernels/*, because the generation of these kernels is
 fairly light-weight. Sure, it is a little bit of code-duplication, but it
 will keep the generator clean.

 Another possible improvement is to separate operations on full vectors
 from operations on ranges and slices. For full vectors we can use the
 built-in vector-types in OpenCL, which allows further optimizations not
 possible with ranges and strides, where we cannot use vector types in
 general.


 What do you think?


I prefer option 3. This would allow for something like :

if(size(x)1e5  stride==1  start==0){

 //The following steps are costly for small vectors
 NumericT cpu_alpha = alpha //copy back to host when the scalar is on
global device memory)
 if(alpha_flip) cpu_alpha*=-1;
 if(reciprocal) cpu_alpha = 1/cpu_alpha;
 //... same for beta

//Optimized routines
 if(external_blas)
   call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
 else{
   generate_execute(x = cpu_alpha*y + cpu_beta*z);
}
else{
  //fallback
}

This way, we at most generate two kernels, one for small vectors,  designed
to optimize latency, and one for big vectors, designed to optimize
bandwidth. Are we converging? :)


Best regards,
Philippe


Best regards,
 Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet

Hey hey,


2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  I prefer option 3. This would allow for something like :

 if(size(x)1e5  stride==1  start==0){


 Here we also need to check the internal_size to fit the vector width



   //The following steps are costly for small vectors
   NumericT cpu_alpha = alpha //copy back to host when the scalar is on
 global device memory)
   if(alpha_flip) cpu_alpha*=-1;
   if(reciprocal) cpu_alpha = 1/cpu_alpha;
   //... same for beta

 //Optimized routines
   if(external_blas)
 call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
   else{
 generate_execute(x = cpu_alpha*y + cpu_beta*z);
 }
 else{
//fallback
 }

 This way, we at most generate two kernels, one for small vectors,
   designed to optimize latency, and one for big vectors, designed to
 optimize bandwidth. Are we converging? :)


 Convergence depends on what is inside generate_execute() ;-) How is the
 problem with alpha and beta residing on the GPU addressed? How will the
 batch-compilation look like? The important point is that for the default
 axpy kernels we really don't want to go through the jit-compiler for each
 of them individually.


;)
in this case, generate_execute() will just trigger the compilation - on the
first call only - of the kernel
x = cpu_alpha*y + cpu_beta*z;

__kernel void kernel(unsigned int N, float4* x, float4* y, float4* z, float
alpha, float beta)
{
  for(i = get_global_id(0) ; i  N ; i+=get_global_size(0))
x[i] = alpha*y[i] + beta*z[i];
}

with of course an appropriate compute profile


 Note to self: Collect some numbers on the costs of jit-compilation for
 different OpenCL SDKs.

 Best regards,
 Karli



Best regards,
Philippe
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet

Hey,


2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hey hey hey,


  Convergence depends on what is inside generate_execute() ;-) How is

 the problem with alpha and beta residing on the GPU addressed? How
 will the batch-compilation look like? The important point is that
 for the default axpy kernels we really don't want to go through the
 jit-compiler for each of them individually.


 ;)
 in this case, generate_execute() will just trigger the compilation - on
 the first call only - of the kernel
 x = cpu_alpha*y + cpu_beta*z;

 __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z,
 float alpha, float beta)
 {
for(i = get_global_id(0) ; i  N ; i+=get_global_size(0))
  x[i] = alpha*y[i] + beta*z[i];
 }


 I'm afraid this is not suitable then. A simple conjugate gradient solver
 would then go through ~10 OpenCL compilations, making it awfully slow at
 the first run... With AMD and Intel SDKs, which to my knowledge still do
 not buffer kernels, this would mean that each time a process is started,
 this large overhead will be visible.


I don't understand why this would go through more than one compilation...
This kernel is compiled only once, the value of flip_sign and reciprocal
only changes the dynamic value of the argument, not the source code.

This would eventually result in:

if(alpha_reciprocal)
   kernel(N,x,y,z,1/alpha,beta)

Am I missing something?

Best regards,
Philippe

Best regards,
 Karli



--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet

Hi,

Oh, I get it better now. I am not entirely convinced, though ;)
From my experience, the overhead of the jit launch is negligible  compared
to the compilation of one kernel. I'm not sure whether compiling two
kernels in the same program or two different program creates a big
difference. Plus, ideally, in the case of linear solver, the generator
could be used to generate fused kernels, provided that the scheduler is
fully operationnal. I fear that any solution to the aforementioned problem
would destroy this precious ability... Ideally, once we enable it, the
generate_execute() mentioned above would just be replaced by generate() (or
enqueue_for_generation, which is more explicit)

This put aside, I'm not sure if we should give that much importance to
jit-compilation overhead, since the binaries can be cached. If I remember
well, Denis Demidov implemented such a caching mechanism for VexCL. What if
we replace  distributed vector/matrix with optionnal automatic kernel
caching mechanism for ViennaCL 1.6.0 (we just have a limited amount of
time :P) ? The drawback is that the filesystem library would have to be
dynamically linked, though, but afterall OpenCL itself also has to be
dynamically linked.

Best regards,
Philippe

2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hi Philippe,



  I don't understand why this would go through more than one compilation...
 This kernel is compiled only once, the value of flip_sign and reciprocal
 only changes the dynamic value of the argument, not the source code.

 This would eventually result in:

 if(alpha_reciprocal)
 kernel(N,x,y,z,1/alpha,beta)

 Am I missing something?


 I think so ;-) It's not about a single kernel, it's about the compilation
 unit (i.e. OpenCL program). For conjugate gradients we roughly have the
 following vector operations (random variable names)

 x = y;
 x += alpha y;
 x = z + alpha z;
 x = y - alpha z;
 x = inner_prod(y,z);

 BiCGStab and GMRES add a few more of them. If we use the generator as-is
 now, then each of the operations creates a separate OpenCL program the
 first time it is encountered and we pay the jit-compiler launch overhead
 multiple times. With the current non-generator model, all vector kernels
 are in the same OpenCL program and we pay the jit-overhead only once. I'd
 like to stick with the current model of having just one OpenCL program for
 all the basic kernels, but get the target-optimized sources from the
 generator.

 Sorry if I wasn't clear enough in my earlier mails.

 Best regards,
 Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Roadmap update after 1.5.0 release

2014-01-21 Thread Philippe Tillet

Hey,

I'm slowly getting back to ViennaCL.
I have added one bullet point to the roadmap:
* Full integration of the micro-scheduler and the generator

I think that we should work towards the full integration of this feature if
we don't want to codebase to eventually get too messy.

I will be working on cleaning GEMM (i.e. better integration of the multiple
BLAS backends, and harmonize the kernels using the
columntrans-rownotrans identity.) until I go back to France, in 1 week.
I have also noticed that the size checking could be moved upwards in the
dispatching mechanism, for now, they are duplicated between
opencl/cuda/openmp . Once this is done, I will probably work towards the
full integration of the micro-scheduler. Can we get rid of op_executor?

Best regards,
Philippe



2013/12/27 Philippe Tillet phil.til...@gmail.com

 Hey,

 Sorry for the late reply :P I'm supposed to defend my MSc in 2 weeks, and
 I am yet to start writing my thesis... (I won't have a lot of time to give
 to ViennaCL until everything is sorted out)

 2013/12/23 Karl Rupp r...@iue.tuwien.ac.at

 Hi guys,

 Now as 1.5.0 is out, I spent some thoughts on the roadmap:
https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap
 Rather than having one major update per year, I'd like to go with
 smaller updates (1.6.0, 1.7.0, etc.) every couple of months, with
 eventual bugfix and performance improvements in between (1.5.1, 1.6.1,
 etc.). Thus, the list of features for 1.6.0 was stripped down and the
 not-so-urgent features are postponed to 1.7.0. We still need to gather
 more experiences before we are ready to finally fix some design errors
 in 2.0.0.

 Due to recent developments by Philippe, support for external BLAS
 libraries as a backend was added to the roadmap for 1.6.0. Any comments
 on further rearrangements of the roadmap?


 This seems fine for me! I  would be against overloading the roadmap at
 this point. If we can achieve both stable distributed data-structures and
 the full integration of the scheduler in the coming couple of months, then
 the performance improvements should be noticeable enough to justify a new
 release, I think.

 Best regards,
 Philippe


 Best regards,
 Karli


 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics
 Pro!

 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel



--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Blas linking and internal design

2014-01-21 Thread Philippe Tillet

Hey Karl,

So today I went back to ViennaCL. I tried to move the equivalence
columntrans = rownotrans upwards in the dispatching mechanism but it
turns out to be impossible, because matrixT,row_major is not (and should
not be) convertible to matrixT, column_major, rendering the underlying
signature inappropriate...
I am very skeptical so as to how to handle this problem. I am thinking
about changing the internal signature to something more low-level,

void gemm(bool /*is_A_trans*/, bool /*is_B_trans*/

,const vcl_size_t /*M*/, const vcl_size_t /*N*/, const
vcl_size_t /*K*/, const T /*alpha*/

,viennacl::backend::mem_handle  const  /*A*/ , const
vcl_size_t /*A_internal_size1*/, const vcl_size_t /*A_internal_size2*/

,const vcl_size_t /*A_start1*/, const vcl_size_t /*A_start2*/,
const vcl_size_t /*A_inc1*/, const vcl_size_t /*A_inc2*/

,viennacl::backend::mem_handle  const  /*B*/, const
vcl_size_t /*B_internal_size1*/, const vcl_size_t /*B_internal_size2*/

,const vcl_size_t /*B_start1*/, const vcl_size_t /*B_start2*/,
const vcl_size_t /*B_inc1*/, const vcl_size_t /*B_inc2*/

,const T /*beta*/, viennacl::backend::mem_handle  /*C*/,
const vcl_size_t /*C_internal_size1*/, const vcl_size_t
/*C_internal_size2*/

,const vcl_size_t /*C_start1*/, const vcl_size_t /*C_start2*/,
const vcl_size_t /*C_inc1*/, const vcl_size_t /*C_inc2*/);


Where all the layouts would be assumed to be column-major, like in the
standard-blas interface.
While this solution is acceptable to me, I fear that it will introduce a
lack of harmony considering that some other functions will stay otherwise
like

template typename NumericT, typename F,

typename ScalarType1

void am(matrix_baseNumericT, F  mat1,

  matrix_baseNumericT, F const  mat2, ScalarType1 const
 alpha, vcl_size_t len_alpha, bool reciprocal_alpha, bool
flip_sign_alpha)


The only reasonable solution I see is to clearly separate in the code the
functions which could be linked with BLAS (and give them a lower level
signature), from the other ones. For example, putting them in two separate
files... is there any problem with doing this?

Best regards,
Philippe
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Roadmap update after 1.5.0 release

2013-12-27 Thread Philippe Tillet

Hey,

Sorry for the late reply :P I'm supposed to defend my MSc in 2 weeks, and I
am yet to start writing my thesis... (I won't have a lot of time to give to
ViennaCL until everything is sorted out)

2013/12/23 Karl Rupp r...@iue.tuwien.ac.at

 Hi guys,

 Now as 1.5.0 is out, I spent some thoughts on the roadmap:
https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap
 Rather than having one major update per year, I'd like to go with
 smaller updates (1.6.0, 1.7.0, etc.) every couple of months, with
 eventual bugfix and performance improvements in between (1.5.1, 1.6.1,
 etc.). Thus, the list of features for 1.6.0 was stripped down and the
 not-so-urgent features are postponed to 1.7.0. We still need to gather
 more experiences before we are ready to finally fix some design errors
 in 2.0.0.

 Due to recent developments by Philippe, support for external BLAS
 libraries as a backend was added to the roadmap for 1.6.0. Any comments
 on further rearrangements of the roadmap?


This seems fine for me! I  would be against overloading the roadmap at this
point. If we can achieve both stable distributed data-structures and the
full integration of the scheduler in the coming couple of months, then the
performance improvements should be noticeable enough to justify a new
release, I think.

Best regards,
Philippe


 Best regards,
 Karli


 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics
 Pro!
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Handling Layout/Transpose ASAP for GEMM/GEMV ?

2013-12-19 Thread Philippe Tillet

Hey,

I've started back on the generator today, and realized how ugly the
dispatching mechanism was, to take advantage of the equivalencies based on
the fact that
RowMajor + Trans = ColMajor + NoTrans

Actually, I've been wondering : why wouldn't we do this on the whole
codebase? We could presumably solely focus on providing a simple BLAS
interface (All Column-Major), and do the additional trickery at some point
beforewards. I see a couple of advantages to this:
= This would enable us to maintain only 4 GEMM and 2 GEMV kernels, instead
of 32 GEMM and 4 GEMV kernels.
= This would enormously increase the consistency between the default
implementations, the BLAS backends and the kernel generator (because all
these implementations can focus on providing just a simple column major
BLAS interface)

Am I missing something? If not, at which point such a dispatching mechanism
should take place?

Best regards,
Philippe
--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Call for testing: PyViennaCL on Ubuntu

2013-12-19 Thread Philippe Tillet

*Sneeks in*

(Seems like it's time to hide a if( rand()  RAND_MAX/2) return;
somewhere in the code where Karl won't find it  !)

:D
Philippe


2013/12/19 Karl Rupp r...@iue.tuwien.ac.at

 Hi Toby,

 please allow for ~1 more day, then 1.5.0 is out and I'm available for
 testing :-)

 Best regards,
 Karli


 On 12/17/2013 08:14 AM, Toby St Clere Smithe wrote:
  Toby St Clere Smithe m...@tsmithe.net
  writes:
  Yep, looks like the build was successful, so I'll go ahead and make sure
  it's all working on older distributions now.
 
  And so, after a little over a day backporting the package build system
  from Python 3 to Python 2 (yay!), we have packages for Debian / Ubuntu
  versions at least as old as Ubuntu 12.04!
 
  If you want to try them out, add the PPA[1] to your system; if you're on
  Ubuntu 12.04, 12.10 or 13.04, you can only install `python-pyviennacl`
  (for Python 2), but if you're on Ubuntu 13.10 or 14.04, you can also
  install `python3-pyviennacl` (for Python 3).
 
  Oh, and everyone gets `pyviennacl-doc` (which contains the HTML docs).
 
  I'll set the version number to 1.0.0 once ViennaCL 1.5.0 is released,
  and after any remaining little bugs get ironed out.
 
  [1] https://launchpad.net/~tsmithe/+archive/pyviennacl/
 
 
  Cheers,
 
  Toby
 
 
 
 
 --
  Rapidly troubleshoot problems before they affect your business. Most IT
  organizations don't have a clear picture of how application performance
  affects their revenue. With AppDynamics, you get 100% visibility into
 your
  Java,.NET,  PHP application. Start your 15-day FREE TRIAL of
 AppDynamics Pro!
 
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 



 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics
 Pro!
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

2013-12-18 Thread Philippe Tillet

Hey,




2013/12/18 Karl Rupp r...@iue.tuwien.ac.at

 Hi.


  A short update : I've implemented linkage to CBlas and CuBlas with

 dynamic selection.
 If activated through VIENNACL_WITH_CUBLAS, one can go back and forth
 between cublas and the original backend by doing:

 A.blas().gemm(NULL);
 A.blas().gemm(viennacl::backend::blas::cublas_
 functionsvalue_type::gemm);

 (and similarly for cblas.)


 Nice, thanks! I think we can shorten the second call to something like
  A.blas().gemm(viennacl::backend::cublas);
 for convenience.



  There is some trickery going on with transpositions and layout, but it
 works for every transpose/layout combination. One can also link A's blas
 to his own gemm function, provided a tiny wrapper (essentially to ensure
 signature compatibility)


 Cool!



It is actually interesting to point out that only 4 GEMM kernels are needed
for any implementation : NN, NT, TN, TT . Then, one can use the
equivalence  Row-Major+N = Col-Major+T , and C = AB = C^T = B^T.A^T.



  A very good news is that this allows viennacl to work very well on very
 recent NVidia Hardware, until our autotuning engine is fully operational.
 On my laptop, cublasSgemm is about 5 times faster than the current CUDA
 implementation , and 20% faster than the OpenCL kernel found by the
 autotuner (120GFLOPs vs 25GFLOPs vs 95GFLOPs). Also,linking with
 OpenBlas leads to HUGE performance boost on the CPU ( 0.02GFLOP/s vs
 70GFLOP/s)...!


 For our native CUDA implementation it's probably only a matter of porting
 the results from the OpenCL tuner over. Unfortunately I don't see a good
 way of doing this with CUDA without a significant penalty on compilation
 times, because there is no concept of runtime kernel selection in CUDA so
 far. The performance difference for GEMM of our CPU backend is not
 surprising, this was never subject to optimization ;-)


That's exactly the point of this feature ! Optimizing GEMM for CPU is
pretty complicated, and linking with external BLAS libraries allow us not
to focus too much on these problems, and to just provide a fallback
implementation for the sake of code portability





  A little question remains. For now, the behavior is really weird when
 one defines both VIENNACL_WITH_CBLAS and VIENNACL_WITH_CUBLAS. How to
 handle this? I am not very familiar with the multiple backends and I
 don't know to which extent they can be combined. Therefore, I see
 multiple options, but can't tell which one is better.

 1 - trigger a preprocessor error when both commands are defined together
 2 - slightly modify the API : A.cuda_blas(), A.host_blas(), A.cl_blas()

 I think that option 2 is better, considering that there is already
 cuda_handle(), opencl_handle(), cpu_handle() or something similar, if
 I'm correct. Any advice?


 The reason why cuda_handle(), opencl_handle() and cpu_handle() exists
 under different names is that they return different types (i.e. the memory
 buffer). For the BLAS backends I don't want to have different member names,
 because this gets annoying for users. For example, if a user wants to cycle
 through the backends for e.g. benchmark purposes, she would have to write

   if (my_constant == CUDA)
 A.cuda_blas()...
   else if (my_constant == HOST)
 A.host_blas()...
   else
 A.cl_blas()...


Yes, you're right. However, the types for .blas() are as of different
accross the backends. This is because I chose a low-level interface for the
Blas wrappers, therefore the signature of the function are slightly
different [ T const * A, vcl_size_t A_internal_size1... versus cl_mem const
A, vcl_size_t A_internal_size1 ...). I can easily change the signature to a
higher level one ( viennacl::matrixT A ... ). This is probably better,
right ?


 so making the code longer than necessary. I suggest to query some central
 registry where the backends are registered and then cycle through them:

   SomeListType blas_list = viennacl::blas_implementations_available();
   for ( it = blas_list.begin(); ... )
   {
 A.blas(*it);
 do_something(A);
   }

 I don't know whether .blas() is the best name for this, because in the
 future we might also have more non-BLAS operations such as sorting or FFT -
 maybe we use .operations() to better reflect the operations table?


Yes, I also thought about it... I'm not sure how to handle the default
case, A.operations().gemm(NULL), but I guess that
A.operations().gemm(viennacl::backend::default()), where a proper overload
would set the pointer to NULL internally.


 ---

 It seems to me that this is going in a very fruitful directions. Any
 objections in pushing and extending this for the 1.6.0 release? 1.5.0 is
 essentially done, I'm currently writing the last bits of documentation and
 resolve some minor warnings on Visual Studio..


Yes. This is already pushed in a feature branch, I can try to extended it
to allow for the list implementation you suggested. There are also a couple
of changes in the generator on another feature

Re: [ViennaCL-devel] Call for testing: PyViennaCL on Ubuntu

2013-12-17 Thread Philippe Tillet

Hey Toby,

Excellent ! Thank you !
I'm installing it right away, and I'll test it later tonight.

Philippe


2013/12/17 Toby St Clere Smithe m...@tsmithe.net

 Toby St Clere Smithe m...@tsmithe.net
 writes:
  Yep, looks like the build was successful, so I'll go ahead and make sure
  it's all working on older distributions now.

 And so, after a little over a day backporting the package build system
 from Python 3 to Python 2 (yay!), we have packages for Debian / Ubuntu
 versions at least as old as Ubuntu 12.04!

 If you want to try them out, add the PPA[1] to your system; if you're on
 Ubuntu 12.04, 12.10 or 13.04, you can only install `python-pyviennacl`
 (for Python 2), but if you're on Ubuntu 13.10 or 14.04, you can also
 install `python3-pyviennacl` (for Python 3).

 Oh, and everyone gets `pyviennacl-doc` (which contains the HTML docs).

 I'll set the version number to 1.0.0 once ViennaCL 1.5.0 is released,
 and after any remaining little bugs get ironed out.

 [1] https://launchpad.net/~tsmithe/+archive/pyviennacl/


 Cheers,

 Toby




 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics
 Pro!
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

2013-12-17 Thread Philippe Tillet

Hi,

A short update : I've implemented linkage to CBlas and CuBlas with dynamic
selection.
If activated through VIENNACL_WITH_CUBLAS, one can go back and forth
between cublas and the original backend by doing:

A.blas().gemm(NULL);
A.blas().gemm(viennacl::backend::blas::cublas_functionsvalue_type::gemm);

(and similarly for cblas.)

There is some trickery going on with transpositions and layout, but it
works for every transpose/layout combination. One can also link A's blas to
his own gemm function, provided a tiny wrapper (essentially to ensure
signature compatibility)
A very good news is that this allows viennacl to work very well on very
recent NVidia Hardware, until our autotuning engine is fully operational.
On my laptop, cublasSgemm is about 5 times faster than the current CUDA
implementation , and 20% faster than the OpenCL kernel found by the
autotuner (120GFLOPs vs 25GFLOPs vs 95GFLOPs). Also,linking with OpenBlas
leads to HUGE performance boost on the CPU ( 0.02GFLOP/s vs 70GFLOP/s)...!

A little question remains. For now, the behavior is really weird when one
defines both VIENNACL_WITH_CBLAS and VIENNACL_WITH_CUBLAS. How to handle
this? I am not very familiar with the multiple backends and I don't know to
which extent they can be combined. Therefore, I see multiple options, but
can't tell which one is better.

1 - trigger a preprocessor error when both commands are defined together
2 - slightly modify the API : A.cuda_blas(), A.host_blas(), A.cl_blas()

I think that option 2 is better, considering that there is already
cuda_handle(), opencl_handle(), cpu_handle() or something similar, if I'm
correct. Any advice?

Best regards,
Philippe


2013/12/15 Philippe Tillet phil.til...@gmail.com

 Hi,




 2013/12/15 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  Yeah, it certainly is a bit tedious. Feel free to only do this for

 matrix-matrix multiplications for now, a full operation table is
 presumably too much of a refactoring for ViennaCL 1.x.y, but much
 better suited for 2.0.0.


 Yes. It's actually a pretty complicated problem, because of the
 different signatures of the different BLAS functions... It seems like
 the cleanest way to do it would be using std::function, and
 std::bind, which may indeed be widely available at the time ViennaCL
 2.0.0 comes out. I hadn't seen this coming.


 The interfacing problem is just a matter of wrapping everything behind a
 common function interface and then use function pointers appropriately.
 C++11 is not an option for me for a few more years to come, mostly because
 this is the usual timeframe on large-scale clusters. (Our test system now
 includes a CentOS 5.10 machine with GCC 4.1.2...)


 Yap, sometimes reinventing the wheel makes sense because the car is too
 old :D



  Wouldn't a classic preprocessor directive but with better BLAS support
 (as I have it implemented now : cpy, swap, asum, norm2, gemv, gemm) be
 more interesting feature-wise than a dynamic gemm only dispatch, in the
 end?


 How would that look like? Do you mean a classic #ifdef? If right now we
 are only interested in GEMM, then yes, a simple static dispatch is enough.
 It just shouldn't start growing if we don't see this as the right way to go
 in the future.


 Oh, something like :

 #define VIENNACL_WITH_CBLAS

 or

 #define VIENNACL_WITH_CUDA
 #define VIENNACL_WITH_CUBLAS

 which would dispatch cpy, swap, asum, norm2, gemv, gemm (for the other
 one, I think that the temporary saving of ViennaCL is beneficial) for float
 and double, and when the non-leading dimension of a matrix is strided.  I
 can add a set of more specific switches if necessary:

 #define VIENNACL_WITH_CUBLAS_GEMV
 #define VIENNACL_WITH_CUBLAS_GEMM
 etc...




  Plus, it seems like the dynamic dispatch will be much more
 interesting in the context of ViennaCL 2.0.0 where more things will be
 dynamic, with possibly already kernel-dispatch for the generator based
 on the input sizes (I'm thinking about it)...


 Absolutely. I think it's important to have directions for the future
 (being more dynamic is apparently one of them), but from the 1.5.0 delay I
 have learned the hard way that one should not start too many changes at the
 same time... ;-)


 Well, yes, I had the same problems on a couple of projects... However
 kernel generation should be the main topic of my internship and my
 (hopefully) Ph.D, so I hope I'll have time for these things!

 Best regards,
 Karli


 Best regards,
 Philippe

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

2013-12-15 Thread Philippe Tillet

Hey,


2013/12/15 Karl Rupp r...@iue.tuwien.ac.at

 Hi again,


  While we're at it, let's discuss the dynamic dispatching mechanism we'd

 ideally want. I see two options:

 (1) A global function pointer table. So, one could for example set:
 viennacl::internal_blas::sgemv_ptr = viennacl::cblas_wrapper;
 where cblas_wrapper essentially checks for the stride in the non-leading
 dimension and forwards to cblas if this stride is one. Of course, if the
 current backend is different, cblas_wrapper is not defined, and
 cublas_wrapper can be defined instead.


 I'd prefer to have this function table per object or per memory backend
 rather than being global, otherwise this will sooner or later bite us in a
 multi-threaded setting. We (or a user) might want to use one implementation
 of a certain operation for smaller or skinny matrices and other
 implementations for larger/square matrices, in which case things are much
 easier if tied to the particular object.


I agree. However, it seems to me that setting the implementation for each
matrix would end up being tedious... one table per memory backend since to
make sense conceptually to me, since the performance (and the portability)
of each blas implementation is determined by the underlying memory system.
If there is no objection, I think I will go for that neat solution.

Now, another question, how to set the default? I think that a preprocessor
directive would be fine here. We already need the preprocessor's #ifdef to
define the includes (and some wrappers) anyway. So using it to initialize
that table seems reasonable to me (ie VIENNACL_WITH_CBLAS would enable some
internal definitions and would initialize the table).

Best regards,
Philippe



  I like this solution a lot, since this allows one to mix multiple blas
 implementation in the same program. This can be useful in some case
 (OpenBlas is faster than MKL for BLAS3, but MKL is supposedly faster for
 all the rest). HOWEVER, this requires linkage if we want to avoid
 multiple definitions of that global pointer table.


 That's another reason why it shouldn't be global ;-)


  Since we now provide
 a libviennacl.so, though, we could include the global table therein, and
 one would link with it if he wants to use the additional
 functionnalities. Plus, if one has his own blas function he wants to
 benchmark against ours, for example, then this solution is very
 convenient.


 The shared library is available in addition to the header-only
 implementation, it's not compulsory. We might change that for ViennaCL
 2.0.0, but 1.x.y will stay header-only.



  (2) A template parameter. So that one would write:
 viennacl::prodCBlasBackend(), similarly to how I did with UMinTL.
 However, I am not very fond of this solution for ViennaCL, because it
 will create a huge bloat in the code, since templates essentially need
 to propagate, and it might screw up a bit the template deduction
 mechanism of some compiler (since prod is already templated with the
 underlying ScalarType...)


 Same here, I consider this to be a wrong use of templates for the reasons
 you mentioned. Fortunately we don't have to worry about performance for
 something tiny like 3x3-matrices, so a bit of runtime logic is not an issue.

 Best regards,
 Karli


--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

2013-12-15 Thread Philippe Tillet

Hi,


2013/12/15 Karl Rupp r...@iue.tuwien.ac.at

 Hey,


  I agree. However, it seems to me that setting the implementation for
 each matrix would end up being tedious... one table per memory backend
 since to make sense conceptually to me, since the performance (and the
 portability) of each blas implementation is determined by the underlying
 memory system. If there is no objection, I think I will go for that neat
 solution.


 Yeah, it certainly is a bit tedious. Feel free to only do this for
 matrix-matrix multiplications for now, a full operation table is presumably
 too much of a refactoring for ViennaCL 1.x.y, but much better suited for
 2.0.0.


Yes. It's actually a pretty complicated problem, because of the different
signatures of the different BLAS functions... It seems like the cleanest
way to do it would be using std::function, and std::bind, which may
indeed be widely available at the time ViennaCL 2.0.0 comes out. I hadn't
seen this coming.
Wouldn't a classic preprocessor directive but with better BLAS support (as
I have it implemented now : cpy, swap, asum, norm2, gemv, gemm) be more
interesting feature-wise than a dynamic gemm only dispatch, in the end?
Plus, it seems like the dynamic dispatch will be much more interesting in
the context of ViennaCL 2.0.0 where more things will be dynamic, with
possibly already kernel-dispatch for the generator based on the input sizes
(I'm thinking about it)...

Best regards,
Philippe




  Now, another question, how to set the default? I think that a
 preprocessor directive would be fine here. We already need the
 preprocessor's #ifdef to define the includes (and some wrappers) anyway.
 So using it to initialize that table seems reasonable to me (ie
 VIENNACL_WITH_CBLAS would enable some internal definitions and would
 initialize the table).


 Yes, that makes sense. The same is already done for the default backend:
 CUDA has priority over OpenCL, which has priority over the fall-back host
 implementation. The rationale is that the more specific an enabled backend
 is, the more likely it is that a user wants to use just that by default.
 This should equally well apply to a CBLAS interface.

 Best regards,
 Karli


--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

2013-12-15 Thread Philippe Tillet

Hi,




2013/12/15 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  Yeah, it certainly is a bit tedious. Feel free to only do this for

 matrix-matrix multiplications for now, a full operation table is
 presumably too much of a refactoring for ViennaCL 1.x.y, but much
 better suited for 2.0.0.


 Yes. It's actually a pretty complicated problem, because of the
 different signatures of the different BLAS functions... It seems like
 the cleanest way to do it would be using std::function, and
 std::bind, which may indeed be widely available at the time ViennaCL
 2.0.0 comes out. I hadn't seen this coming.


 The interfacing problem is just a matter of wrapping everything behind a
 common function interface and then use function pointers appropriately.
 C++11 is not an option for me for a few more years to come, mostly because
 this is the usual timeframe on large-scale clusters. (Our test system now
 includes a CentOS 5.10 machine with GCC 4.1.2...)


Yap, sometimes reinventing the wheel makes sense because the car is too old
:D



  Wouldn't a classic preprocessor directive but with better BLAS support
 (as I have it implemented now : cpy, swap, asum, norm2, gemv, gemm) be
 more interesting feature-wise than a dynamic gemm only dispatch, in the
 end?


 How would that look like? Do you mean a classic #ifdef? If right now we
 are only interested in GEMM, then yes, a simple static dispatch is enough.
 It just shouldn't start growing if we don't see this as the right way to go
 in the future.


Oh, something like :

#define VIENNACL_WITH_CBLAS

or

#define VIENNACL_WITH_CUDA
#define VIENNACL_WITH_CUBLAS

which would dispatch cpy, swap, asum, norm2, gemv, gemm (for the other one,
I think that the temporary saving of ViennaCL is beneficial) for float and
double, and when the non-leading dimension of a matrix is strided.  I can
add a set of more specific switches if necessary:

#define VIENNACL_WITH_CUBLAS_GEMV
#define VIENNACL_WITH_CUBLAS_GEMM
etc...




  Plus, it seems like the dynamic dispatch will be much more
 interesting in the context of ViennaCL 2.0.0 where more things will be
 dynamic, with possibly already kernel-dispatch for the generator based
 on the input sizes (I'm thinking about it)...


 Absolutely. I think it's important to have directions for the future
 (being more dynamic is apparently one of them), but from the 1.5.0 delay I
 have learned the hard way that one should not start too many changes at the
 same time... ;-)


Well, yes, I had the same problems on a couple of projects... However
kernel generation should be the main topic of my internship and my
(hopefully) Ph.D, so I hope I'll have time for these things!

Best regards,
 Karli


Best regards,
Philippe
--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?

2013-12-14 Thread Philippe Tillet

Hello,

I've just realized that most BLAS implementation don't provide anyway to do
strided matrix accesses in the non-leading dimension ... ! Is this correct?
I was hoping that we could have avoided such special cases, but it seems
like a couple of tests will need to be made.

Philippe


2013/12/14 Karl Rupp r...@iue.tuwien.ac.at

 Hey,

  Okay. I'll probably do it statically at first, and I'll keep in mind

 that we want it dynamic at the end of the day (well, not at the end of
 today :D). Once everything works statically, I think we can discuss the
 details of the API we want.


 Fine with me. This way we can first collect a bit of experience on some
 details which might be important for the runtime layer later.

 Best regards,
 Karli


--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Generator's repmat API

2013-11-04 Thread Philippe Tillet

Hello everybody,

I am done implementing :
x = viennacl::reduceop(viennacl::rows(A));
x = viennacl::reduceop(viennacl::cols(A));
s = viennacl::reduceop(x);
In the generator. For now, the op supported are : add, mult, max, min. I
can't support them all, because I need to provide their neutral element for
kernel generation (so that the shared memory can be initialized with the
neutral element)

I am now working on repmat. About this, I am not sure which should be the
return type of the API function. I am planning to go for some

matrix_expressionmatrix, viennacl::tupleint,int, op_repmat(A,
make_tuple(repsize1,repsize2)) ?

Where the tuple would get translated by the scheduler into a binary tree
with operator OP_TUPLE. Does this sound reasonable?

@Toby : There might be some changes of this type in the way the scheduler's
expression tree is generated (for the need of the kernel generator). I'll
try to keep a list of the changes updated, so that the python wrapper does
not diverge too much from the core :)

Philippe
--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression...

2013-10-27 Thread Philippe Tillet

Hello,

I had not noticed that only the first reduction would be executed in this
case, so my arguments were indeed invalid :)
However, I am now even more worried than before ;)
This makes the assumption that the 2-way reduction will always be the best
way to compute an inner-product on any OpenCL device. We want the
reduction-based programs to be device-specific, so these sometimes
truncated operations will have to be forwarded somehow to the kernel
generator, and therefore the expression tree. Does it mean that we need an
additional parameter in the statement which basically says don't execute
the last kernel!. This would introduce a lot of complexity in the
scheduler and the generator, for too little benefit imho.

What about input-dependent kernels? For small inputs where the second
kernel would not be negligible, we would actually be better off performing
the full reduction computation in one, big, work group. I think that, for
small vector, this is also more cache-efficient than the first kernel of
the dual-reduction approach plus a final reduction on the CPU... This would
preserve the benefit of saving one kernel launch, and at the same time more
smoothly integrate within the scheduler/generator framework...

Philippe



2013/10/27 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  Now that I'm back to some C++ coding, I want to finish the integration

 of viennacl::op_reduce.
 I've noticed a lot of different operator overloads for
 viennacl::scalar_expression, with basically different implicit
 conversions to raw scalar. I'm a bit skeptical here :)
 This allows to handle the (imho unpractical) cases such as :
 cpu_scal = inner_prod(x1,x2)*5.0


 This is *very* practical. Without implicit conversion, this would
  a) not work at all and require instead
   gpu_scalar = inner_prod(x1, x2);
   copy(gpu_scalar, cpu_scalar);
   cpu_scal *= 0.5;
 Clearly, this would not result in generic code at all...
  b) be less efficient: With the above, there are two reductions on the GPU
 required in order to then copy a single value to the host. With the
 implicit conversion, this is just one reduction on the GPU, then copy the
 reduced values (no extra overhead, this is only latency limited) and
 finally run the reduction on the CPU at no significant cost. While the
 extra kernel launch does not really matter for large sizes, it is an issue
 for vector sizes between ~10k and ~500k, particularly for AMD and Intel
 accelerators where the latency is high(er).



  I think that such expressions should be forbidden. I think that every
 conversion involving host-device data movement should be explicit,
 since they trigger a flush of the scheduler's queue. Furthermore, we are
 heading towards multi-devices computations, and these implicit
 conversions will then become even more troublesome : an implicit
 inner_prod-scalar conversion would then need to sum the results
 obtained for each device...


 Hmm, I don't see a reason why this should not work for multi-device
 scenarios...



  Basically, I think that we should forbid any other implicit conversions
 than the viennacl::scalarT  = T one... Do you agree?


 I don't want to give away the benefit of saving one kernel launch for
 reduction operations when the result is needed on the host...




  This would force to rewrite the examples above :

 gpu_scal = (vcl_scal1 + vcl_scal2)*5.0;
 cpu_scal = gpu_scal;

 Which is I think more explicit and efficient than the previous approach :)


 For pure scalar operations there is no chance in getting any efficiency
 out of it. Yes, it is more explicit, but at the same time less convenient.
 Overall, we would trade convenience for ... what? ;-) Simpler
 implementation?

 Best regards,
 Karli


--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression...

2013-10-27 Thread Philippe Tillet

Hi hi,


2013/10/27 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  This makes the assumption that the 2-way reduction will always be the

 best way to compute an inner-product on any OpenCL device. We want the
 reduction-based programs to be device-specific, so these sometimes
 truncated operations will have to be forwarded somehow to the kernel
 generator, and therefore the expression tree. Does it mean that we need
 an additional parameter in the statement which basically says don't
 execute the last kernel!. This would introduce a lot of complexity in
 the scheduler and the generator, for too little benefit imho.


 You are right, this is indeed a bit tricky. There is preparation for this
 case already in the 'standard' vector kernels, where each GPU scalar
 argument may include an additional 'mini reduction' before computing the
 actual operation. However, this functionality is currently unused. The
 motivation for this were operations of type
  z = inner_prod(u,v) * w;
 where the second reduction could go into the z - alpha * w assignment.


Oh I see :)
When the kernels are generated, this is actually what happens, ie
z = inner_prod(u,v) * w
leads to two kernels.




  What about input-dependent kernels? For small inputs where the second
 kernel would not be negligible, we would actually be better off
 performing the full reduction computation in one, big, work group. I
 think that, for small vector, this is also more cache-efficient than the
 first kernel of the dual-reduction approach plus a final reduction on
 the CPU... This would preserve the benefit of saving one kernel launch,
 and at the same time more smoothly integrate within the
 scheduler/generator framework...


 Yes, I thought about that already. I think we don't need separate kernels,
 only a proper kernel calling logic. What is quite tricky is to get the
 'cross-over' point right, because that depends on not only the device
 performance, but also on the latency, which is OS-specific.


Ah... This gets tricky indeed. Are there any measure of how the OS affects
the latency? Specifically, If the OS-dependence is independent from the
device-dependence, there should be static ways out of this mess...

Another simple way out is to have a reasonable cross-over size value, and
to integrate such platform-specific information in the autotuning software.
Then, the user could override the default for optimal results, using a
#define typically... until we are able to interact at runtime with the
autotuner's results (using some io mechanism).


Best regards,
 Karli


Philippe
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression...

2013-10-26 Thread Philippe Tillet

Hello,

Now that I'm back to some C++ coding, I want to finish the integration of
viennacl::op_reduce.
I've noticed a lot of different operator overloads for
viennacl::scalar_expression, with basically different implicit
conversions to raw scalar. I'm a bit skeptical here :)
This allows to handle the (imho unpractical) cases such as :
cpu_scal = inner_prod(x1,x2)*5.0
or
cpu_scal = (vcl_scal1 + vcl_scal2)*5.0

I think that such expressions should be forbidden. I think that every
conversion involving host-device data movement should be explicit, since
they trigger a flush of the scheduler's queue. Furthermore, we are heading
towards multi-devices computations, and these implicit conversions will
then become even more troublesome : an implicit inner_prod-scalar
conversion would then need to sum the results obtained for each device...
Basically, I think that we should forbid any other implicit conversions
than the viennacl::scalarT  = T one... Do you agree?

This would force to rewrite the examples above :

gpu_scal = (vcl_scal1 + vcl_scal2)*5.0;
cpu_scal = gpu_scal;

Which is I think more explicit and efficient than the previous approach :)

Philippe
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Adding op_element.subfamily_type into the scheduler

2013-10-17 Thread Philippe Tillet

A clearer classification :

OPERATION_FUNCTION_SUB_TYPE_FAMILY (norm, prod, inner_prod, etc...)
OPERATION_ELEMENT_FUNCTION_SUB_TYPE_FAMILY (abs, pow, etc)
OPERATION_ELEMENT_OPERATOR_SUB_TYPE_FAMILY(+, ==, , etc...)

Philippe


2013/10/18 Philippe Tillet phil.til...@gmail.com

 Hello,

 Currently, there are only two families :
 UNARY_FAMILY
 and
 BINARY_FAMILY

 In the generator, I have to do silly checks using giant ORs to check
 whether the operator is a product operator, an elementwise operator, an
 elementwise function...

 I'm thinking about introducing an subfamily_type, which contains this
 information. Does this sound reasonable to you?

 Philippe

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Adding op_element.subfamily_type into the scheduler

2013-10-17 Thread Philippe Tillet

Hey,

While we're at it. I'm implementing reductions, now.
There are two options here :

templateclass OP, class VectorType reduce(VectorType const  v) {
return scalar_expressionVectorType, OP, reduce_type(v, OP());
}

or

templateclass OP, class VectorType reduce(VectorType const  v) {
return scalar_expressionVectorType, VectorType, reduce_typeOP (v,v);
}

the first one is scheduler-friendly, but the second one is more
meta-programming friendly. I'm really confused here, any advice on which
one to choose? I would prefer the first one, since handling the second one
in the scheduler would clearly be a pain, considering that I don't want to
introduce REDUCE_ADD, REDUCE_MAX, etc..., but just a single REDUCE,
operator, and reuse the existing ones. I think several other similar cases
will arise, such as three arguments function, where it is preferable to
create the appropriate tree structure directly from the function call. Am I
right?

Philippe


2013/10/18 Karl Rupp r...@iue.tuwien.ac.at

 Hey,

   OPERATION_FUNCTION_SUB_TYPE_FAMILY (norm, prod, inner_prod, etc...)
  OPERATION_ELEMENT_FUNCTION_SUB_TYPE_FAMILY (abs, pow, etc)
  OPERATION_ELEMENT_OPERATOR_SUB_TYPE_FAMILY(+, ==, , etc...)

 I assume they are all within the same enum - go for it :-)

 Best regards,
 Karli



 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Adding op_element.subfamily_type into the scheduler

2013-10-17 Thread Philippe Tillet

Okay, this approach has a problem at the OP() stage because *_expression
will store a reference to a temporary object, and because it creates
problem for the *element* part of the statement. On the other-hand,
scalar_expressionVectorType, VectorType, reduce_typeOP (v,v) would need
to be converted to the same end-tree anyway, which will lead to the same
problem inside the statement...

Philippe


2013/10/18 Philippe Tillet phil.til...@gmail.com

 Hey,

 While we're at it. I'm implementing reductions, now.
 There are two options here :

 templateclass OP, class VectorType reduce(VectorType const  v) {
 return scalar_expressionVectorType, OP, reduce_type(v, OP());
 }

 or

 templateclass OP, class VectorType reduce(VectorType const  v) {
 return scalar_expressionVectorType, VectorType, reduce_typeOP
 (v,v);
 }

 the first one is scheduler-friendly, but the second one is more
 meta-programming friendly. I'm really confused here, any advice on which
 one to choose? I would prefer the first one, since handling the second one
 in the scheduler would clearly be a pain, considering that I don't want to
 introduce REDUCE_ADD, REDUCE_MAX, etc..., but just a single REDUCE,
 operator, and reuse the existing ones. I think several other similar cases
 will arise, such as three arguments function, where it is preferable to
 create the appropriate tree structure directly from the function call. Am I
 right?

 Philippe


 2013/10/18 Karl Rupp r...@iue.tuwien.ac.at

 Hey,

   OPERATION_FUNCTION_SUB_TYPE_FAMILY (norm, prod, inner_prod, etc...)
  OPERATION_ELEMENT_FUNCTION_SUB_TYPE_FAMILY (abs, pow, etc)
  OPERATION_ELEMENT_OPERATOR_SUB_TYPE_FAMILY(+, ==, , etc...)

 I assume they are all within the same enum - go for it :-)

 Best regards,
 Karli



 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 

 http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel



--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Common base for implicit_vector_base and vector_base...makes sense?

2013-10-16 Thread Philippe Tillet

Hi,

It seems like the behavior of scalar_vector, unit_vector etc has changed a
bit since the appearance of the kernel generator.
I am currently extending the API of the generator, with relational
operators. I want to design a specific kernel which checks for X[i]  0.42,
for all i.
Since operator is misleading, I am using the more verbose but clearer
approach :

element_less_than(X, scalar_vectorNumericType(X.size(), 0.42)).
The verbosity is a problem, but a minor one, so I will ignore it for now (I
think end users can live with that :P).

Anyway, my problem is that the
VIENNACL_GENERATE_BINARY_ELEMENTOPERATION_OVERLOADS which generates
functions and function overloads for element_* does not handle
implicit_vector. For now, I can just add the proper overloads. But my
problem is actually that, right now, implicit_vector's meaning has
diverged, with the use of the kernel generator.

scalar_vectorfloat v(N,0);
x += v;
y = element_less_than(x,v);

makes perfect sense when using the OpenCL backend, but does not make sense
with the OPENMP and the CUDA backend.
How to handle this divergence? It does look extremely complicated to me.

Philippe
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Common base for implicit_vector_base and vector_base...makes sense?

2013-10-16 Thread Philippe Tillet

Hi hi,


2013/10/16 Karl Rupp r...@iue.tuwien.ac.at

 Hi,

   It seems like the behavior of scalar_vector, unit_vector etc has changed
  a bit since the appearance of the kernel generator.
  I am currently extending the API of the generator, with relational
  operators. I want to design a specific kernel which checks for X[i] 
  0.42, for all i.
  Since operator is misleading, I am using the more verbose but clearer
  approach :
 
  element_less_than(X, scalar_vectorNumericType(X.size(), 0.42)).
  The verbosity is a problem, but a minor one, so I will ignore it for now
  (I think end users can live with that :P).

 Most of all, it is consistent with element_prod(), element_div(), and
 friends. It may be a bit verbose, yes, but it could be way worse ;-)


Yes, I think that Eigen's approach to have proxy objects(arrays) for
elementwise operation : x.array()*y.array() , is a lot of work for barely
no gain, so I'm clearly in favor of keeping things unambiguous and simple,
albeit slightly verbose :)


  Anyway, my problem is that the
  VIENNACL_GENERATE_BINARY_ELEMENTOPERATION_OVERLOADS which generates
  functions and function overloads for element_* does not handle
  implicit_vector. For now, I can just add the proper overloads. But my
  problem is actually that, right now, implicit_vector's meaning has
  diverged, with the use of the kernel generator.
 
  scalar_vectorfloat v(N,0);
  x += v;
  y = element_less_than(x,v);
 
  makes perfect sense when using the OpenCL backend, but does not make
  sense with the OPENMP and the CUDA backend.
  How to handle this divergence? It does look extremely complicated to me.

 Since we don't know until runtime which backend is in use, the only
 clean appraoch is to throw an exception for cases where there is no
 implementation in the other backends.

 Rather than introducing yet another base class, what about allowing
 implicit vectors in vector_base by suitable constructor arguments?
 This will also keep compilation times under control :-)


I'm a bit confused, this solution would then allocate memory in the case of
:
element_less_than(X, vectorNumericType(scalar_vectorNumericType(X.
size(), 0.42))), wouldn't it?
If I want to normalize a vector by substracting a constant c, simply writing
y = x - scalar_vectorNumericType(x.size(),c);
results in a single OpenCL kernel, and more importantly only N reads
instead of 2N.

In my opinion, it would be a bit sad to remove this functionnality, but on
the other hand I have no intention to duplicate ll the operator
overloads for implicit_vector_base and implicit_matrix_base :P

I also thought about using enable_if to check for vector_base or
implicit_vector_base, (or only vector_base #ifndef
VIENNACL_WITH_OPENCL), but I'm a bit afraid of the consequences on the
compilation time, so I thought that providing a common base class in the
OpenCL case would be a good solution, wouldn't it?

Best regards,
Philippe


 Best regards,
 Karli



 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Common base for implicit_vector_base and vector_base...makes sense?

2013-10-16 Thread Philippe Tillet

Hey hey,

Well, the main problem I have with incorporating implicit_vector_base
inside vector_base is that this sounds like replacing inheritance with
switches on enum :P
However, I think I have found a solution which will satisfy both of us:

viennacl::vector_base already have this constructor:
 explicit vector_base(size_type vec_size, viennacl::context ctx =
viennacl::context())

Actually, what I want to do is to make implicit_vector_base inherit from
vector_base and use this constructor. I don't really know why I thought
about a common base, rather than this much better approach.

Sleepily,
Philippe


2013/10/17 Karl Rupp r...@iue.tuwien.ac.at

 Hey,


  After thinking more about it, I see a conceptual flaw in that approach,

 since implicit_vector cannot be used as l-value, while vector_base
 can, it would lead to very misleading code, where implicit_vectors would
 have (empty, or throwing exceptions) operator overloads... The risk here
 is that vector_base would become a holdall.


 Well, you can disable operator= in all symbolic types directly, thus
 preventing any lvalue problems :-)


  What is so bad about having a common overload which would hold the size
  and the opencl context, and then two separate sub-classes? Am I missing
  something?

 It is the combinatorial explosion for tests and such. Already now we have
 to split up tests into one for each numeric type to keep the memory
 consumption under control. When using {implicit_vector_base, vector_base}
 \times { implicit_matrix_base, matrix_base }, compilation times for tests
 will grow by another factor of four. This is a clear disadvantage for
 having two separate base classes. On the other hand, I don't see a real
 advantage. Maybe I miss something?

 Best regards,
 Karli


--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] IRC meeting on Friday, 15:00 UTC?

2013-10-03 Thread Philippe Tillet

Hey,

I'll be there!

Philippe


2013/10/2 Karl Rupp r...@iue.tuwien.ac.at

 Hi guys,

 we haven't had an IRC meeting for quite a while now. I'm finally done
 with most of my relocation from the US back to Austria, so I propose to
 have our next IRC meeting on Friday, October 4, at 15:00 UTC. Is this
 okay for everybody interested in joining?

 Potential topics:
 - Final things to be completed for the upcoming releases of
ViennaCL 1.5.0 and PyViennaCL (1.0.0?)
 - Roadmap towards 1.6.0
 - Defining GUI functionality for Autotuning

 Best regards,
 Karli


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] IRC meeting on Friday, 15:00 UTC?

2013-10-03 Thread Philippe Tillet

Oh, yes, what do you guys think about adding another topics, more
specifically about how PyViennaCL compare to the alternative solutions, in
particular Theano? It is a question I have actually already been asked (by
a former colleague of one of the Theano creators...), and we will sooner or
later have to provide a fair comparison in order to orient the scientists
that are looking for a high-level GPGPU solution.

Philippe


2013/10/3 Toby St Clere Smithe m...@tsmithe.net

 Yep, so will I.

 Toby

 Philippe Tillet phil.til...@gmail.com
 writes:
  Hey,
 
  I'll be there!
 
  Philippe
 
 
  2013/10/2 Karl Rupp r...@iue.tuwien.ac.at
 
  Hi guys,
 
  we haven't had an IRC meeting for quite a while now. I'm finally done
  with most of my relocation from the US back to Austria, so I propose to
  have our next IRC meeting on Friday, October 4, at 15:00 UTC. Is this
  okay for everybody interested in joining?
 
  Potential topics:
  - Final things to be completed for the upcoming releases of
 ViennaCL 1.5.0 and PyViennaCL (1.0.0?)
  - Roadmap towards 1.6.0
  - Defining GUI functionality for Autotuning
 
  Best regards,
  Karli
 
 
 
 --
  October Webinars: Code for Performance
  Free Intel webinars can help you accelerate application performance.
  Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
  from
  the latest Intel processors and coprocessors. See abstracts and
 register 
 
 http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 
 
 --
  October Webinars: Code for Performance
  Free Intel webinars can help you accelerate application performance.
  Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
  the latest Intel processors and coprocessors. See abstracts and register
 
 
 http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel



 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] Incorporating reductions in ViennaCL

2013-09-06 Thread Philippe Tillet

Hi everybody :)

Okay, so in the roadmap i've added Reductions support for ViennaCL 1.6 ...
I plan to take care of it for the three backends, but there are several
things to consider here. For now, I will call them reduce, reduce_rows,
reduce_cols. A convenience layer such that reduce(mat.rows()) or
reduce(mat.cols()) would be better, but this is a completely different
problem :P

From a certain point of view, if one is a vector full of one (optimized at
compile time), we have the following equivalences :

- reduceOP(lhs) uses the same kernel as inner_prod(lhs,one_vector),
except that the reduction operator is OP and not ADD.
- reduce_rowsOP(mat) uses the same kernel as prod(mat,one_vector),
except that the reduction operator is OP and not ADD,.
- similarly, reduce_colsOP(mat) uses the same kernel as
prod(trans(mat),one_vector);

While there is a slight conceptual difference in the kernels (the gemv
kernel has to take care of the reuse of the vector data, the reduce kernel
doesn't), they do show very strong similarities... I see two options here:

(1) - Ignore that slight conceptual difference, and use the same
kernel/backend. This makes sense imho, because the reuse of the vector data
don't matter a lot (it's orders of magnitude smaller in memory than the
matrix). We delegate reduce_implOP(vec) to
inner_prod_implOP=AddType(vec, one_vector())

(2) - Have a specific reduce_impl, reduce_rows_impl,
reduce_cols_impl. I am clearly against this, which would lead to a lot of
duplication for not so much, but well I bring it up for the sake of
Descartes' systematic doubt :)


Best regards,
Philippe
--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Auto-Tuner, GEMM, GEMV... : Integrating RaijinCL into the generator

2013-08-30 Thread Philippe Tillet

Hi hi,


2013/8/30 Karl Rupp r...@iue.tuwien.ac.at

 Hi Philippe,

   About 6months ago I had heard of a library that also performed
  autotuning (http://raijincl.org), but that offered the same performance
  as ours back then.
  Since then, the performance have *greatly* improved, largely
  outperforming our autotuner :
  - Over 3TFLOP/s on HD7970
  - Over 1.3TFLOP/s on HD5850
  - About the same performance as CuBLAS on Kepler and Fermi
 
  A technical report is available there :
 
 http://www.sable.mcgill.ca/publications/techreports/2013-1/sable-tr-2013-1.pdf
  The code is Open-Source and generated under an Apache License.
 
  It would be silly imho to keep struggling to improve the autotuner when
  there seem to be a better one already over there. Plus, Open-Source is
  more about collaboration than competition.
 
  The difference, however, is that while RaijinCL focuses on raw GEMM and
  GEMV autotuning, ViennaCL's generator focuses on temporaries removal.
  I am a bit confused however, so as to how to merge the two works - ie
  using temporary removal along with RaijinCL's profiles/autotuner.
 
  What are exactly the restrictions to use Apache licensed code in MIT
  licensed code? I know that both licenses are permissive, but I don't
  know the details...

 When merging two licenses, the rules are simple: The more restrictive
 license applies. In order not to taint the current MIT license of
 ViennaCL, I'm thus considerably more concerned about integrating
 RaijinCL than you are.

 Since our generator is skeleton-based anyway, what about having a look
 at the best performing kernels in RaijinCL and then extending the
 current generator accordingly such that these kernels are covered as
 well? I consider this to be *far* less painful then trying to merge in
 RaijinCL - as you certainly know, it's not that easy to 'just interface
 with a kernel generator', particularly if this is supposed to happen at
 runtime and in a reliable way. Even just within ViennaCL this took us
 (at least) three iterations to come up with a useful model in practice...



Yes, probably. Plus, we need not all functionalities of RaijinCL (images,
for example). I have taken contact with Rahul (author of RaijinCL). I just
want to make sure that RaijinCL gets the credits it deserves (3TFLOP/s on
HD7970 is a lot !), and maybe join our expertise to get even better
performance :)

Best regards,
Philippe



 Best regards,
 Karli



 --
 Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
 Discover the easy way to master current and previous Microsoft technologies
 and advance your career. Get an incredible 1,500+ hours of step-by-step
 tutorial videos with LearnDevNow. Subscribe today and save!
 http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Call to those with an NVidia GeForce Kepler graphic card : autotuning

2013-08-19 Thread Philippe Tillet

Hi Evan,

Thanks for your answer!
Thanks to Denis Demidov, we already have some (disappointing ... 950GFLOP/s
on SGEMM, 450GFLOP/s on DGEMM ) results for the Tesla K20 ! I am actually
looking for a GeForce  780, to see if the problem is specific to the GK110
or not... A slightly less High-end GPU such as GTX660,670,680... would be
ideal. If the performance is there, the next release will offer pretty good
performance on these particular chips

Philippe


2013/8/19 Evan Bollig bol...@gmail.com

 Philippe, i have you covered: kepler k20.

 Let me know what you need.

 -Evan Bollig
 On Aug 19, 2013 4:14 PM, Philippe Tillet phil.til...@gmail.com wrote:

 Hello everybody,

 For providing good default GEMM kernels for the Kepler Architecture, I
 need the help of the community ! :)
 I'm looking for someone with an NVidia GeForce Kepler graphic card... If
 there is such person here, would he/she be willing to run a small GEMM
 autotuning program? I will add detailed instructions if someone is up for
 it !

 Thanks and best regards,
 Philippe


 --
 Introducing Performance Central, a new site from SourceForge and
 AppDynamics. Performance Central is your source for news, insights,
 analysis and resources for efficient Application Performance Management.
 Visit us today!

 http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel


--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] OpenCL to CUDA kernel translation

2013-08-16 Thread Philippe Tillet

Hey everyone,

It seems to me that most of the differences between CUDA and OpenCL come
from the respective APIs, but that the kernel code is very similar in the
two cases.
Do you guys think it's possible to easily translate the generated kernel
from OpenCL to CUDA, by just doing one-to-one replacements of the keywords?
(__local = __shared__, __global  __device__, ...), or is there any
particular difficulty i've missed?

Best regards,
Philippe
--
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with 2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Scheduler progresses

2013-08-15 Thread Philippe Tillet

Hi,




2013/8/16 Karl Rupp r...@iue.tuwien.ac.at

 Hi guys,

 the scheduler for kernel fusion makes good progress. Toby, you should be
 able to use all of the fundamental dense linear algebra operations now.
 There should be only be two blocks of functionality missing:
   - Sparse matrices (i.e. matrix-vector products)
   - In some cases where += and -= may not work (e.g. matrix-vector product)

 Compilation times are moderate, but there is also some room for
 improvement left. Matrix-matrix products are unnecessarily heavy on the
 compiler.

 The good news for today is that things are finally growing together: Via
 the scheduler Toby can make the fast kernels from Philippe's generator
 available to the Python community :-)



Thanks Karl!
On my side, a minor improvement of 5% on the GEMM kernels, resulting in
an additional 100GFLOP/s on the HD7970. I have also reverted to CUDA 4.0 to
have access to the visual profiler, for which the OpenCL compatibility was
removed in CUDA 5... ($@^§), so I will see what I can do (probably after
the 1.5 release) for the kernels :)

Best regards,
Philippe




 Best regards,
 Karli


 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite!
 It's a free troubleshooting tool designed for production.
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with 2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Compilation load of matrix-test-*

2013-08-06 Thread Philippe Tillet

Hi Karl,

I've just realized i had forgotten to answer!
My computer is no longer laggy in single-threaded mode, which is already a
good thing :) it still cannot bear make -j4, even though it has 4GB of RAM,
my desktop computer can without any issue, though. I'll update this when I
have cleaned and reinstall my system in a few days :P

Best regards,
Philippe


2013/8/2 Karl Rupp r...@iue.tuwien.ac.at

 Hi Phil,

 the tests are now split into more light-weight units by separating
 single and double precision. matrix-test was additionally split into
 row-major and column-major tests. This should now allow you to build with
   `make -j4`
 on weaker machines with limited RAM.

 Best regards,
 Karli


 On 08/01/2013 08:35 PM, Philippe Tillet wrote:
  Hi everybody,
 
  I have had troubles compiling matrix-test-* for quite some time, but it
  has gone worse over time. The compilation process appears to eat up one
  core at 100% (i have a core i5!) and over 1GB on RAM, which is enough to
  freeze my computer for 20-25sec. I have the same problem with the other
  matrix-test-* benchmarks.
I went completely crazy and turned -j4 on, which totally froze my
  computer and forced me to hard reboot :D
Anyway, I am running gcc 4.7 (the default Ubuntu 13.04 version). Is
  anybody else experiencing similar issues?
 
  Best regards,
  Philippe
 
 
 
 --
  Get your SQL database under version control now!
  Version control is standard for application code, but databases havent
  caught up. So what steps can you take to put your SQL databases under
  version control? Why should you start doing it? Read more to find out.
 
 http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk
 
 
 
  ___
  ViennaCL-devel mailing list
  ViennaCL-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/viennacl-devel
 



 --
 Get your SQL database under version control now!
 Version control is standard for application code, but databases havent
 caught up. So what steps can you take to put your SQL databases under
 version control? Why should you start doing it? Read more to find out.
 http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with 2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

[ViennaCL-devel] On Autotuning GEMM

2013-08-06 Thread Philippe Tillet

Hey everybody,

For a few days, I've been playing around with AMD's CodeXL, the HD5850 and
the generator/autotuner:


- First of all, I want to share something that made me completely crazy.
Avoid :
*vector += scalar*vector
*
in a compute bound context. After replacing the above by:
*vector.s0 += scalar*vector.s0*
*vector.s1 += scalar*vector.s1
**vector.s2 += scalar*vector.s2
**vector.s3 += scalar*vector.s3*
 performance jumped from 900GFLOP/s to 2.3TFLOP/s on the HD7970 (which is
of the same order of magnitude as the best existing kernel so far,
presented by Matsumoto et Al.). Only ~10% improvement on HD5850, though.
It seems like the AMD OpenCL compiler does not properly translate the first
operation. A more optimistic view is that it does a very good job at
translating the second one :)

-

- I can make my HD5850 peak at ~920GFLOP/s, which around 45% of the
theoretical peak. Some people in the litterature managed to get 75% of a
HD5870 (they reach ~2TFLOP/s out of ~2.8TFLOP/s ), which is truly
impressive. They had to use some assembly-like language, though. This is
because HD5xxx use VLIW of 5 instructions and not 4. CodeXL shows
ALUPacking = 80.59%, which is in my opinion a direct consequence of
packing instructions 4 by 4 instead of 5 by 5. It seems to me that this
problem has more to do with the OpenCL compiler than my code. Since the
autotuner can find a spot at 95% cache hit rate and 0% local memory
conflict, I assume that the problem in the kernel comes from the way the
ALU are used, rather than bandwidth issues.
Does anybody know if some other architectures use fancy VLIW length? AMD
KernelAnalyzer gives the ptx output for the HD5850, but I am not
experienced enough to understand anything to it.

-

-Very weird behavior:
the initial kernel for C = A*B  was something like:

 __kernel void gemm(uint M, uint K, uint N, __global float4* A, __global
float4* B, __global float4*C){
  uint Mv = M/4; //internal size of A, which is row-major
 uint Nv = N/4; same thing for B.
 //...
 //use Mv and Nv to compute addresses of A and B, rather than M and N.
 }

When replacing it by
 __kernel void gemm(uint M, uint K, uint N, __global float4* A, __global
float4* B, __global float4*C){
 //use inline M/4 and N/4 to compute addresses of A and B, rather than M
and N.
 }

I got ~10% performance improvement on HD5850, no modification on HD7970.
Don't ask me why. I actually think registers are a very precious resource
on AMD device. Since the computation of M/4 and N/4 appears pretty rarely,
it seems to me that it is usually a better choice in these cases to save a
register. Furthermore, since all these registers are probably taken in the
vector register pool, it may be that an uint occupies a whole 128bit wide
register. I am not sure, though.

-


 for(unsigned int bs = 0 ; bs  32 ; ++bs);

seems to be not unrolled by default. Adding #pragma unroll 32
improves performance on NVidia Hardware (almost double them), but kills
them by a factor of 10 on AMD Hardware, for the GEMM case, at least. I am
confused about it. More on this later if I find an answer to that mystery.
If not, i'll just have full #pragma unroll by default, and disable it on
AMD hardware.


= ON THE AUTOTUNING PROCEDURE =
===

Well...

While OpenCL is guaranteed mostly thread-safe (except for clSetKernelArg,
which is thread-safe as long as we set arguments for different kernels in
parallel), I think, it seems like parallel compilations process serially. I
observed this behavior when compiling multiple programs in the same
context, but someone else observed it using different contexts, etc...
http://stackoverflow.com/questions/14544802/threading-opencl-compiling .
Since compilation is a bottleneck of the autotuner ( when the matrix-size
is 1024*1024 at least ... see more later), it seems to me that it would be
a good thing to do. In the end, I thought the simplest way to handle the
problem is to partition the search space, and pass a partition index as an
argument. That way, for a 4-way partitioning:
./blas3_tuning 0
./blas3_tuning 1
./blas3_tuning 2
./blas3_tuning 3
We may observe some speed up (since the above stack overflow link reports
that using fork() resolves the issue.).
Or maybe should we use fork internally? Does anyone know if make -j 4
uses fork or multi-threading? We could have for example some ./blas3_tuning
-j 4.
However, for big matrices sizes, the tuning time seems to be dominated by
the execution of the crappy kernels...

-

There are still quite a few things I still need to do, before talking about
the autotuning procedure itself :)

Best regards,
Philippe
--
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to

Re: [ViennaCL-devel] Kernel Generator wrap-up

2013-07-29 Thread Philippe Tillet

Hi again !

The generator code is pushed on the master branch.



2013/7/28 Karl Rupp r...@iue.tuwien.ac.at

 Hey,



  My preferred option is to pad by default and either to make the
 padding a multiple of four or sixteen. However, we need to maintain
 a full set of unpadded operations, because user-provided buffers
 need not be padded (and a subsequent padding may be too expensive)


 I think making it a multiple of 16 always is a good option, because we
 can reasonably assume that optimal performance are rarely obtained when
 a work item performs (unroll) more than 16*16 operations, on most of the
 kernels.
 However, we have to have a clear and easily extensible dispatch
 mechanism that dispatch some sizes to some specific kernel, which is
 what I was talking about:
 Best {m, k, n} big block sizes for the GEMM kernel:

 GEMM Row-Major * Row-Major
 AMD : 16 * 64 * 256
 NVidia : 16 * 128 * 128
 Intel CPU : 64 * 64 * 128.


 I expect this to be also dependent on the hardware generation. The best
 approach that comes to my mind is to introduce some hardware descriptor,
 which provides nicely preprocessed information from the OpenCL backend. A
 rather simple

   Vendor: [AMD, INTEL, NVIDIA, ...]
   Type:   [CPU, GPU, MIC, etc.]
   Generation: [Southern Island, Fermi, Kepler, ... , UNKNOWN]

 should give us enough dispatch possibilities for the hardware 'out there'.
 If the detection of the hardware generation fails, we just use some
 compatibility kernel (and eventually ask the user to submit hardware
 information when running the tuner).


Right. I'll do that :)




  Of course, it is bound to be device-specific rather than vendor
 specific, and once the autotuning procedure works better we might have
 block sizes such as 96, 112, etc... Furthermore, for the kernel to be
 correct, each size has to be a multiple of the block size (3
 constraints).We can never expect the user to call the kernel on the
 proper sizes. Probem, the padding on ViennaCL is static, while this
 block size is known at runtime... Should we just write somewhere in the
 documentation what the best kernels are?


 The padding is no longer 'static'. The 'ALIGNMENT' template parameter is
 now ignored (vector_base no longer holds an ALIGNMENT parameter), so we can
 introduce a runtime padding without breaking old code. Thus, we can pick a
 proper padding entirely at runtime, tailored to the underlying device.


Oh, true. This padding has to be the smallest one compatible with all
profiles, some sort of lest common multiple, which I hope is not going to
grow ridiculously big...




  Even though the number of possible kernel variations is large
 (though finite), there's only a limited set which actually gives
 good performance. These are the important kernels to be tested
 thoroughly.


 Yes, but this limited set is device/program - specific, and it is hard
 to know (that's why autotuning is for). I don't think anyone could tell
 me explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns,
 use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And
 even if I choose two values for each parameters, it leads to 2¹⁰ = 1024
 test per layout/transposition combination = 32 768 tests . which is
 ridiculously high :D
 What about integrating the test procedure into the autotuning procedure?
 It's not intuitive but I see no better way.


 Yes, a good autotuning procedure should verify the correctness of the
 results obtained anyway. There may be compiler or hardware bugs which can
 lead to fast, but erroneous kernels.

 A two-stage scheme seems best here:
 - First, find the fastest kernel (either without checking, or just
 checking for a particular size).
 - Second, verify this kernel for a couple of different sizes. If this
 fails, pick the next kernel, etc.


Ok, I'll do that.
However, there are things to test in the way the generator behave, rather
than the profiles.
All the operations in tests/vector.cpp have to be compatible with the
generator. Should the corresponding tests be in the same vector.cpp file
(in some #ifdef VIENNACL_WITH_OPENCL) or should it be in a separate file?





  Sooner or later we will have to go for the runtime option anyway. I
 don't see any benefit of being overly pessimistic with 16kB if we
 have the true local memory available at runtime.


 Right, it's not over-complicated to do. The problem is more about
 knowing the right optimization profile used at runtime (the local memory
 used by the to-be-compiled kernel). Ok, it means that this optimization
 profile should not change (since I think we cannot really use global
 objects), so that this local memory value is consistent over time. Only
 the autotuner will be allowed to play with optimization profiles, then,
 which is fine for me.


 There is no reason to expect that the hardware changes during the
 execution of a process. Even if a hardware falls off the bus because it
 overheats, it doesn't come

[ViennaCL-devel] Kernel Generator wrap-up

2013-07-28 Thread Philippe Tillet

Hello everybody,

I'm proud to announce that after about 3weeks, I've recoded from scratch
the OpenCL code generator to integrate it fully with
viennacl::scheduler::statement.

That being said, I'm entering the point where I need to inquire your
opinion for (many) further design choices. Sorted by priority :

1  How to handle padding? For example, the best kernels for a given
operation may use float4, in which case an alignment of 4 is required. For
GEMM, though, the kernel internally used blocking. Since the iteration over
the blocks is unrolled, I prefer to keep the loop boundary static (known at
the OpenCL compile time), so padding inside a kernel is not really an
option here. How to handle this?
Should we have a plethora of kernels optimized for a large number of
block-sizes?If yes, how to choose the block sizes?

2  For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number of
kernels can be generated. Designing a proper test suite in such a situation
is a challenging task. I've thought about testing a fixed amount of
randomly chosen kernel.
We also have to choose multiple sizes for the test (because of 1)...
Finally, multiple operations can be packed together (multiple SAXPY,
multiple scalar reduction/inner product, multiple vector reduction/gemv).
If that number of packed operations is too high, the local memory usage
will be too high and the OpenCL kernel may not *compile*. Should we provide
a mechanism to evaluate this upper bound at runtime (doable) or just use a
very conservative value for now (The OpenCL standards guarantees 16kB of
local memory, the kernel generator guarantees an upperbound on the amount
of local memory used.) ? I prefer the second option.

3  There are several expression nodes that should be supported only by the
generator for now (even though not yet implemented):
   - reduceop(vector_expression)
   - reduce_rowsop(matrix_expression)
   - reduce_colsop(matrix_expression)
   - elementwise relational operators : operator, operator= operator,
operator =, operator==, operator!=.
   - repmat(mat or vector, row_tiling, col_tiling)
   - vector expression : diag(Mat)
   - matrix expression : diag(vec)
My question is : how to provide access for the user to OpenCL-specific
content, not available (yet) for other backends?
Another possibility is to keep this issue for ViennaCL.version  1.5

4  I want to maintain explicit specifications of the generator (apart from
the hard-coded bool-returning C++ function) : what operations it supports,
what it doesn't support. Are you interested? If yes, what format would you
prefer?

Best regards,
Philippe
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

85 matches

Mail list logo