Re: [ViennaCL-devel] CUDA slower than OpenCL in new R implementation?
Hi Charles :) The BLAS kernels for CUDA and OpenCL are entirely different, actually. OpenCL kernels rely on a code-generator, and have been auto-tuned. As far as I know, the CUDA kernels have not been auto-tuned, and don't rely on the same generation engine as the OpenCL ones. While for BLAS1-2, the difference should not be so significant, for GEMM it's totally possible to observe a huge difference. Philippe 2015-07-31 12:04 GMT-07:00 Charles Determan cdeterma...@gmail.com: Greetings, Brief background, I am developing a series of R packages to bring ViennaCL to the R community. I have had success with the development of my gpuR package (https://github.com/cdeterman/gpuR) which relies on the OpenCL backend of ViennaCL (which is housed in the package RViennaCL). I am hoping to submit to CRAN in the coming weeks now that the latest stable ViennaCL version has just been released. Naturally, I wanted a companion package for a CUDA backend. This is now the gpuRcuda package (https://github.com/cdeterman/gpuRcuda). This has appeared to work successfully as most of the code is the same. However, my initial benchmarks are showing very dismal performance with the CUDA backend. I was wondering if someone from this list would be willing to have a look at my code to see why the CUDA code would be so much worse. I had thought, given working a NVIDIA card (GeForce GTX 970), CUDA would provide improved speed but the benchmarks are showing performance at least 5-fold slower than the CPU based R multiplication. Even the 'float' type matrix multiplication is slower than R (which only has double type support!). The sgemm CUDA file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_sgemm.cu) and the associated C++ file is ( https://github.com/cdeterman/gpuRcuda/blob/master/src/vcl_cudaMatrix_gemm.cpp ). Other note, I have tried making the two packages completely independent and the performance is still very poor with CUDA. I really appreciate any help others could provide troubleshooting this. I have truly run out of ideas as to why the code has such poor performance. Regards, Charles -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Column-wise kernels?
Hi, Such row-rise / column-wise reductions could be generate-able by the OpenCL backend, but this won't work on the Host of CUDA backend. Plus, this is not really maintained at the moment. I would recommend Karl's solution, even though it won't be optimal when the vector does not fit in the L2 cache of the OpenCL device (Maxwell for example has 2MB of L2 cache), as the current algorithm for GEMV accesses the entire vector get_num_groups(0) times. Philippe 2015-07-27 9:40 GMT-07:00 Karl Rupp r...@iue.tuwien.ac.at: Excellent, thank you. I thought that would be the way to go initially but I hesitated because of concerns about having additional temporary objects taking up memory when matrices begin to get larger but it certainly is simpler this way. Just pushed: https://github.com/viennacl/viennacl-dev/commit/4063c941235d46804cd448db7ddecf0c3238548f Yeah, it's a bit of a trade-off: Sure, one could optimize the summation kernel, but this also implies more code to maintain. On the other hand, I'm not aware (which, of course, does not deny a possible existence) of a scenario where such summation routines are the performance bottleneck. Glad to hear that 1.7.0 is nearly completed. Does that mean we should expect a formal release soon? Yep. Expect the release on Wednesday. Best regards, Karli On Mon, Jul 27, 2015 at 9:57 AM, Karl Rupp r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at wrote: Hi Charles, I am working on writing some additional opencl kernels (potentially to incorporate in to viennacl) which involve column-wise reductions. A simple case would simply be the sum of each column of a matrix. However, I am having an extremely difficult time getting my kernel correct (reductions are tricky to me). That said, after searching for some resources I came across an old post on sourceforge referring to column-wise kernels (http://sourceforge.net/p/viennacl/mailman/message/27542552/) with viennacl. This leads me to my primary question. Are there such kernels already in ViennaCL that I have overlooked? Yes ;-) Have a look here at how row-wise sums reduce to a standard matrix-vector product: https://sourceforge.net/p/viennacl/discussion/1143678/thread/38e942a0/ That is, in order to compute a row-sum and a column-sum you can use row_sum = prod(A, ones); col_sum = prod(trans(A), ones); In an hour or two I will push convenience functions for summation fixing the only remaining issue for the 1.7.0 release: https://github.com/viennacl/viennacl-dev/issues/127 If not, are there any examples or resources you would recommend to help learn this topic? I have tried searching further but the only thing I can really find is a reduction of an entire matrix (which is relatively simple) as opposed to by column or row. At this point I can only recommend to think about how such operations can be recast in terms of (standard) linear algebra. For example, row- and column-wise updates to a matrix are special cases of the more general A += outer_prod(u, v); operation (rank-1 updates). I'll improve the documentation in that direction. Best regards, Karli -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net mailto:ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] ViennaCL Benchmark GUI 1.0.0 Release Candidate
Hey :-) Worked well on my laptop :-) A couple of suggestions: - Maybe use layout N-T for GEMM, or perhaps it is already possible to chose? From my experience NT-col major (TN row major) always leads to higher performance on GEMM. - The plots were hard to read because rather small on my laptop. I would love to be able to make the plot fullscreen, or to display the data as a table when I click on a curve. I don't know how easy this is to do with qt-creator, though... Apart from this, I'm impressed! This is very user-friendly and detailed. This is my first try of the benchmark GUI in a long, long time, so I hope it brought some perspective. Philippe 2014-11-24 2:23 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, sorry, my email got stuck in the queue over night. Thanks to Namik for already fixing the 'automatic upload' box. :-) Best regards, Karli On 11/23/2014 09:51 PM, Karl Rupp wrote: Hi guys, a release candidate for the benchmark GUI is available for download, I'd appreciate any testing - particularly the Windows version: ** Windows ** http://viennaclbenchmark.sourceforge.net/ViennaCLBenchmark-1.0.0-RC.zip This is a self-contained package which is ready to launch after unzipping. It only requires OpenCL to be installed system-wide or to be available in the PATH environment variable. ** Linux ** http://viennaclbenchmark.sourceforge.net/ViennaCLBenchmark-1.0.0-Linux-RC.gz Requires qt4 to be available on the system (Ubuntu: apt-get install libqt4, Arch Linux: pacman -Su qt4). On some distributions the webkit component needs to installed separately (Ubuntu: apt-get install libqtwebkit4, Arch Linux: pacman -Su qtwebkit). Make sure libOpenCL.so can be found system-wide, or run ldconfig accordingly. A few smaller issues are still left and will be addressed tomorrow. @Namik: Do you have some time to fix the layout on the start screen for multiple OpenCL devices (see screenshot)? Adding a 'scrollable' property might do the trick already... Best regards, Karli -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Roadmap update
Hey :) 2014-11-09 10:06 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at: Hi guys, I've updated our roadmap taking into account the latest release: https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap Feel free to add your topics and post your wishes :-) Awesome! Is it like a christmas present list? Can we post any wish? I'd like a pony, actually. :D The 1.6.1 release is scheduled for the week November 17-21, for which we will provide a new fast kernel right when it is presented at the Supercomputing conference. I had the hope I could get my hand on some GTX970 or GTX980, but I wasn't able to. If any deveoper has access to such hardware, it would be great to let us know, so that we can get optimized kernel for this hardware, and possibly compare against CuBLAS, before SC14. My personal main goal for 1.7.0 is to reduce the use of Boost.uBLAS as much as possible and to have a fast, entirely GPU-based AMG preconditioner (similar to what is in CUSP). At the same time, I'd like to promote shorter release cycles: 1.6.0 was released about a year after 1.5.0, which keeps quite a number of completed features stuck in the pipeline for too long. I've added mines. Rather modest: better auto-tuning, and more devices supported. I am directing my efforts towards my specialization for dense BLAS on OpenCL, which will hopefully get integrated in the 2.0.0 release. Maybe there will be a 1.8.0 release as well, which will still follow the current header-only model. However, we may also switch to ViennaCL 2.0.0 right after the 1.7.x series in order to better target languages other than C++ (most notably C and Fortran due to their wide-spread use in HPC). I will post what I think is reasonable, although most of my thoughts go towards ViennaCL 2.0. As I said I have started today to rewrite the OpenCL layer of ViennaCL using CL/cl.hpp and dynamic layout + datatype (the rationale behind this choice is that OpenCL is already not type-safe anyway, and so clAmdBlas is not type-safe either). It will be interesting to see the influence it will have on the compilation time. Philippe Any thoughts and input is - as always - welcome :-) Best regards, Karli -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] More weird problems
I remember us already having a problem with strlen on the cache with your NVidia SDK, which disappeared when you rebooted. Didn't we? 2014-11-05 16:25 GMT-05:00 Toby St Clere Smithe m...@tsmithe.net: Toby St Clere Smithe m...@tsmithe.net writes: The segfault happens when calling (in ocl/context.hpp): 443 err = clGetProgramBuildInfo(temp, devices_[0].id(), CL_PROGRAM_BUILD_LOG, 0, NULL, ret_val_size); Oh, and the segfault happens in nVidia's OpenCL when it calls strlen somewhere.. Oh, wait.. Apparently my nvidia module was still loaded, so I take back the beignet comment (the system defaults to nvidia if it's available; what I thought was beignet was therefore not..). And reloading the nvidia module seems to solve this for nvidia, too. Very strange. But the matrix_operations problems still remain! -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Segfault running PyViennaCL direct solver tests
Hey, Sorry for the late answer. I've been extremely busy with my stats homework lately. The caching mechanism indeed doesn't account for the device. This is pretty easy to add, ie append the device name + platform version + platform name when doing the hashing. Philippe 2014-11-04 16:12 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at: Hi Toby, thanks for the reports. I'll run the respective functions through a valgrind-like environment today, but I don't expect something to show up at this point. The direct solve kernels for dense matrices are unchanged for quite some time and haven't shown anything suspicious in the nightly tests for *months* now. Thus, I'm very tempted to assume that this is a problem with beignet - yet I'll double-check. Yes, I think so, too, now. But it is weird that I received a segfault on nVidia initially, too. I haven't studied the kernel caching mechanism: at the moment, the PyViennaCL cache directory is versioned, but should it also be separate for different devices? (And I will need to remember to clear out the cache directory for different viennacl git revisions, or add a mechanism to include the git reference..) The caching mechanism computes a hash of the source code and uses that hash to access the binary object. I doubt that there is binary compatibility across different OpenCL SDKs. Yes, having now updated my beignet installation to the latest point release and tested various combinations of stale and clean caches, it seems like the tests pass successfully and without segfaults when there is no overlap of the cached objects across devices. Thanks, this is good news. I'm fighting with some low-level hardware here right now, pretty challenging to get this to work properly :-( Philippe, does the caching mechanism take that into account and store separate binaries for each OpenCL SDK? Or is the SDK name part of the hash? It seems not to do so. I can change this in PyViennaCL, but I don't know if it might be good to do as you suggest and have the SDK name part of the hash in the core. I can always change it in PyViennaCL for this release, and this could be postponed for the core till later. A change of the OpenCL platform is fairly unlikely, so we may be able to go without. But on the other hand, it may lead to some hard-to-debug failures, just like you observed now. I leave this decision up to you... Best regards, Karli -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Segfault running PyViennaCL direct solver tests
I cannot reproduce the bug on my machine, so it's probably better if you patch it :) Because a context may be attached to multiple devices, the fix should concatenate the information of each device. line 402 std::string prefix; for(std::vector viennacl::ocl::device ::const_iterator it = devices_.begin() ; it != devices_.end() ; ++it) prefix += it-name() + it-vendor() + it-driver_version() std::string sha1 = prefix + tools::sha1(source); I can't think of any other place where a change would be necessary 2014-11-04 16:32 GMT-05:00 Toby St Clere Smithe m...@tsmithe.net: Hej Philippe, Philippe Tillet phil.til...@gmail.com writes: Sorry for the late answer. I've been extremely busy with my stats homework lately. The caching mechanism indeed doesn't account for the device. This is pretty easy to add, ie append the device name + platform version + platform name when doing the hashing. Yes -- this is precisely what I was thinking of doing. If no one gets there first, I'll knock up a patch early tomorrow afternoon (CET). Cheers, Toby 2014-11-04 16:12 GMT-05:00 Karl Rupp r...@iue.tuwien.ac.at: Hi Toby, thanks for the reports. I'll run the respective functions through a valgrind-like environment today, but I don't expect something to show up at this point. The direct solve kernels for dense matrices are unchanged for quite some time and haven't shown anything suspicious in the nightly tests for *months* now. Thus, I'm very tempted to assume that this is a problem with beignet - yet I'll double-check. Yes, I think so, too, now. But it is weird that I received a segfault on nVidia initially, too. I haven't studied the kernel caching mechanism: at the moment, the PyViennaCL cache directory is versioned, but should it also be separate for different devices? (And I will need to remember to clear out the cache directory for different viennacl git revisions, or add a mechanism to include the git reference..) The caching mechanism computes a hash of the source code and uses that hash to access the binary object. I doubt that there is binary compatibility across different OpenCL SDKs. Yes, having now updated my beignet installation to the latest point release and tested various combinations of stale and clean caches, it seems like the tests pass successfully and without segfaults when there is no overlap of the cached objects across devices. Thanks, this is good news. I'm fighting with some low-level hardware here right now, pretty challenging to get this to work properly :-( Philippe, does the caching mechanism take that into account and store separate binaries for each OpenCL SDK? Or is the SDK name part of the hash? It seems not to do so. I can change this in PyViennaCL, but I don't know if it might be good to do as you suggest and have the SDK name part of the hash in the core. I can always change it in PyViennaCL for this release, and this could be postponed for the core till later. A change of the OpenCL platform is fairly unlikely, so we may be able to go without. But on the other hand, it may lead to some hard-to-debug failures, just like you observed now. I leave this decision up to you... Best regards, Karli -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Toby St Clere Smithe http://tsmithe.net -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Benchmark GUI - GSoC Closing Words and Future Plans
Hey Namik, Congratulations! :-) Yes, we very hope that you'll stay with us in this adventure. I personally really like open-source development because (1) it's really educative, and (2) it makes me feel free. I think that research/jobs can put a lot of pressure on me, to the point that it can become somewhat alienating. Having a time window to develop my personal projects somehow keeps me optimistic :p Open-Source software is actually not only about coding. I think you could further improve your GUI by clearly defining when it should be used, and when it shouldn't. Assume that your GUI ends up being (mis?)used by some technical journalists, how would you like them to comment the results? If you don't tell them the limits of your GUI, they can't know! If you want my idea on the topic: - The GUI will indicate the performance of an average program (not tuned for any particular architecture) on different devices. This can reveal some information such as It's hard to optimize code for this device, but if you do this maybe you'll get some amazing results ; I don't know. - The GUI *does not* compare the peak performance of two different devices. Whoever uses the GUI has to be extremely careful about it. This is exactly what NVidia/AMD/Intel/WhicheverVendor is doing when presenting an eyecandy slide that says : oh, look how better our GPU is for Numerical Computing. A lot of researchers/journalists fall into this trap, and this is pretty sad. I guess that these two examples give you a clear direction in which you could document your code. Don't hesitate to add a usage section in the GUI, to give some guidelines on how the results should be interpreted. Philippe 2014-08-25 20:21 GMT+02:00 Namik Karovic namik.karo...@gmail.com: Hi Karl, thanks, Namik! Congratulations on successfully completing the GSoC project. I hope you got a good insight in how open source projects are done and how much fun it could be (although at some point one also needs to make sure 'things get done' by dealing with not-so-much fun stuff). Thanks. I must say it felt damn good to finally work on something that's big and important :) The important next step is to finalize the first release. I don't think there's much left to be done feature-wise, now it's mostly a matter of cleaning up and packaging. We hope to have you with us not only for this step, but also for the later future. The central idea of GSoC is to grow the community of open source projects, so we're hoping and encourage you to stay with us to the extent possible considering your other constraints such as course work. How long I'll stick around depends on how much free time I'll have. I'm currently looking for a job, and if I manage to find one, I'm afraid I won't have a lot of free time. In your case the documentation part wasn't that urgent, because the GUI is mainly a matter of fusing available functionality from ViennaCL together. The two 'TODOs' with respect to documentation are: - Document the source code using Doxygen-style comments, just like in the ViennaCL source tree. Ideally, this is done right when writing code, because then any assumptions on function arguments are clear. - Write a user manual on how the GUI works (including some screenshots, etc.). This last part, however, should be written right before the release in order make sure that the screen shots are up-to-date. Alright, I'll get down to writing documentation now. Regards, Namik On Mon, Aug 25, 2014 at 1:33 PM, Karl Rupp r...@iue.tuwien.ac.at wrote: Hi Namik, I'd like to send a big thanks to Karl and Philippe for the positive GSoC final evaluation mark. And a big thanks to everyone for helping me with my project. Also, congrats to Toby for successfully completing his GSoC project. thanks, Namik! Congratulations on successfully completing the GSoC project. I hope you got a good insight in how open source projects are done and how much fun it could be (although at some point one also needs to make sure 'things get done' by dealing with not-so-much fun stuff). It's been a great experience and a pleasure to work with you guys. I plan to continue working on the Benchmark GUI, at least until it's in a respectable shape. I'd also like to offer my help if you plan on making the benchmark result website a reality. Of course, I won't be as active as I was during GSoC. The important next step is to finalize the first release. I don't think there's much left to be done feature-wise, now it's mostly a matter of cleaning up and packaging. We hope to have you with us not only for this step, but also for the later future. The central idea of GSoC is to grow the community of open source projects, so we're hoping and encourage you to stay with us to the extent possible considering your other constraints such as course work. Also, there's one thing still unclear to me. What about documentation? Was I supposed to have written it by
Re: [ViennaCL-devel] Roadmap to 1.6 : Cleaning the code, refurbishing the test suite, the benchmark suite, etc...
Hey, 2014-08-17 11:52 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, So it seems like most of the features are ready for ViennaCL 1.6. My merge from a few days ago (finally) fully integrated the use of device-specific kernels for BLAS1, BLAS2, BLAS3. hurray! :-) The reduction API is still missing, though, but I think that the priority should be to polish the code, and to ensure ViennaCL is still stable despite the migration to device-specific kernels. Specifically, I think that we should spend the next few weeks cleaning the code-base. Agreed. The full set of nightly test machines should be back to operational tomorrow, as we can finally move back into our offices in Vienna. These older systems should give us some more confidence on the stability. I can list a few points that have caught my eye - I've rewritten from scratch the GEMM test. It now uses ublas::prod 27 times, instead of 2500, and thanks to a few macros the file size is substantially smaller (~250 lines vs ~850). The test now completes about 15times faster using the single threaded ViennaCL implementation, and in the glimpse of an eye (~10seconds) when OpenCL is used. Hurray! More importantly, this new version allowed me to spot the bug which was responsible for the failure of libviennacl-blas3 in tonight's dash. The culprit was blas3_prod-test choosing slice1==slice2, while the slices were bugged for C=row-major+sliced... This is frightening because some other similar glitches may be hidden here and there in the test suite. For example, the matrix_vector test passes, but the libviennacl-blas2 test fails. Probably due to some stride issues for row-major matrices. Things get more complicated now that the col-major kernels are used for the row-major cases. Anyhow, I think that we should somehow ensure that there is no such glitch remaining in the test suite, before shipping ViennaCL 1.6 (ie, all matrix_slices/ranges use different offsets/strides in each direction) Cool, thanks, that looks indeed a lot more compact now. With regard to tweaking the tests towards full coverage of row/column strides: This is something you have to add to the tests, because you know the internals of the kernels best and can design the tests towards testing corner cases. What caught my attention was the use of {start_M, start_N, start_K} as well as {stride_M, stride_N, stride_K}: Wouldn't it be better to use separates strides for A, B, C in both dimensions, making it six parameters rather than three? It would be probably better, indeed. - I really think that we should rewrite the benchmarks for the 1.6 release, all the more that it would value the substantial performance improvement that this release will bring. I can start writing a condensed benchmark including copy, axpy, dot, gemv, gemm. I think it would be cool to have sparse,solver,qr also included in that routine. I won't have the time to carry out this ; I'm moving to the United States in 1 week :-p I can take care of that, particularly as I'll have to adjust the tests in the benchmark GUI as well. However, I don't think that we should merge the sparse and solver routines into the same executable, these are two distinct fields of application (dense vs. sparse linear algebra). Merging too many different things into one executable also has some disadvantages if one piece of functionality does not work on a certain machine for whatever reason. Cool! Yes, you're right, dense and sparse routines are not used for the same purposes. Which operation should be included in each executable, then? One for dense benchmarks, and one for sparse benchmarks? - I've noticed a some of unsafe/faulty legacy code dating back to when the layout was made a runtime parameter. * nmf only implements matrixT, but in principle matrix_baseT should work (since no custom kernel is called, I believe) NMF uses a custom kernel and thus only works with OpenCL. A generalization to matrix_base should be straight-forward, yes. I should be able to do it for the release. The kernel it uses is: template typename StringType void generate_nmf_el_wise_mul_div(StringType source, std::string const numeric_string) { source.append(__kernel void el_wise_mul_div( \n); source.append( __global ); source.append(numeric_string); source.append( * matrix1, \n); source.append( __global const ); source.append(numeric_string); source.append( * matrix2, \n); source.append( __global const ); source.append(numeric_string); source.append( * matrix3, \n); source.append( unsigned int size) \n); source.append({ \n); source.append( for (unsigned int i = get_global_id(0); i size; i += get_global_size(0)) \n); source.append( { \n); source.append(); source.append(numeric_string); source.append( val = matrix1[i] * matrix2[i];
Re: [ViennaCL-devel] Roadmap to 1.6 : Cleaning the code, refurbishing the test suite, the benchmark suite, etc...
Hey, The nasty bug on strided GEMV got solved. I'm available on wednesday for the code uniformization session. We should be on IRC at the same time, though, in case we face a situation we had not discussed. I have a couple of questions regarding a standardized way of naming the numeric type of a matrix/vector. Sometimes it's NumericT, sometimes it's T, sometimes it's TYPE... What about NumericType everywhere? Anyway, some similar questions could arise so it's probably better to be able to chat in real time while making the code style uniform. We must also remember to sort out https://github.com/viennacl/viennacl-dev/issues/71 https://github.com/viennacl/viennacl-dev/issues/77 https://github.com/viennacl/viennacl-dev/issues/66 https://github.com/viennacl/viennacl-dev/issues/2 Philippe 2014-08-17 19:36 GMT+02:00 Philippe Tillet phil.til...@gmail.com: So the dense benchmark suite got refurbished here: https://github.com/viennacl/viennacl-dev/commit/73f46e36cfa4104628f831195e4da25a62f9ef66 The same template using macros can be used for any benchmark. It's pretty concise and maintainable! Philippe 2014-08-17 13:50 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, * nmf only implements matrixT, but in principle matrix_baseT should work (since no custom kernel is called, I believe) NMF uses a custom kernel and thus only works with OpenCL. A generalization to matrix_base should be straight-forward, yes. I should be able to do it for the release. The kernel it uses is: template typename StringType void generate_nmf_el_wise_mul_div(StringType source, std::string const numeric_string) { source.append(__kernel void el_wise_mul_div( \n); source.append( __global ); source.append(numeric_string); source.append( * matrix1, \n); source.append( __global const ); source.append(numeric_string); source.append( * matrix2, \n); source.append( __global const ); source.append(numeric_string); source.append( * matrix3, \n); source.append( unsigned int size) \n); source.append({ \n); source.append( for (unsigned int i = get_global_id(0); i size; i += get_global_size(0)) \n); source.append( { \n); source.append(); source.append(numeric_string); source.append( val = matrix1[i] * matrix2[i]; \n); source.append(); source.append(numeric_string); source.append( divisor = matrix3[i]; \n); source.append(matrix1[i] = (divisor (); source.append(numeric_string); source.append()0.1) ? (val / divisor) : (); source.append(numeric_string); source.append()0; \n); source.append( } \n); source.append(} \n); } So, the layout of the matrix shouldn't matter, indeed. It would be pretty easy to have this kernel generated by the generator, too, as this can be represented by the expression tree : matrix1 = select(matrix3 0.1, element_div(element_prod(matrix1, matrix2), matrix3), castT(0)). However, we're running out of time so I wouldn't port it. But we have to keep in mind that this would be a trivial thing to do. The same student who ported the FFT-code to multiple backends will take care of porting NMF to multiple backends. He's pretty quick already, so it should be done by the release. However, I'd refrain from integrating this into the generator for now because it is totally non-critical in terms of overall performance. We can port that under perfect control within the OpenCL backend later when we have more confidence in the stability of the generator (no pun' intended). - We should definitely have a discussion on matrix padding, which is no longer required anywhere in ViennaCL, as far as I know. I am in favor of making size()==internal_size() by default. That's not the point of the e-mail, but we should have a discussion on what we should do with it! Getting rid of the padding would certainly remove the traps of using fast_copy() on a matrix. Other than that, I don't think it has a substantial influence on the code because internal_size() is still needed for dealing with ranges. There may be an influence on certain bandwidth-limited operations, though, as for example a matrix addition may lead to bank conflicts (or channel conflicts, whatever...) when accessing GPU RAM for certain matrix sizes. Before making a decision on the padding issue, we should run some benchmarks to see whether there is an impact. Well, one thing I'm sure of is that we should give the possibility to use no padding if needed (for memory constraints), or (probably even better) to choose the padding size. Apparently it is not an easy choice for us to pick the default because of the many things to consider. Thus
Re: [ViennaCL-devel] Benchmark GUI Expert Mode
Hey Namik, The code looks fine. As a small tip, I would advise to use blas3MatrixSize{A,B,C} = {M, N, K} ; it's much more conventional. I would also suggest to remove LU from the benchmark. I only achieve 11 GFLOP/s on my machine (GEMM peaks at 120GFLOP/s). It will smash the overall score if you keep it enabled! Philippe (Not sleeping either :-p) 2014-08-17 23:28 GMT+02:00 Namik Karovic namik.karo...@gmail.com: Hi all, I just pushed the first working version of expert(custom) benchmark mode. Selecting custom sparse matrices is yet to be implemented, but all other benchmark configs are working. Except blas3, that is. I think I got the sizes wrong. I'd appreciate it if someone could check if I did it right: //blas3MatrixSizeA,B = size1,2 //blas3MatrixSizeB,C = size2,3 viennacl::matrixScalarType vcl_A(blas3MatrixSizeA, blas3MatrixSizeB); viennacl::matrixScalarType vcl_B(blas3MatrixSizeB, blas3MatrixSizeC); viennacl::matrixScalarType vcl_C(blas3MatrixSizeA, blas3MatrixSizeC); // Fill the matrix for (unsigned int i = 0; i blas3MatrixSizeA; ++i) for (unsigned int j = 0; j blas3MatrixSizeB; ++j) stl_A[i*blas3MatrixSizeA + j] = randomScalarType(); for (unsigned int i = 0; i blas3MatrixSizeB; ++i) for (unsigned int j = 0; j blas3MatrixSizeC; ++j) stl_B[i + j*blas3MatrixSizeC] = randomScalarType(); //using ranges viennacl::range r(blas3MatrixSizeB/4, 3 * blas3MatrixSizeB/4); //using slices viennacl::slice s(0, 2, blas3MatrixSizeB/2); The benchmark crashes on test 4 (LU factorization). I don't know if I messed up somewhere before test 4 (in the code written above), or somewhere else. Regards, Namik -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Roadmap to 1.6 : Cleaning the code, refurbishing the test suite, the benchmark suite, etc...
Hey! So it seems like most of the features are ready for ViennaCL 1.6. My merge from a few days ago (finally) fully integrated the use of device-specific kernels for BLAS1, BLAS2, BLAS3. The reduction API is still missing, though, but I think that the priority should be to polish the code, and to ensure ViennaCL is still stable despite the migration to device-specific kernels. Specifically, I think that we should spend the next few weeks cleaning the code-base. I can list a few points that have caught my eye - I've rewritten from scratch the GEMM test. It now uses ublas::prod 27 times, instead of 2500, and thanks to a few macros the file size is substantially smaller (~250 lines vs ~850). The test now completes about 15times faster using the single threaded ViennaCL implementation, and in the glimpse of an eye (~10seconds) when OpenCL is used. Hurray! More importantly, this new version allowed me to spot the bug which was responsible for the failure of libviennacl-blas3 in tonight's dash. The culprit was blas3_prod-test choosing slice1==slice2, while the slices were bugged for C=row-major+sliced... This is frightening because some other similar glitches may be hidden here and there in the test suite. For example, the matrix_vector test passes, but the libviennacl-blas2 test fails. Probably due to some stride issues for row-major matrices. Things get more complicated now that the col-major kernels are used for the row-major cases. Anyhow, I think that we should somehow ensure that there is no such glitch remaining in the test suite, before shipping ViennaCL 1.6 (ie, all matrix_slices/ranges use different offsets/strides in each direction) - I really think that we should rewrite the benchmarks for the 1.6 release, all the more that it would value the substantial performance improvement that this release will bring. I can start writing a condensed benchmark including copy, axpy, dot, gemv, gemm. I think it would be cool to have sparse,solver,qr also included in that routine. I won't have the time to carry out this ; I'm moving to the United States in 1 week :-p - I've noticed a some of unsafe/faulty legacy code dating back to when the layout was made a runtime parameter. * nmf only implements matrixT, but in principle matrix_baseT should work (since no custom kernel is called, I believe) * There was a faulty row_major(is_row_majorBaseType::value) in matrix_range and matrix_stride. This caused matrix_rangematrix_baseT to be column-major no matter what. More generally, there are a couple of places using the static is_row_major or alignment traits. I thought that it could be a good idea to delete these traits to be sure that there can be no such faulty code anywhere else. Am I overlooking any side effect? - We should definitely have a discussion on matrix padding, which is no longer required anywhere in ViennaCL, as far as I know. I am in favor of making size()==internal_size() by default. That's not the point of the e-mail, but we should have a discussion on what we should do with it! - Finally, there is a performance regression for GEMM with slices, due to my fallback being too extreme (one element computed per work-unit). I'm on it, so don't worry if you've got like 3GFLOP/s on slices in the current blas3 benchmark. Okay, that's pretty much everything I'm worried about, I think! Philippe -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Testing GEMM
Hey, The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks involved. This gets hard to test, so I thought it could be a good idea to discuss this. Basically, here is how it works: A = [A1 A2; A3 A4] B = [B1 B2; B3 B4] C = [C1 C2; C3 C4] Where each block is divided according to the corresponding block size of the template. For example; A1 is the closest multiple of the size tuple (ML, KL), where ML is the number of rows computed by each work group, and KL the width step for computing the inner products (If the kernel use local memories, it will load successive blocks of size ML*KL in each work group). A few kernels are enqueued so that: C1 = A1*B1 [optimized kernel] C1 += A2*B3 [fallback] if needed C2 = A1*B2 [fallback] if needed C2 += A2*B4 [fallback] if needed etc... Basically, one optimized kernel doing the bulk of the work, and the other ones doing the clean-up. This works well for full matrices and ranges. When slices are involved, things get more complicated. If the stride is on the non-leading dimension (stride2 for column-major matrices), then it can be incorporated in the optimized kernel. (by appending ld *= stride2 at the beginning of the kernel). However, if stride1 1, then we need to use the fallback kernel. This is a reasonable thing to do : in most applications I know of, only one stride is accessed at the time (we want a set of the rows/columns of a given matrix). However, this becomes really messy to test! Basically, I think that, to have an exhaustive enough testing suite, then we should go for: - Matrices of complicated arbitrary sizes (143, 284, 395). It is important to space them by more than 128, to be sure that A1, B1 and C1 is not square. - Ranges of similar complicated sizes. - Optimized range: (128, 256, 384) for example - matrix row-wise slices, matrix col-wise slices, matrix slice in both directions. I am ready to rewrite the GEMM tests accordingly, but any thought on the procedure would be appreciated! Philippe -- ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Benchmark GUI Feedback Needed
Hello ! This all looks pretty good. Good job! 2014-08-12 3:40 GMT+02:00 Namik Karovic namik.karo...@gmail.com: Hi Karl, I'm fine with splitting things into something like Basic Benchmark and Expert Benchmark ('view' sounds inappropriate), but as long as both benchmark do the same thing, I don't see the problem why the expert version cannot be a refinement of the basic version. Could you please elaborate? The entire issue comes down to this: should basic mode be able to run the benchmark with expert mode's settings? Or should it always run using the default settings, no matter what. My motivation for bringing this up is that one could first do a basic benchmark, then continue on to playing with the expert mode. The basic mode can then be used for quick reference, as it will be not be altered by expert mode runs. So both modes will have their own results, and their own settings. I will prevent users from running both modes at the same time, of course. I hope it's clearer now. I would rather lean towards re-using the expert settings for the basic benchmarks, and to provide some reset button, so that if one messes things up he could still retrieve the original basic results. Looks great, this is a really useful graph (something is fishy with the values on the y-axis, though...) :-) Can you please draw the x- and the y-axis in logarithmic scale and make the vector increment a multiplicative factor (2 by default)? The axis labels are fishy because they aren't properly set up yet :) Sure, I can make em logarithmic. What about the default number of increments? I got it currently set to increment by 1 million from 1M to 15M, so 14 increments. Should there be more increment steps? I need to know so I can calculate the optimum min and max vector size for x2 factor increment. It's actually important to have finer grained data for small vectors, and more spaced points as the data grows bigger : this is why it is better to choose the sizes according to a a^x law than an a*x one. You can experiment other values than 2 for a, if you want. If I were you, I'd probably go with something like : [int(1.5**x) for x in range(30,45)] That is, an increment 1.5 factor from ~190,000 to ~55,000,000 This looks quite okay, actually. Alright, if you say so. But note that in fullscreen it will be a lot more stretched, and thus a lot less visually appealing. I'll do some more thinking to try and make it a bit more organized. There should be a third size for Blas3 part. This will then also make all four boxes (Blas3, Sparse, Copy, Vector) equally high, which should improve the visual appearance. So x,y,z dimensions for Blas3? Blas3 currently uses 2D matrices, so I'll have to modify the benchmark to use 3D matrices? Blas3 multiplies two matrices : A(size1, size2) * B(size2, size3), hence the three sizes required :-p Not sure about what kind of 3D matrices you are referring to! ;) In any case, great job! Philippe I don't see a problem with making the string conversion routines public, so I just pushed a commit for doing so. :-) Thanks. Appreciate it. Regards, Namik On Mon, Aug 11, 2014 at 8:18 PM, Karl Rupp r...@iue.tuwien.ac.at wrote: Hi Namik, I'm starting work on the expert view and would appreciate some feedback before I get into it more seriously. thanks for the latest bunch of features :-) I've got quite a lot of questions, so bear with me please. Here we go: -Should basic and expert views be changed to independent benchmarking modes, or remain different views of the same banchmark backend? I initially imagined basic expert views as differently detailed presentations of the same benchmark instance (one could run the basic benchmark, and switch to expert view after it's done to examine the results in more detail). However, now I'm thinking it would be better not to mix them. Let basic mode be a simple benchmark with default settings, and let expert be fully customizable independent. That way the basic mode would be unaffected by expert mode's settings. This would allow basic mode to act as a safe reference mode. It would also allow easier usage of benchmark profiles (saving user's expert mode config info for later usage), but that's a story for another time. It's worth mentioning that it's easier to implement two independent modes than to have them share a single benchmark mode. So, which version am I to develop? I'm fine with splitting things into something like Basic Benchmark and Expert Benchmark ('view' sounds inappropriate), but as long as both benchmark do the same thing, I don't see the problem why the expert version cannot be a refinement of the basic version. Could you please elaborate? -I've implemented line plotting of copy vector benchmarks. There's still some minor tweaks to be done, but the main functionality is ready. Here's a screenshot for quick reference:
Re: [ViennaCL-devel] Tolerances for tests
Hey Toby, My two cents: Don't forget that while matrix-vector multiplication will still introduce some round-off errors. Ie, when you are computing y = A*[1,1,...] then you are actually computing something like y' = A*( [1,1,...]+eps). GEMV is (backward stable) so you are sure that y' will be close to y. This being said, I don't know much about the stability/back-ward stability of GMRES, but if it's not backward stable you won't be able to get your result close to [1,1,...]. In other words, you're probably better of comparing y' with A*x', where x' is the output of the GMRES procedure, rather than x with x' Philippe 2014-08-05 22:10 GMT+02:00 Toby St Clere Smithe m...@tsmithe.net: Hi all, I've now implemented a test for the iterative solvers and preconditioners, using generate_fdm_laplace. This is good because it gives consistent results, though compared to using a randomly generated system matrix it means that the solvers are only tested on one set of input data. My test works by constructing the system matrix, choosing a solution vector that is just a vector of 1.0s of the correct size, and multiplying to find the RHS to put into the solver. I then run the solver and compare the output to my vector of 1.0s. I report a failure if the error is greater than some tolerance value specific for the datatype. Of course, this absolute error tolerance has a different definition to that in (for instance) the GMRES solver, where we have solver quits if ||r|| tolerance * ||r_initial|| obtains. This means that the solver might return successfully, and yet cause a false test failure. I have a crude work-around for this. It seems to suffice to set the solver tolerance to 1e-1 times the test tolerance, which strictly might be stronger than is ideally warranted. I think this should suffice for the purposes of this test, however; else I'll have to do silly solver-specific things like computing ||r_initial||. Currently, I use a test tolerance of 1e-2 for single-precision, which means a solver tolerance of 1e-3. This seems less precise than I'd like; machine epsilon should be around 1e-7, and so I feel like I should be able to use a test tolerance of 1e-3 and a solver tolerance of 1e-4. However, using GMRES, this gives incorrect results (regardless of max_iterations) -- wildly so when combined with some preconditioners. I suspect that this is caused by rounding errors, but I'm not sure. I tried to check for what the ViennaCL test does, but I couldn't find one for the iterative solvers! Cheers, Toby -- Toby St Clere Smithe http://tsmithe.net -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] On the use of vector types in viennacl's opencl kernels
Hi, It's horrible! As soon as I want to introduce some vectorized types in an opencl template as simple as AXPY, everything starts exploding. Well, first things first, I probably need to justify why I think that we cannot do without double2, float4 in all of our dense kernel templates: - From my own experience, it turns out that some element-wise expressions can be easily compute-bound. In statistics it can be pretty easy to encounter complicated elementwise transforms when evaluating a probability density function. I've personally had to use SSE on my CPU a couple of times to alleviate this problem. - Some vendors explicitely state in their optimization guide that loads of 16 bytes will result in a better bandwidth. On the other hand, using stride!=1 will prevent the use of vectorized loads in any kernel (AXPY, GEMM, etc). We're definitely facing a dilemma, here, where we have to choose between higher JIT overhead (the programs can be cached, however) and potentially higher execution time. My belief is that we should provide a fallback program for stride!=1, which will be compiled only if strided accesses are used. Note that even this wouldn't solve all our problems. How to handle offsets that are not multiple of 4? How to handle sizes that are not multiple of 4. We could use the same fallback, or provide a different optimized kernel. http://paste.ubuntu.com/7915787/ optimized_1 should be able to handle quite well the remaining cases, while optimized_0 should be faster because it doesn't have to check for the alignment contrary to vload4, and doesn't have to do any clean up. In the case of AXPY, I'd expect optimized_1 to be a better option. For GEMM, I'd however prefer the cleanup to be done in some other kernel calls. Seriously, what a headache !! But discarding vector types for everything but GEMM just sounds wrong to me... Philippe -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] OpenMP Matrix Multiplication
Hi guys, So I expect ViennaCL 1.6 to offer some really good performance on CPUs with the OpenCL backend -- possibly 80% of OpenBLAS / MKL on a Core i7 4770, for example. As the OpenCL kernel generator and the auto-tuner will get better, we can hope for further improvements. This will create a huge gap with the fallback OpenMP version, which hardly reaches 0.5 GFLOP/s. What would you thinking about extracting the assembly output of the Intel OpenCL compiler? I'm not familiar *at all* with assembly code. How would we handle multi-threading in such a setting? Philippe -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] ViennaCL console benchmarks
Hey, I've noted that the console benchmarks for ViennaCL were quite outdated, performance for AXPY are reported in FLOP/s, for example. I think it'd be great to have something compact, all incorporated in a single benchmarking executable: === BLAS [float, full] - AXPY : ... (GB/s) DOT : ... (GB/s) GEMV : ...(GB/s) GEMM-NN : ...(GFLOP/s) GEMM-TN : ...(GFLOP/s) GEMM-NT : ...(GFLOP/s) GEMM-TT: ...(GFLOP/s) ... solver, perhaps some other things BLAS [float, ranges] - ... === I can't really think of a case where one would be only interested in the performance of one single operation ! Do you have any other idea to make the benchmarks more concise/readable/informative ? Philippe -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Benchmark GUI First Look
Hi Namik, Good job! It all looks very appealing. I don't have much to say. Just a few comments: - I'd rather use the median instead of the averge, indeed. - As for the latency in the expert section, it would be great to also have an execution time vs size plot, in order to show until when the latency dominates the routines. I'll be here at 1600 UTC Philippe 2014-07-11 12:50 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hey, (...) I'll skip the pause functionality then. Ok :-) Consider using the median rather than the average. In some cases one can observe infrequent outliers, which would distort the average. Would the median value really be a good choice to represent all benchmark sub-tests? I mean, surely those outliers should have some impact on the final result. Performance bottlenecks can sometimes be that thin line that separates high-end from low-end products. I'm not so sure we should just ignore them. I hope we're talking about the same thing here. Imagine you're benchmarking the vector addition, then you may get the following individual performances: 0.1 sec 0.1 sec 0.1 sec 0.5 sec -- some unexpected other load on the system here The median is 0.1 here, whereas the average is 0.2. I'd consider the median to provide a more reliable indicator for system performance here. Rathre than a single score, what about providing multiple scores for the main quantities of interest? One score for GFLOPs, one score for memory bandwidth, one score for latency? Is this too technical already? Well, a multi-score is fine by me. However, memory bandwidth in terms of GB/s is only measured by copy benchmark as far I've seen. We should consider adding some more bandwidth benchmarks in that case. The classic benchmark for this is the STREAM benchmark (http://www.cs.virginia.edu/stream/) which covers the following four vector operations: x - y (copy) x - a * y (scale) x - y + z (sum) x - y + alpha * z (triad) All four of them are easy to reproduce. If you are amazed about the simplicity of the benchmark, you'll be surprised about how much it can tell you regarding performance for a huge number of applications. As for latency, I don't think our average users would care that much about it. Whenever I overclocked my computer, I would primarily focus on achieving higher memory bandwidth instead of lower latency. That is why I would rather see my memory bandwidth instead of latency. It would certainly be a useful addition, but not *that* important to have dedicated score. Valid point, latency and friends should go to the expert results section. Live updates would certainly be cool. However, you need to make sure that the plots are only drawn while in between different benchmark tests, otherwise the plotting might induce a certain load which will interfere with the benchmark. I think this can be accomplished without too much effort. Well each benchmark is run in its own separate thread. I don't think the main GUI thread can interfere with a benchmark's thread. Sure it can. Not in terms of data, but in terms of eating CPU cycles while the benchmark thread is running. Keep in mind that ultimately we're also running tests with OpenMP on the CPU. But if you're referring to data transfer between the CPU and GPU, then I can't say for sure. To the best of my knowledge, Qt widgets utilize CPU and RAM, there's no GPU and modern OpenGL involved. As for communication between running benchmarks and GUI, all messages coming from benchmarks are sent between sub-tests. All benchmarks' sub-tests are intact, so there shouldn't be any problems regarding message emitting. Messages between CPU and GPU are less an issue. Still, the GUI should do as little work as possible while the benchmark is running. It's okay to update in between sub-benchmarks. At this point we should certainly define a reasonable grouping of the results. For example, our current LU factorization is fairly slow because of the way it is implemented, not because the hardware is poor. Are you available for an IRC session on this tomorrow, e.g. 16:00 UTC? Sure, I'm available at 16:00 UTC tomorrow (I guess that's today now :D ) Yes, Friday, 16:00 UTC. I quickly adjusted the one I designed some time ago, see attachment. I can commit the GIMP-File if you like it. Of course we can alter it further as needed and appropriate. Thanks, I implemented it right away and pushed to GitHub. Looks good. Yeah I'd like the GIMP file. I'll play around with it and see if anything can be improved. Ok, I'll push it right after sending the email. Don't forget to pull. ;-) Btw: Could you please consider rearranging the folder hierarchy such that src/ only contains actual code? Put things like the splash screen
Re: [ViennaCL-devel] ViennaCL 1.6 Roadmap
Hi, I'd like to add something, to point out that input-dependent kernels are pointless without kernel caching (both would use an environment variable and the filesystem). Indeed, each program will contain multiple versions of a given operations, which can make the compilation time very long if caching is disabled. Philippe 2014-07-09 17:53 GMT+02:00 Philippe Tillet phil.til...@gmail.com: Hey hey, 2014-07-09 14:47 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hey, Philippe, did you by chance check the impact of the generator integration on kernel latency? We only have a 1-10us margin to work with, which I haven't checked yet. Don't worry for the overhead. It used to be fine. I'll re-check to see whether everything is still fine, but when the program-name and the kernel name prefix is known in advance (ie for the pre-compiled programs), I don't see where a significant overhead could come from! I'll benchmark this ASAP, once some other modifications are done. The overhead could come from too many indirections in memory accesses, i.e. if too many lookups in maps and string comparisons are involved. Since you know the implementation better than me and don't think this is an issue, it should be fine. Either way, it needs to be checked, as it is such a fundamental quantity for the onset of scaling behavior of almost all 'higher-level' algorithms. The process of enqueueing the generator is extremely lightweight, there is no map involved. It does basically two things: - Parse the statement to retrieve some quantities (e.g. M,N,K in the case of GEMM) - Recursively enqueues the elements of the statement (matrix, vector, scalar, etc) When the program name is known in advance, there is no need to build the representation of the statement (which fills a char*), but even this should be fast enough. I remember having measured, some time ago, a total overhead 10microseconds when building this representation. But I'll re-evaluate this ASAP. I've been very motivated to work on the kernel generator recently, and simply don't feel like working on (1) or (2) at the moment. Now, there are two different options, for (4): 4.1 - Implementing the kernel fusion mechanism inside the scheduler. 4.2 - Input-dependent kernels, and performance prediction. While I could help with 4.1, I don't feel like I could do this task alone, because I don't have a sufficient knowledge of the backend. Plus, it implies to get rid of op_executor(), and I'm not sure how I could do this, too! I feel operational, though, for 4.2. I feel like ViennaCL 1.6 should be a performance-oriented release, and having an (input+device)-dependent kernel selection mechanism is something we have to do! I think we should not go for 4.1 with a 1.6.0 release, simply because it would delay the release cycle. We should provide features to our users fairly quickly after they are stabilized, not have them hanging around in the developer repository for too long. We have enough features for 1.6.0 already ;-) Some work from your side on 4.2 would be good, so if you have some resources left, please focus on that. Sure. 4.2 is part of my (future) PhD work, so I can't expect to have everything working flawlessly for ViennaCL 1.6.0. As always, it's better to have a smaller set of reliable features in a release rather than a larger set of broken features ;-) But I feel like I should be able to create the backbone for this release.a simple environment-variable based mechanism that points to a folder where the f spitted out by the python auto-tuner. I'd like an environment-variable based extension, as they can be easily exploited by the advanced users in C++, and generalized by pyviennacl. (since python has a portable filesystem framework) ! Here's my idea. We could have VIENNACL_MODELS_PATH pointing to a directory containing standardized device names (lower-case, spaces replaced by dashes). At runtime, we check if the environment variable is set and if we can open the corresponding file. If not, we fallback on the built-in, input-agnostic database. This sounds to me much more like a researchers facility rather than something an average user wants to be exposed to. Keep in mind that whenever something needs to go through the file system, it is subject to additional problems: These can be permission problems, problems with blanks (or umlauts, etc.), random IO errors, or tricky problems in batch systems on supercomputers. Since I'm part of the PETSc developer team I've learned about so many problems on machines 'out there', where Murphy's law is constantly in action. Can we focus on populating the built-in database for the 1.6.0 release instead? A standard-user with a standard-GPU should
[ViennaCL-devel] ViennaCL 1.6 Roadmap
Hello, Watching at the roadmap: https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap I was concerned with 4 elements: (1) Hook in external BLAS libraries and use them as a computing backend (2) Distributed vectors and matrices (multiple devices, possibly mixed CUDA/OpenCL/OpenMP (3) Support for reductions (vector-reduction, row-wise reduction, col-wise reduction). Naive OpenMP/CUDA implementation, but integrated in the kernel generator for OpenCL. (4) Full integration of the micro-scheduler and the generator. Needless to say that this seems overly ambitious! I had done a prototype for (1), but realized quickly that it would be pretty complicated to make it stable and robust with respect to devices, context, etc. Plus, the generator now gives the same (DENSE!) performance as CuBlas on NVidia GPUs (for Fermi, at least), and clAmdBlas on AMD GPUs. Linking could allow us to have very good performance on OPENMP/CUDA, as well as Sparse Linear algebra on OpenCL. This is interesting, but it is also a good amount of work! (2) Will also require a huge amount of work. Plus, I think it is dangerous to do that when we're not even sure of how we handle ViennaCL on a single device (considering input-dependent kernels, for example). I'd say we should postpone this I'll do (3). It's not a lot of work and the kernel generator already supports it. We just need to add an API. (4) is where I've spent and will spend most of my time. The Kernel Generator is now fully integrated for all the vector operations, all the matrix-vector operations (except rank1 updates) and most of the dense matrix operations (all but LU, FFT,Inplace triangular substitution). While the database is not populated yet, recent benchmarks suggest very good performance (Like CuBlas on GTX470, and 80% of the peak on R9 290x). I think it is necessary to push forward in this direction, and make ViennaCL 1.6 a BIG DATA BIG DATA BIG DATA BIG DATAperformance-based release. I've been very motivated to work on the kernel generator recently, and simply don't feel like working on (1) or (2) at the moment. Now, there are two different options, for (4): 4.1 - Implementing the kernel fusion mechanism inside the scheduler. 4.2 - Input-dependent kernels, and performance prediction. While I could help with 4.1, I don't feel like I could do this task alone, because I don't have a sufficient knowledge of the backend. Plus, it implies to get rid of op_executor(), and I'm not sure how I could do this, too! I feel operational, though, for 4.2. I feel like ViennaCL 1.6 should be a performance-oriented release, and having an (input+device)-dependent kernel selection mechanism is something we have to do! Any thoughts on how the roadmap could/should be rearranged? Philippe -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] ViennaCL 1.6 Roadmap
Hi, 2014-07-08 20:59 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi Philippe, Watching at the roadmap: https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap argl, I forgot to update this after our IRC meeting. The protocol here defines features for 1.6.0 which are far more reasonable: https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Developer-Meetings I was concerned with 4 elements: (1) Hook in external BLAS libraries and use them as a computing backend (2) Distributed vectors and matrices (multiple devices, possibly mixed CUDA/OpenCL/OpenMP (3) Support for reductions (vector-reduction, row-wise reduction, col-wise reduction). Naive OpenMP/CUDA implementation, but integrated in the kernel generator for OpenCL. (4) Full integration of the micro-scheduler and the generator. Needless to say that this seems overly ambitious! I had done a prototype for (1), but realized quickly that it would be pretty complicated to make it stable and robust with respect to devices, context, etc. Plus, the generator now gives the same (DENSE!) performance as CuBlas on NVidia GPUs (for Fermi, at least), and clAmdBlas on AMD GPUs. Linking could allow us to have very good performance on OPENMP/CUDA, as well as Sparse Linear algebra on OpenCL. This is interesting, but it is also a good amount of work! We postponed that and instead agreed to focus on the full scheduler integration. (2) Will also require a huge amount of work. Plus, I think it is dangerous to do that when we're not even sure of how we handle ViennaCL on a single device (considering input-dependent kernels, for example). I'd say we should postpone this Certainly postpone this. Today I got notice that we will have funding for a PhD student working on this. It's still hard to find a good candidate, but at least we have the funding now ;-) I'll do (3). It's not a lot of work and the kernel generator already supports it. We just need to add an API. Today there was a user requesting this on sourceforge. I'll also have time in the next days to work on this, but since you volunteered for it, I'll go for the iterative solver optimizations first. (4) is where I've spent and will spend most of my time. The Kernel Generator is now fully integrated for all the vector operations, all the matrix-vector operations (except rank1 updates) and most of the dense matrix operations (all but LU, FFT,Inplace triangular substitution). While the database is not populated yet, recent benchmarks suggest very good performance (Like CuBlas on GTX470, and 80% of the peak on R9 290x). I think it is necessary to push forward in this direction, and make ViennaCL 1.6 a BIG DATA BIG DATA BIG DATA BIG DATAperformance-based release. I'll help with stripping the op_executor beast, so that everything interfaces the scheduler directly. Philippe, did you by chance check the impact of the generator integration on kernel latency? We only have a 1-10us margin to work with, which I haven't checked yet. Don't worry for the overhead. It used to be fine. I'll re-check to see whether everything is still fine, but when the program-name and the kernel name prefix is known in advance (ie for the pre-compiled programs), I don't see where a significant overhead could come from! I'll benchmark this ASAP, once some other modifications are done. I've been very motivated to work on the kernel generator recently, and simply don't feel like working on (1) or (2) at the moment. Now, there are two different options, for (4): 4.1 - Implementing the kernel fusion mechanism inside the scheduler. 4.2 - Input-dependent kernels, and performance prediction. While I could help with 4.1, I don't feel like I could do this task alone, because I don't have a sufficient knowledge of the backend. Plus, it implies to get rid of op_executor(), and I'm not sure how I could do this, too! I feel operational, though, for 4.2. I feel like ViennaCL 1.6 should be a performance-oriented release, and having an (input+device)-dependent kernel selection mechanism is something we have to do! I think we should not go for 4.1 with a 1.6.0 release, simply because it would delay the release cycle. We should provide features to our users fairly quickly after they are stabilized, not have them hanging around in the developer repository for too long. We have enough features for 1.6.0 already ;-) Some work from your side on 4.2 would be good, so if you have some resources left, please focus on that. Sure. 4.2 is part of my (future) PhD work, so I can't expect to have everything working flawlessly for ViennaCL 1.6.0. But I feel like I should be able to create the backbone for this release.a simple environment-variable based mechanism that points to a folder where the f spitted out by the python auto-tuner. I'd like an environment-variable based extension, as they can be easily exploited by the advanced users in C++, and
Re: [ViennaCL-devel] GEMM broken in Nightly builds
Hey, After some investigations it looks like the problem is not with the GEMM kernel but with the way the kernel is enqueued. It fails when A and B are associated with the same handle in C = alpha*op(A)*op(A) + beta*C... (this handle-checking feature is to allow for some optimizations in other kernels such as those who compute complicated elementwise-functions : y = element_prod(element_exp(x), x)) This bug seems simple to fix at first sight, but it's gonna be hard to provide a good fix. I could tell the kernel to ignore the handle values for GEMM and always consider C = alpha*op(A)*op(B) + beta*D, but then it prevents me from doing pointer arithmetics with C and it'll have an influence on register usage). I'll have to do a gemm-specific handle-binding policy. I'll try to do it ASAP, but it may not come before tomorrow since I have other things to do today, like getting the database ready for Toby's benchmarks... It looks like blas3_prod-test-opencl fails with timeout on centos5. This is just SO strange. Philippe 2014-07-07 11:38 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hey, our Nightly tests report new issues with some examples, which are probably all due to GEMM: http://viennastar.iue.tuwien.ac.at/CDash/index.php?project=ViennaCL (also look at the previous day) Philippe, I see a bunch or recent commits. Is it possible that this got fixed in the meanwhile? Best regards, Karli -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] GEMM broken in Nightly builds
Until this is fixed, I disable the use of the generator for GEMM. 2014-07-07 15:00 GMT+02:00 Philippe Tillet phil.til...@gmail.com: Hey, After some investigations it looks like the problem is not with the GEMM kernel but with the way the kernel is enqueued. It fails when A and B are associated with the same handle in C = alpha*op(A)*op(A) + beta*C... (this handle-checking feature is to allow for some optimizations in other kernels such as those who compute complicated elementwise-functions : y = element_prod(element_exp(x), x)) This bug seems simple to fix at first sight, but it's gonna be hard to provide a good fix. I could tell the kernel to ignore the handle values for GEMM and always consider C = alpha*op(A)*op(B) + beta*D, but then it prevents me from doing pointer arithmetics with C and it'll have an influence on register usage). I'll have to do a gemm-specific handle-binding policy. I'll try to do it ASAP, but it may not come before tomorrow since I have other things to do today, like getting the database ready for Toby's benchmarks... It looks like blas3_prod-test-opencl fails with timeout on centos5. This is just SO strange. Philippe 2014-07-07 11:38 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hey, our Nightly tests report new issues with some examples, which are probably all due to GEMM: http://viennastar.iue.tuwien.ac.at/CDash/index.php?project=ViennaCL (also look at the previous day) Philippe, I see a bunch or recent commits. Is it possible that this got fixed in the meanwhile? Best regards, Karli -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Implementation of multi_inner_prod
Ok, thanks! This sounds reasonable indeed. Philippe 2014-06-26 23:51 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, the cases 5, 6, and 7 are handled by running a kernel for four vectors, then subtract '4' and run a dedicated kernel on the remaining 1, 2, or 3 vectors. This could also be handled by a generated kernel, yes, but I haven't implemented this for two reasons: 1. less kernels to compile 2. less implementation effort One single kernel is not possible for arbitrary values of vectors. Eight vectors turned out to be a reasonable upper bound because the overhead is less than 12.5% over the ideal case already, but at the same time the kernel still works for older GPUs with limited amounts of shared memory. Best regards, Karli On 06/26/2014 11:09 PM, Philippe Tillet wrote: I'll add something. I assume that multiple kernels are launched thanks to current_index. Wouldn't it be better to launch one single kernel ? I think that a lot of users would prefer to have better performance for perhaps a slightly longer JIT overhead (since we'll provide a caching mechanism). Philippe 2014-06-26 23:07 GMT+02:00 Philippe Tillet phil.til...@gmail.com mailto:phil.til...@gmail.com: Hello! I note this in the implementation of multi_inner_prod: switch (vec_tuple.const_size() - current_index) { case 7: case 6: case 5: case 4: //do stuff However, there is a test for 5,6,7 so I assume that these have to be implemented somehow. Could I have more details on why there is no specific kernel for these three cases? NB : This is the very last thing that has to be done before I can push the new device-specific OpenCL backend. All the tests pass except multi_inner_prod for tuple_size = 5. :) Philippe -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] PyViennaCL midterm
Hi, Unfortunately I won't be available until Tuesday for a meeting. Python and CUDA-based libraries are widely used by the Machine Learning community. I also want to push OpenCL forwards, but supporting CUDA through PyViennaCL would be a very good thing to do, since a lot of researcher think that CUDA is faster (or will at least want to compare). Philippe 2014-06-26 18:57 GMT+02:00 Toby St Clere Smithe m...@tsmithe.net: Karl Rupp r...@iue.tuwien.ac.at writes: fine with me. My schedule is very much in flux in the next ~10 days or so, so I might be unavailable on short notice. Rather than having one big IRC meeting with all topics crushed together, I suggest we have a couple of smaller topic-oriented meetings. To start out with, I suggest two topics+dates: If it's inconvenient, I can wait! * Friday, June 27, 18:00 UTC on PyViennaCL But, if not, this time is fine for me :) We can also use VoIP-technology in addition to IRC to speed things up if desired. Toby, do you have questions for Andreas on PyOpenCL or PyCUDA? I'm in contact with him regarding the best way to deal with padding of ViennaCL vectors / matrices. For my manual tests, I work with very small dimensions (to make it easy to spot errors), but of course these are padded out. So if I have a 3x3 matrix, I (say) really have a 128x128 matrix, with a lot of 0 entries. I have a convenience function which takes a matrix or vector to a PyOpenCL 'Array' object (which is just an array-like wrapper around an OpenCL buffer). Clearly, the underlying PyOpenCL buffer object should point to the whole 128x128 memory, but should I have the Array only expose the 3x3 matrix (like ViennaCL does), or the whole padded matrix? What I'm working on right now is having the Array work more like the ViennaCL matrix, so that if the user wants to print out the object (for example), they only get the non-padded entries. It's easy to get the whole buffer from an Array, and of course if the user wants to work with it in OpenCL, then they'll need to deal with the padding... So perhaps I should leave the padding visible when the user takes a ViennaCL object to a PyOpenCL one, to make it obvious? There are a couple of bits of API that need some more work. For instance, with regards to structured matrices, the Vandermonde matrix is missing (I had some API incompatibility with my current code which needs looking at), and there is some thinking to be done about operations involving two structured matrices; the logic for computing the result type needs work here. Bandwidth reduction is also missing, because I don't have a PyViennaCL wrapper type for the std::vectorstd::mapT type used here (is generic sparse support on the agenda? Does it matter right now?). I also want to add support for casting between numeric types, as we were discussing earlier. By and large, however, the body of work here is done, and the remaining bits shouldn't take more than an afternoon. A summer student will soon extend FFT to the multi-backend case, which should also make the structured matrices available for multiple backends. As for the std::vectorstd::mapT type: Are there any standard sparse matrix formats used in Python/NumPy/etc.? Anything in e.g. CSR format? I think it makes most sense to provide convenience conversions for these and not worry about std::vectorstd::mapT . Well, I want to spend some time investigating bringing the PyViennaCL and SciPy sparse types closer together. But of course PyViennaCL supports the ViennaCL compressed / coordinate / etc types right now. A similar situation obtains in the context of supporting multiple back-end platforms. I have implemented a generic Context object, and it is now possible to construct any PyViennaCL type in any given Context (for instance, in host memory, or on a specific OpenCL device with a given OpenCL queue). I'd like to pay my respects to Andreas for his work on PyOpenCL, which has made my life here fairly easy. Meaningful exceptions are raised if you try and execute an operation involving objects in different contexts. Perfect! Did you stumble over any problems in which the context isn't used correctly? There may be some corner cases where this isn't handled correctly, so please don't hesitate to report if you run into issues. Hmm. As far as I'm aware, the only failures I've seen so far have been my fault: for instance, if I try and do A+B with A and B having contexts on different devices or with different associated queues. I now test the vcl::ocl::context equality operator and throw an exception if I get false (or if A and B have different memory domains entirely, etc). My next job is to write a simple example involving a custom PyOpenCL kernel interacting with ViennaCL objects and operations, which I hope to have by the end of the week. Subsequently, I need to prepare some simple benchmarks and my paper for
Re: [ViennaCL-devel] PyViennaCL midterm
I'll be available from Tuesday afternoon on. What about wednesday 13:00 UTC and 15:00 UTC? 2014-06-27 18:30 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hey, Unfortunately I won't be available until Tuesday for a meeting. Python and CUDA-based libraries are widely used by the Machine Learning community. I also want to push OpenCL forwards, but supporting CUDA through PyViennaCL would be a very good thing to do, since a lot of researcher think that CUDA is faster (or will at least want to compare). Okay, if you're unavailable until Tuesday then we just postpone it until next week (both topic meetings). Let's use the time to arrange good time slots: Suggestions? Best regards, Karli -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Implementation of multi_inner_prod
Hello! I note this in the implementation of multi_inner_prod: switch (vec_tuple.const_size() - current_index) { case 7: case 6: case 5: case 4: //do stuff However, there is a test for 5,6,7 so I assume that these have to be implemented somehow. Could I have more details on why there is no specific kernel for these three cases? NB : This is the very last thing that has to be done before I can push the new device-specific OpenCL backend. All the tests pass except multi_inner_prod for tuple_size = 5. :) Philippe -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Behavior of norm_* on vectorint
Hey 2014-06-24 12:29 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hey, If yes,I think that it should be changed because this easily violates the axioms of a norm : we can have norm(alpha*v) != alpha*norm(v) because of the rounding. This will usually be the case even if we change it. There are good reasons why Clang emits warnings when using != or == for floating point comparisons ;-) I know that ;) But I'd say that the error we make for norm2 using float is still stable. For integers, I doubt it is :p What do you consider as 'stable' here? Even integer results can be stable in the sense that your error will be less than one in modulus ;-) Let's not talk too much about this, we'll worry about machine epsilons later :P I think that norm_*(vectorint) should be changed to float norm_*(vectorint). Any thoughts? There is no need to change it for anything for norm_1 and norm_inf. So, the only relevant implementation case is norm_2, for which ublas uses the same type convention we use now (at least that's what I found when I looked it up). Although a floating point return type is probably more often desired than an integer type, it would certainly complicate the implementation. Moreover, it would introduce inconsistency, which I'm not very fond of. The other thing, of course, is that it complicates the implementation considerably (which floating point type to return? float is not great in terms of precision, but double may not be available on the GPU...). I'm open to using a different approach than what we have now, but I'd like to hear solid arguments in favor of a change ;-) Well, this implementation problem already exists! The sqrt() functions only takes float/double as input (except std11's sqrt which casts to double). As a result, norm_2() is actually disabled for integers (I had not noticed it in my first e-mail) ;). This makes a lot of sense to disable it, indeed. This leads to another question, though. Should we add on the todo list some casting operators such as : viennacl::norm_2(viennacl::castfloat(v_int)). For opencl, these can be easily handled by the generator. Hmm, do we have enough use cases for this? I'd rather handle this through an explicit conversion step, i.e. viennacl::vectorfloat v_float(v_int); viennacl::norm_2(v_float); instead of integrating more complexity into how we handle the various operations. The explicit conversion can be provided in a controlled manner (the number of any-to-any-conversions is still manageable), whereas the introduction of casting would immediately blow up the possible input combinations for *all* the operations. For example, the operation x = y + z with the current requirement of the same scalar types for x, y, and z results in ~10 different input combinations (four signed integer types, four unsigned types, two floating point types). If we allow any combinations through casting, this would make it 1000 combinations. OpenCL would certainly help with jit-ting this, but compilation times for CUDA and OpenMP would certainly explode if we want to provide these operations through a shared library interface later. In contrast, an explicit any-to-any conversion can be covered with 10x10=100 kernels, ultimately resulting in the same functionality for our users. I expect that the lower performance due to an explicit conversion/duplication is negligible in typical use cases. This sounds reasonable indeed. I need casting operation_node_type for the generator to control explicit casting within a generated kernel, but it sounds very reasonable to only allow such constructors indeed. For example, max(int, abs(int)) won't compile in OpenCL because abs returns an unsigned int, therefore leading to an ambiguous call of max. So, I have to modify a bit the expression tree to invoke the generator for with max(int, (int)abs(int)). Now, adding an interface to allow: x = viennacl::element_castfloat(y) + z for OpenCL only seems superfluous indeed. There are not enough use cases to justify an interface divergence. However, having a very explicit constructor: viennacl::vectorfloat x(viennacl::element_castfloat(x_int)). may be better than viennacl::vectorfloat x(x_int); Especially when a vector is constructed using operator=. I can add this on the todo list for ViennaCL 1.6. It is a great feature which could permit some great bandwidth savings as well as some mixed-precision implementation Philippe Best regards, Karli -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards
Re: [ViennaCL-devel] Are {op_row, op_diag, op_column} unary or binary?
I want to be more precise, actually : the problem I see with this is that mathematically, diag or row remain unary operators. The fact that it is binary has to do with our implementation. So actually I think that we should never stick to the mathematical meaning of the operators (even though sometimes it may coincide) Philippe 2014-06-17 10:29 GMT+02:00 Toby St Clere Smithe m...@tsmithe.net: Hey Philippe, Philippe Tillet phil.til...@gmail.com writes: The integration of the generator is going on slowly but safely. Vector kernels are fully integrated and I'm about to support some matrix kernel as well ( excluding FFT, LU, and a few others). I have one metaphysical question, though. There are two interpretation possible for op_row, for example. Either we consider it as a unary operator because it acts on a single matrix, or we consider it as a binary operator because it also requires an uint as its RHS. I'm leaning towards the latter interpretation, yes I'd be glad to have your opinion on this! I'd go with binary, for exactly your reason, I think. Cheers, Toby -- HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing Easy Data Exploration http://p.sf.net/sfu/hpccsystems ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing Easy Data Exploration http://p.sf.net/sfu/hpccsystems___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Benchmark GUI warmup
Hey Namik, 2014-05-06 19:43 GMT+02:00 Namik Karovic namik.karo...@gmail.com: Hello, Apologies for not replying earlier, I've been quite busy these last two days. Don't worry ;) So far I have been exploring the advantages/disadvantages of using QML/QtQuick vs traditional widget based GUI. QML has some great design features that could improve the overall user experience and aren't easily implemented when using widgets. I was originally planning to develop some parts using QML ( animations and charts ) and integrate them with the main widget based GUI. However I am now exploring the possibility of doing the entire GUI in QML. Suggestions on which approach to choose are welcome. Unfortunately, I don't know much about Qt, so I probably couldn't help here. However, keep in mind that we aim for maximum portability. I would tend to think that QML is more portable accross languages, and so I would say go for it, as long as you don't loose portability elsewhere. Reading through your discussion about expert benchmark setting I see that I probably should have spent more time studying the autotuner and benchmark codes :/ I understand that there is a great need for expert benchmark customization and I hope to succeed in making that part as detailed as possible, but there should a certain limit to the extent of details. What I'm saying is I'd rather not spend time developing features that will be used only a couple of times. Surely there are some details that aren't of critical importance? It would be great if you guys could agree on what expert details are of greatest priority. I'm going to start studying the autotuner and benchmark codes so I can better understand what needs to be done. I think that the most important part of the project is the intuitiveness/functionnality of the GUI. Keep in mind that most of your userbase will have a limited amount of time, and that anything beyond double-click+coffee break will probably be ignored ;) I really believe that there is no need to read any code related to the auto-tuner, as it is disappearing. Re-implementing an exhaustive search for one particular size for the GUI will not be a huge challenge, so don't worry too much about it. This thread is exclusively dedicated to possible features in the expert tab, which is not a priority for now (but it's still good to have some mid-term perspective when starting a project). That being said, I believe that the Basic options should include: - Benchmarking of as many routines as possible : BLAS, FFT, Solvers, etc... - Simple exhaustive-search auto-tuning for what supports it : What could this hardware ideally give on this problem - Export of the benchmark results to an open database I don't think you should worry about anything else as of now. I'll be working rather actively on a command-line interface to some advanced auto-tuning features. Philippe Best regards, Namik On Tue, May 6, 2014 at 9:38 AM, Karl Rupp r...@iue.tuwien.ac.at wrote: Hi, Why is data pointless? I'd rather have only a few datapoints on new hardware out there rather than having absolutely no data at all. I mean, the data is pretty useful because it tells us about the best default kernel for large square matrices, but it is not very useful if we want to build a general input-dependent model, as it requires in my experience more than 1000 data points. This is true. So this calls for an hierarchical approach: Level 1: Just a couple of known kernels for a given data size, which are compared on the target machine. Level 2: A full tuning set for one data size on the target Level 3: All ~1000 points for building a model Execution times between these levels vary significantly: While almost all users will go through Level 1 anyway, only a few will have the patience to wait for results on Level 2. Level 3 will be mostly for us to have a 'normalized' process for building performance models. Either way, if others will join (machine learning community?), that would be great! I'd rather refrain from running Python scripts from the benchmark GUI. This is intended to be an end-user tool. Those interested in running from Python should take the Python code (i.e. PyViennaCL) directly. Are you sure? It would not take a lot of efforts to have an optional way to call the python script with the proper arguments from the auto-tuner, as long as the user provides the path and that he has all the necessary dependencies. The second half of the last sentence is the problem. I expect 80% of users to run on Windows, where anything but a 'double click installer' is a non-standard process. If Namik has time left by the end of the summer, we can look into that, but we first need to focus on our target audience. Such cases are probably only interesting for the 'expert settings' tab in the GUI, as these parameters only make sense to people who *really* know what
Re: [ViennaCL-devel] Benchmark GUI warmup
Hi, 2014-05-06 9:38 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, Why is data pointless? I'd rather have only a few datapoints on new hardware out there rather than having absolutely no data at all. I mean, the data is pretty useful because it tells us about the best default kernel for large square matrices, but it is not very useful if we want to build a general input-dependent model, as it requires in my experience more than 1000 data points. This is true. So this calls for an hierarchical approach: Level 1: Just a couple of known kernels for a given data size, which are compared on the target machine. Level 2: A full tuning set for one data size on the target Level 3: All ~1000 points for building a model Execution times between these levels vary significantly: While almost all users will go through Level 1 anyway, only a few will have the patience to wait for results on Level 2. Level 3 will be mostly for us to have a 'normalized' process for building performance models. Either way, if others will join (machine learning community?), that would be great! I'd rather refrain from running Python scripts from the benchmark GUI. This is intended to be an end-user tool. Those interested in running from Python should take the Python code (i.e. PyViennaCL) directly. Are you sure? It would not take a lot of efforts to have an optional way to call the python script with the proper arguments from the auto-tuner, as long as the user provides the path and that he has all the necessary dependencies. The second half of the last sentence is the problem. I expect 80% of users to run on Windows, where anything but a 'double click installer' is a non-standard process. If Namik has time left by the end of the summer, we can look into that, but we first need to focus on our target audience. I think you're right. Namik's GUI should only provide Level 1 and 2, which do not require any Python. Since Level 3 would be an internal tool as you correctly pointed it out, we could stick to a python command-line interface, or a rudimentary PyQt GUI. Philippe Such cases are probably only interesting for the 'expert settings' tab in the GUI, as these parameters only make sense to people who *really* know what they are doing (and willing to invest the time). For bloggers, journalists, etc., who just want to quickly get some performance datapoints for the very latest hardware, this is usually not of interest. We need to foc us on serving the main audience first and then watch out for fruitful directions on how to extend it further. Of course ! I've been referring to the expert settings tab from the beginning :) Ah, please say so :-) Best regards, Karli -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: #149; 3 signs your SCM is hindering your productivity #149; Requirements for releasing software faster #149; Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Benchmark GUI warmup
Hi, 2014-05-05 9:18 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, (CC-ing viennacl-devel, as this is developer-talk ;-) ) Either way, I want to let you know that the generator/auto-tuner is undergoing significant changes, and that you will, actually, not have to worry about it for your GSoC project. The generator will be used transparently via the viennacl::linalg:: functions, and the auto-tuner will be entirely moved to pyviennacl. Well, I think this is not entirely unrelated. The purpose of the GUI is still to allow a broader community to feed us with benchmark data, so somehow the loop over all possible configurations is still essential. With an interface to Python I assume that an API to do exactly that will still be available ;-) Well, looping over all the possible configurations for one particular problem size is good for benchmarking purpose only; the data generated this way will not be re-usable unless we can make some assumption on the input-data size. That is, if the GUI only auto-tunes GEMV/GEMM for large square matrices, then we will collect a lot of pointless data. Instead, the GUI should export a model which, given some input data sizes and a hardware configuration, is able to predict the optimal kernel. This is why the auto-tuner is being moved to pyviennacl. However, the GUI could/should indeed still be able to execute the corresponding python scripts. There is, however, one additional point I'd like to discuss. The performance of all the algorithms you'll have to benchmark are highly dependent on the characteristics of the input data. For example, matrix products will behave very differently according to the size/shape of the input matrices. This is very important : this means that a good benchmarking GUI could help the users to design their system. Here's an example. Suppose that someone wants to solve the linear system: A*x* = *y* If, for his particular application, A is a 50,000x50,000 sparse matrix, then he could be greatly interested in knowing how he could pad A to achieve better performance. In that case, the benchmarking-gui could explore randomly R^2 beyond (50,000 ; 50,000), and potentially tell the user that, if he makes A a (50,500; 50,500) matrix, then he could improve his performance by say 10 or 20%. For sparse matrices I don't believe in random patterns. The user usually has a particular application in mind, so I consider it more important to a) Allow users to feed the tuner with their own sparse matrix b) Allow users to select sparse matrices from the Florida matrix market The second option is important for benchmark purposes and for comparison with data in the literature. We can also add a third option for random matrices, but it's certainly far less important. We could also try to describe a sparse matrix by a few parameters (number of rows/cols, format, sparsity pattern, etc...) and use machine learning to predict the optimal kernel given an arbitrary sparse matrix. For the training data, we could use the Florida matrix market, indeed. In the case of dense matrix products, one may even be able to double his performance by slightly altering the size of the input matrices. Okay, this is only about adjusting the padding parameter and should transparently included in the tuning process anyway, shouldn't it? This is not exactly what I meant. Suppose that someone wants to compute the dense matrix product: A*B where A is in R^{238, 2031}, and B is in R^{2031, 1240}. Then, the auto-tuner should indeed find the optimal padding size, and A and B would be transparently padded to multiples of 128: {256,2048} and {2048, 1280}. However, for some reason, using matrices of size {256, 2176} and {2176, 1280} may be worth it on SGEMM (but not on DGEMM), because 2048 could trigger a lot of bank conflicts. Similarly, one might fall on a sweet spot of his GPU for {256,2560}x{2560,1408}. I don't think that ViennaCL should handle this. I can think of some applications in the field of Artificial Neural Networks, where one may want to resize the layers of his neural network so as to fall on some sweet spots of his GPU. Philippe Best regards, Karli -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: #149; 3 signs your SCM is hindering your productivity #149; Requirements for releasing software faster #149; Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Benchmark GUI warmup
Hi hi, 2014-05-05 21:49 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, Well, I think this is not entirely unrelated. The purpose of the GUI is still to allow a broader community to feed us with benchmark data, so somehow the loop over all possible configurations is still essential. With an interface to Python I assume that an API to do exactly that will still be available ;-) Well, looping over all the possible configurations for one particular problem size is good for benchmarking purpose only; the data generated this way will not be re-usable unless we can make some assumption on the input-data size. The data is reusable, of course assuming that one knows the matrix sizes it has been obtained for. That is, if the GUI only auto-tunes GEMV/GEMM for large square matrices, then we will collect a lot of pointless data. Why is data pointless? I'd rather have only a few datapoints on new hardware out there rather than having absolutely no data at all. I mean, the data is pretty useful because it tells us about the best default kernel for large square matrices, but it is not very useful if we want to build a general input-dependent model, as it requires in my experience more than 1000 data points. Instead, the GUI should export a model which, given some input data sizes and a hardware configuration, is able to predict the optimal kernel. This is why the auto-tuner is being moved to pyviennacl. However, the GUI could/should indeed still be able to execute the corresponding python scripts. I'd rather refrain from running Python scripts from the benchmark GUI. This is intended to be an end-user tool. Those interested in running from Python should take the Python code (i.e. PyViennaCL) directly. Are you sure? It would not take a lot of efforts to have an optional way to call the python script with the proper arguments from the auto-tuner, as long as the user provides the path and that he has all the necessary dependencies. For sparse matrices I don't believe in random patterns. The user usually has a particular application in mind, so I consider it more important to a) Allow users to feed the tuner with their own sparse matrix b) Allow users to select sparse matrices from the Florida matrix market The second option is important for benchmark purposes and for comparison with data in the literature. We can also add a third option for random matrices, but it's certainly far less important. We could also try to describe a sparse matrix by a few parameters (number of rows/cols, format, sparsity pattern, etc...) and use machine learning to predict the optimal kernel given an arbitrary sparse matrix. For the training data, we could use the Florida matrix market, indeed. I agree with this approach. Everything is better than using a fixed work group size as we do now (even though this is how other libraries deal with the problem as well) In the case of dense matrix products, one may even be able to double his performance by slightly altering the size of the input matrices. Okay, this is only about adjusting the padding parameter and should transparently included in the tuning process anyway, shouldn't it? This is not exactly what I meant. Suppose that someone wants to compute the dense matrix product: A*B where A is in R^{238, 2031}, and B is in R^{2031, 1240}. Then, the auto-tuner should indeed find the optimal padding size, and A and B would be transparently padded to multiples of 128: {256,2048} and {2048, 1280}. However, for some reason, using matrices of size {256, 2176} and {2176, 1280} may be worth it on SGEMM (but not on DGEMM), because 2048 could trigger a lot of bank conflicts. Similarly, one might fall on a sweet spot of his GPU for {256,2560}x{2560,1408}. I don't think that ViennaCL should handle this. I can think of some applications in the field of Artificial Neural Networks, where one may want to resize the layers of his neural network so as to fall on some sweet spots of his GPU. Such cases are probably only interesting for the 'expert settings' tab in the GUI, as these parameters only make sense to people who *really* know what they are doing (and willing to invest the time). For bloggers, journalists, etc., who just want to quickly get some performance datapoints for the very latest hardware, this is usually not of interest. We need to foc us on serving the main audience first and then watch out for fruitful directions on how to extend it further. Of course ! I've been referring to the expert settings tab from the beginning :) Philippe Best regards, Karli -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: #149; 3 signs your SCM is hindering your productivity #149; Requirements for releasing
Re: [ViennaCL-devel] OpenCL C++ API
Hi, 2014-04-29 15:59 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, So I can't help but to bring up this topic :) Is there any reason why we're using the OpenCL C API instead of the C++ one? Yes, the reason is simple: The C++ API was standardized quite some time *after* the development of ViennaCL started. It seems like we could save several thousands of lines of code (and gain a lot of clarity) by using the C++ API directly. Well, I'm not so sure about that actually. I'd more conservatively estimate it in the range of hundreds. Actually, I think that most of the code bloat comes from the ClGet*Info() calls. There are dozens of methods in ocl::device() to handle this, while this could be simply handled using a proper templated method. I notice that I already did that work in viennacl/ocl/infos.hpp but that it is somewhat not used. Is there any drawback with using : viennacl::ocl::infoCL_DEVICE_NAME(viennacl::ocl::device const d); instead of viennacl::ocl::device::name() ? Perhaps we could rename it so that the call becomes viennacl::ocl::infoVIENNACL_DEVICE_NAME(viennacl::ocl::device const ), so that VIENNACL_DEVICE_NAME could be used with CUDA/OpenMP devices too if ever needed. Of course, we would keep the ocl::device/context/whatever. I see a couple of reasons why we would want to do that: - For now, we have to use VIENNACL_ERR_CHECK(err) after each internal call to the C API. Since it's cumbersome, i've noticed that there are a lot of places in the code where we would just ignore some error checking for the sake of clarity. If we used the C++ API, then we could use the C++ exceptions internally and wouldn't have to bother with all these error checking. As you certainly know, I'd like the ViennaCL core to open up to other languages with version 2.0.0, eventually making it a C shared library with a C++ layer on top. Introducing a C++ exception-based error handling at this stage looks like a step in the wrong direction under these long-term perspectives. Oh, yes, that's right. ViennaCL is already using an exception-based mechanism, but this would be of course easier to change it ourself if we ever need it. - It seems like the list of the ViennaCL exception is not up-to-date. If we forget to handle a case, then we get ocl::unknown_error, while the error is properly handled by the OpenCL C++ API. Which cases aren't covered now? I have just noticed that these maybe extensions, such as : CL_PLATFORM_NOT_FOUND_KHR I'll look into that. - We could save literally thousands of lines of code. I'm ready to bet that it could relieve the compiler and reduce the compilation times Hundreds ;-) I bet against your bet that compilation times reduce notably for two reasons: - the OpenCL C++ API just adds yet another type-hierarchy - most of the compilation time is spent in other parts of the library However, feel free to prove me wrong ;-) It's true, most of the time is spent in template instantiation rather than parsing. I remember having benchmarked it on an older version of ViennaCL. I think that 1000 lines could be saved by using viennacl::ocl::infos instead of viennacl::ocl::{device|program|kernel|chocolatebar}::*(). - It would be easier to maintain. There is a whole community bugfixing the cl.hpp file, and we are more likely to have bugs in our C calls rather than our C++ calls How do you intend to deal with bugs or warnings obtained from cl.hpp? We can handle all bugs and warnings in our codes, but we might have a hard time dealing with warnings or even errors in older versions of cl.hpp. Yes, it won't be noticed by 99% of our users, but I also care about the remaining one percent. We actually had a warning in cl.h recently (or was it in cl.hpp?), didn't we? :P So we might have to deal with warnings in cl.h as well. I agree that we don't want a riot with angry people shouting We are the 1%. - It doesn't add any external dependency. True. - I think that when we need to deal with more complicated things such as multiple devices, we'll gain in productivity and robustness by using the C++ API. Robustness might be true. In which way are we going to gain in productivity? I understand that you may gain in productivity when dealing with the generator, but most other parts of ViennaCL are sitting on top of a backend-agnostic layer, not getting in touch with OpenCL directly. So the question is whether you consider the additional effort of replacing the current use of the C-API worth it. Actually, I came accross a lot of inconveniences while plugging a simple caching mechanism into viennacl::ocl::context::add_program(). It's not at all about the generator ;) The inconveniences however had to do with unintegrated use of STL vectors and cumbersome clGet*Info. Adding methods would not have solved the problem because the corresponding viennacl objects were not created at this point (but the cl handles were)
Re: [ViennaCL-devel] OpenCL C++ API
Hi, 2014-04-29 16:54 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, It seems like we could save several thousands of lines of code (and gain a lot of clarity) by using the C++ API directly. Well, I'm not so sure about that actually. I'd more conservatively estimate it in the range of hundreds. Actually, I think that most of the code bloat comes from the ClGet*Info() calls. There are dozens of methods in ocl::device() to handle this, while this could be simply handled using a proper templated method. I notice that I already did that work in viennacl/ocl/infos.hpp but that it is somewhat not used. Is there any drawback with using : viennacl::ocl::infoCL_DEVICE_NAME(viennacl::ocl::device const d); instead of viennacl::ocl::device::name() ? Yes, we discussed that. The reason is that with info() one cannot buffer the results. Perhaps we could rename it so that the call becomes viennacl::ocl::infoVIENNACL_DEVICE_NAME(viennacl::ocl::device const ), so that VIENNACL_DEVICE_NAME could be used with CUDA/OpenMP devices too if ever needed. I think that's never needed. - We could save literally thousands of lines of code. I'm ready to bet that it could relieve the compiler and reduce the compilation times Hundreds ;-) I bet against your bet that compilation times reduce notably for two reasons: - the OpenCL C++ API just adds yet another type-hierarchy - most of the compilation time is spent in other parts of the library However, feel free to prove me wrong ;-) It's true, most of the time is spent in template instantiation rather than parsing. I remember having benchmarked it on an older version of ViennaCL. I think that 1000 lines could be saved by using viennacl::ocl::infos instead of viennacl::ocl::{device|program|kernel|chocolatebar}::*(). Yes and no. We might save some milliseconds from this, but it's getting harder for us to document and users will have a harder time finding it. Also, return values cannot be buffered, see above. That was the point of dropping infos. I can dig out the other thread if you like. - It would be easier to maintain. There is a whole community bugfixing the cl.hpp file, and we are more likely to have bugs in our C calls rather than our C++ calls How do you intend to deal with bugs or warnings obtained from cl.hpp? We can handle all bugs and warnings in our codes, but we might have a hard time dealing with warnings or even errors in older versions of cl.hpp. Yes, it won't be noticed by 99% of our users, but I also care about the remaining one percent. We actually had a warning in cl.h recently (or was it in cl.hpp?), didn't we? :P So we might have to deal with warnings in cl.h as well. I agree that we don't want a riot with angry people shouting We are the 1%. Yes, it was a warning in the OpenCL 1.1 headers, which is fixed in OpenCL 1.2. This was in a trivial C-header. You can imagine how things can get worse with C++. - I think that when we need to deal with more complicated things such as multiple devices, we'll gain in productivity and robustness by using the C++ API. Robustness might be true. In which way are we going to gain in productivity? I understand that you may gain in productivity when dealing with the generator, but most other parts of ViennaCL are sitting on top of a backend-agnostic layer, not getting in touch with OpenCL directly. So the question is whether you consider the additional effort of replacing the current use of the C-API worth it. Actually, I came accross a lot of inconveniences while plugging a simple caching mechanism into viennacl::ocl::context::add_program(). It's not at all about the generator ;) The inconveniences however had to do with unintegrated use of STL vectors and cumbersome clGet*Info. Adding methods would not have solved the problem because the corresponding viennacl objects were not created at this point (but the cl handles were) This sounds to me like the OpenCL C++ API won't make a difference here ;-) On the other hand, I cannot see any functionnality in the C API which isn't wrapped in the C++ one. It sure would take quite a bit of work, but I'm ready to handle it myself if there is no objection. I'm not objecting to a change, but I also want to submit that I don't think it is worth the effort. If you're fine with the effort (including the testing and bugfixes to bring it to a state comparable to the current), please go ahead :-) I think you're right... Plus, we need information such as the device macro-architecture, which will probably never be provided by the OpenCL standards. To get informations such as NUMA-domains, we might even have to resort
Re: [ViennaCL-devel] Benchmark GUI Project Overview
Hey Namik, Congratulations for your acceptance to the GSoC! I don't know to which extent this blog is customizable, but it would be nice to have some sub-sections related to some sub-parts of the project, to clearly distinguish your updates/ideas on the GUI itself from those you'll have on e.g. the code architecture. It will make communication easier. Philippe 2014-04-27 19:16 GMT+02:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, thanks for putting everything in shape and posting it. Just some thoughts on the blog itself: * Blog title: Try to use more descriptive titles. GSoC 2014 Warmup is better suited than just Warmup, because it contains an important keyword GSoC and you don't get collisions in the following years. * The title/header bar is quite big, using a lot of vertical space (300px) without much information. If you reduce that to about 2/3, you can fit more interesting information to the first screen on lower-resolution displays. Best regards, Karli On 04/27/2014 12:36 AM, Namik Karovic wrote: Greetings, I've uploaded the benchmark GUI project proposal on my blog. You can read it at http://zalomiga.ba/blog/warmup/ . Any feedback is welcome. Regards, Namik -- Start Your Social Network Today - Download eXo Platform Build your Enterprise Intranet with eXo Platform Software Java Based Open Source Intranet - Social, Extensible, Cloud Ready Get Started Now And Turn Your Intranet Into A Collaboration Platform http://p.sf.net/sfu/ExoPlatform ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Start Your Social Network Today - Download eXo Platform Build your Enterprise Intranet with eXo Platform Software Java Based Open Source Intranet - Social, Extensible, Cloud Ready Get Started Now And Turn Your Intranet Into A Collaboration Platform http://p.sf.net/sfu/ExoPlatform ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Start Your Social Network Today - Download eXo Platform Build your Enterprise Intranet with eXo Platform Software Java Based Open Source Intranet - Social, Extensible, Cloud Ready Get Started Now And Turn Your Intranet Into A Collaboration Platform http://p.sf.net/sfu/ExoPlatform___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] WebCL 1.0 final specifications released
Hello everybody, After two years pending, the final specifications for WebCL 1.0 were released a couple of days ago. It is logically based on OpenCL 1.1 since ViennaCL doesn't support anything more. I don't see any clear applications of ViennaCL with that, and I'm incredibly bad with everything more or less related to web development. I just wanted to keep you guys updated :) Philippe -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Ideas for Google Summer of Code?
Hey everybody, My recent advances on auto-tuning gave birth to a new GSoC idea in my mind. More exactly, I've come up with something more complete around (crowd-sourced) auto-tuning and the GUI. This would include: - Developing a portable auto-tuning GUI (as of now : BLAS1 / Dense BLAS2 / Dense BLAS3) - Building a better data-base representation for our profiles ; I feel like manually feeling a map will become awful as we will obtain more data. Ideally, it would be about integrating ViennaCL with cTuning. This would allow us to build a centralized data-base and to easily build statistics/graphs of our data, and in the future to create machine learning models for these datasets. - Filling the database with as many devices as possible. What do you think about it? Best regards, Philippe 2014-02-16 21:02 GMT+01:00 Evan Bollig bol...@gmail.com: No real preference. I do want to play with some of the new features in CUDA 6 this summer. Aside from that, the SpMM is of interest to the other groups I work with. I think Petsc bindings would be a good project. I met with a CFD group on Friday to discuss optimization and scaling improvements in their petsc code. Seems to be a big item of interest for many. -E On Saturday, February 15, 2014, Karl Rupp r...@iue.tuwien.ac.at wrote: Hi Evan, Fyi, I'm willing to contribute to the GSoC effort as well. I can sponsor access to the variety of hardware we have at the Minnesota Supercomputing Institute. Including single and multi-GPU nodes and workstations (m2070s, k20s, k2s, and quadro variants), single and multi Xeon phi 5110p nodes, and nehalem and sandy bridge nodes. Let me know if the need arises. I'm also open to mentor if needed. awesome, Evan, that would be absolutely great. Do you have any particular project you'd like to see addressed? With multiple nodes available it would also be interesting to work on the PETSc-ViennaCL bindings. (This would certainly require a fairly experienced student) Best regards, Karli On Saturday, February 15, 2014, Karl Rupp r...@iue.tuwien.ac.at mailto:r...@iue.tuwien.ac.at wrote: Hi Philippe, I completely agree, concerning matrix-free implementations of the linear solver. Their absence is the very reason why I had to reimplement solvers for UMinTL. I assume you are aware that you can overload viennacl::linalg::prod() for whatever custom 'matrix' type you pass to solve()? Furthermore, some other fancy stopping criterions may be provided. For example, some algorithms in unconstrained optimization use CG on an indefinite matrix, and abort the solver once p^TAp 0. There are also some probabilistic stopping criterions for CG when the matrix-free implementation is an estimator of the true matrix-vector product. In the end, the CG I ended up with for UMinTL is pretty big and flexible, and I think it would be a good thing to have the same thing within ViennaCL. The monitoring capabilities for the iterative solvers in ViennaCL are indeed poor, not providing any feedback to the outside about the current residual and/or custom stopping criteria, convergence reasons, etc. Improving this is desirable, among many other things to do as well. Again a matter of priorities... ;-) Best regards, Karli -- Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151; iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net javascript:; https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- -Evan Bollig bol...@gmail.com mailto:bol...@gmail.com bol...@scs.fsu.edu mailto:bol...@scs.fsu.edu -- -Evan Bollig bol...@gmail.com bol...@scs.fsu.edu -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121054471iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Ideas for Google Summer of Code?
Hi, I completely agree, concerning matrix-free implementations of the linear solver. Their absence is the very reason why I had to reimplement solvers for UMinTL. Furthermore, some other fancy stopping criterions may be provided. For example, some algorithms in unconstrained optimization use CG on an indefinite matrix, and abort the solver once p^TAp 0. There are also some probabilistic stopping criterions for CG when the matrix-free implementation is an estimator of the true matrix-vector product. In the end, the CG I ended up with for UMinTL is pretty big and flexible, and I think it would be a good thing to have the same thing within ViennaCL. Best regards, Philippe 2014-02-15 9:32 GMT+01:00 Karl Rupp r...@iue.tuwien.ac.at: Hi, As long as you're a student, you're eligible to apply for GSoC. ;-) However, I don't give any guarantees, your application will be treated equally. You certainly have an advantage with respect to how things work, but no other student should be excluded upfront. It would definitely be great to be able to finish what I've started! A preliminary list of ideas is available here: http://www.iue.tuwien.ac.at/cse/index.php/gsoc/2014.html I think about adding an OpenMP tuning project, since more and more users seem to get in touch with it. I was thinking about another nice-to-have feature. I'd quite like to make it possible (if it is in fact possible..) to play with prototyping implementations of other algorithms for ViennaCL in PyViennaCl using PyOpenCL and PyCUDA; for instance (just picking a recently discussed example) to implement using PyOpenCL a cl_program for sparse matrix multiplication, and be able to use the resultant buffer like any other PyViennaCL matrix object. I don't know if it would be worthwhile to hook into the generator or scheduler at this point. One thing that will be certainly of interest for a bunch of people is the ability to provide custom matrix-vector products for the iterative solvers (matrix-free implementations). Andreas Kloeckner is also looking forward to provide any help needed. Hooking this into the scheduler is possible to some degree, at least for the common applications of an operator to a vector or matrix. Just personally, I'd quite like this functionality, because I do find rapid development easier in Python than C++, and I would like to play with implementing matrix algorithms at some point in the future. Python *is* more suitable for rapid prototyping than C++. You will quickly find that for rapid prototyping it is important to have broad support for all basic operations (including elementwise operations, etc.), so this should have highest priority. We should be careful with implementing additional algorithms in PyViennaCL for anything other than tutorial purposes, because then we would quickly end up maintaining multiple versions of the same functionality. Yes; I'll need to investigate this. At the moment, I quite enjoy the object-oriented nature of ViennaCL, and there are parts of PyViennaCL which are inelegantly not-OOP. So I'd probably want to think about which way is going to be most elegant overall. Presumably, the C++ API isn't going to disappear? The object-oriented nature will not disappear. Even if the core is going to be a shared (C-)library for ViennaCL 2.0.0, it will still have the same spirit of the current C++ API. Actually, I expect that ViennaCL 2.0.0 will still have a (more lightweight) C++ layer on top of the shared library, so it's API won't change much. Best regards, Karli -- Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] More extensive Nightly Tests booting...
Hi Karl, Wow, that's really neat! I'll fix the warnings for Clang and for generator_blas1-opencl Philippe 2014-02-14 10:38 GMT+01:00 Karl Rupp r...@iue.tuwien.ac.at: Hi guys, in the past few days we worked here in Vienna on setting up an automated nightly build system based on CTest and CDash. It isn't fully completed yet, but it already starts to pay off: http://jwein2.iue.tuwien.ac.at:5/CDash/index.php?project=ViennaCL Philippe, could you please have a look at the new warnings obtained for the generator on Clang? Some of the warnings look pretty ugly and are likely to cause wrong runtime behavior. Also, generator_blas1-opencl now fails due to a missing check for double precision. An older CentOS 5.x system, a Linux Mint machine, and a Windows machine still need to be integrated. Usually I'll take care of warnings, but it certainly helps if you bookmark that page and also check the results in (ir)regular intervals. Automated email notifications are possible, just let me know if I should sign you up. Best regards, Karli -- Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] viennacl::reduce and viennacl::row/col_wise()
Hello, So, as of now, the generation of row-wise reduction can be triggered through the interface: viennacl::reduceop_add(viennacl::row_wise(Mat)) viennacl::reduceop_max(viennacl::col_wise(Mat)) viennacl::reduceop_min(Vec) This plugs into a statement under the form: -- Node 1 == type : COMPOUND_OPERATION operator : OPERATION_UNARY_REDUCTION lhs : Node 2 rhs : OPERATION_BINARY_ADD_TYPE Node 2 == type : COMPOUND_OPERATION operator : OPERATION_UNARY_ROW_WISE_TYPE lhs : MATRIX rhs : UNDEFINED I think that an operator is a symbolic entity, and that the difference between an elementwise addition and a vector summation should not be encoded at the level of the addition operation. This is why in both cases the same OPERATION_BINARY_ADD_TYPE will be involved. I think that the statement representation is nice enough, but the UI may not be optimal from a compilation time perspective. On the other hand, calling viennacl::reduce(A, VIENNACL_ADD_ROW_WISE); seems not so great at all since this will involve a *lot* of duplication in the scheduler and the generator. Yet, I don't see any way of having a dynamic interface (no expression template) while preserving the flexibility of the statement mentionned above. Any idea? Best regards, Philippe -- Android apps run on BlackBerry 10 Introducing the new BlackBerry 10.2.1 Runtime for Android apps. Now with support for Jelly Bean, Bluetooth, Mapview and more. Get your Android app in front of a whole new audience. Start now. http://pubads.g.doubleclick.net/gampad/clk?id=124407151iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Ideas for Google Summer of Code?
Hi, I'll be once more available as a mentor :) I'll be myself pretty busy with some BLAS2/BLAS3 tuning for Hawaii. I'm also in favor of ideas of projects which don't require a strong knowledge of the current codebase, such as the GUI autotuning/benchmarking tool. I think that ViennaCL could also hugely benefit from parameterized FFT kernels... Of course, PyViennaCL is another option. Best regards, Philippe 2014-02-03 Karl Rupp r...@iue.tuwien.ac.at: Hi guys, the Google Summer of Code [1] is approaching. It attracted some great contributors in the past, most notably Philippe and Toby, and I hope there's more to come. So, guys, please provide your project ideas. My experience is that good projects are those which don't require the student to understand large parts of the existing code and are easy to formulate (but not necessarily easy to accomplish :-P). Although I don't see work items on our roadmap nicely fulfilling this optimum, we still have at least two neat things to work on: The first project I have in mind is the benchmarking GUI we brainstormed about in IRC. It's probably a good idea to push out a first working version in the next weeks and then let the student work on refinements such as a visualization of the results, etc. Second, I think PyViennaCL will benefit from another push. Toby, how's your availability for a release in February? I have more time now for assisting you with the final polishing. Something rather generic is the implementation of algorithms such as additional iterative solvers, etc. Now as the standard matrix and vector operations are pretty mature and support multiple backends, this should be a pretty fun piece of work :-) @Philippe, Toby: How about mentoring this year? Best regards, Karli [1] http://www.google-melange.com/gsoc/homepage/google/gsoc2014 -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey, I think we agree on everything now! Okay, I will generate all the kernels, this will lead actually to 16 kernels for each cpu-gpu scalar combination, so 64 small kernels in total. This took time but it was a fruitful discussion :) Anyways, my ideas are much clearer now, thanks! Best regards, Philippe 2014-01-26 Karl Rupp r...@iue.tuwien.ac.at Hey, (x programs/y kernels each) Execution time (1/128) 1.4 (2/64) 2.0 (4/32) 3.2 (8/16) 5.6 (16/8) 10.5 (32/4) 20.0 (64/2) 39.5 (128/1)80.6 Thus, jit launch overhead is in the order of a second! Okay, it seems like 1 program for all the kernels is the way to go. From your hard facts, though, it seems like generating 16 kernels inside the same program would have practically the same cost as generating only one, since the execution time is largely dominated by the kernel launch overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s, which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms. Considering that the flip_sign and reciprocal trick cannot be applied for unsigned integers, this is the way to go then. The increase in the number of kernels should be somewhat compensated by the fact that each of the kernels is shorter. All we need to do is to have a interface to the generator where we can just extract the axpy-kernels. The generator should not do any OpenCL program and kernel management. I don't see any problem with extracting the source code from the generator in order to create this program (it is already done for GEMM), but the generator doesn't handle reciprocal and flip_sign. As I said earlier this feature is cool because it may prevent the transfer of several GPU-scalar in order to invert/reverse the value. On the other hand, though, it is incompatible with the clBlas interface and the kernel generator (both of which are fed with cl_float and cl_double) . Modifying the generator to handle x = y/a - w/b - z*c internally as x = y*a + w*b + z*c + option_a + option_b + option_c sounds like a very dangerous idea to me. It could have a lot of undesirable side effects if made general, and making an axpy-specific tree parsing would lead to a huge amount of code bloat. This is actually the reason why I am so reluctant to integrating reciprocal and flip_sign within the generator... Okay, let's not propagate reciprocal and flip_sign into the generator then. Also, feel free to eliminate the second reduction stage for scalars, which is encoded into the option value. It is currently unused and makes the generator integration harder than necessary. We can revisit that later if all other optimizations are exhausted ;-) if(size(x)1e5 stride==1 start==0){ //Vectors are padded, wouldn't it be confounding/unnecessary to check for the internal size to fit the width? //The following steps are costly for small vectors cl_typeNumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) Never copy device scalars back unless requested by the user. They reads block the command queue, preventing overlaps of host and device computations. if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta Let's just generate all the needed kernels and only dispatch into the correct kernel. //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ dynamically_generated_program::init(); ambm_kernel(x,cpu_alpha,y,cpu_beta,z) } else{ statically_generated_program::init(); ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta, reciprocal_beta, flip_beta, z) } What is the difference between dynamically_generated_program::init(); and statically_generated_program::init(); ? Why aren't they the same? Also, mind the coding style regarding the placement of curly braces and spaces ;-) Wouldn't this solve all of our issues? I (really) hope we're converging now! :) I think we can safely use dynamically_generated_program::init(); in both cases, which contains all the kernels which are currently in the statically generated program. I don't believe it is our task to implement such a cache. This is way too much a source of error and messing with the filesystem for ViennaCL which is supposed to run with user permissions. An OpenCL SDK is installed into the system and thus has much better options to deal with the location of cache, etc. Also, why is only NVIDIA able to provide such a cache, even though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for an extended amount of time. Agreed. I was just suggesting this because PyOpenCL already provides this, but python comes with a set of dynamic libraries, so
[ViennaCL-devel] Altera OpenCL optimization guide
Hello everyone, I have found this relatively new and interesting PDF file : http://www.altera.com/literature/hb/opencl-sdk/aocl_optimization_guide.pdf. I'll read it overnight. This is of course for a mid/long-term perspective, but there are some remarkable points within, for example (some teasing :D) : The AOC implements local memory in FPGAs very differently than in GPUs. If your OpenCL kernel contains code to avoid GPU-specific local memory bank conflicts, remove that code because the AOC generates hardware that avoids local memory bank conflicts automatically whenever possible. Anyway, this is of course not a priority (we don't even have any hardware to test it), but it might be useful to get some insight on how Altera's OpenCL is behaving... Best regards, Philippe -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey hey Karl, 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hi Phil, Oh, I get it better now. I am not entirely convinced, though ;) From my experience, the overhead of the jit launch is negligible compared to the compilation of one kernel. I'm not sure whether compiling two kernels in the same program or two different program creates a big difference. Okay, time to feed you with some hard facts ;-) Scenario: compilation of 128 kernels. Configurations (x programs with y kernels each, x*y=128) Execution times: (x programs/y kernels each) Execution time (1/128) 1.4 (2/64) 2.0 (4/32) 3.2 (8/16) 5.6 (16/8) 10.5 (32/4) 20.0 (64/2) 39.5 (128/1)80.6 Thus, jit launch overhead is in the order of a second! Okay, it seems like 1 program for all the kernels is the way to go. From your hard facts, though, it seems like generating 16 kernels inside the same program would have practically the same cost as generating only one, since the execution time is largely dominated by the kernel launch overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s, which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms. Plus, ideally, in the case of linear solver, the generator could be used to generate fused kernels, provided that the scheduler is fully operationnal. Sure, kernel fusion is a bonus of the micro-scheduler, but we still need to have a fast default behavior for scenarios where the the kernel fusion is disabled. I fear that any solution to the aforementioned problem would destroy this precious ability... Ideally, once we enable it, the generate_execute() mentioned above would just be replaced by generate() (or enqueue_for_generation, which is more explicit) All we need to do is to have a interface to the generator where we can just extract the axpy-kernels. The generator should not do any OpenCL program and kernel management. I don't see any problem with extracting the source code from the generator in order to create this program (it is already done for GEMM), but the generator doesn't handle reciprocal and flip_sign. As I said earlier this feature is cool because it may prevent the transfer of several GPU-scalar in order to invert/reverse the value. On the other hand, though, it is incompatible with the clBlas interface and the kernel generator (both of which are fed with cl_float and cl_double) . Modifying the generator to handle x = y/a - w/b - z*c internally as x = y*a + w*b + z*c + option_a + option_b + option_c sounds like a very dangerous idea to me. It could have a lot of undesirable side effects if made general, and making an axpy-specific tree parsing would lead to a huge amount of code bloat. This is actually the reason why I am so reluctant to integrating reciprocal and flip_sign within the generator... if(size(x)1e5 stride==1 start==0){ //Vectors are padded, wouldn't it be confounding/unnecessary to check for the internal size to fit the width? //The following steps are costly for small vectors cl_typeNumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ dynamically_generated_program::init(); ambm_kernel(x,cpu_alpha,y,cpu_beta,z) } else{ statically_generated_program::init(); ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta, reciprocal_beta, flip_beta, z) } Wouldn't this solve all of our issues? I (really) hope we're converging now! :) This put aside, I'm not sure if we should give that much importance to jit-compilation overhead, since the binaries can be cached. If I remember well, Denis Demidov implemented such a caching mechanism for VexCL. What if we replace distributed vector/matrix with optionnal automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a limited amount of time :P) ? The drawback is that the filesystem library would have to be dynamically linked, though, but afterall OpenCL itself also has to be dynamically linked. I don't believe it is our task to implement such a cache. This is way too much a source of error and messing with the filesystem for ViennaCL which is supposed to run with user permissions. An OpenCL SDK is installed into the system and thus has much better options to deal with the location of cache, etc. Also, why is only NVIDIA able to provide such a cache, even though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for an extended amount of time. Agreed. I was just suggesting this because PyOpenCL already provides this, but python comes with a set of dynamic libraries, so this is probably not the same context. Best regards, Philippe Best regards, Karli
[ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hello, I am a bit confused, is there any reason for using reciprocal and flip_sign, instead of just changing the scalar accordingly? Best regards, Philippe -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi Karl, 2014/1/24 Karl Rupp r...@iue.tuwien.ac.at Hey, I am a bit confused, is there any reason for using reciprocal and flip_sign, instead of just changing the scalar accordingly? yes (with a drawback I'll discuss at the end): Consider the family of operations x = +- y OP1 a +- z OP2 b where x, y, and z are vectors, OP1 and OP2 are either multiplication or division, and a,b are host scalars. If I did the math correctly, these are 16 different kernels when coded explicitly. Hence, if you put all these into separate OpenCL kernels, you'll get fairly long compilation times. However, not that you cannot do this if a and b stem from device scalars, because then the manipulation of a and b would result in additional buffer allocations and kernel launches - way too slow. For floating point operations, one can reduce the number of operations a lot when (+- OP1 a) and (+- OP2 b) are computed once in a preprocessing step. Then, only the kernel x = y * a' + z * b' is needed, cutting the number of OpenCL kernels from 16 to 1. Since (-a) and (1/a) cannot be computed outside the kernel if a is a GPU scalar, this is always computed in a preprocessing step inside the OpenCL kernel for unification purposes. I think we can even apply some more cleverness here if we delegate all the work to a suitable implementation function. And now for the drawback: When using integers, the operation n/m is no longer the same as n * (1/m). Even worse, for unsigned integers it is also no longer possible to replace n - m by n + (-m). Thus, we certainly have to bite the bullet and generate kernels for all 16 combinations when using unsigned integers. However, I'm reluctant to generate all 16 combinations for floating point arguments if this is not needed... Thanks for the clarification. I also absolutely don't want to generate the 16 kernels either! I was in fact wondering why one passed reciprocal_alpha and flip_sign into the kernel. After thinking more about it, I have noticed that this permits us to do the corresponding inversion/multiplication within the kernel, and therefore avoid one some latency penalty / kernel launch overhead when the scalar is pointed out, that's smart! On the other hand, modifying the generator to not actually generate a specific kernel would be absurd imho. This brings another question, then. How could ambm beneficiate from the auto-tuning environment? I propose the following solution: check the size of the matrices/vector If the computation is dominated by the kernel launch time (say, less than 100,000 elements), then we use the current ambm kernel. Otherwise, we transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b' = +- OP b, and either generate the kernel or use a BLAS library. This way, we beneficiate from kernel launch time optimization for small data, and high-bandwidth for large data. Does this sounds good? Best regards, Philippe Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey, 2014/1/24 Karl Rupp r...@iue.tuwien.ac.at Hi, I was in fact wondering why one passed reciprocal_alpha and flip_sign into the kernel. After thinking more about it, I have noticed that this permits us to do the corresponding inversion/multiplication within the kernel, and therefore avoid one some latency penalty / kernel launch overhead when the scalar is pointed out, that's smart! On the other hand, modifying the generator to not actually generate a specific kernel would be absurd imho. This brings another question, then. How could ambm beneficiate from the auto-tuning environment? I propose the following solution: check the size of the matrices/vector If the computation is dominated by the kernel launch time (say, less than 100,000 elements), then we use the current ambm kernel. Otherwise, we transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b' = +- OP b, and either generate the kernel or use a BLAS library. This way, we beneficiate from kernel launch time optimization for small data, and high-bandwidth for large data. Does this sounds good? In terms of execution time, this is probably the best solution. On the other hand, it does not solve the problem of compilation overhead: If we only dispatch into the generator for large data, we still have to generate the respective kernels and go through the OpenCL jit-compiler each time. The compilation overhead of this is even likely to dominate any gains we get from a faster execution. Instead, what about opening up the generator a bit? It is enough if we have some mechanism to access a batch-generation of axpy-like operations, for all other operations the generator can remain as-is. Another option is to move only the axpy-template from the generator over to linalg/opencl/kernels/*, because the generation of these kernels is fairly light-weight. Sure, it is a little bit of code-duplication, but it will keep the generator clean. Another possible improvement is to separate operations on full vectors from operations on ranges and slices. For full vectors we can use the built-in vector-types in OpenCL, which allows further optimizations not possible with ranges and strides, where we cannot use vector types in general. What do you think? I prefer option 3. This would allow for something like : if(size(x)1e5 stride==1 start==0){ //The following steps are costly for small vectors NumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ generate_execute(x = cpu_alpha*y + cpu_beta*z); } else{ //fallback } This way, we at most generate two kernels, one for small vectors, designed to optimize latency, and one for big vectors, designed to optimize bandwidth. Are we converging? :) Best regards, Philippe Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey hey, 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hi, I prefer option 3. This would allow for something like : if(size(x)1e5 stride==1 start==0){ Here we also need to check the internal_size to fit the vector width //The following steps are costly for small vectors NumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ generate_execute(x = cpu_alpha*y + cpu_beta*z); } else{ //fallback } This way, we at most generate two kernels, one for small vectors, designed to optimize latency, and one for big vectors, designed to optimize bandwidth. Are we converging? :) Convergence depends on what is inside generate_execute() ;-) How is the problem with alpha and beta residing on the GPU addressed? How will the batch-compilation look like? The important point is that for the default axpy kernels we really don't want to go through the jit-compiler for each of them individually. ;) in this case, generate_execute() will just trigger the compilation - on the first call only - of the kernel x = cpu_alpha*y + cpu_beta*z; __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z, float alpha, float beta) { for(i = get_global_id(0) ; i N ; i+=get_global_size(0)) x[i] = alpha*y[i] + beta*z[i]; } with of course an appropriate compute profile Note to self: Collect some numbers on the costs of jit-compilation for different OpenCL SDKs. Best regards, Karli Best regards, Philippe -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey, 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hey hey hey, Convergence depends on what is inside generate_execute() ;-) How is the problem with alpha and beta residing on the GPU addressed? How will the batch-compilation look like? The important point is that for the default axpy kernels we really don't want to go through the jit-compiler for each of them individually. ;) in this case, generate_execute() will just trigger the compilation - on the first call only - of the kernel x = cpu_alpha*y + cpu_beta*z; __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z, float alpha, float beta) { for(i = get_global_id(0) ; i N ; i+=get_global_size(0)) x[i] = alpha*y[i] + beta*z[i]; } I'm afraid this is not suitable then. A simple conjugate gradient solver would then go through ~10 OpenCL compilations, making it awfully slow at the first run... With AMD and Intel SDKs, which to my knowledge still do not buffer kernels, this would mean that each time a process is started, this large overhead will be visible. I don't understand why this would go through more than one compilation... This kernel is compiled only once, the value of flip_sign and reciprocal only changes the dynamic value of the argument, not the source code. This would eventually result in: if(alpha_reciprocal) kernel(N,x,y,z,1/alpha,beta) Am I missing something? Best regards, Philippe Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi, Oh, I get it better now. I am not entirely convinced, though ;) From my experience, the overhead of the jit launch is negligible compared to the compilation of one kernel. I'm not sure whether compiling two kernels in the same program or two different program creates a big difference. Plus, ideally, in the case of linear solver, the generator could be used to generate fused kernels, provided that the scheduler is fully operationnal. I fear that any solution to the aforementioned problem would destroy this precious ability... Ideally, once we enable it, the generate_execute() mentioned above would just be replaced by generate() (or enqueue_for_generation, which is more explicit) This put aside, I'm not sure if we should give that much importance to jit-compilation overhead, since the binaries can be cached. If I remember well, Denis Demidov implemented such a caching mechanism for VexCL. What if we replace distributed vector/matrix with optionnal automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a limited amount of time :P) ? The drawback is that the filesystem library would have to be dynamically linked, though, but afterall OpenCL itself also has to be dynamically linked. Best regards, Philippe 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hi Philippe, I don't understand why this would go through more than one compilation... This kernel is compiled only once, the value of flip_sign and reciprocal only changes the dynamic value of the argument, not the source code. This would eventually result in: if(alpha_reciprocal) kernel(N,x,y,z,1/alpha,beta) Am I missing something? I think so ;-) It's not about a single kernel, it's about the compilation unit (i.e. OpenCL program). For conjugate gradients we roughly have the following vector operations (random variable names) x = y; x += alpha y; x = z + alpha z; x = y - alpha z; x = inner_prod(y,z); BiCGStab and GMRES add a few more of them. If we use the generator as-is now, then each of the operations creates a separate OpenCL program the first time it is encountered and we pay the jit-compiler launch overhead multiple times. With the current non-generator model, all vector kernels are in the same OpenCL program and we pay the jit-overhead only once. I'd like to stick with the current model of having just one OpenCL program for all the basic kernels, but get the target-optimized sources from the generator. Sorry if I wasn't clear enough in my earlier mails. Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Roadmap update after 1.5.0 release
Hey, I'm slowly getting back to ViennaCL. I have added one bullet point to the roadmap: * Full integration of the micro-scheduler and the generator I think that we should work towards the full integration of this feature if we don't want to codebase to eventually get too messy. I will be working on cleaning GEMM (i.e. better integration of the multiple BLAS backends, and harmonize the kernels using the columntrans-rownotrans identity.) until I go back to France, in 1 week. I have also noticed that the size checking could be moved upwards in the dispatching mechanism, for now, they are duplicated between opencl/cuda/openmp . Once this is done, I will probably work towards the full integration of the micro-scheduler. Can we get rid of op_executor? Best regards, Philippe 2013/12/27 Philippe Tillet phil.til...@gmail.com Hey, Sorry for the late reply :P I'm supposed to defend my MSc in 2 weeks, and I am yet to start writing my thesis... (I won't have a lot of time to give to ViennaCL until everything is sorted out) 2013/12/23 Karl Rupp r...@iue.tuwien.ac.at Hi guys, Now as 1.5.0 is out, I spent some thoughts on the roadmap: https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap Rather than having one major update per year, I'd like to go with smaller updates (1.6.0, 1.7.0, etc.) every couple of months, with eventual bugfix and performance improvements in between (1.5.1, 1.6.1, etc.). Thus, the list of features for 1.6.0 was stripped down and the not-so-urgent features are postponed to 1.7.0. We still need to gather more experiences before we are ready to finally fix some design errors in 2.0.0. Due to recent developments by Philippe, support for external BLAS libraries as a backend was added to the roadmap for 1.6.0. Any comments on further rearrangements of the roadmap? This seems fine for me! I would be against overloading the roadmap at this point. If we can achieve both stable distributed data-structures and the full integration of the scheduler in the coming couple of months, then the performance improvements should be noticeable enough to justify a new release, I think. Best regards, Philippe Best regards, Karli -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Blas linking and internal design
Hey Karl, So today I went back to ViennaCL. I tried to move the equivalence columntrans = rownotrans upwards in the dispatching mechanism but it turns out to be impossible, because matrixT,row_major is not (and should not be) convertible to matrixT, column_major, rendering the underlying signature inappropriate... I am very skeptical so as to how to handle this problem. I am thinking about changing the internal signature to something more low-level, void gemm(bool /*is_A_trans*/, bool /*is_B_trans*/ ,const vcl_size_t /*M*/, const vcl_size_t /*N*/, const vcl_size_t /*K*/, const T /*alpha*/ ,viennacl::backend::mem_handle const /*A*/ , const vcl_size_t /*A_internal_size1*/, const vcl_size_t /*A_internal_size2*/ ,const vcl_size_t /*A_start1*/, const vcl_size_t /*A_start2*/, const vcl_size_t /*A_inc1*/, const vcl_size_t /*A_inc2*/ ,viennacl::backend::mem_handle const /*B*/, const vcl_size_t /*B_internal_size1*/, const vcl_size_t /*B_internal_size2*/ ,const vcl_size_t /*B_start1*/, const vcl_size_t /*B_start2*/, const vcl_size_t /*B_inc1*/, const vcl_size_t /*B_inc2*/ ,const T /*beta*/, viennacl::backend::mem_handle /*C*/, const vcl_size_t /*C_internal_size1*/, const vcl_size_t /*C_internal_size2*/ ,const vcl_size_t /*C_start1*/, const vcl_size_t /*C_start2*/, const vcl_size_t /*C_inc1*/, const vcl_size_t /*C_inc2*/); Where all the layouts would be assumed to be column-major, like in the standard-blas interface. While this solution is acceptable to me, I fear that it will introduce a lack of harmony considering that some other functions will stay otherwise like template typename NumericT, typename F, typename ScalarType1 void am(matrix_baseNumericT, F mat1, matrix_baseNumericT, F const mat2, ScalarType1 const alpha, vcl_size_t len_alpha, bool reciprocal_alpha, bool flip_sign_alpha) The only reasonable solution I see is to clearly separate in the code the functions which could be linked with BLAS (and give them a lower level signature), from the other ones. For example, putting them in two separate files... is there any problem with doing this? Best regards, Philippe -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Roadmap update after 1.5.0 release
Hey, Sorry for the late reply :P I'm supposed to defend my MSc in 2 weeks, and I am yet to start writing my thesis... (I won't have a lot of time to give to ViennaCL until everything is sorted out) 2013/12/23 Karl Rupp r...@iue.tuwien.ac.at Hi guys, Now as 1.5.0 is out, I spent some thoughts on the roadmap: https://github.com/viennacl/viennacl-dev/wiki/ViennaCL-Roadmap Rather than having one major update per year, I'd like to go with smaller updates (1.6.0, 1.7.0, etc.) every couple of months, with eventual bugfix and performance improvements in between (1.5.1, 1.6.1, etc.). Thus, the list of features for 1.6.0 was stripped down and the not-so-urgent features are postponed to 1.7.0. We still need to gather more experiences before we are ready to finally fix some design errors in 2.0.0. Due to recent developments by Philippe, support for external BLAS libraries as a backend was added to the roadmap for 1.6.0. Any comments on further rearrangements of the roadmap? This seems fine for me! I would be against overloading the roadmap at this point. If we can achieve both stable distributed data-structures and the full integration of the scheduler in the coming couple of months, then the performance improvements should be noticeable enough to justify a new release, I think. Best regards, Philippe Best regards, Karli -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Handling Layout/Transpose ASAP for GEMM/GEMV ?
Hey, I've started back on the generator today, and realized how ugly the dispatching mechanism was, to take advantage of the equivalencies based on the fact that RowMajor + Trans = ColMajor + NoTrans Actually, I've been wondering : why wouldn't we do this on the whole codebase? We could presumably solely focus on providing a simple BLAS interface (All Column-Major), and do the additional trickery at some point beforewards. I see a couple of advantages to this: = This would enable us to maintain only 4 GEMM and 2 GEMV kernels, instead of 32 GEMM and 4 GEMV kernels. = This would enormously increase the consistency between the default implementations, the BLAS backends and the kernel generator (because all these implementations can focus on providing just a simple column major BLAS interface) Am I missing something? If not, at which point such a dispatching mechanism should take place? Best regards, Philippe -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Call for testing: PyViennaCL on Ubuntu
*Sneeks in* (Seems like it's time to hide a if( rand() RAND_MAX/2) return; somewhere in the code where Karl won't find it !) :D Philippe 2013/12/19 Karl Rupp r...@iue.tuwien.ac.at Hi Toby, please allow for ~1 more day, then 1.5.0 is out and I'm available for testing :-) Best regards, Karli On 12/17/2013 08:14 AM, Toby St Clere Smithe wrote: Toby St Clere Smithe m...@tsmithe.net writes: Yep, looks like the build was successful, so I'll go ahead and make sure it's all working on older distributions now. And so, after a little over a day backporting the package build system from Python 3 to Python 2 (yay!), we have packages for Debian / Ubuntu versions at least as old as Ubuntu 12.04! If you want to try them out, add the PPA[1] to your system; if you're on Ubuntu 12.04, 12.10 or 13.04, you can only install `python-pyviennacl` (for Python 2), but if you're on Ubuntu 13.10 or 14.04, you can also install `python3-pyviennacl` (for Python 3). Oh, and everyone gets `pyviennacl-doc` (which contains the HTML docs). I'll set the version number to 1.0.0 once ViennaCL 1.5.0 is released, and after any remaining little bugs get ironed out. [1] https://launchpad.net/~tsmithe/+archive/pyviennacl/ Cheers, Toby -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?
Hey, 2013/12/18 Karl Rupp r...@iue.tuwien.ac.at Hi. A short update : I've implemented linkage to CBlas and CuBlas with dynamic selection. If activated through VIENNACL_WITH_CUBLAS, one can go back and forth between cublas and the original backend by doing: A.blas().gemm(NULL); A.blas().gemm(viennacl::backend::blas::cublas_ functionsvalue_type::gemm); (and similarly for cblas.) Nice, thanks! I think we can shorten the second call to something like A.blas().gemm(viennacl::backend::cublas); for convenience. There is some trickery going on with transpositions and layout, but it works for every transpose/layout combination. One can also link A's blas to his own gemm function, provided a tiny wrapper (essentially to ensure signature compatibility) Cool! It is actually interesting to point out that only 4 GEMM kernels are needed for any implementation : NN, NT, TN, TT . Then, one can use the equivalence Row-Major+N = Col-Major+T , and C = AB = C^T = B^T.A^T. A very good news is that this allows viennacl to work very well on very recent NVidia Hardware, until our autotuning engine is fully operational. On my laptop, cublasSgemm is about 5 times faster than the current CUDA implementation , and 20% faster than the OpenCL kernel found by the autotuner (120GFLOPs vs 25GFLOPs vs 95GFLOPs). Also,linking with OpenBlas leads to HUGE performance boost on the CPU ( 0.02GFLOP/s vs 70GFLOP/s)...! For our native CUDA implementation it's probably only a matter of porting the results from the OpenCL tuner over. Unfortunately I don't see a good way of doing this with CUDA without a significant penalty on compilation times, because there is no concept of runtime kernel selection in CUDA so far. The performance difference for GEMM of our CPU backend is not surprising, this was never subject to optimization ;-) That's exactly the point of this feature ! Optimizing GEMM for CPU is pretty complicated, and linking with external BLAS libraries allow us not to focus too much on these problems, and to just provide a fallback implementation for the sake of code portability A little question remains. For now, the behavior is really weird when one defines both VIENNACL_WITH_CBLAS and VIENNACL_WITH_CUBLAS. How to handle this? I am not very familiar with the multiple backends and I don't know to which extent they can be combined. Therefore, I see multiple options, but can't tell which one is better. 1 - trigger a preprocessor error when both commands are defined together 2 - slightly modify the API : A.cuda_blas(), A.host_blas(), A.cl_blas() I think that option 2 is better, considering that there is already cuda_handle(), opencl_handle(), cpu_handle() or something similar, if I'm correct. Any advice? The reason why cuda_handle(), opencl_handle() and cpu_handle() exists under different names is that they return different types (i.e. the memory buffer). For the BLAS backends I don't want to have different member names, because this gets annoying for users. For example, if a user wants to cycle through the backends for e.g. benchmark purposes, she would have to write if (my_constant == CUDA) A.cuda_blas()... else if (my_constant == HOST) A.host_blas()... else A.cl_blas()... Yes, you're right. However, the types for .blas() are as of different accross the backends. This is because I chose a low-level interface for the Blas wrappers, therefore the signature of the function are slightly different [ T const * A, vcl_size_t A_internal_size1... versus cl_mem const A, vcl_size_t A_internal_size1 ...). I can easily change the signature to a higher level one ( viennacl::matrixT A ... ). This is probably better, right ? so making the code longer than necessary. I suggest to query some central registry where the backends are registered and then cycle through them: SomeListType blas_list = viennacl::blas_implementations_available(); for ( it = blas_list.begin(); ... ) { A.blas(*it); do_something(A); } I don't know whether .blas() is the best name for this, because in the future we might also have more non-BLAS operations such as sorting or FFT - maybe we use .operations() to better reflect the operations table? Yes, I also thought about it... I'm not sure how to handle the default case, A.operations().gemm(NULL), but I guess that A.operations().gemm(viennacl::backend::default()), where a proper overload would set the pointer to NULL internally. --- It seems to me that this is going in a very fruitful directions. Any objections in pushing and extending this for the 1.6.0 release? 1.5.0 is essentially done, I'm currently writing the last bits of documentation and resolve some minor warnings on Visual Studio.. Yes. This is already pushed in a feature branch, I can try to extended it to allow for the list implementation you suggested. There are also a couple of changes in the generator on another feature
Re: [ViennaCL-devel] Call for testing: PyViennaCL on Ubuntu
Hey Toby, Excellent ! Thank you ! I'm installing it right away, and I'll test it later tonight. Philippe 2013/12/17 Toby St Clere Smithe m...@tsmithe.net Toby St Clere Smithe m...@tsmithe.net writes: Yep, looks like the build was successful, so I'll go ahead and make sure it's all working on older distributions now. And so, after a little over a day backporting the package build system from Python 3 to Python 2 (yay!), we have packages for Debian / Ubuntu versions at least as old as Ubuntu 12.04! If you want to try them out, add the PPA[1] to your system; if you're on Ubuntu 12.04, 12.10 or 13.04, you can only install `python-pyviennacl` (for Python 2), but if you're on Ubuntu 13.10 or 14.04, you can also install `python3-pyviennacl` (for Python 3). Oh, and everyone gets `pyviennacl-doc` (which contains the HTML docs). I'll set the version number to 1.0.0 once ViennaCL 1.5.0 is released, and after any remaining little bugs get ironed out. [1] https://launchpad.net/~tsmithe/+archive/pyviennacl/ Cheers, Toby -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?
Hi, A short update : I've implemented linkage to CBlas and CuBlas with dynamic selection. If activated through VIENNACL_WITH_CUBLAS, one can go back and forth between cublas and the original backend by doing: A.blas().gemm(NULL); A.blas().gemm(viennacl::backend::blas::cublas_functionsvalue_type::gemm); (and similarly for cblas.) There is some trickery going on with transpositions and layout, but it works for every transpose/layout combination. One can also link A's blas to his own gemm function, provided a tiny wrapper (essentially to ensure signature compatibility) A very good news is that this allows viennacl to work very well on very recent NVidia Hardware, until our autotuning engine is fully operational. On my laptop, cublasSgemm is about 5 times faster than the current CUDA implementation , and 20% faster than the OpenCL kernel found by the autotuner (120GFLOPs vs 25GFLOPs vs 95GFLOPs). Also,linking with OpenBlas leads to HUGE performance boost on the CPU ( 0.02GFLOP/s vs 70GFLOP/s)...! A little question remains. For now, the behavior is really weird when one defines both VIENNACL_WITH_CBLAS and VIENNACL_WITH_CUBLAS. How to handle this? I am not very familiar with the multiple backends and I don't know to which extent they can be combined. Therefore, I see multiple options, but can't tell which one is better. 1 - trigger a preprocessor error when both commands are defined together 2 - slightly modify the API : A.cuda_blas(), A.host_blas(), A.cl_blas() I think that option 2 is better, considering that there is already cuda_handle(), opencl_handle(), cpu_handle() or something similar, if I'm correct. Any advice? Best regards, Philippe 2013/12/15 Philippe Tillet phil.til...@gmail.com Hi, 2013/12/15 Karl Rupp r...@iue.tuwien.ac.at Hi, Yeah, it certainly is a bit tedious. Feel free to only do this for matrix-matrix multiplications for now, a full operation table is presumably too much of a refactoring for ViennaCL 1.x.y, but much better suited for 2.0.0. Yes. It's actually a pretty complicated problem, because of the different signatures of the different BLAS functions... It seems like the cleanest way to do it would be using std::function, and std::bind, which may indeed be widely available at the time ViennaCL 2.0.0 comes out. I hadn't seen this coming. The interfacing problem is just a matter of wrapping everything behind a common function interface and then use function pointers appropriately. C++11 is not an option for me for a few more years to come, mostly because this is the usual timeframe on large-scale clusters. (Our test system now includes a CentOS 5.10 machine with GCC 4.1.2...) Yap, sometimes reinventing the wheel makes sense because the car is too old :D Wouldn't a classic preprocessor directive but with better BLAS support (as I have it implemented now : cpy, swap, asum, norm2, gemv, gemm) be more interesting feature-wise than a dynamic gemm only dispatch, in the end? How would that look like? Do you mean a classic #ifdef? If right now we are only interested in GEMM, then yes, a simple static dispatch is enough. It just shouldn't start growing if we don't see this as the right way to go in the future. Oh, something like : #define VIENNACL_WITH_CBLAS or #define VIENNACL_WITH_CUDA #define VIENNACL_WITH_CUBLAS which would dispatch cpy, swap, asum, norm2, gemv, gemm (for the other one, I think that the temporary saving of ViennaCL is beneficial) for float and double, and when the non-leading dimension of a matrix is strided. I can add a set of more specific switches if necessary: #define VIENNACL_WITH_CUBLAS_GEMV #define VIENNACL_WITH_CUBLAS_GEMM etc... Plus, it seems like the dynamic dispatch will be much more interesting in the context of ViennaCL 2.0.0 where more things will be dynamic, with possibly already kernel-dispatch for the generator based on the input sizes (I'm thinking about it)... Absolutely. I think it's important to have directions for the future (being more dynamic is apparently one of them), but from the 1.5.0 delay I have learned the hard way that one should not start too many changes at the same time... ;-) Well, yes, I had the same problems on a couple of projects... However kernel generation should be the main topic of my internship and my (hopefully) Ph.D, so I hope I'll have time for these things! Best regards, Karli Best regards, Philippe -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel
Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?
Hey, 2013/12/15 Karl Rupp r...@iue.tuwien.ac.at Hi again, While we're at it, let's discuss the dynamic dispatching mechanism we'd ideally want. I see two options: (1) A global function pointer table. So, one could for example set: viennacl::internal_blas::sgemv_ptr = viennacl::cblas_wrapper; where cblas_wrapper essentially checks for the stride in the non-leading dimension and forwards to cblas if this stride is one. Of course, if the current backend is different, cblas_wrapper is not defined, and cublas_wrapper can be defined instead. I'd prefer to have this function table per object or per memory backend rather than being global, otherwise this will sooner or later bite us in a multi-threaded setting. We (or a user) might want to use one implementation of a certain operation for smaller or skinny matrices and other implementations for larger/square matrices, in which case things are much easier if tied to the particular object. I agree. However, it seems to me that setting the implementation for each matrix would end up being tedious... one table per memory backend since to make sense conceptually to me, since the performance (and the portability) of each blas implementation is determined by the underlying memory system. If there is no objection, I think I will go for that neat solution. Now, another question, how to set the default? I think that a preprocessor directive would be fine here. We already need the preprocessor's #ifdef to define the includes (and some wrappers) anyway. So using it to initialize that table seems reasonable to me (ie VIENNACL_WITH_CBLAS would enable some internal definitions and would initialize the table). Best regards, Philippe I like this solution a lot, since this allows one to mix multiple blas implementation in the same program. This can be useful in some case (OpenBlas is faster than MKL for BLAS3, but MKL is supposedly faster for all the rest). HOWEVER, this requires linkage if we want to avoid multiple definitions of that global pointer table. That's another reason why it shouldn't be global ;-) Since we now provide a libviennacl.so, though, we could include the global table therein, and one would link with it if he wants to use the additional functionnalities. Plus, if one has his own blas function he wants to benchmark against ours, for example, then this solution is very convenient. The shared library is available in addition to the header-only implementation, it's not compulsory. We might change that for ViennaCL 2.0.0, but 1.x.y will stay header-only. (2) A template parameter. So that one would write: viennacl::prodCBlasBackend(), similarly to how I did with UMinTL. However, I am not very fond of this solution for ViennaCL, because it will create a huge bloat in the code, since templates essentially need to propagate, and it might screw up a bit the template deduction mechanism of some compiler (since prod is already templated with the underlying ScalarType...) Same here, I consider this to be a wrong use of templates for the reasons you mentioned. Fortunately we don't have to worry about performance for something tiny like 3x3-matrices, so a bit of runtime logic is not an issue. Best regards, Karli -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?
Hi, 2013/12/15 Karl Rupp r...@iue.tuwien.ac.at Hey, I agree. However, it seems to me that setting the implementation for each matrix would end up being tedious... one table per memory backend since to make sense conceptually to me, since the performance (and the portability) of each blas implementation is determined by the underlying memory system. If there is no objection, I think I will go for that neat solution. Yeah, it certainly is a bit tedious. Feel free to only do this for matrix-matrix multiplications for now, a full operation table is presumably too much of a refactoring for ViennaCL 1.x.y, but much better suited for 2.0.0. Yes. It's actually a pretty complicated problem, because of the different signatures of the different BLAS functions... It seems like the cleanest way to do it would be using std::function, and std::bind, which may indeed be widely available at the time ViennaCL 2.0.0 comes out. I hadn't seen this coming. Wouldn't a classic preprocessor directive but with better BLAS support (as I have it implemented now : cpy, swap, asum, norm2, gemv, gemm) be more interesting feature-wise than a dynamic gemm only dispatch, in the end? Plus, it seems like the dynamic dispatch will be much more interesting in the context of ViennaCL 2.0.0 where more things will be dynamic, with possibly already kernel-dispatch for the generator based on the input sizes (I'm thinking about it)... Best regards, Philippe Now, another question, how to set the default? I think that a preprocessor directive would be fine here. We already need the preprocessor's #ifdef to define the includes (and some wrappers) anyway. So using it to initialize that table seems reasonable to me (ie VIENNACL_WITH_CBLAS would enable some internal definitions and would initialize the table). Yes, that makes sense. The same is already done for the default backend: CUDA has priority over OpenCL, which has priority over the fall-back host implementation. The rationale is that the more specific an enabled backend is, the more likely it is that a user wants to use just that by default. This should equally well apply to a CBLAS interface. Best regards, Karli -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?
Hi, 2013/12/15 Karl Rupp r...@iue.tuwien.ac.at Hi, Yeah, it certainly is a bit tedious. Feel free to only do this for matrix-matrix multiplications for now, a full operation table is presumably too much of a refactoring for ViennaCL 1.x.y, but much better suited for 2.0.0. Yes. It's actually a pretty complicated problem, because of the different signatures of the different BLAS functions... It seems like the cleanest way to do it would be using std::function, and std::bind, which may indeed be widely available at the time ViennaCL 2.0.0 comes out. I hadn't seen this coming. The interfacing problem is just a matter of wrapping everything behind a common function interface and then use function pointers appropriately. C++11 is not an option for me for a few more years to come, mostly because this is the usual timeframe on large-scale clusters. (Our test system now includes a CentOS 5.10 machine with GCC 4.1.2...) Yap, sometimes reinventing the wheel makes sense because the car is too old :D Wouldn't a classic preprocessor directive but with better BLAS support (as I have it implemented now : cpy, swap, asum, norm2, gemv, gemm) be more interesting feature-wise than a dynamic gemm only dispatch, in the end? How would that look like? Do you mean a classic #ifdef? If right now we are only interested in GEMM, then yes, a simple static dispatch is enough. It just shouldn't start growing if we don't see this as the right way to go in the future. Oh, something like : #define VIENNACL_WITH_CBLAS or #define VIENNACL_WITH_CUDA #define VIENNACL_WITH_CUBLAS which would dispatch cpy, swap, asum, norm2, gemv, gemm (for the other one, I think that the temporary saving of ViennaCL is beneficial) for float and double, and when the non-leading dimension of a matrix is strided. I can add a set of more specific switches if necessary: #define VIENNACL_WITH_CUBLAS_GEMV #define VIENNACL_WITH_CUBLAS_GEMM etc... Plus, it seems like the dynamic dispatch will be much more interesting in the context of ViennaCL 2.0.0 where more things will be dynamic, with possibly already kernel-dispatch for the generator based on the input sizes (I'm thinking about it)... Absolutely. I think it's important to have directions for the future (being more dynamic is apparently one of them), but from the 1.5.0 delay I have learned the hard way that one should not start too many changes at the same time... ;-) Well, yes, I had the same problems on a couple of projects... However kernel generation should be the main topic of my internship and my (hopefully) Ph.D, so I hope I'll have time for these things! Best regards, Karli Best regards, Philippe -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Linking ViennaCL (CUDA backend) to cuBLAS ...?
Hello, I've just realized that most BLAS implementation don't provide anyway to do strided matrix accesses in the non-leading dimension ... ! Is this correct? I was hoping that we could have avoided such special cases, but it seems like a couple of tests will need to be made. Philippe 2013/12/14 Karl Rupp r...@iue.tuwien.ac.at Hey, Okay. I'll probably do it statically at first, and I'll keep in mind that we want it dynamic at the end of the day (well, not at the end of today :D). Once everything works statically, I think we can discuss the details of the API we want. Fine with me. This way we can first collect a bit of experience on some details which might be important for the runtime layer later. Best regards, Karli -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Generator's repmat API
Hello everybody, I am done implementing : x = viennacl::reduceop(viennacl::rows(A)); x = viennacl::reduceop(viennacl::cols(A)); s = viennacl::reduceop(x); In the generator. For now, the op supported are : add, mult, max, min. I can't support them all, because I need to provide their neutral element for kernel generation (so that the shared memory can be initialized with the neutral element) I am now working on repmat. About this, I am not sure which should be the return type of the API function. I am planning to go for some matrix_expressionmatrix, viennacl::tupleint,int, op_repmat(A, make_tuple(repsize1,repsize2)) ? Where the tuple would get translated by the scheduler into a binary tree with operator OP_TUPLE. Does this sound reasonable? @Toby : There might be some changes of this type in the way the scheduler's expression tree is generated (for the need of the kernel generator). I'll try to keep a list of the changes updated, so that the python wrapper does not diverge too much from the core :) Philippe -- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression...
Hello, I had not noticed that only the first reduction would be executed in this case, so my arguments were indeed invalid :) However, I am now even more worried than before ;) This makes the assumption that the 2-way reduction will always be the best way to compute an inner-product on any OpenCL device. We want the reduction-based programs to be device-specific, so these sometimes truncated operations will have to be forwarded somehow to the kernel generator, and therefore the expression tree. Does it mean that we need an additional parameter in the statement which basically says don't execute the last kernel!. This would introduce a lot of complexity in the scheduler and the generator, for too little benefit imho. What about input-dependent kernels? For small inputs where the second kernel would not be negligible, we would actually be better off performing the full reduction computation in one, big, work group. I think that, for small vector, this is also more cache-efficient than the first kernel of the dual-reduction approach plus a final reduction on the CPU... This would preserve the benefit of saving one kernel launch, and at the same time more smoothly integrate within the scheduler/generator framework... Philippe 2013/10/27 Karl Rupp r...@iue.tuwien.ac.at Hi, Now that I'm back to some C++ coding, I want to finish the integration of viennacl::op_reduce. I've noticed a lot of different operator overloads for viennacl::scalar_expression, with basically different implicit conversions to raw scalar. I'm a bit skeptical here :) This allows to handle the (imho unpractical) cases such as : cpu_scal = inner_prod(x1,x2)*5.0 This is *very* practical. Without implicit conversion, this would a) not work at all and require instead gpu_scalar = inner_prod(x1, x2); copy(gpu_scalar, cpu_scalar); cpu_scal *= 0.5; Clearly, this would not result in generic code at all... b) be less efficient: With the above, there are two reductions on the GPU required in order to then copy a single value to the host. With the implicit conversion, this is just one reduction on the GPU, then copy the reduced values (no extra overhead, this is only latency limited) and finally run the reduction on the CPU at no significant cost. While the extra kernel launch does not really matter for large sizes, it is an issue for vector sizes between ~10k and ~500k, particularly for AMD and Intel accelerators where the latency is high(er). I think that such expressions should be forbidden. I think that every conversion involving host-device data movement should be explicit, since they trigger a flush of the scheduler's queue. Furthermore, we are heading towards multi-devices computations, and these implicit conversions will then become even more troublesome : an implicit inner_prod-scalar conversion would then need to sum the results obtained for each device... Hmm, I don't see a reason why this should not work for multi-device scenarios... Basically, I think that we should forbid any other implicit conversions than the viennacl::scalarT = T one... Do you agree? I don't want to give away the benefit of saving one kernel launch for reduction operations when the result is needed on the host... This would force to rewrite the examples above : gpu_scal = (vcl_scal1 + vcl_scal2)*5.0; cpu_scal = gpu_scal; Which is I think more explicit and efficient than the previous approach :) For pure scalar operations there is no chance in getting any efficiency out of it. Yes, it is more explicit, but at the same time less convenient. Overall, we would trade convenience for ... what? ;-) Simpler implementation? Best regards, Karli -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression...
Hi hi, 2013/10/27 Karl Rupp r...@iue.tuwien.ac.at Hi, This makes the assumption that the 2-way reduction will always be the best way to compute an inner-product on any OpenCL device. We want the reduction-based programs to be device-specific, so these sometimes truncated operations will have to be forwarded somehow to the kernel generator, and therefore the expression tree. Does it mean that we need an additional parameter in the statement which basically says don't execute the last kernel!. This would introduce a lot of complexity in the scheduler and the generator, for too little benefit imho. You are right, this is indeed a bit tricky. There is preparation for this case already in the 'standard' vector kernels, where each GPU scalar argument may include an additional 'mini reduction' before computing the actual operation. However, this functionality is currently unused. The motivation for this were operations of type z = inner_prod(u,v) * w; where the second reduction could go into the z - alpha * w assignment. Oh I see :) When the kernels are generated, this is actually what happens, ie z = inner_prod(u,v) * w leads to two kernels. What about input-dependent kernels? For small inputs where the second kernel would not be negligible, we would actually be better off performing the full reduction computation in one, big, work group. I think that, for small vector, this is also more cache-efficient than the first kernel of the dual-reduction approach plus a final reduction on the CPU... This would preserve the benefit of saving one kernel launch, and at the same time more smoothly integrate within the scheduler/generator framework... Yes, I thought about that already. I think we don't need separate kernels, only a proper kernel calling logic. What is quite tricky is to get the 'cross-over' point right, because that depends on not only the device performance, but also on the latency, which is OS-specific. Ah... This gets tricky indeed. Are there any measure of how the OS affects the latency? Specifically, If the OS-dependence is independent from the device-dependence, there should be static ways out of this mess... Another simple way out is to have a reasonable cross-over size value, and to integrate such platform-specific information in the autotuning software. Then, the user could override the default for optimal results, using a #define typically... until we are able to interact at runtime with the autotuner's results (using some io mechanism). Best regards, Karli Philippe -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] implicit GPU-to-CPU scalar conversion of viennacl::scalar_expression...
Hello, Now that I'm back to some C++ coding, I want to finish the integration of viennacl::op_reduce. I've noticed a lot of different operator overloads for viennacl::scalar_expression, with basically different implicit conversions to raw scalar. I'm a bit skeptical here :) This allows to handle the (imho unpractical) cases such as : cpu_scal = inner_prod(x1,x2)*5.0 or cpu_scal = (vcl_scal1 + vcl_scal2)*5.0 I think that such expressions should be forbidden. I think that every conversion involving host-device data movement should be explicit, since they trigger a flush of the scheduler's queue. Furthermore, we are heading towards multi-devices computations, and these implicit conversions will then become even more troublesome : an implicit inner_prod-scalar conversion would then need to sum the results obtained for each device... Basically, I think that we should forbid any other implicit conversions than the viennacl::scalarT = T one... Do you agree? This would force to rewrite the examples above : gpu_scal = (vcl_scal1 + vcl_scal2)*5.0; cpu_scal = gpu_scal; Which is I think more explicit and efficient than the previous approach :) Philippe -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Adding op_element.subfamily_type into the scheduler
A clearer classification : OPERATION_FUNCTION_SUB_TYPE_FAMILY (norm, prod, inner_prod, etc...) OPERATION_ELEMENT_FUNCTION_SUB_TYPE_FAMILY (abs, pow, etc) OPERATION_ELEMENT_OPERATOR_SUB_TYPE_FAMILY(+, ==, , etc...) Philippe 2013/10/18 Philippe Tillet phil.til...@gmail.com Hello, Currently, there are only two families : UNARY_FAMILY and BINARY_FAMILY In the generator, I have to do silly checks using giant ORs to check whether the operator is a product operator, an elementwise operator, an elementwise function... I'm thinking about introducing an subfamily_type, which contains this information. Does this sound reasonable to you? Philippe -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Adding op_element.subfamily_type into the scheduler
Hey, While we're at it. I'm implementing reductions, now. There are two options here : templateclass OP, class VectorType reduce(VectorType const v) { return scalar_expressionVectorType, OP, reduce_type(v, OP()); } or templateclass OP, class VectorType reduce(VectorType const v) { return scalar_expressionVectorType, VectorType, reduce_typeOP (v,v); } the first one is scheduler-friendly, but the second one is more meta-programming friendly. I'm really confused here, any advice on which one to choose? I would prefer the first one, since handling the second one in the scheduler would clearly be a pain, considering that I don't want to introduce REDUCE_ADD, REDUCE_MAX, etc..., but just a single REDUCE, operator, and reuse the existing ones. I think several other similar cases will arise, such as three arguments function, where it is preferable to create the appropriate tree structure directly from the function call. Am I right? Philippe 2013/10/18 Karl Rupp r...@iue.tuwien.ac.at Hey, OPERATION_FUNCTION_SUB_TYPE_FAMILY (norm, prod, inner_prod, etc...) OPERATION_ELEMENT_FUNCTION_SUB_TYPE_FAMILY (abs, pow, etc) OPERATION_ELEMENT_OPERATOR_SUB_TYPE_FAMILY(+, ==, , etc...) I assume they are all within the same enum - go for it :-) Best regards, Karli -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Adding op_element.subfamily_type into the scheduler
Okay, this approach has a problem at the OP() stage because *_expression will store a reference to a temporary object, and because it creates problem for the *element* part of the statement. On the other-hand, scalar_expressionVectorType, VectorType, reduce_typeOP (v,v) would need to be converted to the same end-tree anyway, which will lead to the same problem inside the statement... Philippe 2013/10/18 Philippe Tillet phil.til...@gmail.com Hey, While we're at it. I'm implementing reductions, now. There are two options here : templateclass OP, class VectorType reduce(VectorType const v) { return scalar_expressionVectorType, OP, reduce_type(v, OP()); } or templateclass OP, class VectorType reduce(VectorType const v) { return scalar_expressionVectorType, VectorType, reduce_typeOP (v,v); } the first one is scheduler-friendly, but the second one is more meta-programming friendly. I'm really confused here, any advice on which one to choose? I would prefer the first one, since handling the second one in the scheduler would clearly be a pain, considering that I don't want to introduce REDUCE_ADD, REDUCE_MAX, etc..., but just a single REDUCE, operator, and reuse the existing ones. I think several other similar cases will arise, such as three arguments function, where it is preferable to create the appropriate tree structure directly from the function call. Am I right? Philippe 2013/10/18 Karl Rupp r...@iue.tuwien.ac.at Hey, OPERATION_FUNCTION_SUB_TYPE_FAMILY (norm, prod, inner_prod, etc...) OPERATION_ELEMENT_FUNCTION_SUB_TYPE_FAMILY (abs, pow, etc) OPERATION_ELEMENT_OPERATOR_SUB_TYPE_FAMILY(+, ==, , etc...) I assume they are all within the same enum - go for it :-) Best regards, Karli -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Common base for implicit_vector_base and vector_base...makes sense?
Hi, It seems like the behavior of scalar_vector, unit_vector etc has changed a bit since the appearance of the kernel generator. I am currently extending the API of the generator, with relational operators. I want to design a specific kernel which checks for X[i] 0.42, for all i. Since operator is misleading, I am using the more verbose but clearer approach : element_less_than(X, scalar_vectorNumericType(X.size(), 0.42)). The verbosity is a problem, but a minor one, so I will ignore it for now (I think end users can live with that :P). Anyway, my problem is that the VIENNACL_GENERATE_BINARY_ELEMENTOPERATION_OVERLOADS which generates functions and function overloads for element_* does not handle implicit_vector. For now, I can just add the proper overloads. But my problem is actually that, right now, implicit_vector's meaning has diverged, with the use of the kernel generator. scalar_vectorfloat v(N,0); x += v; y = element_less_than(x,v); makes perfect sense when using the OpenCL backend, but does not make sense with the OPENMP and the CUDA backend. How to handle this divergence? It does look extremely complicated to me. Philippe -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Common base for implicit_vector_base and vector_base...makes sense?
Hi hi, 2013/10/16 Karl Rupp r...@iue.tuwien.ac.at Hi, It seems like the behavior of scalar_vector, unit_vector etc has changed a bit since the appearance of the kernel generator. I am currently extending the API of the generator, with relational operators. I want to design a specific kernel which checks for X[i] 0.42, for all i. Since operator is misleading, I am using the more verbose but clearer approach : element_less_than(X, scalar_vectorNumericType(X.size(), 0.42)). The verbosity is a problem, but a minor one, so I will ignore it for now (I think end users can live with that :P). Most of all, it is consistent with element_prod(), element_div(), and friends. It may be a bit verbose, yes, but it could be way worse ;-) Yes, I think that Eigen's approach to have proxy objects(arrays) for elementwise operation : x.array()*y.array() , is a lot of work for barely no gain, so I'm clearly in favor of keeping things unambiguous and simple, albeit slightly verbose :) Anyway, my problem is that the VIENNACL_GENERATE_BINARY_ELEMENTOPERATION_OVERLOADS which generates functions and function overloads for element_* does not handle implicit_vector. For now, I can just add the proper overloads. But my problem is actually that, right now, implicit_vector's meaning has diverged, with the use of the kernel generator. scalar_vectorfloat v(N,0); x += v; y = element_less_than(x,v); makes perfect sense when using the OpenCL backend, but does not make sense with the OPENMP and the CUDA backend. How to handle this divergence? It does look extremely complicated to me. Since we don't know until runtime which backend is in use, the only clean appraoch is to throw an exception for cases where there is no implementation in the other backends. Rather than introducing yet another base class, what about allowing implicit vectors in vector_base by suitable constructor arguments? This will also keep compilation times under control :-) I'm a bit confused, this solution would then allocate memory in the case of : element_less_than(X, vectorNumericType(scalar_vectorNumericType(X. size(), 0.42))), wouldn't it? If I want to normalize a vector by substracting a constant c, simply writing y = x - scalar_vectorNumericType(x.size(),c); results in a single OpenCL kernel, and more importantly only N reads instead of 2N. In my opinion, it would be a bit sad to remove this functionnality, but on the other hand I have no intention to duplicate ll the operator overloads for implicit_vector_base and implicit_matrix_base :P I also thought about using enable_if to check for vector_base or implicit_vector_base, (or only vector_base #ifndef VIENNACL_WITH_OPENCL), but I'm a bit afraid of the consequences on the compilation time, so I thought that providing a common base class in the OpenCL case would be a good solution, wouldn't it? Best regards, Philippe Best regards, Karli -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Common base for implicit_vector_base and vector_base...makes sense?
Hey hey, Well, the main problem I have with incorporating implicit_vector_base inside vector_base is that this sounds like replacing inheritance with switches on enum :P However, I think I have found a solution which will satisfy both of us: viennacl::vector_base already have this constructor: explicit vector_base(size_type vec_size, viennacl::context ctx = viennacl::context()) Actually, what I want to do is to make implicit_vector_base inherit from vector_base and use this constructor. I don't really know why I thought about a common base, rather than this much better approach. Sleepily, Philippe 2013/10/17 Karl Rupp r...@iue.tuwien.ac.at Hey, After thinking more about it, I see a conceptual flaw in that approach, since implicit_vector cannot be used as l-value, while vector_base can, it would lead to very misleading code, where implicit_vectors would have (empty, or throwing exceptions) operator overloads... The risk here is that vector_base would become a holdall. Well, you can disable operator= in all symbolic types directly, thus preventing any lvalue problems :-) What is so bad about having a common overload which would hold the size and the opencl context, and then two separate sub-classes? Am I missing something? It is the combinatorial explosion for tests and such. Already now we have to split up tests into one for each numeric type to keep the memory consumption under control. When using {implicit_vector_base, vector_base} \times { implicit_matrix_base, matrix_base }, compilation times for tests will grow by another factor of four. This is a clear disadvantage for having two separate base classes. On the other hand, I don't see a real advantage. Maybe I miss something? Best regards, Karli -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] IRC meeting on Friday, 15:00 UTC?
Hey, I'll be there! Philippe 2013/10/2 Karl Rupp r...@iue.tuwien.ac.at Hi guys, we haven't had an IRC meeting for quite a while now. I'm finally done with most of my relocation from the US back to Austria, so I propose to have our next IRC meeting on Friday, October 4, at 15:00 UTC. Is this okay for everybody interested in joining? Potential topics: - Final things to be completed for the upcoming releases of ViennaCL 1.5.0 and PyViennaCL (1.0.0?) - Roadmap towards 1.6.0 - Defining GUI functionality for Autotuning Best regards, Karli -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] IRC meeting on Friday, 15:00 UTC?
Oh, yes, what do you guys think about adding another topics, more specifically about how PyViennaCL compare to the alternative solutions, in particular Theano? It is a question I have actually already been asked (by a former colleague of one of the Theano creators...), and we will sooner or later have to provide a fair comparison in order to orient the scientists that are looking for a high-level GPGPU solution. Philippe 2013/10/3 Toby St Clere Smithe m...@tsmithe.net Yep, so will I. Toby Philippe Tillet phil.til...@gmail.com writes: Hey, I'll be there! Philippe 2013/10/2 Karl Rupp r...@iue.tuwien.ac.at Hi guys, we haven't had an IRC meeting for quite a while now. I'm finally done with most of my relocation from the US back to Austria, so I propose to have our next IRC meeting on Friday, October 4, at 15:00 UTC. Is this okay for everybody interested in joining? Potential topics: - Final things to be completed for the upcoming releases of ViennaCL 1.5.0 and PyViennaCL (1.0.0?) - Roadmap towards 1.6.0 - Defining GUI functionality for Autotuning Best regards, Karli -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] Incorporating reductions in ViennaCL
Hi everybody :) Okay, so in the roadmap i've added Reductions support for ViennaCL 1.6 ... I plan to take care of it for the three backends, but there are several things to consider here. For now, I will call them reduce, reduce_rows, reduce_cols. A convenience layer such that reduce(mat.rows()) or reduce(mat.cols()) would be better, but this is a completely different problem :P From a certain point of view, if one is a vector full of one (optimized at compile time), we have the following equivalences : - reduceOP(lhs) uses the same kernel as inner_prod(lhs,one_vector), except that the reduction operator is OP and not ADD. - reduce_rowsOP(mat) uses the same kernel as prod(mat,one_vector), except that the reduction operator is OP and not ADD,. - similarly, reduce_colsOP(mat) uses the same kernel as prod(trans(mat),one_vector); While there is a slight conceptual difference in the kernels (the gemv kernel has to take care of the reuse of the vector data, the reduce kernel doesn't), they do show very strong similarities... I see two options here: (1) - Ignore that slight conceptual difference, and use the same kernel/backend. This makes sense imho, because the reuse of the vector data don't matter a lot (it's orders of magnitude smaller in memory than the matrix). We delegate reduce_implOP(vec) to inner_prod_implOP=AddType(vec, one_vector()) (2) - Have a specific reduce_impl, reduce_rows_impl, reduce_cols_impl. I am clearly against this, which would lead to a lot of duplication for not so much, but well I bring it up for the sake of Descartes' systematic doubt :) Best regards, Philippe -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58041391iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Auto-Tuner, GEMM, GEMV... : Integrating RaijinCL into the generator
Hi hi, 2013/8/30 Karl Rupp r...@iue.tuwien.ac.at Hi Philippe, About 6months ago I had heard of a library that also performed autotuning (http://raijincl.org), but that offered the same performance as ours back then. Since then, the performance have *greatly* improved, largely outperforming our autotuner : - Over 3TFLOP/s on HD7970 - Over 1.3TFLOP/s on HD5850 - About the same performance as CuBLAS on Kepler and Fermi A technical report is available there : http://www.sable.mcgill.ca/publications/techreports/2013-1/sable-tr-2013-1.pdf The code is Open-Source and generated under an Apache License. It would be silly imho to keep struggling to improve the autotuner when there seem to be a better one already over there. Plus, Open-Source is more about collaboration than competition. The difference, however, is that while RaijinCL focuses on raw GEMM and GEMV autotuning, ViennaCL's generator focuses on temporaries removal. I am a bit confused however, so as to how to merge the two works - ie using temporary removal along with RaijinCL's profiles/autotuner. What are exactly the restrictions to use Apache licensed code in MIT licensed code? I know that both licenses are permissive, but I don't know the details... When merging two licenses, the rules are simple: The more restrictive license applies. In order not to taint the current MIT license of ViennaCL, I'm thus considerably more concerned about integrating RaijinCL than you are. Since our generator is skeleton-based anyway, what about having a look at the best performing kernels in RaijinCL and then extending the current generator accordingly such that these kernels are covered as well? I consider this to be *far* less painful then trying to merge in RaijinCL - as you certainly know, it's not that easy to 'just interface with a kernel generator', particularly if this is supposed to happen at runtime and in a reliable way. Even just within ViennaCL this took us (at least) three iterations to come up with a useful model in practice... Yes, probably. Plus, we need not all functionalities of RaijinCL (images, for example). I have taken contact with Rahul (author of RaijinCL). I just want to make sure that RaijinCL gets the credits it deserves (3TFLOP/s on HD7970 is a lot !), and maybe join our expertise to get even better performance :) Best regards, Philippe Best regards, Karli -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Call to those with an NVidia GeForce Kepler graphic card : autotuning
Hi Evan, Thanks for your answer! Thanks to Denis Demidov, we already have some (disappointing ... 950GFLOP/s on SGEMM, 450GFLOP/s on DGEMM ) results for the Tesla K20 ! I am actually looking for a GeForce 780, to see if the problem is specific to the GK110 or not... A slightly less High-end GPU such as GTX660,670,680... would be ideal. If the performance is there, the next release will offer pretty good performance on these particular chips Philippe 2013/8/19 Evan Bollig bol...@gmail.com Philippe, i have you covered: kepler k20. Let me know what you need. -Evan Bollig On Aug 19, 2013 4:14 PM, Philippe Tillet phil.til...@gmail.com wrote: Hello everybody, For providing good default GEMM kernels for the Kepler Architecture, I need the help of the community ! :) I'm looking for someone with an NVidia GeForce Kepler graphic card... If there is such person here, would he/she be willing to run a small GEMM autotuning program? I will add detailed instructions if someone is up for it ! Thanks and best regards, Philippe -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] OpenCL to CUDA kernel translation
Hey everyone, It seems to me that most of the differences between CUDA and OpenCL come from the respective APIs, but that the kernel code is very similar in the two cases. Do you guys think it's possible to easily translate the generated kernel from OpenCL to CUDA, by just doing one-to-one replacements of the keywords? (__local = __shared__, __global __device__, ...), or is there any particular difficulty i've missed? Best regards, Philippe -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Scheduler progresses
Hi, 2013/8/16 Karl Rupp r...@iue.tuwien.ac.at Hi guys, the scheduler for kernel fusion makes good progress. Toby, you should be able to use all of the fundamental dense linear algebra operations now. There should be only be two blocks of functionality missing: - Sparse matrices (i.e. matrix-vector products) - In some cases where += and -= may not work (e.g. matrix-vector product) Compilation times are moderate, but there is also some room for improvement left. Matrix-matrix products are unnecessarily heavy on the compiler. The good news for today is that things are finally growing together: Via the scheduler Toby can make the fast kernels from Philippe's generator available to the Python community :-) Thanks Karl! On my side, a minor improvement of 5% on the GEMM kernels, resulting in an additional 100GFLOP/s on the HD7970. I have also reverted to CUDA 4.0 to have access to the visual profiler, for which the OpenCL compatibility was removed in CUDA 5... ($@^§), so I will see what I can do (probably after the 1.5 release) for the kernels :) Best regards, Philippe Best regards, Karli -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Compilation load of matrix-test-*
Hi Karl, I've just realized i had forgotten to answer! My computer is no longer laggy in single-threaded mode, which is already a good thing :) it still cannot bear make -j4, even though it has 4GB of RAM, my desktop computer can without any issue, though. I'll update this when I have cleaned and reinstall my system in a few days :P Best regards, Philippe 2013/8/2 Karl Rupp r...@iue.tuwien.ac.at Hi Phil, the tests are now split into more light-weight units by separating single and double precision. matrix-test was additionally split into row-major and column-major tests. This should now allow you to build with `make -j4` on weaker machines with limited RAM. Best regards, Karli On 08/01/2013 08:35 PM, Philippe Tillet wrote: Hi everybody, I have had troubles compiling matrix-test-* for quite some time, but it has gone worse over time. The compilation process appears to eat up one core at 100% (i have a core i5!) and over 1GB on RAM, which is enough to freeze my computer for 20-25sec. I have the same problem with the other matrix-test-* benchmarks. I went completely crazy and turned -j4 on, which totally froze my computer and forced me to hard reboot :D Anyway, I am running gcc 4.7 (the default Ubuntu 13.04 version). Is anybody else experiencing similar issues? Best regards, Philippe -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
[ViennaCL-devel] On Autotuning GEMM
Hey everybody, For a few days, I've been playing around with AMD's CodeXL, the HD5850 and the generator/autotuner: - First of all, I want to share something that made me completely crazy. Avoid : *vector += scalar*vector * in a compute bound context. After replacing the above by: *vector.s0 += scalar*vector.s0* *vector.s1 += scalar*vector.s1 **vector.s2 += scalar*vector.s2 **vector.s3 += scalar*vector.s3* performance jumped from 900GFLOP/s to 2.3TFLOP/s on the HD7970 (which is of the same order of magnitude as the best existing kernel so far, presented by Matsumoto et Al.). Only ~10% improvement on HD5850, though. It seems like the AMD OpenCL compiler does not properly translate the first operation. A more optimistic view is that it does a very good job at translating the second one :) - - I can make my HD5850 peak at ~920GFLOP/s, which around 45% of the theoretical peak. Some people in the litterature managed to get 75% of a HD5870 (they reach ~2TFLOP/s out of ~2.8TFLOP/s ), which is truly impressive. They had to use some assembly-like language, though. This is because HD5xxx use VLIW of 5 instructions and not 4. CodeXL shows ALUPacking = 80.59%, which is in my opinion a direct consequence of packing instructions 4 by 4 instead of 5 by 5. It seems to me that this problem has more to do with the OpenCL compiler than my code. Since the autotuner can find a spot at 95% cache hit rate and 0% local memory conflict, I assume that the problem in the kernel comes from the way the ALU are used, rather than bandwidth issues. Does anybody know if some other architectures use fancy VLIW length? AMD KernelAnalyzer gives the ptx output for the HD5850, but I am not experienced enough to understand anything to it. - -Very weird behavior: the initial kernel for C = A*B was something like: __kernel void gemm(uint M, uint K, uint N, __global float4* A, __global float4* B, __global float4*C){ uint Mv = M/4; //internal size of A, which is row-major uint Nv = N/4; same thing for B. //... //use Mv and Nv to compute addresses of A and B, rather than M and N. } When replacing it by __kernel void gemm(uint M, uint K, uint N, __global float4* A, __global float4* B, __global float4*C){ //use inline M/4 and N/4 to compute addresses of A and B, rather than M and N. } I got ~10% performance improvement on HD5850, no modification on HD7970. Don't ask me why. I actually think registers are a very precious resource on AMD device. Since the computation of M/4 and N/4 appears pretty rarely, it seems to me that it is usually a better choice in these cases to save a register. Furthermore, since all these registers are probably taken in the vector register pool, it may be that an uint occupies a whole 128bit wide register. I am not sure, though. - for(unsigned int bs = 0 ; bs 32 ; ++bs); seems to be not unrolled by default. Adding #pragma unroll 32 improves performance on NVidia Hardware (almost double them), but kills them by a factor of 10 on AMD Hardware, for the GEMM case, at least. I am confused about it. More on this later if I find an answer to that mystery. If not, i'll just have full #pragma unroll by default, and disable it on AMD hardware. = ON THE AUTOTUNING PROCEDURE = === Well... While OpenCL is guaranteed mostly thread-safe (except for clSetKernelArg, which is thread-safe as long as we set arguments for different kernels in parallel), I think, it seems like parallel compilations process serially. I observed this behavior when compiling multiple programs in the same context, but someone else observed it using different contexts, etc... http://stackoverflow.com/questions/14544802/threading-opencl-compiling . Since compilation is a bottleneck of the autotuner ( when the matrix-size is 1024*1024 at least ... see more later), it seems to me that it would be a good thing to do. In the end, I thought the simplest way to handle the problem is to partition the search space, and pass a partition index as an argument. That way, for a 4-way partitioning: ./blas3_tuning 0 ./blas3_tuning 1 ./blas3_tuning 2 ./blas3_tuning 3 We may observe some speed up (since the above stack overflow link reports that using fork() resolves the issue.). Or maybe should we use fork internally? Does anyone know if make -j 4 uses fork or multi-threading? We could have for example some ./blas3_tuning -j 4. However, for big matrices sizes, the tuning time seems to be dominated by the execution of the crappy kernels... - There are still quite a few things I still need to do, before talking about the autotuning procedure itself :) Best regards, Philippe -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to
Re: [ViennaCL-devel] Kernel Generator wrap-up
Hi again ! The generator code is pushed on the master branch. 2013/7/28 Karl Rupp r...@iue.tuwien.ac.at Hey, My preferred option is to pad by default and either to make the padding a multiple of four or sixteen. However, we need to maintain a full set of unpadded operations, because user-provided buffers need not be padded (and a subsequent padding may be too expensive) I think making it a multiple of 16 always is a good option, because we can reasonably assume that optimal performance are rarely obtained when a work item performs (unroll) more than 16*16 operations, on most of the kernels. However, we have to have a clear and easily extensible dispatch mechanism that dispatch some sizes to some specific kernel, which is what I was talking about: Best {m, k, n} big block sizes for the GEMM kernel: GEMM Row-Major * Row-Major AMD : 16 * 64 * 256 NVidia : 16 * 128 * 128 Intel CPU : 64 * 64 * 128. I expect this to be also dependent on the hardware generation. The best approach that comes to my mind is to introduce some hardware descriptor, which provides nicely preprocessed information from the OpenCL backend. A rather simple Vendor: [AMD, INTEL, NVIDIA, ...] Type: [CPU, GPU, MIC, etc.] Generation: [Southern Island, Fermi, Kepler, ... , UNKNOWN] should give us enough dispatch possibilities for the hardware 'out there'. If the detection of the hardware generation fails, we just use some compatibility kernel (and eventually ask the user to submit hardware information when running the tuner). Right. I'll do that :) Of course, it is bound to be device-specific rather than vendor specific, and once the autotuning procedure works better we might have block sizes such as 96, 112, etc... Furthermore, for the kernel to be correct, each size has to be a multiple of the block size (3 constraints).We can never expect the user to call the kernel on the proper sizes. Probem, the padding on ViennaCL is static, while this block size is known at runtime... Should we just write somewhere in the documentation what the best kernels are? The padding is no longer 'static'. The 'ALIGNMENT' template parameter is now ignored (vector_base no longer holds an ALIGNMENT parameter), so we can introduce a runtime padding without breaking old code. Thus, we can pick a proper padding entirely at runtime, tailored to the underlying device. Oh, true. This padding has to be the smallest one compatible with all profiles, some sort of lest common multiple, which I hope is not going to grow ridiculously big... Even though the number of possible kernel variations is large (though finite), there's only a limited set which actually gives good performance. These are the important kernels to be tested thoroughly. Yes, but this limited set is device/program - specific, and it is hard to know (that's why autotuning is for). I don't think anyone could tell me explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns, use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And even if I choose two values for each parameters, it leads to 2¹⁰ = 1024 test per layout/transposition combination = 32 768 tests . which is ridiculously high :D What about integrating the test procedure into the autotuning procedure? It's not intuitive but I see no better way. Yes, a good autotuning procedure should verify the correctness of the results obtained anyway. There may be compiler or hardware bugs which can lead to fast, but erroneous kernels. A two-stage scheme seems best here: - First, find the fastest kernel (either without checking, or just checking for a particular size). - Second, verify this kernel for a couple of different sizes. If this fails, pick the next kernel, etc. Ok, I'll do that. However, there are things to test in the way the generator behave, rather than the profiles. All the operations in tests/vector.cpp have to be compatible with the generator. Should the corresponding tests be in the same vector.cpp file (in some #ifdef VIENNACL_WITH_OPENCL) or should it be in a separate file? Sooner or later we will have to go for the runtime option anyway. I don't see any benefit of being overly pessimistic with 16kB if we have the true local memory available at runtime. Right, it's not over-complicated to do. The problem is more about knowing the right optimization profile used at runtime (the local memory used by the to-be-compiled kernel). Ok, it means that this optimization profile should not change (since I think we cannot really use global objects), so that this local memory value is consistent over time. Only the autotuner will be allowed to play with optimization profiles, then, which is fine for me. There is no reason to expect that the hardware changes during the execution of a process. Even if a hardware falls off the bus because it overheats, it doesn't come
[ViennaCL-devel] Kernel Generator wrap-up
Hello everybody, I'm proud to announce that after about 3weeks, I've recoded from scratch the OpenCL code generator to integrate it fully with viennacl::scheduler::statement. That being said, I'm entering the point where I need to inquire your opinion for (many) further design choices. Sorted by priority : 1 How to handle padding? For example, the best kernels for a given operation may use float4, in which case an alignment of 4 is required. For GEMM, though, the kernel internally used blocking. Since the iteration over the blocks is unrolled, I prefer to keep the loop boundary static (known at the OpenCL compile time), so padding inside a kernel is not really an option here. How to handle this? Should we have a plethora of kernels optimized for a large number of block-sizes?If yes, how to choose the block sizes? 2 For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number of kernels can be generated. Designing a proper test suite in such a situation is a challenging task. I've thought about testing a fixed amount of randomly chosen kernel. We also have to choose multiple sizes for the test (because of 1)... Finally, multiple operations can be packed together (multiple SAXPY, multiple scalar reduction/inner product, multiple vector reduction/gemv). If that number of packed operations is too high, the local memory usage will be too high and the OpenCL kernel may not *compile*. Should we provide a mechanism to evaluate this upper bound at runtime (doable) or just use a very conservative value for now (The OpenCL standards guarantees 16kB of local memory, the kernel generator guarantees an upperbound on the amount of local memory used.) ? I prefer the second option. 3 There are several expression nodes that should be supported only by the generator for now (even though not yet implemented): - reduceop(vector_expression) - reduce_rowsop(matrix_expression) - reduce_colsop(matrix_expression) - elementwise relational operators : operator, operator= operator, operator =, operator==, operator!=. - repmat(mat or vector, row_tiling, col_tiling) - vector expression : diag(Mat) - matrix expression : diag(vec) My question is : how to provide access for the user to OpenCL-specific content, not available (yet) for other backends? Another possibility is to keep this issue for ViennaCL.version 1.5 4 I want to maintain explicit specifications of the generator (apart from the hard-coded bool-returning C++ function) : what operations it supports, what it doesn't support. Are you interested? If yes, what format would you prefer? Best regards, Philippe -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel