Re: [ViennaCL-devel] Kernel Generator wrap-up
Hi again ! The generator code is pushed on the master branch. 2013/7/28 Karl Rupp r...@iue.tuwien.ac.at Hey, My preferred option is to pad by default and either to make the padding a multiple of four or sixteen. However, we need to maintain a full set of unpadded operations, because user-provided buffers need not be padded (and a subsequent padding may be too expensive) I think making it a multiple of 16 always is a good option, because we can reasonably assume that optimal performance are rarely obtained when a work item performs (unroll) more than 16*16 operations, on most of the kernels. However, we have to have a clear and easily extensible dispatch mechanism that dispatch some sizes to some specific kernel, which is what I was talking about: Best {m, k, n} big block sizes for the GEMM kernel: GEMM Row-Major * Row-Major AMD : 16 * 64 * 256 NVidia : 16 * 128 * 128 Intel CPU : 64 * 64 * 128. I expect this to be also dependent on the hardware generation. The best approach that comes to my mind is to introduce some hardware descriptor, which provides nicely preprocessed information from the OpenCL backend. A rather simple Vendor: [AMD, INTEL, NVIDIA, ...] Type: [CPU, GPU, MIC, etc.] Generation: [Southern Island, Fermi, Kepler, ... , UNKNOWN] should give us enough dispatch possibilities for the hardware 'out there'. If the detection of the hardware generation fails, we just use some compatibility kernel (and eventually ask the user to submit hardware information when running the tuner). Right. I'll do that :) Of course, it is bound to be device-specific rather than vendor specific, and once the autotuning procedure works better we might have block sizes such as 96, 112, etc... Furthermore, for the kernel to be correct, each size has to be a multiple of the block size (3 constraints).We can never expect the user to call the kernel on the proper sizes. Probem, the padding on ViennaCL is static, while this block size is known at runtime... Should we just write somewhere in the documentation what the best kernels are? The padding is no longer 'static'. The 'ALIGNMENT' template parameter is now ignored (vector_base no longer holds an ALIGNMENT parameter), so we can introduce a runtime padding without breaking old code. Thus, we can pick a proper padding entirely at runtime, tailored to the underlying device. Oh, true. This padding has to be the smallest one compatible with all profiles, some sort of lest common multiple, which I hope is not going to grow ridiculously big... Even though the number of possible kernel variations is large (though finite), there's only a limited set which actually gives good performance. These are the important kernels to be tested thoroughly. Yes, but this limited set is device/program - specific, and it is hard to know (that's why autotuning is for). I don't think anyone could tell me explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns, use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And even if I choose two values for each parameters, it leads to 2¹⁰ = 1024 test per layout/transposition combination = 32 768 tests . which is ridiculously high :D What about integrating the test procedure into the autotuning procedure? It's not intuitive but I see no better way. Yes, a good autotuning procedure should verify the correctness of the results obtained anyway. There may be compiler or hardware bugs which can lead to fast, but erroneous kernels. A two-stage scheme seems best here: - First, find the fastest kernel (either without checking, or just checking for a particular size). - Second, verify this kernel for a couple of different sizes. If this fails, pick the next kernel, etc. Ok, I'll do that. However, there are things to test in the way the generator behave, rather than the profiles. All the operations in tests/vector.cpp have to be compatible with the generator. Should the corresponding tests be in the same vector.cpp file (in some #ifdef VIENNACL_WITH_OPENCL) or should it be in a separate file? Sooner or later we will have to go for the runtime option anyway. I don't see any benefit of being overly pessimistic with 16kB if we have the true local memory available at runtime. Right, it's not over-complicated to do. The problem is more about knowing the right optimization profile used at runtime (the local memory used by the to-be-compiled kernel). Ok, it means that this optimization profile should not change (since I think we cannot really use global objects), so that this local memory value is consistent over time. Only the autotuner will be allowed to play with optimization profiles, then, which is fine for me. There is no reason to expect that the hardware changes during the execution of a process. Even if a hardware falls off the bus because it overheats, it doesn't come
[ViennaCL-devel] Kernel Generator wrap-up
Hello everybody, I'm proud to announce that after about 3weeks, I've recoded from scratch the OpenCL code generator to integrate it fully with viennacl::scheduler::statement. That being said, I'm entering the point where I need to inquire your opinion for (many) further design choices. Sorted by priority : 1 How to handle padding? For example, the best kernels for a given operation may use float4, in which case an alignment of 4 is required. For GEMM, though, the kernel internally used blocking. Since the iteration over the blocks is unrolled, I prefer to keep the loop boundary static (known at the OpenCL compile time), so padding inside a kernel is not really an option here. How to handle this? Should we have a plethora of kernels optimized for a large number of block-sizes?If yes, how to choose the block sizes? 2 For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number of kernels can be generated. Designing a proper test suite in such a situation is a challenging task. I've thought about testing a fixed amount of randomly chosen kernel. We also have to choose multiple sizes for the test (because of 1)... Finally, multiple operations can be packed together (multiple SAXPY, multiple scalar reduction/inner product, multiple vector reduction/gemv). If that number of packed operations is too high, the local memory usage will be too high and the OpenCL kernel may not *compile*. Should we provide a mechanism to evaluate this upper bound at runtime (doable) or just use a very conservative value for now (The OpenCL standards guarantees 16kB of local memory, the kernel generator guarantees an upperbound on the amount of local memory used.) ? I prefer the second option. 3 There are several expression nodes that should be supported only by the generator for now (even though not yet implemented): - reduceop(vector_expression) - reduce_rowsop(matrix_expression) - reduce_colsop(matrix_expression) - elementwise relational operators : operator, operator= operator, operator =, operator==, operator!=. - repmat(mat or vector, row_tiling, col_tiling) - vector expression : diag(Mat) - matrix expression : diag(vec) My question is : how to provide access for the user to OpenCL-specific content, not available (yet) for other backends? Another possibility is to keep this issue for ViennaCL.version 1.5 4 I want to maintain explicit specifications of the generator (apart from the hard-coded bool-returning C++ function) : what operations it supports, what it doesn't support. Are you interested? If yes, what format would you prefer? Best regards, Philippe -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Kernel Generator wrap-up
Hey, I'm proud to announce that after about 3weeks, I've recoded from scratch the OpenCL code generator to integrate it fully with viennacl::scheduler::statement. hurray :-) With the changes to the generator I pushed yesterday there is now a clear spot on where to hand the expression over to the generator. That being said, I'm entering the point where I need to inquire your opinion for (many) further design choices. Sorted by priority : 1 How to handle padding? For example, the best kernels for a given operation may use float4, in which case an alignment of 4 is required. For GEMM, though, the kernel internally used blocking. Since the iteration over the blocks is unrolled, I prefer to keep the loop boundary static (known at the OpenCL compile time), so padding inside a kernel is not really an option here. How to handle this? Should we have a plethora of kernels optimized for a large number of block-sizes?If yes, how to choose the block sizes? My preferred option is to pad by default and either to make the padding a multiple of four or sixteen. However, we need to maintain a full set of unpadded operations, because user-provided buffers need not be padded (and a subsequent padding may be too expensive) 2 For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number of kernels can be generated. Designing a proper test suite in such a situation is a challenging task. I've thought about testing a fixed amount of randomly chosen kernel. Please no random tests. This makes it awfully complicated to fix, because eventually one may not even be able to reproduce a previous failure. Even though the number of possible kernel variations is large (though finite), there's only a limited set which actually gives good performance. These are the important kernels to be tested thoroughly. We also have to choose multiple sizes for the test (because of 1)... Sure. Keeping the sizes moderately small will give us a sufficiently fast test procedure. Finally, multiple operations can be packed together (multiple SAXPY, multiple scalar reduction/inner product, multiple vector reduction/gemv). If that number of packed operations is too high, the local memory usage will be too high and the OpenCL kernel may not *compile*. Should we provide a mechanism to evaluate this upper bound at runtime (doable) or just use a very conservative value for now (The OpenCL standards guarantees 16kB of local memory, the kernel generator guarantees an upperbound on the amount of local memory used.) ? I prefer the second option. Sooner or later we will have to go for the runtime option anyway. I don't see any benefit of being overly pessimistic with 16kB if we have the true local memory available at runtime. 3 There are several expression nodes that should be supported only by the generator for now (even though not yet implemented): - reduceop(vector_expression) - reduce_rowsop(matrix_expression) - reduce_colsop(matrix_expression) - elementwise relational operators : operator, operator= operator, operator =, operator==, operator!=. - repmat(mat or vector, row_tiling, col_tiling) - vector expression : diag(Mat) - matrix expression : diag(vec) My question is : how to provide access for the user to OpenCL-specific content, not available (yet) for other backends? Another possibility is to keep this issue for ViennaCL.version 1.5 After the 1.5.0 release. There's too much other new functionality, so the release is already over-due. This gives us more time to design the API properly rather than coming up with some quick-fix. 4 I want to maintain explicit specifications of the generator (apart from the hard-coded bool-returning C++ function) : what operations it supports, what it doesn't support. Are you interested? If yes, what format would you prefer? I'm not sure about what you mean by 'explicit specifications'. Could you please elaborate? Best regards, Karli -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] Kernel Generator wrap-up
Hey, My preferred option is to pad by default and either to make the padding a multiple of four or sixteen. However, we need to maintain a full set of unpadded operations, because user-provided buffers need not be padded (and a subsequent padding may be too expensive) I think making it a multiple of 16 always is a good option, because we can reasonably assume that optimal performance are rarely obtained when a work item performs (unroll) more than 16*16 operations, on most of the kernels. However, we have to have a clear and easily extensible dispatch mechanism that dispatch some sizes to some specific kernel, which is what I was talking about: Best {m, k, n} big block sizes for the GEMM kernel: GEMM Row-Major * Row-Major AMD : 16 * 64 * 256 NVidia : 16 * 128 * 128 Intel CPU : 64 * 64 * 128. I expect this to be also dependent on the hardware generation. The best approach that comes to my mind is to introduce some hardware descriptor, which provides nicely preprocessed information from the OpenCL backend. A rather simple Vendor: [AMD, INTEL, NVIDIA, ...] Type: [CPU, GPU, MIC, etc.] Generation: [Southern Island, Fermi, Kepler, ... , UNKNOWN] should give us enough dispatch possibilities for the hardware 'out there'. If the detection of the hardware generation fails, we just use some compatibility kernel (and eventually ask the user to submit hardware information when running the tuner). Of course, it is bound to be device-specific rather than vendor specific, and once the autotuning procedure works better we might have block sizes such as 96, 112, etc... Furthermore, for the kernel to be correct, each size has to be a multiple of the block size (3 constraints).We can never expect the user to call the kernel on the proper sizes. Probem, the padding on ViennaCL is static, while this block size is known at runtime... Should we just write somewhere in the documentation what the best kernels are? The padding is no longer 'static'. The 'ALIGNMENT' template parameter is now ignored (vector_base no longer holds an ALIGNMENT parameter), so we can introduce a runtime padding without breaking old code. Thus, we can pick a proper padding entirely at runtime, tailored to the underlying device. Even though the number of possible kernel variations is large (though finite), there's only a limited set which actually gives good performance. These are the important kernels to be tested thoroughly. Yes, but this limited set is device/program - specific, and it is hard to know (that's why autotuning is for). I don't think anyone could tell me explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns, use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And even if I choose two values for each parameters, it leads to 2¹⁰ = 1024 test per layout/transposition combination = 32 768 tests . which is ridiculously high :D What about integrating the test procedure into the autotuning procedure? It's not intuitive but I see no better way. Yes, a good autotuning procedure should verify the correctness of the results obtained anyway. There may be compiler or hardware bugs which can lead to fast, but erroneous kernels. A two-stage scheme seems best here: - First, find the fastest kernel (either without checking, or just checking for a particular size). - Second, verify this kernel for a couple of different sizes. If this fails, pick the next kernel, etc. Sooner or later we will have to go for the runtime option anyway. I don't see any benefit of being overly pessimistic with 16kB if we have the true local memory available at runtime. Right, it's not over-complicated to do. The problem is more about knowing the right optimization profile used at runtime (the local memory used by the to-be-compiled kernel). Ok, it means that this optimization profile should not change (since I think we cannot really use global objects), so that this local memory value is consistent over time. Only the autotuner will be allowed to play with optimization profiles, then, which is fine for me. There is no reason to expect that the hardware changes during the execution of a process. Even if a hardware falls off the bus because it overheats, it doesn't come back without rebooting the machine (verified with two SDKs). After the 1.5.0 release. There's too much other new functionality, so the release is already over-due. This gives us more time to design the API properly rather than coming up with some quick-fix. Ok :) However, I need these for my research, so I'll make it work for OpenCL just after the 1.5.0 release :) It's very easy to add operations to the statement objects, so there's no problem adding more any time after the release. I'm not sure about what you mean by 'explicit specifications'. Could you please elaborate? Hmm, something like a set of all the