Re: [ViennaCL-devel] Kernel Generator wrap-up

2013-07-29 Thread Philippe Tillet
Hi again !

The generator code is pushed on the master branch.



2013/7/28 Karl Rupp r...@iue.tuwien.ac.at

 Hey,



  My preferred option is to pad by default and either to make the
 padding a multiple of four or sixteen. However, we need to maintain
 a full set of unpadded operations, because user-provided buffers
 need not be padded (and a subsequent padding may be too expensive)


 I think making it a multiple of 16 always is a good option, because we
 can reasonably assume that optimal performance are rarely obtained when
 a work item performs (unroll) more than 16*16 operations, on most of the
 kernels.
 However, we have to have a clear and easily extensible dispatch
 mechanism that dispatch some sizes to some specific kernel, which is
 what I was talking about:
 Best {m, k, n} big block sizes for the GEMM kernel:

 GEMM Row-Major * Row-Major
 AMD : 16 * 64 * 256
 NVidia : 16 * 128 * 128
 Intel CPU : 64 * 64 * 128.


 I expect this to be also dependent on the hardware generation. The best
 approach that comes to my mind is to introduce some hardware descriptor,
 which provides nicely preprocessed information from the OpenCL backend. A
 rather simple

   Vendor: [AMD, INTEL, NVIDIA, ...]
   Type:   [CPU, GPU, MIC, etc.]
   Generation: [Southern Island, Fermi, Kepler, ... , UNKNOWN]

 should give us enough dispatch possibilities for the hardware 'out there'.
 If the detection of the hardware generation fails, we just use some
 compatibility kernel (and eventually ask the user to submit hardware
 information when running the tuner).


Right. I'll do that :)




  Of course, it is bound to be device-specific rather than vendor
 specific, and once the autotuning procedure works better we might have
 block sizes such as 96, 112, etc... Furthermore, for the kernel to be
 correct, each size has to be a multiple of the block size (3
 constraints).We can never expect the user to call the kernel on the
 proper sizes. Probem, the padding on ViennaCL is static, while this
 block size is known at runtime... Should we just write somewhere in the
 documentation what the best kernels are?


 The padding is no longer 'static'. The 'ALIGNMENT' template parameter is
 now ignored (vector_base no longer holds an ALIGNMENT parameter), so we can
 introduce a runtime padding without breaking old code. Thus, we can pick a
 proper padding entirely at runtime, tailored to the underlying device.


Oh, true. This padding has to be the smallest one compatible with all
profiles, some sort of lest common multiple, which I hope is not going to
grow ridiculously big...




  Even though the number of possible kernel variations is large
 (though finite), there's only a limited set which actually gives
 good performance. These are the important kernels to be tested
 thoroughly.


 Yes, but this limited set is device/program - specific, and it is hard
 to know (that's why autotuning is for). I don't think anyone could tell
 me explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns,
 use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And
 even if I choose two values for each parameters, it leads to 2¹⁰ = 1024
 test per layout/transposition combination = 32 768 tests . which is
 ridiculously high :D
 What about integrating the test procedure into the autotuning procedure?
 It's not intuitive but I see no better way.


 Yes, a good autotuning procedure should verify the correctness of the
 results obtained anyway. There may be compiler or hardware bugs which can
 lead to fast, but erroneous kernels.

 A two-stage scheme seems best here:
 - First, find the fastest kernel (either without checking, or just
 checking for a particular size).
 - Second, verify this kernel for a couple of different sizes. If this
 fails, pick the next kernel, etc.


Ok, I'll do that.
However, there are things to test in the way the generator behave, rather
than the profiles.
All the operations in tests/vector.cpp have to be compatible with the
generator. Should the corresponding tests be in the same vector.cpp file
(in some #ifdef VIENNACL_WITH_OPENCL) or should it be in a separate file?





  Sooner or later we will have to go for the runtime option anyway. I
 don't see any benefit of being overly pessimistic with 16kB if we
 have the true local memory available at runtime.


 Right, it's not over-complicated to do. The problem is more about
 knowing the right optimization profile used at runtime (the local memory
 used by the to-be-compiled kernel). Ok, it means that this optimization
 profile should not change (since I think we cannot really use global
 objects), so that this local memory value is consistent over time. Only
 the autotuner will be allowed to play with optimization profiles, then,
 which is fine for me.


 There is no reason to expect that the hardware changes during the
 execution of a process. Even if a hardware falls off the bus because it
 overheats, it doesn't come 

[ViennaCL-devel] Kernel Generator wrap-up

2013-07-28 Thread Philippe Tillet
Hello everybody,

I'm proud to announce that after about 3weeks, I've recoded from scratch
the OpenCL code generator to integrate it fully with
viennacl::scheduler::statement.

That being said, I'm entering the point where I need to inquire your
opinion for (many) further design choices. Sorted by priority :

1  How to handle padding? For example, the best kernels for a given
operation may use float4, in which case an alignment of 4 is required. For
GEMM, though, the kernel internally used blocking. Since the iteration over
the blocks is unrolled, I prefer to keep the loop boundary static (known at
the OpenCL compile time), so padding inside a kernel is not really an
option here. How to handle this?
Should we have a plethora of kernels optimized for a large number of
block-sizes?If yes, how to choose the block sizes?

2  For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number of
kernels can be generated. Designing a proper test suite in such a situation
is a challenging task. I've thought about testing a fixed amount of
randomly chosen kernel.
We also have to choose multiple sizes for the test (because of 1)...
Finally, multiple operations can be packed together (multiple SAXPY,
multiple scalar reduction/inner product, multiple vector reduction/gemv).
If that number of packed operations is too high, the local memory usage
will be too high and the OpenCL kernel may not *compile*. Should we provide
a mechanism to evaluate this upper bound at runtime (doable) or just use a
very conservative value for now (The OpenCL standards guarantees 16kB of
local memory, the kernel generator guarantees an upperbound on the amount
of local memory used.) ? I prefer the second option.

3  There are several expression nodes that should be supported only by the
generator for now (even though not yet implemented):
   - reduceop(vector_expression)
   - reduce_rowsop(matrix_expression)
   - reduce_colsop(matrix_expression)
   - elementwise relational operators : operator, operator= operator,
operator =, operator==, operator!=.
   - repmat(mat or vector, row_tiling, col_tiling)
   - vector expression : diag(Mat)
   - matrix expression : diag(vec)
My question is : how to provide access for the user to OpenCL-specific
content, not available (yet) for other backends?
Another possibility is to keep this issue for ViennaCL.version  1.5

4  I want to maintain explicit specifications of the generator (apart from
the hard-coded bool-returning C++ function) : what operations it supports,
what it doesn't support. Are you interested? If yes, what format would you
prefer?

Best regards,
Philippe
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] Kernel Generator wrap-up

2013-07-28 Thread Karl Rupp
Hey,

 I'm proud to announce that after about 3weeks, I've recoded from scratch
 the OpenCL code generator to integrate it fully with
 viennacl::scheduler::statement.

hurray :-) With the changes to the generator I pushed yesterday there is 
now a clear spot on where to hand the expression over to the generator.


 That being said, I'm entering the point where I need to inquire your
 opinion for (many) further design choices. Sorted by priority :

 1  How to handle padding? For example, the best kernels for a given
 operation may use float4, in which case an alignment of 4 is required.
 For GEMM, though, the kernel internally used blocking. Since the
 iteration over the blocks is unrolled, I prefer to keep the loop
 boundary static (known at the OpenCL compile time), so padding inside a
 kernel is not really an option here. How to handle this?
 Should we have a plethora of kernels optimized for a large number of
 block-sizes?If yes, how to choose the block sizes?

My preferred option is to pad by default and either to make the padding 
a multiple of four or sixteen. However, we need to maintain a full set 
of unpadded operations, because user-provided buffers need not be padded 
(and a subsequent padding may be too expensive)



 2  For each operation (BLAS1/BLAS2/BLAS3 for now), an infinite number
 of kernels can be generated. Designing a proper test suite in such a
 situation is a challenging task. I've thought about testing a fixed
 amount of randomly chosen kernel.

Please no random tests. This makes it awfully complicated to fix, 
because eventually one may not even be able to reproduce a previous failure.

Even though the number of possible kernel variations is large (though 
finite), there's only a limited set which actually gives good 
performance. These are the important kernels to be tested thoroughly.


 We also have to choose multiple sizes for the test (because of 1)...

Sure. Keeping the sizes moderately small will give us a sufficiently 
fast test procedure.


 Finally, multiple operations can be packed together (multiple SAXPY,
 multiple scalar reduction/inner product, multiple vector
 reduction/gemv). If that number of packed operations is too high, the
 local memory usage will be too high and the OpenCL kernel may not
 *compile*. Should we provide a mechanism to evaluate this upper bound at
 runtime (doable) or just use a very conservative value for now (The
 OpenCL standards guarantees 16kB of local memory, the kernel generator
 guarantees an upperbound on the amount of local memory used.) ? I prefer
 the second option.

Sooner or later we will have to go for the runtime option anyway. I 
don't see any benefit of being overly pessimistic with 16kB if we have 
the true local memory available at runtime.



 3  There are several expression nodes that should be supported only by
 the generator for now (even though not yet implemented):
 - reduceop(vector_expression)
 - reduce_rowsop(matrix_expression)
 - reduce_colsop(matrix_expression)
 - elementwise relational operators : operator, operator=
 operator, operator =, operator==, operator!=.
 - repmat(mat or vector, row_tiling, col_tiling)
 - vector expression : diag(Mat)
 - matrix expression : diag(vec)
 My question is : how to provide access for the user to OpenCL-specific
 content, not available (yet) for other backends?
 Another possibility is to keep this issue for ViennaCL.version  1.5

After the 1.5.0 release. There's too much other new functionality, so 
the release is already over-due. This gives us more time to design the 
API properly rather than coming up with some quick-fix.


 4  I want to maintain explicit specifications of the generator (apart
 from the hard-coded bool-returning C++ function) : what operations it
 supports, what it doesn't support. Are you interested? If yes, what
 format would you prefer?

I'm not sure about what you mean by 'explicit specifications'. Could you 
please elaborate?

Best regards,
Karli


--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] Kernel Generator wrap-up

2013-07-28 Thread Karl Rupp
Hey,


 My preferred option is to pad by default and either to make the
 padding a multiple of four or sixteen. However, we need to maintain
 a full set of unpadded operations, because user-provided buffers
 need not be padded (and a subsequent padding may be too expensive)


 I think making it a multiple of 16 always is a good option, because we
 can reasonably assume that optimal performance are rarely obtained when
 a work item performs (unroll) more than 16*16 operations, on most of the
 kernels.
 However, we have to have a clear and easily extensible dispatch
 mechanism that dispatch some sizes to some specific kernel, which is
 what I was talking about:
 Best {m, k, n} big block sizes for the GEMM kernel:

 GEMM Row-Major * Row-Major
 AMD : 16 * 64 * 256
 NVidia : 16 * 128 * 128
 Intel CPU : 64 * 64 * 128.

I expect this to be also dependent on the hardware generation. The best 
approach that comes to my mind is to introduce some hardware descriptor, 
which provides nicely preprocessed information from the OpenCL backend. 
A rather simple

   Vendor: [AMD, INTEL, NVIDIA, ...]
   Type:   [CPU, GPU, MIC, etc.]
   Generation: [Southern Island, Fermi, Kepler, ... , UNKNOWN]

should give us enough dispatch possibilities for the hardware 'out 
there'. If the detection of the hardware generation fails, we just use 
some compatibility kernel (and eventually ask the user to submit 
hardware information when running the tuner).


 Of course, it is bound to be device-specific rather than vendor
 specific, and once the autotuning procedure works better we might have
 block sizes such as 96, 112, etc... Furthermore, for the kernel to be
 correct, each size has to be a multiple of the block size (3
 constraints).We can never expect the user to call the kernel on the
 proper sizes. Probem, the padding on ViennaCL is static, while this
 block size is known at runtime... Should we just write somewhere in the
 documentation what the best kernels are?

The padding is no longer 'static'. The 'ALIGNMENT' template parameter is 
now ignored (vector_base no longer holds an ALIGNMENT parameter), so we 
can introduce a runtime padding without breaking old code. Thus, we can 
pick a proper padding entirely at runtime, tailored to the underlying 
device.


 Even though the number of possible kernel variations is large
 (though finite), there's only a limited set which actually gives
 good performance. These are the important kernels to be tested
 thoroughly.


 Yes, but this limited set is device/program - specific, and it is hard
 to know (that's why autotuning is for). I don't think anyone could tell
 me explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns,
 use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And
 even if I choose two values for each parameters, it leads to 2¹⁰ = 1024
 test per layout/transposition combination = 32 768 tests . which is
 ridiculously high :D
 What about integrating the test procedure into the autotuning procedure?
 It's not intuitive but I see no better way.

Yes, a good autotuning procedure should verify the correctness of the 
results obtained anyway. There may be compiler or hardware bugs which 
can lead to fast, but erroneous kernels.

A two-stage scheme seems best here:
- First, find the fastest kernel (either without checking, or just 
checking for a particular size).
- Second, verify this kernel for a couple of different sizes. If this 
fails, pick the next kernel, etc.



 Sooner or later we will have to go for the runtime option anyway. I
 don't see any benefit of being overly pessimistic with 16kB if we
 have the true local memory available at runtime.


 Right, it's not over-complicated to do. The problem is more about
 knowing the right optimization profile used at runtime (the local memory
 used by the to-be-compiled kernel). Ok, it means that this optimization
 profile should not change (since I think we cannot really use global
 objects), so that this local memory value is consistent over time. Only
 the autotuner will be allowed to play with optimization profiles, then,
 which is fine for me.

There is no reason to expect that the hardware changes during the 
execution of a process. Even if a hardware falls off the bus because it 
overheats, it doesn't come back without rebooting the machine (verified 
with two SDKs).


 After the 1.5.0 release. There's too much other new functionality,
 so the release is already over-due. This gives us more time to
 design the API properly rather than coming up with some quick-fix.


 Ok :) However, I need these for my research, so I'll make it work for
 OpenCL just after the 1.5.0 release :)

It's very easy to add operations to the statement objects, so there's no 
problem adding more any time after the release.


 I'm not sure about what you mean by 'explicit specifications'. Could
 you please elaborate?


 Hmm, something like a set of all the