Hi hi,
2013/8/2 Karl Rupp <r...@iue.tuwien.ac.at>
> Hi,
>
> > I've been thinking a bit about dynamically zero-padding
> > viennacl::matrix<> for full hardware use ( best bandwidth for BLAS1,
> > BLAS2, best performance for BLAS3).
> >
> > Basically, the big problem arising is that the blocking-parameter is not
> > dependent on the hardware or the matrix, but rather on the operation, ie
> :
> > x = prod(A,y)
> > C0 = prod(A,B0)
> > C1 = prod(trans(A), B1)
> > C2 = prod(B2, A)
> > all use the same matrix A but require different padding sizes. If we
> > want to handle all these cases properly, it seems like we will need an
> > intelligence layer to dynamically update the matrix A.
> >
> > However, I believe that resizing A to fit the right padding sizes is not
> > an option, because the corresponding overhead would outweigh the
> > benefits of padding for the bandwidth-limited kernel.
>
> I don't think that resizing is a serious option, since it may
> unnecessarily overload the GPU RAM if the initial matrix is already large.
>
>
> > Another option would be to pad A with the biggest block size required.
> > ie, if on the current hardware:
> > BLAS1 Matrix[**] requires padding 4*4
> > GEMM AA requires LHS padded by 32*128, RHS by 128*16
> > GEMM TA requires LHS padded by 16*256, RHS by 256*64
> > ... etc..
> > then each matrix would be zero-padded 256*128, so that it can fit all
> > the context.
>
> Since there are a few kernels with good performance around the 'sweet
> spot', is it reasonable to restrict to a padding to multiples of 128?
>
Hmm, I'm not completely sure.
The best GEMM performance are not located "around" (distance-wise in the
parameter space) the sweet spot, generally, since perturbating one
parameter can result in disastrous performance.
However, it turns out that, from what I have observed, 128 is often the
biggest block size found by the autotuner, even though HD7970's optimal
parameters reach 256, but the induced overhead may just outweigh the
benefits . That said, the autotuning procedure is not optimal, and once it
is working perfectly we should (according to some papers) find block sizes
such as 96, 112, 192, or whatever.
On some hardware, the biggest block size might be only 64, for example. If
we can query the execution profile for GEMM (should be attached to each
viennacl::ocl::context) at runtime, we might be able to save some
space/performance. As said in a previous email, we must not allow the user
to modify the profiles, or it will invalidate the padding for all the
matrices allocated until this point...
Long story short, I agree with restricting padding to at most 128 and
making the profile database read-only. :)
> For very thin/skinny matrices this clearly requires extra thought, but
> in such case also the optimal kernels are different. The overhead of 128
> is still affordable: Considering that matrices below ~1000-by-1000 don't
> show good performance anyway, the overhead is at most 20% in memory
> consumption and at most 30% in FLOPs for GEMM. For the more interesting
> range above 4000-by-4000 it's already below 10% in total FLOPs in worst
> case, but gives much more uniform speed.
>
>
> > After that, at each operation, internal_size*() would be updated to fit
> > the correct block sizes. Are we really sure that the initial zeros of
> > the padding will remain zeros throughout the whole execution process?
>
> Yes, we simply require this as a guarantee. Bounds-checking on the
> result matrix is much cheaper than bounds-checking on the factors.
>
Hmm, I agree that bounds-checking the result is cheap.but there is a
semantical problem imho if we bound-check size() instead of internal_size()
in the kernels : the buffers may be float4* or even float16*, and casting a
float16* to a float* has lead to a segfault in all of my attempts (both on
AMD and on NVidia). That is, the check *if(index < size) * does not suffice
anymore, because size then behave like roundUpToNextMultiple(size,
vector_size), which means that size will end up having internal_size
semantics anyway...
As an example :
--------------------------
viennacl::vector<float> x,y,u,v;
//fill y,v;
x = y + scalar_vector(1) //padding zeros altered between size and size +
optimal_vector_size - 1, even with bound checking.
u = v + scalar_vector(1) //same thing
viennacl::scalar<float> s = inner_product(x, u) // woops - should output
something like correct_result + optimal_vector_size - 1
--------------------------
viennacl::matrix<float> A,B,C,U,V;
//fill U,V
A = U + scalar_matrix<float>(1);
B = V + scalar_matrix<double>(1);
C = prod(A,B); //oops...
Furthermore, it is not an option to go for "manual padding" in the kernel,
we would indeed have to do something like :
>if(index + 1 > size)
> res.s1 = 0;
>if(index + 2 > size)
> res.s2 = 0;
>...
>if(index + n -1 > size)
> res.sn = 0;
In the case of 2D padding, this wouldn't be just (n-1) if statement, but
(n-1)² if statements and branch divergences, which is cumbersome, not
maintainable and harmful for performance
Last, but not least. I am completely lost in the case of the following,
that is not in the current API, but that we have to take into consideration
for the design of our padding policy :
>viennacl::matrix
>//fill x with >0 values
>viennacl::scalar<float> s = reduce<min_type>(x); //s is equal to zero
here, because values between size and optimal_vector_size - 1 are zeros ! x
should be padded with INFINITY and not zeros !
>viennacl::scalar<float> s = reduce<prod_type>(x); //s is equal to zero
here ! x should be padded with ones.
The best way out that I can think of is just to have some O(BlockSize) (
resp. O(BlockSize*size1 + BlockSize*size2) ) 1-D (resp. 2-D) kernel. Before
each operation. In the case where BlockSize << size1 && BlockSize << size2,
the overhead should not be noticeable. The main advantage is that we can
pad with the appropriate element, -INFINITY, 0, 1 , INFINITY, 42...
How do linear algebra libraries work? Is that what they do?
>
> > [**]I've come to the conclusion that having a specific BLAS1 kernel for
> > matrix was necessary, in order to allow maximum bandwidth on operations
> > such as :
> > A = B + repmat(x, 1, traits::size2(B));
> > A = B + eye(size1,size2)...
> > These operations are not implemented yet, though, but the most difficult
> > part is precisely to have a kernel using vector4 for these operations,
> > for example.
> > Do you agree or do you think we should use a 1D kernel for matrix blas1
> > too? Since both views are equivalent, the autotuner should also be able
> > to find bandwidth-optimal 2D parameters.
>
> I'd actually also prefer a larger padding in order to align properly
> with hardware warp/wavefront sizes. 32 would allow for dealing well with
> a single warp using a standard scalar type (float,double). AMD hardware
> seems to prefer vector4, so with 32 threads we are again at 128 entries.
> This would match well with the suggestion for matrix-matrix
> multiplication above.
>
Actually, according to the AMD optimization guide, the 7000s have wavefront
size of 64 ! The guide also says that each compute unit posses 4 vector
units of 16 processing element each, which means enough resources for 1
wavefront. In order to hide properly latency, each compute unit can execute
multiple wavefronts, which is why the group size should be chosen as a
multiple of the wavefront size. However, the GEMM kernel works this way :
- Each work-item processes mS * nS elements of the result matrix
- Each work group processes mL * nL elements
That is, the work group size :
size1 = mL / mS;
size2 = nL / nS;
On the HD7970, the optimal for GEMM AA row-major*row-major,
mL = 16, mS = 4 => work_group_size1 = 4;
nL = 256; mS = 4 => work_groups_size2 = 64;
(I start understanding why this kernel performs so well, the computation of
nL rows are pipelined in the same compute unit, which hides well latency!)
Anyhow, this suggests indeed that we do need large padding to both occupy
as many ALUs as possible and hide latency.
Best regards,
Philippe
PS : The emails get long... :)
> Best regards,
> Karli
>
>
>
> ------------------------------------------------------------------------------
> Get your SQL database under version control now!
> Version control is standard for application code, but databases havent
> caught up. So what steps can you take to put your SQL databases under
> version control? Why should you start doing it? Read more to find out.
> http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
> _______________________________________________
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel