Hey,

 >
> Hmm, I'm not completely sure.
> The best GEMM performance are not located "around" (distance-wise in the
> parameter space) the sweet spot, generally, since perturbating one
> parameter can result in disastrous performance.

Yeah, I agree, the sweet spot may not be defined 'distance-wise', but 
performance-wise. From what I remember, the best ~5 kernels are within a 
few percent margin.


> However, it turns out that, from what I have observed, 128 is often the
> biggest block size found by the autotuner, even though HD7970's optimal
> parameters reach 256, but the induced overhead may just outweigh the
> benefits . That said, the autotuning procedure is not optimal, and once
> it is working perfectly we should (according to some papers) find block
> sizes such as 96, 112, 192, or whatever.

Good point, it may not be a power of two, but rather a multiple of 16 
(cf. Fermi Teslas). I suggest that we determine the final padding size 
by studying the tuner results. Giving up a few percent for GEMM may 
result in much better performance for the memory-bound kernels.


> On some hardware, the biggest block size might be only 64, for example.
> If we can query the execution profile for GEMM (should be attached to
> each viennacl::ocl::context) at runtime, we might be able  to save some
> space/performance. As said in a previous email, we must not allow the
> user to modify the profiles, or it will invalidate the padding for all
> the matrices allocated until this point...
> Long story short, I agree with restricting padding to at most 128 and
> making the profile database read-only. :)

:-)

> Hmm, I agree that bounds-checking the result is cheap.but there is a
> semantical problem imho if we bound-check size() instead of
> internal_size() in the kernels : the buffers may be float4* or even
> float16*, and casting a float16* to a float* has lead to a segfault in
> all of my attempts (both on AMD and on NVidia). That is, the check
> /if(index < size) / does not suffice anymore, because size then behave
> like roundUpToNextMultiple(size, vector_size), which means that size
> will end up having internal_size semantics anyway...
>
> As an example :
> --------------------------
> viennacl::vector<float> x,y,u,v;
> //fill y,v;
> x = y + scalar_vector(1) //padding zeros altered between size and size +
> optimal_vector_size - 1, even with bound checking.

This would be a crappy kernel then. There is no problem in filling a 
float4/float16 with some zero entries if it happens to lie on the size 
boundary. Thread divergence is not an issue here, because this affects 
only one work group.


> --------------------------
> viennacl::matrix<float> A,B,C,U,V;
> //fill U,V
> A = U + scalar_matrix<float>(1);
> B = V + scalar_matrix<double>(1);

Again, you're assuming a bad kernel here. ;-)


>
> Furthermore, it is not an option to go for "manual padding" in the
> kernel,  we would indeed have to do something like :
>  >if(index + 1 > size)
>  >   res.s1 = 0;
>  >if(index + 2 > size)
>  >   res.s2 = 0;
>  >...
>  >if(index + n -1  > size)
>  > res.sn <http://res.sn> = 0;
> In the case of 2D padding, this wouldn't be just (n-1) if statement, but
> (n-1)² if statements and branch divergences, which is cumbersome, not
> maintainable and harmful for performance

Consider x = y + z:

  if (index + vector_length > size)
  {
    // fine-grained check here
  }
  else
    x[i] = y[i] + z[i];

This is no more expensive than any 'traditional' way of bounds-checking.

For 2D-padding on BLAS levels 1 and 2, we assign workgroups to 
individual rows (or columns) rather than smalls blocks anyway (except 
matrix transposition), so the check is no more expensive than for 
vectors. For GEMM one obtains zeros for the respective locations in the 
result matrix anyway.



> Last, but not least. I am completely lost in the case of the following,
> that is not in the current API, but that we have to take into
> consideration for the design of our padding policy :
>
>  >viennacl::matrix
>  >//fill x with >0 values
>  >viennacl::scalar<float> s = reduce<min_type>(x); //s is equal to zero
> here, because values between size and optimal_vector_size - 1 are zeros
> ! x should be padded with INFINITY and not zeros !
>  >viennacl::scalar<float> s = reduce<prod_type>(x); //s is equal to zero
> here ! x should be padded with ones.

The same if-check as above works: one if for a cheap check of whether 
you're running out of size-bounds, then write the respective value for 
the kernel at hand to the local variable.


> The best way out that I can think of is just to have some O(BlockSize) (
> resp. O(BlockSize*size1 + BlockSize*size2) ) 1-D (resp. 2-D) kernel.
> Before each operation. In the case where BlockSize << size1 && BlockSize
> << size2, the overhead should not be noticeable. The main advantage is
> that we can pad with the appropriate element, -INFINITY, 0, 1 ,
> INFINITY, 42...
> How do linear algebra libraries work? Is that what they do?

Keep in mind that you don't have vector types on the CPU, so the 
casting-problem never shows up.



> Actually, according to the AMD optimization guide, the 7000s have
> wavefront size of 64 !

Oh, right, the newer ones have a larger wavefront size - thanks!

> The guide also says that each compute unit posses
> 4 vector units of 16 processing element each, which means enough
> resources for 1 wavefront. In order to hide properly latency, each
> compute unit can execute multiple wavefronts, which is why the group
> size should be chosen as a multiple of the wavefront size. However, the
> GEMM kernel works this way :
> - Each work-item processes mS * nS elements of the result matrix
> - Each work group processes mL * nL elements
> That is, the work group size :
> size1 = mL / mS;
> size2 = nL / nS;
>
> On the HD7970, the optimal for GEMM AA row-major*row-major,
>
> mL = 16, mS = 4 => work_group_size1 = 4;
> nL = 256; mS = 4 => work_groups_size2 = 64;
> (I start understanding why this kernel performs so well, the computation
> of nL rows are pipelined in the same compute unit, which hides well
> latency!)
>
>   Anyhow, this suggests indeed that we do need large padding to both
> occupy as many ALUs as possible and hide latency.

A padding of 256 looks pretty expensive to me, resulting in a lot of 
unnecessary FLOPs in worst case. Can you please assemble a list of all 
GEMM kernel configuration parameters and their execution times for the 
GTX 470, Tesla C2050, HD 7970 and HD 5850? mL, nL, and kL are of main 
interest, but for future reference it's better to just dump all 
parameters. I think I can also run on a K20X here, I need to beg a 
colleague. I'll also open up a separate repository on GitHub for 
collecting results there, hopefully it will grow with more device 
results over time.

Best regards,
Karli

> PS : The emails get long... :)

No problem with that ;-)



------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to