Hi,

> I can't think of any such case where one would want to have control over
> this. This would require knowledge of our implementations to make
> appropriate choices anyway. In order to have a reasonable decision
> process, we need to come up with some heuristics...
> My first idea would be to compute a degree of "compute-boundness", and
> to dispatch according to it.
>
> T_{compute} = 2.\frac{MKN}{Peak_{compute}
>
> T_{bandwidth} = S.\frac{MK + KN}{Peak_{bandwidth}}
>
> k = \frac{T_{compute}}{T_{bandwidth}} = \alpha.\frac{MN}{M+N}
>
> \alpha = 2.\frac{Peak_{bandwidth}}{S.Peak_{compute}}
>
> Now, for the dispatching, we can proceed the following way :
>
> k < k_1 \Rightarrow \textrm{Multiple Inner Product}
> k_1 \leq k \leq k_2 \Rightarrow \textrm{Multiple Matrix-Vector Product}
> k_2 < k \Rightarrow \textrm{Matrix-Matrix Product}
>
> Since the bandwidth and the compute power are device-dependent, theory
> would suggest that this "compute-boundness" degree should also be device
> dependent.

This would be one option, yes. On the other hand, for the current device 
kernels to work well it is crucial to have enough 'worker blocks' in the 
result matrix, within which the individual work groups operate (i.e. 
load data and compute, with 'compute' being more intensive than 'load'). 
Performance drops only if:
  a) Many worker blocks work on the 'boundary' of the matrix, i.e. they 
read a lot of padded data (zeros) and thus run unnecessary computations. 
Let's call them 'not fully utilized', whereas worker blocks loading at 
least one full block (i.e. no padded data) are called 'fully utilized'.
  b) not enough 'fully utilized' worker blocks are active

This suggests that we use this as a criterion for kernel dispatch: If 
the number of fully utilized worker blocks is less than, say, 4 times 
the number of physically available work groups, we change to a 
matrix-vector product or multiple inner products. I think this can be 
nicely put on top of the current dispatch for the kernel, so we don't 
need a second device-specific dispatch here.


> However, I think we can reason for now in terms of order of
> magnitude, assuming 100GB/s for 1TFLOP/s... Assume these numbers, square
> matrices M=N, and S=4bytes (float), we get :
>
> k = 0.025 . N
>
> now the choice of k_1 and k_2 seems purely empirical...
>
> The problem of this model is that it doesn't take the second size K...
> nor does it take in account the kernel launch overhead, which prevents
> us from computing too many inner products : Taking k_1 =1 leads to
> inner-products being used up until N=40, which involves the computation
> of 1600 inner products... Now, even if we can pack inner products
> together, it's still too huge to be practical.
> It's just a first shot, of course, and any idea/hint is more than welcome :P

If I understand this correctly, then the dispatch based on fully 
utilized work groups does not have these issues? The crucial point is to 
get the worker blocks busy, because this is the one thing that makes a 
kernel compute-limited rather than bandwidth-limited.

Best regards,
Karli


------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to