Hi guys,

I should have done this before...

I checked out the latest snapshot from trunk and my test works fine now.

Cheers :)

Richard Gomes
M: +44(77)9955-6813
http://tinyurl.com/frgomes
twitter: frgomes

JQuantLib is a library for Quantitative Finance written in Java.
http://www.jquantlib.org/
twitter: jquantlib

On 08/12/10 21:57, Richard Gomes wrote:
> Hi guys,
>
> I have several tests which have copied/pasted code like this:
>
> static def calc(
>           p : Place, n : Int,
>           params : RemoteArray[Float]{home==p,rank==1},
>           result : RemoteArray[Float]{home==p,rank==1}) {
>       val blocks  = p.isCUDA() ? 480 : 1;
>       val threads = 512;
>       finish async at (p) @CUDA @CUDADirectParams {
>            finish
>            for ([block] in 0..blocks-1) async {
>               clocked finish
>               for ([thread] in 0..threads-1) clocked async {
>                   val tid  = block * threads + thread;
>                   val tids = blocks * threads;
>                   for (var i:int = tid; i<  n; i += tids) {
>                       val d = params(i);
>                       result(i) = d * d;;
>                   }
>               }
>           }
>       }
> }
>
> ... which works fine.
>
>
> One of the tests call the "calc" function above more or less like this:
>
>
> finish {
> for (gpu in gpus.values()) async at (cpu) {
>       ...
>       //--- First step : allocate device arrays
>       val gpuDatum  = CUDAUtilities.makeRemoteArray[Float]
>               (gpu, len, (j:int) =>  cpuDatum(size/n * i + j));
>       val gpuResult = CUDAUtilities.makeRemoteArray[Float]
>               (gpu, len, (j:int) =>  0.0 as Float);
>
>       //--- Second step : call kernel function
>       calc(gpu, len, gpuDatum, gpuResult);
>
>       ...
> }
>
>
>
> This example "works" but does not offer any coordination between gpus
> connected to "cpu" (=here).
>
> Once I still have only one gpu at the moment, I defined
>
>       export X10RT_ACCELS=CUDA0,CUDA0,CUDA0,CUDA0
>
> When I tried to create another example employing teams like KMeansCUDA
> does, it got stuck because all gpus share the same parent (=here).
>
> So, looks like (I guess) that Team is good for coordination between
> different cpus, once the code is typically host code and not kernel code.
>
> OK. Then I tried coordination using clocks in different ways.
> In the example below I explicitly declare and employ a certain clock for
> coordinating tasks.
>
>
> finish async {
>       val c = Clock.make();
>       for (gpu in gpus.values()) async clocked (c) {
>
>       val i   = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES;
>       val len = size/n + ( i+1==n ? size%n : 0 );
>
>       //--- First step : allocate device arrays
>
>       c.next();
>       val gpuDatum  =  ...
>       val gpuResult =  ...
>
>       //--- Second step : call kernel function
>
>       c.next();
>       calc(gpu, len, gpuDatum, gpuResult);
>
>           ...
>       }
> }
>
>
> Executing this example I've got the following message
>
>       X10RT: async 37 is not a CUDA kernel.
>
> If I'm not wrong, this message comes from the kernel function, once this
> message dissapears when I comment out the call to the kernel function.
>
> So, looks like there's a certain relationship between the "finish" in
> the host code and the "finish" in the kernel code.
>
> In the documentation on Finish (pg 160) it says that a "finish" waits
> for termination of all activities spawned by "S". I'm certainly confused
> by the implications of this statement, so I tried to simplify the code
> above, like this:
>
>
> clocked finish {
>       for (gpu in gpus.values()) async {
>
>       val i   = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES;
>       val len = size/n + ( i+1==n ? size%n : 0 );
>
>       //--- First step : allocate device arrays
>
>       next;
>       val gpuDatum  =  ...
>       val gpuResult =  ...
>
>       //--- Second step : call kernel function
>
>       next;
>       calc(gpu, len, gpuDatum, gpuResult);
>
>           ...
>       }
> }
>
>
> When I execute it, the result is absolutely the same:
>
>       X10RT: async 37 is not a CUDA kernel.
>
>
>
> So, could you guys guide me about this?
>
> 1. Am I correct to think that I cannot employ Team when 2 or more GPUs
> belong to the same place?
>
> 2. What is the relationship between a finish in the host code and a
> finish in the kernel code? Or maybe this question should be on "clock"s
> instead on "finish"es ?
>
> 2. Would you recommend a explicit clock in order to avoid conflict with
> the clock in the kernel function?
>
> Thanks a lot :)
>

------------------------------------------------------------------------------
This SF Dev2Dev email is sponsored by:

WikiLeaks The End of the Free Internet
http://p.sf.net/sfu/therealnews-com
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to