Hi guys,

I have several tests which have copied/pasted code like this:

static def calc(
         p : Place, n : Int,
         params : RemoteArray[Float]{home==p,rank==1},
         result : RemoteArray[Float]{home==p,rank==1}) {
     val blocks  = p.isCUDA() ? 480 : 1;
     val threads = 512;
     finish async at (p) @CUDA @CUDADirectParams {
          for ([block] in 0..blocks-1) async {
             clocked finish
             for ([thread] in 0..threads-1) clocked async {
                 val tid  = block * threads + thread;
                 val tids = blocks * threads;
                 for (var i:int = tid; i < n; i += tids) {
                     val d = params(i);
                     result(i) = d * d;;

... which works fine.

One of the tests call the "calc" function above more or less like this:

finish {
for (gpu in gpus.values()) async at (cpu) {
     //--- First step : allocate device arrays
     val gpuDatum  = CUDAUtilities.makeRemoteArray[Float]
             (gpu, len, (j:int) => cpuDatum(size/n * i + j));
     val gpuResult = CUDAUtilities.makeRemoteArray[Float]
             (gpu, len, (j:int) => 0.0 as Float);

     //--- Second step : call kernel function
     calc(gpu, len, gpuDatum, gpuResult);


This example "works" but does not offer any coordination between gpus 
connected to "cpu" (=here).

Once I still have only one gpu at the moment, I defined


When I tried to create another example employing teams like KMeansCUDA 
does, it got stuck because all gpus share the same parent (=here).

So, looks like (I guess) that Team is good for coordination between 
different cpus, once the code is typically host code and not kernel code.

OK. Then I tried coordination using clocks in different ways.
In the example below I explicitly declare and employ a certain clock for 
coordinating tasks.

finish async {
     val c = Clock.make();
     for (gpu in gpus.values()) async clocked (c) {

        val i   = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES;
        val len = size/n + ( i+1==n ? size%n : 0 );

        //--- First step : allocate device arrays

        val gpuDatum  =  ...
        val gpuResult =  ...

        //--- Second step : call kernel function

        calc(gpu, len, gpuDatum, gpuResult);


Executing this example I've got the following message

        X10RT: async 37 is not a CUDA kernel.

If I'm not wrong, this message comes from the kernel function, once this 
message dissapears when I comment out the call to the kernel function.

So, looks like there's a certain relationship between the "finish" in 
the host code and the "finish" in the kernel code.

In the documentation on Finish (pg 160) it says that a "finish" waits 
for termination of all activities spawned by "S". I'm certainly confused 
by the implications of this statement, so I tried to simplify the code 
above, like this:

clocked finish {
     for (gpu in gpus.values()) async {

        val i   = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES;
        val len = size/n + ( i+1==n ? size%n : 0 );

        //--- First step : allocate device arrays

        val gpuDatum  =  ...
        val gpuResult =  ...

        //--- Second step : call kernel function

        calc(gpu, len, gpuDatum, gpuResult);


When I execute it, the result is absolutely the same:

        X10RT: async 37 is not a CUDA kernel.

So, could you guys guide me about this?

1. Am I correct to think that I cannot employ Team when 2 or more GPUs 
belong to the same place?

2. What is the relationship between a finish in the host code and a 
finish in the kernel code? Or maybe this question should be on "clock"s 
instead on "finish"es ?

2. Would you recommend a explicit clock in order to avoid conflict with 
the clock in the kernel function?

Thanks a lot :)

Richard Gomes
M: +44(77)9955-6813
twitter: frgomes

JQuantLib is a library for Quantitative Finance written in Java.
twitter: jquantlib

This SF Dev2Dev email is sponsored by:

WikiLeaks The End of the Free Internet
X10-users mailing list

Reply via email to