Hi guys, I have several tests which have copied/pasted code like this:
static def calc( p : Place, n : Int, params : RemoteArray[Float]{home==p,rank==1}, result : RemoteArray[Float]{home==p,rank==1}) { val blocks = p.isCUDA() ? 480 : 1; val threads = 512; finish async at (p) @CUDA @CUDADirectParams { finish for ([block] in 0..blocks-1) async { clocked finish for ([thread] in 0..threads-1) clocked async { val tid = block * threads + thread; val tids = blocks * threads; for (var i:int = tid; i < n; i += tids) { val d = params(i); result(i) = d * d;; } } } } } ... which works fine. One of the tests call the "calc" function above more or less like this: finish { for (gpu in gpus.values()) async at (cpu) { ... //--- First step : allocate device arrays val gpuDatum = CUDAUtilities.makeRemoteArray[Float] (gpu, len, (j:int) => cpuDatum(size/n * i + j)); val gpuResult = CUDAUtilities.makeRemoteArray[Float] (gpu, len, (j:int) => 0.0 as Float); //--- Second step : call kernel function calc(gpu, len, gpuDatum, gpuResult); ... } This example "works" but does not offer any coordination between gpus connected to "cpu" (=here). Once I still have only one gpu at the moment, I defined export X10RT_ACCELS=CUDA0,CUDA0,CUDA0,CUDA0 When I tried to create another example employing teams like KMeansCUDA does, it got stuck because all gpus share the same parent (=here). So, looks like (I guess) that Team is good for coordination between different cpus, once the code is typically host code and not kernel code. OK. Then I tried coordination using clocks in different ways. In the example below I explicitly declare and employ a certain clock for coordinating tasks. finish async { val c = Clock.make(); for (gpu in gpus.values()) async clocked (c) { val i = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES; val len = size/n + ( i+1==n ? size%n : 0 ); //--- First step : allocate device arrays c.next(); val gpuDatum = ... val gpuResult = ... //--- Second step : call kernel function c.next(); calc(gpu, len, gpuDatum, gpuResult); ... } } Executing this example I've got the following message X10RT: async 37 is not a CUDA kernel. If I'm not wrong, this message comes from the kernel function, once this message dissapears when I comment out the call to the kernel function. So, looks like there's a certain relationship between the "finish" in the host code and the "finish" in the kernel code. In the documentation on Finish (pg 160) it says that a "finish" waits for termination of all activities spawned by "S". I'm certainly confused by the implications of this statement, so I tried to simplify the code above, like this: clocked finish { for (gpu in gpus.values()) async { val i = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES; val len = size/n + ( i+1==n ? size%n : 0 ); //--- First step : allocate device arrays next; val gpuDatum = ... val gpuResult = ... //--- Second step : call kernel function next; calc(gpu, len, gpuDatum, gpuResult); ... } } When I execute it, the result is absolutely the same: X10RT: async 37 is not a CUDA kernel. So, could you guys guide me about this? 1. Am I correct to think that I cannot employ Team when 2 or more GPUs belong to the same place? 2. What is the relationship between a finish in the host code and a finish in the kernel code? Or maybe this question should be on "clock"s instead on "finish"es ? 2. Would you recommend a explicit clock in order to avoid conflict with the clock in the kernel function? Thanks a lot :) -- Richard Gomes M: +44(77)9955-6813 http://tinyurl.com/frgomes twitter: frgomes JQuantLib is a library for Quantitative Finance written in Java. http://www.jquantlib.org/ twitter: jquantlib ------------------------------------------------------------------------------ This SF Dev2Dev email is sponsored by: WikiLeaks The End of the Free Internet http://p.sf.net/sfu/therealnews-com _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users