I think the example I put out was a bit too simple for what I was trying to 
get at, and I apologize for that. If instead of having a vector of i's and 
j's we have matrix X that is 100 x 500 and what we desire is to have a 
final vector where the value of the i_th position of the vector is 
T.sum(T.nlinalg.matrix_inverse(T.dot(T.transpose(Y),Y))) where Y = 
X[:,[i,i+1,i+2]. Assuming I have written a scan function called RunFunc 
that runs the code in the previous sentence, and I call 
RunFunc(range(0,100)), what I want is for each of my 100 cores to 
independently run the function and assemble the result in the proper place 
in my final results vector. Assuming ideal conditions with no compilation 
time, if one call of the function (e.g., RunFunc(8) or RunFunc(25)) takes 1 
second per GPU core, I would like to get the 100-valued vector output in 1 
second instead of 100 seconds. The latter time (100 seconds) is what the 
scan function seems to do as it runs scan on a GPU as one would run a loop 
on a single-core CPU.

On Thursday, October 27, 2016 at 1:43:32 PM UTC-4, Jesse Livezey wrote:
>
> There's an example here for addition which will look very similar to 
> multiplication:
>
> http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-matrices
>
> On Thursday, October 27, 2016 at 10:42:38 AM UTC-7, Jesse Livezey wrote:
>>
>> It will look similar to creating two numpy arrays and multiplying them 
>> elementwise, except it will perform the multiplications in parallel on the 
>> gpu.
>>
>> On Thursday, October 27, 2016 at 10:41:33 AM UTC-7, Jesse Livezey wrote:
>>>
>>> If you can create a theano vector that has all of the i's and a second 
>>> theano vector that has all of the j's, then you can just do i*j and will 
>>> will perform all of the multiplications in parallel.
>>>
>>> On Wednesday, October 26, 2016 at 11:48:06 PM UTC-7, kd...@cornell.edu 
>>> wrote:
>>>>
>>>> I would like to compute the result of i*j for a number of i's and j's, 
>>>> and I would like to do so concurrently. If I use the scan function over my 
>>>> sequence of i's and j's, I will get my desired result, but it will not 
>>>> perform the operations concurrently. If I have 100 cores in my single GPU, 
>>>> I would like there to be 100 asynchronous computations (technically more 
>>>> since each core has multiple threads) of the multiplication and final 
>>>> assignment to one vector that will be returned. This is similar to how 
>>>> multiprocessing works in base python with CPU cores. The Theano tutorial 
>>>> claims that it uses GPU asynchronous capabilities, but I am not sure of 
>>>> that as I have ran scan functions, and they seems to go as fast or slower 
>>>> than the CPU.
>>>>
>>>> Should I not use scan? Can this even be done in Theano? Do I have to 
>>>> use PyCUDA?
>>>>
>>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to