Hi,

  I am holding an array on the GPU (in a shared variable), and I'm sampling 
random minibatches from it, but it seems there is a call to HostFromGpu at 
every index, which causes significant delay.  Is there a way to avoid this?

  Here is a minimal code example, plus the debug and profiling printouts.  
The same thing happens if I use theano.map.  The problem is much worse in 
my actual code, which uses multiple levels of indexing--despite also using 
much larger data arrays, the time in the many calls to HostFromGpu 
dominates.  


Code example: 

import theano
import theano.tensor as T
import numpy as np

H = W = 3
N = 10
B = 3

src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
idxs = T.ivector('idxs')

selections = [src[idxs[i]] for i in range(B)]
new_dest = T.stack(selections)
updates = [(dest, new_dest)]
f = theano.function(inputs=[idxs], updates=updates)

np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
print(dest.get_value())
f(np_idxs)
print(dest.get_value())

theano.printing.debugprint(f)
for _ in range(10):
    f(np_idxs)


Debugprint (notice the HostFromGpu listed with unique ID leading up to each 
ScalarFromTensor):

GpuJoin [id A] ''   16
 |TensorConstant{0} [id B]
 |InplaceGpuDimShuffle{x,0,1} [id C] ''   15
 | |GpuSubtensor{int32} [id D] ''   14
 |   |src [id E]
 |   |ScalarFromTensor [id F] ''   13
 |     |HostFromGpu(gpuarray) [id G] ''   12
 |       |GpuSubtensor{int64} [id H] ''   11
 |         |GpuFromHost<None> [id I] ''   0
 |         | |idxs [id J]
 |         |Constant{0} [id K]
 |InplaceGpuDimShuffle{x,0,1} [id L] ''   10
 | |GpuSubtensor{int32} [id M] ''   9
 |   |src [id E]
 |   |ScalarFromTensor [id N] ''   8
 |     |HostFromGpu(gpuarray) [id O] ''   7
 |       |GpuSubtensor{int64} [id P] ''   6
 |         |GpuFromHost<None> [id I] ''   0
 |         |Constant{1} [id Q]
 |InplaceGpuDimShuffle{x,0,1} [id R] ''   5
   |GpuSubtensor{int32} [id S] ''   4
     |src [id E]
     |ScalarFromTensor [id T] ''   3
       |HostFromGpu(gpuarray) [id U] ''   2
         |GpuSubtensor{int64} [id V] ''   1
           |GpuFromHost<None> [id I] ''   0
           |Constant{2} [id W]



Theano profile (in 10 calls to the function--notice 10 calls to GpuFromHost 
but 30 calls to HostFromGPU):

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
<Class name>
  38.9%    38.9%       0.001s       5.27e-05s     C       10       1  
 theano.gpuarray.basic_ops.GpuJoin
  31.5%    70.4%       0.000s       1.42e-05s     C       30       3  
 theano.gpuarray.basic_ops.HostFromGpu
  15.0%    85.4%       0.000s       2.03e-05s     C       10       1  
 theano.gpuarray.basic_ops.GpuFromHost
   7.4%    92.8%       0.000s       1.67e-06s     C       60       6  
 theano.gpuarray.subtensor.GpuSubtensor
   6.0%    98.8%       0.000s       2.69e-06s     C       30       3  
 theano.gpuarray.elemwise.GpuDimShuffle
   1.2%   100.0%       0.000s       5.56e-07s     C       30       3  
 theano.tensor.basic.ScalarFromTensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)



Appreciate any tips! Thanks!
Adam




  

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to