Hi, I am holding an array on the GPU (in a shared variable), and I'm sampling random minibatches from it, but it seems there is a call to HostFromGpu at every index, which causes significant delay. Is there a way to avoid this?
Here is a minimal code example, plus the debug and profiling printouts. The same thing happens if I use theano.map. The problem is much worse in my actual code, which uses multiple levels of indexing--despite also using much larger data arrays, the time in the many calls to HostFromGpu dominates. Code example: import theano import theano.tensor as T import numpy as np H = W = 3 N = 10 B = 3 src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src") dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest") idxs = T.ivector('idxs') selections = [src[idxs[i]] for i in range(B)] new_dest = T.stack(selections) updates = [(dest, new_dest)] f = theano.function(inputs=[idxs], updates=updates) np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32) print(dest.get_value()) f(np_idxs) print(dest.get_value()) theano.printing.debugprint(f) for _ in range(10): f(np_idxs) Debugprint (notice the HostFromGpu listed with unique ID leading up to each ScalarFromTensor): GpuJoin [id A] '' 16 |TensorConstant{0} [id B] |InplaceGpuDimShuffle{x,0,1} [id C] '' 15 | |GpuSubtensor{int32} [id D] '' 14 | |src [id E] | |ScalarFromTensor [id F] '' 13 | |HostFromGpu(gpuarray) [id G] '' 12 | |GpuSubtensor{int64} [id H] '' 11 | |GpuFromHost<None> [id I] '' 0 | | |idxs [id J] | |Constant{0} [id K] |InplaceGpuDimShuffle{x,0,1} [id L] '' 10 | |GpuSubtensor{int32} [id M] '' 9 | |src [id E] | |ScalarFromTensor [id N] '' 8 | |HostFromGpu(gpuarray) [id O] '' 7 | |GpuSubtensor{int64} [id P] '' 6 | |GpuFromHost<None> [id I] '' 0 | |Constant{1} [id Q] |InplaceGpuDimShuffle{x,0,1} [id R] '' 5 |GpuSubtensor{int32} [id S] '' 4 |src [id E] |ScalarFromTensor [id T] '' 3 |HostFromGpu(gpuarray) [id U] '' 2 |GpuSubtensor{int64} [id V] '' 1 |GpuFromHost<None> [id I] '' 0 |Constant{2} [id W] Theano profile (in 10 calls to the function--notice 10 calls to GpuFromHost but 30 calls to HostFromGPU): Class --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 38.9% 38.9% 0.001s 5.27e-05s C 10 1 theano.gpuarray.basic_ops.GpuJoin 31.5% 70.4% 0.000s 1.42e-05s C 30 3 theano.gpuarray.basic_ops.HostFromGpu 15.0% 85.4% 0.000s 2.03e-05s C 10 1 theano.gpuarray.basic_ops.GpuFromHost 7.4% 92.8% 0.000s 1.67e-06s C 60 6 theano.gpuarray.subtensor.GpuSubtensor 6.0% 98.8% 0.000s 2.69e-06s C 30 3 theano.gpuarray.elemwise.GpuDimShuffle 1.2% 100.0% 0.000s 5.56e-07s C 30 3 theano.tensor.basic.ScalarFromTensor ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) Appreciate any tips! Thanks! Adam -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.