Re: [theano-users] Avoiding HostFromGPU at every Index into Shared Variable?

2018-02-07 Thread Frédéric Bastien
On the GPU, not all indexing are fast. The slices are fast (just a view).
But on advanced indexing, only this version have been well optimized:

a_tensor[a_vector_of_int]

The vector_of_int can be on any of the dimensions from memory. But for sure
on the first dimensions.

We have code that support more advanced indexing on the GPU, but sometimes
it is slower, sometimes faster. So it is not activated by default.

For the "other computation being slow". It will depend what is that
computation. Without seeing the profile of that part, I can't comment. But
we didn't spend a good amount of time optimizing those type of computation.
So I'm not suprised that there is case when the generated code isn't very
optimized.

Frédéric


On Fri, Jan 19, 2018 at 3:42 PM Adam Stooke  wrote:

> Hi,
>
>   I am holding an array on the GPU (in a shared variable), and I'm
> sampling random minibatches from it, but it seems there is a call to
> HostFromGpu at every index, which causes significant delay.  Is there a way
> to avoid this?
>
>   Here is a minimal code example, plus the debug and profiling printouts.
> The same thing happens if I use theano.map.  The problem is much worse in
> my actual code, which uses multiple levels of indexing--despite also using
> much larger data arrays, the time in the many calls to HostFromGpu
> dominates.
>
>
> Code example:
>
> import theano
> import theano.tensor as T
> import numpy as np
>
> H = W = 3
> N = 10
> B = 3
>
> src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
> dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
> idxs = T.ivector('idxs')
>
> selections = [src[idxs[i]] for i in range(B)]
> new_dest = T.stack(selections)
> updates = [(dest, new_dest)]
> f = theano.function(inputs=[idxs], updates=updates)
>
> np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
> print(dest.get_value())
> f(np_idxs)
> print(dest.get_value())
>
> theano.printing.debugprint(f)
> for _ in range(10):
> f(np_idxs)
>
>
> Debugprint (notice the HostFromGpu listed with unique ID leading up to
> each ScalarFromTensor):
>
> GpuJoin [id A] ''   16
>  |TensorConstant{0} [id B]
>  |InplaceGpuDimShuffle{x,0,1} [id C] ''   15
>  | |GpuSubtensor{int32} [id D] ''   14
>  |   |src [id E]
>  |   |ScalarFromTensor [id F] ''   13
>  | |HostFromGpu(gpuarray) [id G] ''   12
>  |   |GpuSubtensor{int64} [id H] ''   11
>  | |GpuFromHost [id I] ''   0
>  | | |idxs [id J]
>  | |Constant{0} [id K]
>  |InplaceGpuDimShuffle{x,0,1} [id L] ''   10
>  | |GpuSubtensor{int32} [id M] ''   9
>  |   |src [id E]
>  |   |ScalarFromTensor [id N] ''   8
>  | |HostFromGpu(gpuarray) [id O] ''   7
>  |   |GpuSubtensor{int64} [id P] ''   6
>  | |GpuFromHost [id I] ''   0
>  | |Constant{1} [id Q]
>  |InplaceGpuDimShuffle{x,0,1} [id R] ''   5
>|GpuSubtensor{int32} [id S] ''   4
>  |src [id E]
>  |ScalarFromTensor [id T] ''   3
>|HostFromGpu(gpuarray) [id U] ''   2
>  |GpuSubtensor{int64} [id V] ''   1
>|GpuFromHost [id I] ''   0
>|Constant{2} [id W]
>
>
>
> Theano profile (in 10 calls to the function--notice 10 calls to
> GpuFromHost but 30 calls to HostFromGPU):
>
> Class
> ---
> <% time> <#call> <#apply>
> 
>   38.9%38.9%   0.001s   5.27e-05s C   10   1
>  theano.gpuarray.basic_ops.GpuJoin
>   31.5%70.4%   0.000s   1.42e-05s C   30   3
>  theano.gpuarray.basic_ops.HostFromGpu
>   15.0%85.4%   0.000s   2.03e-05s C   10   1
>  theano.gpuarray.basic_ops.GpuFromHost
>7.4%92.8%   0.000s   1.67e-06s C   60   6
>  theano.gpuarray.subtensor.GpuSubtensor
>6.0%98.8%   0.000s   2.69e-06s C   30   3
>  theano.gpuarray.elemwise.GpuDimShuffle
>1.2%   100.0%   0.000s   5.56e-07s C   30   3
>  theano.tensor.basic.ScalarFromTensor
>... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>
>
>
> Appreciate any tips! Thanks!
> Adam
>
>
>
>
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to theano-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[theano-users] Avoiding HostFromGPU at every Index into Shared Variable?

2018-01-19 Thread Adam Stooke
Hi,

  I am holding an array on the GPU (in a shared variable), and I'm sampling 
random minibatches from it, but it seems there is a call to HostFromGpu at 
every index, which causes significant delay.  Is there a way to avoid this?

  Here is a minimal code example, plus the debug and profiling printouts.  
The same thing happens if I use theano.map.  The problem is much worse in 
my actual code, which uses multiple levels of indexing--despite also using 
much larger data arrays, the time in the many calls to HostFromGpu 
dominates.  


Code example: 

import theano
import theano.tensor as T
import numpy as np

H = W = 3
N = 10
B = 3

src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
idxs = T.ivector('idxs')

selections = [src[idxs[i]] for i in range(B)]
new_dest = T.stack(selections)
updates = [(dest, new_dest)]
f = theano.function(inputs=[idxs], updates=updates)

np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
print(dest.get_value())
f(np_idxs)
print(dest.get_value())

theano.printing.debugprint(f)
for _ in range(10):
f(np_idxs)


Debugprint (notice the HostFromGpu listed with unique ID leading up to each 
ScalarFromTensor):

GpuJoin [id A] ''   16
 |TensorConstant{0} [id B]
 |InplaceGpuDimShuffle{x,0,1} [id C] ''   15
 | |GpuSubtensor{int32} [id D] ''   14
 |   |src [id E]
 |   |ScalarFromTensor [id F] ''   13
 | |HostFromGpu(gpuarray) [id G] ''   12
 |   |GpuSubtensor{int64} [id H] ''   11
 | |GpuFromHost [id I] ''   0
 | | |idxs [id J]
 | |Constant{0} [id K]
 |InplaceGpuDimShuffle{x,0,1} [id L] ''   10
 | |GpuSubtensor{int32} [id M] ''   9
 |   |src [id E]
 |   |ScalarFromTensor [id N] ''   8
 | |HostFromGpu(gpuarray) [id O] ''   7
 |   |GpuSubtensor{int64} [id P] ''   6
 | |GpuFromHost [id I] ''   0
 | |Constant{1} [id Q]
 |InplaceGpuDimShuffle{x,0,1} [id R] ''   5
   |GpuSubtensor{int32} [id S] ''   4
 |src [id E]
 |ScalarFromTensor [id T] ''   3
   |HostFromGpu(gpuarray) [id U] ''   2
 |GpuSubtensor{int64} [id V] ''   1
   |GpuFromHost [id I] ''   0
   |Constant{2} [id W]



Theano profile (in 10 calls to the function--notice 10 calls to GpuFromHost 
but 30 calls to HostFromGPU):

Class
---
<% time> <#call> <#apply> 

  38.9%38.9%   0.001s   5.27e-05s C   10   1  
 theano.gpuarray.basic_ops.GpuJoin
  31.5%70.4%   0.000s   1.42e-05s C   30   3  
 theano.gpuarray.basic_ops.HostFromGpu
  15.0%85.4%   0.000s   2.03e-05s C   10   1  
 theano.gpuarray.basic_ops.GpuFromHost
   7.4%92.8%   0.000s   1.67e-06s C   60   6  
 theano.gpuarray.subtensor.GpuSubtensor
   6.0%98.8%   0.000s   2.69e-06s C   30   3  
 theano.gpuarray.elemwise.GpuDimShuffle
   1.2%   100.0%   0.000s   5.56e-07s C   30   3  
 theano.tensor.basic.ScalarFromTensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)



Appreciate any tips! Thanks!
Adam




  

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.