[theano-users] Re: Avoiding HostFromGPU at every Index into Shared Variable?

Adam Stooke Tue, 23 Jan 2018 16:19:59 -0800

I realize now the above example might seem strange where I make the 
"selections" an explicit list, rather than just feeding the "idxs" directly 
into "src".  The reason for this is because I actually need to get a slice 
(of fixed sized) at each index.  The script below contains the full 
problem, including three possible solutions--
1) explicitly construct the list of slices, 
2) use theano.map to get the slices, 
3) make all the individual indexes corresponding to the slice elements, get 
those at once and then reshape (each slice becomes its own unit of data, 
segregated by another dimension).


My observations in testing:  
For small batch size, like 32, method 3 (idx) is fastest, followed by 
method 1 (list).  For large batch size, like 2048, method 2 (map) is 
fastest, and method 1 (list) doesn't compile, after several minutes at 
least.  Still, a significant portion of time in both methods 1 and 2 are 
spent in the HostFromGpu, to do with the indexes.  The scan op appears to 
go on the cpu.  However I think the efficiency of grabbing the full slices, 
rather than each and every index, might be leading to the better 
performance at large batch size.  

So the question stands:  how to collect indexes/slices from a shared 
variable without getting HostFromGpu happening for all the indexes?

Please help! :)


import theano
import theano.tensor as T
import numpy as np
import time

E = 4
H = W = 200
N = 2000
B = 32  # 256, 2048
S = 4
LOOPS = 10

LIST = True
MAP = True
IDX = True

np_src = np.random.rand(E, N, H, W).astype(np.float32)
src = theano.shared(np_src, name="src")
np_dest_zeros = np.zeros([B, S, H, W], dtype=np.float32)
idxs_0 = T.lvector('idxs_0')
idxs_1 = T.lvector('idxs_1')

np_idxs_0 = np.random.randint(low=0, high=E, size=B)
np_idxs_1 = np.random.randint(low=0, high=N - S, size=B)  # 
.astype(np.int32)
np_answer = np.stack([np_src[e, i:i + S] for e, i in zip(np_idxs_0, 
np_idxs_1)])


# Fixed list of states method ############
if LIST:
    dest_list = theano.shared(np.zeros([B, S, H, W], dtype=np.float32), 
name="dest_list")
    selections_list = [src[idxs_0[i], idxs_1[i]:idxs_1[i] + S] for i in 
range(B)]
    new_dest_list = T.stack(selections_list)
    updates_list = [(dest_list, new_dest_list)]
    f_list = theano.function(inputs=[idxs_0, idxs_1], updates=updates_list, 
name="list")

    # print(dest_list.get_value())
    f_list(np_idxs_0, np_idxs_1)
    # print(dest_list.get_value())
    theano.printing.debugprint(f_list)
    # time.sleep(1)
    # t0_list = time.time()
    for _ in range(LOOPS):
        f_list(np_idxs_0, np_idxs_1)
    # x = dest_list.get_value()
    # t_list = time.time() - t0_list


# mapped list of states method ###########
if MAP:

    # s = theano.shared(S, name="S")
    # print("s.dtype: ", s.dtype, "s.get_value: ", s.get_value())
    dest_map = theano.shared(np_dest_zeros, name="dest_map")


    def get_state(idx_0, idx_1, data):
        # tried using a shared variable in place of "S" here--no effect
        return data[idx_0, idx_1:idx_1 + S]
        # return data[idx_0, slice(idx_1, idx_1 + S)]


    states_map, updates_map = theano.map(
        fn=get_state,
        sequences=[idxs_0, idxs_1],
        non_sequences=src,
        )
    new_dest_map = T.concatenate([states_map])
    updates_map = [(dest_map, new_dest_map)]
    f_map = theano.function(inputs=[idxs_0, idxs_1], updates=updates_map, 
name="map")

    # print(dest_map.get_value())
    f_map(np_idxs_0, np_idxs_1)
    # print(dest_map.get_value())
    print("\n\n")
    theano.printing.debugprint(f_map)
    # time.sleep(1)
    # t0_map = time.time()
    for _ in range(LOOPS):
        f_map(np_idxs_0, np_idxs_1)
    # x = dest_map.get_value()
    # t_map = time.time() - t0_map


# full idx list reshaping method ########
if IDX:
    dest_idx = theano.shared(np_dest_zeros, name="dest_idx")

    step_idxs_col = T.reshape(idxs_1, (-1, 1))
    step_idxs_tile = T.tile(step_idxs_col, (1, S))
    step_idxs_rang = step_idxs_tile + T.arange(S)
    step_idxs_flat = step_idxs_rang.reshape([-1])
    env_idxs_repeat = T.repeat(idxs_0, S)

    selections_idx = src[env_idxs_repeat, step_idxs_flat]
    new_dest_idx = selections_idx.reshape([-1, S, H, W])
    updates_idx = [(dest_idx, new_dest_idx)]
    f_idx = theano.function(inputs=[idxs_0, idxs_1], updates=updates_idx, 
name="idx")

    # print(dest_idx.get_value())
    f_idx(np_idxs_0, np_idxs_1)
    # print(dest_idx.get_value())
    print("\n\n")
    theano.printing.debugprint(f_idx)
    # time.sleep(1)
    # t0_idx = time.time()
    for _ in range(LOOPS):
        f_idx(np_idxs_0, np_idxs_1)
    # x = dest_idx.get_value()
    # t_idx = time.time() - t0_idx


###################################################
if LIST:
    print("Theano list values pass: ", np.allclose(np_answer, 
dest_list.get_value()))
    # print("list time: ", t_list)
if MAP:
    print("Theano map values pass: ", np.allclose(np_answer, 
dest_map.get_value()))
    # print("map time: ", t_map)
if IDX:
    print("Theano idx values pass: ", np.allclose(np_answer, 
dest_idx.get_value()))
    # print("idx time: ", t_idx)








On Friday, January 19, 2018 at 12:42:16 PM UTC-8, Adam Stooke wrote:
>
> Hi,
>
>   I am holding an array on the GPU (in a shared variable), and I'm 
> sampling random minibatches from it, but it seems there is a call to 
> HostFromGpu at every index, which causes significant delay.  Is there a way 
> to avoid this?
>
>   Here is a minimal code example, plus the debug and profiling printouts.  
> The same thing happens if I use theano.map.  The problem is much worse in 
> my actual code, which uses multiple levels of indexing--despite also using 
> much larger data arrays, the time in the many calls to HostFromGpu 
> dominates.  
>
>
> Code example: 
>
> import theano
> import theano.tensor as T
> import numpy as np
>
> H = W = 3
> N = 10
> B = 3
>
> src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
> dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
> idxs = T.ivector('idxs')
>
> selections = [src[idxs[i]] for i in range(B)]
> new_dest = T.stack(selections)
> updates = [(dest, new_dest)]
> f = theano.function(inputs=[idxs], updates=updates)
>
> np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
> print(dest.get_value())
> f(np_idxs)
> print(dest.get_value())
>
> theano.printing.debugprint(f)
> for _ in range(10):
>     f(np_idxs)
>
>
> Debugprint (notice the HostFromGpu listed with unique ID leading up to 
> each ScalarFromTensor):
>
> GpuJoin [id A] ''   16
>  |TensorConstant{0} [id B]
>  |InplaceGpuDimShuffle{x,0,1} [id C] ''   15
>  | |GpuSubtensor{int32} [id D] ''   14
>  |   |src [id E]
>  |   |ScalarFromTensor [id F] ''   13
>  |     |HostFromGpu(gpuarray) [id G] ''   12
>  |       |GpuSubtensor{int64} [id H] ''   11
>  |         |GpuFromHost<None> [id I] ''   0
>  |         | |idxs [id J]
>  |         |Constant{0} [id K]
>  |InplaceGpuDimShuffle{x,0,1} [id L] ''   10
>  | |GpuSubtensor{int32} [id M] ''   9
>  |   |src [id E]
>  |   |ScalarFromTensor [id N] ''   8
>  |     |HostFromGpu(gpuarray) [id O] ''   7
>  |       |GpuSubtensor{int64} [id P] ''   6
>  |         |GpuFromHost<None> [id I] ''   0
>  |         |Constant{1} [id Q]
>  |InplaceGpuDimShuffle{x,0,1} [id R] ''   5
>    |GpuSubtensor{int32} [id S] ''   4
>      |src [id E]
>      |ScalarFromTensor [id T] ''   3
>        |HostFromGpu(gpuarray) [id U] ''   2
>          |GpuSubtensor{int64} [id V] ''   1
>            |GpuFromHost<None> [id I] ''   0
>            |Constant{2} [id W]
>
>
>
> Theano profile (in 10 calls to the function--notice 10 calls to 
> GpuFromHost but 30 calls to HostFromGPU):
>
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
> <Class name>
>   38.9%    38.9%       0.001s       5.27e-05s     C       10       1  
>  theano.gpuarray.basic_ops.GpuJoin
>   31.5%    70.4%       0.000s       1.42e-05s     C       30       3  
>  theano.gpuarray.basic_ops.HostFromGpu
>   15.0%    85.4%       0.000s       2.03e-05s     C       10       1  
>  theano.gpuarray.basic_ops.GpuFromHost
>    7.4%    92.8%       0.000s       1.67e-06s     C       60       6  
>  theano.gpuarray.subtensor.GpuSubtensor
>    6.0%    98.8%       0.000s       2.69e-06s     C       30       3  
>  theano.gpuarray.elemwise.GpuDimShuffle
>    1.2%   100.0%       0.000s       5.56e-07s     C       30       3  
>  theano.tensor.basic.ScalarFromTensor
>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>
>
>
> Appreciate any tips! Thanks!
> Adam
>
>
>
>
>   
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[theano-users] Re: Avoiding HostFromGPU at every Index into Shared Variable?

Reply via email to