Nevermind.
The point of this setup was to speedup some computation by keeping raw data
on the GPU, then selecting minibatches from it with some reshaping and
other slight pre-processing (e.g. sometimes need to set certain subtensors
to 0), into another shared variable, on which the function computes. In my
case, this did save the ~10% time in GpuFromHost in the function, and the
~20% overall time building the input array with numpy on CPU, but the total
time ended up more than doubling, because the functions to manipulate data
on the GPU are very slow. Not sure if this is more to do with pygpu or
just with GPUs in general? I'm content with the CPU-memory solution.
In cases where a random minibatch can be gathered by simply indexing into a
batch, I have seen overall speed improvements by putting the batch onto the
GPU, with the same kind of data.
Would be interested to read about any of your experiences.
Thanks,
Adam
On Tuesday, January 23, 2018 at 4:19:13 PM UTC-8, Adam Stooke wrote:
>
> I realize now the above example might seem strange where I make the
> "selections" an explicit list, rather than just feeding the "idxs" directly
> into "src". The reason for this is because I actually need to get a slice
> (of fixed sized) at each index. The script below contains the full
> problem, including three possible solutions--
> 1) explicitly construct the list of slices,
> 2) use theano.map to get the slices,
> 3) make all the individual indexes corresponding to the slice elements,
> get those at once and then reshape (each slice becomes its own unit of
> data, segregated by another dimension).
>
> My observations in testing:
> For small batch size, like 32, method 3 (idx) is fastest, followed by
> method 1 (list). For large batch size, like 2048, method 2 (map) is
> fastest, and method 1 (list) doesn't compile, after several minutes at
> least. Still, a significant portion of time in both methods 1 and 2 are
> spent in the HostFromGpu, to do with the indexes. The scan op appears to
> go on the cpu. However I think the efficiency of grabbing the full slices,
> rather than each and every index, might be leading to the better
> performance at large batch size.
>
> So the question stands: how to collect indexes/slices from a shared
> variable without getting HostFromGpu happening for all the indexes?
>
> Please help! :)
>
>
> import theano
> import theano.tensor as T
> import numpy as np
> import time
>
> E = 4
> H = W = 200
> N = 2000
> B = 32 # 256, 2048
> S = 4
> LOOPS = 10
>
> LIST = True
> MAP = True
> IDX = True
>
> np_src = np.random.rand(E, N, H, W).astype(np.float32)
> src = theano.shared(np_src, name="src")
> np_dest_zeros = np.zeros([B, S, H, W], dtype=np.float32)
> idxs_0 = T.lvector('idxs_0')
> idxs_1 = T.lvector('idxs_1')
>
> np_idxs_0 = np.random.randint(low=0, high=E, size=B)
> np_idxs_1 = np.random.randint(low=0, high=N - S, size=B) #
> .astype(np.int32)
> np_answer = np.stack([np_src[e, i:i + S] for e, i in zip(np_idxs_0,
> np_idxs_1)])
>
>
> # Fixed list of states method ############
> if LIST:
> dest_list = theano.shared(np.zeros([B, S, H, W], dtype=np.float32),
> name="dest_list")
> selections_list = [src[idxs_0[i], idxs_1[i]:idxs_1[i] + S] for i in
> range(B)]
> new_dest_list = T.stack(selections_list)
> updates_list = [(dest_list, new_dest_list)]
> f_list = theano.function(inputs=[idxs_0, idxs_1],
> updates=updates_list, name="list")
>
> # print(dest_list.get_value())
> f_list(np_idxs_0, np_idxs_1)
> # print(dest_list.get_value())
> theano.printing.debugprint(f_list)
> # time.sleep(1)
> # t0_list = time.time()
> for _ in range(LOOPS):
> f_list(np_idxs_0, np_idxs_1)
> # x = dest_list.get_value()
> # t_list = time.time() - t0_list
>
>
> # mapped list of states method ###########
> if MAP:
>
> # s = theano.shared(S, name="S")
> # print("s.dtype: ", s.dtype, "s.get_value: ", s.get_value())
> dest_map = theano.shared(np_dest_zeros, name="dest_map")
>
>
> def get_state(idx_0, idx_1, data):
> # tried using a shared variable in place of "S" here--no effect
> return data[idx_0, idx_1:idx_1 + S]
> # return data[idx_0, slice(idx_1, idx_1 + S)]
>
>
> states_map, updates_map = theano.map(
> fn=get_state,
> sequences=[idxs_0, idxs_1],
> non_sequences=src,
> )
> new_dest_map = T.concatenate([states_map])
> updates_map = [(dest_map, new_dest_map)]
> f_map = theano.function(inputs=[idxs_0, idxs_1], updates=updates_map,
> name="map")
>
> # print(dest_map.get_value())
> f_map(np_idxs_0, np_idxs_1)
> # print(dest_map.get_value())
> print("\n\n")
> theano.printing.debugprint(f_map)
> # time.sleep(1)
> # t0_map = time.time()
> for _ in range(LOOPS):
> f_map(np_idxs_0, np_idxs_1)
> # x = dest_map.get_value()
> # t_map = time.time() - t0_map
>
>
> # full idx list reshaping method ########
> if IDX:
> dest_idx = theano.shared(np_dest_zeros, name="dest_idx")
>
> step_idxs_col = T.reshape(idxs_1, (-1, 1))
> step_idxs_tile = T.tile(step_idxs_col, (1, S))
> step_idxs_rang = step_idxs_tile + T.arange(S)
> step_idxs_flat = step_idxs_rang.reshape([-1])
> env_idxs_repeat = T.repeat(idxs_0, S)
>
> selections_idx = src[env_idxs_repeat, step_idxs_flat]
> new_dest_idx = selections_idx.reshape([-1, S, H, W])
> updates_idx = [(dest_idx, new_dest_idx)]
> f_idx = theano.function(inputs=[idxs_0, idxs_1], updates=updates_idx,
> name="idx")
>
> # print(dest_idx.get_value())
> f_idx(np_idxs_0, np_idxs_1)
> # print(dest_idx.get_value())
> print("\n\n")
> theano.printing.debugprint(f_idx)
> # time.sleep(1)
> # t0_idx = time.time()
> for _ in range(LOOPS):
> f_idx(np_idxs_0, np_idxs_1)
> # x = dest_idx.get_value()
> # t_idx = time.time() - t0_idx
>
>
> ###################################################
> if LIST:
> print("Theano list values pass: ", np.allclose(np_answer,
> dest_list.get_value()))
> # print("list time: ", t_list)
> if MAP:
> print("Theano map values pass: ", np.allclose(np_answer,
> dest_map.get_value()))
> # print("map time: ", t_map)
> if IDX:
> print("Theano idx values pass: ", np.allclose(np_answer,
> dest_idx.get_value()))
> # print("idx time: ", t_idx)
>
>
>
>
>
>
>
>
> On Friday, January 19, 2018 at 12:42:16 PM UTC-8, Adam Stooke wrote:
>>
>> Hi,
>>
>> I am holding an array on the GPU (in a shared variable), and I'm
>> sampling random minibatches from it, but it seems there is a call to
>> HostFromGpu at every index, which causes significant delay. Is there a way
>> to avoid this?
>>
>> Here is a minimal code example, plus the debug and profiling
>> printouts. The same thing happens if I use theano.map. The problem is
>> much worse in my actual code, which uses multiple levels of
>> indexing--despite also using much larger data arrays, the time in the many
>> calls to HostFromGpu dominates.
>>
>>
>> Code example:
>>
>> import theano
>> import theano.tensor as T
>> import numpy as np
>>
>> H = W = 3
>> N = 10
>> B = 3
>>
>> src = theano.shared(np.random.rand(N, H, W).astype(np.float32),
>> name="src")
>> dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
>> idxs = T.ivector('idxs')
>>
>> selections = [src[idxs[i]] for i in range(B)]
>> new_dest = T.stack(selections)
>> updates = [(dest, new_dest)]
>> f = theano.function(inputs=[idxs], updates=updates)
>>
>> np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
>> print(dest.get_value())
>> f(np_idxs)
>> print(dest.get_value())
>>
>> theano.printing.debugprint(f)
>> for _ in range(10):
>> f(np_idxs)
>>
>>
>> Debugprint (notice the HostFromGpu listed with unique ID leading up to
>> each ScalarFromTensor):
>>
>> GpuJoin [id A] '' 16
>> |TensorConstant{0} [id B]
>> |InplaceGpuDimShuffle{x,0,1} [id C] '' 15
>> | |GpuSubtensor{int32} [id D] '' 14
>> | |src [id E]
>> | |ScalarFromTensor [id F] '' 13
>> | |HostFromGpu(gpuarray) [id G] '' 12
>> | |GpuSubtensor{int64} [id H] '' 11
>> | |GpuFromHost<None> [id I] '' 0
>> | | |idxs [id J]
>> | |Constant{0} [id K]
>> |InplaceGpuDimShuffle{x,0,1} [id L] '' 10
>> | |GpuSubtensor{int32} [id M] '' 9
>> | |src [id E]
>> | |ScalarFromTensor [id N] '' 8
>> | |HostFromGpu(gpuarray) [id O] '' 7
>> | |GpuSubtensor{int64} [id P] '' 6
>> | |GpuFromHost<None> [id I] '' 0
>> | |Constant{1} [id Q]
>> |InplaceGpuDimShuffle{x,0,1} [id R] '' 5
>> |GpuSubtensor{int32} [id S] '' 4
>> |src [id E]
>> |ScalarFromTensor [id T] '' 3
>> |HostFromGpu(gpuarray) [id U] '' 2
>> |GpuSubtensor{int64} [id V] '' 1
>> |GpuFromHost<None> [id I] '' 0
>> |Constant{2} [id W]
>>
>>
>>
>> Theano profile (in 10 calls to the function--notice 10 calls to
>> GpuFromHost but 30 calls to HostFromGPU):
>>
>> Class
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
>> <Class name>
>> 38.9% 38.9% 0.001s 5.27e-05s C 10 1
>> theano.gpuarray.basic_ops.GpuJoin
>> 31.5% 70.4% 0.000s 1.42e-05s C 30 3
>> theano.gpuarray.basic_ops.HostFromGpu
>> 15.0% 85.4% 0.000s 2.03e-05s C 10 1
>> theano.gpuarray.basic_ops.GpuFromHost
>> 7.4% 92.8% 0.000s 1.67e-06s C 60 6
>> theano.gpuarray.subtensor.GpuSubtensor
>> 6.0% 98.8% 0.000s 2.69e-06s C 30 3
>> theano.gpuarray.elemwise.GpuDimShuffle
>> 1.2% 100.0% 0.000s 5.56e-07s C 30 3
>> theano.tensor.basic.ScalarFromTensor
>> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
>>
>>
>>
>> Appreciate any tips! Thanks!
>> Adam
>>
>>
>>
>>
>>
>>
>
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.