Hi,

I am running a RNN/GRU model for a fairly large dataset with the goal of 
sequence prediction. When I profile my code, I found one GpuFromHost takes 
~30% of computation time. See part of profiling results below:

<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> 
<Gflops/s> <Apply name>  
  30.2%    73.0%     462.776s       3.71e-01s   1248   221                 
    GpuFromHost(Subtensor{:int64:}.0)
    input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4, 
2097152) 
    output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152, 
2048, 1) 

theano.printing.debugprint shows that the call is generated in gradient 
calculation; see snippet below. There is also a HostFromGpu a couple of 
layers below.

 | | | | |GpuFromHost [id FN] ''   221
 | | | |   |Subtensor{:int64:} [id FO] ''   220
 | | | |     |Subtensor{::int64} [id FP] ''   219
 | | | |     | |InplaceDimShuffle{1,2,0} [id FQ] ''   218
 | | | |     | | |Reshape{3} [id FR] ''   217
 | | | |     | |   |CrossentropyCategorical1HotGrad [id FS] ''   216
 | | | |     | |   | |Elemwise{Second}[(0, 0)] [id FT] ''   215
 | | | |     | |   | | |CrossentropyCategorical1Hot [id FU] ''   209
 | | | |     | |   | | | |HostFromGpu [id FV] ''   206

I have heard about the cost of using GpuFromHost (and its counterpart 
HostFromGpu) and had moved almost all data to GPU (via shared variables). 
So I don't understand why the call is needed. In particular I don't 
understand:

1. If all my data are on GPU and theano is optimized for GPU, why is the 
GpuFromHost even generated?
2. Is the call generated because the memory is too large? The call tries to 
move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have 
12GB memory thus the need to move seems remote on the surface. Overall 
memory consumption seems OK under profiling.
3. Does the call have anything to do with CrossentropyCategorical1Hot? I 
assume CrossentropyCategorical1Hot  has been optimized for GPU. But the 
code shows that a HostFromGPU is called before CrossentropyCategorical1Hot 
is applied. I am not sure if CrossentropyCategorical1Hot has any memory 
requirement (e.g., c-contiguous).
4. Should I try any GPU assertion to debug the root cause of the problem?

Any hint is appreciated.

Thank you,
Haining 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to