My guess is that you use the old GPU backend. Can you confirm you use the
Theano flag device=gpu? And that also you have float64 in the graph. The
old backend don't support them. I suggest that you install the just
released 0.10 beta and that you use the new backend with device=cuda.

Also,you can use the flag warn_float64=pdb to find where you have them and
make sore they are float32. This will be faster.

Fred

Le lun. 31 juil. 2017 14:42, Haining Yu <hainin...@gmail.com> a écrit :

> Hi,
>
> I am running a RNN/GRU model for a fairly large dataset with the goal of
> sequence prediction. When I profile my code, I found one GpuFromHost takes
> ~30% of computation time. See part of profiling results below:
>
> <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
> <Gflops/s> <Apply name>
>   30.2%    73.0%     462.776s       3.71e-01s   1248   221
>     GpuFromHost(Subtensor{:int64:}.0)
>     input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4,
> 2097152)
>     output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152,
> 2048, 1)
>
> theano.printing.debugprint shows that the call is generated in gradient
> calculation; see snippet below. There is also a HostFromGpu a couple of
> layers below.
>
>  | | | | |GpuFromHost [id FN] ''   221
>  | | | |   |Subtensor{:int64:} [id FO] ''   220
>  | | | |     |Subtensor{::int64} [id FP] ''   219
>  | | | |     | |InplaceDimShuffle{1,2,0} [id FQ] ''   218
>  | | | |     | | |Reshape{3} [id FR] ''   217
>  | | | |     | |   |CrossentropyCategorical1HotGrad [id FS] ''   216
>  | | | |     | |   | |Elemwise{Second}[(0, 0)] [id FT] ''   215
>  | | | |     | |   | | |CrossentropyCategorical1Hot [id FU] ''   209
>  | | | |     | |   | | | |HostFromGpu [id FV] ''   206
>
> I have heard about the cost of using GpuFromHost (and its counterpart
> HostFromGpu) and had moved almost all data to GPU (via shared variables).
> So I don't understand why the call is needed. In particular I don't
> understand:
>
> 1. If all my data are on GPU and theano is optimized for GPU, why is the
> GpuFromHost even generated?
> 2. Is the call generated because the memory is too large? The call tries
> to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have
> 12GB memory thus the need to move seems remote on the surface. Overall
> memory consumption seems OK under profiling.
> 3. Does the call have anything to do with CrossentropyCategorical1Hot? I
> assume CrossentropyCategorical1Hot  has been optimized for GPU. But the
> code shows that a HostFromGPU is called before CrossentropyCategorical1Hot
> is applied. I am not sure if CrossentropyCategorical1Hot has any memory
> requirement (e.g., c-contiguous).
> 4. Should I try any GPU assertion to debug the root cause of the problem?
>
> Any hint is appreciated.
>
> Thank you,
> Haining
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to theano-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to