My guess is that you use the old GPU backend. Can you confirm you use the Theano flag device=gpu? And that also you have float64 in the graph. The old backend don't support them. I suggest that you install the just released 0.10 beta and that you use the new backend with device=cuda.
Also,you can use the flag warn_float64=pdb to find where you have them and make sore they are float32. This will be faster. Fred Le lun. 31 juil. 2017 14:42, Haining Yu <hainin...@gmail.com> a écrit : > Hi, > > I am running a RNN/GRU model for a fairly large dataset with the goal of > sequence prediction. When I profile my code, I found one GpuFromHost takes > ~30% of computation time. See part of profiling results below: > > <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> > <Gflops/s> <Apply name> > 30.2% 73.0% 462.776s 3.71e-01s 1248 221 > GpuFromHost(Subtensor{:int64:}.0) > input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4, > 2097152) > output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152, > 2048, 1) > > theano.printing.debugprint shows that the call is generated in gradient > calculation; see snippet below. There is also a HostFromGpu a couple of > layers below. > > | | | | |GpuFromHost [id FN] '' 221 > | | | | |Subtensor{:int64:} [id FO] '' 220 > | | | | |Subtensor{::int64} [id FP] '' 219 > | | | | | |InplaceDimShuffle{1,2,0} [id FQ] '' 218 > | | | | | | |Reshape{3} [id FR] '' 217 > | | | | | | |CrossentropyCategorical1HotGrad [id FS] '' 216 > | | | | | | | |Elemwise{Second}[(0, 0)] [id FT] '' 215 > | | | | | | | | |CrossentropyCategorical1Hot [id FU] '' 209 > | | | | | | | | | |HostFromGpu [id FV] '' 206 > > I have heard about the cost of using GpuFromHost (and its counterpart > HostFromGpu) and had moved almost all data to GPU (via shared variables). > So I don't understand why the call is needed. In particular I don't > understand: > > 1. If all my data are on GPU and theano is optimized for GPU, why is the > GpuFromHost even generated? > 2. Is the call generated because the memory is too large? The call tries > to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have > 12GB memory thus the need to move seems remote on the surface. Overall > memory consumption seems OK under profiling. > 3. Does the call have anything to do with CrossentropyCategorical1Hot? I > assume CrossentropyCategorical1Hot has been optimized for GPU. But the > code shows that a HostFromGPU is called before CrossentropyCategorical1Hot > is applied. I am not sure if CrossentropyCategorical1Hot has any memory > requirement (e.g., c-contiguous). > 4. Should I try any GPU assertion to debug the root cause of the problem? > > Any hint is appreciated. > > Thank you, > Haining > > -- > > --- > You received this message because you are subscribed to the Google Groups > "theano-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to theano-users+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.