Very thanks for the information. >From the profiling log, the CPU is quick good since there are lots of data operations such as split, join and which are almost 100X faster in CPU.
The topologies of your model include huge of small GEMM and Elemwise so I think the big cache will be helpful in CPU side. And as the title, parallel branch would be a very good idea for independent compute flow. Do you have used Intel MKL as the backend of GEMM which will show better performance? btw, I can't open .p file, any suggestions? On Thursday, April 20, 2017 at 5:43:19 PM UTC+8, Sharapolas wrote: > > Guys thanks for your feedback. > > For the past week I have been trying to optimize my solver as much as > possible and I optimized so much that the CPU is twice faster than the GPU > now :D Extremelly puzzled with this result and I hope you could shed some > light on that. > > Wider story: > In my initial version, I arranged the tensors such that I do not need > to do slicing. Then I noticed that GPU load is directly proportional to the > size of the tensors being used, thus I decided to use smaller tensors but > lump them together and then slice in the few cases where I need it. As a > result the GPU code turned to be more than 4 times slower, but CPU code > almost rivals my first GPU version. I tried using different version of > indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in > similar timings. > > Do you have suggestions how I could speed up my GPU code? Otherwise, I > might as well just run on multicode CPU and prob become even faster than > GPU :/ > > > GPU version. Flags: > os.environ['THEANO_FLAGS'] = > ",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True' > os.environ['CUDA_LAUNCH_BLOCKING'] = '1' > Pickled version: > https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM > Graph: > https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU > Profile: > Function profiling > ================== > Time in 1000 calls to Function.__call__: 2.170000e+01s > Time in Function.fn.__call__: 2.166000e+01s (99.816%) > Time in thunks: 2.150321e+01s (99.093%) > Total compile time: 1.809000e+00s > Number of Apply nodes: 276 > Theano Optimizer time: 1.099000e+00s > Theano validate time: 2.069981e-01s > Theano Linker time (includes C, CUDA code generation/compiling): > 2.370000e-01s > Import time 3.000021e-03s > Node make_thunk time 2.260001e-01s > Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, > 0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0, > CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s > Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0}, > TensorConstant{(2L,) of 1}) time 2.000093e-03s > Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0}, > TensorConstant{(2L,) of 1}) time 2.000093e-03s > Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, > convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0}) > time 2.000093e-03s > Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, > TensorConstant{(4L,) of 1}) time 2.000093e-03s > > Time in all call to theano.grad() 0.000000e+00s > Time since theano import 101.753s > Class > --- > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> > <Class name> > 38.0% 38.0% 8.176s 1.57e-04s C 52000 52 > theano.sandbox.cuda.blas.GpuDot22 > 16.9% 54.9% 3.627s 4.37e-05s C 83000 83 > theano.sandbox.cuda.basic_ops.GpuElemwise > 14.7% 69.6% 3.169s 1.76e-04s Py 18000 18 > theano.sandbox.cuda.basic_ops.GpuSplit > 13.8% 83.4% 2.970s 1.65e-04s C 18000 18 > theano.sandbox.cuda.basic_ops.GpuJoin > 12.4% 95.9% 2.674s 1.57e-04s C 17000 17 > theano.sandbox.cuda.blas.GpuGemm > 3.5% 99.4% 0.751s 4.17e-05s C 18000 18 > theano.sandbox.cuda.basic_ops.GpuCAReduce > 0.6% 100.0% 0.137s 1.96e-06s C 70000 70 > theano.sandbox.cuda.basic_ops.GpuDimShuffle > ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) > > Ops > --- > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op > name> > 38.0% 38.0% 8.176s 1.57e-04s C 52000 52 > GpuDot22 > 13.8% 51.8% 2.970s 1.65e-04s C 18000 18 > GpuJoin > 12.4% 64.3% 2.674s 1.57e-04s C 17000 17 > GpuGemm{inplace} > 7.7% 71.9% 1.649s 2.36e-04s Py 7000 7 > GpuSplit{4} > 6.1% 78.1% 1.317s 4.39e-05s C 30000 30 > GpuElemwise{Mul}[(0, 1)] > 5.4% 83.5% 1.167s 1.30e-04s Py 9000 9 > GpuSplit{2} > 3.6% 87.0% 0.766s 4.26e-05s C 18000 18 > GpuElemwise{mul,no_inplace} > 3.5% 90.6% 0.763s 4.24e-05s C 18000 18 > GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)] > 3.5% 94.1% 0.751s 4.17e-05s C 18000 18 > GpuCAReduce{add}{0,1} > 1.9% 95.9% 0.399s 4.99e-05s C 8000 8 > GpuElemwise{Mul}[(0, 0)] > 1.6% 97.6% 0.353s 1.76e-04s Py 2000 2 > GpuSplit{3} > 1.1% 98.7% 0.247s 4.12e-05s C 6000 6 > GpuElemwise{Add}[(0, 2)] > 0.6% 99.4% 0.133s 2.56e-06s C 52000 52 > GpuDimShuffle{1,0} > 0.4% 99.8% 0.094s 4.70e-05s C 2000 2 > GpuElemwise{Add}[(0, 1)] > 0.2% 100.0% 0.041s 4.10e-05s C 1000 1 > GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)] > 0.0% 100.0% 0.004s 2.22e-07s C 18000 18 > GpuDimShuffle{0,x} > ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) > > Apply > ------ > <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> > 1.2% 1.2% 0.259s 2.59e-04s 1000 14 > GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,) > of 1}) > 1.1% 2.3% 0.246s 2.46e-04s 1000 9 > GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1}) > 1.1% 3.5% 0.245s 2.45e-04s 1000 236 > GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, > GpuElemwise{Add}[(0, 1)].0) > 1.1% 4.6% 0.239s 2.39e-04s 1000 239 > GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, > GpuElemwise{Add}[(0, 2)].0) > 1.1% 5.7% 0.233s 2.33e-04s 1000 8 > GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,) > of 1}) > 1.1% 6.8% 0.232s 2.32e-04s 1000 5 > GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of > 1}) > 1.1% 7.8% 0.228s 2.28e-04s 1000 0 > GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of > 1}) > 1.1% 8.9% 0.227s 2.27e-04s 1000 2 > GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,) > of 1}) > 1.0% 9.9% 0.225s 2.25e-04s 1000 238 > GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, > GpuElemwise{Add}[(0, 2)].0) > 1.0% 11.0% 0.224s 2.24e-04s 1000 4 > GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,) > of 1}) > 1.0% 12.0% 0.223s 2.23e-04s 1000 260 > GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, > GpuElemwise{Add}[(0, 2)].0) > 1.0% 13.0% 0.221s 2.21e-04s 1000 271 > GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) + > i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0, > GpuElemwise{Add}[(0, 2)].0) > 1.0% 14.0% 0.218s 2.18e-04s 1000 261 > GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, > GpuElemwise{Add}[(0, 2)].0) > 0.9% 15.0% 0.203s 2.03e-04s 1000 237 > GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, > GpuElemwise{Add}[(0, 1)].0) > 0.9% 15.8% 0.184s 1.84e-04s 1000 146 > GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0) > 0.8% 16.7% 0.181s 1.81e-04s 1000 84 > GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0) > 0.8% 17.5% 0.179s 1.79e-04s 1000 134 > GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0) > 0.8% 18.4% 0.179s 1.79e-04s 1000 16 > GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0}, > TensorConstant{(3L,) of 1}) > 0.8% 19.2% 0.175s 1.75e-04s 1000 83 > GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0) > 0.8% 20.0% 0.174s 1.74e-04s 1000 11 > GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0}, > TensorConstant{(3L,) of 1}) > ... (remaining 256 Apply instances account for 80.03%(17.21s) of the > runtime) > > > Some info useful for gpu: > > Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and > 0.000s(0.00%) transfert Op > > Theano function input that are float64 > <fct name> <input name> <input type> <str input> > > List of apply that don't have float64 as input but have float64 in > outputs > (Useful to know if we forgot some cast when using floatX=float32 or > gpu code) > <Apply> <Apply position> <fct name> <inputs type> <outputs type> > > Here are tips to potentially make your code run faster > (if you think of new ones, suggest them on the mailing > list). > Test them first, as they are not guaranteed to always > provide a speedup. > Sorry, no tip for today. > > The CPU version. Flags: > os.environ['THEANO_FLAGS'] = > ',mode=FAST_RUN,floatX=float32,device=cpu,profile=True' > Graph: > https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk > Pickled function: > https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU > Profile: > Function profiling > ================== > Time in 1000 calls to Function.__call__: 5.470006e+00s > Time in Function.fn.__call__: 5.422005e+00s (99.122%) > Time in thunks: 5.277404e+00s (96.479%) > Total compile time: 9.329998e-01s > Number of Apply nodes: 285 > Theano Optimizer time: 7.650001e-01s > Theano validate time: 1.880007e-01s > Theano Linker time (includes C, CUDA code generation/compiling): > 1.140001e-01s > Import time 0.000000e+00s > Node make_thunk time 1.020000e-01s > Node InplaceDimShuffle{x,0}(Sum{axis=[0], acc_dtype=float64}.0) > time 1.000166e-03s > Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, > InplaceDimShuffle{1,0}.0) time 1.000166e-03s > Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, > InplaceDimShuffle{1,0}.0) time 1.000166e-03s > Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, > InplaceDimShuffle{1,0}.0) time 1.000166e-03s > Node Gemm{inplace}(Dot22.0, TensorConstant{1.0}, > Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time > 1.000166e-03s > > Time in all call to theano.grad() 0.000000e+00s > Time since theano import 62.174s > Class > --- > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> > <Class name> > 74.3% 74.3% 3.921s 7.54e-05s Py 52000 52 > theano.tensor.blas.Dot22 > 18.9% 93.2% 0.996s 5.86e-05s C 17000 17 > theano.tensor.blas.Gemm > 2.8% 95.9% 0.146s 1.59e-06s C 92000 92 > theano.tensor.elemwise.Elemwise > 1.6% 97.6% 0.085s 4.72e-06s C 18000 18 > theano.tensor.elemwise.Sum > 1.1% 98.7% 0.058s 3.22e-06s C 18000 18 > theano.tensor.basic.Join > 1.0% 99.7% 0.053s 2.94e-06s C 18000 18 > theano.tensor.basic.Split > 0.3% 100.0% 0.018s 2.57e-07s C 70000 70 > theano.tensor.elemwise.DimShuffle > ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) > > Ops > --- > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op > name> > 74.3% 74.3% 3.921s 7.54e-05s Py 52000 52 > Dot22 > 18.9% 93.2% 0.996s 5.86e-05s C 17000 17 > Gemm{inplace} > 1.6% 94.8% 0.085s 4.72e-06s C 18000 18 > Sum{axis=[0], acc_dtype=float64} > 1.4% 96.2% 0.076s 4.22e-06s C 18000 18 > Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)] > 1.1% 97.3% 0.058s 3.22e-06s C 18000 18 > Join > 0.7% 98.0% 0.038s 2.11e-06s C 18000 18 > Elemwise{mul,no_inplace} > 0.5% 98.5% 0.025s 3.56e-06s C 7000 7 > Split{4} > 0.4% 98.9% 0.021s 2.34e-06s C 9000 9 > Split{2} > 0.2% 99.2% 0.013s 2.50e-07s C 52000 52 > InplaceDimShuffle{1,0} > 0.2% 99.4% 0.012s 3.08e-07s C 39000 39 > Elemwise{Mul}[(0, 1)] > 0.2% 99.6% 0.011s 1.83e-06s C 6000 6 > Elemwise{Add}[(0, 2)] > 0.1% 99.7% 0.007s 3.51e-06s C 2000 2 > Split{3} > 0.1% 99.8% 0.005s 5.56e-07s C 9000 9 > Elemwise{Mul}[(0, 0)] > 0.1% 99.9% 0.005s 2.77e-07s C 18000 18 > InplaceDimShuffle{x,0} > 0.1% 100.0% 0.004s 2.00e-06s C 2000 2 > Elemwise{Add}[(0, 1)] > ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) > > Apply > ------ > <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> > 2.0% 2.0% 0.106s 1.06e-04s 1000 110 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 2.0% 4.0% 0.104s 1.04e-04s 1000 107 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.8% 5.7% 0.093s 9.30e-05s 1000 188 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.8% 7.5% 0.093s 9.30e-05s 1000 78 > Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3) > 1.8% 9.3% 0.093s 9.29e-05s 1000 146 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 11.0% 0.092s 9.20e-05s 1000 135 > Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3) > 1.7% 12.8% 0.092s 9.20e-05s 1000 105 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 14.5% 0.092s 9.19e-05s 1000 164 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 16.2% 0.090s 9.03e-05s 1000 177 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 17.9% 0.090s 8.99e-05s 1000 178 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 19.6% 0.089s 8.90e-05s 1000 159 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 21.3% 0.089s 8.90e-05s 1000 168 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 23.0% 0.089s 8.90e-05s 1000 157 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.7% 24.6% 0.088s 8.80e-05s 1000 73 > Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3) > 1.6% 26.3% 0.087s 8.71e-05s 1000 121 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.6% 27.9% 0.087s 8.70e-05s 1000 193 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.6% 29.6% 0.086s 8.60e-05s 1000 170 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.6% 31.2% 0.085s 8.50e-05s 1000 166 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.6% 32.8% 0.084s 8.40e-05s 1000 155 > Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3) > 1.6% 34.3% 0.083s 8.30e-05s 1000 140 > Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3) > ... (remaining 265 Apply instances account for 65.66%(3.46s) of the > runtime) > > Here are tips to potentially make your code run faster > (if you think of new ones, suggest them on the mailing > list). > Test them first, as they are not guaranteed to always > provide a speedup. > Sorry, no tip for today. > > On Thursday, April 20, 2017 at 4:07:45 AM UTC+3, Patric wrote: >> >> Could you share your model with us? We'd like to take a look :) >> >> On Tuesday, April 18, 2017 at 5:24:30 PM UTC+8, Sharapolas wrote: >>> >>> I have a computation tree and am implementing leaf node evalutions. In >>> theano graph do paralle branches get evaluated in parallel on the GPU? >>> >> -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
