My system:
Windows 8.1 Enterprise x64
Anaconda Python 2.7.12 x64
Theano 0.9.0rc4.dev-44f7578c16e7b991c06e373d470d9889c2729844
Geforce GTX 1070
i7-4790K @ 4 Ghz 16 GB RAM
On Thursday, April 20, 2017 at 12:43:19 PM UTC+3, Sharapolas wrote:
>
> Guys thanks for your feedback.
>
> For the past week I have been trying to optimize my solver as much as
> possible and I optimized so much that the CPU is twice faster than the GPU
> now :D Extremelly puzzled with this result and I hope you could shed some
> light on that.
>
> Wider story:
> In my initial version, I arranged the tensors such that I do not need
> to do slicing. Then I noticed that GPU load is directly proportional to the
> size of the tensors being used, thus I decided to use smaller tensors but
> lump them together and then slice in the few cases where I need it. As a
> result the GPU code turned to be more than 4 times slower, but CPU code
> almost rivals my first GPU version. I tried using different version of
> indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in
> similar timings.
>
> Do you have suggestions how I could speed up my GPU code? Otherwise, I
> might as well just run on multicode CPU and prob become even faster than
> GPU :/
>
>
> GPU version. Flags:
> os.environ['THEANO_FLAGS'] =
> ",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
> os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
> Pickled version:
> https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
> Graph:
> https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
> Profile:
> Function profiling
> ==================
> Time in 1000 calls to Function.__call__: 2.170000e+01s
> Time in Function.fn.__call__: 2.166000e+01s (99.816%)
> Time in thunks: 2.150321e+01s (99.093%)
> Total compile time: 1.809000e+00s
> Number of Apply nodes: 276
> Theano Optimizer time: 1.099000e+00s
> Theano validate time: 2.069981e-01s
> Theano Linker time (includes C, CUDA code generation/compiling):
> 2.370000e-01s
> Import time 3.000021e-03s
> Node make_thunk time 2.260001e-01s
> Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
> 0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0,
> CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
> Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0},
> TensorConstant{(2L,) of 1}) time 2.000093e-03s
> Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0},
> TensorConstant{(2L,) of 1}) time 2.000093e-03s
> Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0},
> convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
> time 2.000093e-03s
> Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0},
> TensorConstant{(4L,) of 1}) time 2.000093e-03s
>
> Time in all call to theano.grad() 0.000000e+00s
> Time since theano import 101.753s
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
> <Class name>
> 38.0% 38.0% 8.176s 1.57e-04s C 52000 52
> theano.sandbox.cuda.blas.GpuDot22
> 16.9% 54.9% 3.627s 4.37e-05s C 83000 83
> theano.sandbox.cuda.basic_ops.GpuElemwise
> 14.7% 69.6% 3.169s 1.76e-04s Py 18000 18
> theano.sandbox.cuda.basic_ops.GpuSplit
> 13.8% 83.4% 2.970s 1.65e-04s C 18000 18
> theano.sandbox.cuda.basic_ops.GpuJoin
> 12.4% 95.9% 2.674s 1.57e-04s C 17000 17
> theano.sandbox.cuda.blas.GpuGemm
> 3.5% 99.4% 0.751s 4.17e-05s C 18000 18
> theano.sandbox.cuda.basic_ops.GpuCAReduce
> 0.6% 100.0% 0.137s 1.96e-06s C 70000 70
> theano.sandbox.cuda.basic_ops.GpuDimShuffle
> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
>
> Ops
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
> name>
> 38.0% 38.0% 8.176s 1.57e-04s C 52000 52
> GpuDot22
> 13.8% 51.8% 2.970s 1.65e-04s C 18000 18
> GpuJoin
> 12.4% 64.3% 2.674s 1.57e-04s C 17000 17
> GpuGemm{inplace}
> 7.7% 71.9% 1.649s 2.36e-04s Py 7000 7
> GpuSplit{4}
> 6.1% 78.1% 1.317s 4.39e-05s C 30000 30
> GpuElemwise{Mul}[(0, 1)]
> 5.4% 83.5% 1.167s 1.30e-04s Py 9000 9
> GpuSplit{2}
> 3.6% 87.0% 0.766s 4.26e-05s C 18000 18
> GpuElemwise{mul,no_inplace}
> 3.5% 90.6% 0.763s 4.24e-05s C 18000 18
> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
> 3.5% 94.1% 0.751s 4.17e-05s C 18000 18
> GpuCAReduce{add}{0,1}
> 1.9% 95.9% 0.399s 4.99e-05s C 8000 8
> GpuElemwise{Mul}[(0, 0)]
> 1.6% 97.6% 0.353s 1.76e-04s Py 2000 2
> GpuSplit{3}
> 1.1% 98.7% 0.247s 4.12e-05s C 6000 6
> GpuElemwise{Add}[(0, 2)]
> 0.6% 99.4% 0.133s 2.56e-06s C 52000 52
> GpuDimShuffle{1,0}
> 0.4% 99.8% 0.094s 4.70e-05s C 2000 2
> GpuElemwise{Add}[(0, 1)]
> 0.2% 100.0% 0.041s 4.10e-05s C 1000 1
> GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
> 0.0% 100.0% 0.004s 2.22e-07s C 18000 18
> GpuDimShuffle{0,x}
> ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
>
> Apply
> ------
> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
> 1.2% 1.2% 0.259s 2.59e-04s 1000 14
> GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,)
> of 1})
> 1.1% 2.3% 0.246s 2.46e-04s 1000 9
> GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
> 1.1% 3.5% 0.245s 2.45e-04s 1000 236
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
> GpuElemwise{Add}[(0, 1)].0)
> 1.1% 4.6% 0.239s 2.39e-04s 1000 239
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
> GpuElemwise{Add}[(0, 2)].0)
> 1.1% 5.7% 0.233s 2.33e-04s 1000 8
> GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,)
> of 1})
> 1.1% 6.8% 0.232s 2.32e-04s 1000 5
> GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of
> 1})
> 1.1% 7.8% 0.228s 2.28e-04s 1000 0
> GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of
> 1})
> 1.1% 8.9% 0.227s 2.27e-04s 1000 2
> GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,)
> of 1})
> 1.0% 9.9% 0.225s 2.25e-04s 1000 238
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
> GpuElemwise{Add}[(0, 2)].0)
> 1.0% 11.0% 0.224s 2.24e-04s 1000 4
> GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,)
> of 1})
> 1.0% 12.0% 0.223s 2.23e-04s 1000 260
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
> GpuElemwise{Add}[(0, 2)].0)
> 1.0% 13.0% 0.221s 2.21e-04s 1000 271
> GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) +
> i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0,
> GpuElemwise{Add}[(0, 2)].0)
> 1.0% 14.0% 0.218s 2.18e-04s 1000 261
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
> GpuElemwise{Add}[(0, 2)].0)
> 0.9% 15.0% 0.203s 2.03e-04s 1000 237
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
> GpuElemwise{Add}[(0, 1)].0)
> 0.9% 15.8% 0.184s 1.84e-04s 1000 146
> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
> 0.8% 16.7% 0.181s 1.81e-04s 1000 84
> GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
> 0.8% 17.5% 0.179s 1.79e-04s 1000 134
> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
> 0.8% 18.4% 0.179s 1.79e-04s 1000 16
> GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0},
> TensorConstant{(3L,) of 1})
> 0.8% 19.2% 0.175s 1.75e-04s 1000 83
> GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
> 0.8% 20.0% 0.174s 1.74e-04s 1000 11
> GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0},
> TensorConstant{(3L,) of 1})
> ... (remaining 256 Apply instances account for 80.03%(17.21s) of the
> runtime)
>
>
> Some info useful for gpu:
>
> Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and
> 0.000s(0.00%) transfert Op
>
> Theano function input that are float64
> <fct name> <input name> <input type> <str input>
>
> List of apply that don't have float64 as input but have float64 in
> outputs
> (Useful to know if we forgot some cast when using floatX=float32 or
> gpu code)
> <Apply> <Apply position> <fct name> <inputs type> <outputs type>
>
> Here are tips to potentially make your code run faster
> (if you think of new ones, suggest them on the mailing
> list).
> Test them first, as they are not guaranteed to always
> provide a speedup.
> Sorry, no tip for today.
>
> The CPU version. Flags:
> os.environ['THEANO_FLAGS'] =
> ',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
> Graph:
> https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
> Pickled function:
> https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
> Profile:
> Function profiling
> ==================
> Time in 1000 calls to Function.__call__: 5.470006e+00s
> Time in Function.fn.__call__: 5.422005e+00s (99.122%)
> Time in thunks: 5.277404e+00s (96.479%)
> Total compile time: 9.329998e-01s
> Number of Apply nodes: 285
> Theano Optimizer time: 7.650001e-01s
> Theano validate time: 1.880007e-01s
> Theano Linker time (includes C, CUDA code generation/compiling):
> 1.140001e-01s
> Import time 0.000000e+00s
> Node make_thunk time 1.020000e-01s
> Node InplaceDimShuffle{x,0}(Sum{axis=[0], acc_dtype=float64}.0)
> time 1.000166e-03s
> Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
> Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
> Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
> Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
> Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time
> 1.000166e-03s
>
> Time in all call to theano.grad() 0.000000e+00s
> Time since theano import 62.174s
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
> <Class name>
> 74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
> theano.tensor.blas.Dot22
> 18.9% 93.2% 0.996s 5.86e-05s C 17000 17
> theano.tensor.blas.Gemm
> 2.8% 95.9% 0.146s 1.59e-06s C 92000 92
> theano.tensor.elemwise.Elemwise
> 1.6% 97.6% 0.085s 4.72e-06s C 18000 18
> theano.tensor.elemwise.Sum
> 1.1% 98.7% 0.058s 3.22e-06s C 18000 18
> theano.tensor.basic.Join
> 1.0% 99.7% 0.053s 2.94e-06s C 18000 18
> theano.tensor.basic.Split
> 0.3% 100.0% 0.018s 2.57e-07s C 70000 70
> theano.tensor.elemwise.DimShuffle
> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
>
> Ops
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
> name>
> 74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
> Dot22
> 18.9% 93.2% 0.996s 5.86e-05s C 17000 17
> Gemm{inplace}
> 1.6% 94.8% 0.085s 4.72e-06s C 18000 18
> Sum{axis=[0], acc_dtype=float64}
> 1.4% 96.2% 0.076s 4.22e-06s C 18000 18
> Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
> 1.1% 97.3% 0.058s 3.22e-06s C 18000 18
> Join
> 0.7% 98.0% 0.038s 2.11e-06s C 18000 18
> Elemwise{mul,no_inplace}
> 0.5% 98.5% 0.025s 3.56e-06s C 7000 7
> Split{4}
> 0.4% 98.9% 0.021s 2.34e-06s C 9000 9
> Split{2}
> 0.2% 99.2% 0.013s 2.50e-07s C 52000 52
> InplaceDimShuffle{1,0}
> 0.2% 99.4% 0.012s 3.08e-07s C 39000 39
> Elemwise{Mul}[(0, 1)]
> 0.2% 99.6% 0.011s 1.83e-06s C 6000 6
> Elemwise{Add}[(0, 2)]
> 0.1% 99.7% 0.007s 3.51e-06s C 2000 2
> Split{3}
> 0.1% 99.8% 0.005s 5.56e-07s C 9000 9
> Elemwise{Mul}[(0, 0)]
> 0.1% 99.9% 0.005s 2.77e-07s C 18000 18
> InplaceDimShuffle{x,0}
> 0.1% 100.0% 0.004s 2.00e-06s C 2000 2
> Elemwise{Add}[(0, 1)]
> ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
>
> Apply
> ------
> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
> 2.0% 2.0% 0.106s 1.06e-04s 1000 110
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 2.0% 4.0% 0.104s 1.04e-04s 1000 107
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.8% 5.7% 0.093s 9.30e-05s 1000 188
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.8% 7.5% 0.093s 9.30e-05s 1000 78
> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
> 1.8% 9.3% 0.093s 9.29e-05s 1000 146
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 11.0% 0.092s 9.20e-05s 1000 135
> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
> 1.7% 12.8% 0.092s 9.20e-05s 1000 105
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 14.5% 0.092s 9.19e-05s 1000 164
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 16.2% 0.090s 9.03e-05s 1000 177
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 17.9% 0.090s 8.99e-05s 1000 178
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 19.6% 0.089s 8.90e-05s 1000 159
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 21.3% 0.089s 8.90e-05s 1000 168
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 23.0% 0.089s 8.90e-05s 1000 157
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.7% 24.6% 0.088s 8.80e-05s 1000 73
> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
> 1.6% 26.3% 0.087s 8.71e-05s 1000 121
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.6% 27.9% 0.087s 8.70e-05s 1000 193
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.6% 29.6% 0.086s 8.60e-05s 1000 170
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.6% 31.2% 0.085s 8.50e-05s 1000 166
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.6% 32.8% 0.084s 8.40e-05s 1000 155
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
> 1.6% 34.3% 0.083s 8.30e-05s 1000 140
> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
> ... (remaining 265 Apply instances account for 65.66%(3.46s) of the
> runtime)
>
> Here are tips to potentially make your code run faster
> (if you think of new ones, suggest them on the mailing
> list).
> Test them first, as they are not guaranteed to always
> provide a speedup.
> Sorry, no tip for today.
>
> On Thursday, April 20, 2017 at 4:07:45 AM UTC+3, Patric wrote:
>>
>> Could you share your model with us? We'd like to take a look :)
>>
>> On Tuesday, April 18, 2017 at 5:24:30 PM UTC+8, Sharapolas wrote:
>>>
>>> I have a computation tree and am implementing leaf node evalutions. In
>>> theano graph do paralle branches get evaluated in parallel on the GPU?
>>>
>>
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.