Thank you for your investigation. I ran and profiled the example code on
the CPU and attached the results below. Alloc is much less of an issue on
CPU so not sure your hypothesis still holds.
Regarding the reshaping I would suspect the following line to blame
reach = cases[group] * T.tile(reach.T, (cases[group].shape[0],1))
The reach.T is a row vector and I multiply each row of cases[group] with
it. Afaik the broadcasted version should look l
reach = (cases[group].T * reach).T
but it throws an error of mismatching shapes. Or am I misunderstanding
broadcasting?
For the rest most of the operations are element-wise multiply, divide, add.
Or maybe those could be wasting time in allocating space for intermediate
results?
CPU profiled code:
Enter code here...Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 20000 calls to Function.__call__: 5.398000e+01s
Time in Function.fn.__call__: 5.358499e+01s (99.268%)
Time in thunks: 5.327701e+01s (98.698%)
Total compile time: 5.962000e+00s
Number of Apply nodes: 21
Theano Optimizer time: 4.749999e-01s
Theano validate time: 6.999969e-03s
Theano Linker time (includes C, CUDA code generation/compiling):
4.271000e+00s
Import time 6.200051e-02s
Node make_thunk time 4.267000e+00s
Node Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) +
maximum(i0, i1)))}}(InplaceDimShuffle{1,0}.0, TensorConstant{(1L, 1L) of
0.0}, InplaceDimShuffle{1,0}.0) time 8.690000e-01s
Node Elemwise{mul,no_inplace}(<TensorType(float32, matrix)>,
InplaceDimShuffle{0,x}.0) time 8.640001e-01s
Node Elemwise{Composite{(i0 + (i1 * i2))}}[(0,
0)](InplaceDimShuffle{0,x}.0, TensorConstant{(1L, 1L) of 2.0},
InplaceDimShuffle{0,x}.0) time 8.090000e-01s
Node Elemwise{Neg}[(0, 0)](InplaceDimShuffle{0,x}.0) time
7.180002e-01s
Node Alloc(TensorConstant{1.0}, Shape_i{0}.0, TensorConstant{1},
TensorConstant{1}, TensorConstant{1}) time 5.160000e-01s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 61.639s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
59.9% 59.9% 31.927s 1.33e-04s C 240000 24
theano.tensor.elemwise.Elemwise
29.9% 89.9% 15.945s 3.19e-04s C 50000 5
theano.tensor.elemwise.Sum
9.9% 99.7% 5.262s 1.32e-04s C 40000 4
theano.tensor.basic.Alloc
0.1% 99.8% 0.049s 4.89e-07s C 100000 10
theano.tensor.elemwise.DimShuffle
0.1% 99.9% 0.047s 5.93e-07s C 80000 8
theano.compile.ops.Shape_i
0.1% 100.0% 0.027s 6.74e-07s C 40000 4
theano.tensor.basic.Reshape
0.0% 100.0% 0.020s 5.01e-07s C 40000 4
theano.tensor.opt.MakeVector
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
47.8% 47.8% 25.449s 6.36e-04s C 40000 4
Elemwise{Mul}[(0, 1)]
29.9% 77.7% 15.945s 3.19e-04s C 50000 5
Sum{axis=[1], acc_dtype=float64}
9.9% 87.6% 5.262s 1.32e-04s C 40000 4
Alloc
6.4% 93.9% 3.394s 3.39e-04s C 10000 1
Elemwise{Mul}[(0, 2)]
5.1% 99.0% 2.718s 2.72e-04s C 10000 1
Elemwise{mul,no_inplace}
0.1% 99.2% 0.068s 6.80e-06s C 10000 1
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) +
maximum(i4, i3))) + (i5 * (maximum(i4, i3) / (maximum(i2, i3) + maximum(i4,
i3)))))}}
0.1% 99.3% 0.058s 5.80e-06s C 10000 1
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) +
maximum(i4, i3))) + ((i5 * i6 * maximum(i4, i3)) / (maximum(i2, i3) +
maximum(i4, i3))))}}
0.1% 99.4% 0.054s 2.71e-06s C 20000 2
Elemwise{TrueDiv}[(0, 0)]
0.1% 99.5% 0.052s 1.30e-06s C 40000 4
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
0.1% 99.5% 0.030s 1.50e-06s C 20000 2
Elemwise{maximum,no_inplace}
0.1% 99.6% 0.028s 2.80e-06s C 10000 1
Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) + maximum(i0, i1)))}}
0.0% 99.6% 0.025s 5.03e-07s C 50000 5
Shape_i{0}
0.0% 99.7% 0.025s 4.17e-07s C 60000 6
InplaceDimShuffle{0,x}
0.0% 99.7% 0.024s 5.98e-07s C 40000 4
InplaceDimShuffle{1,0}
0.0% 99.8% 0.022s 7.43e-07s C 30000 3
Shape_i{1}
0.0% 99.8% 0.021s 1.05e-06s C 20000 2
Elemwise{Neg}[(0, 0)]
0.0% 99.8% 0.021s 2.10e-06s C 10000 1
Elemwise{Composite{(i0 + (i1 * i2))}}[(0, 0)]
0.0% 99.9% 0.020s 5.01e-07s C 40000 4
MakeVector{dtype='int64'}
0.0% 99.9% 0.019s 1.90e-06s C 10000 1
Elemwise{add,no_inplace}
0.0% 100.0% 0.019s 6.32e-07s C 30000 3
Reshape{2}
... (remaining 2 Ops account for 0.04%(0.02s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
22.0% 22.0% 11.705s 1.17e-03s 10000 10
Elemwise{Mul}[(0, 1)](<TensorType(float32, matrix)>, Reshape{2}.0)
21.7% 43.7% 11.554s 1.16e-03s 10000 12
Elemwise{Mul}[(0, 1)](<TensorType(float32, matrix)>, Elemwise{Mul}[(0,
1)].0)
6.4% 50.0% 3.394s 3.39e-04s 10000 28
Elemwise{Mul}[(0, 2)](<TensorType(float32, matrix)>, <TensorType(float32,
matrix)>, Reshape{2}.0)
6.1% 56.1% 3.247s 3.25e-04s 10000 21 Sum{axis=[1],
acc_dtype=float64}(Elemwise{mul,no_inplace}.0)
6.0% 62.1% 3.189s 3.19e-04s 10000 29 Sum{axis=[1],
acc_dtype=float64}(Elemwise{Mul}[(0, 1)].0)
6.0% 68.1% 3.184s 3.18e-04s 10000 30 Sum{axis=[1],
acc_dtype=float64}(Elemwise{Mul}[(0, 2)].0)
6.0% 74.0% 3.171s 3.17e-04s 10000 14 Sum{axis=[1],
acc_dtype=float64}(Elemwise{Mul}[(0, 1)].0)
5.9% 80.0% 3.154s 3.15e-04s 10000 11 Sum{axis=[1],
acc_dtype=float64}(Elemwise{Mul}[(0, 1)].0)
5.1% 85.1% 2.718s 2.72e-04s 10000 16
Elemwise{mul,no_inplace}(<TensorType(float32, matrix)>,
InplaceDimShuffle{0,x}.0)
4.1% 89.1% 2.173s 2.17e-04s 10000 27
Elemwise{Mul}[(0, 1)](<TensorType(float32, matrix)>, Reshape{2}.0)
3.3% 92.5% 1.782s 1.78e-04s 10000 17
Alloc(Elemwise{TrueDiv}[(0, 0)].0, Shape_i{0}.0, TensorConstant{1},
Shape_i{1}.0, Shape_i{0}.0)
3.2% 95.7% 1.731s 1.73e-04s 10000 18
Alloc(Elemwise{TrueDiv}[(0, 0)].0, Shape_i{0}.0, TensorConstant{1},
Shape_i{1}.0, Shape_i{0}.0)
3.2% 99.0% 1.720s 1.72e-04s 10000 6
Alloc(Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) + maximum(i0,
i1)))}}.0, Shape_i{0}.0, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
0.1% 99.1% 0.068s 6.80e-06s 10000 34
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) +
maximum(i4, i3))) + (i5 * (maximum(i4, i3) / (maximum(i2, i3) + maximum(i4,
i3)))))}}(TensorConstant{(1L, 1L) of -1.0}, InplaceDimShuffle{0,x}.0,
<TensorType(float32, matrix)>, TensorConstant{(1L, 1L) of 0.0},
<TensorType(float32, matrix)>, Elemwise{Composite{(i0 + (i1 * i2))}}[(0,
0)].0)
0.1% 99.2% 0.058s 5.80e-06s 10000 16
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) +
maximum(i4, i3))) + ((i5 * i6 * maximum(i4, i3)) / (maximum(i2, i3) +
maximum(i4, i3))))}}(TensorConstant{(1L, 1L) of -1.0},
InplaceDimShuffle{0,x}.0, <TensorType(float32, matrix)>,
TensorConstant{(1L, 1L) of 0.0}, <TensorType(float32, matrix)>,
TensorConstant{(1L, 1L) of 2.0}, InplaceDimShuffle{0,x}.0)
0.1% 99.3% 0.035s 3.52e-06s 10000 14
Elemwise{TrueDiv}[(0, 0)](Elemwise{maximum,no_inplace}.0,
Elemwise{add,no_inplace}.0)
0.1% 99.3% 0.028s 2.80e-06s 10000 5
Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) + maximum(i0,
i1)))}}(InplaceDimShuffle{1,0}.0, TensorConstant{(1L, 1L) of 0.0},
InplaceDimShuffle{1,0}.0)
0.1% 99.4% 0.028s 2.80e-06s 10000 9
Alloc(TensorConstant{1.0}, Shape_i{0}.0, TensorConstant{1},
TensorConstant{1}, TensorConstant{1})
0.0% 99.4% 0.021s 2.10e-06s 10000 33
Elemwise{Composite{(i0 + (i1 * i2))}}[(0, 0)](InplaceDimShuffle{0,x}.0,
TensorConstant{(1L, 1L) of 2.0}, InplaceDimShuffle{0,x}.0)
0.0% 99.4% 0.019s 1.90e-06s 10000 12
Elemwise{add,no_inplace}(Elemwise{maximum,no_inplace}.0,
Elemwise{maximum,no_inplace}.0)
... (remaining 39 Apply instances account for 0.56%(0.30s) of the
runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
On Friday, 10 February 2017 20:29:17 UTC+2, nouiz wrote:
>
> I think I understand the problem. To confirm where the problem is, can you
> profile it on the CPU? If I'm right, most of the time should still be spend
> in Alloc.
>
> The problem is that the graph have a more complicated version of part of
> the code that one of our optimization don't recognize. The optimization
> that is disabled is the one that remove the alloc. They don't get removed
> because there is a reshape between the elemwise and the alloc.
>
> Ideally, we should removey from the graph the reshape, and the opt should
> do its work.
>
> You can probably work around that by finding where you do reshape and
> remove them by making sure the alloc is of the right shape directly. Or you
> can probably remove the alloc and use broadcasting directly to get the
> result you want. Mostly the opt that didn't apply will make use of the
> broadasting.
>
> Keep us updated and if you can confirm that the Alloc is still slow on the
> CPU, thaedt would be confirm the problem.
>
> Fred
>
> On Fri, Feb 3, 2017 at 4:22 AM Šarūnas S. <[email protected] <javascript:>>
> wrote:
>
>> I wrote a script in theano and started profiling it. What I noticed is
>> GPU spends most of the time in GpuAlloc .
>>
>> Could somebody explain me why this is happening and how I could reduce it?
>> In C or C++ I would preallocate it, but not sure how to do this in
>> theano.
>>
>> I am running on Windows 8.1 with Nvidia GTX 1070 with Theano
>> @ 0.9.0dev4.dev-3c0be3d94102ac6864b2e5ab52ae96d07c6375c6
>>
>>
>> I am attaching extensive profile result below:
>>
>> Function profiling
>> ==================
>> Message: Sum of all(2) printed profiles at exit excluding Scan op
>> profile.
>> Time in 200 calls to Function.__call__: 3.463001e+00s
>> Time in Function.fn.__call__: 3.451001e+00s (99.653%)
>> Time in thunks: 3.425293e+00s (98.911%)
>> Total compile time: 1.413800e+01s
>> Number of Apply nodes: 590
>> Theano Optimizer time: 1.158200e+01s
>> Theano validate time: 9.390018e-01s
>> Theano Linker time (includes C, CUDA code generation/compiling):
>> 2.107000e+00s
>> Import time 3.500128e-02s
>> Node make_thunk time 2.042000e+00s
>> Node GpuCAReduce{add}{0,1}(GpuElemwise{Composite{(i0 * (i1 *
>> i2))}}[(0, 2)].0) time 9.000063e-03s
>> Node GpuCAReduce{add}{0,1}(GpuElemwise{Mul}[(0, 1)].0) time
>> 7.999897e-03s
>> Node GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0) time
>> 6.999969e-03s
>> Node Shape_i{1}(<CudaNdarrayType(float32, matrix)>) time
>> 4.999876e-03s
>> Node GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[[ 240.]]},
>> GpuDimShuffle{0,x}.0) time 4.999876e-03s
>>
>>
>> Time in all call to theano.grad() 0.000000e+00s
>> Time since theano import 41.580s
>> Class
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
>> <Class name>
>> 90.5% 90.5% 3.100s 3.37e-04s C 9200 92
>> theano.sandbox.cuda.basic_ops.GpuAlloc
>> 7.4% 97.9% 0.254s 4.19e-06s C 60600 606
>> theano.sandbox.cuda.basic_ops.GpuElemwise
>> 1.0% 98.9% 0.034s 2.77e-06s C 12200 122
>> theano.sandbox.cuda.basic_ops.GpuCAReduce
>> 0.5% 99.4% 0.017s 1.84e-06s C 9200 92
>> theano.sandbox.cuda.basic_ops.GpuReshape
>> 0.5% 99.9% 0.016s 7.45e-07s C 21400 214
>> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>> 0.1% 99.9% 0.003s 1.57e-06s C 1900 19
>> theano.tensor.elemwise.Elemwise
>> 0.1% 100.0% 0.002s 5.24e-07s C 3800 38
>> theano.compile.ops.Shape_i
>> 0.0% 100.0% 0.000s 0.00e+00s C 1900 19
>> theano.tensor.opt.MakeVector
>> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
>>
>>
>> Ops
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
>> <Op name>
>> 90.5% 90.5% 3.100s 3.37e-04s C 9200 92
>> GpuAlloc
>> 1.7% 92.2% 0.058s 4.41e-06s C 13100 131
>> GpuElemwise{Mul}[(0, 1)]
>> 1.0% 93.2% 0.034s 3.21e-06s C 10600 106
>> GpuElemwise{maximum,no_inplace}
>> 1.0% 94.2% 0.034s 2.77e-06s C 12200 122
>> GpuCAReduce{add}{0,1}
>> 0.7% 94.8% 0.023s 3.54e-06s C 6500 65
>> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>> 0.5% 95.4% 0.018s 3.27e-06s C 5500 55
>> GpuElemwise{mul,no_inplace}
>> 0.5% 95.9% 0.018s 4.61e-06s C 3900 39
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
>> 0.5% 96.4% 0.017s 1.84e-06s C 9200 92
>> GpuReshape{2}
>> 0.4% 96.8% 0.014s 4.33e-06s C 3200 32
>> GpuElemwise{Composite{(i0 * (i1 * i2))}}[(0, 2)]
>> 0.2% 97.0% 0.008s 8.69e-07s C 9200 92
>> GpuDimShuffle{1,0}
>> 0.2% 97.3% 0.008s 5.33e-06s C 1500 15
>> GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}
>> 0.2% 97.5% 0.008s 6.52e-07s C 12200 122
>> GpuDimShuffle{0,x}
>> 0.2% 97.7% 0.007s 4.38e-06s C 1600 16
>> GpuElemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) +
>> maximum(i4, i3))) + ((i5 * i6 * maximum(i4, i3)) / (maximum(i2, i3) +
>> maximum(i4, i3))))},no_inplace}
>> 0.2% 97.9% 0.007s 2.92e-06s C 2400 24
>> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)},no_inplace}
>> 0.2% 98.1% 0.007s 8.75e-06s C 800 8
>> GpuElemwise{Composite{((i0 * i1 * i2) / i3)}}[(0, 2)]
>> 0.2% 98.3% 0.007s 8.73e-06s C 800 8
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 0)]
>> 0.2% 98.5% 0.006s 3.54e-06s C 1700 17
>> GpuElemwise{true_div,no_inplace}
>> 0.1% 98.6% 0.005s 5.02e-06s C 1000 10
>> GpuElemwise{Composite{(i0 * (i1 + i2))},no_inplace}
>> 0.1% 98.8% 0.005s 9.99e-06s C 500 5
>> GpuElemwise{Composite{(((i0 + i1) + i2) + i3)},no_inplace}
>> 0.1% 98.9% 0.004s 6.65e-06s C 600 6
>> GpuElemwise{Composite{(((i0 * (maximum(i1, i2) / Composite{((i0 + i1) +
>> i2)}(maximum(i1, i2), maximum(i3, i2), maximum(i4, i2)))) + ((i5 * i6 *
>> maximum(i3, i2)) / Composite{((i0 + i1) + i2)}(maximum(i1, i2), maximum(
>> i3, i2), maximum(i4, i2)))) + ((i7 * i8 * maximum(i4, i2)) / Composite{((i0
>> + i1) + i2)}(maximum(i1, i2), maximum(i3, i2), maximum(i4, i2))))},
>> no_inplace}
>> ... (remaining 33 Ops account for 1.11%(0.04s) of the runtime)
>>
>>
>> Apply
>> ------
>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>> 1.6% 1.6% 0.055s 5.50e-04s 100 188 GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}.0, TensorConstant{
>> 1326}, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>> 1.6% 3.2% 0.055s 5.50e-04s 100 217 GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, TensorConstant{1326},
>> TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>> 1.6% 4.8% 0.055s 5.50e-04s 100 224 GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1 * i2) / i3)}}[(0, 2)].0, TensorConstant{
>> 1326}, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>> 1.6% 6.4% 0.055s 5.50e-04s 100 183 GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, TensorConstant{1326},
>> TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>> 1.6% 8.0% 0.054s 5.39e-04s 100 186 GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}.0, TensorConstant{
>> 1326}, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>> 1.5% 9.5% 0.053s 5.30e-04s 100 154 GpuAlloc(
>> GpuElemwise{true_div
>>
>
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.