Re: [theano-users] GPU spends most of the time in GpuAlloc

Šarūnas S . Sat, 11 Feb 2017 02:55:53 -0800

Thank you for your investigation. I ran and profiled the example code on 
the CPU and attached the results below. Alloc is much less of an issue on 
CPU so not sure your hypothesis still holds.


Regarding the reshaping I would suspect the following line to blame
reach = cases[group] * T.tile(reach.T, (cases[group].shape[0],1))
The reach.T is a row vector and I multiply each row of cases[group] with 
it. Afaik the broadcasted version should look l
reach = (cases[group].T * reach).T
but it throws an error of mismatching shapes. Or am I misunderstanding 
broadcasting?

For the rest most of the operations are element-wise multiply, divide, add. 
Or maybe those could be wasting time in allocating space for intermediate 
results? 


CPU profiled code:
Enter code here...Function profiling
==================
  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
  Time in 20000 calls to Function.__call__: 5.398000e+01s
  Time in Function.fn.__call__: 5.358499e+01s (99.268%)
  Time in thunks: 5.327701e+01s (98.698%)
  Total compile time: 5.962000e+00s
    Number of Apply nodes: 21
    Theano Optimizer time: 4.749999e-01s
       Theano validate time: 6.999969e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 
4.271000e+00s
       Import time 6.200051e-02s
       Node make_thunk time 4.267000e+00s
           Node Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) + 
maximum(i0, i1)))}}(InplaceDimShuffle{1,0}.0, TensorConstant{(1L, 1L) of 
0.0}, InplaceDimShuffle{1,0}.0) time 8.690000e-01s
           Node Elemwise{mul,no_inplace}(<TensorType(float32, matrix)>, 
InplaceDimShuffle{0,x}.0) time 8.640001e-01s
           Node Elemwise{Composite{(i0 + (i1 * i2))}}[(0, 
0)](InplaceDimShuffle{0,x}.0, TensorConstant{(1L, 1L) of 2.0}, 
InplaceDimShuffle{0,x}.0) time 8.090000e-01s
           Node Elemwise{Neg}[(0, 0)](InplaceDimShuffle{0,x}.0) time 
7.180002e-01s
           Node Alloc(TensorConstant{1.0}, Shape_i{0}.0, TensorConstant{1}, 
TensorConstant{1}, TensorConstant{1}) time 5.160000e-01s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 61.639s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
<Class name>
  59.9%    59.9%      31.927s       1.33e-04s     C   240000      24   
theano.tensor.elemwise.Elemwise
  29.9%    89.9%      15.945s       3.19e-04s     C    50000       5   
theano.tensor.elemwise.Sum
   9.9%    99.7%       5.262s       1.32e-04s     C    40000       4   
theano.tensor.basic.Alloc
   0.1%    99.8%       0.049s       4.89e-07s     C   100000      10   
theano.tensor.elemwise.DimShuffle
   0.1%    99.9%       0.047s       5.93e-07s     C    80000       8   
theano.compile.ops.Shape_i
   0.1%   100.0%       0.027s       6.74e-07s     C    40000       4   
theano.tensor.basic.Reshape
   0.0%   100.0%       0.020s       5.01e-07s     C    40000       4   
theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
name>
  47.8%    47.8%      25.449s       6.36e-04s     C     40000        4   
Elemwise{Mul}[(0, 1)]
  29.9%    77.7%      15.945s       3.19e-04s     C     50000        5   
Sum{axis=[1], acc_dtype=float64}
   9.9%    87.6%       5.262s       1.32e-04s     C     40000        4   
Alloc
   6.4%    93.9%       3.394s       3.39e-04s     C     10000        1   
Elemwise{Mul}[(0, 2)]
   5.1%    99.0%       2.718s       2.72e-04s     C     10000        1   
Elemwise{mul,no_inplace}
   0.1%    99.2%       0.068s       6.80e-06s     C     10000        1   
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) + 
maximum(i4, i3))) + (i5 * (maximum(i4, i3) / (maximum(i2, i3) + maximum(i4, 
i3)))))}}
   0.1%    99.3%       0.058s       5.80e-06s     C     10000        1   
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) + 
maximum(i4, i3))) + ((i5 * i6 * maximum(i4, i3)) / (maximum(i2, i3) + 
maximum(i4, i3))))}}
   0.1%    99.4%       0.054s       2.71e-06s     C     20000        2   
Elemwise{TrueDiv}[(0, 0)]
   0.1%    99.5%       0.052s       1.30e-06s     C     40000        4   
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
   0.1%    99.5%       0.030s       1.50e-06s     C     20000        2   
Elemwise{maximum,no_inplace}
   0.1%    99.6%       0.028s       2.80e-06s     C     10000        1   
Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) + maximum(i0, i1)))}}
   0.0%    99.6%       0.025s       5.03e-07s     C     50000        5   
Shape_i{0}
   0.0%    99.7%       0.025s       4.17e-07s     C     60000        6   
InplaceDimShuffle{0,x}
   0.0%    99.7%       0.024s       5.98e-07s     C     40000        4   
InplaceDimShuffle{1,0}
   0.0%    99.8%       0.022s       7.43e-07s     C     30000        3   
Shape_i{1}
   0.0%    99.8%       0.021s       1.05e-06s     C     20000        2   
Elemwise{Neg}[(0, 0)]
   0.0%    99.8%       0.021s       2.10e-06s     C     10000        1   
Elemwise{Composite{(i0 + (i1 * i2))}}[(0, 0)]
   0.0%    99.9%       0.020s       5.01e-07s     C     40000        4   
MakeVector{dtype='int64'}
   0.0%    99.9%       0.019s       1.90e-06s     C     10000        1   
Elemwise{add,no_inplace}
   0.0%   100.0%       0.019s       6.32e-07s     C     30000        3   
Reshape{2}
   ... (remaining 2 Ops account for   0.04%(0.02s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  22.0%    22.0%      11.705s       1.17e-03s   10000    10   
Elemwise{Mul}[(0, 1)](<TensorType(float32, matrix)>, Reshape{2}.0)
  21.7%    43.7%      11.554s       1.16e-03s   10000    12   
Elemwise{Mul}[(0, 1)](<TensorType(float32, matrix)>, Elemwise{Mul}[(0, 
1)].0)
   6.4%    50.0%       3.394s       3.39e-04s   10000    28   
Elemwise{Mul}[(0, 2)](<TensorType(float32, matrix)>, <TensorType(float32, 
matrix)>, Reshape{2}.0)
   6.1%    56.1%       3.247s       3.25e-04s   10000    21   Sum{axis=[1], 
acc_dtype=float64}(Elemwise{mul,no_inplace}.0)
   6.0%    62.1%       3.189s       3.19e-04s   10000    29   Sum{axis=[1], 
acc_dtype=float64}(Elemwise{Mul}[(0, 1)].0)
   6.0%    68.1%       3.184s       3.18e-04s   10000    30   Sum{axis=[1], 
acc_dtype=float64}(Elemwise{Mul}[(0, 2)].0)
   6.0%    74.0%       3.171s       3.17e-04s   10000    14   Sum{axis=[1], 
acc_dtype=float64}(Elemwise{Mul}[(0, 1)].0)
   5.9%    80.0%       3.154s       3.15e-04s   10000    11   Sum{axis=[1], 
acc_dtype=float64}(Elemwise{Mul}[(0, 1)].0)
   5.1%    85.1%       2.718s       2.72e-04s   10000    16   
Elemwise{mul,no_inplace}(<TensorType(float32, matrix)>, 
InplaceDimShuffle{0,x}.0)
   4.1%    89.1%       2.173s       2.17e-04s   10000    27   
Elemwise{Mul}[(0, 1)](<TensorType(float32, matrix)>, Reshape{2}.0)
   3.3%    92.5%       1.782s       1.78e-04s   10000    17   
Alloc(Elemwise{TrueDiv}[(0, 0)].0, Shape_i{0}.0, TensorConstant{1}, 
Shape_i{1}.0, Shape_i{0}.0)
   3.2%    95.7%       1.731s       1.73e-04s   10000    18   
Alloc(Elemwise{TrueDiv}[(0, 0)].0, Shape_i{0}.0, TensorConstant{1}, 
Shape_i{1}.0, Shape_i{0}.0)
   3.2%    99.0%       1.720s       1.72e-04s   10000     6   
Alloc(Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) + maximum(i0, 
i1)))}}.0, Shape_i{0}.0, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
   0.1%    99.1%       0.068s       6.80e-06s   10000    34   
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) + 
maximum(i4, i3))) + (i5 * (maximum(i4, i3) / (maximum(i2, i3) + maximum(i4, 
i3)))))}}(TensorConstant{(1L, 1L) of -1.0}, InplaceDimShuffle{0,x}.0, 
<TensorType(float32, matrix)>, TensorConstant{(1L, 1L) of 0.0}, 
<TensorType(float32, matrix)>, Elemwise{Composite{(i0 + (i1 * i2))}}[(0, 
0)].0)
   0.1%    99.2%       0.058s       5.80e-06s   10000    16   
Elemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) + 
maximum(i4, i3))) + ((i5 * i6 * maximum(i4, i3)) / (maximum(i2, i3) + 
maximum(i4, i3))))}}(TensorConstant{(1L, 1L) of -1.0}, 
InplaceDimShuffle{0,x}.0, <TensorType(float32, matrix)>, 
TensorConstant{(1L, 1L) of 0.0}, <TensorType(float32, matrix)>, 
TensorConstant{(1L, 1L) of 2.0}, InplaceDimShuffle{0,x}.0)
   0.1%    99.3%       0.035s       3.52e-06s   10000    14   
Elemwise{TrueDiv}[(0, 0)](Elemwise{maximum,no_inplace}.0, 
Elemwise{add,no_inplace}.0)
   0.1%    99.3%       0.028s       2.80e-06s   10000     5   
Elemwise{Composite{(maximum(i0, i1) / (maximum(i2, i1) + maximum(i0, 
i1)))}}(InplaceDimShuffle{1,0}.0, TensorConstant{(1L, 1L) of 0.0}, 
InplaceDimShuffle{1,0}.0)
   0.1%    99.4%       0.028s       2.80e-06s   10000     9   
Alloc(TensorConstant{1.0}, Shape_i{0}.0, TensorConstant{1}, 
TensorConstant{1}, TensorConstant{1})
   0.0%    99.4%       0.021s       2.10e-06s   10000    33   
Elemwise{Composite{(i0 + (i1 * i2))}}[(0, 0)](InplaceDimShuffle{0,x}.0, 
TensorConstant{(1L, 1L) of 2.0}, InplaceDimShuffle{0,x}.0)
   0.0%    99.4%       0.019s       1.90e-06s   10000    12   
Elemwise{add,no_inplace}(Elemwise{maximum,no_inplace}.0, 
Elemwise{maximum,no_inplace}.0)
   ... (remaining 39 Apply instances account for 0.56%(0.30s) of the 
runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing 
list).
                 Test them first, as they are not guaranteed to always 
provide a speedup.
  Sorry, no tip for today.

On Friday, 10 February 2017 20:29:17 UTC+2, nouiz wrote:
>
> I think I understand the problem. To confirm where the problem is, can you 
> profile it on the CPU? If I'm right, most of the time should still be spend 
> in Alloc.
>
> The problem is that the graph have a more complicated version of part of 
> the code that one of our optimization don't recognize. The optimization 
> that is disabled is the one that remove the alloc. They don't get removed 
> because there is a reshape between the elemwise and the alloc.
>
> Ideally, we should removey from the graph the reshape, and the opt should 
> do its work.
>
> You can probably work around that by finding where you do reshape and 
> remove them by making sure the alloc is of the right shape directly. Or you 
> can probably remove the alloc and use broadcasting directly to get the 
> result you want. Mostly the opt that didn't apply will make use of the 
> broadasting.
>
> Keep us updated and if you can confirm that the Alloc is still slow on the 
> CPU, thaedt would be confirm the problem.
>
> Fred
>
> On Fri, Feb 3, 2017 at 4:22 AM Šarūnas S. <[email protected] <javascript:>> 
> wrote:
>
>> I wrote a script in theano and started profiling it. What I noticed is 
>> GPU spends most of the time in GpuAlloc . 
>>
>> Could somebody explain me why this is happening and how I could reduce it?
>> In C or C++ I would preallocate it, but not sure how to do this in 
>> theano.   
>>
>> I am running on Windows 8.1 with Nvidia GTX 1070 with Theano 
>> @ 0.9.0dev4.dev-3c0be3d94102ac6864b2e5ab52ae96d07c6375c6 
>>
>>
>> I am attaching extensive profile result below:
>>
>> Function profiling
>> ==================
>>   Message: Sum of all(2) printed profiles at exit excluding Scan op 
>> profile.
>>   Time in 200 calls to Function.__call__: 3.463001e+00s
>>   Time in Function.fn.__call__: 3.451001e+00s (99.653%)
>>   Time in thunks: 3.425293e+00s (98.911%)
>>   Total compile time: 1.413800e+01s
>>     Number of Apply nodes: 590
>>     Theano Optimizer time: 1.158200e+01s
>>        Theano validate time: 9.390018e-01s
>>     Theano Linker time (includes C, CUDA code generation/compiling): 
>> 2.107000e+00s
>>        Import time 3.500128e-02s
>>        Node make_thunk time 2.042000e+00s
>>            Node GpuCAReduce{add}{0,1}(GpuElemwise{Composite{(i0 * (i1 * 
>> i2))}}[(0, 2)].0) time 9.000063e-03s
>>            Node GpuCAReduce{add}{0,1}(GpuElemwise{Mul}[(0, 1)].0) time 
>> 7.999897e-03s
>>            Node GpuDimShuffle{0,x}(GpuCAReduce{add}{0,1}.0) time 
>> 6.999969e-03s
>>            Node Shape_i{1}(<CudaNdarrayType(float32, matrix)>) time 
>> 4.999876e-03s
>>            Node GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[[ 240.]]}, 
>> GpuDimShuffle{0,x}.0) time 4.999876e-03s
>>
>>
>> Time in all call to theano.grad() 0.000000e+00s
>> Time since theano import 41.580s
>> Class
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>> <Class name>
>>   90.5%    90.5%       3.100s       3.37e-04s     C     9200      92   
>> theano.sandbox.cuda.basic_ops.GpuAlloc
>>    7.4%    97.9%       0.254s       4.19e-06s     C    60600     606   
>> theano.sandbox.cuda.basic_ops.GpuElemwise
>>    1.0%    98.9%       0.034s       2.77e-06s     C    12200     122   
>> theano.sandbox.cuda.basic_ops.GpuCAReduce
>>    0.5%    99.4%       0.017s       1.84e-06s     C     9200      92   
>> theano.sandbox.cuda.basic_ops.GpuReshape
>>    0.5%    99.9%       0.016s       7.45e-07s     C    21400     214   
>> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>>    0.1%    99.9%       0.003s       1.57e-06s     C     1900      19   
>> theano.tensor.elemwise.Elemwise
>>    0.1%   100.0%       0.002s       5.24e-07s     C     3800      38   
>> theano.compile.ops.Shape_i
>>    0.0%   100.0%       0.000s       0.00e+00s     C     1900      19   
>> theano.tensor.opt.MakeVector
>>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>>
>>
>> Ops
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>> <Op name>
>>   90.5%    90.5%       3.100s       3.37e-04s     C     9200       92   
>> GpuAlloc
>>    1.7%    92.2%       0.058s       4.41e-06s     C     13100      131   
>> GpuElemwise{Mul}[(0, 1)]
>>    1.0%    93.2%       0.034s       3.21e-06s     C     10600      106   
>> GpuElemwise{maximum,no_inplace}
>>    1.0%    94.2%       0.034s       2.77e-06s     C     12200      122   
>> GpuCAReduce{add}{0,1}
>>    0.7%    94.8%       0.023s       3.54e-06s     C     6500       65   
>> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>>    0.5%    95.4%       0.018s       3.27e-06s     C     5500       55   
>> GpuElemwise{mul,no_inplace}
>>    0.5%    95.9%       0.018s       4.61e-06s     C     3900       39   
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
>>    0.5%    96.4%       0.017s       1.84e-06s     C     9200       92   
>> GpuReshape{2}
>>    0.4%    96.8%       0.014s       4.33e-06s     C     3200       32   
>> GpuElemwise{Composite{(i0 * (i1 * i2))}}[(0, 2)]
>>    0.2%    97.0%       0.008s       8.69e-07s     C     9200       92   
>> GpuDimShuffle{1,0}
>>    0.2%    97.3%       0.008s       5.33e-06s     C     1500       15   
>> GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}
>>    0.2%    97.5%       0.008s       6.52e-07s     C     12200      122   
>> GpuDimShuffle{0,x}
>>    0.2%    97.7%       0.007s       4.38e-06s     C     1600       16   
>> GpuElemwise{Composite{(((i0 * i1 * maximum(i2, i3)) / (maximum(i2, i3) + 
>> maximum(i4, i3))) + ((i5 * i6 * maximum(i4, i3)) / (maximum(i2, i3) + 
>> maximum(i4, i3))))},no_inplace}
>>    0.2%    97.9%       0.007s       2.92e-06s     C     2400       24   
>> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)},no_inplace}
>>    0.2%    98.1%       0.007s       8.75e-06s     C      800        8   
>> GpuElemwise{Composite{((i0 * i1 * i2) / i3)}}[(0, 2)]
>>    0.2%    98.3%       0.007s       8.73e-06s     C      800        8   
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 0)]
>>    0.2%    98.5%       0.006s       3.54e-06s     C     1700       17   
>> GpuElemwise{true_div,no_inplace}
>>    0.1%    98.6%       0.005s       5.02e-06s     C     1000       10   
>> GpuElemwise{Composite{(i0 * (i1 + i2))},no_inplace}
>>    0.1%    98.8%       0.005s       9.99e-06s     C      500        5   
>> GpuElemwise{Composite{(((i0 + i1) + i2) + i3)},no_inplace}
>>    0.1%    98.9%       0.004s       6.65e-06s     C      600        6   
>> GpuElemwise{Composite{(((i0 * (maximum(i1, i2) / Composite{((i0 + i1) + 
>> i2)}(maximum(i1, i2), maximum(i3, i2), maximum(i4, i2)))) + ((i5 * i6 * 
>> maximum(i3, i2)) / Composite{((i0 + i1) + i2)}(maximum(i1, i2), maximum(
>> i3, i2), maximum(i4, i2)))) + ((i7 * i8 * maximum(i4, i2)) / Composite{((i0 
>> + i1) + i2)}(maximum(i1, i2), maximum(i3, i2), maximum(i4, i2))))},
>> no_inplace}
>>    ... (remaining 33 Ops account for   1.11%(0.04s) of the runtime)
>>
>>
>> Apply
>> ------
>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>    1.6%     1.6%       0.055s       5.50e-04s    100   188   GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}.0, TensorConstant{
>> 1326}, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>>    1.6%     3.2%       0.055s       5.50e-04s    100   217   GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, TensorConstant{1326}, 
>> TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>>    1.6%     4.8%       0.055s       5.50e-04s    100   224   GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1 * i2) / i3)}}[(0, 2)].0, TensorConstant{
>> 1326}, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>>    1.6%     6.4%       0.055s       5.50e-04s    100   183   GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, TensorConstant{1326}, 
>> TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>>    1.6%     8.0%       0.054s       5.39e-04s    100   186   GpuAlloc(
>> GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}.0, TensorConstant{
>> 1326}, TensorConstant{1}, Shape_i{1}.0, Shape_i{0}.0)
>>    1.5%     9.5%       0.053s       5.30e-04s    100   154   GpuAlloc(
>> GpuElemwise{true_div
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [theano-users] GPU spends most of the time in GpuAlloc

Reply via email to