I'm learning NN, Theanos, GPU v CPU. I happened to have a 48 cores GeForce
GT 610 around so I messed around and added it to my AMD 8-core 32 Gb
machine and installed the programs and ran the Logistic regressions example
tweaked with additional prints of progress with N the number of features at
4000, not 400. the CPU version took 32 seconds, the GPU took 139 seconds !
so now I'm trying to understand the profile information. I attach the two
profiles . Int the Classes section
the GPU spends
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
62.7% 62.7% 84.714s 2.12e-03s C 40000 4
theano.sandbox.cuda.basic_ops.GpuFromHost
34.9% 97.6% 47.197s 2.36e-03s C 20000 2
theano.sandbox.cuda.blas.GpuGemv
1.4% 99.0% 1.918s 2.74e-05s C 70000 7
theano.sandbox.cuda.basic_ops.GpuElemwise
the CPU spends
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
75.5% 75.5% 24.083s 1.20e-03s C 20000 2
theano.tensor.blas_c.CGemv
23.9% 99.4% 7.612s 9.52e-05s C 80000 8
theano.tensor.elemwise.Elemwise
0.3% 99.7% 0.103s 1.03e-05s C 10000 1
theano.tensor.elemwise.Sum
How does one interpret this ? ANy other aspects one should focus on to
learn from ?
>From the program
gpu
0 1.57595e+06
1000 1893.22
2000 1888.27
3000 1888.27
4000 1888.27
5000 1888.27
6000 1888.27
7000 1888.27
8000 1888.27
9000 1888.27
Looping 10000 times took 139.552690 seconds
cpu
0 1.5654e+06
1000 1881.41
2000 1878.16
3000 1878.15
4000 1878.15
5000 1878.15
6000 1878.16
7000 1878.16
8000 1878.15
9000 1878.15
Looping 10000 times took 33.479546 seconds
thanks
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.
1269.0
karve@erie:~$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True python
TimedLogistic.py
Initial model:
cpu
0 1.5654e+06
1000 1881.41
2000 1878.16
3000 1878.15
4000 1878.15
5000 1878.15
6000 1878.16
7000 1878.16
8000 1878.15
9000 1878.15
Looping 10000 times took 33.479546 seconds
Final model:
diff between act and pred
1203.0
Function profiling
==================
Message: TimedLogistic.py:59
Time in 10000 calls to Function.__call__: 3.276280e+01s
Time in Function.fn.__call__: 3.218840e+01s (98.247%)
Time in thunks: 3.188320e+01s (97.315%)
Total compile time: 1.659312e+00s
Number of Apply nodes: 17
Theano Optimizer time: 1.203010e+00s
Theano validate time: 5.726814e-03s
Theano Linker time (includes C, CUDA code generation/compiling):
4.421425e-02s
Import time 1.099539e-02s
Time in all call to theano.grad() 5.375147e-02s
Time since theano import 35.817s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class
name>
75.5% 75.5% 24.083s 1.20e-03s C 20000 2
theano.tensor.blas_c.CGemv
23.9% 99.4% 7.612s 9.52e-05s C 80000 8
theano.tensor.elemwise.Elemwise
0.3% 99.7% 0.103s 1.03e-05s C 10000 1
theano.tensor.elemwise.Sum
0.2% 99.9% 0.049s 1.62e-06s C 30000 3
theano.tensor.elemwise.DimShuffle
0.1% 99.9% 0.019s 1.90e-06s C 10000 1
theano.tensor.basic.AllocEmpty
0.1% 100.0% 0.017s 8.61e-07s C 20000 2
theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
75.5% 75.5% 24.083s 1.20e-03s C 20000 2
CGemv{inplace}
13.0% 88.5% 4.142s 4.14e-04s C 10000 1
Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))}}[(0, 4)]
7.0% 95.5% 2.231s 2.23e-04s C 10000 1
Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) -
((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
3.4% 99.0% 1.095s 1.09e-04s C 10000 1
Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
0.3% 99.3% 0.103s 1.03e-05s C 10000 1
Sum{acc_dtype=float64}
0.2% 99.5% 0.054s 5.45e-06s C 10000 1
Elemwise{sub,no_inplace}
0.1% 99.6% 0.047s 4.67e-06s C 10000 1
Elemwise{neg,no_inplace}
0.1% 99.7% 0.037s 1.83e-06s C 20000 2
InplaceDimShuffle{x}
0.1% 99.8% 0.026s 2.56e-06s C 10000 1
Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.1% 99.9% 0.019s 1.90e-06s C 10000 1
AllocEmpty{dtype='float32'}
0.1% 99.9% 0.017s 8.61e-07s C 20000 2
Shape_i{0}
0.0% 99.9% 0.012s 1.21e-06s C 10000 1
InplaceDimShuffle{1,0}
0.0% 100.0% 0.010s 9.51e-07s C 10000 1
Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.0% 100.0% 0.008s 8.30e-07s C 10000 1
Elemwise{Cast{float32}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
39.8% 39.8% 12.698s 1.27e-03s 10000 7
CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w,
TensorConstant{0.0})
35.7% 75.5% 11.385s 1.14e-03s 10000 15 CGemv{inplace}(w,
TensorConstant{-0.10000000149011612}, x.T,
Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid(
(-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.9980000257492065})
13.0% 88.5% 4.142s 4.14e-04s 10000 12
Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0
, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0,
Elemwise{neg,no_inplace}.0)
7.0% 95.5% 2.231s 2.23e-04s 10000 13
Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) -
((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0)
- i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0,
Elemwise{sub,no_inplace}.0)
3.4% 99.0% 1.095s 1.09e-04s 10000 11
Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0,
TensorConstant{(1,) of 0.5})
0.3% 99.3% 0.103s 1.03e-05s 10000 14
Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) /
i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
0.2% 99.5% 0.054s 5.45e-06s 10000 4
Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.1% 99.6% 0.047s 4.67e-06s 10000 10
Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.1% 99.7% 0.027s 2.72e-06s 10000 0
InplaceDimShuffle{x}(b)
0.1% 99.8% 0.026s 2.56e-06s 10000 9
Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0,
InplaceDimShuffle{x}.0)
0.1% 99.8% 0.019s 1.90e-06s 10000 5
AllocEmpty{dtype='float32'}(Shape_i{0}.0)
0.0% 99.9% 0.012s 1.21e-06s 10000 2
InplaceDimShuffle{1,0}(x)
0.0% 99.9% 0.010s 9.51e-07s 10000 16
Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b,
TensorConstant{0.10000000149011612}, Sum{acc_dtype=float64}.0)
0.0% 99.9% 0.009s 9.50e-07s 10000 3 Shape_i{0}(y)
0.0% 99.9% 0.009s 9.29e-07s 10000 6
InplaceDimShuffle{x}(Shape_i{0}.0)
0.0% 100.0% 0.008s 8.30e-07s 10000 8
Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.0% 100.0% 0.008s 7.72e-07s 10000 1 Shape_i{0}(x)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide
a speedup.
- Try installing amdlibm and set the Theano flag lib.amdlibm=True. This
speeds up only some Elemwise operation.
Function profiling
==================
Message: TimedLogistic.py:60
Time in 1 calls to Function.__call__: 1.614809e-03s
Time in Function.fn.__call__: 1.566648e-03s (97.018%)
Time in thunks: 1.549006e-03s (95.925%)
Total compile time: 6.419921e-02s
Number of Apply nodes: 5
Theano Optimizer time: 5.166340e-02s
Theano validate time: 6.103516e-04s
Theano Linker time (includes C, CUDA code generation/compiling):
4.971504e-03s
Import time 6.563663e-04s
Time in all call to theano.grad() 5.375147e-02s
Time since theano import 35.822s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class
name>
92.2% 92.2% 0.001s 1.43e-03s C 1 1
theano.tensor.blas_c.CGemv
7.4% 99.7% 0.000s 1.15e-04s C 1 1
theano.tensor.elemwise.Elemwise
0.2% 99.9% 0.000s 3.10e-06s C 1 1
theano.tensor.elemwise.DimShuffle
0.1% 99.9% 0.000s 1.19e-06s C 1 1
theano.tensor.basic.AllocEmpty
0.1% 100.0% 0.000s 9.54e-07s C 1 1
theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
92.2% 92.2% 0.001s 1.43e-03s C 1 1
CGemv{inplace}
7.4% 99.7% 0.000s 1.15e-04s C 1 1
Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
0.2% 99.9% 0.000s 3.10e-06s C 1 1
InplaceDimShuffle{x}
0.1% 99.9% 0.000s 1.19e-06s C 1 1
AllocEmpty{dtype='float32'}
0.1% 100.0% 0.000s 9.54e-07s C 1 1
Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
92.2% 92.2% 0.001s 1.43e-03s 1 3
CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w,
TensorConstant{0.0})
7.4% 99.7% 0.000s 1.15e-04s 1 4
Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0,
InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
0.2% 99.9% 0.000s 3.10e-06s 1 0
InplaceDimShuffle{x}(b)
0.1% 99.9% 0.000s 1.19e-06s 1 2
AllocEmpty{dtype='float32'}(Shape_i{0}.0)
0.1% 100.0% 0.000s 9.54e-07s 1 1 Shape_i{0}(x)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide
a speedup.
- Try installing amdlibm and set the Theano flag lib.amdlibm=True. This
speeds up only some Elemwise operation.
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 3.276441e+01s
Time in Function.fn.__call__: 3.218997e+01s (98.247%)
Time in thunks: 3.188475e+01s (97.315%)
Total compile time: 1.723511e+00s
Number of Apply nodes: 17
Theano Optimizer time: 1.254673e+00s
Theano validate time: 6.337166e-03s
Theano Linker time (includes C, CUDA code generation/compiling):
4.918575e-02s
Import time 1.165175e-02s
Time in all call to theano.grad() 5.375147e-02s
Time since theano import 35.824s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class
name>
75.5% 75.5% 24.085s 1.20e-03s C 20001 3
theano.tensor.blas_c.CGemv
23.9% 99.4% 7.613s 9.52e-05s C 80001 9
theano.tensor.elemwise.Elemwise
0.3% 99.7% 0.103s 1.03e-05s C 10000 1
theano.tensor.elemwise.Sum
0.2% 99.9% 0.049s 1.62e-06s C 30001 4
theano.tensor.elemwise.DimShuffle
0.1% 99.9% 0.019s 1.90e-06s C 10001 2
theano.tensor.basic.AllocEmpty
0.1% 100.0% 0.017s 8.61e-07s C 20001 3
theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
75.5% 75.5% 24.085s 1.20e-03s C 20001 3
CGemv{inplace}
13.0% 88.5% 4.142s 4.14e-04s C 10000 1
Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))}}[(0, 4)]
7.0% 95.5% 2.231s 2.23e-04s C 10000 1
Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) -
((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
3.4% 99.0% 1.095s 1.09e-04s C 10000 1
Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
0.3% 99.3% 0.103s 1.03e-05s C 10000 1
Sum{acc_dtype=float64}
0.2% 99.5% 0.054s 5.45e-06s C 10000 1
Elemwise{sub,no_inplace}
0.1% 99.6% 0.047s 4.67e-06s C 10000 1
Elemwise{neg,no_inplace}
0.1% 99.7% 0.037s 1.83e-06s C 20001 3
InplaceDimShuffle{x}
0.1% 99.8% 0.026s 2.56e-06s C 10000 1
Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.1% 99.9% 0.019s 1.90e-06s C 10001 2
AllocEmpty{dtype='float32'}
0.1% 99.9% 0.017s 8.61e-07s C 20001 3
Shape_i{0}
0.0% 99.9% 0.012s 1.21e-06s C 10000 1
InplaceDimShuffle{1,0}
0.0% 100.0% 0.010s 9.51e-07s C 10000 1
Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.0% 100.0% 0.008s 8.30e-07s C 10000 1
Elemwise{Cast{float32}}
0.0% 100.0% 0.000s 1.15e-04s C 1 1
Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
39.8% 39.8% 12.698s 1.27e-03s 10000 7
CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w,
TensorConstant{0.0})
35.7% 75.5% 11.385s 1.14e-03s 10000 15 CGemv{inplace}(w,
TensorConstant{-0.10000000149011612}, x.T,
Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid(
(-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.9980000257492065})
13.0% 88.5% 4.142s 4.14e-04s 10000 12
Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0
, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0,
Elemwise{neg,no_inplace}.0)
7.0% 95.5% 2.231s 2.23e-04s 10000 13
Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) -
((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0)
- i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0,
Elemwise{sub,no_inplace}.0)
3.4% 99.0% 1.095s 1.09e-04s 10000 11
Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0,
TensorConstant{(1,) of 0.5})
0.3% 99.3% 0.103s 1.03e-05s 10000 14
Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) /
i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
0.2% 99.4% 0.054s 5.45e-06s 10000 4
Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.1% 99.6% 0.047s 4.67e-06s 10000 10
Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.1% 99.7% 0.027s 2.72e-06s 10000 0
InplaceDimShuffle{x}(b)
0.1% 99.8% 0.026s 2.56e-06s 10000 9
Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0,
InplaceDimShuffle{x}.0)
0.1% 99.8% 0.019s 1.90e-06s 10000 5
AllocEmpty{dtype='float32'}(Shape_i{0}.0)
0.0% 99.9% 0.012s 1.21e-06s 10000 2
InplaceDimShuffle{1,0}(x)
0.0% 99.9% 0.010s 9.51e-07s 10000 16
Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b,
TensorConstant{0.10000000149011612}, Sum{acc_dtype=float64}.0)
0.0% 99.9% 0.009s 9.50e-07s 10000 3 Shape_i{0}(y)
0.0% 99.9% 0.009s 9.29e-07s 10000 6
InplaceDimShuffle{x}(Shape_i{0}.0)
0.0% 100.0% 0.008s 8.30e-07s 10000 8
Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.0% 100.0% 0.008s 7.72e-07s 10000 1 Shape_i{0}(x)
0.0% 100.0% 0.001s 1.43e-03s 1 3
CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w,
TensorConstant{0.0})
0.0% 100.0% 0.000s 1.15e-04s 1 4
Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0,
InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
0.0% 100.0% 0.000s 3.10e-06s 1 0
InplaceDimShuffle{x}(b)
... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide
a speedup.
- Try installing amdlibm and set the Theano flag lib.amdlibm=True. This
speeds up only some Elemwise operation.
karve@erie:~$karve@erie:~$ nano .theanorc
karve@erie:~$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True python
TimedLogistic.py
Using gpu device 0: GeForce GT 610 (CNMeM is disabled, cuDNN not available)
Initial model:
gpu
0 1.57595e+06
1000 1893.22
2000 1888.27
3000 1888.27
4000 1888.27
5000 1888.27
6000 1888.27
7000 1888.27
8000 1888.27
9000 1888.27
Looping 10000 times took 139.552690 seconds
Final model:
diff between act and pred
1219.0
Function profiling
==================
Message: TimedLogistic.py:59
Time in 10000 calls to Function.__call__: 1.386716e+02s
Time in Function.fn.__call__: 1.380846e+02s (99.577%)
Time in thunks: 1.351576e+02s (97.466%)
Total compile time: 4.957566e-01s
Number of Apply nodes: 25
Theano Optimizer time: 4.296634e-01s
Theano validate time: 7.941723e-03s
Theano Linker time (includes C, CUDA code generation/compiling):
4.485345e-02s
Import time 1.086354e-02s
Time in all call to theano.grad() 6.365490e-02s
Time since theano import 142.024s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class
name>
62.7% 62.7% 84.714s 2.12e-03s C 40000 4
theano.sandbox.cuda.basic_ops.GpuFromHost
34.9% 97.6% 47.197s 2.36e-03s C 20000 2
theano.sandbox.cuda.blas.GpuGemv
1.4% 99.0% 1.918s 2.74e-05s C 70000 7
theano.sandbox.cuda.basic_ops.GpuElemwise
0.6% 99.6% 0.777s 2.59e-05s C 30000 3
theano.sandbox.cuda.basic_ops.HostFromGpu
0.2% 99.8% 0.326s 3.26e-05s C 10000 1
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 99.9% 0.109s 1.09e-05s C 10000 1
theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.0% 100.0% 0.055s 2.77e-06s C 20000 2
theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.032s 1.58e-06s C 20000 2
theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.019s 9.47e-07s C 20000 2
theano.compile.ops.Shape_i
0.0% 100.0% 0.009s 9.42e-07s C 10000 1
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
62.7% 62.7% 84.714s 2.12e-03s C 40000 4
GpuFromHost
34.9% 97.6% 47.197s 2.36e-03s C 20000 2
GpuGemv{inplace}
0.6% 98.2% 0.777s 2.59e-05s C 30000 3
HostFromGpu
0.3% 98.5% 0.388s 3.88e-05s C 10000 1
GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))},no_inplace}
0.3% 98.7% 0.375s 3.75e-05s C 10000 1
GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5)
/ i3))}}[(0, 0)]
0.2% 99.0% 0.326s 3.26e-05s C 10000 1
GpuCAReduce{add}{1}
0.2% 99.2% 0.310s 3.10e-05s C 10000 1
GpuElemwise{sub,no_inplace}
0.2% 99.4% 0.266s 2.66e-05s C 10000 1
GpuElemwise{neg,no_inplace}
0.2% 99.6% 0.217s 2.17e-05s C 10000 1
GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.2% 99.7% 0.205s 2.05e-05s C 10000 1
GpuElemwise{ScalarSigmoid}[(0, 0)]
0.1% 99.8% 0.157s 1.57e-05s C 10000 1
GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.1% 99.9% 0.109s 1.09e-05s C 10000 1
GpuAllocEmpty
0.0% 99.9% 0.045s 4.49e-06s C 10000 1
Elemwise{gt,no_inplace}
0.0% 100.0% 0.019s 9.47e-07s C 20000 2
Shape_i{0}
0.0% 100.0% 0.016s 1.59e-06s C 10000 1
GpuDimShuffle{x}
0.0% 100.0% 0.016s 1.58e-06s C 10000 1
GpuDimShuffle{1,0}
0.0% 100.0% 0.011s 1.05e-06s C 10000 1
Elemwise{Cast{float32}}
0.0% 100.0% 0.009s 9.42e-07s C 10000 1
InplaceDimShuffle{x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
61.1% 61.1% 82.519s 8.25e-03s 10000 1 GpuFromHost(x)
22.3% 83.3% 30.092s 3.01e-03s 10000 10
GpuGemv{inplace}(GpuAllocEmpty.0, TensorConstant{1.0}, GpuFromHost.0, w,
TensorConstant{0.0})
12.7% 96.0% 17.105s 1.71e-03s 10000 21
GpuGemv{inplace}(w, TensorConstant{-0.10000000149011612}, GpuDimShuffle{1,0}.0,
GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i
3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0, TensorConstant{0.9980000257492065})
1.2% 97.2% 1.678s 1.68e-04s 10000 3 GpuFromHost(y)
0.3% 97.5% 0.388s 3.88e-05s 10000 15
GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))},no_inplace}(GpuFromHost.0, GpuElemwise{Composite{((-
i0) - i1)}}[(0, 0)].0, CudaNdarrayConstant{[-1.]},
GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
0.3% 97.8% 0.375s 3.75e-05s 10000 18
GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5)
/ i3))}}[(0, 0)](GpuElemwise{Composite{((-i0) - i1)}}[(0, 0
)].0, CudaNdarrayConstant{[-1.]}, GpuFromHost.0, GpuFromHost.0,
GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0)
0.3% 98.0% 0.360s 3.60e-05s 10000 0 GpuFromHost(b)
0.2% 98.3% 0.331s 3.31e-05s 10000 19
HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
0.2% 98.5% 0.326s 3.26e-05s 10000 20
GpuCAReduce{add}{1}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) /
i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
0.2% 98.8% 0.310s 3.10e-05s 10000 8
GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, GpuFromHost.0)
0.2% 99.0% 0.301s 3.01e-05s 10000 17
HostFromGpu(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))},no_inplace}.0)
0.2% 99.2% 0.266s 2.66e-05s 10000 14
GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.2% 99.3% 0.217s 2.17e-05s 10000 12
GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)](GpuGemv{inplace}.0,
GpuDimShuffle{x}.0)
0.2% 99.5% 0.205s 2.05e-05s 10000 16
GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
0.1% 99.6% 0.157s 1.57e-05s 10000 23
GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](GpuFromHost.0,
CudaNdarrayConstant{0.10000000149011612}, GpuCAReduce{add}{1}.0)
0.1% 99.7% 0.157s 1.57e-05s 10000 13
GpuFromHost(Elemwise{Cast{float32}}.0)
0.1% 99.8% 0.145s 1.45e-05s 10000 24
HostFromGpu(GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)].0)
0.1% 99.9% 0.109s 1.09e-05s 10000 7
GpuAllocEmpty(Shape_i{0}.0)
0.0% 99.9% 0.045s 4.49e-06s 10000 22
Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
0.0% 100.0% 0.016s 1.59e-06s 10000 5
GpuDimShuffle{x}(GpuFromHost.0)
... (remaining 5 Apply instances account for 0.04%(0.05s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide
a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: TimedLogistic.py:60
Time in 1 calls to Function.__call__: 1.181531e-02s
Time in Function.fn.__call__: 1.176643e-02s (99.586%)
Time in thunks: 1.154184e-02s (97.685%)
Total compile time: 9.768987e-02s
Number of Apply nodes: 9
Theano Optimizer time: 7.542181e-02s
Theano validate time: 1.331329e-03s
Theano Linker time (includes C, CUDA code generation/compiling):
1.152706e-02s
Import time 8.077621e-04s
Time in all call to theano.grad() 6.365490e-02s
Time since theano import 142.030s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class
name>
71.9% 71.9% 0.008s 4.15e-03s C 2 2
theano.sandbox.cuda.basic_ops.GpuFromHost
25.9% 97.9% 0.003s 3.00e-03s C 1 1
theano.sandbox.cuda.blas.GpuGemv
1.5% 99.4% 0.000s 1.76e-04s C 1 1
theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.3% 99.7% 0.000s 3.29e-05s C 1 1
theano.sandbox.cuda.basic_ops.HostFromGpu
0.2% 99.9% 0.000s 2.69e-05s C 1 1
theano.sandbox.cuda.basic_ops.GpuElemwise
0.1% 100.0% 0.000s 5.96e-06s C 1 1
theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.000s 9.54e-07s C 1 1
theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.000s 0.00e+00s C 1 1
theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
71.9% 71.9% 0.008s 4.15e-03s C 2 2
GpuFromHost
25.9% 97.9% 0.003s 3.00e-03s C 1 1
GpuGemv{inplace}
1.5% 99.4% 0.000s 1.76e-04s C 1 1
GpuAllocEmpty
0.3% 99.7% 0.000s 3.29e-05s C 1 1
HostFromGpu
0.2% 99.9% 0.000s 2.69e-05s C 1 1
GpuElemwise{Composite{scalar_sigmoid((-((-i0) - i1)))}}[(0, 0)]
0.1% 100.0% 0.000s 5.96e-06s C 1 1
Elemwise{gt,no_inplace}
0.0% 100.0% 0.000s 9.54e-07s C 1 1
GpuDimShuffle{x}
0.0% 100.0% 0.000s 0.00e+00s C 1 1
Shape_i{0}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
71.7% 71.7% 0.008s 8.28e-03s 1 1 GpuFromHost(x)
25.9% 97.7% 0.003s 3.00e-03s 1 5
GpuGemv{inplace}(GpuAllocEmpty.0, TensorConstant{1.0}, GpuFromHost.0, w,
TensorConstant{0.0})
1.5% 99.2% 0.000s 1.76e-04s 1 4
GpuAllocEmpty(Shape_i{0}.0)
0.3% 99.5% 0.000s 3.29e-05s 1 7
HostFromGpu(GpuElemwise{Composite{scalar_sigmoid((-((-i0) - i1)))}}[(0, 0)].0)
0.2% 99.7% 0.000s 2.69e-05s 1 6
GpuElemwise{Composite{scalar_sigmoid((-((-i0) - i1)))}}[(0,
0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
0.2% 99.9% 0.000s 2.50e-05s 1 0 GpuFromHost(b)
0.1% 100.0% 0.000s 5.96e-06s 1 8
Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
0.0% 100.0% 0.000s 9.54e-07s 1 3
GpuDimShuffle{x}(GpuFromHost.0)
0.0% 100.0% 0.000s 0.00e+00s 1 2 Shape_i{0}(x)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide
a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
Time in 10001 calls to Function.__call__: 1.386834e+02s
Time in Function.fn.__call__: 1.380964e+02s (99.577%)
Time in thunks: 1.351691e+02s (97.466%)
Total compile time: 5.934465e-01s
Number of Apply nodes: 25
Theano Optimizer time: 5.050852e-01s
Theano validate time: 9.273052e-03s
Theano Linker time (includes C, CUDA code generation/compiling):
5.638051e-02s
Import time 1.167130e-02s
Time in all call to theano.grad() 6.365490e-02s
Time since theano import 142.032s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class
name>
62.7% 62.7% 84.723s 2.12e-03s C 40002 6
theano.sandbox.cuda.basic_ops.GpuFromHost
34.9% 97.6% 47.200s 2.36e-03s C 20001 3
theano.sandbox.cuda.blas.GpuGemv
1.4% 99.0% 1.918s 2.74e-05s C 70001 8
theano.sandbox.cuda.basic_ops.GpuElemwise
0.6% 99.6% 0.777s 2.59e-05s C 30001 4
theano.sandbox.cuda.basic_ops.HostFromGpu
0.2% 99.8% 0.326s 3.26e-05s C 10000 1
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 99.9% 0.109s 1.09e-05s C 10001 2
theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.0% 100.0% 0.055s 2.77e-06s C 20001 3
theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.032s 1.58e-06s C 20001 3
theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.019s 9.47e-07s C 20001 3
theano.compile.ops.Shape_i
0.0% 100.0% 0.009s 9.42e-07s C 10000 1
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
62.7% 62.7% 84.723s 2.12e-03s C 40002 6
GpuFromHost
34.9% 97.6% 47.200s 2.36e-03s C 20001 3
GpuGemv{inplace}
0.6% 98.2% 0.777s 2.59e-05s C 30001 4
HostFromGpu
0.3% 98.5% 0.388s 3.88e-05s C 10000 1
GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))},no_inplace}
0.3% 98.7% 0.375s 3.75e-05s C 10000 1
GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5)
/ i3))}}[(0, 0)]
0.2% 99.0% 0.326s 3.26e-05s C 10000 1
GpuCAReduce{add}{1}
0.2% 99.2% 0.310s 3.10e-05s C 10000 1
GpuElemwise{sub,no_inplace}
0.2% 99.4% 0.266s 2.66e-05s C 10000 1
GpuElemwise{neg,no_inplace}
0.2% 99.6% 0.217s 2.17e-05s C 10000 1
GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]
0.2% 99.7% 0.205s 2.05e-05s C 10000 1
GpuElemwise{ScalarSigmoid}[(0, 0)]
0.1% 99.8% 0.157s 1.57e-05s C 10000 1
GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.1% 99.9% 0.109s 1.09e-05s C 10001 2
GpuAllocEmpty
0.0% 99.9% 0.045s 4.49e-06s C 10001 2
Elemwise{gt,no_inplace}
0.0% 100.0% 0.019s 9.47e-07s C 20001 3
Shape_i{0}
0.0% 100.0% 0.016s 1.59e-06s C 10001 2
GpuDimShuffle{x}
0.0% 100.0% 0.016s 1.58e-06s C 10000 1
GpuDimShuffle{1,0}
0.0% 100.0% 0.011s 1.05e-06s C 10000 1
Elemwise{Cast{float32}}
0.0% 100.0% 0.009s 9.42e-07s C 10000 1
InplaceDimShuffle{x}
0.0% 100.0% 0.000s 2.69e-05s C 1 1
GpuElemwise{Composite{scalar_sigmoid((-((-i0) - i1)))}}[(0, 0)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
61.0% 61.0% 82.519s 8.25e-03s 10000 1 GpuFromHost(x)
22.3% 83.3% 30.092s 3.01e-03s 10000 10
GpuGemv{inplace}(GpuAllocEmpty.0, TensorConstant{1.0}, GpuFromHost.0, w,
TensorConstant{0.0})
12.7% 96.0% 17.105s 1.71e-03s 10000 21
GpuGemv{inplace}(w, TensorConstant{-0.10000000149011612}, GpuDimShuffle{1,0}.0,
GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i
3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0, TensorConstant{0.9980000257492065})
1.2% 97.2% 1.678s 1.68e-04s 10000 3 GpuFromHost(y)
0.3% 97.5% 0.388s 3.88e-05s 10000 15
GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))},no_inplace}(GpuFromHost.0, GpuElemwise{Composite{((-
i0) - i1)}}[(0, 0)].0, CudaNdarrayConstant{[-1.]},
GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
0.3% 97.8% 0.375s 3.75e-05s 10000 18
GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5)
/ i3))}}[(0, 0)](GpuElemwise{Composite{((-i0) - i1)}}[(0, 0
)].0, CudaNdarrayConstant{[-1.]}, GpuFromHost.0, GpuFromHost.0,
GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0)
0.3% 98.0% 0.360s 3.60e-05s 10000 0 GpuFromHost(b)
0.2% 98.3% 0.331s 3.31e-05s 10000 19
HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
0.2% 98.5% 0.326s 3.26e-05s 10000 20
GpuCAReduce{add}{1}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) /
i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
0.2% 98.8% 0.310s 3.10e-05s 10000 8
GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, GpuFromHost.0)
0.2% 99.0% 0.301s 3.01e-05s 10000 17
HostFromGpu(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 *
scalar_softplus(i4)))},no_inplace}.0)
0.2% 99.2% 0.266s 2.66e-05s 10000 14
GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
0.2% 99.3% 0.217s 2.17e-05s 10000 12
GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)](GpuGemv{inplace}.0,
GpuDimShuffle{x}.0)
0.2% 99.5% 0.205s 2.05e-05s 10000 16
GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
0.1% 99.6% 0.157s 1.57e-05s 10000 23
GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](GpuFromHost.0,
CudaNdarrayConstant{0.10000000149011612}, GpuCAReduce{add}{1}.0)
0.1% 99.7% 0.157s 1.57e-05s 10000 13
GpuFromHost(Elemwise{Cast{float32}}.0)
0.1% 99.8% 0.145s 1.45e-05s 10000 24
HostFromGpu(GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)].0)
0.1% 99.9% 0.109s 1.09e-05s 10000 7
GpuAllocEmpty(Shape_i{0}.0)
0.0% 99.9% 0.045s 4.49e-06s 10000 22
Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
0.0% 100.0% 0.016s 1.59e-06s 10000 5
GpuDimShuffle{x}(GpuFromHost.0)
... (remaining 14 Apply instances account for 0.05%(0.07s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide
a speedup.
Sorry, no tip for today.
karve@erie:~$