Guys thanks for your feedback. 

For the past week I have been trying to optimize my solver as much as 
possible and I optimized so much that the CPU is twice faster than the GPU 
now :D Extremelly puzzled with this result and I hope you could shed some 
light on that. 

Wider story:
     In my initial version, I arranged the tensors such that I do not need 
to do slicing. Then I noticed that GPU load is directly proportional to the 
size of the tensors being used, thus I decided to use smaller tensors but 
lump them together and then slice in the few cases where I need it. As a 
result the GPU code turned to be more than 4 times slower, but CPU code 
almost rivals my first GPU version. I tried using different version of 
indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in 
similar timings. 

Do you have suggestions how I could speed up my GPU code? Otherwise, I 
might as well just run on multicode CPU and prob become even faster than 
GPU :/ 


GPU version. Flags:
    os.environ['THEANO_FLAGS'] = 
",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
    os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
Pickled version:
    https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
Graph:
    https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
Profile:
Function profiling
==================
  Time in 1000 calls to Function.__call__: 2.170000e+01s
  Time in Function.fn.__call__: 2.166000e+01s (99.816%)
  Time in thunks: 2.150321e+01s (99.093%)
  Total compile time: 1.809000e+00s
    Number of Apply nodes: 276
    Theano Optimizer time: 1.099000e+00s
       Theano validate time: 2.069981e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 
2.370000e-01s
       Import time 3.000021e-03s
       Node make_thunk time 2.260001e-01s
           Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 
0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0, 
CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
           Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0}, 
TensorConstant{(2L,) of 1}) time 2.000093e-03s
           Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0}, 
TensorConstant{(2L,) of 1}) time 2.000093e-03s
           Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, 
convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0}) 
time 2.000093e-03s
           Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, 
TensorConstant{(4L,) of 1}) time 2.000093e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 101.753s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
<Class name>
  38.0%    38.0%       8.176s       1.57e-04s     C    52000      52   
theano.sandbox.cuda.blas.GpuDot22
  16.9%    54.9%       3.627s       4.37e-05s     C    83000      83   
theano.sandbox.cuda.basic_ops.GpuElemwise
  14.7%    69.6%       3.169s       1.76e-04s     Py   18000      18   
theano.sandbox.cuda.basic_ops.GpuSplit
  13.8%    83.4%       2.970s       1.65e-04s     C    18000      18   
theano.sandbox.cuda.basic_ops.GpuJoin
  12.4%    95.9%       2.674s       1.57e-04s     C    17000      17   
theano.sandbox.cuda.blas.GpuGemm
   3.5%    99.4%       0.751s       4.17e-05s     C    18000      18   
theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.6%   100.0%       0.137s       1.96e-06s     C    70000      70   
theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
name>
  38.0%    38.0%       8.176s       1.57e-04s     C     52000       52   
GpuDot22
  13.8%    51.8%       2.970s       1.65e-04s     C     18000       18   
GpuJoin
  12.4%    64.3%       2.674s       1.57e-04s     C     17000       17   
GpuGemm{inplace}
   7.7%    71.9%       1.649s       2.36e-04s     Py    7000        7   
GpuSplit{4}
   6.1%    78.1%       1.317s       4.39e-05s     C     30000       30   
GpuElemwise{Mul}[(0, 1)]
   5.4%    83.5%       1.167s       1.30e-04s     Py    9000        9   
GpuSplit{2}
   3.6%    87.0%       0.766s       4.26e-05s     C     18000       18   
GpuElemwise{mul,no_inplace}
   3.5%    90.6%       0.763s       4.24e-05s     C     18000       18   
GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
   3.5%    94.1%       0.751s       4.17e-05s     C     18000       18   
GpuCAReduce{add}{0,1}
   1.9%    95.9%       0.399s       4.99e-05s     C     8000        8   
GpuElemwise{Mul}[(0, 0)]
   1.6%    97.6%       0.353s       1.76e-04s     Py    2000        2   
GpuSplit{3}
   1.1%    98.7%       0.247s       4.12e-05s     C     6000        6   
GpuElemwise{Add}[(0, 2)]
   0.6%    99.4%       0.133s       2.56e-06s     C     52000       52   
GpuDimShuffle{1,0}
   0.4%    99.8%       0.094s       4.70e-05s     C     2000        2   
GpuElemwise{Add}[(0, 1)]
   0.2%   100.0%       0.041s       4.10e-05s     C     1000        1   
GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
   0.0%   100.0%       0.004s       2.22e-07s     C     18000       18   
GpuDimShuffle{0,x}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   1.2%     1.2%       0.259s       2.59e-04s   1000    14   
GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,) 
of 1})
   1.1%     2.3%       0.246s       2.46e-04s   1000     9   
GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
   1.1%     3.5%       0.245s       2.45e-04s   1000   236   
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
GpuElemwise{Add}[(0, 1)].0)
   1.1%     4.6%       0.239s       2.39e-04s   1000   239   
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
GpuElemwise{Add}[(0, 2)].0)
   1.1%     5.7%       0.233s       2.33e-04s   1000     8   
GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,) 
of 1})
   1.1%     6.8%       0.232s       2.32e-04s   1000     5   
GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of 
1})
   1.1%     7.8%       0.228s       2.28e-04s   1000     0   
GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of 
1})
   1.1%     8.9%       0.227s       2.27e-04s   1000     2   
GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,) 
of 1})
   1.0%     9.9%       0.225s       2.25e-04s   1000   238   
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
GpuElemwise{Add}[(0, 2)].0)
   1.0%    11.0%       0.224s       2.24e-04s   1000     4   
GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,) 
of 1})
   1.0%    12.0%       0.223s       2.23e-04s   1000   260   
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
GpuElemwise{Add}[(0, 2)].0)
   1.0%    13.0%       0.221s       2.21e-04s   1000   271   
GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) + 
i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0, 
GpuElemwise{Add}[(0, 2)].0)
   1.0%    14.0%       0.218s       2.18e-04s   1000   261   
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
GpuElemwise{Add}[(0, 2)].0)
   0.9%    15.0%       0.203s       2.03e-04s   1000   237   
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
GpuElemwise{Add}[(0, 1)].0)
   0.9%    15.8%       0.184s       1.84e-04s   1000   146   
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
   0.8%    16.7%       0.181s       1.81e-04s   1000    84   
GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
   0.8%    17.5%       0.179s       1.79e-04s   1000   134   
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
   0.8%    18.4%       0.179s       1.79e-04s   1000    16   
GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0}, 
TensorConstant{(3L,) of 1})
   0.8%    19.2%       0.175s       1.75e-04s   1000    83   
GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
   0.8%    20.0%       0.174s       1.74e-04s   1000    11   
GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0}, 
TensorConstant{(3L,) of 1})
   ... (remaining 256 Apply instances account for 80.03%(17.21s) of the 
runtime)


Some info useful for gpu:

    Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and 
0.000s(0.00%) transfert Op

    Theano function input that are float64
    <fct name> <input name> <input type> <str input>

    List of apply that don't have float64 as input but have float64 in 
outputs
    (Useful to know if we forgot some cast when using floatX=float32 or gpu 
code)
    <Apply> <Apply position> <fct name> <inputs type> <outputs type>

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing 
list).
                 Test them first, as they are not guaranteed to always 
provide a speedup.
  Sorry, no tip for today.

The CPU version. Flags:
    os.environ['THEANO_FLAGS'] = 
',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
Graph: 
    https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
Pickled function:
    https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
Profile:
Function profiling
==================
  Time in 1000 calls to Function.__call__: 5.470006e+00s
  Time in Function.fn.__call__: 5.422005e+00s (99.122%)
  Time in thunks: 5.277404e+00s (96.479%)
  Total compile time: 9.329998e-01s
    Number of Apply nodes: 285
    Theano Optimizer time: 7.650001e-01s
       Theano validate time: 1.880007e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 
1.140001e-01s
       Import time 0.000000e+00s
       Node make_thunk time 1.020000e-01s
           Node InplaceDimShuffle{x,0}(Sum{axis=[0], acc_dtype=float64}.0) 
time 1.000166e-03s
           Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
           Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
           Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
           Node Gemm{inplace}(Dot22.0, TensorConstant{1.0}, 
Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time 
1.000166e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 62.174s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
<Class name>
  74.3%    74.3%       3.921s       7.54e-05s     Py   52000      52   
theano.tensor.blas.Dot22
  18.9%    93.2%       0.996s       5.86e-05s     C    17000      17   
theano.tensor.blas.Gemm
   2.8%    95.9%       0.146s       1.59e-06s     C    92000      92   
theano.tensor.elemwise.Elemwise
   1.6%    97.6%       0.085s       4.72e-06s     C    18000      18   
theano.tensor.elemwise.Sum
   1.1%    98.7%       0.058s       3.22e-06s     C    18000      18   
theano.tensor.basic.Join
   1.0%    99.7%       0.053s       2.94e-06s     C    18000      18   
theano.tensor.basic.Split
   0.3%   100.0%       0.018s       2.57e-07s     C    70000      70   
theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
name>
  74.3%    74.3%       3.921s       7.54e-05s     Py    52000       52   
Dot22
  18.9%    93.2%       0.996s       5.86e-05s     C     17000       17   
Gemm{inplace}
   1.6%    94.8%       0.085s       4.72e-06s     C     18000       18   
Sum{axis=[0], acc_dtype=float64}
   1.4%    96.2%       0.076s       4.22e-06s     C     18000       18   
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
   1.1%    97.3%       0.058s       3.22e-06s     C     18000       18   
Join
   0.7%    98.0%       0.038s       2.11e-06s     C     18000       18   
Elemwise{mul,no_inplace}
   0.5%    98.5%       0.025s       3.56e-06s     C     7000        7   
Split{4}
   0.4%    98.9%       0.021s       2.34e-06s     C     9000        9   
Split{2}
   0.2%    99.2%       0.013s       2.50e-07s     C     52000       52   
InplaceDimShuffle{1,0}
   0.2%    99.4%       0.012s       3.08e-07s     C     39000       39   
Elemwise{Mul}[(0, 1)]
   0.2%    99.6%       0.011s       1.83e-06s     C     6000        6   
Elemwise{Add}[(0, 2)]
   0.1%    99.7%       0.007s       3.51e-06s     C     2000        2   
Split{3}
   0.1%    99.8%       0.005s       5.56e-07s     C     9000        9   
Elemwise{Mul}[(0, 0)]
   0.1%    99.9%       0.005s       2.77e-07s     C     18000       18   
InplaceDimShuffle{x,0}
   0.1%   100.0%       0.004s       2.00e-06s     C     2000        2   
Elemwise{Add}[(0, 1)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   2.0%     2.0%       0.106s       1.06e-04s   1000   110   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   2.0%     4.0%       0.104s       1.04e-04s   1000   107   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.8%     5.7%       0.093s       9.30e-05s   1000   188   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.8%     7.5%       0.093s       9.30e-05s   1000    78   
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
   1.8%     9.3%       0.093s       9.29e-05s   1000   146   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    11.0%       0.092s       9.20e-05s   1000   135   
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
   1.7%    12.8%       0.092s       9.20e-05s   1000   105   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    14.5%       0.092s       9.19e-05s   1000   164   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    16.2%       0.090s       9.03e-05s   1000   177   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    17.9%       0.090s       8.99e-05s   1000   178   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    19.6%       0.089s       8.90e-05s   1000   159   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    21.3%       0.089s       8.90e-05s   1000   168   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    23.0%       0.089s       8.90e-05s   1000   157   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.7%    24.6%       0.088s       8.80e-05s   1000    73   
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
   1.6%    26.3%       0.087s       8.71e-05s   1000   121   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.6%    27.9%       0.087s       8.70e-05s   1000   193   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.6%    29.6%       0.086s       8.60e-05s   1000   170   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.6%    31.2%       0.085s       8.50e-05s   1000   166   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.6%    32.8%       0.084s       8.40e-05s   1000   155   
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
   1.6%    34.3%       0.083s       8.30e-05s   1000   140   
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
   ... (remaining 265 Apply instances account for 65.66%(3.46s) of the 
runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing 
list).
                 Test them first, as they are not guaranteed to always 
provide a speedup.
  Sorry, no tip for today.

On Thursday, April 20, 2017 at 4:07:45 AM UTC+3, Patric wrote:
>
> Could you share your model with us? We'd like to take a look :)
>
> On Tuesday, April 18, 2017 at 5:24:30 PM UTC+8, Sharapolas wrote:
>>
>> I have a computation tree and am implementing leaf node evalutions. In 
>> theano graph do paralle branches get evaluated in parallel on the GPU?
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to