Dear Patric, 

Thank you for your help and comments. Coincidentally, soon after posting I 
have came across MKL and find it pretty criminal that its not by default in 
anaconda! :) 

The CPU version now is either much faster ( when I reduce the internal 
matrices from 1000x1000 to 200x1000 ) or equal to my GPU version. So CPU is 
able to exploit better my fundamental optimizations of the problem itself. 
Pretty curious how this would like on a server type multi-core CPU. 

Regarding the parallel branches, even aside my specific problem I see that 
its more papers come out with multi inputs, forks and merges within models. 
These structures would benefit greatly from parallel branches. Now, 
thinking more about it, such parallelism could be achieved manually just 
splitting the graph at some nodes with many inputs. One would just create 
shared variables which would link the sub-graphs with the trunk graph. Then 
because Theano utilises GPU asynchronously one would get the result. 


New CPU profile:
Function profiling
==================
  Message: D:\PK scripts\sgd_solver\utils\GameForest.py:137
  Time in 1000 calls to Function.__call__: 9.868994e+00s
  Time in Function.fn.__call__: 9.794995e+00s (99.250%)
  Time in thunks: 9.372134e+00s (94.965%)
  Total compile time: 8.510001e-01s
    Number of Apply nodes: 276
    Theano Optimizer time: 6.790001e-01s
       Theano validate time: 1.619997e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 
1.150000e-01s
       Import time 1.000166e-03s
       Node make_thunk time 1.029999e-01s
           Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 
0)](raw_p:cc/cc/cc/cr0r0r0r0a, Join.0, InplaceDimShuffle{0,x}.0, 
TensorConstant{(1L, 1L) of 0.0}) time 1.999855e-03s
           Node Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0) 
time 1.000166e-03s
           Node Elemwise{Mul}[(0, 1)](Elemwise{Mul}[(0, 1)].0, 
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
           Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 
0)](raw_p:cc/cc/cc/cr1r0a, Join.0, InplaceDimShuffle{0,x}.0, 
TensorConstant{(1L, 1L) of 0.0}) time 1.000166e-03s
           Node Gemm{inplace}(Dot22.0, TensorConstant{1.0}, 
convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0, TensorConstant{1.0}) 
time 1.000166e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 74.134s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
<Class name>
  65.8%    65.8%       6.164s       1.19e-04s     Py   52000      52   
theano.tensor.blas.Dot22
  25.3%    91.0%       2.369s       1.39e-04s     C    17000      17   
theano.tensor.blas.Gemm
   3.5%    94.5%       0.325s       3.91e-06s     C    83000      83   
theano.tensor.elemwise.Elemwise
   2.1%    96.6%       0.197s       1.09e-05s     C    18000      18   
theano.tensor.basic.Split
   1.9%    98.5%       0.174s       9.65e-06s     C    18000      18   
theano.tensor.basic.Join
   1.1%    99.6%       0.104s       5.77e-06s     C    18000      18   
theano.tensor.elemwise.Sum
   0.4%   100.0%       0.040s       5.71e-07s     C    70000      70   
theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
name>
  65.8%    65.8%       6.164s       1.19e-04s     Py    52000       52   
Dot22
  25.3%    91.0%       2.369s       1.39e-04s     C     17000       17   
Gemm{inplace}
   1.9%    92.9%       0.174s       9.65e-06s     C     18000       18   
Join
   1.5%    94.4%       0.139s       7.72e-06s     C     18000       18   
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
   1.2%    95.6%       0.113s       1.25e-05s     C     9000        9   
Split{2}
   1.1%    96.7%       0.104s       5.77e-06s     C     18000       18   
Sum{axis=[1], acc_dtype=float64}
   0.8%    97.5%       0.077s       4.28e-06s     C     18000       18   
Elemwise{mul,no_inplace}
   0.8%    98.3%       0.074s       1.06e-05s     C     7000        7   
Split{4}
   0.4%    98.8%       0.042s       1.40e-06s     C     30000       30   
Elemwise{Mul}[(0, 1)]
   0.3%    99.1%       0.030s       4.99e-06s     C     6000        6   
Elemwise{Add}[(0, 2)]
   0.3%    99.3%       0.025s       4.80e-07s     C     52000       52   
InplaceDimShuffle{1,0}
   0.2%    99.5%       0.018s       2.25e-06s     C     8000        8   
Elemwise{Mul}[(0, 0)]
   0.2%    99.7%       0.015s       8.33e-07s     C     18000       18   
InplaceDimShuffle{0,x}
   0.1%    99.8%       0.010s       4.99e-06s     C     2000        2   
Split{3}
   0.1%    99.9%       0.010s       4.99e-06s     C     2000        2   
Elemwise{Add}[(0, 1)]
   0.1%   100.0%       0.009s       8.99e-06s     C     1000        1   
Elemwise{Add}[(0, 0)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   2.5%     2.5%       0.237s       2.37e-04s   1000    84   
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
   2.1%     4.6%       0.194s       1.94e-04s   1000    83   
Dot22(ranges_r=3, InplaceDimShuffle{1,0}.0)
   1.9%     6.5%       0.182s       1.82e-04s   1000   150   
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
   1.7%     8.2%       0.157s       1.57e-04s   1000    71   
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
   1.7%     9.9%       0.156s       1.56e-04s   1000    94   
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
   1.7%    11.6%       0.156s       1.56e-04s   1000   153   
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
   1.6%    13.2%       0.154s       1.54e-04s   1000   119   
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
   1.6%    14.8%       0.152s       1.52e-04s   1000   126   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
   1.6%    16.4%       0.150s       1.50e-04s   1000   134   
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
   1.6%    18.0%       0.150s       1.50e-04s   1000   165   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
   1.6%    19.6%       0.149s       1.49e-04s   1000   164   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
   1.6%    21.2%       0.147s       1.47e-04s   1000   184   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
   1.6%    22.7%       0.146s       1.46e-04s   1000   160   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
   1.5%    24.3%       0.145s       1.45e-04s   1000   113   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
   1.5%    25.8%       0.145s       1.45e-04s   1000    85   
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
   1.5%    27.4%       0.142s       1.42e-04s   1000   172   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
   1.5%    28.9%       0.142s       1.42e-04s   1000   188   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
   1.5%    30.4%       0.141s       1.41e-04s   1000   183   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
   1.5%    31.8%       0.137s       1.37e-04s   1000   193   
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
   1.5%    33.3%       0.137s       1.37e-04s   1000    72   
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
   ... (remaining 256 Apply instances account for 66.70%(6.25s) of the 
runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing 
list).
                 Test them first, as they are not guaranteed to always 
provide a speedup.
  Sorry, no tip for today.




On Friday, April 21, 2017 at 8:14:59 AM UTC+3, Patric wrote:
>
> Very thanks for the information. 
>
> From the profiling log, the CPU is quick good since there are lots of data 
> operations such as split, join and which are almost 100X faster in CPU.
>
> The topologies of your model include huge of small GEMM and Elemwise so I 
> think the big cache will be helpful in CPU side. And as the title, parallel 
> branch would be a very good idea for independent compute flow.
>
> Do you have used Intel MKL as the backend of GEMM which will show better 
> performance?
>
> btw, I can't open .p file, any suggestions?
>
>
>
> On Thursday, April 20, 2017 at 5:43:19 PM UTC+8, Sharapolas wrote:
>>
>> Guys thanks for your feedback. 
>>
>> For the past week I have been trying to optimize my solver as much as 
>> possible and I optimized so much that the CPU is twice faster than the GPU 
>> now :D Extremelly puzzled with this result and I hope you could shed some 
>> light on that. 
>>
>> Wider story:
>>      In my initial version, I arranged the tensors such that I do not 
>> need to do slicing. Then I noticed that GPU load is directly proportional 
>> to the size of the tensors being used, thus I decided to use smaller 
>> tensors but lump them together and then slice in the few cases where I need 
>> it. As a result the GPU code turned to be more than 4 times slower, but CPU 
>> code almost rivals my first GPU version. I tried using different version of 
>> indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in 
>> similar timings. 
>>
>> Do you have suggestions how I could speed up my GPU code? Otherwise, I 
>> might as well just run on multicode CPU and prob become even faster than 
>> GPU :/ 
>>
>>
>> GPU version. Flags:
>>     os.environ['THEANO_FLAGS'] = 
>> ",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
>>     os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
>> Pickled version:
>>     https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
>> Graph:
>>     https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
>> Profile:
>> Function profiling
>> ==================
>>   Time in 1000 calls to Function.__call__: 2.170000e+01s
>>   Time in Function.fn.__call__: 2.166000e+01s (99.816%)
>>   Time in thunks: 2.150321e+01s (99.093%)
>>   Total compile time: 1.809000e+00s
>>     Number of Apply nodes: 276
>>     Theano Optimizer time: 1.099000e+00s
>>        Theano validate time: 2.069981e-01s
>>     Theano Linker time (includes C, CUDA code generation/compiling): 
>> 2.370000e-01s
>>        Import time 3.000021e-03s
>>        Node make_thunk time 2.260001e-01s
>>            Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 
>> 0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0, 
>> CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
>>            Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0}, 
>> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>>            Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0}, 
>> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>>            Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, 
>> convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0}) 
>> time 2.000093e-03s
>>            Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, 
>> TensorConstant{(4L,) of 1}) time 2.000093e-03s
>>
>> Time in all call to theano.grad() 0.000000e+00s
>> Time since theano import 101.753s
>> Class
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>> <Class name>
>>   38.0%    38.0%       8.176s       1.57e-04s     C    52000      52   
>> theano.sandbox.cuda.blas.GpuDot22
>>   16.9%    54.9%       3.627s       4.37e-05s     C    83000      83   
>> theano.sandbox.cuda.basic_ops.GpuElemwise
>>   14.7%    69.6%       3.169s       1.76e-04s     Py   18000      18   
>> theano.sandbox.cuda.basic_ops.GpuSplit
>>   13.8%    83.4%       2.970s       1.65e-04s     C    18000      18   
>> theano.sandbox.cuda.basic_ops.GpuJoin
>>   12.4%    95.9%       2.674s       1.57e-04s     C    17000      17   
>> theano.sandbox.cuda.blas.GpuGemm
>>    3.5%    99.4%       0.751s       4.17e-05s     C    18000      18   
>> theano.sandbox.cuda.basic_ops.GpuCAReduce
>>    0.6%   100.0%       0.137s       1.96e-06s     C    70000      70   
>> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>>
>> Ops
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
>> name>
>>   38.0%    38.0%       8.176s       1.57e-04s     C     52000       52   
>> GpuDot22
>>   13.8%    51.8%       2.970s       1.65e-04s     C     18000       18   
>> GpuJoin
>>   12.4%    64.3%       2.674s       1.57e-04s     C     17000       17   
>> GpuGemm{inplace}
>>    7.7%    71.9%       1.649s       2.36e-04s     Py    7000        7   
>> GpuSplit{4}
>>    6.1%    78.1%       1.317s       4.39e-05s     C     30000       30   
>> GpuElemwise{Mul}[(0, 1)]
>>    5.4%    83.5%       1.167s       1.30e-04s     Py    9000        9   
>> GpuSplit{2}
>>    3.6%    87.0%       0.766s       4.26e-05s     C     18000       18   
>> GpuElemwise{mul,no_inplace}
>>    3.5%    90.6%       0.763s       4.24e-05s     C     18000       18   
>> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>>    3.5%    94.1%       0.751s       4.17e-05s     C     18000       18   
>> GpuCAReduce{add}{0,1}
>>    1.9%    95.9%       0.399s       4.99e-05s     C     8000        8   
>> GpuElemwise{Mul}[(0, 0)]
>>    1.6%    97.6%       0.353s       1.76e-04s     Py    2000        2   
>> GpuSplit{3}
>>    1.1%    98.7%       0.247s       4.12e-05s     C     6000        6   
>> GpuElemwise{Add}[(0, 2)]
>>    0.6%    99.4%       0.133s       2.56e-06s     C     52000       52   
>> GpuDimShuffle{1,0}
>>    0.4%    99.8%       0.094s       4.70e-05s     C     2000        2   
>> GpuElemwise{Add}[(0, 1)]
>>    0.2%   100.0%       0.041s       4.10e-05s     C     1000        1   
>> GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
>>    0.0%   100.0%       0.004s       2.22e-07s     C     18000       18   
>> GpuDimShuffle{0,x}
>>    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
>>
>> Apply
>> ------
>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>    1.2%     1.2%       0.259s       2.59e-04s   1000    14   
>> GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,) 
>> of 1})
>>    1.1%     2.3%       0.246s       2.46e-04s   1000     9   
>> GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
>>    1.1%     3.5%       0.245s       2.45e-04s   1000   236   
>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>> GpuElemwise{Add}[(0, 1)].0)
>>    1.1%     4.6%       0.239s       2.39e-04s   1000   239   
>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>> GpuElemwise{Add}[(0, 2)].0)
>>    1.1%     5.7%       0.233s       2.33e-04s   1000     8   
>> GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,) 
>> of 1})
>>    1.1%     6.8%       0.232s       2.32e-04s   1000     5   
>> GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of 
>> 1})
>>    1.1%     7.8%       0.228s       2.28e-04s   1000     0   
>> GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of 
>> 1})
>>    1.1%     8.9%       0.227s       2.27e-04s   1000     2   
>> GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,) 
>> of 1})
>>    1.0%     9.9%       0.225s       2.25e-04s   1000   238   
>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>> GpuElemwise{Add}[(0, 2)].0)
>>    1.0%    11.0%       0.224s       2.24e-04s   1000     4   
>> GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,) 
>> of 1})
>>    1.0%    12.0%       0.223s       2.23e-04s   1000   260   
>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>> GpuElemwise{Add}[(0, 2)].0)
>>    1.0%    13.0%       0.221s       2.21e-04s   1000   271   
>> GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) + 
>> i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0, 
>> GpuElemwise{Add}[(0, 2)].0)
>>    1.0%    14.0%       0.218s       2.18e-04s   1000   261   
>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>> GpuElemwise{Add}[(0, 2)].0)
>>    0.9%    15.0%       0.203s       2.03e-04s   1000   237   
>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>> GpuElemwise{Add}[(0, 1)].0)
>>    0.9%    15.8%       0.184s       1.84e-04s   1000   146   
>> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>>    0.8%    16.7%       0.181s       1.81e-04s   1000    84   
>> GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
>>    0.8%    17.5%       0.179s       1.79e-04s   1000   134   
>> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>>    0.8%    18.4%       0.179s       1.79e-04s   1000    16   
>> GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0}, 
>> TensorConstant{(3L,) of 1})
>>    0.8%    19.2%       0.175s       1.75e-04s   1000    83   
>> GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
>>    0.8%    20.0%       0.174s       1.74e-04s   1000    11   
>> GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0}, 
>> TensorConstant{(3L,) of 1})
>>    ... (remaining 256 Apply instances account for 80.03%(17.21s) of the 
>> runtime)
>>
>>
>> Some info useful for gpu:
>>
>>     Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and 
>> 0.000s(0.00%) transfert Op
>>
>>     Theano function input that are float64
>>     <fct name> <input name> <input type> <str input>
>>
>>     List of apply that don't have float64 as input but have float64 in 
>> outputs
>>     (Useful to know if we forgot some cast when using floatX=float32 or 
>> gpu code)
>>     <Apply> <Apply position> <fct name> <inputs type> <outputs type>
>>
>> Here are tips to potentially make your code run faster
>>                  (if you think of new ones, suggest them on the mailing 
>> list).
>>                  Test them first, as they are not guaranteed to always 
>> provide a speedup.
>>   Sorry, no tip for today.
>>
>> The CPU version. Flags:
>>     os.environ['THEANO_FLAGS'] = 
>> ',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
>> Graph: 
>>     https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
>> Pickled function:
>>     https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
>> Profile:
>> Function profiling
>> ==================
>>   Time in 1000 calls to Function.__call__: 5.470006e+00s
>>   Time in Function.fn.__call__: 5.422005e+00s (99.122%)
>>   Time in thunks: 5.277404e+00s (96.479%)
>>   Total compile time: 9.329998e-01s
>>     Number of Apply nodes: 285
>>     Theano Optimizer time: 7.650001e-01s
>>        Theano validate time: 1.880007e-01s
>>     Theano Linker time (includes C, CUDA code generation/compiling): 
>> 1.140001e-01s
>>        Import time 0.000000e+00s
>>        Node make_thunk time 1.020000e-01s
>>            Node InplaceDimShuffle{x,0}(Sum{axis=[0], 
>> acc_dtype=float64}.0) time 1.000166e-03s
>>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>            Node Gemm{inplace}(Dot22.0, TensorConstant{1.0}, 
>> Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time 
>> 1.000166e-03s
>>
>> Time in all call to theano.grad() 0.000000e+00s
>> Time since theano import 62.174s
>> Class
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>> <Class name>
>>   74.3%    74.3%       3.921s       7.54e-05s     Py   52000      52   
>> theano.tensor.blas.Dot22
>>   18.9%    93.2%       0.996s       5.86e-05s     C    17000      17   
>> theano.tensor.blas.Gemm
>>    2.8%    95.9%       0.146s       1.59e-06s     C    92000      92   
>> theano.tensor.elemwise.Elemwise
>>    1.6%    97.6%       0.085s       4.72e-06s     C    18000      18   
>> theano.tensor.elemwise.Sum
>>    1.1%    98.7%       0.058s       3.22e-06s     C    18000      18   
>> theano.tensor.basic.Join
>>    1.0%    99.7%       0.053s       2.94e-06s     C    18000      18   
>> theano.tensor.basic.Split
>>    0.3%   100.0%       0.018s       2.57e-07s     C    70000      70   
>> theano.tensor.elemwise.DimShuffle
>>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>>
>> Ops
>> ---
>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
>> name>
>>   74.3%    74.3%       3.921s       7.54e-05s     Py    52000       52   
>> Dot22
>>   18.9%    93.2%       0.996s       5.86e-05s     C     17000       17   
>> Gemm{inplace}
>>    1.6%    94.8%       0.085s       4.72e-06s     C     18000       18   
>> Sum{axis=[0], acc_dtype=float64}
>>    1.4%    96.2%       0.076s       4.22e-06s     C     18000       18   
>> Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>>    1.1%    97.3%       0.058s       3.22e-06s     C     18000       18   
>> Join
>>    0.7%    98.0%       0.038s       2.11e-06s     C     18000       18   
>> Elemwise{mul,no_inplace}
>>    0.5%    98.5%       0.025s       3.56e-06s     C     7000        7   
>> Split{4}
>>    0.4%    98.9%       0.021s       2.34e-06s     C     9000        9   
>> Split{2}
>>    0.2%    99.2%       0.013s       2.50e-07s     C     52000       52   
>> InplaceDimShuffle{1,0}
>>    0.2%    99.4%       0.012s       3.08e-07s     C     39000       39   
>> Elemwise{Mul}[(0, 1)]
>>    0.2%    99.6%       0.011s       1.83e-06s     C     6000        6   
>> Elemwise{Add}[(0, 2)]
>>    0.1%    99.7%       0.007s       3.51e-06s     C     2000        2   
>> Split{3}
>>    0.1%    99.8%       0.005s       5.56e-07s     C     9000        9   
>> Elemwise{Mul}[(0, 0)]
>>    0.1%    99.9%       0.005s       2.77e-07s     C     18000       18   
>> InplaceDimShuffle{x,0}
>>    0.1%   100.0%       0.004s       2.00e-06s     C     2000        2   
>> Elemwise{Add}[(0, 1)]
>>    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
>>
>> Apply
>> ------
>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>    2.0%     2.0%       0.106s       1.06e-04s   1000   110   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    2.0%     4.0%       0.104s       1.04e-04s   1000   107   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.8%     5.7%       0.093s       9.30e-05s   1000   188   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.8%     7.5%       0.093s       9.30e-05s   1000    78   
>> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>>    1.8%     9.3%       0.093s       9.29e-05s   1000   146   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    11.0%       0.092s       9.20e-05s   1000   135   
>> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>>    1.7%    12.8%       0.092s       9.20e-05s   1000   105   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    14.5%       0.092s       9.19e-05s   1000   164   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    16.2%       0.090s       9.03e-05s   1000   177   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    17.9%       0.090s       8.99e-05s   1000   178   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    19.6%       0.089s       8.90e-05s   1000   159   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    21.3%       0.089s       8.90e-05s   1000   168   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    23.0%       0.089s       8.90e-05s   1000   157   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.7%    24.6%       0.088s       8.80e-05s   1000    73   
>> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>>    1.6%    26.3%       0.087s       8.71e-05s   1000   121   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.6%    27.9%       0.087s       8.70e-05s   1000   193   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.6%    29.6%       0.086s       8.60e-05s   1000   170   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.6%    31.2%       0.085s       8.50e-05s   1000   166   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.6%    32.8%       0.084s       8.40e-05s   1000   155   
>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>    1.6%    34.3%       0.083s       8.30e-05s   1000   140   
>> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>>    ... (remaining 265 Apply instances account for 65.66%(3.46s) of the 
>> runtime)
>>
>> Here are tips to potentially make your code run faster
>>                  (if you think of new ones, suggest them on the mailing 
>> list).
>>                  Test them first, as they are not guaranteed to always 
>> provide a speedup.
>>   Sorry, no tip for today.
>>
>> On Thursday, April 20, 2017 at 4:07:45 AM UTC+3, Patric wrote:
>>>
>>> Could you share your model with us? We'd like to take a look :)
>>>
>>> On Tuesday, April 18, 2017 at 5:24:30 PM UTC+8, Sharapolas wrote:
>>>>
>>>> I have a computation tree and am implementing leaf node evalutions. In 
>>>> theano graph do paralle branches get evaluated in parallel on the GPU?
>>>>
>>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to