I have investigated Theano runtime and I think I could achieve what I want 
with a custom linker. Before I do that I would like to get your feedback. 

As far as I understand Theano graph is traversed during runtime using 
Linkers. Nodes are sorted in the order which they should be executed and 
then their thunk codes get executed. A simple linker that I've found is

class Loop(VM):
    """
    Unconditional start-to-finish program execution in Python.
    No garbage collection is allowed on intermediate results.
    """
    # Some other part of Theano query that information
    allow_gc = False

    def __call__(self):
        if self.time_thunks:
            for cont in self.pre_call_clear:
                cont[0] = None
            try:
                for i, (thunk, node) in enumerate(zip(self.thunks,
                                                      self.nodes)):
                    t0 = time.time()
                    thunk()
                    t1 = time.time()
                    self.call_counts[i] += 1
                    self.call_times[i] += t1 - t0
            except:
                link.raise_with_op(node, thunk)
        else:
            for cont in self.pre_call_clear:
                cont[0] = None
            try:
                for thunk, node in zip(self.thunks, self.nodes):
                    thunk()
            except:
                link.raise_with_op(node, thunk)

 Here the thunks are processed sequentially. Now suppose all my thunks are 
independent ( say many updates of many independent variables ), then I 
could run all the thunks in parallel. In the case when some of the thunks 
are dependent I still could run them parallel as long as I make sure that 
by the time a thunk is run its inputs are ready. I imagine the latter could 
be done, but before doing anything I would like to ask you whether I 
understand the situation correct. 



On Friday, April 21, 2017 at 10:04:02 AM UTC+3, Sharapolas wrote:
>
> Dear Patric, 
>
> Thank you for your help and comments. Coincidentally, soon after posting I 
> have came across MKL and find it pretty criminal that its not by default in 
> anaconda! :) 
>
> The CPU version now is either much faster ( when I reduce the internal 
> matrices from 1000x1000 to 200x1000 ) or equal to my GPU version. So CPU is 
> able to exploit better my fundamental optimizations of the problem itself. 
> Pretty curious how this would like on a server type multi-core CPU. 
>
> Regarding the parallel branches, even aside my specific problem I see that 
> its more papers come out with multi inputs, forks and merges within models. 
> These structures would benefit greatly from parallel branches. Now, 
> thinking more about it, such parallelism could be achieved manually just 
> splitting the graph at some nodes with many inputs. One would just create 
> shared variables which would link the sub-graphs with the trunk graph. Then 
> because Theano utilises GPU asynchronously one would get the result. 
>
>
> New CPU profile:
> Function profiling
> ==================
>   Message: D:\PK scripts\sgd_solver\utils\GameForest.py:137
>   Time in 1000 calls to Function.__call__: 9.868994e+00s
>   Time in Function.fn.__call__: 9.794995e+00s (99.250%)
>   Time in thunks: 9.372134e+00s (94.965%)
>   Total compile time: 8.510001e-01s
>     Number of Apply nodes: 276
>     Theano Optimizer time: 6.790001e-01s
>        Theano validate time: 1.619997e-01s
>     Theano Linker time (includes C, CUDA code generation/compiling): 
> 1.150000e-01s
>        Import time 1.000166e-03s
>        Node make_thunk time 1.029999e-01s
>            Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 
> 0)](raw_p:cc/cc/cc/cr0r0r0r0a, Join.0, InplaceDimShuffle{0,x}.0, 
> TensorConstant{(1L, 1L) of 0.0}) time 1.999855e-03s
>            Node Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0) 
> time 1.000166e-03s
>            Node Elemwise{Mul}[(0, 1)](Elemwise{Mul}[(0, 1)].0, 
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>            Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 
> 0)](raw_p:cc/cc/cc/cr1r0a, Join.0, InplaceDimShuffle{0,x}.0, 
> TensorConstant{(1L, 1L) of 0.0}) time 1.000166e-03s
>            Node Gemm{inplace}(Dot22.0, TensorConstant{1.0}, 
> convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0, TensorConstant{1.0}) 
> time 1.000166e-03s
>
> Time in all call to theano.grad() 0.000000e+00s
> Time since theano import 74.134s
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
> <Class name>
>   65.8%    65.8%       6.164s       1.19e-04s     Py   52000      52   
> theano.tensor.blas.Dot22
>   25.3%    91.0%       2.369s       1.39e-04s     C    17000      17   
> theano.tensor.blas.Gemm
>    3.5%    94.5%       0.325s       3.91e-06s     C    83000      83   
> theano.tensor.elemwise.Elemwise
>    2.1%    96.6%       0.197s       1.09e-05s     C    18000      18   
> theano.tensor.basic.Split
>    1.9%    98.5%       0.174s       9.65e-06s     C    18000      18   
> theano.tensor.basic.Join
>    1.1%    99.6%       0.104s       5.77e-06s     C    18000      18   
> theano.tensor.elemwise.Sum
>    0.4%   100.0%       0.040s       5.71e-07s     C    70000      70   
> theano.tensor.elemwise.DimShuffle
>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>
> Ops
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
> name>
>   65.8%    65.8%       6.164s       1.19e-04s     Py    52000       52   
> Dot22
>   25.3%    91.0%       2.369s       1.39e-04s     C     17000       17   
> Gemm{inplace}
>    1.9%    92.9%       0.174s       9.65e-06s     C     18000       18   
> Join
>    1.5%    94.4%       0.139s       7.72e-06s     C     18000       18   
> Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>    1.2%    95.6%       0.113s       1.25e-05s     C     9000        9   
> Split{2}
>    1.1%    96.7%       0.104s       5.77e-06s     C     18000       18   
> Sum{axis=[1], acc_dtype=float64}
>    0.8%    97.5%       0.077s       4.28e-06s     C     18000       18   
> Elemwise{mul,no_inplace}
>    0.8%    98.3%       0.074s       1.06e-05s     C     7000        7   
> Split{4}
>    0.4%    98.8%       0.042s       1.40e-06s     C     30000       30   
> Elemwise{Mul}[(0, 1)]
>    0.3%    99.1%       0.030s       4.99e-06s     C     6000        6   
> Elemwise{Add}[(0, 2)]
>    0.3%    99.3%       0.025s       4.80e-07s     C     52000       52   
> InplaceDimShuffle{1,0}
>    0.2%    99.5%       0.018s       2.25e-06s     C     8000        8   
> Elemwise{Mul}[(0, 0)]
>    0.2%    99.7%       0.015s       8.33e-07s     C     18000       18   
> InplaceDimShuffle{0,x}
>    0.1%    99.8%       0.010s       4.99e-06s     C     2000        2   
> Split{3}
>    0.1%    99.9%       0.010s       4.99e-06s     C     2000        2   
> Elemwise{Add}[(0, 1)]
>    0.1%   100.0%       0.009s       8.99e-06s     C     1000        1   
> Elemwise{Add}[(0, 0)]
>    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
>
> Apply
> ------
> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>    2.5%     2.5%       0.237s       2.37e-04s   1000    84   
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
>    2.1%     4.6%       0.194s       1.94e-04s   1000    83   
> Dot22(ranges_r=3, InplaceDimShuffle{1,0}.0)
>    1.9%     6.5%       0.182s       1.82e-04s   1000   150   
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
>    1.7%     8.2%       0.157s       1.57e-04s   1000    71   
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
>    1.7%     9.9%       0.156s       1.56e-04s   1000    94   
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
>    1.7%    11.6%       0.156s       1.56e-04s   1000   153   
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
>    1.6%    13.2%       0.154s       1.54e-04s   1000   119   
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
>    1.6%    14.8%       0.152s       1.52e-04s   1000   126   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
>    1.6%    16.4%       0.150s       1.50e-04s   1000   134   
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
>    1.6%    18.0%       0.150s       1.50e-04s   1000   165   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
>    1.6%    19.6%       0.149s       1.49e-04s   1000   164   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
>    1.6%    21.2%       0.147s       1.47e-04s   1000   184   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
>    1.6%    22.7%       0.146s       1.46e-04s   1000   160   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
>    1.5%    24.3%       0.145s       1.45e-04s   1000   113   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
>    1.5%    25.8%       0.145s       1.45e-04s   1000    85   
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
>    1.5%    27.4%       0.142s       1.42e-04s   1000   172   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
>    1.5%    28.9%       0.142s       1.42e-04s   1000   188   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
>    1.5%    30.4%       0.141s       1.41e-04s   1000   183   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
>    1.5%    31.8%       0.137s       1.37e-04s   1000   193   
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3, 
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
>    1.5%    33.3%       0.137s       1.37e-04s   1000    72   
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
>    ... (remaining 256 Apply instances account for 66.70%(6.25s) of the 
> runtime)
>
> Here are tips to potentially make your code run faster
>                  (if you think of new ones, suggest them on the mailing 
> list).
>                  Test them first, as they are not guaranteed to always 
> provide a speedup.
>   Sorry, no tip for today.
>
>
>
>
> On Friday, April 21, 2017 at 8:14:59 AM UTC+3, Patric wrote:
>>
>> Very thanks for the information. 
>>
>> From the profiling log, the CPU is quick good since there are lots of 
>> data operations such as split, join and which are almost 100X faster in CPU.
>>
>> The topologies of your model include huge of small GEMM and Elemwise so I 
>> think the big cache will be helpful in CPU side. And as the title, parallel 
>> branch would be a very good idea for independent compute flow.
>>
>> Do you have used Intel MKL as the backend of GEMM which will show better 
>> performance?
>>
>> btw, I can't open .p file, any suggestions?
>>
>>
>>
>> On Thursday, April 20, 2017 at 5:43:19 PM UTC+8, Sharapolas wrote:
>>>
>>> Guys thanks for your feedback. 
>>>
>>> For the past week I have been trying to optimize my solver as much as 
>>> possible and I optimized so much that the CPU is twice faster than the GPU 
>>> now :D Extremelly puzzled with this result and I hope you could shed some 
>>> light on that. 
>>>
>>> Wider story:
>>>      In my initial version, I arranged the tensors such that I do not 
>>> need to do slicing. Then I noticed that GPU load is directly proportional 
>>> to the size of the tensors being used, thus I decided to use smaller 
>>> tensors but lump them together and then slice in the few cases where I need 
>>> it. As a result the GPU code turned to be more than 4 times slower, but CPU 
>>> code almost rivals my first GPU version. I tried using different version of 
>>> indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in 
>>> similar timings. 
>>>
>>> Do you have suggestions how I could speed up my GPU code? Otherwise, I 
>>> might as well just run on multicode CPU and prob become even faster than 
>>> GPU :/ 
>>>
>>>
>>> GPU version. Flags:
>>>     os.environ['THEANO_FLAGS'] = 
>>> ",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
>>>     os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
>>> Pickled version:
>>>     https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
>>> Graph:
>>>     https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
>>> Profile:
>>> Function profiling
>>> ==================
>>>   Time in 1000 calls to Function.__call__: 2.170000e+01s
>>>   Time in Function.fn.__call__: 2.166000e+01s (99.816%)
>>>   Time in thunks: 2.150321e+01s (99.093%)
>>>   Total compile time: 1.809000e+00s
>>>     Number of Apply nodes: 276
>>>     Theano Optimizer time: 1.099000e+00s
>>>        Theano validate time: 2.069981e-01s
>>>     Theano Linker time (includes C, CUDA code generation/compiling): 
>>> 2.370000e-01s
>>>        Import time 3.000021e-03s
>>>        Node make_thunk time 2.260001e-01s
>>>            Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), 
>>> i3)}}[(0, 0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0, 
>>> CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
>>>            Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0}, 
>>> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>>>            Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0}, 
>>> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>>>            Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, 
>>> convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0}) 
>>> time 2.000093e-03s
>>>            Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, 
>>> TensorConstant{(4L,) of 1}) time 2.000093e-03s
>>>
>>> Time in all call to theano.grad() 0.000000e+00s
>>> Time since theano import 101.753s
>>> Class
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>>> <Class name>
>>>   38.0%    38.0%       8.176s       1.57e-04s     C    52000      52   
>>> theano.sandbox.cuda.blas.GpuDot22
>>>   16.9%    54.9%       3.627s       4.37e-05s     C    83000      83   
>>> theano.sandbox.cuda.basic_ops.GpuElemwise
>>>   14.7%    69.6%       3.169s       1.76e-04s     Py   18000      18   
>>> theano.sandbox.cuda.basic_ops.GpuSplit
>>>   13.8%    83.4%       2.970s       1.65e-04s     C    18000      18   
>>> theano.sandbox.cuda.basic_ops.GpuJoin
>>>   12.4%    95.9%       2.674s       1.57e-04s     C    17000      17   
>>> theano.sandbox.cuda.blas.GpuGemm
>>>    3.5%    99.4%       0.751s       4.17e-05s     C    18000      18   
>>> theano.sandbox.cuda.basic_ops.GpuCAReduce
>>>    0.6%   100.0%       0.137s       1.96e-06s     C    70000      70   
>>> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>>>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>>>
>>> Ops
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>>> <Op name>
>>>   38.0%    38.0%       8.176s       1.57e-04s     C     52000       52   
>>> GpuDot22
>>>   13.8%    51.8%       2.970s       1.65e-04s     C     18000       18   
>>> GpuJoin
>>>   12.4%    64.3%       2.674s       1.57e-04s     C     17000       17   
>>> GpuGemm{inplace}
>>>    7.7%    71.9%       1.649s       2.36e-04s     Py    7000        7   
>>> GpuSplit{4}
>>>    6.1%    78.1%       1.317s       4.39e-05s     C     30000       30   
>>> GpuElemwise{Mul}[(0, 1)]
>>>    5.4%    83.5%       1.167s       1.30e-04s     Py    9000        9   
>>> GpuSplit{2}
>>>    3.6%    87.0%       0.766s       4.26e-05s     C     18000       18   
>>> GpuElemwise{mul,no_inplace}
>>>    3.5%    90.6%       0.763s       4.24e-05s     C     18000       18   
>>> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>>>    3.5%    94.1%       0.751s       4.17e-05s     C     18000       18   
>>> GpuCAReduce{add}{0,1}
>>>    1.9%    95.9%       0.399s       4.99e-05s     C     8000        8   
>>> GpuElemwise{Mul}[(0, 0)]
>>>    1.6%    97.6%       0.353s       1.76e-04s     Py    2000        2   
>>> GpuSplit{3}
>>>    1.1%    98.7%       0.247s       4.12e-05s     C     6000        6   
>>> GpuElemwise{Add}[(0, 2)]
>>>    0.6%    99.4%       0.133s       2.56e-06s     C     52000       52   
>>> GpuDimShuffle{1,0}
>>>    0.4%    99.8%       0.094s       4.70e-05s     C     2000        2   
>>> GpuElemwise{Add}[(0, 1)]
>>>    0.2%   100.0%       0.041s       4.10e-05s     C     1000        1   
>>> GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
>>>    0.0%   100.0%       0.004s       2.22e-07s     C     18000       18   
>>> GpuDimShuffle{0,x}
>>>    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
>>>
>>> Apply
>>> ------
>>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>>    1.2%     1.2%       0.259s       2.59e-04s   1000    14   
>>> GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,) 
>>> of 1})
>>>    1.1%     2.3%       0.246s       2.46e-04s   1000     9   
>>> GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
>>>    1.1%     3.5%       0.245s       2.45e-04s   1000   236   
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>>> GpuElemwise{Add}[(0, 1)].0)
>>>    1.1%     4.6%       0.239s       2.39e-04s   1000   239   
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>>> GpuElemwise{Add}[(0, 2)].0)
>>>    1.1%     5.7%       0.233s       2.33e-04s   1000     8   
>>> GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,) 
>>> of 1})
>>>    1.1%     6.8%       0.232s       2.32e-04s   1000     5   
>>> GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of 
>>> 1})
>>>    1.1%     7.8%       0.228s       2.28e-04s   1000     0   
>>> GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of 
>>> 1})
>>>    1.1%     8.9%       0.227s       2.27e-04s   1000     2   
>>> GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,) 
>>> of 1})
>>>    1.0%     9.9%       0.225s       2.25e-04s   1000   238   
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>>> GpuElemwise{Add}[(0, 2)].0)
>>>    1.0%    11.0%       0.224s       2.24e-04s   1000     4   
>>> GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,) 
>>> of 1})
>>>    1.0%    12.0%       0.223s       2.23e-04s   1000   260   
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>>> GpuElemwise{Add}[(0, 2)].0)
>>>    1.0%    13.0%       0.221s       2.21e-04s   1000   271   
>>> GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) + 
>>> i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0, 
>>> GpuElemwise{Add}[(0, 2)].0)
>>>    1.0%    14.0%       0.218s       2.18e-04s   1000   261   
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>>> GpuElemwise{Add}[(0, 2)].0)
>>>    0.9%    15.0%       0.203s       2.03e-04s   1000   237   
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
>>> GpuElemwise{Add}[(0, 1)].0)
>>>    0.9%    15.8%       0.184s       1.84e-04s   1000   146   
>>> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>>>    0.8%    16.7%       0.181s       1.81e-04s   1000    84   
>>> GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
>>>    0.8%    17.5%       0.179s       1.79e-04s   1000   134   
>>> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>>>    0.8%    18.4%       0.179s       1.79e-04s   1000    16   
>>> GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0}, 
>>> TensorConstant{(3L,) of 1})
>>>    0.8%    19.2%       0.175s       1.75e-04s   1000    83   
>>> GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
>>>    0.8%    20.0%       0.174s       1.74e-04s   1000    11   
>>> GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0}, 
>>> TensorConstant{(3L,) of 1})
>>>    ... (remaining 256 Apply instances account for 80.03%(17.21s) of the 
>>> runtime)
>>>
>>>
>>> Some info useful for gpu:
>>>
>>>     Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and 
>>> 0.000s(0.00%) transfert Op
>>>
>>>     Theano function input that are float64
>>>     <fct name> <input name> <input type> <str input>
>>>
>>>     List of apply that don't have float64 as input but have float64 in 
>>> outputs
>>>     (Useful to know if we forgot some cast when using floatX=float32 or 
>>> gpu code)
>>>     <Apply> <Apply position> <fct name> <inputs type> <outputs type>
>>>
>>> Here are tips to potentially make your code run faster
>>>                  (if you think of new ones, suggest them on the mailing 
>>> list).
>>>                  Test them first, as they are not guaranteed to always 
>>> provide a speedup.
>>>   Sorry, no tip for today.
>>>
>>> The CPU version. Flags:
>>>     os.environ['THEANO_FLAGS'] = 
>>> ',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
>>> Graph: 
>>>     https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
>>> Pickled function:
>>>     https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
>>> Profile:
>>> Function profiling
>>> ==================
>>>   Time in 1000 calls to Function.__call__: 5.470006e+00s
>>>   Time in Function.fn.__call__: 5.422005e+00s (99.122%)
>>>   Time in thunks: 5.277404e+00s (96.479%)
>>>   Total compile time: 9.329998e-01s
>>>     Number of Apply nodes: 285
>>>     Theano Optimizer time: 7.650001e-01s
>>>        Theano validate time: 1.880007e-01s
>>>     Theano Linker time (includes C, CUDA code generation/compiling): 
>>> 1.140001e-01s
>>>        Import time 0.000000e+00s
>>>        Node make_thunk time 1.020000e-01s
>>>            Node InplaceDimShuffle{x,0}(Sum{axis=[0], 
>>> acc_dtype=float64}.0) time 1.000166e-03s
>>>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
>>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
>>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
>>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>>            Node Gemm{inplace}(Dot22.0, TensorConstant{1.0}, 
>>> Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time 
>>> 1.000166e-03s
>>>
>>> Time in all call to theano.grad() 0.000000e+00s
>>> Time since theano import 62.174s
>>> Class
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>>> <Class name>
>>>   74.3%    74.3%       3.921s       7.54e-05s     Py   52000      52   
>>> theano.tensor.blas.Dot22
>>>   18.9%    93.2%       0.996s       5.86e-05s     C    17000      17   
>>> theano.tensor.blas.Gemm
>>>    2.8%    95.9%       0.146s       1.59e-06s     C    92000      92   
>>> theano.tensor.elemwise.Elemwise
>>>    1.6%    97.6%       0.085s       4.72e-06s     C    18000      18   
>>> theano.tensor.elemwise.Sum
>>>    1.1%    98.7%       0.058s       3.22e-06s     C    18000      18   
>>> theano.tensor.basic.Join
>>>    1.0%    99.7%       0.053s       2.94e-06s     C    18000      18   
>>> theano.tensor.basic.Split
>>>    0.3%   100.0%       0.018s       2.57e-07s     C    70000      70   
>>> theano.tensor.elemwise.DimShuffle
>>>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>>>
>>> Ops
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>>> <Op name>
>>>   74.3%    74.3%       3.921s       7.54e-05s     Py    52000       52   
>>> Dot22
>>>   18.9%    93.2%       0.996s       5.86e-05s     C     17000       17   
>>> Gemm{inplace}
>>>    1.6%    94.8%       0.085s       4.72e-06s     C     18000       18   
>>> Sum{axis=[0], acc_dtype=float64}
>>>    1.4%    96.2%       0.076s       4.22e-06s     C     18000       18   
>>> Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>>>    1.1%    97.3%       0.058s       3.22e-06s     C     18000       18   
>>> Join
>>>    0.7%    98.0%       0.038s       2.11e-06s     C     18000       18   
>>> Elemwise{mul,no_inplace}
>>>    0.5%    98.5%       0.025s       3.56e-06s     C     7000        7   
>>> Split{4}
>>>    0.4%    98.9%       0.021s       2.34e-06s     C     9000        9   
>>> Split{2}
>>>    0.2%    99.2%       0.013s       2.50e-07s     C     52000       52   
>>> InplaceDimShuffle{1,0}
>>>    0.2%    99.4%       0.012s       3.08e-07s     C     39000       39   
>>> Elemwise{Mul}[(0, 1)]
>>>    0.2%    99.6%       0.011s       1.83e-06s     C     6000        6   
>>> Elemwise{Add}[(0, 2)]
>>>    0.1%    99.7%       0.007s       3.51e-06s     C     2000        2   
>>> Split{3}
>>>    0.1%    99.8%       0.005s       5.56e-07s     C     9000        9   
>>> Elemwise{Mul}[(0, 0)]
>>>    0.1%    99.9%       0.005s       2.77e-07s     C     18000       18   
>>> InplaceDimShuffle{x,0}
>>>    0.1%   100.0%       0.004s       2.00e-06s     C     2000        2   
>>> Elemwise{Add}[(0, 1)]
>>>    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
>>>
>>> Apply
>>> ------
>>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>>    2.0%     2.0%       0.106s       1.06e-04s   1000   110   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    2.0%     4.0%       0.104s       1.04e-04s   1000   107   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.8%     5.7%       0.093s       9.30e-05s   1000   188   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.8%     7.5%       0.093s       9.30e-05s   1000    78   
>>> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>>>    1.8%     9.3%       0.093s       9.29e-05s   1000   146   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    11.0%       0.092s       9.20e-05s   1000   135   
>>> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>>>    1.7%    12.8%       0.092s       9.20e-05s   1000   105   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    14.5%       0.092s       9.19e-05s   1000   164   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    16.2%       0.090s       9.03e-05s   1000   177   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    17.9%       0.090s       8.99e-05s   1000   178   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    19.6%       0.089s       8.90e-05s   1000   159   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    21.3%       0.089s       8.90e-05s   1000   168   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    23.0%       0.089s       8.90e-05s   1000   157   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.7%    24.6%       0.088s       8.80e-05s   1000    73   
>>> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>>>    1.6%    26.3%       0.087s       8.71e-05s   1000   121   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.6%    27.9%       0.087s       8.70e-05s   1000   193   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.6%    29.6%       0.086s       8.60e-05s   1000   170   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.6%    31.2%       0.085s       8.50e-05s   1000   166   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.6%    32.8%       0.084s       8.40e-05s   1000   155   
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>>    1.6%    34.3%       0.083s       8.30e-05s   1000   140   
>>> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>>>    ... (remaining 265 Apply instances account for 65.66%(3.46s) of the 
>>> runtime)
>>>
>>> Here are tips to potentially make your code run faster
>>>                  (if you think of new ones, suggest them on the mailing 
>>> list).
>>>                  Test them first, as they are not guaranteed to always 
>>> provide a speedup.
>>>   Sorry, no tip for today.
>>>
>>> On Thursday, April 20, 2017 at 4:07:45 AM UTC+3, Patric wrote:
>>>>
>>>> Could you share your model with us? We'd like to take a look :)
>>>>
>>>> On Tuesday, April 18, 2017 at 5:24:30 PM UTC+8, Sharapolas wrote:
>>>>>
>>>>> I have a computation tree and am implementing leaf node evalutions. In 
>>>>> theano graph do paralle branches get evaluated in parallel on the GPU?
>>>>>
>>>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to