I have investigated Theano runtime and I think I could achieve what I want
with a custom linker. Before I do that I would like to get your feedback.
As far as I understand Theano graph is traversed during runtime using
Linkers. Nodes are sorted in the order which they should be executed and
then their thunk codes get executed. A simple linker that I've found is
class Loop(VM):
"""
Unconditional start-to-finish program execution in Python.
No garbage collection is allowed on intermediate results.
"""
# Some other part of Theano query that information
allow_gc = False
def __call__(self):
if self.time_thunks:
for cont in self.pre_call_clear:
cont[0] = None
try:
for i, (thunk, node) in enumerate(zip(self.thunks,
self.nodes)):
t0 = time.time()
thunk()
t1 = time.time()
self.call_counts[i] += 1
self.call_times[i] += t1 - t0
except:
link.raise_with_op(node, thunk)
else:
for cont in self.pre_call_clear:
cont[0] = None
try:
for thunk, node in zip(self.thunks, self.nodes):
thunk()
except:
link.raise_with_op(node, thunk)
Here the thunks are processed sequentially. Now suppose all my thunks are
independent ( say many updates of many independent variables ), then I
could run all the thunks in parallel. In the case when some of the thunks
are dependent I still could run them parallel as long as I make sure that
by the time a thunk is run its inputs are ready. I imagine the latter could
be done, but before doing anything I would like to ask you whether I
understand the situation correct.
On Friday, April 21, 2017 at 10:04:02 AM UTC+3, Sharapolas wrote:
>
> Dear Patric,
>
> Thank you for your help and comments. Coincidentally, soon after posting I
> have came across MKL and find it pretty criminal that its not by default in
> anaconda! :)
>
> The CPU version now is either much faster ( when I reduce the internal
> matrices from 1000x1000 to 200x1000 ) or equal to my GPU version. So CPU is
> able to exploit better my fundamental optimizations of the problem itself.
> Pretty curious how this would like on a server type multi-core CPU.
>
> Regarding the parallel branches, even aside my specific problem I see that
> its more papers come out with multi inputs, forks and merges within models.
> These structures would benefit greatly from parallel branches. Now,
> thinking more about it, such parallelism could be achieved manually just
> splitting the graph at some nodes with many inputs. One would just create
> shared variables which would link the sub-graphs with the trunk graph. Then
> because Theano utilises GPU asynchronously one would get the result.
>
>
> New CPU profile:
> Function profiling
> ==================
> Message: D:\PK scripts\sgd_solver\utils\GameForest.py:137
> Time in 1000 calls to Function.__call__: 9.868994e+00s
> Time in Function.fn.__call__: 9.794995e+00s (99.250%)
> Time in thunks: 9.372134e+00s (94.965%)
> Total compile time: 8.510001e-01s
> Number of Apply nodes: 276
> Theano Optimizer time: 6.790001e-01s
> Theano validate time: 1.619997e-01s
> Theano Linker time (includes C, CUDA code generation/compiling):
> 1.150000e-01s
> Import time 1.000166e-03s
> Node make_thunk time 1.029999e-01s
> Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
> 0)](raw_p:cc/cc/cc/cr0r0r0r0a, Join.0, InplaceDimShuffle{0,x}.0,
> TensorConstant{(1L, 1L) of 0.0}) time 1.999855e-03s
> Node Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
> time 1.000166e-03s
> Node Elemwise{Mul}[(0, 1)](Elemwise{Mul}[(0, 1)].0,
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
> Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
> 0)](raw_p:cc/cc/cc/cr1r0a, Join.0, InplaceDimShuffle{0,x}.0,
> TensorConstant{(1L, 1L) of 0.0}) time 1.000166e-03s
> Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
> convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
> time 1.000166e-03s
>
> Time in all call to theano.grad() 0.000000e+00s
> Time since theano import 74.134s
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
> <Class name>
> 65.8% 65.8% 6.164s 1.19e-04s Py 52000 52
> theano.tensor.blas.Dot22
> 25.3% 91.0% 2.369s 1.39e-04s C 17000 17
> theano.tensor.blas.Gemm
> 3.5% 94.5% 0.325s 3.91e-06s C 83000 83
> theano.tensor.elemwise.Elemwise
> 2.1% 96.6% 0.197s 1.09e-05s C 18000 18
> theano.tensor.basic.Split
> 1.9% 98.5% 0.174s 9.65e-06s C 18000 18
> theano.tensor.basic.Join
> 1.1% 99.6% 0.104s 5.77e-06s C 18000 18
> theano.tensor.elemwise.Sum
> 0.4% 100.0% 0.040s 5.71e-07s C 70000 70
> theano.tensor.elemwise.DimShuffle
> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
>
> Ops
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
> name>
> 65.8% 65.8% 6.164s 1.19e-04s Py 52000 52
> Dot22
> 25.3% 91.0% 2.369s 1.39e-04s C 17000 17
> Gemm{inplace}
> 1.9% 92.9% 0.174s 9.65e-06s C 18000 18
> Join
> 1.5% 94.4% 0.139s 7.72e-06s C 18000 18
> Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
> 1.2% 95.6% 0.113s 1.25e-05s C 9000 9
> Split{2}
> 1.1% 96.7% 0.104s 5.77e-06s C 18000 18
> Sum{axis=[1], acc_dtype=float64}
> 0.8% 97.5% 0.077s 4.28e-06s C 18000 18
> Elemwise{mul,no_inplace}
> 0.8% 98.3% 0.074s 1.06e-05s C 7000 7
> Split{4}
> 0.4% 98.8% 0.042s 1.40e-06s C 30000 30
> Elemwise{Mul}[(0, 1)]
> 0.3% 99.1% 0.030s 4.99e-06s C 6000 6
> Elemwise{Add}[(0, 2)]
> 0.3% 99.3% 0.025s 4.80e-07s C 52000 52
> InplaceDimShuffle{1,0}
> 0.2% 99.5% 0.018s 2.25e-06s C 8000 8
> Elemwise{Mul}[(0, 0)]
> 0.2% 99.7% 0.015s 8.33e-07s C 18000 18
> InplaceDimShuffle{0,x}
> 0.1% 99.8% 0.010s 4.99e-06s C 2000 2
> Split{3}
> 0.1% 99.9% 0.010s 4.99e-06s C 2000 2
> Elemwise{Add}[(0, 1)]
> 0.1% 100.0% 0.009s 8.99e-06s C 1000 1
> Elemwise{Add}[(0, 0)]
> ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
>
> Apply
> ------
> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
> 2.5% 2.5% 0.237s 2.37e-04s 1000 84
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
> 2.1% 4.6% 0.194s 1.94e-04s 1000 83
> Dot22(ranges_r=3, InplaceDimShuffle{1,0}.0)
> 1.9% 6.5% 0.182s 1.82e-04s 1000 150
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
> 1.7% 8.2% 0.157s 1.57e-04s 1000 71
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
> 1.7% 9.9% 0.156s 1.56e-04s 1000 94
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
> 1.7% 11.6% 0.156s 1.56e-04s 1000 153
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
> 1.6% 13.2% 0.154s 1.54e-04s 1000 119
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
> 1.6% 14.8% 0.152s 1.52e-04s 1000 126
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
> 1.6% 16.4% 0.150s 1.50e-04s 1000 134
> Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
> 1.6% 18.0% 0.150s 1.50e-04s 1000 165
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
> 1.6% 19.6% 0.149s 1.49e-04s 1000 164
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
> 1.6% 21.2% 0.147s 1.47e-04s 1000 184
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
> 1.6% 22.7% 0.146s 1.46e-04s 1000 160
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
> 1.5% 24.3% 0.145s 1.45e-04s 1000 113
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
> 1.5% 25.8% 0.145s 1.45e-04s 1000 85
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
> 1.5% 27.4% 0.142s 1.42e-04s 1000 172
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
> 1.5% 28.9% 0.142s 1.42e-04s 1000 188
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
> 1.5% 30.4% 0.141s 1.41e-04s 1000 183
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
> 1.5% 31.8% 0.137s 1.37e-04s 1000 193
> Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
> Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
> 1.5% 33.3% 0.137s 1.37e-04s 1000 72
> Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
> ... (remaining 256 Apply instances account for 66.70%(6.25s) of the
> runtime)
>
> Here are tips to potentially make your code run faster
> (if you think of new ones, suggest them on the mailing
> list).
> Test them first, as they are not guaranteed to always
> provide a speedup.
> Sorry, no tip for today.
>
>
>
>
> On Friday, April 21, 2017 at 8:14:59 AM UTC+3, Patric wrote:
>>
>> Very thanks for the information.
>>
>> From the profiling log, the CPU is quick good since there are lots of
>> data operations such as split, join and which are almost 100X faster in CPU.
>>
>> The topologies of your model include huge of small GEMM and Elemwise so I
>> think the big cache will be helpful in CPU side. And as the title, parallel
>> branch would be a very good idea for independent compute flow.
>>
>> Do you have used Intel MKL as the backend of GEMM which will show better
>> performance?
>>
>> btw, I can't open .p file, any suggestions?
>>
>>
>>
>> On Thursday, April 20, 2017 at 5:43:19 PM UTC+8, Sharapolas wrote:
>>>
>>> Guys thanks for your feedback.
>>>
>>> For the past week I have been trying to optimize my solver as much as
>>> possible and I optimized so much that the CPU is twice faster than the GPU
>>> now :D Extremelly puzzled with this result and I hope you could shed some
>>> light on that.
>>>
>>> Wider story:
>>> In my initial version, I arranged the tensors such that I do not
>>> need to do slicing. Then I noticed that GPU load is directly proportional
>>> to the size of the tensors being used, thus I decided to use smaller
>>> tensors but lump them together and then slice in the few cases where I need
>>> it. As a result the GPU code turned to be more than 4 times slower, but CPU
>>> code almost rivals my first GPU version. I tried using different version of
>>> indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in
>>> similar timings.
>>>
>>> Do you have suggestions how I could speed up my GPU code? Otherwise, I
>>> might as well just run on multicode CPU and prob become even faster than
>>> GPU :/
>>>
>>>
>>> GPU version. Flags:
>>> os.environ['THEANO_FLAGS'] =
>>> ",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
>>> os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
>>> Pickled version:
>>> https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
>>> Graph:
>>> https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
>>> Profile:
>>> Function profiling
>>> ==================
>>> Time in 1000 calls to Function.__call__: 2.170000e+01s
>>> Time in Function.fn.__call__: 2.166000e+01s (99.816%)
>>> Time in thunks: 2.150321e+01s (99.093%)
>>> Total compile time: 1.809000e+00s
>>> Number of Apply nodes: 276
>>> Theano Optimizer time: 1.099000e+00s
>>> Theano validate time: 2.069981e-01s
>>> Theano Linker time (includes C, CUDA code generation/compiling):
>>> 2.370000e-01s
>>> Import time 3.000021e-03s
>>> Node make_thunk time 2.260001e-01s
>>> Node GpuElemwise{Composite{maximum(((i0 + i1) - i2),
>>> i3)}}[(0, 0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0,
>>> CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
>>> Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0},
>>> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>>> Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0},
>>> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>>> Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0},
>>> convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
>>> time 2.000093e-03s
>>> Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0},
>>> TensorConstant{(4L,) of 1}) time 2.000093e-03s
>>>
>>> Time in all call to theano.grad() 0.000000e+00s
>>> Time since theano import 101.753s
>>> Class
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
>>> <Class name>
>>> 38.0% 38.0% 8.176s 1.57e-04s C 52000 52
>>> theano.sandbox.cuda.blas.GpuDot22
>>> 16.9% 54.9% 3.627s 4.37e-05s C 83000 83
>>> theano.sandbox.cuda.basic_ops.GpuElemwise
>>> 14.7% 69.6% 3.169s 1.76e-04s Py 18000 18
>>> theano.sandbox.cuda.basic_ops.GpuSplit
>>> 13.8% 83.4% 2.970s 1.65e-04s C 18000 18
>>> theano.sandbox.cuda.basic_ops.GpuJoin
>>> 12.4% 95.9% 2.674s 1.57e-04s C 17000 17
>>> theano.sandbox.cuda.blas.GpuGemm
>>> 3.5% 99.4% 0.751s 4.17e-05s C 18000 18
>>> theano.sandbox.cuda.basic_ops.GpuCAReduce
>>> 0.6% 100.0% 0.137s 1.96e-06s C 70000 70
>>> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>>> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
>>>
>>> Ops
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
>>> <Op name>
>>> 38.0% 38.0% 8.176s 1.57e-04s C 52000 52
>>> GpuDot22
>>> 13.8% 51.8% 2.970s 1.65e-04s C 18000 18
>>> GpuJoin
>>> 12.4% 64.3% 2.674s 1.57e-04s C 17000 17
>>> GpuGemm{inplace}
>>> 7.7% 71.9% 1.649s 2.36e-04s Py 7000 7
>>> GpuSplit{4}
>>> 6.1% 78.1% 1.317s 4.39e-05s C 30000 30
>>> GpuElemwise{Mul}[(0, 1)]
>>> 5.4% 83.5% 1.167s 1.30e-04s Py 9000 9
>>> GpuSplit{2}
>>> 3.6% 87.0% 0.766s 4.26e-05s C 18000 18
>>> GpuElemwise{mul,no_inplace}
>>> 3.5% 90.6% 0.763s 4.24e-05s C 18000 18
>>> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>>> 3.5% 94.1% 0.751s 4.17e-05s C 18000 18
>>> GpuCAReduce{add}{0,1}
>>> 1.9% 95.9% 0.399s 4.99e-05s C 8000 8
>>> GpuElemwise{Mul}[(0, 0)]
>>> 1.6% 97.6% 0.353s 1.76e-04s Py 2000 2
>>> GpuSplit{3}
>>> 1.1% 98.7% 0.247s 4.12e-05s C 6000 6
>>> GpuElemwise{Add}[(0, 2)]
>>> 0.6% 99.4% 0.133s 2.56e-06s C 52000 52
>>> GpuDimShuffle{1,0}
>>> 0.4% 99.8% 0.094s 4.70e-05s C 2000 2
>>> GpuElemwise{Add}[(0, 1)]
>>> 0.2% 100.0% 0.041s 4.10e-05s C 1000 1
>>> GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
>>> 0.0% 100.0% 0.004s 2.22e-07s C 18000 18
>>> GpuDimShuffle{0,x}
>>> ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
>>>
>>> Apply
>>> ------
>>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>> 1.2% 1.2% 0.259s 2.59e-04s 1000 14
>>> GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,)
>>> of 1})
>>> 1.1% 2.3% 0.246s 2.46e-04s 1000 9
>>> GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
>>> 1.1% 3.5% 0.245s 2.45e-04s 1000 236
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
>>> GpuElemwise{Add}[(0, 1)].0)
>>> 1.1% 4.6% 0.239s 2.39e-04s 1000 239
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
>>> GpuElemwise{Add}[(0, 2)].0)
>>> 1.1% 5.7% 0.233s 2.33e-04s 1000 8
>>> GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,)
>>> of 1})
>>> 1.1% 6.8% 0.232s 2.32e-04s 1000 5
>>> GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of
>>> 1})
>>> 1.1% 7.8% 0.228s 2.28e-04s 1000 0
>>> GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of
>>> 1})
>>> 1.1% 8.9% 0.227s 2.27e-04s 1000 2
>>> GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,)
>>> of 1})
>>> 1.0% 9.9% 0.225s 2.25e-04s 1000 238
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
>>> GpuElemwise{Add}[(0, 2)].0)
>>> 1.0% 11.0% 0.224s 2.24e-04s 1000 4
>>> GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,)
>>> of 1})
>>> 1.0% 12.0% 0.223s 2.23e-04s 1000 260
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
>>> GpuElemwise{Add}[(0, 2)].0)
>>> 1.0% 13.0% 0.221s 2.21e-04s 1000 271
>>> GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) +
>>> i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0,
>>> GpuElemwise{Add}[(0, 2)].0)
>>> 1.0% 14.0% 0.218s 2.18e-04s 1000 261
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
>>> GpuElemwise{Add}[(0, 2)].0)
>>> 0.9% 15.0% 0.203s 2.03e-04s 1000 237
>>> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
>>> GpuElemwise{Add}[(0, 1)].0)
>>> 0.9% 15.8% 0.184s 1.84e-04s 1000 146
>>> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>>> 0.8% 16.7% 0.181s 1.81e-04s 1000 84
>>> GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
>>> 0.8% 17.5% 0.179s 1.79e-04s 1000 134
>>> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>>> 0.8% 18.4% 0.179s 1.79e-04s 1000 16
>>> GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0},
>>> TensorConstant{(3L,) of 1})
>>> 0.8% 19.2% 0.175s 1.75e-04s 1000 83
>>> GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
>>> 0.8% 20.0% 0.174s 1.74e-04s 1000 11
>>> GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0},
>>> TensorConstant{(3L,) of 1})
>>> ... (remaining 256 Apply instances account for 80.03%(17.21s) of the
>>> runtime)
>>>
>>>
>>> Some info useful for gpu:
>>>
>>> Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and
>>> 0.000s(0.00%) transfert Op
>>>
>>> Theano function input that are float64
>>> <fct name> <input name> <input type> <str input>
>>>
>>> List of apply that don't have float64 as input but have float64 in
>>> outputs
>>> (Useful to know if we forgot some cast when using floatX=float32 or
>>> gpu code)
>>> <Apply> <Apply position> <fct name> <inputs type> <outputs type>
>>>
>>> Here are tips to potentially make your code run faster
>>> (if you think of new ones, suggest them on the mailing
>>> list).
>>> Test them first, as they are not guaranteed to always
>>> provide a speedup.
>>> Sorry, no tip for today.
>>>
>>> The CPU version. Flags:
>>> os.environ['THEANO_FLAGS'] =
>>> ',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
>>> Graph:
>>> https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
>>> Pickled function:
>>> https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
>>> Profile:
>>> Function profiling
>>> ==================
>>> Time in 1000 calls to Function.__call__: 5.470006e+00s
>>> Time in Function.fn.__call__: 5.422005e+00s (99.122%)
>>> Time in thunks: 5.277404e+00s (96.479%)
>>> Total compile time: 9.329998e-01s
>>> Number of Apply nodes: 285
>>> Theano Optimizer time: 7.650001e-01s
>>> Theano validate time: 1.880007e-01s
>>> Theano Linker time (includes C, CUDA code generation/compiling):
>>> 1.140001e-01s
>>> Import time 0.000000e+00s
>>> Node make_thunk time 1.020000e-01s
>>> Node InplaceDimShuffle{x,0}(Sum{axis=[0],
>>> acc_dtype=float64}.0) time 1.000166e-03s
>>> Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
>>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>> Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
>>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>> Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
>>> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>>> Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
>>> Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time
>>> 1.000166e-03s
>>>
>>> Time in all call to theano.grad() 0.000000e+00s
>>> Time since theano import 62.174s
>>> Class
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
>>> <Class name>
>>> 74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
>>> theano.tensor.blas.Dot22
>>> 18.9% 93.2% 0.996s 5.86e-05s C 17000 17
>>> theano.tensor.blas.Gemm
>>> 2.8% 95.9% 0.146s 1.59e-06s C 92000 92
>>> theano.tensor.elemwise.Elemwise
>>> 1.6% 97.6% 0.085s 4.72e-06s C 18000 18
>>> theano.tensor.elemwise.Sum
>>> 1.1% 98.7% 0.058s 3.22e-06s C 18000 18
>>> theano.tensor.basic.Join
>>> 1.0% 99.7% 0.053s 2.94e-06s C 18000 18
>>> theano.tensor.basic.Split
>>> 0.3% 100.0% 0.018s 2.57e-07s C 70000 70
>>> theano.tensor.elemwise.DimShuffle
>>> ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
>>>
>>> Ops
>>> ---
>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
>>> <Op name>
>>> 74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
>>> Dot22
>>> 18.9% 93.2% 0.996s 5.86e-05s C 17000 17
>>> Gemm{inplace}
>>> 1.6% 94.8% 0.085s 4.72e-06s C 18000 18
>>> Sum{axis=[0], acc_dtype=float64}
>>> 1.4% 96.2% 0.076s 4.22e-06s C 18000 18
>>> Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>>> 1.1% 97.3% 0.058s 3.22e-06s C 18000 18
>>> Join
>>> 0.7% 98.0% 0.038s 2.11e-06s C 18000 18
>>> Elemwise{mul,no_inplace}
>>> 0.5% 98.5% 0.025s 3.56e-06s C 7000 7
>>> Split{4}
>>> 0.4% 98.9% 0.021s 2.34e-06s C 9000 9
>>> Split{2}
>>> 0.2% 99.2% 0.013s 2.50e-07s C 52000 52
>>> InplaceDimShuffle{1,0}
>>> 0.2% 99.4% 0.012s 3.08e-07s C 39000 39
>>> Elemwise{Mul}[(0, 1)]
>>> 0.2% 99.6% 0.011s 1.83e-06s C 6000 6
>>> Elemwise{Add}[(0, 2)]
>>> 0.1% 99.7% 0.007s 3.51e-06s C 2000 2
>>> Split{3}
>>> 0.1% 99.8% 0.005s 5.56e-07s C 9000 9
>>> Elemwise{Mul}[(0, 0)]
>>> 0.1% 99.9% 0.005s 2.77e-07s C 18000 18
>>> InplaceDimShuffle{x,0}
>>> 0.1% 100.0% 0.004s 2.00e-06s C 2000 2
>>> Elemwise{Add}[(0, 1)]
>>> ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
>>>
>>> Apply
>>> ------
>>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>> 2.0% 2.0% 0.106s 1.06e-04s 1000 110
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 2.0% 4.0% 0.104s 1.04e-04s 1000 107
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.8% 5.7% 0.093s 9.30e-05s 1000 188
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.8% 7.5% 0.093s 9.30e-05s 1000 78
>>> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>>> 1.8% 9.3% 0.093s 9.29e-05s 1000 146
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 11.0% 0.092s 9.20e-05s 1000 135
>>> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>>> 1.7% 12.8% 0.092s 9.20e-05s 1000 105
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 14.5% 0.092s 9.19e-05s 1000 164
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 16.2% 0.090s 9.03e-05s 1000 177
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 17.9% 0.090s 8.99e-05s 1000 178
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 19.6% 0.089s 8.90e-05s 1000 159
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 21.3% 0.089s 8.90e-05s 1000 168
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 23.0% 0.089s 8.90e-05s 1000 157
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.7% 24.6% 0.088s 8.80e-05s 1000 73
>>> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>>> 1.6% 26.3% 0.087s 8.71e-05s 1000 121
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.6% 27.9% 0.087s 8.70e-05s 1000 193
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.6% 29.6% 0.086s 8.60e-05s 1000 170
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.6% 31.2% 0.085s 8.50e-05s 1000 166
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.6% 32.8% 0.084s 8.40e-05s 1000 155
>>> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>>> 1.6% 34.3% 0.083s 8.30e-05s 1000 140
>>> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>>> ... (remaining 265 Apply instances account for 65.66%(3.46s) of the
>>> runtime)
>>>
>>> Here are tips to potentially make your code run faster
>>> (if you think of new ones, suggest them on the mailing
>>> list).
>>> Test them first, as they are not guaranteed to always
>>> provide a speedup.
>>> Sorry, no tip for today.
>>>
>>> On Thursday, April 20, 2017 at 4:07:45 AM UTC+3, Patric wrote:
>>>>
>>>> Could you share your model with us? We'd like to take a look :)
>>>>
>>>> On Tuesday, April 18, 2017 at 5:24:30 PM UTC+8, Sharapolas wrote:
>>>>>
>>>>> I have a computation tree and am implementing leaf node evalutions. In
>>>>> theano graph do paralle branches get evaluated in parallel on the GPU?
>>>>>
>>>>
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.