[theano-users] Re: Does theano compute parallel branches in parallel?

Sharapolas Thu, 20 Apr 2017 02:49:42 -0700

My system:
    Windows 8.1 Enterprise x64
    Anaconda Python 2.7.12 x64
    Theano 0.9.0rc4.dev-44f7578c16e7b991c06e373d470d9889c2729844
    Geforce GTX 1070
    i7-4790K @ 4 Ghz 16 GB RAM



On Thursday, April 20, 2017 at 12:43:19 PM UTC+3, Sharapolas wrote:
>
> Guys thanks for your feedback. 
>
> For the past week I have been trying to optimize my solver as much as 
> possible and I optimized so much that the CPU is twice faster than the GPU 
> now :D Extremelly puzzled with this result and I hope you could shed some 
> light on that. 
>
> Wider story:
>      In my initial version, I arranged the tensors such that I do not need 
> to do slicing. Then I noticed that GPU load is directly proportional to the 
> size of the tensors being used, thus I decided to use smaller tensors but 
> lump them together and then slice in the few cases where I need it. As a 
> result the GPU code turned to be more than 4 times slower, but CPU code 
> almost rivals my first GPU version. I tried using different version of 
> indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in 
> similar timings. 
>
> Do you have suggestions how I could speed up my GPU code? Otherwise, I 
> might as well just run on multicode CPU and prob become even faster than 
> GPU :/ 
>
>
> GPU version. Flags:
>     os.environ['THEANO_FLAGS'] = 
> ",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
>     os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
> Pickled version:
>     https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
> Graph:
>     https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
> Profile:
> Function profiling
> ==================
>   Time in 1000 calls to Function.__call__: 2.170000e+01s
>   Time in Function.fn.__call__: 2.166000e+01s (99.816%)
>   Time in thunks: 2.150321e+01s (99.093%)
>   Total compile time: 1.809000e+00s
>     Number of Apply nodes: 276
>     Theano Optimizer time: 1.099000e+00s
>        Theano validate time: 2.069981e-01s
>     Theano Linker time (includes C, CUDA code generation/compiling): 
> 2.370000e-01s
>        Import time 3.000021e-03s
>        Node make_thunk time 2.260001e-01s
>            Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 
> 0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0, 
> CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
>            Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0}, 
> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>            Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0}, 
> TensorConstant{(2L,) of 1}) time 2.000093e-03s
>            Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, 
> convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0}) 
> time 2.000093e-03s
>            Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, 
> TensorConstant{(4L,) of 1}) time 2.000093e-03s
>
> Time in all call to theano.grad() 0.000000e+00s
> Time since theano import 101.753s
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
> <Class name>
>   38.0%    38.0%       8.176s       1.57e-04s     C    52000      52   
> theano.sandbox.cuda.blas.GpuDot22
>   16.9%    54.9%       3.627s       4.37e-05s     C    83000      83   
> theano.sandbox.cuda.basic_ops.GpuElemwise
>   14.7%    69.6%       3.169s       1.76e-04s     Py   18000      18   
> theano.sandbox.cuda.basic_ops.GpuSplit
>   13.8%    83.4%       2.970s       1.65e-04s     C    18000      18   
> theano.sandbox.cuda.basic_ops.GpuJoin
>   12.4%    95.9%       2.674s       1.57e-04s     C    17000      17   
> theano.sandbox.cuda.blas.GpuGemm
>    3.5%    99.4%       0.751s       4.17e-05s     C    18000      18   
> theano.sandbox.cuda.basic_ops.GpuCAReduce
>    0.6%   100.0%       0.137s       1.96e-06s     C    70000      70   
> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>
> Ops
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
> name>
>   38.0%    38.0%       8.176s       1.57e-04s     C     52000       52   
> GpuDot22
>   13.8%    51.8%       2.970s       1.65e-04s     C     18000       18   
> GpuJoin
>   12.4%    64.3%       2.674s       1.57e-04s     C     17000       17   
> GpuGemm{inplace}
>    7.7%    71.9%       1.649s       2.36e-04s     Py    7000        7   
> GpuSplit{4}
>    6.1%    78.1%       1.317s       4.39e-05s     C     30000       30   
> GpuElemwise{Mul}[(0, 1)]
>    5.4%    83.5%       1.167s       1.30e-04s     Py    9000        9   
> GpuSplit{2}
>    3.6%    87.0%       0.766s       4.26e-05s     C     18000       18   
> GpuElemwise{mul,no_inplace}
>    3.5%    90.6%       0.763s       4.24e-05s     C     18000       18   
> GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>    3.5%    94.1%       0.751s       4.17e-05s     C     18000       18   
> GpuCAReduce{add}{0,1}
>    1.9%    95.9%       0.399s       4.99e-05s     C     8000        8   
> GpuElemwise{Mul}[(0, 0)]
>    1.6%    97.6%       0.353s       1.76e-04s     Py    2000        2   
> GpuSplit{3}
>    1.1%    98.7%       0.247s       4.12e-05s     C     6000        6   
> GpuElemwise{Add}[(0, 2)]
>    0.6%    99.4%       0.133s       2.56e-06s     C     52000       52   
> GpuDimShuffle{1,0}
>    0.4%    99.8%       0.094s       4.70e-05s     C     2000        2   
> GpuElemwise{Add}[(0, 1)]
>    0.2%   100.0%       0.041s       4.10e-05s     C     1000        1   
> GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
>    0.0%   100.0%       0.004s       2.22e-07s     C     18000       18   
> GpuDimShuffle{0,x}
>    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
>
> Apply
> ------
> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>    1.2%     1.2%       0.259s       2.59e-04s   1000    14   
> GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,) 
> of 1})
>    1.1%     2.3%       0.246s       2.46e-04s   1000     9   
> GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
>    1.1%     3.5%       0.245s       2.45e-04s   1000   236   
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
> GpuElemwise{Add}[(0, 1)].0)
>    1.1%     4.6%       0.239s       2.39e-04s   1000   239   
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
> GpuElemwise{Add}[(0, 2)].0)
>    1.1%     5.7%       0.233s       2.33e-04s   1000     8   
> GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,) 
> of 1})
>    1.1%     6.8%       0.232s       2.32e-04s   1000     5   
> GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of 
> 1})
>    1.1%     7.8%       0.228s       2.28e-04s   1000     0   
> GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of 
> 1})
>    1.1%     8.9%       0.227s       2.27e-04s   1000     2   
> GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,) 
> of 1})
>    1.0%     9.9%       0.225s       2.25e-04s   1000   238   
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
> GpuElemwise{Add}[(0, 2)].0)
>    1.0%    11.0%       0.224s       2.24e-04s   1000     4   
> GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,) 
> of 1})
>    1.0%    12.0%       0.223s       2.23e-04s   1000   260   
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
> GpuElemwise{Add}[(0, 2)].0)
>    1.0%    13.0%       0.221s       2.21e-04s   1000   271   
> GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) + 
> i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0, 
> GpuElemwise{Add}[(0, 2)].0)
>    1.0%    14.0%       0.218s       2.18e-04s   1000   261   
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
> GpuElemwise{Add}[(0, 2)].0)
>    0.9%    15.0%       0.203s       2.03e-04s   1000   237   
> GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0, 
> GpuElemwise{Add}[(0, 1)].0)
>    0.9%    15.8%       0.184s       1.84e-04s   1000   146   
> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>    0.8%    16.7%       0.181s       1.81e-04s   1000    84   
> GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
>    0.8%    17.5%       0.179s       1.79e-04s   1000   134   
> GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
>    0.8%    18.4%       0.179s       1.79e-04s   1000    16   
> GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0}, 
> TensorConstant{(3L,) of 1})
>    0.8%    19.2%       0.175s       1.75e-04s   1000    83   
> GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
>    0.8%    20.0%       0.174s       1.74e-04s   1000    11   
> GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0}, 
> TensorConstant{(3L,) of 1})
>    ... (remaining 256 Apply instances account for 80.03%(17.21s) of the 
> runtime)
>
>
> Some info useful for gpu:
>
>     Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and 
> 0.000s(0.00%) transfert Op
>
>     Theano function input that are float64
>     <fct name> <input name> <input type> <str input>
>
>     List of apply that don't have float64 as input but have float64 in 
> outputs
>     (Useful to know if we forgot some cast when using floatX=float32 or 
> gpu code)
>     <Apply> <Apply position> <fct name> <inputs type> <outputs type>
>
> Here are tips to potentially make your code run faster
>                  (if you think of new ones, suggest them on the mailing 
> list).
>                  Test them first, as they are not guaranteed to always 
> provide a speedup.
>   Sorry, no tip for today.
>
> The CPU version. Flags:
>     os.environ['THEANO_FLAGS'] = 
> ',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
> Graph: 
>     https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
> Pickled function:
>     https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
> Profile:
> Function profiling
> ==================
>   Time in 1000 calls to Function.__call__: 5.470006e+00s
>   Time in Function.fn.__call__: 5.422005e+00s (99.122%)
>   Time in thunks: 5.277404e+00s (96.479%)
>   Total compile time: 9.329998e-01s
>     Number of Apply nodes: 285
>     Theano Optimizer time: 7.650001e-01s
>        Theano validate time: 1.880007e-01s
>     Theano Linker time (includes C, CUDA code generation/compiling): 
> 1.140001e-01s
>        Import time 0.000000e+00s
>        Node make_thunk time 1.020000e-01s
>            Node InplaceDimShuffle{x,0}(Sum{axis=[0], acc_dtype=float64}.0) 
> time 1.000166e-03s
>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>            Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0, 
> InplaceDimShuffle{1,0}.0) time 1.000166e-03s
>            Node Gemm{inplace}(Dot22.0, TensorConstant{1.0}, 
> Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time 
> 1.000166e-03s
>
> Time in all call to theano.grad() 0.000000e+00s
> Time since theano import 62.174s
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
> <Class name>
>   74.3%    74.3%       3.921s       7.54e-05s     Py   52000      52   
> theano.tensor.blas.Dot22
>   18.9%    93.2%       0.996s       5.86e-05s     C    17000      17   
> theano.tensor.blas.Gemm
>    2.8%    95.9%       0.146s       1.59e-06s     C    92000      92   
> theano.tensor.elemwise.Elemwise
>    1.6%    97.6%       0.085s       4.72e-06s     C    18000      18   
> theano.tensor.elemwise.Sum
>    1.1%    98.7%       0.058s       3.22e-06s     C    18000      18   
> theano.tensor.basic.Join
>    1.0%    99.7%       0.053s       2.94e-06s     C    18000      18   
> theano.tensor.basic.Split
>    0.3%   100.0%       0.018s       2.57e-07s     C    70000      70   
> theano.tensor.elemwise.DimShuffle
>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>
> Ops
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op 
> name>
>   74.3%    74.3%       3.921s       7.54e-05s     Py    52000       52   
> Dot22
>   18.9%    93.2%       0.996s       5.86e-05s     C     17000       17   
> Gemm{inplace}
>    1.6%    94.8%       0.085s       4.72e-06s     C     18000       18   
> Sum{axis=[0], acc_dtype=float64}
>    1.4%    96.2%       0.076s       4.22e-06s     C     18000       18   
> Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
>    1.1%    97.3%       0.058s       3.22e-06s     C     18000       18   
> Join
>    0.7%    98.0%       0.038s       2.11e-06s     C     18000       18   
> Elemwise{mul,no_inplace}
>    0.5%    98.5%       0.025s       3.56e-06s     C     7000        7   
> Split{4}
>    0.4%    98.9%       0.021s       2.34e-06s     C     9000        9   
> Split{2}
>    0.2%    99.2%       0.013s       2.50e-07s     C     52000       52   
> InplaceDimShuffle{1,0}
>    0.2%    99.4%       0.012s       3.08e-07s     C     39000       39   
> Elemwise{Mul}[(0, 1)]
>    0.2%    99.6%       0.011s       1.83e-06s     C     6000        6   
> Elemwise{Add}[(0, 2)]
>    0.1%    99.7%       0.007s       3.51e-06s     C     2000        2   
> Split{3}
>    0.1%    99.8%       0.005s       5.56e-07s     C     9000        9   
> Elemwise{Mul}[(0, 0)]
>    0.1%    99.9%       0.005s       2.77e-07s     C     18000       18   
> InplaceDimShuffle{x,0}
>    0.1%   100.0%       0.004s       2.00e-06s     C     2000        2   
> Elemwise{Add}[(0, 1)]
>    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
>
> Apply
> ------
> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>    2.0%     2.0%       0.106s       1.06e-04s   1000   110   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    2.0%     4.0%       0.104s       1.04e-04s   1000   107   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.8%     5.7%       0.093s       9.30e-05s   1000   188   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.8%     7.5%       0.093s       9.30e-05s   1000    78   
> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>    1.8%     9.3%       0.093s       9.29e-05s   1000   146   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    11.0%       0.092s       9.20e-05s   1000   135   
> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>    1.7%    12.8%       0.092s       9.20e-05s   1000   105   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    14.5%       0.092s       9.19e-05s   1000   164   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    16.2%       0.090s       9.03e-05s   1000   177   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    17.9%       0.090s       8.99e-05s   1000   178   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    19.6%       0.089s       8.90e-05s   1000   159   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    21.3%       0.089s       8.90e-05s   1000   168   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    23.0%       0.089s       8.90e-05s   1000   157   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.7%    24.6%       0.088s       8.80e-05s   1000    73   
> Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
>    1.6%    26.3%       0.087s       8.71e-05s   1000   121   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.6%    27.9%       0.087s       8.70e-05s   1000   193   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.6%    29.6%       0.086s       8.60e-05s   1000   170   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.6%    31.2%       0.085s       8.50e-05s   1000   166   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.6%    32.8%       0.084s       8.40e-05s   1000   155   
> Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
>    1.6%    34.3%       0.083s       8.30e-05s   1000   140   
> Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
>    ... (remaining 265 Apply instances account for 65.66%(3.46s) of the 
> runtime)
>
> Here are tips to potentially make your code run faster
>                  (if you think of new ones, suggest them on the mailing 
> list).
>                  Test them first, as they are not guaranteed to always 
> provide a speedup.
>   Sorry, no tip for today.
>
> On Thursday, April 20, 2017 at 4:07:45 AM UTC+3, Patric wrote:
>>
>> Could you share your model with us? We'd like to take a look :)
>>
>> On Tuesday, April 18, 2017 at 5:24:30 PM UTC+8, Sharapolas wrote:
>>>
>>> I have a computation tree and am implementing leaf node evalutions. In 
>>> theano graph do paralle branches get evaluated in parallel on the GPU?
>>>
>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[theano-users] Re: Does theano compute parallel branches in parallel?

Reply via email to