Re: [theano-users] Re: IfElse GPU version

Šarūnas S . Thu, 13 Apr 2017 05:45:04 -0700

Ok I think its getting clearer now due to your help. Thanks.

As far as I understand then on each ifelse call the condition gets 
evaluated on CPU and then the branches on the CPU. But if the condition is 
based on some variable in GPU it would then transfer data back to CPU 
evaluate and transfer back to GPU to process in the branches right?


Thus this implementation sounds like it could easily bottleneck the whole 
computation

On Tuesday, 11 April 2017 18:02:21 UTC+3, nouiz wrote:
>
> ifelse work on the GPU. The "PY" just mean it use the Python interface. 
> But it still work on the GPU. Only the condition stay on CPU, but both 
> branch are moved to the GPU.
>
> If you want to make sure of that, put a python break point in the file 
> ifelse.py, in the method thunk(). You will see the inputs data isn't 
> numpy.ndarray.
>
> Fred
>
> On Sun, Mar 26, 2017 at 10:51 AM Šarūnas S. <[email protected] 
> <javascript:>> wrote:
>
>> Indeed this was my first approach, but due to many small variations the 
>> number of graphs is a bit too big to manage. Currently I precompile a few 
>> trees and for the remainder of variations I do an ifelse variant with 
>> boolean operations which reduces the number of trees at the cost of 
>> computational inefficiency. 
>>
>> But I am most interested in the ifelse status since I could use a single 
>> tree and with lazy evaluations would have full computational efficiency. 
>> That's what the GPUs and theano is all about, right? 
>>
>>
>>
>>
>>
>> On Sunday, 26 March 2017 05:26:37 UTC+2, Jesse Livezey wrote:
>>>
>>> I have decided to precompile a general graph in which all the possible 
>>>> graphs are nested. Then during realtime I would set which parts of the 
>>>> general graph to use using the *allowed_branch* variables and *if* nodes. 
>>>> Since afaik ifs are evaluated lazily in each case I would only be using 
>>>> the 
>>>> relevant part of the graph so my computational cost is minimal.
>>>
>>>
>>> Have you considered precompiling all possible graphs individually and 
>>> then just using python conditionals to choose a graph? Maybe this won't 
>>> work for your system, but it might be easier to get right.
>>>
>>> On Saturday, March 25, 2017 at 2:21:27 AM UTC-7, Šarūnas S. wrote:
>>>>
>>>> Nouiz sorry I understand what you were refering by is constant. I've 
>>>> mislead you with my example. 
>>>>
>>>> This is a more realistic example:
>>>>
>>>> import theano as th
>>>> import theano.tensor as T
>>>>
>>>> allowed_branch = th.shared( np.cast['float32']( 0 ) )
>>>>
>>>> x = T.matrix('x')
>>>> y = T.matrix('y')
>>>> f = x ** 2 + y ** 2 + 2 * x * y  
>>>>  
>>>> result = th.ifelse.ifelse( T.gt( allowed_branch, T.constant( 0 ) ), f, 
>>>> T.zeros( (2,2) ) )
>>>>                         
>>>>
>>>>
>>>> I am working on a realtime system which in a given situation will 
>>>> constructs a relevant computational graph, compute its result and display 
>>>> it. 
>>>> However, the graphs are relatively big and each of their compilation 
>>>> takes too long so I cant compile realtime. Thus I have to somehow 
>>>> precompile. 
>>>>
>>>> I have decided to precompile a general graph in which all the possible 
>>>> graphs are nested. Then during realtime I would set which parts of the 
>>>> general graph to use using the *allowed_branch* variables and *if* 
>>>> nodes. Since afaik ifs are evaluated lazily in each case I would only be 
>>>> using the relevant part of the graph so my computational cost is minimal.
>>>>
>>>>
>>>> On Saturday, 25 March 2017 10:04:21 UTC+1, Šarūnas S. wrote:
>>>>>
>>>>> I suspect that ifelse is running on GPU because this is the profile I 
>>>>> get
>>>>>
>>>>> ==================
>>>>>   Message: Sum of all(44) printed profiles at exit excluding Scan op 
>>>>> profile.
>>>>>   Time in 95 calls to Function.__call__: 2.309995e-01s
>>>>>   Time in Function.fn.__call__: 2.299995e-01s (99.567%)
>>>>>   Time in thunks: 2.307765e-01s (99.903%)
>>>>>   Total compile time: 1.360100e+01s
>>>>>     Number of Apply nodes: 416
>>>>>     Theano Optimizer time: 6.314001e+00s
>>>>>        Theano validate time: 9.200015e-01s
>>>>>     Theano Linker time (includes C, CUDA code generation/compiling): 
>>>>> 1.169000e+00s
>>>>>        Import time 2.799892e-02s
>>>>>        Node make_thunk time 1.108999e+00s
>>>>>            Node GpuElemwise{Composite{(i0 * ((i1 * i2) + (i1 * 
>>>>> i3)))}}[(0, 2)](CudaNdarrayConstant{0.5}, 
>>>>> CudaNdarrayConstant{0.833333313465}, GpuCAReduce{add}{1,1}.0, 
>>>>> GpuCAReduce{add}{1,1}.0) time 6.999969e-03s
>>>>>            Node GpuElemwise{Composite{(-minimum(i0, 
>>>>> maximum(minimum(i0, (maximum((i1 - i2), i3) + i2)), (((i1 + i2) * i4) + 
>>>>> i1))))},no_inplace}(<CudaNdarrayType(float32, scalar)>, 
>>>>> <CudaNdarrayType(float32, scalar)>, <CudaNdarrayType(float32, scalar)>, 
>>>>> CudaNdarrayConstant{120.0}, <CudaNdarrayType(float32, scalar)>) time 
>>>>> 4.999876e-03s
>>>>>            Node GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, 
>>>>> matrix)>, GpuElemwise{TrueDiv}[(0, 0)].0) time 4.000187e-03s
>>>>>            Node HostFromGpu(<CudaNdarrayType(float32, scalar)>) time 
>>>>> 3.999949e-03s
>>>>>            Node GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{x,x}.0, 
>>>>> GpuDimShuffle{x,0}.0) time 3.999949e-03s
>>>>>
>>>>> Time in all call to theano.grad() 0.000000e+00s
>>>>> Time since theano import 28.959s
>>>>> Class
>>>>> ---
>>>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>>>>> <Class name>
>>>>>   55.4%    55.4%       0.128s       8.71e-05s     C     1468     301   
>>>>> theano.sandbox.cuda.basic_ops.GpuElemwise
>>>>>   25.6%    81.0%       0.059s       1.03e-04s     C      571     106   
>>>>> theano.sandbox.cuda.basic_ops.GpuCAReduce
>>>>>    9.1%    90.1%       0.021s       3.72e-05s     C      564     150   
>>>>> theano.sandbox.cuda.basic_ops.HostFromGpu
>>>>>    5.6%    95.7%       0.013s       6.04e-06s     Py    2148     168   
>>>>> theano.ifelse.IfElse
>>>>>    3.5%    99.1%       0.008s       2.16e-04s     C       37       4   
>>>>> theano.compile.ops.DeepCopyOp
>>>>>    0.4%    99.6%       0.001s       1.60e-06s     C      623     122   
>>>>> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>>>>>    0.4%   100.0%       0.001s       1.97e-06s     C      506     110   
>>>>> theano.tensor.elemwise.Elemwise
>>>>>    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>>>>>
>>>>> Ops
>>>>> ---
>>>>> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> 
>>>>> <Op name>
>>>>>   16.9%    16.9%       0.039s       1.22e-04s     C      319       58 
>>>>>   GpuElemwise{mul,no_inplace}
>>>>>   10.0%    26.9%       0.023s       1.49e-04s     C      155       30 
>>>>>   GpuCAReduce{add}{1,0}
>>>>>    9.1%    36.0%       0.021s       3.72e-05s     C      564      150 
>>>>>   HostFromGpu
>>>>>    8.2%    44.2%       0.019s       1.23e-04s     C      154       30 
>>>>>   GpuCAReduce{add}{0,1}
>>>>>    6.9%    51.1%       0.016s       6.61e-05s     C      242       44 
>>>>>   GpuElemwise{Mul}[(0, 1)]
>>>>>    6.5%    57.6%       0.015s       6.20e-05s     C      242       44 
>>>>>   GpuElemwise{maximum,no_inplace}
>>>>>    6.5%    64.1%       0.015s       6.19e-05s     C      242       44 
>>>>>   GpuCAReduce{maximum}{1}
>>>>>    5.6%    69.7%       0.013s       6.04e-06s     Py    2148      168 
>>>>>   if{inplace,gpu}
>>>>>    3.5%    73.2%       0.008s       5.59e-05s     C      143       26 
>>>>>   GpuElemwise{TrueDiv}[(0, 0)]
>>>>>    3.5%    76.7%       0.008s       2.16e-04s     C       37        4 
>>>>>   DeepCopyOp
>>>>>    2.6%    79.3%       0.006s       8.95e-05s     C       67       16 
>>>>>   GpuElemwise{Mul}[(0, 2)]
>>>>>    2.2%    81.4%       0.005s       1.25e-04s     C       40        4 
>>>>>   GpuElemwise{Maximum}[(0, 0)]
>>>>>    1.7%    83.2%       0.004s       2.00e-04s     C       20        2 
>>>>>   GpuElemwise{Composite{maximum(i0, maximum(i1, maximum(i2, i3)))}}[(0, 
>>>>> 0)]
>>>>>    1.7%    84.9%       0.004s       4.93e-04s     C        8        8 
>>>>>   GpuElemwise{neg,no_inplace}
>>>>>    1.3%    86.2%       0.003s       1.36e-04s     C       22        4 
>>>>>   GpuElemwise{Composite{((i0 + i1) + i2)},no_inplace}
>>>>>    1.3%    87.5%       0.003s       2.50e-04s     C       12        3 
>>>>>   GpuElemwise{Composite{minimum(i0, maximum(minimum(i0, (maximum((i1 - 
>>>>> i2), 
>>>>> i3) + i2)), ((i4 * i5) + i1)))}}[(0, 4)]
>>>>>    1.3%    88.8%       0.003s       9.08e-05s     C       33        6 
>>>>>   GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 0)]
>>>>>    0.9%    89.6%       0.002s       3.03e-05s     C       66       12 
>>>>>   GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)]
>>>>>    0.9%    90.5%       0.002s       1.00e-04s     C       20        2 
>>>>>   GpuCAReduce{add}{1,1}
>>>>>    0.9%    91.4%       0.002s       2.50e-04s     C        8        3 
>>>>>   GpuElemwise{Composite{minimum(i0, maximum(minimum(i0, (maximum((i1 - 
>>>>> i2), 
>>>>> i3) + i2)), (((i2 + i1) * i4) + i1)))},no_inplace}
>>>>>    ... (remaining 28 Ops account for   8.62%(0.02s) of the runtime)
>>>>>
>>>>> Apply
>>>>> ------
>>>>> <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
>>>>>    1.7%     1.7%       0.004s       4.00e-04s     10   365   
>>>>> GpuElemwise{Maximum}[(0, 0)](if{inplace,gpu}.0, if{inplace,gpu}.0)
>>>>>    1.3%     3.0%       0.003s       3.00e-04s     10   105   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{TrueDiv}[(0, 0)].0)
>>>>>    1.3%     4.3%       0.003s       3.00e-04s     10   356   
>>>>> GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{x,x}.0, GpuDimShuffle{0,x}.0)
>>>>>    1.3%     5.6%       0.003s       3.00e-04s     10   143   
>>>>> GpuCAReduce{add}{1,0}(GpuElemwise{mul,no_inplace}.0)
>>>>>    1.3%     6.9%       0.003s       3.00e-04s     10   112   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{TrueDiv}[(0, 0)].0)
>>>>>    1.3%     8.2%       0.003s       3.00e-04s     10   169   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)].0)
>>>>>    1.3%     9.5%       0.003s       3.00e-04s     10   136   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{TrueDiv}[(0, 0)].0)
>>>>>    1.3%    10.8%       0.003s       3.00e-04s     10   217   
>>>>> GpuCAReduce{add}{0,1}(GpuElemwise{mul,no_inplace}.0)
>>>>>    1.3%    12.1%       0.003s       3.00e-04s     10   184   
>>>>> GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 0)](GpuElemwise{TrueDiv}[(0, 
>>>>> 0)].0, GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
>>>>>    1.3%    13.4%       0.003s       5.96e-04s      5     1   
>>>>> HostFromGpu(GpuElemwise{Composite{minimum(i0, maximum(minimum(i0, 
>>>>> (maximum((i1 - i2), i3) + i2)), (((i2 + i1) * i4) + i1)))},no_inplace}.0)
>>>>>    0.9%    14.3%       0.002s       1.69e-04s     12     0   
>>>>> DeepCopyOp(<CudaNdarrayType(float32, scalar)>)
>>>>>    0.9%    15.2%       0.002s       2.00e-04s     10   148   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)].0)
>>>>>    0.9%    16.0%       0.002s       2.00e-04s     10   153   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)].0)
>>>>>    0.9%    16.9%       0.002s       2.00e-04s     10   126   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{TrueDiv}[(0, 0)].0)
>>>>>    0.9%    17.8%       0.002s       2.00e-04s     10   412   
>>>>> GpuCAReduce{add}{1,1}(GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 
>>>>> 0)].0)
>>>>>    0.9%    18.6%       0.002s       2.00e-04s     10   103   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{TrueDiv}[(0, 0)].0)
>>>>>    0.9%    19.5%       0.002s       2.00e-04s     10    89   
>>>>> GpuElemwise{TrueDiv}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, 
>>>>> GpuElemwise{Composite{((i0 + i1) + i2)},no_inplace}.0)
>>>>>    0.9%    20.4%       0.002s       2.00e-04s     10     3   
>>>>> GpuElemwise{maximum,no_inplace}(<CudaNdarrayType(float32, col)>, 
>>>>> CudaNdarrayConstant{[[ 0.001]]})
>>>>>    0.9%    21.2%       0.002s       2.00e-04s     10   134   
>>>>> GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>, 
>>>>> GpuElemwise{TrueDiv}[(0, 0)].0)
>>>>>    0.9%    22.1%       0.002s       2.00e-04s     10   300   
>>>>> GpuElemwise{Mul}[(0, 1)](GpuElemwise{Composite{minimum(i0, 
>>>>> maximum(minimum(i0, (maximum((i1 - i2), i3) + i2)), ((i4 * i5) + 
>>>>> i1)))},no_inplace}.0, GpuDimShuffle{x,0}.0)
>>>>>    ... (remaining 941 Apply instances account for 77.89%(0.18s) of the 
>>>>> runtime)
>>>>>
>>>>> Here are tips to potentially make your code run faster
>>>>>                  (if you think of new ones, suggest them on the 
>>>>> mailing list).
>>>>>                  Test them first, as they are not guaranteed to always 
>>>>> provide a speedup.
>>>>>   Sorry, no tip for today.
>>>>>
>>>>> And as you see ifelse is being shown as a PY operation which I would 
>>>>> presume run on CPU. So where does it run? Also, what do you mean by add a 
>>>>> condition is constant? 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> P.S In case you need  these are my Theano flags
>>>>>
>>>>> os.environ['THEANO_FLAGS'] = 
>>>>> ",optimizer=fast_run,floatX=float32,device=gpu,linker=cvm"
>>>>> os.environ['THEANO_FLAGS'] += ',allow_gc=False,'
>>>>> os.environ['THEANO_FLAGS'] += ',lib.cnmem=0.3'
>>>>> os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
>>>>> os.environ['THEANO_FLAGS'] += ',profile=true'
>>>>>
>>>>>
>>>>> On Friday, 24 March 2017 23:09:11 UTC+1, nouiz wrote:
>>>>>>
>>>>>> What tell you the ifelse is on the CPU?
>>>>>>
>>>>>> Anyway, add the condition is constant Theano will remove it during 
>>>>>> the compilation.
>>>>>>
>>>>>> Fred
>>>>>>
>>>>>> Le ven. 24 mars 2017 12:41, Šarūnas S. <[email protected]> a écrit :
>>>>>>
>>>>>>> Please find a code example:
>>>>>>>
>>>>>>> import theano as th
>>>>>>> import theano.tensor as T
>>>>>>>
>>>>>>> retval = th.ifelse.ifelse( T.gt(T.constant(2.0),T.constant(1.0)), T.
>>>>>>> ones((500,1)),T.zeros((250,1)))
>>>>>>>
>>>>>>> On Friday, 24 March 2017 17:33:59 UTC+1, Šarūnas S. wrote:
>>>>>>>>
>>>>>>>> I am using theano version 0.9.0.rc2.dev version.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Friday, 24 March 2017 17:32:33 UTC+1, Šarūnas S. wrote:
>>>>>>>>>
>>>>>>>>> In my graph I have a few IfElse nodes and I am wondering how and 
>>>>>>>>> where they are executed. 
>>>>>>>>>
>>>>>>>>> At first I ran the code with linker=cvm in my THEANO_FLAGS but 
>>>>>>>>> after profiling it looked like the ifelse is being executed on the 
>>>>>>>>> CPU. 
>>>>>>>>> Then I forced the linker=c to check whether the IfElse will go 
>>>>>>>>> through and 
>>>>>>>>> I got the NotImplementedError: if{inplace, gpu} cannot produce C 
>>>>>>>>> code. Btw 
>>>>>>>>> removing inline optimization did not help as it still gave the same 
>>>>>>>>> error. 
>>>>>>>>>
>>>>>>>>> So does IfElse have a GPU implementation? If yes how do I use it? 
>>>>>>>>> Also, does it do lazy evaluation or not? 
>>>>>>>>>
>>>>>>>> -- 
>>>>>>>
>>>>>>> --- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "theano-users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "theano-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [theano-users] Re: IfElse GPU version

Reply via email to