Re: [theano-users] Re: IfElse GPU version

Frédéric Bastien Tue, 11 Apr 2017 08:02:42 -0700

ifelse work on the GPU. The "PY" just mean it use the Python interface. But
it still work on the GPU. Only the condition stay on CPU, but both branch
are moved to the GPU.


If you want to make sure of that, put a python break point in the file
ifelse.py, in the method thunk(). You will see the inputs data isn't
numpy.ndarray.

Fred

On Sun, Mar 26, 2017 at 10:51 AM Šarūnas S. <[email protected]> wrote:

Indeed this was my first approach, but due to many small variations the
number of graphs is a bit too big to manage. Currently I precompile a few
trees and for the remainder of variations I do an ifelse variant with
boolean operations which reduces the number of trees at the cost of
computational inefficiency.

But I am most interested in the ifelse status since I could use a single
tree and with lazy evaluations would have full computational efficiency.
That's what the GPUs and theano is all about, right?





On Sunday, 26 March 2017 05:26:37 UTC+2, Jesse Livezey wrote:

I have decided to precompile a general graph in which all the possible
graphs are nested. Then during realtime I would set which parts of the
general graph to use using the *allowed_branch* variables and *if* nodes.
Since afaik ifs are evaluated lazily in each case I would only be using the
relevant part of the graph so my computational cost is minimal.


Have you considered precompiling all possible graphs individually and then
just using python conditionals to choose a graph? Maybe this won't work for
your system, but it might be easier to get right.

On Saturday, March 25, 2017 at 2:21:27 AM UTC-7, Šarūnas S. wrote:

Nouiz sorry I understand what you were refering by is constant. I've
mislead you with my example.

This is a more realistic example:

import theano as th
import theano.tensor as T

allowed_branch = th.shared( np.cast['float32']( 0 ) )

x = T.matrix('x')
y = T.matrix('y')
f = x ** 2 + y ** 2 + 2 * x * y

result = th.ifelse.ifelse( T.gt( allowed_branch, T.constant( 0 ) ), f,
T.zeros( (2,2) ) )



I am working on a realtime system which in a given situation will
constructs a relevant computational graph, compute its result and display
it.
However, the graphs are relatively big and each of their compilation takes
too long so I cant compile realtime. Thus I have to somehow precompile.

I have decided to precompile a general graph in which all the possible
graphs are nested. Then during realtime I would set which parts of the
general graph to use using the *allowed_branch* variables and *if* nodes.
Since afaik ifs are evaluated lazily in each case I would only be using the
relevant part of the graph so my computational cost is minimal.


On Saturday, 25 March 2017 10:04:21 UTC+1, Šarūnas S. wrote:

I suspect that ifelse is running on GPU because this is the profile I get

==================
  Message: Sum of all(44) printed profiles at exit excluding Scan op
profile.
  Time in 95 calls to Function.__call__: 2.309995e-01s
  Time in Function.fn.__call__: 2.299995e-01s (99.567%)
  Time in thunks: 2.307765e-01s (99.903%)
  Total compile time: 1.360100e+01s
    Number of Apply nodes: 416
    Theano Optimizer time: 6.314001e+00s
       Theano validate time: 9.200015e-01s
    Theano Linker time (includes C, CUDA code generation/compiling):
1.169000e+00s
       Import time 2.799892e-02s
       Node make_thunk time 1.108999e+00s
           Node GpuElemwise{Composite{(i0 * ((i1 * i2) + (i1 * i3)))}}[(0,
2)](CudaNdarrayConstant{0.5}, CudaNdarrayConstant{0.833333313465},
GpuCAReduce{add}{1,1}.0, GpuCAReduce{add}{1,1}.0) time 6.999969e-03s
           Node GpuElemwise{Composite{(-minimum(i0, maximum(minimum(i0,
(maximum((i1 - i2), i3) + i2)), (((i1 + i2) * i4) +
i1))))},no_inplace}(<CudaNdarrayType(float32, scalar)>,
<CudaNdarrayType(float32, scalar)>, <CudaNdarrayType(float32, scalar)>,
CudaNdarrayConstant{120.0}, <CudaNdarrayType(float32, scalar)>) time
4.999876e-03s
           Node GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32,
matrix)>, GpuElemwise{TrueDiv}[(0, 0)].0) time 4.000187e-03s
           Node HostFromGpu(<CudaNdarrayType(float32, scalar)>) time
3.999949e-03s
           Node GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{x,x}.0,
GpuDimShuffle{x,0}.0) time 3.999949e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 28.959s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
  55.4%    55.4%       0.128s       8.71e-05s     C     1468     301
theano.sandbox.cuda.basic_ops.GpuElemwise
  25.6%    81.0%       0.059s       1.03e-04s     C      571     106
theano.sandbox.cuda.basic_ops.GpuCAReduce
   9.1%    90.1%       0.021s       3.72e-05s     C      564     150
theano.sandbox.cuda.basic_ops.HostFromGpu
   5.6%    95.7%       0.013s       6.04e-06s     Py    2148     168
theano.ifelse.IfElse
   3.5%    99.1%       0.008s       2.16e-04s     C       37       4
theano.compile.ops.DeepCopyOp
   0.4%    99.6%       0.001s       1.60e-06s     C      623     122
theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.4%   100.0%       0.001s       1.97e-06s     C      506     110
theano.tensor.elemwise.Elemwise
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
  16.9%    16.9%       0.039s       1.22e-04s     C      319       58
GpuElemwise{mul,no_inplace}
  10.0%    26.9%       0.023s       1.49e-04s     C      155       30
GpuCAReduce{add}{1,0}
   9.1%    36.0%       0.021s       3.72e-05s     C      564      150
HostFromGpu
   8.2%    44.2%       0.019s       1.23e-04s     C      154       30
GpuCAReduce{add}{0,1}
   6.9%    51.1%       0.016s       6.61e-05s     C      242       44
GpuElemwise{Mul}[(0, 1)]
   6.5%    57.6%       0.015s       6.20e-05s     C      242       44
GpuElemwise{maximum,no_inplace}
   6.5%    64.1%       0.015s       6.19e-05s     C      242       44
GpuCAReduce{maximum}{1}
   5.6%    69.7%       0.013s       6.04e-06s     Py    2148      168
if{inplace,gpu}
   3.5%    73.2%       0.008s       5.59e-05s     C      143       26
GpuElemwise{TrueDiv}[(0, 0)]
   3.5%    76.7%       0.008s       2.16e-04s     C       37        4
DeepCopyOp
   2.6%    79.3%       0.006s       8.95e-05s     C       67       16
GpuElemwise{Mul}[(0, 2)]
   2.2%    81.4%       0.005s       1.25e-04s     C       40        4
GpuElemwise{Maximum}[(0, 0)]
   1.7%    83.2%       0.004s       2.00e-04s     C       20        2
GpuElemwise{Composite{maximum(i0, maximum(i1, maximum(i2, i3)))}}[(0, 0)]
   1.7%    84.9%       0.004s       4.93e-04s     C        8        8
GpuElemwise{neg,no_inplace}
   1.3%    86.2%       0.003s       1.36e-04s     C       22        4
GpuElemwise{Composite{((i0 + i1) + i2)},no_inplace}
   1.3%    87.5%       0.003s       2.50e-04s     C       12        3
GpuElemwise{Composite{minimum(i0, maximum(minimum(i0, (maximum((i1 - i2),
i3) + i2)), ((i4 * i5) + i1)))}}[(0, 4)]
   1.3%    88.8%       0.003s       9.08e-05s     C       33        6
GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 0)]
   0.9%    89.6%       0.002s       3.03e-05s     C       66       12
GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)]
   0.9%    90.5%       0.002s       1.00e-04s     C       20        2
GpuCAReduce{add}{1,1}
   0.9%    91.4%       0.002s       2.50e-04s     C        8        3
GpuElemwise{Composite{minimum(i0, maximum(minimum(i0, (maximum((i1 - i2),
i3) + i2)), (((i2 + i1) * i4) + i1)))},no_inplace}
   ... (remaining 28 Ops account for   8.62%(0.02s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   1.7%     1.7%       0.004s       4.00e-04s     10   365
GpuElemwise{Maximum}[(0, 0)](if{inplace,gpu}.0, if{inplace,gpu}.0)
   1.3%     3.0%       0.003s       3.00e-04s     10   105
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{TrueDiv}[(0, 0)].0)
   1.3%     4.3%       0.003s       3.00e-04s     10   356
GpuElemwise{Mul}[(0, 1)](GpuDimShuffle{x,x}.0, GpuDimShuffle{0,x}.0)
   1.3%     5.6%       0.003s       3.00e-04s     10   143
GpuCAReduce{add}{1,0}(GpuElemwise{mul,no_inplace}.0)
   1.3%     6.9%       0.003s       3.00e-04s     10   112
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{TrueDiv}[(0, 0)].0)
   1.3%     8.2%       0.003s       3.00e-04s     10   169
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)].0)
   1.3%     9.5%       0.003s       3.00e-04s     10   136
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{TrueDiv}[(0, 0)].0)
   1.3%    10.8%       0.003s       3.00e-04s     10   217
GpuCAReduce{add}{0,1}(GpuElemwise{mul,no_inplace}.0)
   1.3%    12.1%       0.003s       3.00e-04s     10   184
GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 0)](GpuElemwise{TrueDiv}[(0,
0)].0, GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
   1.3%    13.4%       0.003s       5.96e-04s      5     1
HostFromGpu(GpuElemwise{Composite{minimum(i0, maximum(minimum(i0,
(maximum((i1 - i2), i3) + i2)), (((i2 + i1) * i4) + i1)))},no_inplace}.0)
   0.9%    14.3%       0.002s       1.69e-04s     12     0
DeepCopyOp(<CudaNdarrayType(float32, scalar)>)
   0.9%    15.2%       0.002s       2.00e-04s     10   148
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)].0)
   0.9%    16.0%       0.002s       2.00e-04s     10   153
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{Composite{(i0 * (i1 / i2))}}[(0, 1)].0)
   0.9%    16.9%       0.002s       2.00e-04s     10   126
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{TrueDiv}[(0, 0)].0)
   0.9%    17.8%       0.002s       2.00e-04s     10   412
GpuCAReduce{add}{1,1}(GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0,
0)].0)
   0.9%    18.6%       0.002s       2.00e-04s     10   103
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{TrueDiv}[(0, 0)].0)
   0.9%    19.5%       0.002s       2.00e-04s     10    89
GpuElemwise{TrueDiv}[(0, 0)](GpuElemwise{maximum,no_inplace}.0,
GpuElemwise{Composite{((i0 + i1) + i2)},no_inplace}.0)
   0.9%    20.4%       0.002s       2.00e-04s     10     3
GpuElemwise{maximum,no_inplace}(<CudaNdarrayType(float32, col)>,
CudaNdarrayConstant{[[ 0.001]]})
   0.9%    21.2%       0.002s       2.00e-04s     10   134
GpuElemwise{mul,no_inplace}(<CudaNdarrayType(float32, matrix)>,
GpuElemwise{TrueDiv}[(0, 0)].0)
   0.9%    22.1%       0.002s       2.00e-04s     10   300
GpuElemwise{Mul}[(0, 1)](GpuElemwise{Composite{minimum(i0,
maximum(minimum(i0, (maximum((i1 - i2), i3) + i2)), ((i4 * i5) +
i1)))},no_inplace}.0, GpuDimShuffle{x,0}.0)
   ... (remaining 941 Apply instances account for 77.89%(0.18s) of the
runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing
list).
                 Test them first, as they are not guaranteed to always
provide a speedup.
  Sorry, no tip for today.

And as you see ifelse is being shown as a PY operation which I would
presume run on CPU. So where does it run? Also, what do you mean by add a
condition is constant?










P.S In case you need  these are my Theano flags

os.environ['THEANO_FLAGS'] =
",optimizer=fast_run,floatX=float32,device=gpu,linker=cvm"
os.environ['THEANO_FLAGS'] += ',allow_gc=False,'
os.environ['THEANO_FLAGS'] += ',lib.cnmem=0.3'
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['THEANO_FLAGS'] += ',profile=true'


On Friday, 24 March 2017 23:09:11 UTC+1, nouiz wrote:

What tell you the ifelse is on the CPU?

Anyway, add the condition is constant Theano will remove it during the
compilation.

Fred

Le ven. 24 mars 2017 12:41, Šarūnas S. <[email protected]> a écrit :

Please find a code example:

import theano as th
import theano.tensor as T

retval = th.ifelse.ifelse( T.gt(T.constant(2.0),T.constant(1.0)), T.ones((
500,1)),T.zeros((250,1)))

On Friday, 24 March 2017 17:33:59 UTC+1, Šarūnas S. wrote:

I am using theano version 0.9.0.rc2.dev version.



On Friday, 24 March 2017 17:32:33 UTC+1, Šarūnas S. wrote:

In my graph I have a few IfElse nodes and I am wondering how and where they
are executed.

At first I ran the code with linker=cvm in my THEANO_FLAGS but after
profiling it looked like the ifelse is being executed on the CPU. Then I
forced the linker=c to check whether the IfElse will go through and I got
the NotImplementedError: if{inplace, gpu} cannot produce C code. Btw
removing inline optimization did not help as it still gave the same error.

So does IfElse have a GPU implementation? If yes how do I use it? Also,
does it do lazy evaluation or not?

-- 

---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
For more options, visit https://groups.google.com/d/optout.

-- 

---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
For more options, visit https://groups.google.com/d/optout.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [theano-users] Re: IfElse GPU version

Reply via email to