Re: [theano-users] custom elemwise Op not getting fused

Adam Becker Tue, 14 Mar 2017 04:30:55 -0700

Thanks for the reply!

I made some progress to solving the issue, but there are still some 
problems.


The latest code for the Op definition are on here 
<https://github.com/khaotik/Theano/blob/padop/theano/tensor/padding.py>. 
The code can *only ran with "khaotik/padop" branch* because a lot hacks are 
made to tensor/elemwise.py. For now, only "ElemIdx" and "ElemAt" Ops are 
functional.

It seems that for a large elemwise graph, the fusion is still not done as 
expected.

For example, this 
<https://gist.github.com/khaotik/a79c6248506f9e324adf256f4a2ff58c> script 
will compile into a fully fused graph.

khaotik@KKPC:~/tmp$ thcpu python3 -i test_roll.py 
old: 0.010861 new: 0.087167
>>> th.printing.debugprint(fn_new)
Elemwise{Composite{ElemAt{nin=3}(i0, ((ElemIdx{axis=0}(i0) + i1) % i2), 
ElemIdx{axis=1}(i0))}} [id A] ''   3
 |<TensorType(float32, matrix)> [id B]
 |InplaceDimShuffle{x,x} [id C] ''   1
 | |<TensorType(int32, scalar)> [id D]
 |InplaceDimShuffle{x,x} [id E] ''   2
   |Shape_i{0} [id F] ''   0
     |<TensorType(float32, matrix)> [id B]

However, for a larger elemwise graph like this 
<https://gist.github.com/khaotik/e648b990b9799dd05e002887573f0b02> script, 
the elemwise Op is not fully fused. There are multiple Composite instances.

khaotik@KKPC:~/tmp$ thcpu python3 -i test_circconv.py 
old: 0.046644 new 0.375561
>>> th.printing.debugprint(fn_new)
Elemwise{Composite{((ElemAt{nin=3}(i0, Composite{((i0 + i1) % i2)}(i1, 
ElemIdx{axis=0}(i0), i2), Composite{((i0 + i1) % i2)}(i1, ElemIdx{axis=1}(i0
), i3)) * i4) + (ElemAt{nin=3}(i0, (ElemIdx{axis=0}(i0) % i2), Composite{((i0 
+ i1) % i2)}(i1, ElemIdx{axis=1}(i0), i3)) * i5) + (ElemAt{nin=3}(i0, 
Composite{((i0 + i1) % i2)}(i6, ElemIdx{axis=0}(i0), i2), Composite{((i0 + 
i1) % i2)}(i1, ElemIdx{axis=1}(i0), i3)) * i7) + (ElemAt{nin=3}(i0, 
Composite{((i0 + i1) % i2)}(i1, ElemIdx{axis=0}(i0), i2), (ElemIdx{axis=1}(
i0) % i3)) * i8) + (ElemAt{nin=3}(i0, (ElemIdx{axis=0}(i0) % i2), (ElemIdx{
axis=1}(i0) % i3)) * i9) + (ElemAt{nin=3}(i0, Composite{((i0 + i1) % i2)}(i6
, ElemIdx{axis=0}(i0), i2), (ElemIdx{axis=1}(i0) % i3)) * i10) + (ElemAt{nin
=3}(i0, Composite{((i0 + i1) % i2)}(i1, ElemIdx{axis=0}(i0), i2), Composite
{((i0 + i1) % i2)}(i6, ElemIdx{axis=1}(i0), i3)) * i11) + (ElemAt{nin=3}(i0, 
(ElemIdx{axis=0}(i0) % i2), Composite{((i0 + i1) % i2)}(i6, ElemIdx{axis=1}(
i0), i3)) * i12) + (ElemAt{nin=3}(i0, Composite{((i0 + i1) % i2)}(i6, 
ElemIdx{axis=0}(i0), i2), Composite{((i0 + i1) % i2)}(i6, ElemIdx{axis=1}(i0
), i3)) * i13))}} [id A] ''   22
 |<TensorType(float32, matrix)> [id B]
 |TensorConstant{(1, 1) of -1} [id C]
 |InplaceDimShuffle{x,x} [id D] ''   21
 | |Shape_i{0} [id E] ''   10
 | ...
 < 30+ lines omitted >

Furthermore, even if the elemwise is fully fused like the 1st example, 
there's still almost 9x slowdown in the 1st example. There are probably 
other optimization problems in the current implementation.

On Tuesday, March 14, 2017 at 8:53:28 AM UTC+8, nouiz wrote:
>
> WIth this code snippet, I'm not able to run it. There is other problems 
> with the code that just make it not work. For example, what is "int32" in 
> your code? I tried a few things, but it didn't run.
>
> Can you give a full working example?
>
> Why do you make do_constant_folding() return False?
>
> There is a few reason that could desactivate the fusion of elemwise. But I 
> don't see one that would apply in your cases. One of them is not having c 
> code (you have). Another is if the node is used by more then 1 other node 
> in the graph. We don't want to duplicate computation, so we don't fuse 
> them. I would need a working example to investigate more.
>
> Fred
>
> On Thu, Mar 9, 2017 at 8:51 AM Adam Becker <junkk...@gmail.com 
> <javascript:>> wrote:
>
>> Hi,
>>
>> I'm close to a working PoC for the generalized elemwise Op (CPU for now). 
>> However it appears the Op is not getting properly fused with other elemwise 
>> Ops.
>>
>> There are two new scalar Ops, ElemIdx and ElemAt, with respective 
>> Elemwise subclass: TensorIdx and TensorAt.
>>
>> The definitions of the new Ops:
>>
>> class ElemIdx(ScalarOp):
>>     '''
>>     This gives tensor indices along an axis. All the indices are computed
>>     on the fly during elemwise thus much less memory consumption.
>>     This operates on tensor object while able to fuse with elemwise
>>     This is similar to threadIdx.* in CUDA
>>
>>     '''
>>     # TODO
>>     # - finish DOCS
>>     # - should be 0 inps -> 1 outs, like constant,
>>     #   however theano is not happy with 0 inps for now
>>     # - support negative axis
>>     # - make axis symbolic var?
>>     # - implement numpy.intp for output type?
>>     __props__ = ('axis',)
>>     nin = 1
>>     nout = 1
>>
>>
>>     def __init__(self, axis, **kwargs):
>>         super(ElemIdx, self).__init__(**kwargs)
>>         self.axis = axis
>>
>>     def c_code(self, node, name, inputs, outputs, sub):
>>         inp, = inputs
>>         out, = outputs
>>         axis = self.axis
>>         # protect substitutions at Elemwise
>>         l_sub = '%(l_sub)s'
>>         r_sub = '%(r_sub)s'
>>         idx_var = 'IDX_%(inp)s_%(axis)d' % locals()
>>         code = '''
>>         #ifdef TENSOR_ELEMWISE
>>         %(out)s = %(l_sub)s%(idx_var)s%(r_sub)s;
>>         #endif
>>         ''' % locals()
>>         return code
>>
>>     # TODO def c_code_contiguous(self):
>>     def c_code_cache_version(self):
>>         return (0,)
>>
>>     def do_constant_folding(self, node):
>>         return False
>>
>>     def output_types(self, *inp_types):
>>         return (int32,)
>>
>> class ElemAt(ScalarOp):
>>     '''
>>     Similar to adv. subtensor however works with elemwise.
>>     This is the opposite of ElemIdx
>>     '''
>>     # TODO finish DOCS
>>     nout = 1
>>
>>
>>     def __init__(self, ndim, **kwargs):
>>         super(ElemAt, self).__init__(**kwargs)
>>         self.nin = 1+ndim
>>
>>     def c_code(self, node, name, inputs, outputs, sub):
>>         inp = inputs[0]
>>         out, = outputs
>>         idxs = inputs[1:]
>>         code = '%(out)s = %(inp)ster[' % locals()
>>         terms = []
>>         # protect nested substitutions at Elemwise
>>         l_sub = '%(l_sub)s'
>>         r_sub = '%(r_sub)s'
>>         for axis, idx in enumerate(idxs):
>>             strd_var = 'STRD_%(inp)s_%(axis)d' % locals()
>>             terms.append(
>>                 '%(idx)s*%(l_sub)s%(strd_var)s%(r_sub)s' % locals())
>>         code += ' + '.join(terms) + '];\n'
>>         return '''
>>         #ifdef TENSOR_ELEMWISE
>>         %s
>>         #endif\n''' % code
>>
>>     def c_code_cache_version(self):
>>         return (0,)
>>
>>     def do_constant_folding(self, node):
>>         return False
>>
>>     def output_types(self, inp_types):
>>         # pdb.set_trace()
>>         return inp_types[:1]
>>
>> class TensorIdx(Elemwise):
>>     # TODO DOCS
>>     __props__ = Elemwise.__props__
>>     def __init__(self, axis, **kwargs):
>>         super(TensorIdx, self).__init__(
>>             scalar_op=ElemIdx(axis),
>>             **kwargs)
>>
>>     def __str__(self):
>>         name = 'idx' if self.name is None else self.name
>>         axis = self.scalar_op.axis
>>         return '%(name)s{%(axis)d}' % locals()
>>
>>     def do_constant_folding(self, node):
>>         return False
>>
>> class TensorAt(Elemwise):
>>     # TODO DOCS
>>     __props__ = Elemwise.__props__
>>     def __init__(self, ndim, **kwargs):
>>         super(TensorAt, self).__init__(
>>             scalar_op=ElemAt(ndim),
>>             **kwargs)
>>
>>     def __str__(self):
>>         name = 'at' if self.name is None else self.name
>>         ndim = self.scalar_op.nin - 1
>>         return '%(name)s{%(ndim)dD}' % locals()
>>
>>     def do_constant_folding(self, node):
>>         return False
>>
>> def idx(x, axis):
>>     if not isinstance(axis, int):
>>         raise TypeError('axis must be integer')
>>     return TensorIdx(axis)(x)
>>
>> def at_idx(x, *idxs):
>>     return TensorAt(x.ndim)(x, *idxs)
>>
>> There are also many hacks done to elemwise.py and elemwise_cgen.py to 
>> make this work. (link to branch 
>> <https://github.com/khaotik/Theano/blob/padop/theano/tensor/elemwise.py>, 
>> highly hacky/unstable though)
>>
>> When building graph:
>>
>> x = T.imatrix()
>> i0 = idx(x, 0)
>> i1 = idx(x, 1)
>>
>>
>> fn0 = theano.function([x], i0+i1)
>> fn1 = theano.function([x], idx(i0+i1)) # doesn't make sense, just for 
>> testing
>> fn2 = theano.function([x], at_idx(x, i0, i1)
>>
>>
>> dp = theano.printing.debugprint
>> dp(fn0)
>> dp(fn1)
>> dp(fn2)
>>
>> This gives:
>>
>> Elemwise{Composite{(ElemIdx{axis=0}(i0) + ElemIdx{axis=1}(i0))}} [id A] 
>> ''   0
>>  |<TensorType(float32, matrix)> [id B]
>>
>>
>> idx{0} [id A] ''   1
>>  |Elemwise{Composite{(ElemIdx{axis=0}(i0) + ElemIdx{axis=1}(i0))}} [id B] 
>> ''   0
>>    |<TensorType(float32, matrix)> [id C]
>>
>>
>> at{2D} [id A] ''   2
>>  |<TensorType(float32, matrix)> [id B]
>>  |idx{0} [id C] ''   1
>>  | |<TensorType(float32, matrix)> [id B]
>>  |idx{1} [id D] ''   0
>>    |<TensorType(float32, matrix)> [id B]
>>
>> Looks like the custom op doesn't wanna fuse its subtree, however it can 
>> be fused as child of built-in elemwise op. Any clue the cause of this?
>>
>> Thanks.
>>
>> -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "theano-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to theano-users...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [theano-users] custom elemwise Op not getting fused

Reply via email to