Thanks Jesse,

After experimenting a lot I ended up wrapping the *einsum *and using it on 
the CPU. In my case CPU + einsum version is faster than GPU + multiply + 
sum. 

This is the wrapper I'm using:

import os
import numpy as np


os.environ['THEANO_FLAGS'] = 
',mode=FAST_RUN,floatX=float32,device=cpu,openmp=True,openmp_elemwise_minsize=10'
os.environ['THEANO_FLAGS'] += ',allow_gc=False,'
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'


import theano as th
import theano.tensor as T


class Einsum(th.Op):
    __props__ = ()


    # itypes = [ T.ftensor3, T.ftensor3 ]
    # otypes = [ T.fmatrix ]


    itypes = None
    otypes = None


    def make_node(self, *inputs ):
        # x = th.tensor.as_tensor_variable( inputs[ 0 ] )
        # Note: using x_.type() is dangerous, as it copies x's broadcasting
        # behaviour


        outputs = []
        if len( self._code[ self._code.rindex( '->' )+2: ] ) == 0:
            outputs.append( T.fscalar(  ) )
        elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 1:
            outputs.append( T.fvector( ) )
        elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 2:
            outputs.append( T.fmatrix( ) )
        elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 3:
            outputs.append( T.ftensor3( ) )        
        else:
            raise NotImplementedError


        return th.Apply(self, inputs, outputs )


    def __init__( self, code ):
        self._code = code


    def perform( self, node, inputs, output_storage ):
        x = inputs[ 0 ]
        y = inputs[ 1 ]        
        z = output_storage[ 0 ]
        z[0] = np.einsum( self._code, x, y )


def einsum( code, *inputs ):
    
    out = Einsum( code )( *inputs )


    return out


On Tuesday, May 9, 2017 at 2:22:38 AM UTC+3, Jesse Livezey wrote:
>
> I see, you can use batched_dot for that. I wrote a gist which compares the 
> numpy matmul, theano batch_dot, and theano multiply and sum approaches.
> https://gist.github.com/JesseLivezey/42cabcf87aa0033410f7520933942127
>
> On GPU, the multiply and sum seems to be fastest, but it will also use 
> more memory.
>
>
> On Monday, May 8, 2017 at 1:30:33 AM UTC-7, Šarūnas S. wrote:
>>
>> Currently, I have 3 approaches that are portable to theano:
>>
>> # 3D example
>> axis = 0
>> prob = np.random.random( ( 1, 1000, 50 ) )
>> cases = np.random.random( ( 1000, 1000, 50 ) )
>>
>> # Elementwise + sum
>> for i in xrange( 100 ):
>> result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
>>
>> # Loop version
>> result = np.zeros( ( 1000, 1, 50 ) )
>> for i in xrange( 5 ):
>> result[ :, :, i ] = np.dot( prob[ :, :, i ], cases[ :, :, i ] )
>>
>> # Block diagonal sparse dot version
>> prob_big = np.zeros( ( 1, 1000, 50, 50 ) )
>> cases_big = np.zeros( ( 1000, 1000, 50, 50 ) )
>>
>> for i in xrange( 50 ):
>> prob_big[ :, :, i, i ] = prob[ :, :, i, i ]
>> cases_big[ :, :, i, i ] = prob[ :, :, i, i ]
>>
>> intermediate = np.tensordot( prob_big, cases_big, axes=[ [ 0 ], [ 1 ] ] )
>> result = np.zeros( 1000, 1, 50 )
>> for i in range( 50 ):
>> result[ :, :, i ] = intermediate[ :, :, i, i ]
>>
>> I think the the one which would structure this as a sparse block diagonal 
>> matrix would work best since I've seen some support for the block sparse 
>> matrices. However, it looks like I would still need some loop for 
>> blocksparse to iterate over all the blocks. Is there a way to somehow do 
>> all the blocks at once and collect the diagonal without using scan? 
>>
>> On Saturday, 6 May 2017 10:41:06 UTC+3, Šarūnas S. wrote:
>>>
>>> I have tried that, but to no avail. The problem is that I have to 
>>> multiply on 2 axes, but sum only on 1. 
>>>
>>> On Friday, 5 May 2017 19:23:12 UTC+3, Jesse Livezey wrote:
>>>>
>>>> I think tensordot should do what you want
>>>>
>>>> http://deeplearning.net/software/theano/library/tensor/basic.html#theano.tensor.tensordot
>>>> something like
>>>> result = T.tensordot(prob, cases, axes=1)
>>>>
>>>>
>>>>
>>>> On Friday, May 5, 2017 at 3:17:14 AM UTC-7, Šarūnas S. wrote:
>>>>>
>>>>> I was shown that in *numpy* I could speed it up in the following way:
>>>>>
>>>>> result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
>>>>>
>>>>>
>>>>> result = np.matmul(prob.transpose(2,0,1), cases.T).T
>>>>>
>>>>>
>>>>> Bot give me the expected speedup in *numpy*, but neither is 
>>>>> implemented in *Theano*. Is there a way to do the same in *Theano* on 
>>>>> the *GPU*?
>>>>>
>>>>>
>>>>>
>>>>> On Friday, 5 May 2017 11:15:26 UTC+3, Šarūnas S. wrote:
>>>>>>
>>>>>> In my current theano script the bottleneck is equivalent to the 
>>>>>> following numpy code:
>>>>>>
>>>>>> import numpy as np
>>>>>>
>>>>>> # 3D example
>>>>>> axis = 0
>>>>>> prob = np.random.random( ( 1, 1000, 50 ) )
>>>>>> cases = np.random.random( ( 1000, 1000, 50 ) )
>>>>>>
>>>>>> start = time.time(  )
>>>>>> for i in xrange( 1000 ):
>>>>>> result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
>>>>>> print '3D naive method took {} seconds'.format( time.time() - start )
>>>>>> print result.shape
>>>>>> print
>>>>>>
>>>>>> I had seen in 2D case that replacing elementwise+sum with a dot 
>>>>>> product gave me 5x speedup. Are there any theano matrix operations that 
>>>>>> could help me out here? 
>>>>>>
>>>>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to