Thanks Jesse,
After experimenting a lot I ended up wrapping the *einsum *and using it on
the CPU. In my case CPU + einsum version is faster than GPU + multiply +
sum.
This is the wrapper I'm using:
import os
import numpy as np
os.environ['THEANO_FLAGS'] =
',mode=FAST_RUN,floatX=float32,device=cpu,openmp=True,openmp_elemwise_minsize=10'
os.environ['THEANO_FLAGS'] += ',allow_gc=False,'
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
import theano as th
import theano.tensor as T
class Einsum(th.Op):
__props__ = ()
# itypes = [ T.ftensor3, T.ftensor3 ]
# otypes = [ T.fmatrix ]
itypes = None
otypes = None
def make_node(self, *inputs ):
# x = th.tensor.as_tensor_variable( inputs[ 0 ] )
# Note: using x_.type() is dangerous, as it copies x's broadcasting
# behaviour
outputs = []
if len( self._code[ self._code.rindex( '->' )+2: ] ) == 0:
outputs.append( T.fscalar( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 1:
outputs.append( T.fvector( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 2:
outputs.append( T.fmatrix( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 3:
outputs.append( T.ftensor3( ) )
else:
raise NotImplementedError
return th.Apply(self, inputs, outputs )
def __init__( self, code ):
self._code = code
def perform( self, node, inputs, output_storage ):
x = inputs[ 0 ]
y = inputs[ 1 ]
z = output_storage[ 0 ]
z[0] = np.einsum( self._code, x, y )
def einsum( code, *inputs ):
out = Einsum( code )( *inputs )
return out
On Tuesday, May 9, 2017 at 2:22:38 AM UTC+3, Jesse Livezey wrote:
>
> I see, you can use batched_dot for that. I wrote a gist which compares the
> numpy matmul, theano batch_dot, and theano multiply and sum approaches.
> https://gist.github.com/JesseLivezey/42cabcf87aa0033410f7520933942127
>
> On GPU, the multiply and sum seems to be fastest, but it will also use
> more memory.
>
>
> On Monday, May 8, 2017 at 1:30:33 AM UTC-7, Šarūnas S. wrote:
>>
>> Currently, I have 3 approaches that are portable to theano:
>>
>> # 3D example
>> axis = 0
>> prob = np.random.random( ( 1, 1000, 50 ) )
>> cases = np.random.random( ( 1000, 1000, 50 ) )
>>
>> # Elementwise + sum
>> for i in xrange( 100 ):
>> result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
>>
>> # Loop version
>> result = np.zeros( ( 1000, 1, 50 ) )
>> for i in xrange( 5 ):
>> result[ :, :, i ] = np.dot( prob[ :, :, i ], cases[ :, :, i ] )
>>
>> # Block diagonal sparse dot version
>> prob_big = np.zeros( ( 1, 1000, 50, 50 ) )
>> cases_big = np.zeros( ( 1000, 1000, 50, 50 ) )
>>
>> for i in xrange( 50 ):
>> prob_big[ :, :, i, i ] = prob[ :, :, i, i ]
>> cases_big[ :, :, i, i ] = prob[ :, :, i, i ]
>>
>> intermediate = np.tensordot( prob_big, cases_big, axes=[ [ 0 ], [ 1 ] ] )
>> result = np.zeros( 1000, 1, 50 )
>> for i in range( 50 ):
>> result[ :, :, i ] = intermediate[ :, :, i, i ]
>>
>> I think the the one which would structure this as a sparse block diagonal
>> matrix would work best since I've seen some support for the block sparse
>> matrices. However, it looks like I would still need some loop for
>> blocksparse to iterate over all the blocks. Is there a way to somehow do
>> all the blocks at once and collect the diagonal without using scan?
>>
>> On Saturday, 6 May 2017 10:41:06 UTC+3, Šarūnas S. wrote:
>>>
>>> I have tried that, but to no avail. The problem is that I have to
>>> multiply on 2 axes, but sum only on 1.
>>>
>>> On Friday, 5 May 2017 19:23:12 UTC+3, Jesse Livezey wrote:
>>>>
>>>> I think tensordot should do what you want
>>>>
>>>> http://deeplearning.net/software/theano/library/tensor/basic.html#theano.tensor.tensordot
>>>> something like
>>>> result = T.tensordot(prob, cases, axes=1)
>>>>
>>>>
>>>>
>>>> On Friday, May 5, 2017 at 3:17:14 AM UTC-7, Šarūnas S. wrote:
>>>>>
>>>>> I was shown that in *numpy* I could speed it up in the following way:
>>>>>
>>>>> result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
>>>>>
>>>>>
>>>>> result = np.matmul(prob.transpose(2,0,1), cases.T).T
>>>>>
>>>>>
>>>>> Bot give me the expected speedup in *numpy*, but neither is
>>>>> implemented in *Theano*. Is there a way to do the same in *Theano* on
>>>>> the *GPU*?
>>>>>
>>>>>
>>>>>
>>>>> On Friday, 5 May 2017 11:15:26 UTC+3, Šarūnas S. wrote:
>>>>>>
>>>>>> In my current theano script the bottleneck is equivalent to the
>>>>>> following numpy code:
>>>>>>
>>>>>> import numpy as np
>>>>>>
>>>>>> # 3D example
>>>>>> axis = 0
>>>>>> prob = np.random.random( ( 1, 1000, 50 ) )
>>>>>> cases = np.random.random( ( 1000, 1000, 50 ) )
>>>>>>
>>>>>> start = time.time( )
>>>>>> for i in xrange( 1000 ):
>>>>>> result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
>>>>>> print '3D naive method took {} seconds'.format( time.time() - start )
>>>>>> print result.shape
>>>>>> print
>>>>>>
>>>>>> I had seen in 2D case that replacing elementwise+sum with a dot
>>>>>> product gave me 5x speedup. Are there any theano matrix operations that
>>>>>> could help me out here?
>>>>>>
>>>>>
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.