I've been implementing a multilayer GRU with an attention mechanism and its 
taking forever to compile the gradients.  I'm not sure of the exact time 
but its in 10s of minutes if not over an hour -- much longer than i've ever 
had with theano otherwise.

The network itself is too large to show here but I'll include the 
GRU-attention layer (its partially based off some code I found elsewhere on 
this list).  The full network is 3 of these plus a few softmaxes coming off 
them.

I would appreciate it if anyone could tell me whether such long compile 
times are just to be expected or whether there's something wrong.  I can 
check that its really the gradients that are slowing down the compilation. 
 When I just compute the forward pass compilation times are much less.


def make_recurrent_attention(s0, x, W, V, Va_align, Vb_align, b):
    """
    Make a recurrent network with an embedding 'emb', embed->hidden Wx, a
    hidden-hidden Wh, and  output params W, b.  The network has nh hidden 
states
    and a mbsize mbs.
    """
    # this is the recurrent step.  This function is passed to theano.scan 
so the
    # order of arguments is important -- first arguments are the input 
sequence
    # then the second arguments are previous returns from the function (see
    # below).
    def GRU(s_tm1, W, V, Va_align, Vb_align, b):
        # size of hidden layer
        nh=s_tm1.shape[1]
        xW_t = softalign(s_tm1, Va_align, Vb_align)
        # we compute the input to both z, r and h in one operation
        # so out = s_tm1 . (Vz, Vr, V) in the normal GRU
        out = T.nnet.sigmoid(T.dot(s_tm1, V))
        z=T.nnet.sigmoid(xW_t[:,:nh] + out[:,:nh])
        r=T.nnet.sigmoid(xW_t[:,nh:2*nh] + out[:,nh:2*nh])
        h = T.tanh(xW_t[:,2*nh:] + out[:,2*nh:] * r  + b)
        s_t = (1-z) * h +  z * s_tm1
        return s_t
    # the input multiplications do not need to be done sequentially so we do
    # them all at once here
    prev_layer=T.dot(x, W) 
    Ts,M,N=prev_layer.shape
    # make this M x T x N -- this is the way the align code we used was 
written
    prev_layer=prev_layer.dimshuffle(1,0,2)
    def softalign(s_tm1, Va_align, Vb_align):
        # generate an M x N input features based on last state
        prev_t_to_align = T.dot(s_tm1, Va_align) # MxN
        # needed to combine with prev layer info
        prev_t_to_align = prev_t_to_align.dimshuffle(0, 'x', 1) # M x 1 x N
        e = T.dot(T.tanh(prev_layer + prev_t_to_align), Vb_align)
        e = T.reshape(e, (M, Ts)) # get rid of singleton dim
        alpha = T.nnet.softmax(e)
        # this will be M x T x 1
        alpha = alpha.dimshuffle(0, 1, 'x') 
        # this will be M x N
        weighted_inp = T.sum(alpha*prev_layer, axis=1)
        return weighted_inp
    # Because we use attention we don't pass the last layer in as a 
sequence;
    # rather we just walk through that many steps and we'll use the 
attention
    # mechanism to learn which inputs we really want at each step.
    h, _ = theano.scan(fn=GRU, \
        outputs_info=s0,
        non_sequences=[W, V, Va_align, Vb_align, b],
        n_steps=prev_layer.shape[0])
        #,
        #strict=True)
    return h


-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to