I've been implementing a multilayer GRU with an attention mechanism and its
taking forever to compile the gradients. I'm not sure of the exact time
but its in 10s of minutes if not over an hour -- much longer than i've ever
had with theano otherwise.
The network itself is too large to show here but I'll include the
GRU-attention layer (its partially based off some code I found elsewhere on
this list). The full network is 3 of these plus a few softmaxes coming off
them.
I would appreciate it if anyone could tell me whether such long compile
times are just to be expected or whether there's something wrong. I can
check that its really the gradients that are slowing down the compilation.
When I just compute the forward pass compilation times are much less.
def make_recurrent_attention(s0, x, W, V, Va_align, Vb_align, b):
"""
Make a recurrent network with an embedding 'emb', embed->hidden Wx, a
hidden-hidden Wh, and output params W, b. The network has nh hidden
states
and a mbsize mbs.
"""
# this is the recurrent step. This function is passed to theano.scan
so the
# order of arguments is important -- first arguments are the input
sequence
# then the second arguments are previous returns from the function (see
# below).
def GRU(s_tm1, W, V, Va_align, Vb_align, b):
# size of hidden layer
nh=s_tm1.shape[1]
xW_t = softalign(s_tm1, Va_align, Vb_align)
# we compute the input to both z, r and h in one operation
# so out = s_tm1 . (Vz, Vr, V) in the normal GRU
out = T.nnet.sigmoid(T.dot(s_tm1, V))
z=T.nnet.sigmoid(xW_t[:,:nh] + out[:,:nh])
r=T.nnet.sigmoid(xW_t[:,nh:2*nh] + out[:,nh:2*nh])
h = T.tanh(xW_t[:,2*nh:] + out[:,2*nh:] * r + b)
s_t = (1-z) * h + z * s_tm1
return s_t
# the input multiplications do not need to be done sequentially so we do
# them all at once here
prev_layer=T.dot(x, W)
Ts,M,N=prev_layer.shape
# make this M x T x N -- this is the way the align code we used was
written
prev_layer=prev_layer.dimshuffle(1,0,2)
def softalign(s_tm1, Va_align, Vb_align):
# generate an M x N input features based on last state
prev_t_to_align = T.dot(s_tm1, Va_align) # MxN
# needed to combine with prev layer info
prev_t_to_align = prev_t_to_align.dimshuffle(0, 'x', 1) # M x 1 x N
e = T.dot(T.tanh(prev_layer + prev_t_to_align), Vb_align)
e = T.reshape(e, (M, Ts)) # get rid of singleton dim
alpha = T.nnet.softmax(e)
# this will be M x T x 1
alpha = alpha.dimshuffle(0, 1, 'x')
# this will be M x N
weighted_inp = T.sum(alpha*prev_layer, axis=1)
return weighted_inp
# Because we use attention we don't pass the last layer in as a
sequence;
# rather we just walk through that many steps and we'll use the
attention
# mechanism to learn which inputs we really want at each step.
h, _ = theano.scan(fn=GRU, \
outputs_info=s0,
non_sequences=[W, V, Va_align, Vb_align, b],
n_steps=prev_layer.shape[0])
#,
#strict=True)
return h
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.