I tried to post a response but I think it got lost. I managed to get the compile time down to something much shorter by only using attention on one of my three GRU layers. In retrospect this is probably perfectly reasonable as multiple attention layers are probably overkill. I might try going to bleeding edge to see if it works but for now this is fine for me. Thanks for the responses!
On Tuesday, July 19, 2016 at 3:58:45 PM UTC+2, nouiz wrote: > > Update to Theano development version should speed up the compilation: > > > http://www.deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions > > Also, the first time you compile on a computer, it will take more time. > This is normal as we compile C code. But we cache it. So the following > compilation on the same computer(in fact, on computers with the same home > directory) will be faster. > > We are working on making the optimization of Theano faster. > > Fred > > > On Mon, Jul 18, 2016 at 9:07 AM, Doug <[email protected] <javascript:>> > wrote: > >> Compile times of that length aren't out of the question, I've run across >> them from time to time. They almost always come from advanced recurrent >> functions, multiple GRU layers with attention would fit that bill, but >> anything with a significant graph is going to take longer to compile (I've >> run into long compile time with large convolutional ladder networks). >> General advice is to make sure you aren't doing a scan inside of another >> scan, and make sure you are on the latest code from github because scan has >> received a lot of work in prior releases. You can also do a few runs with >> fast_compile mode to make sure you don't have any issues with the graph >> (bad shapes, etc), then switch back to fast_run when you want to let the >> network work for a while. >> >> >> On Sunday, July 17, 2016 at 7:39:43 PM UTC-4, [email protected] wrote: >>> >>> I've been implementing a multilayer GRU with an attention mechanism and >>> its taking forever to compile the gradients. I'm not sure of the exact >>> time but its in 10s of minutes if not over an hour -- much longer than i've >>> ever had with theano otherwise. >>> >>> The network itself is too large to show here but I'll include the >>> GRU-attention layer (its partially based off some code I found elsewhere on >>> this list). The full network is 3 of these plus a few softmaxes coming off >>> them. >>> >>> I would appreciate it if anyone could tell me whether such long compile >>> times are just to be expected or whether there's something wrong. I can >>> check that its really the gradients that are slowing down the compilation. >>> When I just compute the forward pass compilation times are much less. >>> >>> >>> def make_recurrent_attention(s0, x, W, V, Va_align, Vb_align, b): >>> """ >>> Make a recurrent network with an embedding 'emb', embed->hidden Wx, a >>> hidden-hidden Wh, and output params W, b. The network has nh >>> hidden states >>> and a mbsize mbs. >>> """ >>> # this is the recurrent step. This function is passed to >>> theano.scan so the >>> # order of arguments is important -- first arguments are the input >>> sequence >>> # then the second arguments are previous returns from the function >>> (see >>> # below). >>> def GRU(s_tm1, W, V, Va_align, Vb_align, b): >>> # size of hidden layer >>> nh=s_tm1.shape[1] >>> xW_t = softalign(s_tm1, Va_align, Vb_align) >>> # we compute the input to both z, r and h in one operation >>> # so out = s_tm1 . (Vz, Vr, V) in the normal GRU >>> out = T.nnet.sigmoid(T.dot(s_tm1, V)) >>> z=T.nnet.sigmoid(xW_t[:,:nh] + out[:,:nh]) >>> r=T.nnet.sigmoid(xW_t[:,nh:2*nh] + out[:,nh:2*nh]) >>> h = T.tanh(xW_t[:,2*nh:] + out[:,2*nh:] * r + b) >>> s_t = (1-z) * h + z * s_tm1 >>> return s_t >>> # the input multiplications do not need to be done sequentially so >>> we do >>> # them all at once here >>> prev_layer=T.dot(x, W) >>> Ts,M,N=prev_layer.shape >>> # make this M x T x N -- this is the way the align code we used was >>> written >>> prev_layer=prev_layer.dimshuffle(1,0,2) >>> def softalign(s_tm1, Va_align, Vb_align): >>> # generate an M x N input features based on last state >>> prev_t_to_align = T.dot(s_tm1, Va_align) # MxN >>> # needed to combine with prev layer info >>> prev_t_to_align = prev_t_to_align.dimshuffle(0, 'x', 1) # M x 1 >>> x N >>> e = T.dot(T.tanh(prev_layer + prev_t_to_align), Vb_align) >>> e = T.reshape(e, (M, Ts)) # get rid of singleton dim >>> alpha = T.nnet.softmax(e) >>> # this will be M x T x 1 >>> alpha = alpha.dimshuffle(0, 1, 'x') >>> # this will be M x N >>> weighted_inp = T.sum(alpha*prev_layer, axis=1) >>> return weighted_inp >>> # Because we use attention we don't pass the last layer in as a >>> sequence; >>> # rather we just walk through that many steps and we'll use the >>> attention >>> # mechanism to learn which inputs we really want at each step. >>> h, _ = theano.scan(fn=GRU, \ >>> outputs_info=s0, >>> non_sequences=[W, V, Va_align, Vb_align, b], >>> n_steps=prev_layer.shape[0]) >>> #, >>> #strict=True) >>> return h >>> >>> >>> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "theano-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
