[theano-users] Correct way of 'zero-ing' cost for padded tokens in Sequence-to-Sequence

Usama Yaseen Tue, 14 Feb 2017 08:28:37 -0800

Hi all,

I am trying to implement vanilla Seq2Seq using GRU, the code works fine 
without the mini-batch version, however after incorporating mini-batch the 
model never converges, something is definitely wrong, I presume it's the 
part where I have tried to ignore padded tokens (or the way I am computing 
cost), here's the relevant code of decoder (encoder is straight-forward):



        def recurrence(msk, h_tm_prev, y_tm_prev):
            x_z = T.dot(self.emb[y_tm_prev], self.W_z) + self.b_z
            x_r = T.dot(self.emb[y_tm_prev], self.W_r) + self.b_r
            x_h = T.dot(self.emb[y_tm_prev], self.W) + T.dot(self.enc_h, 
self.c_h) + self.bh


            z_t = self.inner_activation(x_z + T.dot(h_tm_prev, self.U_z))
            r_t = self.inner_activation(x_r + T.dot(h_tm_prev, self.U_r))
            hh_t = self.activation(x_h + T.dot(r_t * h_tm_prev, self.U))
            h_t = (T.ones_like(z_t) - z_t) * hh_t + z_t * h_tm_prev


            # needed to back-propagate errors
            y_d_t = T.dot(h_t, self.V) + T.dot(self.enc_h, self.c_y) + T.dot
(self.emb[y_tm_prev], self.y_t1) + self.by
            *# ignore padded tokens, is this correct ?*
            y_d_t = T.batched_dot(y_d_t, msk)
            # y_d_t = y_d_t * msk.dimshuffle(0, 'x')
            y_d = T.clip(T.nnet.softmax(y_d_t),
                         0.0001, 0.9999)
            y_t = T.argmax(y_d, axis=1)
            return h_t, y_d, T.cast(y_t.flatten(), 'int32')


        [_, y_dist, y], _ = theano.scan(
            fn=recurrence,
            sequences=mask.dimshuffle(1, 0),  # ugly, but we have to go 
till the end (will go till max_len)
            outputs_info=[T.alloc(self.h0, self.enc_h.shape[0], hidden_dim),
                          None,
                          T.alloc(self.y0, self.enc_h.shape[0])]
        )


        self.y = y.dimshuffle(1, 0)
        self.y_dist = y_dist.dimshuffle(1, 0, 2)


    def negative_log_likelihood(self, y):


        def compute_cost(y_dist, target):
            return T.sum(T.nnet.categorical_crossentropy(y_dist, target))


        batched_cost, _ = theano.scan(
            fn=compute_cost,
            sequences=[self.y_dist, y],
            outputs_info=None
        )


        return T.mean(batched_cost)


* dimensions of selective variables: *mask* -> (batch_size, max_len), 
*enc_h* -> (batch_size, hidden_dim), *X, Y* -> (batch_size, max_len), 

Any clues would be highly appreciated. Thanks !

If someone is interested, this <https://github.com/uyaseen/neural-converse> 
is the complete code.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[theano-users] Correct way of 'zero-ing' cost for padded tokens in Sequence-to-Sequence

Reply via email to