Hi, I'm currently trying to implement a simple neural network with built-in support for maxout (Goodfellow et al. 2013). The way I'm implementing it is by storing a list of k weight matrices as a 3D array of size k×n×m where n is the size of the input and m is the number of units in the layer, with a corresponding k×m bias matrix. I then calculate the dot product between the D×n input matrix and the 3D weight array yielding a D×k×m array, add the bias matrix, and then calculate the max value over the second dimension, finally outputting a D×m matrix.
Note that in the case where k = 1, this is equivalent to a simple linear layer, with the max operation simply collapsing the D×1×m array into a D×m 2D matrix. I'm trying to exploit this fact in order to support other activation functions than maxout, without having to decide as a special case whether to use a 2D or a 3D weight array. However, I'm noticing that if I try to combine the maxout operation with a ReLU activation function, even with k = 1 (which should be equivalent to a standard ReLU layer), I often get NaNs in the gradient of the weight arrays during gradient descent. I've created a minimal example demonstrating the issue here: https://gist.github.com/arvidfm/2208c09865e731d8929bce100db83152#file-theano_maxout_test-py The example constructs a 2-5-2 network with ReLU hidden units and a softmax output layer. The weight arrays of the two layers are of size 1×2×5 and 1×5×2 respectively. Standard batch gradient descent is performed on randomly generated data to minimize the cross entropy. The training aborts once a NaN value is detected in any of the network parameters. I've tried running the example on two computers, with two different installations of Theano, on both Python 2 and 3, and using both the CPU and the GPU, and in each case NaN values are generally detected after a few thousand iterations. The NaN values seem to always appear in the gradients of the weight and bias arrays for the first (hidden) layer; never for the output layer. If I simply remove the first dimension from the weight and bias arrays and skip the max operation, everything works as expected. Similarly, if I leave out the ReLU operation, everything works (of course, k needs to be set to a value larger than 1 to be able to separate the classes in that case). What could be the issue here? Am I doing something wrong, and if not, is there any way to circumvent the problem? I've tried debugging the example the best I can, but trying to interpret the computation graph of the gradient is a bit beyond my current ability. For what it's worth I've included the output from running with NanGuardMode enabled in the gist in case it provides any useful information. Regards, Arvid Fahlström Myrman -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
