Hi,

I'm currently trying to implement a simple neural network with built-in 
support for maxout (Goodfellow et al. 2013). The way I'm implementing it is 
by storing a list of k weight matrices as a 3D array of size k×n×m where n 
is the size of the input and m is the number of units in the layer, with a 
corresponding k×m bias matrix. I then calculate the dot product between the 
D×n input matrix and the 3D weight array yielding a D×k×m array, add the 
bias matrix, and then calculate the max value over the second dimension, 
finally outputting a D×m matrix.

Note that in the case where k = 1, this is equivalent to a simple linear 
layer, with the max operation simply collapsing the D×1×m array into a D×m 
2D matrix. I'm trying to exploit this fact in order to support other 
activation functions than maxout, without having to decide as a special 
case whether to use a 2D or a 3D weight array. However, I'm noticing that 
if I try to combine the maxout operation with a ReLU activation function, 
even with k = 1 (which should be equivalent to a standard ReLU layer), I 
often get NaNs in the gradient of the weight arrays during gradient descent.

I've created a minimal example demonstrating the issue here:

https://gist.github.com/arvidfm/2208c09865e731d8929bce100db83152#file-theano_maxout_test-py

The example constructs a 2-5-2 network with ReLU hidden units and a softmax 
output layer. The weight arrays of the two layers are of size 1×2×5 and 
1×5×2 respectively. Standard batch gradient descent is performed on 
randomly generated data to minimize the cross entropy. The training aborts 
once a NaN value is detected in any of the network parameters. I've tried 
running the example on two computers, with two different installations of 
Theano, on both Python 2 and 3, and using both the CPU and the GPU, and in 
each case NaN values are generally detected after a few thousand 
iterations. The NaN values seem to always appear in the gradients of the 
weight and bias arrays for the first (hidden) layer; never for the output 
layer.

If I simply remove the first dimension from the weight and bias arrays and 
skip the max operation, everything works as expected. Similarly, if I leave 
out the ReLU operation, everything works (of course, k needs to be set to a 
value larger than 1 to be able to separate the classes in that case). What 
could be the issue here? Am I doing something wrong, and if not, is there 
any way to circumvent the problem? I've tried debugging the example the 
best I can, but trying to interpret the computation graph of the gradient 
is a bit beyond my current ability. For what it's worth I've included the 
output from running with NanGuardMode enabled in the gist in case it 
provides any useful information.

Regards,
Arvid Fahlström Myrman

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to