Hi,

Can you try with the latest development version of Theano, if you are
not doing that already?

It appears that the expression for the gradient of ReLU is unstable
numerically, and I think we fixed something related to that, possibly
since last release.

On Thu, Aug 18, 2016, Arvid Fahlström Myrman wrote:
> Hi,
> 
> I'm currently trying to implement a simple neural network with built-in 
> support for maxout (Goodfellow et al. 2013). The way I'm implementing it is 
> by storing a list of k weight matrices as a 3D array of size k×n×m where n 
> is the size of the input and m is the number of units in the layer, with a 
> corresponding k×m bias matrix. I then calculate the dot product between the 
> D×n input matrix and the 3D weight array yielding a D×k×m array, add the 
> bias matrix, and then calculate the max value over the second dimension, 
> finally outputting a D×m matrix.
> 
> Note that in the case where k = 1, this is equivalent to a simple linear 
> layer, with the max operation simply collapsing the D×1×m array into a D×m 
> 2D matrix. I'm trying to exploit this fact in order to support other 
> activation functions than maxout, without having to decide as a special 
> case whether to use a 2D or a 3D weight array. However, I'm noticing that 
> if I try to combine the maxout operation with a ReLU activation function, 
> even with k = 1 (which should be equivalent to a standard ReLU layer), I 
> often get NaNs in the gradient of the weight arrays during gradient descent.
> 
> I've created a minimal example demonstrating the issue here:
> 
> https://gist.github.com/arvidfm/2208c09865e731d8929bce100db83152#file-theano_maxout_test-py
> 
> The example constructs a 2-5-2 network with ReLU hidden units and a softmax 
> output layer. The weight arrays of the two layers are of size 1×2×5 and 
> 1×5×2 respectively. Standard batch gradient descent is performed on 
> randomly generated data to minimize the cross entropy. The training aborts 
> once a NaN value is detected in any of the network parameters. I've tried 
> running the example on two computers, with two different installations of 
> Theano, on both Python 2 and 3, and using both the CPU and the GPU, and in 
> each case NaN values are generally detected after a few thousand 
> iterations. The NaN values seem to always appear in the gradients of the 
> weight and bias arrays for the first (hidden) layer; never for the output 
> layer.
> 
> If I simply remove the first dimension from the weight and bias arrays and 
> skip the max operation, everything works as expected. Similarly, if I leave 
> out the ReLU operation, everything works (of course, k needs to be set to a 
> value larger than 1 to be able to separate the classes in that case). What 
> could be the issue here? Am I doing something wrong, and if not, is there 
> any way to circumvent the problem? I've tried debugging the example the 
> best I can, but trying to interpret the computation graph of the gradient 
> is a bit beyond my current ability. For what it's worth I've included the 
> output from running with NanGuardMode enabled in the gist in case it 
> provides any useful information.
> 
> Regards,
> Arvid Fahlström Myrman
> 
> -- 
> 
> --- 
> You received this message because you are subscribed to the Google Groups 
> "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.


-- 
Pascal

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to