Frank McQuillan created MADLIB-1226: ---------------------------------------
Summary: Add option for 1-hot encoding to minibatch preprocessor Key: MADLIB-1226 URL: https://issues.apache.org/jira/browse/MADLIB-1226 Project: Apache MADlib Issue Type: Improvement Components: Module: Utilities Reporter: Frank McQuillan Fix For: v1.14 I was testing MNIST dataset with minibatch preprocessor + MLP and could not get it to converge. It turned out to be user error (me) and not a problem with convergence at all, because I forgot to 1-hot encode the dependent variable. But I am wondering if other people might do the same thing that I did and get confused. Here's what I did. For this input data: {code} madlib=# \d+ public.mnist_train Table "public.mnist_train" Column | Type | Modifiers | Storage | Stats target | Description --------+-----------+----------------------------------------------------------+----------+--------------+------------- y | integer | | plain | | x | integer[] | | extended | | id | integer | not null default nextval('mnist_train_id_seq'::regclass) | plain | | {code} I called minibatch preprocessor: {code} SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table 'mnist_train_packed', -- Output table 'y', -- Dependent variable 'x' -- Independent variables ); {code} then mlp: {code} SELECT madlib.mlp_classification( 'mnist_train_packed', -- Source table from preprocessor output 'mnist_result', -- Destination table 'independent_varname', -- Independent 'dependent_varname', -- Dependent ARRAY[5], -- Hidden layer sizes 'learning_rate_init=0.01, n_iterations=20, learning_rate_policy=exp, n_epochs=20, lambda=0.0001, -- Regularization tolerance=0', 'tanh', -- Activation function '', -- No weights FALSE, -- No warmstart TRUE); -- Verbose {code} with the result: {code} INFO: Iteration: 2, Loss: <-79.5295531257> INFO: Iteration: 3, Loss: <-79.529408892> INFO: Iteration: 4, Loss: <-79.5291940436> INFO: Iteration: 5, Loss: <-79.5288964944> INFO: Iteration: 6, Loss: <-79.5285051451> INFO: Iteration: 7, Loss: <-79.5280094708> INFO: Iteration: 8, Loss: <-79.5273995189> INFO: Iteration: 9, Loss: <-79.5266665607> {code}' So it did not error out but clearly is not working on data in the right format. I suggest 2 changes: 1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of dependent variable (this JIRA) 2) add a check to the MLP classification code to check that the dependent var has been 1-hot encodeded, and error out if that is not the case. (JIRA xxx) -- This message was sent by Atlassian JIRA (v7.6.3#76005)