[jira] [Issue Comment Deleted] (MADLIB-1084) Graph - Personalized PageRank

2018-04-10 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1084:

Comment: was deleted

(was: 
{code:sql}
SELECT pagerank( 
'vertex',  -- Vertex table name
'id',  -- Vertex id column
'edge',-- Edge table name
'src=start_id, dest=end_id',   -- Edge source and dest columns
'pagerank_out' -- Output table with PageRank
);
{code})

> Graph - Personalized PageRank
> -
>
> Key: MADLIB-1084
> URL: https://issues.apache.org/jira/browse/MADLIB-1084
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Himanshu Pandey
>Priority: Major
> Fix For: v1.14
>
> Attachments: GraphTest.py
>
>
> Personalized PageRank which is a variant of regular PageRank.
> Please refer to  
> [http://madlib.apache.org/docs/latest/group__grp__pagerank.html] as a 
> starting point.
> Reference:
>  Neighborhood Formation and Anomaly Detection in Bipartite Graphs
>  [http://www.cs.cmu.edu/~deepay/mywww/papers/icdm05.pdf]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MADLIB-1084) Graph - Personalized PageRank

2018-04-10 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433133#comment-16433133
 ] 

Frank McQuillan commented on MADLIB-1084:
-


{code:sql}
SELECT pagerank( 
'vertex',  -- Vertex table name
'id',  -- Vertex id column
'edge',-- Edge table name
'src=start_id, dest=end_id',   -- Edge source and dest columns
'pagerank_out' -- Output table with PageRank
);
{code}

> Graph - Personalized PageRank
> -
>
> Key: MADLIB-1084
> URL: https://issues.apache.org/jira/browse/MADLIB-1084
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Himanshu Pandey
>Priority: Major
> Fix For: v1.14
>
> Attachments: GraphTest.py
>
>
> Personalized PageRank which is a variant of regular PageRank.
> Please refer to  
> [http://madlib.apache.org/docs/latest/group__grp__pagerank.html] as a 
> starting point.
> Reference:
>  Neighborhood Formation and Anomaly Detection in Bipartite Graphs
>  [http://www.cs.cmu.edu/~deepay/mywww/papers/icdm05.pdf]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MADLIB-1225) Sporadic install check failures in random forest

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433092#comment-16433092
 ] 

ASF GitHub Bot commented on MADLIB-1225:


Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/258


> Sporadic install check failures in random forest
> 
>
> Key: MADLIB-1225
> URL: https://issues.apache.org/jira/browse/MADLIB-1225
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: Random Forest
>Reporter: Nandish Jayaram
>Priority: Major
> Fix For: v1.14
>
>
> Install check seems to fail for random forest sporadically. The failure 
> happens for the test which deals with variable importance in the install 
> check.
> The error in the log when a failure happens is:
> {code}
> SELECT
>  assert(cat_var_importance[1] > con_var_importance[1], 'class should be 
> important!'),
>  assert(cat_var_importance[1] > cat_var_importance[2], 'class should be 
> important!')
> FROM train_output_group;
> psql:/tmp/madlib.WW_EyD/recursive_partitioning/test/random_forest.sql_in.tmp:158:
>  ERROR: Failed assertion: class should be important! (seg0 slice1 
> 93e250c8-8924-4a80-5c68-1464f40b0395:25432 pid=91044)
> {code}
> The last RF install-check query that was run before the error was:
> {code}
> SELECT forest_train(
>  'dt_golf', -- source table
>  'train_output', -- output model table
>  'id', -- id column
>  'class::TEXT', -- response
>  'class, windy, temperature', -- features
>  NULL, -- exclude columns
>  NULL, -- no grouping
>  10, -- num of trees
>  1, -- num of random features
>  TRUE, -- importance
>  3, -- num_permutations
>  10, -- max depth
>  1, -- min split
>  1, -- min bucket
>  8, -- number of bins per continuous variable
>  'max_surrogates=0',
>  FALSE
>  );
> SELECT * from train_output_summary;
> -[ RECORD 1 ]-+
> method | forest_train
> is_classification | t
> source_table | dt_golf
> model_table | train_output
> id_col_name | id
> dependent_varname | class::TEXT
> independent_varnames | class,windy,temperature
> cat_features | class,windy
> con_features | temperature
> grouping_cols |
> num_trees | 10
> num_random_features | 1
> max_tree_depth | 10
> min_split | 1
> min_bucket | 1
> num_splits | 8
> verbose | f
> importance | t
> num_permutations | 3
> num_all_groups | 1
> num_failed_groups | 0
> total_rows_processed | 16
> total_rows_skipped | 0
> dependent_var_levels | "Don't Play","Play"
> dependent_var_type | text
> independent_var_types | text, boolean, double precision
> null_proxy | None
> SELECT * from train_output_group;
> -[ RECORD 1 ]--+---
> gid | 1
> success | t
> cat_n_levels | \{2,2}
> cat_levels_in_text | \{"Don't Play",Play,False,True}
> oob_error | 0.2000
> cat_var_importance | \{0.0245,0.025487012987013}
> con_var_importance | \{0}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MADLIB-1226) Add option for 1-hot encoding to minibatch preprocessor

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432862#comment-16432862
 ] 

ASF GitHub Bot commented on MADLIB-1226:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/madlib/pull/259

Minibatch: Add one-hot encoding option for int

JIRA: MADLIB-1226

Integer dependent variables can be used either in regression or
classification. To use in classification, they need to be one-hot
encoded. This commit adds an option to allow users to pick if a integer
dependent input needs to one-hot encoded or not. The flag is ignored if
the variable is not of integer type.

Other changes include adding an appropriate test in install-check,
code cleanup and PEP8 conformance.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/minibatch_one_hot_encode

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #259


commit 4729973d4e477cfef42cb21f8b8a3778171a5a3d
Author: Rahul Iyer 
Date:   2018-04-10T19:34:23Z

Minibatch: Add one-hot encoding option for int

JIRA: MADLIB-1226

Integer dependent variables can be used either in regression or
classification. To use in classification, they need to be one-hot
encoded. This commit adds an option to allow users to pick if a integer
dependent input needs to one-hot encoded or not. The flag is ignored if
the variable is not of integer type.

Other changes include adding an appropriate test in install-check,
code cleanup and PEP8 conformance.




> Add option for 1-hot encoding to minibatch preprocessor
> ---
>
> Key: MADLIB-1226
> URL: https://issues.apache.org/jira/browse/MADLIB-1226
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Utilities
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.14
>
>
> I was testing MNIST dataset with minibatch preprocessor + MLP and could not 
> get it to converge.   It turned out to be user error (me) and not a problem 
> with convergence at all, because I forgot to 1-hot encode the dependent 
> variable.
> But I am wondering if other people might do the same thing that I did and get 
> confused.
> Here's what I did.  For this input data:
> {code}
> madlib=# \d+ public.mnist_train
>                                               Table "public.mnist_train"
>  Column |   Type    |                        Modifiers                        
>  | Storage  | Stats target | Description 
> +---+--+--+--+-
>  y      | integer   |                                                         
>  | plain    |              | 
>  x      | integer[] |                                                         
>  | extended |              | 
>  id     | integer   | not null default 
> nextval('mnist_train_id_seq'::regclass) | plain    |              | 
> {code}
> I called minibatch preprocessor:
> {code}
> SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
>  'mnist_train_packed',  -- Output table
>  'y',   -- Dependent 
> variable
>  'x'-- Independent 
> variables
>  );
> {code}
> then mlp:
> {code}
> SELECT madlib.mlp_classification(
> 'mnist_train_packed',-- Source table from preprocessor output
> 'mnist_result',  -- Destination table
> 'independent_varname',   --  Independent
> 'dependent_varname',-- Dependent
> ARRAY[5],-- Hidden layer sizes
> 'learning_rate_init=0.01,
> n_iterations=20,
> learning_rate_policy=exp, n_epochs=20,
> lambda=0.0001, -- Regularization
> tolerance=0',
> 'tanh',  -- Activation function
> '',  -- No weights
> FALSE,   -- No warmstart
> TRUE);   -- Verbose
> {code}
> with the result:
> {code}
> INFO:  Iteration: 2, Loss: <-79.5295531257>
> INFO:  Iteration: 3, Loss: <-79.529408892>
> INFO:  Iteration: 4, Loss: <-79.5291940436>
> INFO:  Iteration: 5, Loss: <-79.5288964944>
> INFO:  Iteration: 6, Loss: <-79.5285051451>
> INFO:  Iteration: 7, Loss: <-79.5280094708>
> INFO:  Iteration: 8, Loss: <-79.5273995189>
> INFO:  Iteration: 9, 

[jira] [Updated] (MADLIB-1226) Add option for 1-hot encoding to minibatch preprocessor

2018-04-10 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1226:

Description: 
I was testing MNIST dataset with minibatch preprocessor + MLP and could not get 
it to converge.   It turned out to be user error (me) and not a problem with 
convergence at all, because I forgot to 1-hot encode the dependent variable.

But I am wondering if other people might do the same thing that I did and get 
confused.

Here's what I did.  For this input data:

{code}
madlib=# \d+ public.mnist_train

                                              Table "public.mnist_train"

 Column |   Type    |                        Modifiers                         
| Storage  | Stats target | Description 

+---+--+--+--+-

 y      | integer   |                                                          
| plain    |              | 

 x      | integer[] |                                                          
| extended |              | 

 id     | integer   | not null default nextval('mnist_train_id_seq'::regclass) 
| plain    |              | 
{code}

I called minibatch preprocessor:

{code}
SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
 'mnist_train_packed',  -- Output table
 'y',   -- Dependent 
variable
 'x'-- Independent 
variables
 );
{code}

then mlp:

{code}
SELECT madlib.mlp_classification(
'mnist_train_packed',-- Source table from preprocessor output
'mnist_result',  -- Destination table
'independent_varname',   --  Independent
'dependent_varname',-- Dependent
ARRAY[5],-- Hidden layer sizes
'learning_rate_init=0.01,
n_iterations=20,
learning_rate_policy=exp, n_epochs=20,
lambda=0.0001, -- Regularization
tolerance=0',
'tanh',  -- Activation function
'',  -- No weights
FALSE,   -- No warmstart
TRUE);   -- Verbose
{code}

with the result:

{code}
INFO:  Iteration: 2, Loss: <-79.5295531257>
INFO:  Iteration: 3, Loss: <-79.529408892>
INFO:  Iteration: 4, Loss: <-79.5291940436>
INFO:  Iteration: 5, Loss: <-79.5288964944>
INFO:  Iteration: 6, Loss: <-79.5285051451>
INFO:  Iteration: 7, Loss: <-79.5280094708>
INFO:  Iteration: 8, Loss: <-79.5273995189>
INFO:  Iteration: 9, Loss: <-79.525607>
{code}

So it did not error out but clearly is not working on data in the right format.

I suggest 2 changes:

1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of 
scalar integer dependent variables (this JIRA)

2) Add a check to the MLP classification code to check that the dependent var 
has been 1-hot encoded, and error out if that is not the case. 
(https://issues.apache.org/jira/browse/MADLIB-1226)


Proposed interface:

{code}
minibatch_preprocessor( source_table,
output_table,
dependent_varname,
independent_varname,
grouping_col,
buffer_size,
one_hot_encode_int_dep_var
)
{code}
{code}
one_hot_encode_int_dep_var (optional)
BOOLEAN. default: FALSE. Whether to one-hot encode dependent variables that are 
scalar integer.
This parameter is ignored if the dependent variable is not a scalar integer.
More detail:  the mini-batch preprocessor automatically encodes dependent 
variables that are 
Boolean and character types such as text, char and varchar.  However, scalar 
integers are a 
special case because they can be used in both classification and regression 
problems, so
you must tell the mini-batch preprocessor whether you want to encode them or 
not.  
In the case that you have already encoded the dependent variable yourself, 
you can ignore this parameter.  Also, if you want to encode float values for 
some reason, cast them 
to text first.
{code}


 



  was:
I was testing MNIST dataset with minibatch preprocessor + MLP and could not get 
it to converge.   It turned out to be user error (me) and not a problem with 
convergence at all, because I forgot to 1-hot encode the dependent variable.

But I am wondering if other people might do the same thing that I did and get 
confused.

Here's what I did.  For this input data:

{code}
madlib=# \d+ public.mnist_train

                                              Table "public.mnist_train"

 Column |   Type    |                        Modifiers                         
| Storage  | Stats target | Description 


[jira] [Updated] (MADLIB-1227) In MLP classification with mini-batch, check for 1-hot encoding of dependent variable

2018-04-10 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1227:

Description: 
Related to
https://issues.apache.org/jira/browse/MADLIB-1226

Add check to MLP classification code to check that the dependent var has been 
1-hot encoded by accident, and error out if that is not the case.  

This is to avoid the case of passing INTs as dep var to MLP that have not been 
1-hot encoded and having it run classification and not converge, or give 
erroneous results, and giving no notification to the user about the problem.


  was:
Related to
https://issues.apache.org/jira/browse/MADLIB-1226

Add check to MLP classification code to check that the dependent var has been 
1-hot encoded, and error out if that is not the case.  

This is to avoid the case of passing INTs as dep var to MLP that have not been 
1-hot encoded and having it run classification and give no results, or 
erroneous results.



> In MLP classification with mini-batch, check for 1-hot encoding of dependent 
> variable
> -
>
> Key: MADLIB-1227
> URL: https://issues.apache.org/jira/browse/MADLIB-1227
> Project: Apache MADlib
>  Issue Type: Improvement
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.14
>
>
> Related to
> https://issues.apache.org/jira/browse/MADLIB-1226
> Add check to MLP classification code to check that the dependent var has been 
> 1-hot encoded by accident, and error out if that is not the case.  
> This is to avoid the case of passing INTs as dep var to MLP that have not 
> been 1-hot encoded and having it run classification and not converge, or give 
> erroneous results, and giving no notification to the user about the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)