[ 
https://issues.apache.org/jira/browse/MADLIB-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandish Jayaram updated MADLIB-1225:
------------------------------------
    Description: 
Install check seems to fail for random forest sporadically. The failure happens 
for the test which deals with variable importance in the install check.

The error in the log when a failure happens is:
{code}
SELECT
 assert(cat_var_importance[1] > con_var_importance[1], 'class should be 
important!'),
 assert(cat_var_importance[1] > cat_var_importance[2], 'class should be 
important!')
FROM train_output_group;
psql:/tmp/madlib.WW_EyD/recursive_partitioning/test/random_forest.sql_in.tmp:158:
 ERROR: Failed assertion: class should be important! (seg0 slice1 
93e250c8-8924-4a80-5c68-1464f40b0395:25432 pid=91044)
{code}
The last RF install-check query that was run before the error was:
{code}
SELECT forest_train(
 'dt_golf', -- source table
 'train_output', -- output model table
 'id', -- id column
 'class::TEXT', -- response
 'class, windy, temperature', -- features
 NULL, -- exclude columns
 NULL, -- no grouping
 10, -- num of trees
 1, -- num of random features
 TRUE, -- importance
 3, -- num_permutations
 10, -- max depth
 1, -- min split
 1, -- min bucket
 8, -- number of bins per continuous variable
 'max_surrogates=0',
 FALSE
 );
SELECT * from train_output_summary;
-[ RECORD 1 ]---------+--------------------------------
method | forest_train
is_classification | t
source_table | dt_golf
model_table | train_output
id_col_name | id
dependent_varname | class::TEXT
independent_varnames | class,windy,temperature
cat_features | class,windy
con_features | temperature
grouping_cols |
num_trees | 10
num_random_features | 1
max_tree_depth | 10
min_split | 1
min_bucket | 1
num_splits | 8
verbose | f
importance | t
num_permutations | 3
num_all_groups | 1
num_failed_groups | 0
total_rows_processed | 16
total_rows_skipped | 0
dependent_var_levels | "Don't Play","Play"
dependent_var_type | text
independent_var_types | text, boolean, double precision
null_proxy | None

SELECT * from train_output_group;
-[ RECORD 1 ]------+---------------------------------------
gid | 1
success | t
cat_n_levels | \{2,2}
cat_levels_in_text | \{"Don't Play",Play,False,True}
oob_error | 0.20000000000000000000
cat_var_importance | \{0.0244444444444445,0.025487012987013}
con_var_importance | \{0}
{code}

  was:Install check seems to fail for random forest sporadically. The failure 
happens for the test which deals with variable importance in the install check.


> Sporadic install check failures in random forest
> ------------------------------------------------
>
>                 Key: MADLIB-1225
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1225
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Random Forest
>            Reporter: Nandish Jayaram
>            Priority: Major
>             Fix For: v1.14
>
>
> Install check seems to fail for random forest sporadically. The failure 
> happens for the test which deals with variable importance in the install 
> check.
> The error in the log when a failure happens is:
> {code}
> SELECT
>  assert(cat_var_importance[1] > con_var_importance[1], 'class should be 
> important!'),
>  assert(cat_var_importance[1] > cat_var_importance[2], 'class should be 
> important!')
> FROM train_output_group;
> psql:/tmp/madlib.WW_EyD/recursive_partitioning/test/random_forest.sql_in.tmp:158:
>  ERROR: Failed assertion: class should be important! (seg0 slice1 
> 93e250c8-8924-4a80-5c68-1464f40b0395:25432 pid=91044)
> {code}
> The last RF install-check query that was run before the error was:
> {code}
> SELECT forest_train(
>  'dt_golf', -- source table
>  'train_output', -- output model table
>  'id', -- id column
>  'class::TEXT', -- response
>  'class, windy, temperature', -- features
>  NULL, -- exclude columns
>  NULL, -- no grouping
>  10, -- num of trees
>  1, -- num of random features
>  TRUE, -- importance
>  3, -- num_permutations
>  10, -- max depth
>  1, -- min split
>  1, -- min bucket
>  8, -- number of bins per continuous variable
>  'max_surrogates=0',
>  FALSE
>  );
> SELECT * from train_output_summary;
> -[ RECORD 1 ]---------+--------------------------------
> method | forest_train
> is_classification | t
> source_table | dt_golf
> model_table | train_output
> id_col_name | id
> dependent_varname | class::TEXT
> independent_varnames | class,windy,temperature
> cat_features | class,windy
> con_features | temperature
> grouping_cols |
> num_trees | 10
> num_random_features | 1
> max_tree_depth | 10
> min_split | 1
> min_bucket | 1
> num_splits | 8
> verbose | f
> importance | t
> num_permutations | 3
> num_all_groups | 1
> num_failed_groups | 0
> total_rows_processed | 16
> total_rows_skipped | 0
> dependent_var_levels | "Don't Play","Play"
> dependent_var_type | text
> independent_var_types | text, boolean, double precision
> null_proxy | None
> SELECT * from train_output_group;
> -[ RECORD 1 ]------+---------------------------------------
> gid | 1
> success | t
> cat_n_levels | \{2,2}
> cat_levels_in_text | \{"Don't Play",Play,False,True}
> oob_error | 0.20000000000000000000
> cat_var_importance | \{0.0244444444444445,0.025487012987013}
> con_var_importance | \{0}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to