[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/289


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-11 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201728550
  
--- Diff: src/modules/recursive_partitioning/DT_impl.hpp ---
@@ -1512,6 +1512,9 @@ DecisionTree::computeVariableImportance(
 uint64_t maj_count = getMajorityCount(node_index);
 uint64_t min_count =  node_count - maj_count;
 
+if (min_count == 0) {
--- End diff --

I've pushed a commit to master that fixes the need to do this hack. Please 
rebase on master and remove these lines. 


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201489373
  
--- Diff: src/modules/recursive_partitioning/DT_impl.hpp ---
@@ -1512,6 +1512,9 @@ DecisionTree::computeVariableImportance(
 uint64_t maj_count = getMajorityCount(node_index);
 uint64_t min_count =  node_count - maj_count;
 
+if (min_count == 0) {
--- End diff --

We can leave some comment here to address it later


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201501719
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.py_in ---
@@ -1291,38 +1300,64 @@ def _create_group_table(
 schema_madlib, output_table_name, oob_error_table,
 importance_table, cat_features_info_table, grp_key_to_grp_cols,
 grouping_cols, tree_terminated):
-""" Ceate the group table for random forest"""
+""" Create the group table for random forest"""
+
+cat_var_importance_str = ''
+con_var_importance_str = ''
+impurity_var_importance_str = ''
+left_join_importance_table_str = ''
+join_impurity_table_str = ''
+
+if importance_table:
+impurity_var_importance_table_name = unique_string(desp='impurity')
+plpy.execute("""
+CREATE TEMP TABLE {impurity_var_importance_table_name} AS
+SELECT
+gid,
+{schema_madlib}.array_avg(impurity_var_importance, False) 
AS impurity_var_importance
+FROM {output_table_name}
+GROUP BY gid
+""".format(**locals()))
+
+cat_var_importance_str = ", cat_var_importance AS 
oob_cat_var_importance,"
+con_var_importance_str = "con_var_importance AS 
oob_con_var_importance,"
+impurity_var_importance_str = "impurity_var_importance"
+left_join_importance_table_str = """LEFT OUTER JOIN 
{importance_table}
+USING (gid)""".format(importance_table=importance_table)
+join_impurity_table_str = """JOIN 
{impurity_var_importance_table_name} USING 
(gid)""".format(impurity_var_importance_table_name=impurity_var_importance_table_name)
+
 grouping_cols_str = ('' if grouping_cols is None
  else grouping_cols + ",")
 group_table_name = add_postfix(output_table_name, "_group")
+
 sql_create_group_table = """
 CREATE TABLE {group_table_name} AS
 SELECT
 gid,
 {grouping_cols_str}
-grp_finished as success,
+grp_finished AS success,
 cat_n_levels,
 cat_levels_in_text,
-oob_error,
-cat_var_importance,
-con_var_importance
+oob_error
+{cat_var_importance_str}
+{con_var_importance_str}
+{impurity_var_importance_str}
 FROM
 {oob_error_table}
 JOIN
 {grp_key_to_grp_cols}
 USING (gid)
 JOIN (
 SELECT
-unnest($1) as grp_key,
-unnest($2) as grp_finished
+unnest($1) AS grp_key,
+unnest($2) AS grp_finished
 ) tree_terminated
 USING (grp_key)
 JOIN
 {cat_features_info_table}
 USING (gid)
-LEFT OUTER JOIN
-{importance_table}
-USING (gid)
+{left_join_importance_table_str}
--- End diff --

We can name it oob_importance_table_* to explicitly distinguish it from 
impurity importance table


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201462356
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.py_in ---
@@ -616,23 +628,20 @@ def forest_train(
 _calculate_oob_error(schema_madlib, oob_prediction_table,
  oob_error_table, id_col_name,
  is_classification)
-
-importance_table = unique_string()
-sql_create_empty_imp_tbl = """
-CREATE TEMP TABLE {importance_table}
-(
-gid integer,
-cat_var_importance float8[],
-con_var_importance float8[]
-);
-""".format(**locals())
-
plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl)
-plpy.execute(sql_create_empty_imp_tbl)
-
-# we populate the importance_table only if variable 
importance is to be
-# calculated, otherwise we use an empty table which will 
be used later
-# for an outer join.
+importance_table = ''
 if importance:
+importance_table = unique_string()
+sql_create_empty_imp_tbl = """
+CREATE TEMP TABLE {importance_table}
+(
+gid integer,
+cat_var_importance float8[],
+con_var_importance float8[]
+);
+""".format(**locals())
+
plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl)
--- End diff --

Do we need this notice?


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201490136
  
--- Diff: src/modules/recursive_partitioning/decision_tree.cpp ---
@@ -502,10 +502,13 @@ get_variable_importance::run(AnyType ){
ColumnVector combined_var_imp(n_cat_features + n_con_features);
combined_var_imp << cat_var_importance, con_var_importance;
 
-// Avoid divide by zero by adding a small number.
+// Avoid divide by zero by replacing with a small number if necessary
--- End diff --

Out of curiosity: what was wrong with the previous method? 


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201493117
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.py_in ---
@@ -1333,42 +1368,69 @@ def _create_group_table(
 # -
 
 
-def _create_empty_result_table(schema_madlib, output_table_name):
+def _create_empty_result_table(schema_madlib, output_table_name, 
importance):
 """Create the result table for all trees in the forest"""
+impurity_var_imp_str = """, impurity_var_importance double 
precision[]);""" if importance else ");"
--- End diff --

Do we need the triple quotes? Also if `);` is common in both cases then we 
can move that to the query directly?


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201492311
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.py_in ---
@@ -1291,38 +1300,64 @@ def _create_group_table(
 schema_madlib, output_table_name, oob_error_table,
 importance_table, cat_features_info_table, grp_key_to_grp_cols,
 grouping_cols, tree_terminated):
-""" Ceate the group table for random forest"""
+""" Create the group table for random forest"""
+
+cat_var_importance_str = ''
--- End diff --

Let's please change the variable names to `oob*`. It would also help to 
change the `compute_var_importance` function in decision tree to 
`compute_impurity_var_importance`. 


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-09 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/289

RF: Add impurity variable importance

JIRA: MADLIB-1205

This commit makes the following changes:
- Add impurity variable importance for random forests.
- Rename current cat_var_importance and con_var_importance measurements to
oob_cat_var_importance and oob_con_var_importance.

New impurity measurement is provided as impurity_var_importance, and 
supports
grouping. It combines the importance values for both categorical and
continuous features into a single array.

Co-authored-by: Rahul Iyer 
Co-authored-by: Jingyi Mei 
Co-authored-by: Arvind Sridhar 
Co-authored-by: Nandish Jayaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib rf_gini_importance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #289


commit 622d46a85f4264fdc94bd41dc66a23f1aa2c3ed6
Author: Rahul Iyer 
Date:   2018-07-10T00:34:33Z

RF: Add impurity variable importance

JIRA: MADLIB-1205

This commit makes the following changes:
- Add impurity variable importance for random forests.
- Rename current cat_var_importance and con_var_importance measurements to
oob_cat_var_importance and oob_con_var_importance.

New impurity measurement is provided as impurity_var_importance, and 
supports
grouping. It combines the importance values for both categorical and
continuous features into a single array.

Co-authored-by: Rahul Iyer 
Co-authored-by: Jingyi Mei 
Co-authored-by: Arvind Sridhar 
Co-authored-by: Nandish Jayaram 




---