[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/289 ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201728550 --- Diff: src/modules/recursive_partitioning/DT_impl.hpp --- @@ -1512,6 +1512,9 @@ DecisionTree::computeVariableImportance( uint64_t maj_count = getMajorityCount(node_index); uint64_t min_count = node_count - maj_count; +if (min_count == 0) { --- End diff -- I've pushed a commit to master that fixes the need to do this hack. Please rebase on master and remove these lines. ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201489373 --- Diff: src/modules/recursive_partitioning/DT_impl.hpp --- @@ -1512,6 +1512,9 @@ DecisionTree::computeVariableImportance( uint64_t maj_count = getMajorityCount(node_index); uint64_t min_count = node_count - maj_count; +if (min_count == 0) { --- End diff -- We can leave some comment here to address it later ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201501719 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.py_in --- @@ -1291,38 +1300,64 @@ def _create_group_table( schema_madlib, output_table_name, oob_error_table, importance_table, cat_features_info_table, grp_key_to_grp_cols, grouping_cols, tree_terminated): -""" Ceate the group table for random forest""" +""" Create the group table for random forest""" + +cat_var_importance_str = '' +con_var_importance_str = '' +impurity_var_importance_str = '' +left_join_importance_table_str = '' +join_impurity_table_str = '' + +if importance_table: +impurity_var_importance_table_name = unique_string(desp='impurity') +plpy.execute(""" +CREATE TEMP TABLE {impurity_var_importance_table_name} AS +SELECT +gid, +{schema_madlib}.array_avg(impurity_var_importance, False) AS impurity_var_importance +FROM {output_table_name} +GROUP BY gid +""".format(**locals())) + +cat_var_importance_str = ", cat_var_importance AS oob_cat_var_importance," +con_var_importance_str = "con_var_importance AS oob_con_var_importance," +impurity_var_importance_str = "impurity_var_importance" +left_join_importance_table_str = """LEFT OUTER JOIN {importance_table} +USING (gid)""".format(importance_table=importance_table) +join_impurity_table_str = """JOIN {impurity_var_importance_table_name} USING (gid)""".format(impurity_var_importance_table_name=impurity_var_importance_table_name) + grouping_cols_str = ('' if grouping_cols is None else grouping_cols + ",") group_table_name = add_postfix(output_table_name, "_group") + sql_create_group_table = """ CREATE TABLE {group_table_name} AS SELECT gid, {grouping_cols_str} -grp_finished as success, +grp_finished AS success, cat_n_levels, cat_levels_in_text, -oob_error, -cat_var_importance, -con_var_importance +oob_error +{cat_var_importance_str} +{con_var_importance_str} +{impurity_var_importance_str} FROM {oob_error_table} JOIN {grp_key_to_grp_cols} USING (gid) JOIN ( SELECT -unnest($1) as grp_key, -unnest($2) as grp_finished +unnest($1) AS grp_key, +unnest($2) AS grp_finished ) tree_terminated USING (grp_key) JOIN {cat_features_info_table} USING (gid) -LEFT OUTER JOIN -{importance_table} -USING (gid) +{left_join_importance_table_str} --- End diff -- We can name it oob_importance_table_* to explicitly distinguish it from impurity importance table ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201462356 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.py_in --- @@ -616,23 +628,20 @@ def forest_train( _calculate_oob_error(schema_madlib, oob_prediction_table, oob_error_table, id_col_name, is_classification) - -importance_table = unique_string() -sql_create_empty_imp_tbl = """ -CREATE TEMP TABLE {importance_table} -( -gid integer, -cat_var_importance float8[], -con_var_importance float8[] -); -""".format(**locals()) - plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl) -plpy.execute(sql_create_empty_imp_tbl) - -# we populate the importance_table only if variable importance is to be -# calculated, otherwise we use an empty table which will be used later -# for an outer join. +importance_table = '' if importance: +importance_table = unique_string() +sql_create_empty_imp_tbl = """ +CREATE TEMP TABLE {importance_table} +( +gid integer, +cat_var_importance float8[], +con_var_importance float8[] +); +""".format(**locals()) + plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl) --- End diff -- Do we need this notice? ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201490136 --- Diff: src/modules/recursive_partitioning/decision_tree.cpp --- @@ -502,10 +502,13 @@ get_variable_importance::run(AnyType ){ ColumnVector combined_var_imp(n_cat_features + n_con_features); combined_var_imp << cat_var_importance, con_var_importance; -// Avoid divide by zero by adding a small number. +// Avoid divide by zero by replacing with a small number if necessary --- End diff -- Out of curiosity: what was wrong with the previous method? ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201493117 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.py_in --- @@ -1333,42 +1368,69 @@ def _create_group_table( # - -def _create_empty_result_table(schema_madlib, output_table_name): +def _create_empty_result_table(schema_madlib, output_table_name, importance): """Create the result table for all trees in the forest""" +impurity_var_imp_str = """, impurity_var_importance double precision[]);""" if importance else ");" --- End diff -- Do we need the triple quotes? Also if `);` is common in both cases then we can move that to the query directly? ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201492311 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.py_in --- @@ -1291,38 +1300,64 @@ def _create_group_table( schema_madlib, output_table_name, oob_error_table, importance_table, cat_features_info_table, grp_key_to_grp_cols, grouping_cols, tree_terminated): -""" Ceate the group table for random forest""" +""" Create the group table for random forest""" + +cat_var_importance_str = '' --- End diff -- Let's please change the variable names to `oob*`. It would also help to change the `compute_var_importance` function in decision tree to `compute_impurity_var_importance`. ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/289 RF: Add impurity variable importance JIRA: MADLIB-1205 This commit makes the following changes: - Add impurity variable importance for random forests. - Rename current cat_var_importance and con_var_importance measurements to oob_cat_var_importance and oob_con_var_importance. New impurity measurement is provided as impurity_var_importance, and supports grouping. It combines the importance values for both categorical and continuous features into a single array. Co-authored-by: Rahul Iyer Co-authored-by: Jingyi Mei Co-authored-by: Arvind Sridhar Co-authored-by: Nandish Jayaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib rf_gini_importance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #289 commit 622d46a85f4264fdc94bd41dc66a23f1aa2c3ed6 Author: Rahul Iyer Date: 2018-07-10T00:34:33Z RF: Add impurity variable importance JIRA: MADLIB-1205 This commit makes the following changes: - Add impurity variable importance for random forests. - Rename current cat_var_importance and con_var_importance measurements to oob_cat_var_importance and oob_con_var_importance. New impurity measurement is provided as impurity_var_importance, and supports grouping. It combines the importance values for both categorical and continuous features into a single array. Co-authored-by: Rahul Iyer Co-authored-by: Jingyi Mei Co-authored-by: Arvind Sridhar Co-authored-by: Nandish Jayaram ---