[GitHub] madlib pull request #289: RF: Add impurity variable importance
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/289 RF: Add impurity variable importance JIRA: MADLIB-1205 This commit makes the following changes: - Add impurity variable importance for random forests. - Rename current cat_var_importance and con_var_importance measurements to oob_cat_var_importance and oob_con_var_importance. New impurity measurement is provided as impurity_var_importance, and supports grouping. It combines the importance values for both categorical and continuous features into a single array. Co-authored-by: Rahul Iyer Co-authored-by: Jingyi Mei Co-authored-by: Arvind Sridhar Co-authored-by: Nandish Jayaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib rf_gini_importance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #289 commit 622d46a85f4264fdc94bd41dc66a23f1aa2c3ed6 Author: Rahul Iyer Date: 2018-07-10T00:34:33Z RF: Add impurity variable importance JIRA: MADLIB-1205 This commit makes the following changes: - Add impurity variable importance for random forests. - Rename current cat_var_importance and con_var_importance measurements to oob_cat_var_importance and oob_con_var_importance. New impurity measurement is provided as impurity_var_importance, and supports grouping. It combines the importance values for both categorical and continuous features into a single array. Co-authored-by: Rahul Iyer Co-authored-by: Jingyi Mei Co-authored-by: Arvind Sridhar Co-authored-by: Nandish Jayaram ---
Register now for ApacheCon and save $250
Greetings, Apache software enthusiasts! (You’re getting this because you’re on one or more dev@ or users@ lists for some Apache Software Foundation project.) ApacheCon North America, in Montreal, is now just 80 days away, and early bird prices end in just two weeks - on July 21. Prices will be going up from $550 to $800 so register NOW to save $250, at http://apachecon.com/acna18 And don’t forget to reserve your hotel room. We have negotiated a special rate and the room block closes August 24. http://www.apachecon.com/acna18/venue.html Our schedule includes over 100 talks and we’ll be featuring talks from dozens of ASF projects., We have inspiring keynotes from some of the brilliant members of our community and the wider tech space, including: * Myrle Krantz, PMC chair for Apache Fineract, and leader in the open source financing space * Cliff Schmidt, founder of Literacy Bridge (now Amplio) and creator of the Talking Book project * Bridget Kromhout, principal cloud developer advocate at Microsoft * Euan McLeod, Comcast engineer, and pioneer in streaming video We’ll also be featuring tracks for Geospatial science, Tomcat, Cloudstack, and Big Data, as well as numerous other fields where Apache software is leading the way. See the full schedule at http://apachecon.com/acna18/schedule.html As usual we’ll be running our Apache BarCamp, the traditional ApacheCon Hackathon, and the Wednesday evening Lighting Talks, too, so you’ll want to be there. Register today at http://apachecon.com/acna18 and we’ll see you in Montreal! -- Rich Bowen VP, Conferences, The Apache Software Foundation h...@apachecon.com @ApacheCon
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/288 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/537/ ---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/288#discussion_r200891497 --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in --- @@ -0,0 +1,104 @@ +""" +@file cols2vec.py_in + +@brief Utility to convert Columns to array + +""" + +import plpy +from utilities.control import MinWarning +from utilities.utilities import split_quoted_delimited_str +from utilities.utilities import _string_to_array +from utilities.utilities import _assert +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import is_var_valid +from utilities.validate_args import get_cols +from utilities.validate_args import quote_ident +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + + +def validate_cols2vec_args(source_table, output_table, + list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs): +""" +Function to validate input parameters +""" +if list_of_features.strip() != '*': +if not (list_of_features and list_of_features.strip()): +plpy.error("Features to include is empty") +_assert( +columns_exist_in_table( +source_table, split_quoted_delimited_str(list_of_features)), +"Invalid columns to list of features {0}".format(list_of_features)) + +if cols_to_output and cols_to_output.strip() != '*': +_assert( +columns_exist_in_table( +source_table, _string_to_array(cols_to_output)), +"Invalid columns to output list {0}".format(cols_to_output)) + + +def cols2vec(schema_madlib, source_table, output_table, list_of_features, + list_of_features_to_exclude=None, cols_to_output=None, **kwargs): +""" +Args: +@param schema_madlib: Name of MADlib schema +@param model: Name of table containing the tree model +@param source_table:Name of table containing prediction data +@param output_table:Name of table to output the results +@param list_of_features:Comma-separated string of column names or +expressions to put into feature array. +Can also be a '*' implying all columns +are to be put into feature array. +@param list_of_features_to_exclude: Comma-separated string of column names +to exclude from the feature array +@param cols_to_output: Comma-separated string of column names +from the source table to keep in the output table, +in addition to the feature array. + +Returns: +None + +""" + +with MinWarning('warning'): +validate_cols2vec_args(source_table, output_table, list_of_features, + list_of_features_to_exclude, cols_to_output, **kwargs) + +all_cols = '' +feature_cols = '' +feature_list = '' +if list_of_features.strip() == '*': +all_cols = get_cols(source_table, schema_madlib) +all_col_set = set(list(all_cols)) +exclude_set = set(split_quoted_delimited_str( +list_of_features_to_exclude)) +feature_list = list(all_col_set - exclude_set) +else: +feature_list = split_quoted_delimited_str(list_of_features) + +feature_cols = py_list_to_sql_string( +list(feature_list), "text", False) +filtered_list_of_features = ",".join( --- End diff -- Above changes are done as suggested. ---