[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-09 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/289

RF: Add impurity variable importance

JIRA: MADLIB-1205

This commit makes the following changes:
- Add impurity variable importance for random forests.
- Rename current cat_var_importance and con_var_importance measurements to
oob_cat_var_importance and oob_con_var_importance.

New impurity measurement is provided as impurity_var_importance, and 
supports
grouping. It combines the importance values for both categorical and
continuous features into a single array.

Co-authored-by: Rahul Iyer 
Co-authored-by: Jingyi Mei 
Co-authored-by: Arvind Sridhar 
Co-authored-by: Nandish Jayaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib rf_gini_importance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #289


commit 622d46a85f4264fdc94bd41dc66a23f1aa2c3ed6
Author: Rahul Iyer 
Date:   2018-07-10T00:34:33Z

RF: Add impurity variable importance

JIRA: MADLIB-1205

This commit makes the following changes:
- Add impurity variable importance for random forests.
- Rename current cat_var_importance and con_var_importance measurements to
oob_cat_var_importance and oob_con_var_importance.

New impurity measurement is provided as impurity_var_importance, and 
supports
grouping. It combines the importance values for both categorical and
continuous features into a single array.

Co-authored-by: Rahul Iyer 
Co-authored-by: Jingyi Mei 
Co-authored-by: Arvind Sridhar 
Co-authored-by: Nandish Jayaram 




---


Register now for ApacheCon and save $250

2018-07-09 Thread Rich Bowen

Greetings, Apache software enthusiasts!

(You’re getting this because you’re on one or more dev@ or users@ lists 
for some Apache Software Foundation project.)


ApacheCon North America, in Montreal, is now just 80 days away, and 
early bird prices end in just two weeks - on July 21. Prices will be 
going up from $550 to $800 so register NOW to save $250, at 
http://apachecon.com/acna18


And don’t forget to reserve your hotel room. We have negotiated a 
special rate and the room block closes August 24. 
http://www.apachecon.com/acna18/venue.html


Our schedule includes over 100 talks and we’ll be featuring talks from 
dozens of ASF projects.,  We have inspiring keynotes from some of the 
brilliant members of our community and the wider tech space, including:


 * Myrle Krantz, PMC chair for Apache Fineract, and leader in the open 
source financing space
 * Cliff Schmidt, founder of Literacy Bridge (now Amplio) and creator 
of the Talking Book project

 * Bridget Kromhout, principal cloud developer advocate at Microsoft
 * Euan McLeod, Comcast engineer, and pioneer in streaming video

We’ll also be featuring tracks for Geospatial science, Tomcat, 
Cloudstack, and Big Data, as well as numerous other fields where Apache 
software is leading the way. See the full schedule at 
http://apachecon.com/acna18/schedule.html


As usual we’ll be running our Apache BarCamp, the traditional ApacheCon 
Hackathon, and the Wednesday evening Lighting Talks, too, so you’ll want 
to be there.


Register today at http://apachecon.com/acna18 and we’ll see you in Montreal!

--
Rich Bowen
VP, Conferences, The Apache Software Foundation
h...@apachecon.com
@ApacheCon


[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

2018-07-09 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/288
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/537/



---


[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

2018-07-09 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/288#discussion_r200891497
  
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,104 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.utilities import _assert
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+
+def validate_cols2vec_args(source_table, output_table,
+   list_of_features, list_of_features_to_exclude, 
cols_to_output, **kwargs):
+"""
+Function to validate input parameters
+"""
+if list_of_features.strip() != '*':
+if not (list_of_features and list_of_features.strip()):
+plpy.error("Features to include is empty")
+_assert(
+columns_exist_in_table(
+source_table, 
split_quoted_delimited_str(list_of_features)),
+"Invalid columns to list of features 
{0}".format(list_of_features))
+
+if cols_to_output and cols_to_output.strip() != '*':
+_assert(
+columns_exist_in_table(
+source_table, _string_to_array(cols_to_output)),
+"Invalid columns to output list {0}".format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, 
**kwargs):
+"""
+Args:
+@param schema_madlib:   Name of MADlib schema
+@param model:   Name of table containing the 
tree model
+@param source_table:Name of table containing 
prediction data
+@param output_table:Name of table to output the 
results
+@param list_of_features:Comma-separated string of 
column names or
+expressions to put into 
feature array.
+Can also be a '*' implying all 
columns
+are to be put into feature 
array.
+@param list_of_features_to_exclude: Comma-separated string of 
column names
+to exclude from the feature 
array
+@param cols_to_output:  Comma-separated string of 
column names
+from the source table to keep 
in the output table,
+in addition to the feature 
array.
+
+Returns:
+None
+
+"""
+
+with MinWarning('warning'):
+validate_cols2vec_args(source_table, output_table, 
list_of_features,
+   list_of_features_to_exclude, 
cols_to_output, **kwargs)
+
+all_cols = ''
+feature_cols = ''
+feature_list = ''
+if list_of_features.strip() == '*':
+all_cols = get_cols(source_table, schema_madlib)
+all_col_set = set(list(all_cols))
+exclude_set = set(split_quoted_delimited_str(
+list_of_features_to_exclude))
+feature_list = list(all_col_set - exclude_set)
+else:
+feature_list = split_quoted_delimited_str(list_of_features)
+
+feature_cols = py_list_to_sql_string(
+list(feature_list), "text", False)
+filtered_list_of_features = ",".join(
--- End diff --

Above changes are done as suggested. 


---