[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/295
  
@fmcquillan only impurity, I don't think we scale oob to 100.


---


[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread fmcquillan
Github user fmcquillan commented on the issue:

https://github.com/apache/madlib/pull/295
  
Would this apply to oob too?
Or just impurity?


---


[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/madlib/pull/295
  
Considering the above situation, I suggest the variable importance values 
not be scaled to sum to 100. We can make the normalization within 
`get_var_importance` just for the reporting (which is the behavior in rpart). 
In other words, the output table would keep the original values (for DT and RF) 
but the helper function would rescale during the report for ease in reading the 
values. 


---


[GitHub] madlib pull request #291: Feature: Vector to Columns

2018-07-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/291#discussion_r203890181
  
--- Diff: src/ports/postgres/modules/utilities/transform_vec_cols.py_in ---
@@ -0,0 +1,492 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+from control import MinWarning
+from internal.db_utils import is_col_1d_array
+from internal.db_utils import quote_literal
+from utilities import _assert
+from utilities import add_postfix
+from utilities import ANY_ARRAY
+from utilities import is_valid_psql_type
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from validate_args import is_var_valid
+from validate_args import get_cols
+from validate_args import get_expr_type
+from validate_args import input_tbl_valid
+from validate_args import output_tbl_valid
+from validate_args import table_exists
+
+class vec_cols_helper:
+def __init__(self):
+self.all_cols = None
+
+def get_cols_as_list(self, cols_to_process, source_table=None, 
exclude_cols=None):
+"""
+Get a list of columns based on the value of cols_to_process
+Args:
+@param cols_to_process: str, Either a * or a comma-separated 
list of col names
+@param source_table: str, optional. Source table name
+@param exclude_cols: str, optional. Comma-separated list of 
the col(s) to exclude
+ from the source table, only used if 
cols_to_process is *
+Returns:
+A list of column names (or an empty list)
+"""
+# If cols_to_process is empty/None, return empty list
+if not cols_to_process:
+return []
+if cols_to_process.strip() != "*":
+# If cols_to_process is a comma separated list of names, 
return list
+# of column names in cols_to_process.
+return [col for col in 
split_quoted_delimited_str(cols_to_process)
+if col not in split_quoted_delimited_str(exclude_cols)]
+if source_table:
+if not self.all_cols:
+self.all_cols = get_cols(source_table)
+return [col for col in self.all_cols
+if col not in split_quoted_delimited_str(exclude_cols)]
+return []
+
+class vec2cols:
+def __init__(self):
+self.get_cols_helper = vec_cols_helper()
+self.module_name = self.__class__.__name__
+
+def validate_args(self, source_table, output_table, vector_col, 
feature_names,
+  cols_to_output):
+"""
+Validate args for vec2cols
+"""
+input_tbl_valid(source_table, self.module_name)
+output_tbl_valid(output_table, self.module_name)
+# cols_to_validate = 
self.get_cols_helper.get_cols_as_list(cols_to_output) + [vector_col]
--- End diff --

Guess we can remove this commented line.


---


[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread fmcquillan99
Github user fmcquillan99 commented on the issue:

https://github.com/apache/madlib/pull/295
  
Another run I got
```
grp 0   grp1
31.01364943 31.6576
22.85881741 33.3245
13.70257438 0
6.344527751 3.33304
26.0804244  11.6654
total   99.9336 79.9806
```
so this does seem to be about trees contributing or not.


---


[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread fmcquillan99
Github user fmcquillan99 commented on the issue:

https://github.com/apache/madlib/pull/295
  
Should impurity_var_importance always add up to 100?
From the regression example in the user docs:

```
DROP TABLE IF EXISTS mt_imp_output;
SELECT madlib.get_var_importance('mt_cars_output','mt_imp_output');
SELECT am, impurity_var_importance FROM mt_imp_output ORDER BY am, 
impurity_var_importance DESC;
```
results in
```

 am | impurity_var_importance 
+-
  0 |35.7664683110879
  0 |24.7481977075922
  0 |12.4401197123678
  0 |12.1559096708347
  0 |4.88929809351791
  1 |31.7259035495099
  1 |29.6146492693988
  1 |14.9602257795489
  1 |7.01369118455985
  1 |6.68552870777581
(10 rows)
```
which does not add up to 100
```
grp 0   grp 1
35.76646831 31.72590355
24.74819771 29.61464927
12.44011971 14.96022578
12.15590967 7.013691185
4.889298094 6.685528708
total   89.935  89.9849
```


---


[GitHub] madlib issue #291: Feature: Vector to Columns

2018-07-19 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/291
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/582/



---


[GitHub] madlib issue #291: Feature: Vector to Columns

2018-07-19 Thread fmcquillan99
Github user fmcquillan99 commented on the issue:

https://github.com/apache/madlib/pull/291
  
After the above 2 issues I mentioned are fixed, I will have 1 more commit 
on user docs to this PR



---


[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/295
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/580/



---


[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/295
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/579/



---


[GitHub] madlib issue #291: Feature: Vector to Columns

2018-07-19 Thread fmcquillan99
Github user fmcquillan99 commented on the issue:

https://github.com/apache/madlib/pull/291
  
In cols2vec,

For this table:
```
CREATE TABLE golf (
id integer NOT NULL,
"OUTLOOK" text,
temperature double precision,
humidity double precision,
"Temp_Humidity" double precision[],
clouds_airquality text[],
windy boolean,
class text,
observation_weight double precision
);
```

this fails:
```
SELECT madlib.cols2vec(
'golf',
'cols2vec_result',
'id, temperature'
);
```
because `id` is INT and `temperature` is FLOAT.

It forces the user to do:
```
SELECT madlib.cols2vec(
'golf',
'cols2vec_result',
'id::FLOAT, temperature'
);
```
but this is inconvenient especially if you have a big 
table and are using '*' to get all columns into the feature
vector and they are a mix of numeric types.  

Also a mix of VARCHAR and TEXT fails in a similar way
but should not.

Use PostgreSQL precendence rules to fix this please.





---


[GitHub] madlib pull request #295: Recursive Partitioning: Add function to report imp...

2018-07-19 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/295#discussion_r203812338
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -2327,6 +2328,110 @@ def _tree_error(schema_madlib, source_table, 
dependent_varname,
 plpy.execute(sql)
 # 
 
+def _validate_var_importance_input(model_table, summary_table, 
output_table):
+_assert(table_exists(model_table),
+"Recursive Partitioning: Model table does not exist.")
+_assert(table_exists(summary_table),
+"Recursive Partitioning: Model summary table does not exist.")
+_assert(not table_exists(output_table),
+"Recursive Partitioning: Output table already exists.")
+
+def _is_model_for_RF(summary_table):
+# Only an RF model (and not DT) would have num_trees column in summary
+return columns_exist_in_table(summary_table, ['num_trees'])
+
+def _is_RF_model_with_imp_pre_1_15(group_table, summary_table):
--- End diff --

Since goal of function is to check if `impurity_var_importance` exists in 
group table, let's name it to reflect that. 


---


[GitHub] madlib pull request #295: Recursive Partitioning: Add function to report imp...

2018-07-19 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/295#discussion_r203811384
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -2327,6 +2328,110 @@ def _tree_error(schema_madlib, source_table, 
dependent_varname,
 plpy.execute(sql)
 # 
 
+def _validate_var_importance_input(model_table, summary_table, 
output_table):
+_assert(table_exists(model_table),
+"Recursive Partitioning: Model table does not exist.")
+_assert(table_exists(summary_table),
+"Recursive Partitioning: Model summary table does not exist.")
+_assert(not table_exists(output_table),
+"Recursive Partitioning: Output table already exists.")
+
+def _is_model_for_RF(summary_table):
+# Only an RF model (and not DT) would have num_trees column in summary
+return columns_exist_in_table(summary_table, ['num_trees'])
+
+def _is_RF_model_with_imp_pre_1_15(group_table, summary_table):
+"""
+Check if the RF model is from MADlib < 1.15. The group table for
+>= 1.15 RF models should have a column named 
impurity_var_importance
+if it was learnt with importance param True.
+"""
+_assert(table_exists(group_table),
+"Recursive Partitioning: Model group table does not exist.")
+# this flag has to be set to true for RF to report importance scores.
+isImportance = plpy.execute("SELECT importance FROM {summary_table}".
--- End diff --

Our convention is to use snake case for Python (i.e. `is_importance`). I 
also suggest changing it to `is_importance_set` to make it more explicit. 


---


[GitHub] madlib pull request #295: Recursive Partitioning: Add function to report imp...

2018-07-19 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/295#discussion_r203811750
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -2327,6 +2328,110 @@ def _tree_error(schema_madlib, source_table, 
dependent_varname,
 plpy.execute(sql)
 # 
 
+def _validate_var_importance_input(model_table, summary_table, 
output_table):
+_assert(table_exists(model_table),
+"Recursive Partitioning: Model table does not exist.")
+_assert(table_exists(summary_table),
+"Recursive Partitioning: Model summary table does not exist.")
+_assert(not table_exists(output_table),
+"Recursive Partitioning: Output table already exists.")
+
+def _is_model_for_RF(summary_table):
+# Only an RF model (and not DT) would have num_trees column in summary
+return columns_exist_in_table(summary_table, ['num_trees'])
--- End diff --

We've a `method_name` in summary table that makes this clearer. 


---


[GitHub] madlib pull request #295: Recursive Partitioning: Add function to report imp...

2018-07-19 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/295#discussion_r203816091
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -2327,6 +2328,110 @@ def _tree_error(schema_madlib, source_table, 
dependent_varname,
 plpy.execute(sql)
 # 
 
+def _validate_var_importance_input(model_table, summary_table, 
output_table):
+_assert(table_exists(model_table),
+"Recursive Partitioning: Model table does not exist.")
+_assert(table_exists(summary_table),
+"Recursive Partitioning: Model summary table does not exist.")
+_assert(not table_exists(output_table),
+"Recursive Partitioning: Output table already exists.")
+
+def _is_model_for_RF(summary_table):
+# Only an RF model (and not DT) would have num_trees column in summary
+return columns_exist_in_table(summary_table, ['num_trees'])
+
+def _is_RF_model_with_imp_pre_1_15(group_table, summary_table):
+"""
+Check if the RF model is from MADlib < 1.15. The group table for
+>= 1.15 RF models should have a column named 
impurity_var_importance
+if it was learnt with importance param True.
+"""
+_assert(table_exists(group_table),
+"Recursive Partitioning: Model group table does not exist.")
+# this flag has to be set to true for RF to report importance scores.
+isImportance = plpy.execute("SELECT importance FROM {summary_table}".
+format(**locals()))[0]['importance']
+_assert(isImportance, """Recursive Partitioning: The model does """
++ """not have the importance information.""")
+if columns_exist_in_table(group_table, ['impurity_var_importance']):
+# If this column exists, then the RF model is >=1.15.
+return False
+else:
+return True
+
+def get_var_importance(schema_madlib, model_table, output_table, **kwargs):
+""" Create table capturing importance scores for each feature.
+For DT, this function will record the impurity importance score if 
it exists.
+For RF, this function will record the oob variable importance and 
impurity
+importance (only for models learnt with 1.15 onwards) for each 
variable.
+
+Args:
+@param schema_madlib: str, MADlib schema name
+@param model_table: str, Model table name
+@param output_table: str, Output table name
+
+"""
+# Validate parameters
+summary_table = add_postfix(model_table, "_summary")
+_validate_var_importance_input(model_table, summary_table, 
output_table)
+grouping_cols = plpy.execute(
+"SELECT grouping_cols FROM {summary_table}".
+format(**locals()))[0]['grouping_cols']
+grouping_cols_comma = ''
+if grouping_cols :
+grouping_cols_comma = add_postfix(grouping_cols,", ")
+is_RF = _is_model_for_RF(summary_table)
+if is_RF:
+group_table = add_postfix(model_table, "_group")
+is_pre_1_15_RF_model = _is_RF_model_with_imp_pre_1_15(
+group_table, summary_table)
+
+# Query to add oob variable importance for categorical vars
+if is_pre_1_15_RF_model:
+# In < 1.15 RF model, the variable importance was captured 
using two
+# different columns named 'cat_var_importance'
+plpy.execute(
+""" CREATE TABLE {output_table} AS
+SELECT {grouping_cols_comma}
+unnest(regexp_split_to_array(cat_features, ',')) 
AS feature,
+unnest(cat_var_importance) AS var_importance
+FROM {group_table}, {summary_table}
+""".format(**locals()))
--- End diff --

IMO, it's cleaner to see the two queries `UNION`ed together to get the 
output. 


---


[GitHub] madlib issue #296: DT/RF: Ensure cat features are recorded per group

2018-07-19 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/296
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/577/



---