[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

2018-03-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/248#discussion_r177503935
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
  "one value are dropped from the tree model.")
 cat_features = [feature for feature in cat_features if feature 
in use_cat_features]
 
-grp_to_col_to_row = dict((grp_key, dict(
-(row['colname'], row['levels']) for row in items))
-for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key')))
-
+# grp_col_to_levels is a list of tuples (pairs) with
+#   first value = group value,
+#   second value = a dict mapping a categorical column to its 
levels in data
+#   (these levels are specific to the group and can be 
different
+#   for different groups)
+#   The list of tuples can be converted to a dict, but the ordering
+#   will be lost.
+# eg. grp_col_to_levels =
+#   [
+#   ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+#   ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+#   ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+#   ]
+grp_to_col_to_levels = [
+(grp_key, dict((row['colname'], row['levels']) for row in 
items))
+for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key'))]
 if cat_features:
-cat_items_list = [rows[col] for col in cat_features
-  for grp_key, rows in grp_to_col_to_row.items() 
if col in rows]
+# Below statements collect the grp_to_col_to_levels into multiple 
variables
+# From above eg.
+#   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+#   cat_n = [2, 3, 2, 2, 2, 3]
+#   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
--- End diff --

Yes, thanks for the catch. I'll update before merging. 


---


[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

2018-03-27 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/248#discussion_r177495864
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
  "one value are dropped from the tree model.")
 cat_features = [feature for feature in cat_features if feature 
in use_cat_features]
 
-grp_to_col_to_row = dict((grp_key, dict(
-(row['colname'], row['levels']) for row in items))
-for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key')))
-
+# grp_col_to_levels is a list of tuples (pairs) with
+#   first value = group value,
+#   second value = a dict mapping a categorical column to its 
levels in data
+#   (these levels are specific to the group and can be 
different
+#   for different groups)
+#   The list of tuples can be converted to a dict, but the ordering
+#   will be lost.
+# eg. grp_col_to_levels =
+#   [
+#   ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+#   ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+#   ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+#   ]
+grp_to_col_to_levels = [
+(grp_key, dict((row['colname'], row['levels']) for row in 
items))
+for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key'))]
 if cat_features:
-cat_items_list = [rows[col] for col in cat_features
-  for grp_key, rows in grp_to_col_to_row.items() 
if col in rows]
+# Below statements collect the grp_to_col_to_levels into multiple 
variables
+# From above eg.
+#   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+#   cat_n = [2, 3, 2, 2, 2, 3]
+#   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
+#   grp_key_cat = ['3', '4', '5']
--- End diff --

+1 for the examples, very helpful.


---


[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

2018-03-22 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/madlib/pull/248

DT: Ensure proper quoting in grouping coalesce

JIRA: MADLIB-1217

Grouping column value is coalesced with null_proxy to get the right null
identifier when null_as_category is True. The null_proxy needs to be
quoted appropriately as a string.

Closes #248

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/null_category_grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #248


commit ffd6355334bedaf4a6d502685330500abd3735d7
Author: Rahul Iyer 
Date:   2018-03-22T23:48:53Z

DT: Ensure proper quoting in grouping coalesce

JIRA: MADLIB-1217

Grouping column value is coalesced with null_proxy to get the right null
identifier when null_as_category is True. The null_proxy needs to be
quoted appropriately as a string.

Closes #248




---