[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/248#discussion_r177503935 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in --- @@ -970,16 +970,35 @@ def _get_bins_grps( "one value are dropped from the tree model.") cat_features = [feature for feature in cat_features if feature in use_cat_features] -grp_to_col_to_row = dict((grp_key, dict( -(row['colname'], row['levels']) for row in items)) -for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))) - +# grp_col_to_levels is a list of tuples (pairs) with +# first value = group value, +# second value = a dict mapping a categorical column to its levels in data +# (these levels are specific to the group and can be different +# for different groups) +# The list of tuples can be converted to a dict, but the ordering +# will be lost. +# eg. grp_col_to_levels = +# [ +# ('3', {'vs': [0, 1], 'cyl': [4,6,8]}), +# ('4', {'vs': [0, 1], 'cyl': [4,6]}), +# ('5', {'vs': [0, 1], 'cyl': [4,6,8]}) +# ] +grp_to_col_to_levels = [ +(grp_key, dict((row['colname'], row['levels']) for row in items)) +for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))] if cat_features: -cat_items_list = [rows[col] for col in cat_features - for grp_key, rows in grp_to_col_to_row.items() if col in rows] +# Below statements collect the grp_to_col_to_levels into multiple variables +# From above eg. +# cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]] +# cat_n = [2, 3, 2, 2, 2, 3] +# cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8] --- End diff -- Yes, thanks for the catch. I'll update before merging. ---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/248#discussion_r177495864 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in --- @@ -970,16 +970,35 @@ def _get_bins_grps( "one value are dropped from the tree model.") cat_features = [feature for feature in cat_features if feature in use_cat_features] -grp_to_col_to_row = dict((grp_key, dict( -(row['colname'], row['levels']) for row in items)) -for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))) - +# grp_col_to_levels is a list of tuples (pairs) with +# first value = group value, +# second value = a dict mapping a categorical column to its levels in data +# (these levels are specific to the group and can be different +# for different groups) +# The list of tuples can be converted to a dict, but the ordering +# will be lost. +# eg. grp_col_to_levels = +# [ +# ('3', {'vs': [0, 1], 'cyl': [4,6,8]}), +# ('4', {'vs': [0, 1], 'cyl': [4,6]}), +# ('5', {'vs': [0, 1], 'cyl': [4,6,8]}) +# ] +grp_to_col_to_levels = [ +(grp_key, dict((row['colname'], row['levels']) for row in items)) +for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))] if cat_features: -cat_items_list = [rows[col] for col in cat_features - for grp_key, rows in grp_to_col_to_row.items() if col in rows] +# Below statements collect the grp_to_col_to_levels into multiple variables +# From above eg. +# cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]] +# cat_n = [2, 3, 2, 2, 2, 3] +# cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8] +# grp_key_cat = ['3', '4', '5'] --- End diff -- +1 for the examples, very helpful. ---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
GitHub user iyerr3 opened a pull request: https://github.com/apache/madlib/pull/248 DT: Ensure proper quoting in grouping coalesce JIRA: MADLIB-1217 Grouping column value is coalesced with null_proxy to get the right null identifier when null_as_category is True. The null_proxy needs to be quoted appropriately as a string. Closes #248 You can merge this pull request into a Git repository by running: $ git pull https://github.com/iyerr3/incubator-madlib bugfix/null_category_grouping Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/248.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #248 commit ffd6355334bedaf4a6d502685330500abd3735d7 Author: Rahul IyerDate: 2018-03-22T23:48:53Z DT: Ensure proper quoting in grouping coalesce JIRA: MADLIB-1217 Grouping column value is coalesced with null_proxy to get the right null identifier when null_as_category is True. The null_proxy needs to be quoted appropriately as a string. Closes #248 ---