[GitHub] madlib pull request #251: MLP: Simplify initialization of model coefficients

2018-03-27 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/madlib/pull/251

MLP: Simplify initialization of model coefficients

Changes:

1. Model initialization now happens in the C++ code instead of being
passed via Python.
2. Warm start coefficients are still passed from Python. These
coefficients are copied into the model. The vectorized form of the copy
raises an Eigen assertion when used with a MappedMatrix, possibly due
to the actual block size not being available via the Map.
3. The distinction between a "TaskState" and an "AlgoState" within a
State type has been removed in the MLPMiniBatchState. This demarcation
seemed to be confusing and didn't provide much abstraction.

Closes #251

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/minibatch_init_debug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #251






---


[GitHub] madlib pull request #249: RF: Use NULL::integer[] when no continuous feature...

2018-03-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/249


---


[GitHub] madlib pull request #246: DT and RF user doc updates

2018-03-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/246


---


[GitHub] madlib issue #249: RF: Use NULL::integer[] when no continuous features

2018-03-27 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/249
  
The changes look good. +1 for adding an install check test.

I noticed that `con_features` is populated by 
`decision_tree.py:_classify_features` which does something like `con_types = 
['real', 'float8', 'double precision', 'numeric']`. Is it better to call the 
utilities function `is_psql_numeric_type`  ? 


---


[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

2018-03-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/248#discussion_r177503935
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
  "one value are dropped from the tree model.")
 cat_features = [feature for feature in cat_features if feature 
in use_cat_features]
 
-grp_to_col_to_row = dict((grp_key, dict(
-(row['colname'], row['levels']) for row in items))
-for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key')))
-
+# grp_col_to_levels is a list of tuples (pairs) with
+#   first value = group value,
+#   second value = a dict mapping a categorical column to its 
levels in data
+#   (these levels are specific to the group and can be 
different
+#   for different groups)
+#   The list of tuples can be converted to a dict, but the ordering
+#   will be lost.
+# eg. grp_col_to_levels =
+#   [
+#   ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+#   ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+#   ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+#   ]
+grp_to_col_to_levels = [
+(grp_key, dict((row['colname'], row['levels']) for row in 
items))
+for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key'))]
 if cat_features:
-cat_items_list = [rows[col] for col in cat_features
-  for grp_key, rows in grp_to_col_to_row.items() 
if col in rows]
+# Below statements collect the grp_to_col_to_levels into multiple 
variables
+# From above eg.
+#   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+#   cat_n = [2, 3, 2, 2, 2, 3]
+#   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
--- End diff --

Yes, thanks for the catch. I'll update before merging. 


---


[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

2018-03-27 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/248#discussion_r177495864
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
  "one value are dropped from the tree model.")
 cat_features = [feature for feature in cat_features if feature 
in use_cat_features]
 
-grp_to_col_to_row = dict((grp_key, dict(
-(row['colname'], row['levels']) for row in items))
-for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key')))
-
+# grp_col_to_levels is a list of tuples (pairs) with
+#   first value = group value,
+#   second value = a dict mapping a categorical column to its 
levels in data
+#   (these levels are specific to the group and can be 
different
+#   for different groups)
+#   The list of tuples can be converted to a dict, but the ordering
+#   will be lost.
+# eg. grp_col_to_levels =
+#   [
+#   ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+#   ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+#   ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+#   ]
+grp_to_col_to_levels = [
+(grp_key, dict((row['colname'], row['levels']) for row in 
items))
+for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key'))]
 if cat_features:
-cat_items_list = [rows[col] for col in cat_features
-  for grp_key, rows in grp_to_col_to_row.items() 
if col in rows]
+# Below statements collect the grp_to_col_to_levels into multiple 
variables
+# From above eg.
+#   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+#   cat_n = [2, 3, 2, 2, 2, 3]
+#   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
+#   grp_key_cat = ['3', '4', '5']
--- End diff --

+1 for the examples, very helpful.


---