[GitHub] madlib pull request #246: DT user doc updates

2018-03-23 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r176660561
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---
@@ -127,18 +128,20 @@ tree_train(
 
   grouping_cols (optional)
   TEXT, default: NULL. Comma-separated list of column names to group 
the
-  data by. This will result in multiple decision trees, one for
+  data by. This will produce multiple decision trees, one for
   each group. 
 
   weights (optional)
-  TEXT. Column name containing numerical weights for each observation.
-  Can be any value greater than 0 (does not need to be
-  an integer).  
+  TEXT. Column name containing numerical weights for 
+  each observation.  Can be any value greater 
+  than 0 (does not need to be an integer).  
   This can be used to handle the case of unbalanced data sets.
-  For classification the row's vote is multiplied by the weight, 
-  and for regression we perform a weighted average at each node.
-  If this parameter is not set, all observations (tuples)
-  are treated equally with a weight of 1.0.
+  The weights are used to compute a weighted average in 
+  the output leaf node. For classification, the contribution 
+  of a row towards the vote of it's corresponding level 
--- End diff --

`s/it's/its`. I had a typo in my comment as well. 


---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-23 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r176661267
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in ---
@@ -97,327 +264,220 @@ forest_train(training_table_name,
 
 
   is_classification
-  boolean. True if it is a classification model.
+  BOOLEAN. True if it is a classification model, false
+  if for regression.
 
 
 
   source_table
-  text. Data source table name.
+  TEXT. Data source table name.
 
 
 
   model_table
-  text. Model table name.
+  TEXT. Model table name.
 
 
 
   id_col_name
-text. The ID column name.
+TEXT. The ID column name.
 
 
 
   dependent_varname
-  text. Dependent variable.
+  TEXT. Dependent variable.
 
 
 
-  independent_varname
-  text. Independent variables
+  independent_varnames
+  TEXT. Independent variables
 
 
 
   cat_features
-  text. Categorical feature names.
+  TEXT. List of categorical features 
+  as a comma-separated string.
 
 
 
   con_features
-  text. Continuous feature names.
+  TEXT. List of continuous feature
+  as a comma-separated string.
 
 
 
-  grouping_col
-  int. Names of grouping columns.
+  grouping_cols
+  INTEGER. Names of grouping columns.
 
 
 
   num_trees
-  int. Number of trees grown by the model.
+  INTEGER. Number of trees grown by the model.
 
 
 
   num_random_features
-  int. Number of features randomly selected for each split.
+  INTEGER. Number of features randomly selected for each 
split.
 
 
 
   max_tree_depth
-  int. Maximum depth of any tree in the random forest 
model_table.
+  INTEGER. Maximum depth of any tree in the random forest 
model_table.
 
 
 
   min_split
-  int. Minimum number of observations in a node for it to be 
split.
+  INTEGER. Minimum number of observations in a node for it to be 
split.
 
 
 
   min_bucket
-  int. Minimum number of observations in any terminal node.
+  INTEGER. Minimum number of observations in any terminal 
node.
 
 
 
   num_splits
-  int. Number of buckets for continuous variables.
+  INTEGER. Number of buckets for continuous variables.
 
 
 
   verbose
-  boolean. Whether or not to display debug info.
+  BOOLEAN. Whether or not to display debug info.
 
 
 
   importance
-  boolean. Whether or not to calculate variable importance.
+  BOOLEAN. Whether or not to calculate variable importance.
 
 
 
   num_permutations
-  int. Number of times feature values are permuted while 
calculating
-  variable importance. The default value is 1.
+  INTEGER. Number of times feature values are permuted while 
calculating
+  variable importance.
 
 
 
 num_all_groups
-int. Number of groups during forest training.
+INTEGER. Number of groups during forest training.
 
 
 
 num_failed_groups
-int. Number of failed groups during forest training.
+INTEGER. Number of failed groups during forest training.
 
 
 
   total_rows_processed
-  bigint. Total numbers of rows processed in all groups.
+  BIG INTEGER. Total numbers of rows processed in all groups.
--- End diff --

This is `BIGINT`. Postgres doesn't expand on the `INT` for this type. Same 
comment for the next item as well. 


---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-21 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r176152043
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---
@@ -418,7 +468,10 @@ tree_predict(tree_model,
   new_data_table
   TEXT. Name of the table containing prediction data. This table is
   expected to contain the same features that were used during training. 
The table
-  should also contain id_col_name used for identifying each 
row.
+  should also contain id_col_name used for identifying each row.
+
+  If the new_data_table contains categorical variables
--- End diff --

Ok, I will remove this line.


---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-21 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r176150844
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---
@@ -127,7 +132,11 @@ tree_train(
 
   weights (optional)
   TEXT. Column name containing numerical weights for each observation.
+  Can be any value greater than 0 (does not need to be
+  an integer).  
   This can be used to handle the case of unbalanced data sets.
+  For classification the row's vote is multiplied by the weight, 
--- End diff --

ok


---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r175927937
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---
@@ -418,7 +468,10 @@ tree_predict(tree_model,
   new_data_table
   TEXT. Name of the table containing prediction data. This table is
   expected to contain the same features that were used during training. 
The table
-  should also contain id_col_name used for identifying each 
row.
+  should also contain id_col_name used for identifying each row.
+
+  If the new_data_table contains categorical variables
--- End diff --

Are we sure of this? We use majority branch in most cases when the feature 
does not provide a path. 


---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r175924018
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---
@@ -127,7 +132,11 @@ tree_train(
 
   weights (optional)
   TEXT. Column name containing numerical weights for each observation.
+  Can be any value greater than 0 (does not need to be
+  an integer).  
   This can be used to handle the case of unbalanced data sets.
+  For classification the row's vote is multiplied by the weight, 
--- End diff --

I suggest rephrase as 

> The `weights` is used to compute a weighted average in the output leaf 
node. For classification, the contribution of a row towards the vote of it's 
corresponding level is multiplied by the weight (weighted mode). For 
regression, the output value of the row is multiplied by the weight (weighted 
mean).   


---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-20 Thread fmcquillan99
GitHub user fmcquillan99 opened a pull request:

https://github.com/apache/madlib/pull/246

DT user doc updates

@rahiyer please review DT user doc updates

Will start working on RF in parallel.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fmcquillan99/apache-madlib doc-tree-1dot14

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #246


commit 7d04df1718408852d642c743aca1eef721a77a83
Author: Frank McQuillan 
Date:   2018-03-20T18:56:58Z

DT user doc updates




---