[GitHub] madlib pull request #246: DT user doc updates
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r176660561 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in --- @@ -127,18 +128,20 @@ tree_train( grouping_cols (optional) TEXT, default: NULL. Comma-separated list of column names to group the - data by. This will result in multiple decision trees, one for + data by. This will produce multiple decision trees, one for each group. weights (optional) - TEXT. Column name containing numerical weights for each observation. - Can be any value greater than 0 (does not need to be - an integer). + TEXT. Column name containing numerical weights for + each observation. Can be any value greater + than 0 (does not need to be an integer). This can be used to handle the case of unbalanced data sets. - For classification the row's vote is multiplied by the weight, - and for regression we perform a weighted average at each node. - If this parameter is not set, all observations (tuples) - are treated equally with a weight of 1.0. + The weights are used to compute a weighted average in + the output leaf node. For classification, the contribution + of a row towards the vote of it's corresponding level --- End diff -- `s/it's/its`. I had a typo in my comment as well. ---
[GitHub] madlib pull request #246: DT user doc updates
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r176661267 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in --- @@ -97,327 +264,220 @@ forest_train(training_table_name, is_classification - boolean. True if it is a classification model. + BOOLEAN. True if it is a classification model, false + if for regression. source_table - text. Data source table name. + TEXT. Data source table name. model_table - text. Model table name. + TEXT. Model table name. id_col_name -text. The ID column name. +TEXT. The ID column name. dependent_varname - text. Dependent variable. + TEXT. Dependent variable. - independent_varname - text. Independent variables + independent_varnames + TEXT. Independent variables cat_features - text. Categorical feature names. + TEXT. List of categorical features + as a comma-separated string. con_features - text. Continuous feature names. + TEXT. List of continuous feature + as a comma-separated string. - grouping_col - int. Names of grouping columns. + grouping_cols + INTEGER. Names of grouping columns. num_trees - int. Number of trees grown by the model. + INTEGER. Number of trees grown by the model. num_random_features - int. Number of features randomly selected for each split. + INTEGER. Number of features randomly selected for each split. max_tree_depth - int. Maximum depth of any tree in the random forest model_table. + INTEGER. Maximum depth of any tree in the random forest model_table. min_split - int. Minimum number of observations in a node for it to be split. + INTEGER. Minimum number of observations in a node for it to be split. min_bucket - int. Minimum number of observations in any terminal node. + INTEGER. Minimum number of observations in any terminal node. num_splits - int. Number of buckets for continuous variables. + INTEGER. Number of buckets for continuous variables. verbose - boolean. Whether or not to display debug info. + BOOLEAN. Whether or not to display debug info. importance - boolean. Whether or not to calculate variable importance. + BOOLEAN. Whether or not to calculate variable importance. num_permutations - int. Number of times feature values are permuted while calculating - variable importance. The default value is 1. + INTEGER. Number of times feature values are permuted while calculating + variable importance. num_all_groups -int. Number of groups during forest training. +INTEGER. Number of groups during forest training. num_failed_groups -int. Number of failed groups during forest training. +INTEGER. Number of failed groups during forest training. total_rows_processed - bigint. Total numbers of rows processed in all groups. + BIG INTEGER. Total numbers of rows processed in all groups. --- End diff -- This is `BIGINT`. Postgres doesn't expand on the `INT` for this type. Same comment for the next item as well. ---
[GitHub] madlib pull request #246: DT user doc updates
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r176152043 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in --- @@ -418,7 +468,10 @@ tree_predict(tree_model, new_data_table TEXT. Name of the table containing prediction data. This table is expected to contain the same features that were used during training. The table - should also contain id_col_name used for identifying each row. + should also contain id_col_name used for identifying each row. + + If the new_data_table contains categorical variables --- End diff -- Ok, I will remove this line. ---
[GitHub] madlib pull request #246: DT user doc updates
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r176150844 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in --- @@ -127,7 +132,11 @@ tree_train( weights (optional) TEXT. Column name containing numerical weights for each observation. + Can be any value greater than 0 (does not need to be + an integer). This can be used to handle the case of unbalanced data sets. + For classification the row's vote is multiplied by the weight, --- End diff -- ok ---
[GitHub] madlib pull request #246: DT user doc updates
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r175927937 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in --- @@ -418,7 +468,10 @@ tree_predict(tree_model, new_data_table TEXT. Name of the table containing prediction data. This table is expected to contain the same features that were used during training. The table - should also contain id_col_name used for identifying each row. + should also contain id_col_name used for identifying each row. + + If the new_data_table contains categorical variables --- End diff -- Are we sure of this? We use majority branch in most cases when the feature does not provide a path. ---
[GitHub] madlib pull request #246: DT user doc updates
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r175924018 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in --- @@ -127,7 +132,11 @@ tree_train( weights (optional) TEXT. Column name containing numerical weights for each observation. + Can be any value greater than 0 (does not need to be + an integer). This can be used to handle the case of unbalanced data sets. + For classification the row's vote is multiplied by the weight, --- End diff -- I suggest rephrase as > The `weights` is used to compute a weighted average in the output leaf node. For classification, the contribution of a row towards the vote of it's corresponding level is multiplied by the weight (weighted mode). For regression, the output value of the row is multiplied by the weight (weighted mean). ---
[GitHub] madlib pull request #246: DT user doc updates
GitHub user fmcquillan99 opened a pull request: https://github.com/apache/madlib/pull/246 DT user doc updates @rahiyer please review DT user doc updates Will start working on RF in parallel. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fmcquillan99/apache-madlib doc-tree-1dot14 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #246 commit 7d04df1718408852d642c743aca1eef721a77a83 Author: Frank McQuillanDate: 2018-03-20T18:56:58Z DT user doc updates ---