This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git
The following commit(s) were added to refs/heads/master by this push: new 20c87fa add sections to RF and DT user docs on run-time and memory usage 20c87fa is described below commit 20c87faefd3a166c5456112fba1c8b6ab107ad18 Author: Frank McQuillan <fmcquil...@pivotal.io> AuthorDate: Fri Apr 19 17:23:51 2019 -0700 add sections to RF and DT user docs on run-time and memory usage --- .../deep_learning/keras_model_arch_table.sql_in | 2 +- .../recursive_partitioning/decision_tree.sql_in | 34 +++++++++++++---- .../recursive_partitioning/random_forest.sql_in | 43 +++++++++++++++++----- .../modules/regress/clustered_variance.sql_in | 6 +-- .../postgres/modules/sample/balance_sample.sql_in | 2 +- src/ports/postgres/modules/svm/svm.sql_in | 4 +- 6 files changed, 67 insertions(+), 24 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in index bb734ab..16037c2 100644 --- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in +++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in @@ -129,7 +129,7 @@ model.add(Dense(3, name='dense_2')) model.to_json </pre> This is represented by the following JSON: -<pre class="example"> +<pre class="result"> '{"class_name": "Sequential", "keras_version": "2.1.6", "config": [{"class_name": "Dense", "config": {"kernel_initializer": {"class_name": "VarianceScaling", "config": {"distribution": "uniform", diff --git a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in index 8ad7a9d..bf1c883 100644 --- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in +++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in @@ -17,6 +17,7 @@ m4_include(`SQLCommon.m4') <div class="toc"><b>Contents</b><ul> <li class="level1"><a href="#train">Training Function</a></li> +<li class="level1"><a href="#runtime">Run-time and Memory Usage</a></li> <li class="level1"><a href="#predict">Prediction Function</a></li> <li class="level1"><a href="#display">Tree Display</a></li> <li class="level1"><a href="#display_importance">Importance Display</a></li> @@ -109,7 +110,7 @@ tree_train( by their value. </DD> - <DT>list_of_features_to_exclude</DT> + <DT>list_of_features_to_exclude (optional)</DT> <DD>TEXT. Comma-separated string of column names to exclude from the predictors list. If the <em>dependent_variable</em> is an expression (including cast of a column name), then this list should include the columns present in the @@ -118,7 +119,7 @@ tree_train( The names in this parameter should be identical to the names used in the table and quoted appropriately. </DD> - <DT>split_criterion</DT> + <DT>split_criterion (optional)</DT> <DD>TEXT, default = 'gini' for classification, 'mse' for regression. Impurity function to compute the feature to use to split a node. Supported criteria are 'gini', 'entropy', 'misclassification' for @@ -148,7 +149,8 @@ tree_train( <DD>INTEGER, default: 7. Maximum depth of any node of the final tree, with the root node counted as depth 0. A deeper tree can lead to better prediction but will also result in - longer processing time and higher memory usage.</DD> + longer processing time and higher memory usage. + Current allowed maximum is 100.</DD> <DT>min_split (optional)</DT> <DD>INTEGER, default: 20. Minimum number of observations that must exist @@ -475,11 +477,27 @@ provided <em>cp</em> and explore all possible sub-trees (up to a single-node tre to compute the optimal sub-tree. The optimal sub-tree and the 'cp' corresponding to this optimal sub-tree is placed in the <em>output_table</em>, with the columns named as <em>tree</em> and <em>pruning_cp</em> respectively. -- The main parameters that affect memory usage are: depth of -tree (‘max_depth’), number of features, number of values per -categorical feature, and number of bins for continuous features (‘num_splits’). -If you are hitting memory limits, consider reducing one or -more of these parameters. + +@anchor runtime +@par Run-time and Memory Usage + +The number of features and the number of class values per categorical feature have a direct +impact on run-time and memory. In addition, here is a summary of the main parameters +in the training function that affect run-time and memory: + +| Parameter | Run-time | Memory | Notes | +| :------ | :------ | :------ | :------ | +| 'max_depth' | High | High | Deeper trees can take longer to run and use more memory. | +| 'min_split' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. | +| 'min_bucket' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. | +| 'num_splits' | High | High | Depends on number of continuous variables. Effectively adds more features as the binning becomes more granular. | + +If you experience long run-times or are hitting memory limits, consider reducing one or +more of these parameters. One approach when building a decision tree model is to start +with a low maximum depth value and use suggested defaults for +other parameters. This will give you a sense of run-time and test set accuracy. +Then you can change maximum depth in a systematic way as required +to improve accuracy. @anchor predict @par Prediction Function diff --git a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in index ba0049b..251dfbc 100644 --- a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in +++ b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in @@ -17,6 +17,7 @@ m4_include(`SQLCommon.m4') <div class="toc"><b>Contents</b><ul> <li class="level1"><a href="#train">Training Function</a></li> +<li class="level1"><a href="#runtime">Run-time and Memory Usage</a></li> <li class="level1"><a href="#predict">Prediction Function</a></li> <li class="level1"><a href="#get_tree">Tree Display</a></li> <li class="level1"><a href="#get_importance">Importance Display</a></li> @@ -139,7 +140,8 @@ forest_train(training_table_name, <DT>num_random_features (optional)</DT> <DD>INTEGER, default: sqrt(n) for classification, n/3 - for regression. This is the number of features to randomly + for regression, where n is the number of features. + This parameter is the number of features to randomly select at each split.</DD> <DT>importance (optional)</DT> @@ -154,7 +156,8 @@ forest_train(training_table_name, <DT>num_permutations (optional)</DT> <DD>INTEGER, default: 1. Number of times to permute each feature value while - calculating the out-of-bag variable importance. + calculating the out-of-bag variable importance. Only applies when + the 'importance' parameter is set to true. @note Variable importance for a feature is determined by permuting the variable and computing the drop in predictive accuracy using out-of-bag samples [1]. @@ -174,7 +177,10 @@ forest_train(training_table_name, <DD>INTEGER, default: 7. Maximum depth of any node of a tree, with the root node counted as depth 0. A deeper tree can lead to better prediction but will also result in - longer processing time and higher memory usage.</DD> + longer processing time and higher memory usage. + Current allowed maximum is 15. Note that since random forest + is an ensemble method, individual trees typically do not need + to be deep.</DD> <DT>min_split (optional)</DT> <DD>INTEGER, default: 20. Minimum number of observations that must exist @@ -477,11 +483,30 @@ forest_train(training_table_name, </DD> </DL> -@note The main parameters that affect memory usage are: depth of -tree (‘max_tree_depth’), number of features, number of values per -categorical feature, and number of bins for continuous features (‘num_splits’). -If you are hitting memory limits, consider reducing one or -more of these parameters. +@anchor runtime +@par Run-time and Memory Usage + +The number of features and the number of class values per categorical feature have a direct +impact on run-time and memory. In addition, here is a summary of the main parameters +in the training function that affect run-time and memory: + +| Parameter | Run-time | Memory | Notes | +| :------ | :------ | :------ | :------ | +| 'num_trees' | High | No or little effect. | Linear with number of trees. Notes that trees train sequentially one after another, though each tree is trained in parallel. | +| 'importance' | Moderate | No or little effect. | Depends on number of features and 'num_permutations' parameter. | +| 'num_permutations' | Moderate | No or little effect. | Depends on number of features. | +| 'max_tree_depth' | High | High | Deeper trees can take longer to run and use more memory. | +| 'min_split' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. | +| 'min_bucket' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. | +| 'num_splits' | High | High | Depends on number of continuous variables. Effectively adds more features as the binning becomes more granular. | +| 'sample_ratio' | High | High | Reduces run time by using only some of the data. | + +If you experience long run-times or are hitting memory limits, consider reducing one or +more of these parameters. One approach when building a random forest model is to start +with a small number of trees and a low maximum depth value, and use suggested defaults for +other parameters. This will give you a sense of run-time and test set accuracy. +Then you can change number of trees and maximum depth in a systematic way as required +to improve accuracy. @anchor predict @par Prediction Function @@ -1446,7 +1471,7 @@ File random_forest.sql_in documenting the training function * are to be used as predictors (except the ones included in * the next argument). Boolean, integer, and text columns are * considered categorical columns. - * @param list_of_features_to_exclude OPTIONAL. List of column names + * @param list_of_features_to_exclude List of column names * (comma-separated string) to exlude from the predictors list. * @param grouping_cols OPTIONAL. List of column names (comma-separated * string) to group the data by. This will lead to creating diff --git a/src/ports/postgres/modules/regress/clustered_variance.sql_in b/src/ports/postgres/modules/regress/clustered_variance.sql_in index afd83d0..f05630d 100644 --- a/src/ports/postgres/modules/regress/clustered_variance.sql_in +++ b/src/ports/postgres/modules/regress/clustered_variance.sql_in @@ -291,7 +291,7 @@ SELECT madlib.clustered_variance_linregr(); -# Run the linear regression function and view the results. <pre class="example"> -DROP TABLE IF EXISTS out_table; +DROP TABLE IF EXISTS out_table, out_table_summary; SELECT madlib.clustered_variance_linregr( 'abalone', 'out_table', 'rings', @@ -309,7 +309,7 @@ SELECT madlib.clustered_variance_logregr(); -# Run the logistic regression function and view the results. <pre class="example"> -DROP TABLE IF EXISTS out_table; +DROP TABLE IF EXISTS out_table, out_table_summary; SELECT madlib.clustered_variance_logregr( 'abalone', 'out_table', 'rings < 10', @@ -326,7 +326,7 @@ SELECT madlib.clustered_variance_mlogregr(); -# Run the multinomial logistic regression and view the results. <pre class="example"> -DROP TABLE IF EXISTS out_table; +DROP TABLE IF EXISTS out_table, out_table_summary; SELECT madlib.clustered_variance_mlogregr( 'abalone', 'out_table', 'CASE WHEN rings < 10 THEN 1 ELSE 0 END', diff --git a/src/ports/postgres/modules/sample/balance_sample.sql_in b/src/ports/postgres/modules/sample/balance_sample.sql_in index eea73aa..15d86e6 100644 --- a/src/ports/postgres/modules/sample/balance_sample.sql_in +++ b/src/ports/postgres/modules/sample/balance_sample.sql_in @@ -185,7 +185,7 @@ The following table shows how the parameters 'class_size' and 'output_table_size' work together: | Case | 'class_size' | 'output_table_size' | Result | -| :---- | :---- | :---- | :---- | +| :------ | :------ | :----------------- | :-------- | | 1 | 'uniform' | NULL | Resample for uniform class size with output size = input size (i.e., balanced). | | 2 | 'uniform' | 10000 | Resample for uniform class size with output size = 10K (i.e., balanced). | | 3 | NULL | NULL | Resample for uniform class size with output size = input size (i.e., balanced). Class_size=NULL has same behavior as ‘uniform’. | diff --git a/src/ports/postgres/modules/svm/svm.sql_in b/src/ports/postgres/modules/svm/svm.sql_in index ddfa134..a55fd5f 100644 --- a/src/ports/postgres/modules/svm/svm.sql_in +++ b/src/ports/postgres/modules/svm/svm.sql_in @@ -414,8 +414,8 @@ resulting \e init_stepsize can be run on the whole dataset. <DT>tolerance</dt> <DD>Default: 1e-10. The criterion to end iterations. The training stops whenever -<the difference between the training models of two consecutive iterations is -<smaller than \e tolerance or the iteration number is larger than \e max_iter. +the difference between the training models of two consecutive iterations is +smaller than \e tolerance or the iteration number is larger than \e max_iter. </DD> <DT>lambda</dt>