Repository: madlib Updated Branches: refs/heads/master fa6d53a42 -> b6f4fa1f5
Docs: update KNN, DT and RF docs to match recent commits Closes #235 Project: http://git-wip-us.apache.org/repos/asf/madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/b6f4fa1f Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/b6f4fa1f Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/b6f4fa1f Branch: refs/heads/master Commit: b6f4fa1f508e0c51f0e86c114c294f9448f55d99 Parents: fa6d53a Author: Frank McQuillan <fmcquil...@pivotal.io> Authored: Tue Feb 13 16:16:57 2018 -0800 Committer: Nandish Jayaram <njaya...@apache.org> Committed: Fri Feb 23 12:23:29 2018 -0800 ---------------------------------------------------------------------- src/ports/postgres/modules/knn/knn.sql_in | 19 ++++++++++--- .../recursive_partitioning/decision_tree.sql_in | 13 +++++++++ .../recursive_partitioning/random_forest.sql_in | 28 +++++++++++++++----- 3 files changed, 50 insertions(+), 10 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/madlib/blob/b6f4fa1f/src/ports/postgres/modules/knn/knn.sql_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/knn/knn.sql_in b/src/ports/postgres/modules/knn/knn.sql_in index 3139c15..1a90652 100644 --- a/src/ports/postgres/modules/knn/knn.sql_in +++ b/src/ports/postgres/modules/knn/knn.sql_in @@ -147,9 +147,18 @@ The following distance functions can be used: <li><b>user defined function</b> with signature <tt>DOUBLE PRECISION[] x, DOUBLE PRECISION[] y -> DOUBLE PRECISION</tt></li></ul></dd> <dt>weighted_avg (optional)</dt> -<dd>BOOLEAN, default: FALSE. Calculates the Regression or classication -of k-NN using the weighted average method. - +<dd>BOOLEAN, default: FALSE. Calculates classification or +regression values using a weighted average. The idea is to +weigh the contribution of each of the k neighbors according +to their distance to the test point, giving greater influence +to closer neighbors. The distance function 'fn_dist' specified +above is used. + +For classification, majority voting weighs a neighbor +according to inverse distance. + +For regression, the inverse distance weighting approach is +used from Shepard [4]. </dl> @@ -392,6 +401,10 @@ is assigned to the test point. [3] Gongde Guo1, Hui Wang, David Bell, Yaxin Bi, Kieran Greer: KNN Model-Based Approach in Classification, https://ai2-s2-pdfs.s3.amazonaws.com/a7e2/814ec5db800d2f8c4313fd436e9cf8273821.pdf +@anchor knn-lit-4 +[4] Shepard, Donald (1968). "A two-dimensional interpolation function for +irregularly-spaced data". Proceedings of the 1968 ACM National Conference. pp. 517â524. + @internal @sa namespace knn (documenting the implementation in Python) @endinternal http://git-wip-us.apache.org/repos/asf/madlib/blob/b6f4fa1f/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in index 0878b10..eb0e760 100644 --- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in +++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in @@ -355,6 +355,19 @@ tree_train( <th>independent_var_types</th> <td>TEXT. A comma separated string for the types of independent variables.</td> </tr> + + <tr> + <th>n_folds</th> + <td>BIGINT. Number of cross-validation folds used.</td> + </tr> + + <tr> + <th>null_proxy</th> + <td>TEXT. Describes how NULLs are handled. If NULL is not + treated as a separate categorical variable, this will be NULL. + If NULL is treated as a separate categorical value, this will be + set to "__NULL__"</td> + </tr> </table> </DD> </DL> http://git-wip-us.apache.org/repos/asf/madlib/blob/b6f4fa1f/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in index b74288a..cc228ac 100644 --- a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in +++ b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in @@ -208,13 +208,26 @@ forest_train(training_table_name, <tr> <th>dependent_var_levels</th> - <td>itext. For classification, the distinct levels of the dependent variable.</td> + <td>text. For classification, the distinct levels of the dependent variable.</td> </tr> <tr> <th>dependent_var_type</th> <td>text. The type of dependent variable.</td> </tr> + + <tr> + <th>independent_var_types</th> + <td>text. A comma separated string for the types of independent variables.</td> + </tr> + + <tr> + <th>null_proxy</th> + <td>text. Describes how NULLs are handled. If NULL is not + treated as a separate categorical variable, this will be NULL. + If NULL is treated as a separate categorical value, this will be + set to "__NULL__"</td> + </tr> </table> A group table named <em> \<model_table\>_group</em> is created, which has the following columns: @@ -374,7 +387,7 @@ forest_train(training_table_name, variable comes into use when the primary predictior value is NULL. </tr> <tr> - <th>null_as_special_cat</th> + <th>null_as_category</th> <td>Default: FALSE. Whether to treat NULL as a special categorical value. If this is set to TRUE, NULL values are considered a categorical @@ -564,7 +577,7 @@ dependent_varname | class independent_varnames | "OUTLOOK",windy,temperature,humidity cat_features | "OUTLOOK",windy con_features | temperature,humidity -grouping_cols | +grouping_cols | num_trees | 20 num_random_features | 2 max_tree_depth | 8 @@ -581,6 +594,7 @@ total_rows_skipped | 0 dependent_var_levels | "Don't Play","Play" dependent_var_type | text independent_var_types | text, text, double precision, double precision +null_proxy | None </pre> View the group table output: <pre class="example"> @@ -592,10 +606,10 @@ Result: gid | 1 success | t cat_n_levels | {3,2} -cat_levels_in_text | {overcast,rain,sunny,false,true} -oob_error | 0.50000000000000000000 -cat_var_importance | {-0.206309523809524,-0.234345238095238} -con_var_importance | {-0.308690476190476,-0.272678571428571} +cat_levels_in_text | {overcast,sunny,rain,false,true} +oob_error | 0.42857142857142857143 +cat_var_importance | {0.0305555555555556,0.0626984126984127} +con_var_importance | {0,0.0243650793650794} </pre> -# Obtain a dot format display of a single tree