[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/342 @reductionista thank you for the comments. The existing `minibatch_preprocessor` module outputs new columns called `dependent_varname` and `independent_varname` instead of the column names from the input table. The reason we did the same here is purely to conform with what is already in the other module. The other module allows expressions as input params (which may have been the reason behind a different column name in its output table), while this module does not explicitly support expressions. So, I do agree with your point about the output table column names, but I am just not sure how odd it would be to have the difference between the two modules. May be other folks could also weigh in to help us decide. Also, this module (`minibatch_preprocessor_dl`) is at early stage dev, so this will be a great time to try out options. Regarding your comment on the ordering of the two input params (`x` and `y`): This is following the convention we have in every other MADlib module, namely, we first have the dependent variable followed by the independent variable in the input parameters list. If you'd like it to be the opposite, it might be a good idea to start a separate thread in the community mailing list to discuss it. It will break conformity if we change the order of the two variables only in this module. BTW, `2.0` release will be a good time to change it since that release would break backward compatibility. ---
[GitHub] madlib pull request #342: Minibatch Preprocessor for Deep learning
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/342 Minibatch Preprocessor for Deep learning The minibatch preprocessor we currently have in MADlib is bloated for DL tasks. This feature adds a simplified version of creating buffers, and divides each element of the independent array by a normalizing constant for standardization (which is 255.0 by default). This is standard practice with image data. Co-authored-by: Arvind Sridhar Co-authored-by: Domino Valdano You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib deep-learning/minibatch-preprocessor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/342.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #342 commit c983aafcd5e31bab5dbc278178ff9e2e17942ea1 Author: Nandish Jayaram Date: 2018-12-18T01:54:42Z Minibatch Preprocessor for Deep learning The minibatch preprocessor we currently have in MADlib is bloated for DL tasks. This feature adds a simplified version of creating buffers, and divides each element of the independent array by a normalizing constant for standardization (which is 255.0 by default). This is standard practice with image data. Co-authored-by: Arvind Sridhar Co-authored-by: Domino Valdano ---
[GitHub] madlib pull request #341: Minibatch Preprocessor for Deep learning
Github user njayaram2 closed the pull request at: https://github.com/apache/madlib/pull/341 ---
[GitHub] madlib pull request #341: Minibatch Preprocessor for Deep learning
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/341 Minibatch Preprocessor for Deep learning The minibatch preprocessor we currently have in MADlib is bloated for DL tasks. This feature adds a simplified version of creating buffers, and divides each element of the independent array by a normalizing constant for standardization (which is 255.0 by default). This is standard practice with image data. Co-authored-by: Arvind Sridhar Co-authored-by: Domino Valdano You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib deep-learning/minibatch-preprocessor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/341.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #341 commit 78430bc8586ae0256a24de2472392564a15f7f8e Author: Nandish Jayaram Date: 2018-12-18T01:54:42Z Minibatch Preprocessor for Deep learning The minibatch preprocessor we currently have in MADlib is bloated for DL tasks. This feature adds a simplified version of creating buffers, and divides each element of the independent array by a normalizing constant for standardization (which is 255.0 by default). This is standard practice with image data. Co-authored-by: Arvind Sridhar Co-authored-by: Domino Valdano ---
[GitHub] madlib pull request #338: Install/Dev check: Add new test cases for some mod...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/338 Install/Dev check: Add new test cases for some modules Some modules such as array_ops and pmml did not have any install check files, while stemmer did not have any test files. This commit adds some basic test cases for these modules. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib ic-pmml-stemmer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/338.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #338 commit c351f176b305fb44bd87bc6a4f79c099a3d6fbe3 Author: Nandish Jayaram Date: 2018-09-29T00:15:40Z Install/Dev check: Add new test cases for some modules Some modules such as array_ops and pmml did not have any install check files, while stemmer did not have any test files. This commit adds some basic test cases for these modules. ---
[GitHub] madlib pull request #334: Minibatch Preprocessor: Update online doc
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/334 Minibatch Preprocessor: Update online doc The online doc is outdated. This commit adds two new parameters that have been introduced since the last time the doc was edited. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib doc/minibatch-preprocessor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/334.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #334 commit 7e95fc7d936f25e74ceceb74dfa7473c4eda45c8 Author: Nandish Jayaram Date: 2018-10-23T17:35:02Z Minibatch Preprocessor: Update online doc The online doc is outdated. This commit adds two new parameters that have been introduced since the last time the doc was edited. ---
[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...
Github user njayaram2 closed the pull request at: https://github.com/apache/madlib/pull/326 ---
[GitHub] madlib pull request #327: Upgrade: Fix issue with upgrading RPM to 1.15.1
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/327 Upgrade: Fix issue with upgrading RPM to 1.15.1 JIRA: MADLIB-1278 During RPM upgrade, rpm_post.sh is run first, followed by rpm_post_uninstall.sh. So we must do all the uninstallation specific stuff based on the current operation being uninstall or upgrade. This commit makes the necessary change to remove symlinks only during uninstallation, and not while upgrading. Co-authored-by: Domino Valdano You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib rpm-upgrade-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/327.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #327 commit 8234449be7d8522aa6d351bf504a573558cad4d3 Author: Nandish Jayaram Date: 2018-10-01T21:32:44Z Upgrade: Fix issue with upgrading RPM to 1.15.1 JIRA: MADLIB-1278 During RPM upgrade, rpm_post.sh is run first, followed by rpm_post_uninstall.sh. So we must do all the uninstallation specific stuff based on the current operation being uninstall or upgrade. This commit makes the necessary change to remove symlinks only during uninstallation, and not while upgrading. Co-authored-by: Domino Valdano ---
[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/326 Install/Dev check: Add new test cases for some modules Some modules such as array_ops and pmml did not have any install check files, while stemmer did not have any test files. This commit adds some basic test cases for these modules. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib ic-pmml-stemmer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/326.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #326 commit 0928be85ac504ae90ec6a2f2de694cd5aa2440f7 Author: Nandish Jayaram Date: 2018-09-29T00:15:40Z Install/Dev check: Add new test cases for some modules Some modules such as array_ops and pmml did not have any install check files, while stemmer did not have any test files. This commit adds some basic test cases for these modules. ---
[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/315 @fmcquillan99 thanks for testing this out. I can have a look at this issue. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r215464110 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) --- End diff -- @fmcquillan99 yes, I was looking at the alternatives. @hpandeycodeit I can take care of this. I will push a commit to this PR (you may have to merge it, since I am not a co-author on your branch). ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214485090 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) --- End diff -- `point_column_name` and `test_column_name` params must be an array as this if check suggests. If it's not an array it fails further down when the distance function (such as `squared_dist_norm2`) is called. I don't think the function `is_var_valid()` checks for these being arrays. You may have to check them after the new asserts, using a new helper function (`is_col_array()` cannot be used as is for expressions, and `is_var_valid()` does not check for an array, but just the validity of the expression) ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214487281 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(is_var_valid(point_source, point_column_name), +"kNN error: Invalid point_column_name expression") --- End diff -- Can we change the error message to indicate invalid column name or expression? How about `"kNN error: {0} is an invalid column name or expression for point_column_name param".format(point_column_name)`? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214484071 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') --- End diff -- `point_id` and `test_id` params are not validated anymore. This should still be done, since the new asserts only check for `point_column_name` and `test_column_name` being valid expressions. This validation is required to catch invalid values such as `NULL`, `''`, invalid or non-existing column name etc. for `point_id` and `test_id` params. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214485621 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) --- End diff -- I think we will need to keep these asserts too. They seem to be checking for a different validation. If the helper function `array_col_has_no_null()` does not work for expressions, please go ahead and change that, or create a new helper function to handle the scenario (both expressions and column names). ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214487426 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(is_var_valid(point_source, point_column_name), +"kNN error: Invalid point_column_name expression") + +_assert(is_var_valid(test_source, test_column_name), +"kNN error: Invalid test_column_name expression") --- End diff -- Similar to the above comment, can we change the error message to: `"kNN error: {0} is an invalid column name or expression for test_column_name param".format(test_column_name)`? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214482318 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +p_col_name, +t_col_name) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM +( +SELECT {point_id} , {point_column_name} as p_col_name , {label_column_name} from {point_source} +) train, +( +SELECT {test_id} ,{test_column_name} as t_col_name from {test_source} +) test --- End diff -- Can you please use variables with unique strings for `train`, `test`, `p_col_name` and `t_col_name`. If the train or test table is named any of those, the query would fail I guess. While you are at it, could you also do the same for other variables in this query: `train_id`, `r`, `dist_inverse` and others I may have missed listing out? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r213806554 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(point_column_name, "KNN error: Invalid point_column_name expression") + +_assert(test_column_name, "KNN error: Invalid test_column_name expression") --- End diff -- `KNN` in the error message is different from `kNN` used in other error messages (capital `K`). Please keep it consistent as `kNN`. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r213791885 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(point_column_name, "KNN error: Invalid point_column_name expression") + +_assert(test_column_name, "KNN error: Invalid test_column_name expression") --- End diff -- Since the original asserts are removed, this results in the function call not exiting gracefully when we have incorrect param values. You may have to use function `is_var_valid()` in `validate_args.py_in` to validate `point_column_name` and `test_column_name`. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r213792935 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +train.{point_col_name_temp}, +test.{test_col_name_temp}) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM + {point_source_temp_table} as train, + {test_source_temp_table} as test --- End diff -- Please use subqueries, instead of tables: ``` (select {point_id} , {point_column_name} as {point_col_name_temp} , {label_column_name} from {point_source}) train, (select {test_id}, {test_column_name} as {test_col_name_temp} from {test_source}) test ``` ---
[GitHub] madlib issue #314: Ubuntu support: Enable creation of gppkg on Ubuntu
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/314 Thank you for the comments @reductionista , I have updated the comment. Please do have a look at it. ---
[GitHub] madlib pull request #314: Ubuntu support: Enable creation of gppkg on Ubuntu
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/314 Ubuntu support: Enable creation of gppkg on Ubuntu This commit makes necessary changes to create a gppkg on Ubuntu. The default behavior when MADlib is built on Ubuntu is to create a .deb installer. If we want to create a gppkg, then we need an RPM due to limitations in gppkg. We now create an RPM on Ubuntu (assuming package alien is installed on Ubuntu) if the right cmake flag is specified. Once an RPM is created on `make package`, we can now go ahead and create the gppkg using `make gppkg`. The cmake flag to use if we want to create an .rpm instead of .deb on Ubuntu when we run `make package` is: -DCREATE_RPM_FOR_UBUNTU=True Co-authored-by: Orhan Kislal You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib ubuntu-gppkg-support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/314.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #314 commit bd85bae33473e697d7c26fdac9ee3253ff9a82b3 Author: Nandish Jayaram Date: 2018-08-06T22:19:18Z Ubuntu support: Enable creation of gppkg on Ubuntu This commit makes necessary changes to create a gppkg on Ubuntu. The default behavior when MADlib is built on Ubuntu is to create a .deb installer. If we want to create a gppkg, then we need an RPM due to limitations in gppkg. We now create an RPM on Ubuntu (assuming package alien is installed on Ubuntu) if the right cmake flag is specified. Once an RPM is created on `make package`, we can now go ahead and create the gppkg using `make gppkg`. The cmake flag to use if we want to create an .rpm instead of .deb on Ubuntu when we run `make package` is: -DCREATE_RPM_FOR_UBUNTU=True Co-authored-by: Orhan Kislal ---
[GitHub] madlib pull request #295: Recursive Partitioning: Add function to report imp...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/295#discussion_r204593622 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.py_in --- @@ -1564,69 +1578,69 @@ def get_var_importance(schema_madlib, model_table, output_table, **kwargs): """ # Validate parameters summary_table = add_postfix(model_table, "_summary") -_validate_var_importance_input(model_table, summary_table, output_table) -grouping_cols = plpy.execute( -"SELECT grouping_cols FROM {summary_table}". -format(**locals()))[0]['grouping_cols'] -grouping_cols_comma = '' -if grouping_cols : -grouping_cols_comma = add_postfix(grouping_cols,", ") -is_RF = _is_model_for_RF(summary_table) +_validate_var_importance_input(model_table, + summary_table, + output_table) +is_RF = _is_random_forest_model(summary_table) + +# 'importance_model_table' is the table containing variable importance values. +# For RF, it is placed in _group as opposed to +# for DT. +importance_model_table = (model_table if not is_RF else + add_postfix(model_table, "_group")) +grouping_cols = plpy.execute("SELECT grouping_cols FROM {summary_table}". + format(**locals()))[0]['grouping_cols'] +if grouping_cols: +grouping_cols_comma = add_postfix(grouping_cols, ", ") +else: +grouping_cols_comma = '' +is_impurity_imp_col_present = _is_impurity_importance_in_model( +importance_model_table, summary_table, is_RF=is_RF) + +# convert importance to percentages +normalization_target = 100.0 + +def _unnest_normalize(input_array_str): +return (""" +unnest({0}.normalize_sum_array({1}::double precision[], + {2}::double precision)) +""".format(schema_madlib, input_array_str, normalization_target)) + if is_RF: -group_table = add_postfix(model_table, "_group") -is_impurity_imp_col_present = _is_impurity_importance_in_group_table( -group_table, summary_table) -# The group table for >= 1.15 RF models should have a column named -# impurity_var_importance if it was learnt with importance param True. -# So set is_pre_1_15_RF_model to False if the column exists, and to -# True if the column does not exist. -is_pre_1_15_RF_model = False if is_impurity_imp_col_present else True - -# Query to add oob variable importance for categorical vars -if is_pre_1_15_RF_model: -# In < 1.15 RF model, the variable importance was captured using two -# different columns named 'cat_var_importance' -plpy.execute( -""" CREATE TABLE {output_table} AS --- Add oob variable importance for categorical vars -SELECT {grouping_cols_comma} -unnest(regexp_split_to_array(cat_features, ',')) AS feature, -unnest(cat_var_importance) AS var_importance -FROM {group_table}, {summary_table} -UNION --- Add oob variable importance for continuous vars -SELECT {grouping_cols_comma} -unnest(regexp_split_to_array(con_features, ',')) AS feature, -unnest(con_var_importance) AS var_importance -FROM {group_table}, {summary_table} -""".format(**locals())) +if is_impurity_imp_col_present: +# In versions >= 1.15, the OOB variable importance is captured +# in a single column: 'oob_var_importance'. +oob_var_importance_str = ( +"{0} AS oob_var_importance,". +format(_unnest_normalize('oob_var_importance'))) +impurity_var_importance_str = ( +"{0} AS impurity_var_importance". +format(_unnest_normalize('impurity_var_importance'))) else: -# In >= 1.15 RF models, the variable importance and impurity -# importance scores are captured in columns 'oob_var_importance' -# and 'impurity_var_importance' respectively. -plpy.execute( -
[GitHub] madlib pull request #291: Feature: Vector-Column Transformations
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/291#discussion_r204589559 --- Diff: src/ports/postgres/modules/utilities/transform_vec_cols.py_in --- @@ -0,0 +1,513 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import plpy +from control import MinWarning +from internal.db_utils import is_col_1d_array +from internal.db_utils import quote_literal +from utilities import _assert +from utilities import add_postfix +from utilities import ANY_ARRAY +from utilities import is_psql_boolean_type +from utilities import is_psql_char_type +from utilities import is_psql_numeric_type +from utilities import is_valid_psql_type +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from validate_args import is_var_valid +from validate_args import get_cols +from validate_args import get_cols_and_types +from validate_args import get_expr_type +from validate_args import input_tbl_valid +from validate_args import output_tbl_valid +from validate_args import table_exists + +class vec_cols_helper: +def __init__(self): +self.all_cols = None + +def get_cols_as_list(self, cols_to_process, source_table=None, exclude_cols=None): +""" +Get a list of columns based on the value of cols_to_process +Args: +@param cols_to_process: str, Either a * or a comma-separated list of col names +@param source_table: str, optional. Source table name +@param exclude_cols: str, optional. Comma-separated list of the col(s) to exclude + from the source table, only used if cols_to_process is * +Returns: +A list of column names (or an empty list) +""" +# If cols_to_process is empty/None, return empty list +if not cols_to_process: +return [] +if cols_to_process.strip() != "*": +# If cols_to_process is a comma separated list of names, return list +# of column names in cols_to_process. +return [col for col in split_quoted_delimited_str(cols_to_process) +if col not in split_quoted_delimited_str(exclude_cols)] +if source_table: +if not self.all_cols: +self.all_cols = get_cols(source_table) +return [col for col in self.all_cols +if col not in split_quoted_delimited_str(exclude_cols)] +return [] + +def get_type_class(self, arg): +if is_psql_numeric_type(arg): +return "double precision" +elif is_psql_char_type(arg): +return "text" +else: +return arg + +class vec2cols: +def __init__(self): +self.get_cols_helper = vec_cols_helper() +self.module_name = self.__class__.__name__ + +def validate_args(self, source_table, output_table, vector_col, feature_names, + cols_to_output): +""" +Validate args for vec2cols +""" +input_tbl_valid(source_table, self.module_name) +output_tbl_valid(output_table, self.module_name) +is_var_valid(source_table, cols_to_output) +is_var_valid(source_table, vector_col) +_assert(is_valid_psql_type(get_expr_type(vector_col, source_table), ANY_ARRAY), +"{0}: vector_col should refer to an array.".format(self.module_name)) +_assert(is_col_1d_array(source_table, vector_col), +"{0}: vector_col must be a 1-dimensional array.".format(self.module_name)) + +def get_names_for_split_output_cols(self, source_table, vector_col, feature_names): +"&
[GitHub] madlib pull request #297: Madpack: Fix various schema related bugs
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/297#discussion_r204572428 --- Diff: src/madpack/madpack.py --- @@ -995,19 +996,20 @@ def run_unit_tests(args, testcase): Run unit tests. """ if not _is_madlib_installation_valid_for_tests(args['schema'], - args['db_madlib_ver']): + args['db_madlib_ver'], + 'unit-tests'): return info_(this, "> Running unit-test scripts for:", verbose) modset = _get_modset_for_tests(testcase, 'test_') # Loop through all modules and run unit tests -_process_py_sql_files_in_modules(modset, {'madpack_cmd':'unit-test'}) +_process_py_sql_files_in_modules(modset, {'madpack_cmd':'Unit-test'}) --- End diff -- Changing `unit-test` to `Unit-test` causes a failure when we try to run madpack with `unit-test` option. It's probably because the check for madpack command in function `_process_py_sql_files_in_modules` is case sensitive. ---
[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/295 @fmcquillan only impurity, I don't think we scale oob to 100. ---
[GitHub] madlib pull request #291: Feature: Vector to Columns
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/291#discussion_r203890181 --- Diff: src/ports/postgres/modules/utilities/transform_vec_cols.py_in --- @@ -0,0 +1,492 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import plpy +from control import MinWarning +from internal.db_utils import is_col_1d_array +from internal.db_utils import quote_literal +from utilities import _assert +from utilities import add_postfix +from utilities import ANY_ARRAY +from utilities import is_valid_psql_type +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from validate_args import is_var_valid +from validate_args import get_cols +from validate_args import get_expr_type +from validate_args import input_tbl_valid +from validate_args import output_tbl_valid +from validate_args import table_exists + +class vec_cols_helper: +def __init__(self): +self.all_cols = None + +def get_cols_as_list(self, cols_to_process, source_table=None, exclude_cols=None): +""" +Get a list of columns based on the value of cols_to_process +Args: +@param cols_to_process: str, Either a * or a comma-separated list of col names +@param source_table: str, optional. Source table name +@param exclude_cols: str, optional. Comma-separated list of the col(s) to exclude + from the source table, only used if cols_to_process is * +Returns: +A list of column names (or an empty list) +""" +# If cols_to_process is empty/None, return empty list +if not cols_to_process: +return [] +if cols_to_process.strip() != "*": +# If cols_to_process is a comma separated list of names, return list +# of column names in cols_to_process. +return [col for col in split_quoted_delimited_str(cols_to_process) +if col not in split_quoted_delimited_str(exclude_cols)] +if source_table: +if not self.all_cols: +self.all_cols = get_cols(source_table) +return [col for col in self.all_cols +if col not in split_quoted_delimited_str(exclude_cols)] +return [] + +class vec2cols: +def __init__(self): +self.get_cols_helper = vec_cols_helper() +self.module_name = self.__class__.__name__ + +def validate_args(self, source_table, output_table, vector_col, feature_names, + cols_to_output): +""" +Validate args for vec2cols +""" +input_tbl_valid(source_table, self.module_name) +output_tbl_valid(output_table, self.module_name) +# cols_to_validate = self.get_cols_helper.get_cols_as_list(cols_to_output) + [vector_col] --- End diff -- Guess we can remove this commented line. ---
[GitHub] madlib issue #294: Pagerank: Remove duplicate entries from grouping output
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/294 Thank you for the comments @jingyimei , have pushed a commit with a new dev-check test. ---
[GitHub] madlib pull request #290: madpack: Add madpack option to run unit tests.
Github user njayaram2 closed the pull request at: https://github.com/apache/madlib/pull/290 ---
[GitHub] madlib pull request #294: Pagerank: Remove duplicate entries from grouping o...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/294 Pagerank: Remove duplicate entries from grouping output JIRA: MADLIB-1229 JIRA: MADLIB-1253 Fixes the missing output for complete graphs bug as well. Co-authored-by: Nandish Jayaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bug/pagerank-dup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/294.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #294 commit 1b55acac9d5550e0a74fa46ec0ab4842d089ac1c Author: Orhan Kislal Date: 2018-07-14T00:09:11Z Pagerank: Remove duplicate entries from grouping output JIRA: MADLIB-1229 JIRA: MADLIB-1253 Fixes the missing output for complete graphs bug as well. Co-authored-by: Nandish Jayaram ---
[GitHub] madlib issue #290: madpack: Add madpack option to run unit tests.
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/290 Thank you for the comments @iyerr3 , will do the needful. ---
[GitHub] madlib pull request #290: madpack: Add madpack option to run unit tests.
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/290 madpack: Add madpack option to run unit tests. JIRA: MADLIB-1252 Unit tests in MADlib are written in python files, that are located in the .../test/unit_tests/ folders, whose names begin with the prefix "test_". This commit adds a new madpack option to run unit tests similar to how we run install and dev checks. - The new option added is: `unit-test`. - Sample usage (on a postgres database, with MADlib installed on database `madlib`): * Run unit tests on all modules that have it defined: src/bin/madpack -p postgres -c /madlib unit-test * Run unit tests only for the `convex` module: src/bin/madpack -p postgres -c /madlib unit-test -t convex * Run unit tests only for the `convex` and decision trees module: src/bin/madpack -p postgres -c /madlib unit-test -t convex,recursive_partitioning/decision_tree - Add command to run all unit tests in Jenkins build script. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib madpack/unit-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/290.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #290 commit 5f4395bc8548a5f74ba9a8fcb716f2a39ec51162 Author: Nandish Jayaram Date: 2018-07-11T00:32:24Z madpack: Add madpack option to run unit tests. JIRA: MADLIB-1252 Unit tests in MADlib are written in python files, that are located in the .../test/unit_tests/ folders, whose names begin with the prefix "test_". This commit adds a new madpack option to run unit tests similar to how we run install and dev checks. - The new option added is: `unit-test`. - Sample usage (on a postgres database, with MADlib installed on database `madlib`): * Run unit tests on all modules that have it defined: src/bin/madpack -p postgres -c /madlib unit-test * Run unit tests only for the `convex` module: src/bin/madpack -p postgres -c /madlib unit-test -t convex * Run unit tests only for the `convex` and decision trees module: src/bin/madpack -p postgres -c /madlib unit-test -t convex,recursive_partitioning/decision_tree - Add command to run all unit tests in Jenkins build script. ---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/288#discussion_r200444598 --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in --- @@ -0,0 +1,110 @@ +""" +@file cols2vec.py_in + +@brief Utility to convert Columns to array + +""" + +import plpy +from utilities.control import MinWarning +from utilities.utilities import split_quoted_delimited_str +from utilities.utilities import _string_to_array +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import is_var_valid +from utilities.validate_args import get_cols +from utilities.validate_args import quote_ident +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + + +def validate_cols2vec_args(source_table, output_table, + list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs): +""" +Function to validate input parameters +""" +if list_of_features.strip() != '*': +if not (list_of_features and list_of_features.strip()): +plpy.error("Features to include is empty") + +if list_of_features.strip() != '*': +if not columns_exist_in_table( +source_table, split_quoted_delimited_str(list_of_features)): +plpy.error( +"Invalid columns to list_of_features ({0})".format(list_of_features)) + +if cols_to_output and cols_to_output.strip() != '*': +if not columns_exist_in_table( +source_table, _string_to_array(cols_to_output)): +plpy.error("Invalid columns to output list ({0})". + format(cols_to_output)) + + +def cols2vec(schema_madlib, source_table, output_table, list_of_features, + list_of_features_to_exclude=None, cols_to_output=None, **kwargs): +""" +Args: +@param schema_madlib: Name of MADlib schema +@param model: Name of table containing the tree model +@param source_table:Name of table containing prediction data +@param output_table:Name of table to output the results +@param list_of_features:Comma-separated string of column names or +expressions to put into feature array. +Can also be a '*' implying all columns +are to be put into feature array. +@param list_of_features_to_exclude: Comma-separated string of column names +to exclude from the feature array +@param cols_to_output: Comma-separated string of column names +from the source table to keep in the output table, +in addition to the feature array. + +Returns: +None + +""" + +with MinWarning('warning'): +validate_cols2vec_args(source_table, output_table, list_of_features, + list_of_features_to_exclude, cols_to_output, **kwargs) + +all_cols = '' +feature_cols = '' +if list_of_features.strip() == '*': +all_cols = get_cols(source_table, schema_madlib) +all_col_set = set(list(all_cols)) +exclude_set = set(split_quoted_delimited_str( +list_of_features_to_exclude)) +feature_set = all_col_set - exclude_set +feature_cols = py_list_to_sql_string( +list(feature_set), "text", False) +filtered_list_of_features = ",".join( +feat for feat in list(feature_set)) +else: +feature_list = split_quoted_delimited_str(list_of_features) +feature_exclude = split_quoted_delimited_str( +list_of_features_to_exclude) +return_set = set(feature_list) - set(feature_exclude) +feature_cols = py_list_to_sql_string( +list(return_set), "text", False) +filtered_list_of_features = ",".join( +feat for feat in feature_list if feat in return_set) + +output_cols = '' +if cols_to_output: +output_cols_list = [', '.join(get_cols(source_table, schema_madlib)) if col == '*' else col +
[GitHub] madlib issue #284: SVM: Fix flaky dev-check failure
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/284 Thank you for the comment @iyerr3 , will relax the constraint as suggested. ---
[GitHub] madlib pull request #284: SVM: Fix flaky dev-check failure
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/284 SVM: Fix flaky dev-check failure JIRA: MADLIB-1232 SVM has a dev-check query that is flaky on a large cluster. This commit relaxes the assert condition for that query. Closes #284 You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/svm-flaky-dev-check Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/284.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #284 commit 9c556816bcca990e9b168cf556ce0da0cacf935a Author: Nandish Jayaram Date: 2018-06-27T19:40:19Z SVM: Fix flaky dev-check failure JIRA: MADLIB-1232 SVM has a dev-check query that is flaky on a large cluster. This commit relaxes the assert condition for that query. Closes #284 ---
[GitHub] madlib issue #283: Bugfix: Fix failing dev check in CRF
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/283 Thank you for the comments @kaknikhil . I moved out the jenkins build script to a different commit. ---
[GitHub] madlib pull request #283: Bugfix: Fix failing dev check in CRF
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/283 Bugfix: Fix failing dev check in CRF This commit has the following changes: - A couple of dev check files in CRF did not have the label table creation in it. But the label table was consumed by one of the queries that led to dev-check failure. - Run dev check on Jenkins build instead of install check. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/crf-dev-check Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/283.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #283 commit deb206b7d1c1e7ce87d6e33c7a1dff91b3adb82b Author: Nandish Jayaram Date: 2018-06-27T18:25:46Z Bugfix: Fix failing dev check in CRF This commit has the following changes: - A couple of dev check files in CRF did not have the label table creation in it. But the label table was consumed by one of the queries that led to dev-check failure. - Run dev check on Jenkins build instead of install check. ---
[GitHub] madlib pull request #277: DT: Add impurity importance metric
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/277#discussion_r196150565 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in --- @@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, cat_features, "$3", "$2", null_proxy) -# The arguments of the aggregate (in the same order): -# 1. current tree state, madlib.bytea8 -# 2. categorical features (integer format) in a single array -# 3. continuous features in a single array -# 4. weight value -# 5. categorical sorted levels (integer format) in a combined array -# 6. continuous splits -# 7. number of dependent levels train_sql = """ SELECT (result).* from ( SELECT -{schema_madlib}._dt_apply($1, +{schema_madlib}._dt_apply( +$1, {schema_madlib}._compute_leaf_stats( -$1, -{cat_features_str}, -{con_features_str}, +$1, -- current tree state, madlib.bytea8 +{cat_features_str}, -- categorical features in an array +{con_features_str}, -- continuous features in an array {dep_var}, -{weights}, -$2, -$4, -{dep_n_levels}::smallint, -{subsample}::boolean +{weights}, -- weight value +$2, -- categorical sorted levels in a combined array +$4, -- continuous splits +{dep_n_levels}::smallint, -- number of dependent levels +{subsample}::boolean -- should we use a subsample of data --- End diff -- Oh okay, thank you. I think a comment will be useful. ---
[GitHub] madlib issue #276: Feature/dev check
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/276 Thank you for the comments @iyerr3 , will make the changes you have requested. Having one IC file for each module makes sense, but on Greenplum, for some modules the IC run time is still quite high. For example, if a module had 5 IC files, where each ran for 10 seconds, the user would see IC progressing every 10 seconds. But if those were combined into one IC file, the progression would happen after 50 seconds. It seemed a little odd (longer IC run times for some modules in terms of user experience) when we were trying it out, so decided to keep multiple IC files for such modules. ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r194151705 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +122,26 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. +The velocity is the same size as the coefficient and is accumulated in the direction of persistent reduction, which speeds up the optimization. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum we first correct the velocity, and then update the model with that velocity, whereas in Nesterov momentum, we first move the model in the direction of momentum*velocity , then correct the velocity and finally use the updated model to calculate the gradient. The main difference being that in classical momentum, we compute the gradient before updating the model whereas in nesterov we first update the model and then compute the gradient from the updated position.\\ --- End diff -- `momentum*velocity ,` -> `momentum*velocity,`. The extra space before the comma is moving the `,` to the next line in the pdf. ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r194151826 --- Diff: doc/design/modules/neural-network.tex --- @@ -196,17 +221,28 @@ \subsubsection{The $\mathit{Gradient}$ Function} \end{algorithmic} \end{algorithm} -\begin{algorithm}[mlp-train-iteration$(X, Y, \eta)$] \label{alg:mlp-train-iteration} +\begin{algorithm}[mlp-train-iteration$(X, Y, \eta, mu)$] \label{alg:mlp-train-iteration} --- End diff -- `mu` -> `\mu` ---
[GitHub] madlib pull request #275: Madpack: Fix error with dropping user after IC fai...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/275 Madpack: Fix error with dropping user after IC failure. JIRA: MADLIB-1182 Previously, when install check did not fail gracefully, the user created by madpack hung around and disturbed IC attempts within other databases. We fixed this by: 1) Renaming the test user using the specific database that the IC test was run from, making the test user name database-specific. 2) Dropping schema and user in a try-finally block, so that it's executed even on a failed IC run. Co-authored-by: Arvind Sridhar You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/madpack-ic-user Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #275 commit d817973e78fa419dd7b83b077049c8c6b1911911 Author: Nandish Jayaram Date: 2018-06-05T19:08:25Z Madpack: Fix error with dropping user after IC failure. JIRA: MADLIB-1182 Previously, when install check did not fail gracefully, the user created by madpack hung around and disturbed IC attempts within other databases. We fixed this by: 1) Renaming the test user using the specific database that the IC test was run from, making the test user name database-specific. 2) Dropping schema and user in a try-finally block, so that it's executed even on a failed IC run. Co-authored-by: Arvind Sridhar ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192246910 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum you first correct the velocity and step with that velocity, whereas in Nesterov momentum you first step in the velocity direction then make a correction to the velocity vector based on the new location. \\ +Classic momentum update +\[\begin{aligned} +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ +\mathit{u} \set \mathit{u} + \mathit{v} \\ +\end{aligned}\] + +Nesterov momentum update +\[\begin{aligned} +\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ (nesterov's initial coefficient update )} \\ +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ +\mathit{u} \set \mathit{u} - \eta * \mathit{gradient} \\ --- End diff -- Can we left-align these and other momentum related equations? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192246463 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ --- End diff -- We only describe the range for momentum, but nothing about velocity before the next line begins. Can we say a bit about what velocity is before we describe how it is used? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192251198 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] --- End diff -- Please add your name to the list of authors in the beginning of the file. Maybe add the history for this file as well? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192245605 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum you first correct the velocity and step with that velocity, whereas in Nesterov momentum you first step in the velocity direction then make a correction to the velocity vector based on the new location. \\ --- End diff -- `step with that velocity` is a little confusing to me. Do we have some source where it is defined this way? If it's any better, can we use the following text to say what the difference between momentum and NAG is (source is http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf): ``` ... the key difference between momentum and Nesterov’s accelerated gradient is that momentum computes the gradient before applying the velocity, while Nesterov’s accelerated gradient computes the gradient after doing so. ``` If `step with that velocity` is a standard way of defining it, then I am okay with it. This comment applies to user and online docs too. ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192248589 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum you first correct the velocity and step with that velocity, whereas in Nesterov momentum you first step in the velocity direction then make a correction to the velocity vector based on the new location. \\ +Classic momentum update +\[\begin{aligned} +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ +\mathit{u} \set \mathit{u} + \mathit{v} \\ +\end{aligned}\] + +Nesterov momentum update +\[\begin{aligned} +\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ (nesterov's initial coefficient update )} \\ +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ --- End diff -- Instead of `gradient`, can we replace it with the actual math notation? I think the gradient at this line is: `\frac{\partial f}{\partial u_{k-1}^{sj}}` The gradient in the next line should be w.r.t `v` instead of `u`. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192186350 --- Diff: src/madpack/madpack.py --- @@ -131,10 +141,73 @@ def _get_relative_maddir(maddir, port): return maddir # -- +def _cleanup_comments_in_sqlfile(output_filename, upgrade): +""" +@brief Remove comments in the sql script +""" +if not upgrade: +with open(output_filename, 'r+') as output_filehandle: +full_sql = output_filehandle.read() +pattern = re.compile(r"""(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))""") +res = '' +lines = re.split(r'[\r\n]+', full_sql) +for line in lines: +tmp = line +if not tmp.strip().startswith("E'"): +line = re.sub(pattern, '', line) +res += line + '\n' +full_sql = res.strip() +full_sql = re.sub(pattern, '', full_sql).strip() +# Re-write the cleaned-up sql to a new file. Python does not let us +# erase all the content of a file and rewrite the same file again. +cleaned_output_filename = output_filename+'.tmp' +with open(cleaned_output_filename, 'w') as output_filehandle: +_write_to_file(output_filehandle, full_sql) +# Move the cleaned output file to the old one. +os.rename(cleaned_output_filename, output_filename) + +def _run_m4_and_append(schema, maddir_mod_py, module, sqlfile, + output_filehandle, pre_sql=None): +""" +Function to process a sql file with M4. +""" +# Check if the SQL file exists +if not os.path.isfile(sqlfile): +error_(this, "Missing module SQL file (%s)" % sqlfile, False) --- End diff -- We removed the error message from `ValueError`, while we still print the message from `error_()`. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192174800 --- Diff: src/madpack/upgrade_util.py --- @@ -1299,18 +1303,19 @@ def _clean_function(self): pattern = re.compile(r"""CREATE(\s+)FUNCTION""", re.DOTALL | re.IGNORECASE) self._sql = re.sub(pattern, 'CREATE OR REPLACE FUNCTION', self._sql) -def cleanup(self, sql): +def cleanup(self, sql, algoname): --- End diff -- Good catch, this is dead code. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192193089 --- Diff: src/madpack/madpack.py --- @@ -559,71 +650,59 @@ def _db_rename_schema(from_schema, to_schema): # -- -def _db_create_schema(schema): +def _db_create_schema(schema, is_schema_in_db, filehandle): """ Create schema @param from_schema name of the schema to rename +@param is_schema_in_db flag to indicate if schema is already present @param to_schema new name for the schema """ -info_(this, "> Creating %s schema" % schema.upper(), True) --- End diff -- This is useful and it was preserved in `main`. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192205268 --- Diff: src/madpack/madpack.py --- @@ -987,275 +1276,42 @@ def main(argv): error_(this, "Missing -p/--platform parameter.", True) if not con_args: error_(this, "Unknown problem with database connection string: %s" % con_args, True) +# Completed "Get and validate arguments" - # COMMAND: version if args.command[0] == 'version': _print_revs(rev, dbrev, con_args, schema) -# COMMAND: uninstall/reinstall -if args.command[0] in ('uninstall', 'reinstall'): -if get_rev_num(dbrev) == [0]: -info_(this, "Nothing to uninstall. No version found in schema %s." % schema.upper(), True) -return - -# Find any potential data to lose -affected_objects = _internal_run_query(""" -SELECT -n1.nspname AS schema, -relname AS relation, -attname AS column, -typname AS type -FROM -pg_attribute a, -pg_class c, -pg_type t, -pg_namespace n, -pg_namespace n1 -WHERE -n.nspname = '%s' -AND t.typnamespace = n.oid -AND a.atttypid = t.oid -AND c.oid = a.attrelid -AND c.relnamespace = n1.oid -AND c.relkind = 'r' -ORDER BY -n1.nspname, relname, attname, typname""" % schema.lower(), True) - -info_(this, "*** Uninstalling MADlib ***", True) -info_(this, "***", True) -info_(this, "* Schema %s and all database objects depending on it will be dropped!" % schema.upper(), True) -if affected_objects: -info_(this, "* If you continue the following data will be lost (schema : table.column : type):", True) -for ao in affected_objects: -info_(this, '* - ' + ao['schema'] + ' : ' + ao['relation'] + '.' + - ao['column'] + ' : ' + ao['type'], True) -info_(this, "***", True) -info_(this, "Would you like to continue? [Y/N]", True) -go = raw_input('>>> ').upper() -while go != 'Y' and go != 'N': -go = raw_input('Yes or No >>> ').upper() - -# 2) Do the uninstall/drop -if go == 'N': -info_(this, 'No problem. Nothing dropped.', True) -return - -elif go == 'Y': -info_(this, "> dropping schema %s" % schema.upper(), verbose) -try: -_internal_run_query("DROP SCHEMA %s CASCADE;" % (schema), True) -except: -error_(this, "Cannot drop schema %s." % schema.upper(), True) - -info_(this, 'Schema %s (and all dependent objects) has been dropped.' % schema.upper(), True) -info_(this, 'MADlib uninstalled successfully.', True) - -else: -return - -# COMMAND: install/reinstall -if args.command[0] in ('install', 'reinstall'): -# Refresh MADlib version in DB, None for GP/PG -if args.command[0] == 'reinstall': -print "Setting MADlib database version to be None for reinstall" -dbrev = None - -info_(this, "*** Installing MADlib ***", True) - -# 1) Compare OS and DB versions. -# noop if OS <= DB. -_print_revs(rev, dbrev, con_args, schema) -if is_rev_gte(get_rev_num(dbrev), get_rev_num(rev)): -info_(this, "Current MADlib version already up to date.", True) -return -# proceed to create objects if nothing installed in DB -elif dbrev is None: -pass -# error and refer to upgrade if OS > DB -else: -error_(this, """Aborting installation: existing MADlib version detected in {0} schema -To upgrade the {0} schema to MADlib v{1} please run the following command: -madpack upgrade -s {0} -p {2} [-c ...] -""".forma
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192175486 --- Diff: src/madpack/upgrade_util.py --- @@ -1299,18 +1303,19 @@ def _clean_function(self): pattern = re.compile(r"""CREATE(\s+)FUNCTION""", re.DOTALL | re.IGNORECASE) self._sql = re.sub(pattern, 'CREATE OR REPLACE FUNCTION', self._sql) -def cleanup(self, sql): +def cleanup(self, sql, algoname): """ @brief Entry function for cleaning the sql script """ self._sql = sql -self._clean_comment() -self._clean_type() -self._clean_cast() -self._clean_operator() -self._clean_opclass() -self._clean_aggregate() -self._clean_function() +if algoname not in self.get_change_handler().newmodule: --- End diff -- This if check was originally done in `madpack.py` before `cleanup()` function was called. We moved it to this function instead. Will add the comment, but keep the if logic as is, since we will anyway have to return at the end. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192204168 --- Diff: src/madpack/madpack.py --- @@ -824,6 +873,246 @@ def parse_arguments(): # Get the arguments return parser.parse_args() +def run_install_check(args, testcase): +schema = args['schema'] +dbrev = args['dbrev'] +# 1) Compare OS and DB versions. Continue if OS = DB. +if get_rev_num(dbrev) != get_rev_num(rev): +_print_revs(rev, dbrev, con_args, schema) +info_(this, "Versions do not match. Install-check stopped.", True) +return + +# Create install-check user +test_user = ('madlib_' + + rev.replace('.', '').replace('-', '_') + + '_installcheck') +try: +_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), False) +except: +_internal_run_query("DROP OWNED BY %s CASCADE;" % (test_user), True) +_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), True) +_internal_run_query("CREATE USER %s;" % (test_user), True) + +_internal_run_query("GRANT USAGE ON SCHEMA %s TO %s;" % (schema, test_user), True) + +# 2) Run test SQLs +info_(this, "> Running test scripts for:", verbose) + +caseset = (set([test.strip() for test in testcase.split(',')]) + if testcase != "" else set()) + +modset = {} +for case in caseset: +if case.find('/') > -1: +[mod, algo] = case.split('/') +if mod not in modset: +modset[mod] = [] +if algo not in modset[mod]: +modset[mod].append(algo) +else: +modset[case] = [] + +# Loop through all modules +for moduleinfo in portspecs['modules']: + +# Get module name +module = moduleinfo['name'] + +# Skip if doesn't meet specified modules +if modset is not None and len(modset) > 0 and module not in modset: +continue +# JIRA: MADLIB-1078 fix +# Skip pmml during install-check (when run without the -t option). +# We can still run install-check on pmml with '-t' option. +if not modset and module in ['pmml']: +continue +info_(this, "> - %s" % module, verbose) + +# Make a temp dir for this module (if doesn't exist) +cur_tmpdir = tmpdir + '/' + module + '/test' # tmpdir is a global variable +_make_dir(cur_tmpdir) + +# Find the Python module dir (platform specific or generic) +if os.path.isdir(maddir + "/ports/" + portid + "/" + dbver + "/modules/" + module): +maddir_mod_py = maddir + "/ports/" + portid + "/" + dbver + "/modules" +else: +maddir_mod_py = maddir + "/modules" + +# Find the SQL module dir (platform specific or generic) +if os.path.isdir(maddir + "/ports/" + portid + "/modules/" + module): +maddir_mod_sql = maddir + "/ports/" + portid + "/modules" +else: +maddir_mod_sql = maddir + "/modules" + +# Prepare test schema +test_schema = "madlib_installcheck_%s" % (module) +_internal_run_query("DROP SCHEMA IF EXISTS %s CASCADE; CREATE SCHEMA %s;" % +(test_schema, test_schema), True) +_internal_run_query("GRANT ALL ON SCHEMA %s TO %s;" % +(test_schema, test_user), True) + +# Switch to test user and prepare the search_path +pre_sql = '-- Switch to test user:\n' \ + 'SET ROLE %s;\n' \ + '-- Set SEARCH_PATH for install-check:\n' \ + 'SET search_path=%s,%s;\n' \ + % (test_user, test_schema, schema) + +# Loop through all test SQL files for this module +sql_files = maddir_mod_sql + '/' + module + '/test/*.sql_in' +for sqlfile in sorted(glob.glob(sql_files), reverse=True): +algoname = os.path.basename(sqlfile).split('.')[0] +# run only algo specified +if (module in modset and modset[module] and +algoname not in modset[module]): +continue + +# Set file names +tmpfile = cur_tmpdir + '/' + os.path.basename(sqlfile) + '.tmp' +
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192182069 --- Diff: src/madpack/madpack.py --- @@ -95,6 +95,16 @@ def _internal_run_query(sql, show_error): return run_query(sql, con_args, show_error) # -- +def _write_to_file(handle, sql, show_error=False): +handle.write(sql) +handle.write('\n') +# -- + + +def _merge_to_file(handle, other_file, show_error=False): --- End diff -- It was used very early in the story I guess, have removed it now. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192193755 --- Diff: src/madpack/madpack.py --- @@ -824,6 +873,246 @@ def parse_arguments(): # Get the arguments return parser.parse_args() +def run_install_check(args, testcase): +schema = args['schema'] --- End diff -- This is install check code which is not refactored in this commit, and should be done as part of a different commit. We only moved `install-check` specific code from `main` to this function, since `main` was very huge. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192181634 --- Diff: src/madpack/madpack.py --- @@ -95,6 +95,16 @@ def _internal_run_query(sql, show_error): return run_query(sql, con_args, show_error) # -- +def _write_to_file(handle, sql, show_error=False): --- End diff -- Moved this function to `madlib/utilities.py`. ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/271#discussion_r192178673 --- Diff: src/madpack/utilities.py --- @@ -33,6 +33,23 @@ this = os.path.basename(sys.argv[0])# name of this script +class AtomicFileOpen: --- End diff -- It's not used, will remove it. ---
[GitHub] madlib issue #271: Madpack: Make install, reinstall and upgrade atomic
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/271 Thank you for the comments @kaknikhil , will address them. ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191575057 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -1781,7 +1799,7 @@ class MLPMinibatchPreProcessor: summary_table_columns = summary_table_columns[0] required_columns = (self.DEPENDENT_VARNAME, self.INDEPENDENT_VARNAME, -self.CLASS_VALUES, self.GROUPING_COL) +self.CLASS_VALUES, self.GROUPING_COL, self.DEPENDENT_VARTYPE) if set(required_columns) <= set(summary_table_columns): self.preprocessed_summary_dict = summary_table_columns else: --- End diff -- Can you please update the optimizer params in online docs? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191533727 --- Diff: src/modules/convex/type/model.hpp --- @@ -126,45 +129,96 @@ struct MLPModel { for (k = 0; k < N; k ++) { size += static_cast((n[k] + 1) * (n[k+1])); } +//TODO conditionally assign size based on momentum --- End diff -- Is this TODO still valid? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191574947 --- Diff: src/ports/postgres/modules/convex/mlp.sql_in --- @@ -1474,13 +1480,15 @@ CREATE AGGREGATE MADLIB_SCHEMA.mlp_minibatch_step( /* warm_start_coeff */DOUBLE PRECISION[], /* lambda */ DOUBLE PRECISION, /* batch_size */ INTEGER, -/* n_epochs */INTEGER +/* n_epochs */INTEGER, +/* momentum */DOUBLE PRECISION, +/* is_nesterov */ BOOLEAN )( STYPE=DOUBLE PRECISION[], SFUNC=MADLIB_SCHEMA.mlp_minibatch_transition, m4_ifdef(`__POSTGRESQL__', `', `prefunc=MADLIB_SCHEMA.mlp_minibatch_merge,') FINALFUNC=MADLIB_SCHEMA.mlp_minibatch_final, -INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0}' +INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0,0,0}' ); - --- End diff -- Can you please update the user docs with momentum and nesterov related optimizer params? Is it also a good time to update the MLP design doc? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191539168 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -197,6 +244,7 @@ MLP::loss( const model_type, const independent_variables_type, const dependent_variable_type _true) { + --- End diff -- The loss computation code in this function is duplicated in `getLoss()` function. Can we call that function to remove the code duplication? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191537654 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -126,68 +157,84 @@ MLP::getLossAndUpdateModel( const Matrix _true_batch, const double ) { -uint16_t num_layers = static_cast(model.u.size()); // assuming nu. of layers >= 1 -Index num_rows_in_batch = x_batch.rows(); double total_loss = 0.; +// model is updated with the momentum step (i.e. velocity vector) +// if Nesterov Accelerated Gradient is enabled +model.nesterovUpdatePosition(); -// gradient added over the batch -std::vector total_gradient_per_layer(num_layers); -for (Index k=0; k < num_layers; ++k) +// initialize gradient vector +std::vector total_gradient_per_layer(model.num_layers); +for (Index k=0; k < model.num_layers; ++k) { total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(), model.u[k].cols()); +} +std::vector net, o, delta; +Index num_rows_in_batch = x_batch.rows(); for (Index i=0; i < num_rows_in_batch; i++){ +// gradient and loss added over the batch ColumnVector x = x_batch.row(i); ColumnVector y_true = y_true_batch.row(i); -std::vector net, o, delta; feedForward(model, x, net, o); backPropogate(y_true, o.back(), net, model, delta); -for (Index k=0; k < num_layers; k++){ +// compute the gradient +for (Index k=0; k < model.num_layers; k++){ total_gradient_per_layer[k] += o[k] * delta[k].transpose(); } -// loss computation -ColumnVector y_estimated = o.back(); -if(model.is_classification){ -double clip = 1.e-10; -y_estimated = y_estimated.cwiseMax(clip).cwiseMin(1.-clip); -total_loss += - (y_true.array()*y_estimated.array().log() - + (-y_true.array()+1)*(-y_estimated.array()+1).log()).sum(); -} -else{ -total_loss += 0.5 * (y_estimated - y_true).squaredNorm(); -} +// compute the loss +total_loss += getLoss(y_true, o.back(), model.is_classification); } -for (Index k=0; k < num_layers; k++){ + +// convert gradient to a gradient update vector +// 1. normalize to per row update +// 2. discount by stepsize +// 3. add regularization +// 4. make negative +for (Index k=0; k < model.num_layers; k++){ Matrix regularization = MLP::lambda * model.u[k]; regularization.row(0).setZero(); // Do not update bias -model.u[k] -= -stepsize * -(total_gradient_per_layer[k] / static_cast(num_rows_in_batch) + - regularization); +total_gradient_per_layer[k] = -stepsize * (total_gradient_per_layer[k] / static_cast(num_rows_in_batch) + + regularization); +model.updateVelocity(total_gradient_per_layer[k], k); +model.updatePosition(total_gradient_per_layer[k], k); } + return total_loss; + } + template void MLP::gradientInPlace( model_type , const independent_variables_type, const dependent_variable_type _true, -const double) { -size_t N = model.u.size(); // assuming nu. of layers >= 1 +const double) +{ +model.nesterovUpdatePosition(); + std::vector net, o, delta; feedForward(model, x, net, o); backPropogate(y_true, o.back(), net, model, delta); -for (size_t k=0; k < N; k++){ +for (Index k=0; k < model.num_layers; k++){ Matrix regularization = MLP::lambda*model.u[k]; regularization.row(0).setZero(); // Do not update bias -model.u[k] -= stepsize * (o[k] * delta[k].transpose() + regularization); +if (model.momentum > 0){ +Matrix gradient = -stepsize * (o[k] * delta[k].transpose() + regularization); +model.updateVelocity(gradient, k); +model.updatePosition(gradient, k); +} +else { +// Updating model inline instead of using updatePosition since +// updatePosition creates a copy of the gradient making it slower. --- End diff -- This comment is a little confusing. We do pass along the gradi
[GitHub] madlib pull request #260: minibatch preprocessor improvements
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/260#discussion_r180603956 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -397,8 +408,9 @@ class MiniBatchStandardizer: x_std_dev_str = self.x_std_dev_str) return query -def _get_query_for_standardizing_with_grouping(self): +def _create_table_for_standardizing_with_grouping(self): --- End diff -- Why was the method name changed? The older name seems to be more apt, since this function is still returning the query, and not executing it (the same for `_create_table_for_standardizing_without_grouping()` too). ---
[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/259#discussion_r180576675 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in --- @@ -91,6 +92,22 @@ minibatch_preprocessor( When this value is NULL, no grouping is used and a single preprocessing step is performed for the whole data set. + + one_hot_encode_int_dep_var (optional) + BOOLEAN. default: FALSE. + A flag to decide whether to one-hot encode dependent variables that are +scalar integers. This parameter is ignored if the dependent variable is not a +scalar integer. + +@note The mini-batch preprocessor automatically encodes +dependent variables that are boolean and character types such as text, char and +varchar. However, scalar integers are a special case because they can be used +in both classification and regression problems, so you must tell the mini-batch +preprocessor whether you want to encode them or not. In the case that you have +already encoded the dependent variable yourself, you can ignore this parameter. +Also, if you want to encode float values for some reason, cast them to text +first. --- End diff -- +1 for the explanation. ---
[GitHub] madlib pull request #258: RF: Comment out assert in flaky install check quer...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/258 RF: Comment out assert in flaky install check query JIRA: MADLIB-1225 The variable importance computation involves randomization inherently. So it is hard to reproduce this error consistently. This commit comments out the assert for now (the failure rate was around 4.3%, when tested over 600 runs). Closes #258 You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/rf/flaky-install-check Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #258 commit f2059aa35de5ab0827cf1d9fa0e39179a6190b49 Author: Nandish Jayaram <njayaram@...> Date: 2018-04-07T00:16:17Z RF: Comment out assert in flaky install check query JIRA: MADLIB-1225 The variable importance computation involves randomization inherently. So it is hard to reproduce this error consistently. This commit comments out the assert for now (the failure rate was around 4.3%, when tested over 600 runs). Closes #258 ---
[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/254#discussion_r178684532 --- Diff: src/ports/postgres/modules/utilities/mean_std_dev_calculator.py_in --- @@ -40,15 +41,27 @@ class MeanStdDevCalculator: self.dimension = dimension def get_mean_and_std_dev_for_ind_var(self): -set_zero_std_to_one = True - x_scaled_vals = utils_ind_var_scales(self.source_table, self.indep_var_array_str, self.dimension, self.schema_madlib, - None, # do not dump the output to a temp table - set_zero_std_to_one) + x_mean_table = None, # do not dump the output to a temp table + set_zero_std_to_one=True) x_mean_str = _array_to_string(x_scaled_vals["mean"]) x_std_str = _array_to_string(x_scaled_vals["std"]) +if not x_mean_str or not x_std_str: +plpy.error("mean/stddev for the independent variable" + "cannot be null") + return x_mean_str, x_std_str + +def create_mean_std_table_for_ind_var_grouping(self, x_mean_table, grouping_cols): +utils_ind_var_scales_grouping(self.source_table, + self.indep_var_array_str, + self.dimension, + self.schema_madlib, + grouping_cols, + x_mean_table, + set_zero_std_to_one = True, + create_temp_table = False) --- End diff -- Could you please correct the indentation here? ---
[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/254#discussion_r178684358 --- Diff: src/ports/postgres/modules/convex/utils_regularization.py_in --- @@ -85,6 +86,8 @@ def utils_ind_var_scales_grouping(tbl_data, col_ind_var, dimension, x_mean_table, set_zero_std_to_one (optional, default is False. If set to true 0.0 standard deviation values will be set to 1.0) +create_temp_table If set to true, create a persistent instead of a temp + table, else create a temp table for x_mean --- End diff -- Shouldn't this comment say create temp table when true, and a persistent table when set to false? ---
[GitHub] madlib pull request #251: MLP: Simplify initialization of model coefficients
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/251#discussion_r178204067 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -98,23 +98,28 @@ mlp_igd_transition::run(AnyType ) { double is_classification_double = (double) is_classification; double activation_double = (double) activation; -MappedColumnVector coeff = args[10].getAs(); - state.task.model.rebind(_classification_double,_double, -()[0], numberOfStages, -[0]); - -// state.task.model.is_classification = -// static_cast(is_classification); -// state.task.model.activation = static_cast(activation); -// MappedColumnVector initial_coeff = args[10].getAs(); -// // copy initial_coeff into the model -// Index fan_in, fan_out, layer_start = 0; -// for (size_t k = 0; k < numberOfStages; ++k){ -// fan_in = numbersOfUnits[k]; -// fan_out = numbersOfUnits[k+1]; -// state.task.model.u[k] << initial_coeff.segment(layer_start, (fan_in+1)*fan_out); -// layer_start = (fan_in + 1) * fan_out; -// } +if (!args[10].isNull()){ +// initial coefficients are provided +MappedColumnVector warm_start_coeff = args[10].getAs(); + +// copy warm start into the task model +// state.reset() ensures algo.incrModel is copied from task.model +Index layer_start = 0; +for (size_t k = 0; k < numberOfStages; ++k){ +for (size_t j=0; j < state.task.model.u[k].cols(); ++j){ +for (size_t i=0; i < state.task.model.u[k].rows(); ++i){ +state.task.model.u[k](i, j) = warm_start_coeff( +layer_start + j * state.task.model.u[k].rows() + i); +} +} +layer_start = state.task.model.u[k].rows() * state.task.model.u[k].cols(); +} +} else { +// initialize the model with appropriate coefficients +state.task.model.initialize( --- End diff -- This is more of a doubt than a comment I guess. Should this be `state.algo.incrModel` instead? The reason I thought so was that `MLPIGDAlgorithm::transition(state, tuple);` uses the model in algo instead of the one task. ---
[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/250#discussion_r178167691 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, dependent_varname, if is_classification: # Currently, classification doesn't accept an --- End diff -- Yes, that's correct. ---
[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/250#discussion_r178168389 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -856,8 +868,16 @@ def mlp_predict(schema_madlib, model_table, data_table, id_col_name, activation = _get_activation_index(summary['activation']) layer_sizes = PY2SQL( summary['layer_sizes'], array_type="DOUBLE PRECISION") -is_classification = int(summary["is_classification"]) is_response = int(pred_type == 'response') +is_classification = int(summary["is_classification"]) +classes = summary['classes'] +# Set a flag to indicate that it is a classification model, with an array +# as the dependent var. The only scenario where classification allows for +# an array dep var is when the user has provided a one-hot encoded dep var +# during training, and mlp_classification does not one-hot encode +# (and hence classes column in model's summary table is NULL). +array_dep_var_for_classification = int(is_classification and not classes) --- End diff -- Will change it to `is_dep_var_an_array_for_classification` ---
[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/250#discussion_r178168247 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, dependent_varname, if is_classification: # Currently, classification doesn't accept an # array for dep type in IGD -_assert("[]" not in expr_type and expr_type in classification_types, +_assert(("[]" in expr_type and expr_type[:-2] in int_types) \ --- End diff -- - One-hot encoding is normally supposed to be an integer array, hence was checking only for int type. But, that assumption might be too strict, so will make changes to allow for other numeric types too. - Good catch. It might be a good idea to disallow anything greater than a 1-D array. Will make an error check for the same. ---
[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/250#discussion_r178168263 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, dependent_varname, if is_classification: # Currently, classification doesn't accept an # array for dep type in IGD -_assert("[]" not in expr_type and expr_type in classification_types, +_assert(("[]" in expr_type and expr_type[:-2] in int_types) \ +or expr_type in classification_types, "Dependent variable column should be of type: " --- End diff -- Will change it to cover both the cases. ---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/248#discussion_r177495864 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in --- @@ -970,16 +970,35 @@ def _get_bins_grps( "one value are dropped from the tree model.") cat_features = [feature for feature in cat_features if feature in use_cat_features] -grp_to_col_to_row = dict((grp_key, dict( -(row['colname'], row['levels']) for row in items)) -for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))) - +# grp_col_to_levels is a list of tuples (pairs) with +# first value = group value, +# second value = a dict mapping a categorical column to its levels in data +# (these levels are specific to the group and can be different +# for different groups) +# The list of tuples can be converted to a dict, but the ordering +# will be lost. +# eg. grp_col_to_levels = +# [ +# ('3', {'vs': [0, 1], 'cyl': [4,6,8]}), +# ('4', {'vs': [0, 1], 'cyl': [4,6]}), +# ('5', {'vs': [0, 1], 'cyl': [4,6,8]}) +# ] +grp_to_col_to_levels = [ +(grp_key, dict((row['colname'], row['levels']) for row in items)) +for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))] if cat_features: -cat_items_list = [rows[col] for col in cat_features - for grp_key, rows in grp_to_col_to_row.items() if col in rows] +# Below statements collect the grp_to_col_to_levels into multiple variables +# From above eg. +# cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]] +# cat_n = [2, 3, 2, 2, 2, 3] +# cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8] +# grp_key_cat = ['3', '4', '5'] --- End diff -- +1 for the examples, very helpful. ---
[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/250 MLP: Allow one-hot encoded dependent var for classification JIRA:MADLIB-1222 MLP currently automatically encodes categorical variables for classification but does not allow already encoded arrays for dependent variables in mlp_classification. This commit lets users have an already encoded array for the dependent variable and train a model. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/mlp/support-encoded-dep-var Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/250.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #250 commit f5a87dee6bc8f27c1a13a4921ea726b391b1813d Author: Nandish Jayaram <njayaram@...> Date: 2018-03-20T22:43:25Z MLP: Allow one-hot encoded dependent var for classification JIRA:MADLIB-1222 MLP currently automatically encodes categorical variables for classification but does not allow already encoded arrays for dependent variables in mlp_classification. This commit lets users have an already encoded array for the dependent variable and train a model. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r176218740 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType ) { return state; } +/** + * @brief Perform the multilayer perceptron minibatch transition step + * + * Called for each tuple. + */ +AnyType +mlp_minibatch_transition::run(AnyType ) { +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +MLPMiniBatchState<MutableArrayHandle > state = args[0]; + +// initilize the state if first tuple +if (state.algo.numRows == 0) { +if (!args[3].isNull()) { +MLPMiniBatchState<ArrayHandle > previousState = args[3]; --- End diff -- Tried it, it was cleaner this way. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175949965 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, it_args.update({ 'group_by_clause': group_by_clause, 'using_clause': using_clause, -'grouping_str_comma': grouping_str_comma +'grouping_str_comma': grouping_str_comma, }) first_try = True temp_output_table = unique_string(desp='temp_output_table') + +layer_sizes = [num_input_nodes] + hidden_layer_sizes + [num_output_nodes] + for _ in range(n_tries): +prev_state = None if not warm_start: coeff = [] -for i in range(len(layer_sizes) - 1): -fan_in = layer_sizes[i] -fan_out = layer_sizes[i + 1] +for fan_in, fan_out in zip(layer_sizes, layer_sizes[1:]): # Initalize according to Glorot and Bengio (2010) # See design doc for more info span = math.sqrt(6.0 / (fan_in + fan_out)) -dim = (layer_sizes[i] + 1) * layer_sizes[i + 1] -rand = plpy.execute("""SELECT array_agg({span}*2*(random()-0.5)) - AS random - FROM generate_series(0,{dim}) -""".format(span=span, dim=dim))[0]["random"] +dim = (fan_in + 1) * fan_out +rand = [span * (random() - 0.5) for _ in range(dim)] --- End diff -- Its supposed to be explained in the design doc as per the comment. I think these formulae are taken from a research paper. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175950079 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, # used, it will be an empty list if there was not grouping. groups = [t[col_grp_key] for t in res if t[col_grp_key]] losses = [t['loss'] for t in res] -loss = zip(groups, losses) if len(groups)==len(losses) \ - else losses -plpy.info("Iteration: " + str(it.iteration) + ", Loss: <" + \ - ', '.join([str(l) for l in loss]) + ">") +loss = zip(groups, losses) if groups else losses +plpy.info("Iteration: {0}, Loss: <{1}>". + format(it.iteration, ', '.join(map(str, loss it.final() _update_temp_model_table(it_args, it.iteration, temp_output_table, - first_try) + is_minibatch_enabled, first_try) first_try = False -layer_sizes_str = py_list_to_sql_string( -layer_sizes, array_type="integer") -classes_str = py_list_to_sql_string( -[strip_end_quotes(cl, "'") for cl in classes], -array_type=dependent_type) +layer_sizes_str = py_list_to_sql_string(layer_sizes, +array_type="integer") + _create_summary_table(locals()) -_create_standardization_table(standardization_table, x_mean_table, - warm_start) +if is_minibatch_enabled: +# We already have the mean and std in the input standardization table +input_std_table = add_postfix(source_table, '_standardization') +_create_standardization_table(standardization_table, input_std_table, + warm_start) +else: +_create_standardization_table(standardization_table, x_mean_table, + warm_start) +# The original input table is the tab_data_scaled for mini batch. +# Do NOT drop this, it will end up dropping the original data table. +plpy.execute("DROP TABLE IF EXISTS {0}".format(tbl_data_scaled)) +plpy.execute("DROP TABLE IF EXISTS {0}".format(x_mean_table)) --- End diff -- Yes. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175947887 --- Diff: src/modules/convex/algo/igd.hpp --- @@ -90,20 +90,27 @@ IGD<State, ConstState, Task>::transition(state_type , for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) { double loss = 0.0; -for (int curr_batch=0, curr_batch_row_index=0; curr_batch < n_batches; - curr_batch++, curr_batch_row_index += batch_size) { - Matrix X_batch; - ColumnVector y_batch; - if (curr_batch == n_batches-1) { - // last batch - X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); - y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index); - } else { - X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); - y_batch = tuple.depVar.segment(curr_batch_row_index, batch_size); - } - loss += Task::getLossAndUpdateModel( - state.task.model, X_batch, y_batch, state.task.stepsize); +int random_curr_batch[n_batches]; +for(int i=0; i<n_batches; i++) { +random_curr_batch[i] = i; +} +int curr_batch_row_index = 0; +std::random_shuffle(_curr_batch[0], _curr_batch[n_batches]); +for (int i=0; i < n_batches; i++) { +int curr_batch = random_curr_batch[i]; +int curr_batch_row_index = curr_batch * batch_size; +Matrix X_batch; +Matrix Y_batch; +if (curr_batch == n_batches-1) { + // last batch + X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); + Y_batch = tuple.depVar.bottomRows(n_rows-curr_batch_row_index); +} else { +X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); +Y_batch = tuple.depVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); --- End diff -- Yes, that seems right. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175947857 --- Diff: src/modules/convex/algo/igd.hpp --- @@ -90,20 +90,27 @@ IGD<State, ConstState, Task>::transition(state_type , for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) { double loss = 0.0; -for (int curr_batch=0, curr_batch_row_index=0; curr_batch < n_batches; - curr_batch++, curr_batch_row_index += batch_size) { - Matrix X_batch; - ColumnVector y_batch; - if (curr_batch == n_batches-1) { - // last batch - X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); - y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index); - } else { - X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); - y_batch = tuple.depVar.segment(curr_batch_row_index, batch_size); - } - loss += Task::getLossAndUpdateModel( - state.task.model, X_batch, y_batch, state.task.stepsize); +int random_curr_batch[n_batches]; +for(int i=0; i<n_batches; i++) { +random_curr_batch[i] = i; +} +int curr_batch_row_index = 0; +std::random_shuffle(_curr_batch[0], _curr_batch[n_batches]); --- End diff -- I don't think we should set a seed. Setting a seed will ensure we will always get the same set of random numbers. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175950139 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -491,10 +571,28 @@ def _update_temp_model_table(args, iteration, temp_output_table, first_try): ) rel_state_subq {join_clause} """.format(insert_or_create_str=insert_or_create_str, - iteration=iteration, join_clause=join_clause, **args) + iteration=iteration, join_clause=join_clause, + internal_result_udf=internal_result_udf, **args) plpy.execute(model_table_query) +def _get_loss(schema_madlib, state, is_mini_batch): --- End diff -- Yes, must remove it. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175948252 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType ) { return state; } +/** + * @brief Perform the multilayer perceptron minibatch transition step + * + * Called for each tuple. + */ +AnyType +mlp_minibatch_transition::run(AnyType ) { +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +MLPMiniBatchState<MutableArrayHandle > state = args[0]; + +// initilize the state if first tuple +if (state.algo.numRows == 0) { +if (!args[3].isNull()) { +MLPMiniBatchState<ArrayHandle > previousState = args[3]; +state.allocate(*this, previousState.task.numberOfStages, + previousState.task.numbersOfUnits); +state = previousState; +} else { +// configuration parameters +ArrayHandle numbersOfUnits = args[4].getAs<ArrayHandle >(); --- End diff -- We probably could, but there are a couple of extra arguments that only minibatch gets, and not IGD (batch_size and n_epochs). ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175949682 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, it_args.update({ 'group_by_clause': group_by_clause, 'using_clause': using_clause, -'grouping_str_comma': grouping_str_comma +'grouping_str_comma': grouping_str_comma, }) first_try = True temp_output_table = unique_string(desp='temp_output_table') + +layer_sizes = [num_input_nodes] + hidden_layer_sizes + [num_output_nodes] --- End diff -- Yes, this looks like duplicated code. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175948750 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -111,6 +117,57 @@ class MLP { template double MLP<Model, Tuple>::lambda = 0; +template +double +MLP<Model, Tuple>::getLossAndUpdateModel( +model_type , +const Matrix _batch, +const Matrix _true_batch, +const double ) { + +uint16_t N = model.u.size(); // assuming nu. of layers >= 1 +size_t n = x_batch.rows(); +size_t i, k; +double total_loss = 0.; + +// gradient added over the batch +std::vector total_gradient_per_layer(N); +for (k=0; k < N; ++k) +total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(), + model.u[k].cols()); + +for (i=0; i < n; i++){ +ColumnVector x = x_batch.row(i); +ColumnVector y_true = y_true_batch.row(i); + +std::vector net, o, delta; +feedForward(model, x, net, o); --- End diff -- We will have to change the design docs too for that. Apparently, the notation used here is supposed to be in sync with the design doc. ---
[GitHub] madlib issue #241: MiniBatch Pre-Processor: Add new module minibatch_preproc...
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/241 Another issue I found but forgot to mention in the review: The `__id__` column has double values instead of integers. For instance, I found values such as `0.2000` for that column in the output table. This issue also happens only when the module is used without specifying a value for the `buffer_size` param. ---
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175548350 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ +
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175588969 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ +
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175593796 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ +
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175531202 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ +
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175522378 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -794,6 +794,41 @@ def collate_plpy_result(plpy_result_rows): # -- +def validate_module_input_params(source_table, output_table, independent_varname, + dependent_varname, module_name, **kwargs): --- End diff -- How about having an optional param to deal with checking for residual output tables (summary and standardization tables). We could take a list of suffixes to check for. ---
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175585050 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ +
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/243 MLP: Add minibatch gradient descent solver JIRA: MADLIB-1206 This commit adds support for mini-batch based gradient descent for MLP. If the input table contains a 2D matrix for independent variable, minibatch is automatically used as the solver. Two minibatch specific optimizers are also introduced: batch_size and n_epochs. - batch_size is defaulted to min(200, buffer_size), where buffer_size is equal to the number of original input rows packed into a single row in the matrix. - n_epochs is the number of times all the batches in a buffer are iterated over (default 1). Other changes include: - dependent variable in the minibatch solver is also a matrix now. It was initially a vector. - Randomize the order of processing a batch within an epoch. - MLP minibatch currently doesn't support weights param, an error is thrown now. - Delete an unused type named mlp_step_result. - Add unit tests for newly added functions in python file. Co-authored-by: Rahul Iyer <ri...@apache.org> Co-authored-by: Nikhil Kak <n...@pivotal.io> Closes #243 You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib mlp-minibatch-with-preprocessed-data-rebased Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/243.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #243 commit d9306f7c6a44f64c53df13c34759da55468c4d26 Author: Nandish Jayaram <njayaram@...> Date: 2018-02-28T00:51:42Z MLP: Add minibatch gradient descent solver JIRA: MADLIB-1206 This commit adds support for mini-batch based gradient descent for MLP. If the input table contains a 2D matrix for independent variable, minibatch is automatically used as the solver. Two minibatch specific optimizers are also introduced: batch_size and n_epochs. - batch_size is defaulted to min(200, buffer_size), where buffer_size is equal to the number of original input rows packed into a single row in the matrix. - n_epochs is the number of times all the batches in a buffer are iterated over (default 1). Other changes include: - dependent variable in the minibatch solver is also a matrix now. It was initially a vector. - Randomize the order of processing a batch within an epoch. - MLP minibatch currently doesn't support weights param, an error is thrown now. - Delete an unused type named mlp_step_result. - Add unit tests for newly added functions in python file. Co-authored-by: Rahul Iyer <ri...@apache.org> Co-authored-by: Nikhil Kak <n...@pivotal.io> Closes #243 ---
[GitHub] madlib pull request #240: MLP: Fix step size initialization based on learnin...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/240 MLP: Fix step size initialization based on learning rate policy JIRA: MADLIB-1212 The step_size is supposed to be updated based on the learning rate. The formulae for different policies depend on the current iteration number, which was not consumed correctly. This commit fixes it by using it.iteration for the current iteration number in that update. Co-authored-by: Nikhil Kak <n...@pivotal.io> You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/mlp-learning-rate-policy Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/240.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #240 commit 72c6c8bdf5bfb50132e0566ca9600f82e99b7700 Author: Nandish Jayaram <njayaram@...> Date: 2018-03-06T19:10:50Z MLP: Fix step size initialization based on learning rate policy JIRA: MADLIB-1212 The step_size is supposed to be updated based on the learning rate. The formulae for different policies depend on the current iteration number, which was not consumed correctly. This commit fixes it by using it.iteration for the current iteration number in that update. Co-authored-by: Nikhil Kak <n...@pivotal.io> ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r173254181 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the +classes (mainhue), with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- Grouping by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. Further, `blue=8` in the example below will +be split between the four groups, resulting in two blue rows for each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=8', -- Specified class value size. Rest of the values are outputed as is. +NULL, -- Output table size +'zone'-- Group by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 2 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 1 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 3 | 3 | Austria |3 |1 | 84 | 8 | 4 | 2
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172954235 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, output_table, class_col, parsed_class_sizes = extract_keyvalue_params(class_sizes, allow_duplicates=False, lower_case_names=False) +distinct_levels = collate_plpy_result( +plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ". + format(class_col, source_table)))['levels'] if not parsed_class_sizes: -sampling_strategy_str = _validate_and_get_sampling_strategy(class_sizes, -output_table_size) +sampling_strategy_str = _validate_and_get_sampling_strategy( +class_sizes, output_table_size) else: sampling_strategy_str = None try: for each_level, each_class_size in parsed_class_sizes.items(): -_assert(each_level in actual_level_counts, +_assert(each_level in distinct_levels, "Sample: Invalid class value specified ({0})". - format(each_level)) +format(each_level)) each_class_size = int(each_class_size) _assert(each_class_size >= 1, "Sample: Class size has to be greater than zero") parsed_class_sizes[each_level] = each_class_size - -except TypeError: +except ValueError: plpy.error("Sample: Invalid value for class_sizes ({0})". format(class_sizes)) # Get the number of rows to be sampled for each class level, based on # the input table, class_sizes, and output_table_size params. This also # includes info about the resulting sampling strategy, i.e., one of # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level. -target_class_sizes = _get_target_level_counts(sampling_strategy_str, - parsed_class_sizes, - actual_level_counts, - output_table_size) - -undersample_level_dict, oversample_level_dict, nosample_level_dict = \ -_get_sampling_strategy_specific_dict(target_class_sizes) - -# Get subqueries for each sampling strategy, so that they can be used -# together in one big query. - -# Subquery that will be used to get rows as is for those class levels -# that need no sampling. -nosample_subquery = _get_nosample_subquery( -new_source_table, class_col, nosample_level_dict.keys()) -# Subquery that will be used to sample those class levels that -# have to be oversampled. -oversample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, oversample_level_dict) -# Subquery that will be used to sample those class levels that -# have to be undersampled. Undersampling supports both with and without -# replacement, so fetch the appropriate subquery. -if with_replacement: -undersample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) -else: -undersample_subquery = _get_without_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) - -# Merge the three subqueries using a UNION ALL clause. -union_all_subquery = ' UNION ALL '.join( -['({0})'.format(subquery) - for subquery in [undersample_subquery, oversample_subquery, nosample_subquery] - if subquery]) - -final_query = """ -CREATE TABLE {output_table} AS -SELECT row_number() OVER() AS {new_col_name}, * +grp_col_str, grp_cols = get_grouping_col_str( +schema_madlib, 'Balance sample', [NEW_ID_COLUMN
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172953687 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, output_table, class_col, parsed_class_sizes = extract_keyvalue_params(class_sizes, allow_duplicates=False, lower_case_names=False) +distinct_levels = collate_plpy_result( +plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ". + format(class_col, source_table)))['levels'] if not parsed_class_sizes: -sampling_strategy_str = _validate_and_get_sampling_strategy(class_sizes, -output_table_size) +sampling_strategy_str = _validate_and_get_sampling_strategy( +class_sizes, output_table_size) else: sampling_strategy_str = None try: for each_level, each_class_size in parsed_class_sizes.items(): -_assert(each_level in actual_level_counts, +_assert(each_level in distinct_levels, "Sample: Invalid class value specified ({0})". - format(each_level)) +format(each_level)) each_class_size = int(each_class_size) _assert(each_class_size >= 1, "Sample: Class size has to be greater than zero") parsed_class_sizes[each_level] = each_class_size - -except TypeError: +except ValueError: plpy.error("Sample: Invalid value for class_sizes ({0})". format(class_sizes)) # Get the number of rows to be sampled for each class level, based on # the input table, class_sizes, and output_table_size params. This also # includes info about the resulting sampling strategy, i.e., one of # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level. -target_class_sizes = _get_target_level_counts(sampling_strategy_str, - parsed_class_sizes, - actual_level_counts, - output_table_size) - -undersample_level_dict, oversample_level_dict, nosample_level_dict = \ -_get_sampling_strategy_specific_dict(target_class_sizes) - -# Get subqueries for each sampling strategy, so that they can be used -# together in one big query. - -# Subquery that will be used to get rows as is for those class levels -# that need no sampling. -nosample_subquery = _get_nosample_subquery( -new_source_table, class_col, nosample_level_dict.keys()) -# Subquery that will be used to sample those class levels that -# have to be oversampled. -oversample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, oversample_level_dict) -# Subquery that will be used to sample those class levels that -# have to be undersampled. Undersampling supports both with and without -# replacement, so fetch the appropriate subquery. -if with_replacement: -undersample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) -else: -undersample_subquery = _get_without_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) - -# Merge the three subqueries using a UNION ALL clause. -union_all_subquery = ' UNION ALL '.join( -['({0})'.format(subquery) - for subquery in [undersample_subquery, oversample_subquery, nosample_subquery] - if subquery]) - -final_query = """ -CREATE TABLE {output_table} AS -SELECT row_number() OVER() AS {new_col_name}, * +grp_col_str, grp_cols = get_grouping_col_str( +schema_madlib, 'Balance sample', [NEW_ID_COLUMN
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172958587 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -58,28 +60,64 @@ NOSAMPLE = 'nosample' NEW_ID_COLUMN = '__madlib_id__' NULL_IDENTIFIER = '__madlib_null_id__' -def _get_level_frequency_distribution(source_table, class_col): -""" Returns a dict containing the number of rows associated with each class + +def _get_level_frequency_distribution(source_table, class_col, + grp_by_cols=None): +""" Count the number of rows for each class, partitioned by the grp_by_cols + +Returns a dict containing the number of rows associated with each class level. Each class level count is converted to a string using ::text. None is a valid key in this dict, capturing NULL value in the database. """ +if grp_by_cols and grp_by_cols.lower() != 'null': +is_grouping = True +grp_by_cols_comma = grp_by_cols + ', ' +array_grp_by_cols_comma = "array[{0}]".format(grp_by_cols) + " as group_values, " +else: +is_grouping = False +grp_by_cols_comma = array_grp_by_cols_comma = "" + +# In below query, the inner query groups the data using grp_by_cols + classes +# and obtains the count for each combination. The outer query then groups +# again by the grp_by_cols to collect the classes and counts in an array. query_result = plpy.execute(""" -SELECT {class_col}::text AS classes, - count(*) AS class_count -FROM {source_table} -GROUP BY {class_col} - """.format(**locals())) +SELECT +-- For each group get the classes and their rows counts +{grp_identifier} as group_values, +array_agg(classes) as classes, +array_agg(class_count) as class_count +FROM( +-- for each group and class combination present in source table +-- get the count of rows for that combination +SELECT +{array_grp_by_cols_comma} +({class_col})::TEXT AS classes, +count(*) AS class_count +FROM {source_table} +GROUP BY {grp_by_cols_comma} ({class_col}) +) q +GROUP BY {grp_identifier} --- End diff -- In Greenplum 5.x, the above query fails on install check, with the following error: ``` SELECT balance_sample('"TEST_s"', 'out_sr2', 'gr1', 'undersample ', NULL, NULL, TRUE, TRUE); psql:/tmp/madlib.2N5sjK/sample/test/balance_sample.sql_in.tmp:111: ERROR: plpy.SPIError: non-integer constant in GROUP BY LINE 17: GROUP BY true ^ QUERY: SELECT -- For each group get the classes and their rows counts true as group_values, array_agg(classes) as classes, array_agg(class_count) as class_count FROM( -- for each group and class combination present in source table -- get the count of rows for that combination SELECT (gr1)::TEXT AS classes, count(*) AS class_count FROM "TEST_s" GROUP BY (gr1) ) q GROUP BY true CONTEXT: Traceback (most recent call last): PL/Python function "balance_sample", line 23, in return balance_sample.balance_sample(**globals()) PL/Python function "balance_sample", line 575, in balance_sample PL/Python function "balance_sample", line 100, in _get_level_frequency_distribution PL/Python function "balance_sample" ``` ---
[GitHub] madlib issue #237: Bugfix: MLP predict using 1.12 model fails on later versi...
Github user njayaram2 commented on the issue: https://github.com/apache/madlib/pull/237 Thank you for the comments @iyerr3 , will push another commit. ---
[GitHub] madlib pull request #237: Bugfix: MLP predict using 1.12 model fails on late...
GitHub user njayaram2 opened a pull request: https://github.com/apache/madlib/pull/237 Bugfix: MLP predict using 1.12 model fails on later versions JIRA: MADLIB-1207 MADlib 1.12 did not support grouping in MLP. The summary table created used to have the mean and std used for standardizing the independent variable. From MADlib 1.13 onwards, grouping was supported, and the mean and std were moved to the standardization table. This resulted in a failure when MADlib 1.12 MLP models were used to predict using MADlib 1.13. This commit fixes this issue. Closes #237 You can merge this pull request into a Git repository by running: $ git pull https://github.com/njayaram2/madlib bug/mlp-1.12model-grouping Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #237 commit 63daad10bf8c610d60be738f79e9bdc22ddc6483 Author: Nandish Jayaram <njayaram@...> Date: 2018-02-22T01:18:57Z Bugfix: MLP predict using 1.12 model fails on later versions JIRA: MADLIB-1207 MADlib 1.12 did not support grouping in MLP. The summary table created used to have the mean and std used for standardizing the independent variable. From MADlib 1.13 onwards, grouping was supported, and the mean and std were moved to the standardization table. This resulted in a failure when MADlib 1.12 MLP models were used to predict using MADlib 1.13. This commit fixes this issue. Closes #237 ---
[GitHub] madlib pull request #230: Balanced sets final
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/230#discussion_r166056096 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,748 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +m4_changequote(`') + +import math + +if __name__ != "__main__": +import plpy +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import extract_keyvalue_params +from utilities.utilities import unique_string +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import get_cols +from utilities.validate_args import table_exists +from utilities.validate_args import table_is_empty +else: +# Used only for Unit Testing +# FIXME: repeating a function from utilities that is needed by the unit test. +# This should be removed once a unittest framework in used for testing. +import random +import time + +def unique_string(desp='', **kwargs): +""" +Generate random remporary names for temp table and other names. +It has a SQL interface so both SQL and Python functions can call it. +""" +r1 = random.randint(1, 1) +r2 = int(time.time()) +r3 = int(time.time()) % random.randint(1, 1) +u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" + str(r3) + "__" +return u_string +# -- + +UNIFORM = 'uniform' +UNDERSAMPLE = 'undersample' +OVERSAMPLE = 'oversample' +NOSAMPLE = 'nosample' + +NEW_ID_COLUMN = '__madlib_id__' +NULL_IDENTIFIER = '__madlib_null_id__' + +def _get_frequency_distribution(source_table, class_col): +""" Returns a dict containing the number of rows associated with each class +level. Each class level value is converted to a string using ::text. +""" +query_result = plpy.execute(""" +SELECT {class_col}::text AS classes, + count(*) AS class_count +FROM {source_table} +GROUP BY {class_col} + """.format(**locals())) +actual_level_counts = {} +for each_row in query_result: +level = each_row['classes'] +if level: +level = level.strip() +actual_level_counts[level] = each_row['class_count'] +return actual_level_counts + + +def _validate_and_get_sampling_strategy(sampling_strategy_str, output_table_size, +supported_strategies=None, default=UNIFORM): +""" Returns the sampling strategy based on the class_sizes input param. +@param sampling_strategy_str The sampling strategy specified by the + user (class_sizes param) +@returns: +Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is UNIFORM. +""" +if not sampling_strategy_str: +sampling_strategy_str = default +else: +if len(sampling_strategy_str) < 3: +# Require at least 3 characters since UNIFORM and UNDERSAMPLE have +# common prefix substring +plpy.error("Sample: Invalid class_sizes parameter") + +if not supported_strategies: +supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE] +try: +# allow user to specify a prefix substring of --- End diff -- There is precedence for supporting prefix for parameter values, in modules such as SVM. Yes, the error messages could be similar. ---