[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

2018-12-22 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/342
  
@reductionista thank you for the comments.
The existing `minibatch_preprocessor` module outputs new columns called 
`dependent_varname` and `independent_varname` instead of the column names from 
the input table. The reason we did the same here is purely to conform with what 
is already in the other module. The other module allows expressions as input 
params (which may have been the reason behind a different column name in its 
output table), while this module does not explicitly support expressions. So, I 
do agree with your point about the output table column names, but I am just not 
sure how odd it would be to have the difference between the two modules. May be 
other folks could also weigh in to help us decide. Also, this module 
(`minibatch_preprocessor_dl`) is at early stage dev, so this will be a great 
time to try out options.

Regarding your comment on the ordering of the two input params (`x` and 
`y`):
This is following the convention we have in every other MADlib module, 
namely, we first have the dependent variable followed by the independent 
variable in the input parameters list. If you'd like it to be the opposite, it 
might be a good idea to start a separate thread in the community mailing list 
to discuss it. It will break conformity if we change the order of the two 
variables only in this module. BTW, `2.0` release will be a good time to change 
it since that release would break backward compatibility. 


---


[GitHub] madlib pull request #342: Minibatch Preprocessor for Deep learning

2018-12-19 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/342

Minibatch Preprocessor for Deep learning

The minibatch preprocessor we currently have in MADlib is bloated for DL
tasks. This feature adds a simplified version of creating buffers, and
divides each element of the independent array by a normalizing constant
for standardization (which is 255.0 by default). This is standard practice
with image data.

Co-authored-by: Arvind Sridhar 
Co-authored-by: Domino Valdano 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
deep-learning/minibatch-preprocessor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/342.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #342


commit c983aafcd5e31bab5dbc278178ff9e2e17942ea1
Author: Nandish Jayaram 
Date:   2018-12-18T01:54:42Z

Minibatch Preprocessor for Deep learning

The minibatch preprocessor we currently have in MADlib is bloated for DL
tasks. This feature adds a simplified version of creating buffers, and
divides each element of the independent array by a normalizing constant
for standardization (which is 255.0 by default). This is standard practice
with image data.

Co-authored-by: Arvind Sridhar 
Co-authored-by: Domino Valdano 




---


[GitHub] madlib pull request #341: Minibatch Preprocessor for Deep learning

2018-12-19 Thread njayaram2
Github user njayaram2 closed the pull request at:

https://github.com/apache/madlib/pull/341


---


[GitHub] madlib pull request #341: Minibatch Preprocessor for Deep learning

2018-12-19 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/341

Minibatch Preprocessor for Deep learning

The minibatch preprocessor we currently have in MADlib is bloated for DL
tasks. This feature adds a simplified version of creating buffers, and
divides each element of the independent array by a normalizing constant
for standardization (which is 255.0 by default). This is standard practice
with image data.

Co-authored-by: Arvind Sridhar 
Co-authored-by: Domino Valdano 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
deep-learning/minibatch-preprocessor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/341.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #341


commit 78430bc8586ae0256a24de2472392564a15f7f8e
Author: Nandish Jayaram 
Date:   2018-12-18T01:54:42Z

Minibatch Preprocessor for Deep learning

The minibatch preprocessor we currently have in MADlib is bloated for DL
tasks. This feature adds a simplified version of creating buffers, and
divides each element of the independent array by a normalizing constant
for standardization (which is 255.0 by default). This is standard practice
with image data.

Co-authored-by: Arvind Sridhar 
Co-authored-by: Domino Valdano 




---


[GitHub] madlib pull request #338: Install/Dev check: Add new test cases for some mod...

2018-11-15 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/338

Install/Dev check: Add new test cases for some modules

Some modules such as array_ops and pmml did not have any install check
files, while stemmer did not have any test files. This commit adds some
basic test cases for these modules.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib ic-pmml-stemmer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/338.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #338


commit c351f176b305fb44bd87bc6a4f79c099a3d6fbe3
Author: Nandish Jayaram 
Date:   2018-09-29T00:15:40Z

Install/Dev check: Add new test cases for some modules

Some modules such as array_ops and pmml did not have any install check
files, while stemmer did not have any test files. This commit adds some
basic test cases for these modules.




---


[GitHub] madlib pull request #334: Minibatch Preprocessor: Update online doc

2018-10-23 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/334

Minibatch Preprocessor: Update online doc

The online doc is outdated. This commit adds two new parameters that
have been introduced since the last time the doc was edited.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib doc/minibatch-preprocessor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/334.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #334


commit 7e95fc7d936f25e74ceceb74dfa7473c4eda45c8
Author: Nandish Jayaram 
Date:   2018-10-23T17:35:02Z

Minibatch Preprocessor: Update online doc

The online doc is outdated. This commit adds two new parameters that
have been introduced since the last time the doc was edited.




---


[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...

2018-10-02 Thread njayaram2
Github user njayaram2 closed the pull request at:

https://github.com/apache/madlib/pull/326


---


[GitHub] madlib pull request #327: Upgrade: Fix issue with upgrading RPM to 1.15.1

2018-10-01 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/327

Upgrade: Fix issue with upgrading RPM to 1.15.1

JIRA: MADLIB-1278

During RPM upgrade, rpm_post.sh is run first, followed by
rpm_post_uninstall.sh. So we must do all the uninstallation specific
stuff based on the current operation being uninstall or upgrade.
This commit makes the necessary change to remove symlinks only during
uninstallation, and not while upgrading.

Co-authored-by: Domino Valdano 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib rpm-upgrade-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #327


commit 8234449be7d8522aa6d351bf504a573558cad4d3
Author: Nandish Jayaram 
Date:   2018-10-01T21:32:44Z

Upgrade: Fix issue with upgrading RPM to 1.15.1

JIRA: MADLIB-1278

During RPM upgrade, rpm_post.sh is run first, followed by
rpm_post_uninstall.sh. So we must do all the uninstallation specific
stuff based on the current operation being uninstall or upgrade.
This commit makes the necessary change to remove symlinks only during
uninstallation, and not while upgrading.

Co-authored-by: Domino Valdano 




---


[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...

2018-10-01 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/326

Install/Dev check: Add new test cases for some modules

Some modules such as array_ops and pmml did not have any install check
files, while stemmer did not have any test files. This commit adds some
basic test cases for these modules.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib ic-pmml-stemmer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/326.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #326


commit 0928be85ac504ae90ec6a2f2de694cd5aa2440f7
Author: Nandish Jayaram 
Date:   2018-09-29T00:15:40Z

Install/Dev check: Add new test cases for some modules

Some modules such as array_ops and pmml did not have any install check
files, while stemmer did not have any test files. This commit adds some
basic test cases for these modules.




---


[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...

2018-09-07 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/315
  
@fmcquillan99 thanks for testing this out. I can have a look at this issue.


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-05 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r215464110
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
--- End diff --

@fmcquillan99 yes, I was looking at the alternatives. @hpandeycodeit I can 
take care of this. I will push a commit to this PR (you may have to merge it, 
since I am not a co-author on your branch).


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214485090
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
--- End diff --

`point_column_name` and `test_column_name` params must be an array as this 
if check suggests. If it's not an array it fails further down when the distance 
function (such as `squared_dist_norm2`) is called.

I don't think the function `is_var_valid()` checks for these being arrays. 
You may have to check them after the new asserts, using a new helper function 
(`is_col_array()` cannot be used as is for expressions, and `is_var_valid()` 
does not check for an array, but just the validity of the expression)


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214487281
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(is_var_valid(point_source, point_column_name),
+"kNN error: Invalid point_column_name expression")
--- End diff --

Can we change the error message to indicate invalid column name or 
expression? How about 
`"kNN error: {0} is an invalid column name or expression for 
point_column_name param".format(point_column_name)`?


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214484071
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
--- End diff --

`point_id` and `test_id` params are not validated anymore. This should 
still be done, since the new asserts only check for `point_column_name` and 
`test_column_name` being valid expressions. This validation is required to 
catch invalid values such as `NULL`, `''`, invalid or non-existing column name 
etc. for `point_id` and `test_id` params.


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214485621
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
--- End diff --

I think we will need to keep these asserts too. They seem to be checking 
for a different validation. If the helper function `array_col_has_no_null()` 
does not work for expressions, please go ahead and change that, or create a new 
helper function to handle the scenario (both expressions and column names).


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214487426
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(is_var_valid(point_source, point_column_name),
+"kNN error: Invalid point_column_name expression")
+
+_assert(is_var_valid(test_source, test_column_name),
+"kNN error: Invalid test_column_name expression")
--- End diff --

Similar to the above comment, can we change the error message to:
`"kNN error: {0} is an invalid column name or expression for 
test_column_name param".format(test_column_name)`?


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214482318
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id,
 SELECT test.{test_id} AS {test_id_temp},
 train.{point_id} as train_id,
 {fn_dist}(
-train.{point_column_name},
-test.{test_column_name})
+p_col_name,
+t_col_name)
 AS dist
 {label_out}
-FROM {point_source} AS train,
- {test_source} AS test
+FROM
+(
+SELECT {point_id} , {point_column_name} as 
p_col_name , {label_column_name} from {point_source}
+) train,
+(
+SELECT {test_id} ,{test_column_name} as 
t_col_name from {test_source}
+) test
--- End diff --

Can you please use variables with unique strings for `train`, `test`, 
`p_col_name` and `t_col_name`. If the train or test table is named any of 
those, the query would fail I guess.
While you are at it, could you also do the same for other variables in this 
query: `train_id`, `r`, `dist_inverse` and others I may have missed listing out?


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r213806554
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(point_column_name, "KNN error: Invalid point_column_name 
expression")
+
+_assert(test_column_name, "KNN error: Invalid test_column_name 
expression")
--- End diff --

`KNN` in the error message is different from `kNN` used in other error 
messages (capital `K`). Please keep it consistent as `kNN`.


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r213791885
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(point_column_name, "KNN error: Invalid point_column_name 
expression")
+
+_assert(test_column_name, "KNN error: Invalid test_column_name 
expression")
--- End diff --

Since the original asserts are removed, this results in the function call 
not exiting gracefully when we have incorrect param values. You may have to use 
function `is_var_valid()` in `validate_args.py_in` to validate 
`point_column_name` and `test_column_name`.


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r213792935
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id,
 SELECT test.{test_id} AS {test_id_temp},
 train.{point_id} as train_id,
 {fn_dist}(
-train.{point_column_name},
-test.{test_column_name})
+train.{point_col_name_temp},
+test.{test_col_name_temp})
 AS dist
 {label_out}
-FROM {point_source} AS train,
- {test_source} AS test
+FROM
+ {point_source_temp_table} as train,
+ {test_source_temp_table} as test
--- End diff --

Please use subqueries, instead of tables:
```
(select {point_id} , {point_column_name} as {point_col_name_temp} , 
{label_column_name} from {point_source}) train,
(select {test_id}, {test_column_name} as {test_col_name_temp} from 
{test_source}) test
```


---


[GitHub] madlib issue #314: Ubuntu support: Enable creation of gppkg on Ubuntu

2018-08-27 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/314
  
Thank you for the comments @reductionista , I have updated the comment. 
Please do have a look at it.


---


[GitHub] madlib pull request #314: Ubuntu support: Enable creation of gppkg on Ubuntu

2018-08-22 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/314

Ubuntu support: Enable creation of gppkg on Ubuntu

This commit makes necessary changes to create a gppkg on Ubuntu. The
default behavior when MADlib is built on Ubuntu is to create a .deb
installer. If we want to create a gppkg, then we need an RPM due to
limitations in gppkg. We now create an RPM on Ubuntu (assuming package
alien is installed on Ubuntu) if the right cmake flag is specified. Once
an RPM is created on `make package`, we can now go ahead and create the
gppkg using `make gppkg`.
The cmake flag to use if we want to create an .rpm instead of .deb on
Ubuntu when we run `make package` is:
-DCREATE_RPM_FOR_UBUNTU=True

Co-authored-by: Orhan Kislal 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib ubuntu-gppkg-support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #314


commit bd85bae33473e697d7c26fdac9ee3253ff9a82b3
Author: Nandish Jayaram 
Date:   2018-08-06T22:19:18Z

Ubuntu support: Enable creation of gppkg on Ubuntu

This commit makes necessary changes to create a gppkg on Ubuntu. The
default behavior when MADlib is built on Ubuntu is to create a .deb
installer. If we want to create a gppkg, then we need an RPM due to
limitations in gppkg. We now create an RPM on Ubuntu (assuming package
alien is installed on Ubuntu) if the right cmake flag is specified. Once
an RPM is created on `make package`, we can now go ahead and create the
gppkg using `make gppkg`.
The cmake flag to use if we want to create an .rpm instead of .deb on
Ubuntu when we run `make package` is:
-DCREATE_RPM_FOR_UBUNTU=True

Co-authored-by: Orhan Kislal 




---


[GitHub] madlib pull request #295: Recursive Partitioning: Add function to report imp...

2018-07-23 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/295#discussion_r204593622
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.py_in ---
@@ -1564,69 +1578,69 @@ def get_var_importance(schema_madlib, model_table, 
output_table, **kwargs):
 """
 # Validate parameters
 summary_table = add_postfix(model_table, "_summary")
-_validate_var_importance_input(model_table, summary_table, 
output_table)
-grouping_cols = plpy.execute(
-"SELECT grouping_cols FROM {summary_table}".
-format(**locals()))[0]['grouping_cols']
-grouping_cols_comma = ''
-if grouping_cols :
-grouping_cols_comma = add_postfix(grouping_cols,", ")
-is_RF = _is_model_for_RF(summary_table)
+_validate_var_importance_input(model_table,
+   summary_table,
+   output_table)
+is_RF = _is_random_forest_model(summary_table)
+
+# 'importance_model_table' is the table containing variable importance 
values.
+# For RF, it is placed in _group as opposed to 

+# for DT.
+importance_model_table = (model_table if not is_RF else
+  add_postfix(model_table, "_group"))
+grouping_cols = plpy.execute("SELECT grouping_cols FROM 
{summary_table}".
+ format(**locals()))[0]['grouping_cols']
+if grouping_cols:
+grouping_cols_comma = add_postfix(grouping_cols, ", ")
+else:
+grouping_cols_comma = ''
+is_impurity_imp_col_present = _is_impurity_importance_in_model(
+importance_model_table, summary_table, is_RF=is_RF)
+
+# convert importance to percentages
+normalization_target = 100.0
+
+def _unnest_normalize(input_array_str):
+return ("""
+unnest({0}.normalize_sum_array({1}::double precision[],
+   {2}::double precision))
+""".format(schema_madlib, input_array_str, 
normalization_target))
+
 if is_RF:
-group_table = add_postfix(model_table, "_group")
-is_impurity_imp_col_present = 
_is_impurity_importance_in_group_table(
-group_table, summary_table)
-# The group table for >= 1.15 RF models should have a column named
-# impurity_var_importance if it was learnt with importance param 
True.
-# So set is_pre_1_15_RF_model to False if the column exists, and to
-# True if the column does not exist.
-is_pre_1_15_RF_model = False if is_impurity_imp_col_present else 
True
-
-# Query to add oob variable importance for categorical vars
-if is_pre_1_15_RF_model:
-# In < 1.15 RF model, the variable importance was captured 
using two
-# different columns named 'cat_var_importance'
-plpy.execute(
-""" CREATE TABLE {output_table} AS
--- Add oob variable importance for categorical vars
-SELECT {grouping_cols_comma}
-unnest(regexp_split_to_array(cat_features, ',')) 
AS feature,
-unnest(cat_var_importance) AS var_importance
-FROM {group_table}, {summary_table}
-UNION
--- Add oob variable importance for continuous vars
-SELECT {grouping_cols_comma}
-unnest(regexp_split_to_array(con_features, ',')) 
AS feature,
-unnest(con_var_importance) AS var_importance
-FROM {group_table}, {summary_table}
-""".format(**locals()))
+if is_impurity_imp_col_present:
+# In versions >= 1.15, the OOB variable importance is captured
+# in a single column: 'oob_var_importance'.
+oob_var_importance_str = (
+"{0} AS oob_var_importance,".
+format(_unnest_normalize('oob_var_importance')))
+impurity_var_importance_str = (
+"{0} AS impurity_var_importance".
+format(_unnest_normalize('impurity_var_importance')))
 else:
-# In >= 1.15 RF models, the variable importance and impurity
-# importance scores are captured in columns 
'oob_var_importance'
-# and 'impurity_var_importance' respectively.
-plpy.execute(
-  

[GitHub] madlib pull request #291: Feature: Vector-Column Transformations

2018-07-23 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/291#discussion_r204589559
  
--- Diff: src/ports/postgres/modules/utilities/transform_vec_cols.py_in ---
@@ -0,0 +1,513 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+from control import MinWarning
+from internal.db_utils import is_col_1d_array
+from internal.db_utils import quote_literal
+from utilities import _assert
+from utilities import add_postfix
+from utilities import ANY_ARRAY
+from utilities import is_psql_boolean_type
+from utilities import is_psql_char_type
+from utilities import is_psql_numeric_type
+from utilities import is_valid_psql_type
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from validate_args import is_var_valid
+from validate_args import get_cols
+from validate_args import get_cols_and_types
+from validate_args import get_expr_type
+from validate_args import input_tbl_valid
+from validate_args import output_tbl_valid
+from validate_args import table_exists
+
+class vec_cols_helper:
+def __init__(self):
+self.all_cols = None
+
+def get_cols_as_list(self, cols_to_process, source_table=None, 
exclude_cols=None):
+"""
+Get a list of columns based on the value of cols_to_process
+Args:
+@param cols_to_process: str, Either a * or a comma-separated 
list of col names
+@param source_table: str, optional. Source table name
+@param exclude_cols: str, optional. Comma-separated list of 
the col(s) to exclude
+ from the source table, only used if 
cols_to_process is *
+Returns:
+A list of column names (or an empty list)
+"""
+# If cols_to_process is empty/None, return empty list
+if not cols_to_process:
+return []
+if cols_to_process.strip() != "*":
+# If cols_to_process is a comma separated list of names, 
return list
+# of column names in cols_to_process.
+return [col for col in 
split_quoted_delimited_str(cols_to_process)
+if col not in split_quoted_delimited_str(exclude_cols)]
+if source_table:
+if not self.all_cols:
+self.all_cols = get_cols(source_table)
+return [col for col in self.all_cols
+if col not in split_quoted_delimited_str(exclude_cols)]
+return []
+
+def get_type_class(self, arg):
+if is_psql_numeric_type(arg):
+return "double precision"
+elif is_psql_char_type(arg):
+return "text"
+else:
+return arg
+
+class vec2cols:
+def __init__(self):
+self.get_cols_helper = vec_cols_helper()
+self.module_name = self.__class__.__name__
+
+def validate_args(self, source_table, output_table, vector_col, 
feature_names,
+  cols_to_output):
+"""
+Validate args for vec2cols
+"""
+input_tbl_valid(source_table, self.module_name)
+output_tbl_valid(output_table, self.module_name)
+is_var_valid(source_table, cols_to_output)
+is_var_valid(source_table, vector_col)
+_assert(is_valid_psql_type(get_expr_type(vector_col, 
source_table), ANY_ARRAY),
+"{0}: vector_col should refer to an 
array.".format(self.module_name))
+_assert(is_col_1d_array(source_table, vector_col),
+"{0}: vector_col must be a 1-dimensional 
array.".format(self.module_name))
+
+def get_names_for_split_output_cols(self, source_table, vector_col, 
feature_names):
+"&

[GitHub] madlib pull request #297: Madpack: Fix various schema related bugs

2018-07-23 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/297#discussion_r204572428
  
--- Diff: src/madpack/madpack.py ---
@@ -995,19 +996,20 @@ def run_unit_tests(args, testcase):
 Run unit tests.
 """
 if not _is_madlib_installation_valid_for_tests(args['schema'],
-   args['db_madlib_ver']):
+   args['db_madlib_ver'],
+   'unit-tests'):
 return
 info_(this, "> Running unit-test scripts for:", verbose)
 modset = _get_modset_for_tests(testcase, 'test_')
 # Loop through all modules and run unit tests
-_process_py_sql_files_in_modules(modset, {'madpack_cmd':'unit-test'})
+_process_py_sql_files_in_modules(modset, {'madpack_cmd':'Unit-test'})
--- End diff --

Changing `unit-test` to `Unit-test` causes a failure when we try to run 
madpack with `unit-test` option. It's probably because the check for madpack 
command in function `_process_py_sql_files_in_modules` is case sensitive.


---


[GitHub] madlib issue #295: Recursive Partitioning: Add function to report importance...

2018-07-19 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/295
  
@fmcquillan only impurity, I don't think we scale oob to 100.


---


[GitHub] madlib pull request #291: Feature: Vector to Columns

2018-07-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/291#discussion_r203890181
  
--- Diff: src/ports/postgres/modules/utilities/transform_vec_cols.py_in ---
@@ -0,0 +1,492 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+from control import MinWarning
+from internal.db_utils import is_col_1d_array
+from internal.db_utils import quote_literal
+from utilities import _assert
+from utilities import add_postfix
+from utilities import ANY_ARRAY
+from utilities import is_valid_psql_type
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from validate_args import is_var_valid
+from validate_args import get_cols
+from validate_args import get_expr_type
+from validate_args import input_tbl_valid
+from validate_args import output_tbl_valid
+from validate_args import table_exists
+
+class vec_cols_helper:
+def __init__(self):
+self.all_cols = None
+
+def get_cols_as_list(self, cols_to_process, source_table=None, 
exclude_cols=None):
+"""
+Get a list of columns based on the value of cols_to_process
+Args:
+@param cols_to_process: str, Either a * or a comma-separated 
list of col names
+@param source_table: str, optional. Source table name
+@param exclude_cols: str, optional. Comma-separated list of 
the col(s) to exclude
+ from the source table, only used if 
cols_to_process is *
+Returns:
+A list of column names (or an empty list)
+"""
+# If cols_to_process is empty/None, return empty list
+if not cols_to_process:
+return []
+if cols_to_process.strip() != "*":
+# If cols_to_process is a comma separated list of names, 
return list
+# of column names in cols_to_process.
+return [col for col in 
split_quoted_delimited_str(cols_to_process)
+if col not in split_quoted_delimited_str(exclude_cols)]
+if source_table:
+if not self.all_cols:
+self.all_cols = get_cols(source_table)
+return [col for col in self.all_cols
+if col not in split_quoted_delimited_str(exclude_cols)]
+return []
+
+class vec2cols:
+def __init__(self):
+self.get_cols_helper = vec_cols_helper()
+self.module_name = self.__class__.__name__
+
+def validate_args(self, source_table, output_table, vector_col, 
feature_names,
+  cols_to_output):
+"""
+Validate args for vec2cols
+"""
+input_tbl_valid(source_table, self.module_name)
+output_tbl_valid(output_table, self.module_name)
+# cols_to_validate = 
self.get_cols_helper.get_cols_as_list(cols_to_output) + [vector_col]
--- End diff --

Guess we can remove this commented line.


---


[GitHub] madlib issue #294: Pagerank: Remove duplicate entries from grouping output

2018-07-16 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/294
  
Thank you for the comments @jingyimei , have pushed a commit with a new 
dev-check test.


---


[GitHub] madlib pull request #290: madpack: Add madpack option to run unit tests.

2018-07-16 Thread njayaram2
Github user njayaram2 closed the pull request at:

https://github.com/apache/madlib/pull/290


---


[GitHub] madlib pull request #294: Pagerank: Remove duplicate entries from grouping o...

2018-07-16 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/294

Pagerank: Remove duplicate entries from grouping output

JIRA: MADLIB-1229
JIRA: MADLIB-1253

Fixes the missing output for complete graphs bug as well.

Co-authored-by: Nandish Jayaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bug/pagerank-dup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/294.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #294


commit 1b55acac9d5550e0a74fa46ec0ab4842d089ac1c
Author: Orhan Kislal 
Date:   2018-07-14T00:09:11Z

Pagerank: Remove duplicate entries from grouping output

JIRA: MADLIB-1229
JIRA: MADLIB-1253

Fixes the missing output for complete graphs bug as well.

Co-authored-by: Nandish Jayaram 




---


[GitHub] madlib issue #290: madpack: Add madpack option to run unit tests.

2018-07-12 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/290
  
Thank you for the comments @iyerr3 , will do the needful.


---


[GitHub] madlib pull request #290: madpack: Add madpack option to run unit tests.

2018-07-11 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/290

madpack: Add madpack option to run unit tests.

JIRA: MADLIB-1252

Unit tests in MADlib are written in python files, that are located in
the .../test/unit_tests/ folders, whose names begin with
the prefix "test_". This commit adds a new madpack option to run unit
tests similar to how we run install and dev checks.

- The new option added is: `unit-test`.
- Sample usage (on a postgres database, with MADlib installed on
  database `madlib`):
  * Run unit tests on all modules that have it defined:
  src/bin/madpack -p postgres -c /madlib unit-test
  * Run unit tests only for the `convex` module:
  src/bin/madpack -p postgres -c /madlib unit-test -t convex
  * Run unit tests only for the `convex` and decision trees module:
  src/bin/madpack -p postgres -c /madlib unit-test -t
  convex,recursive_partitioning/decision_tree
- Add command to run all unit tests in Jenkins build script.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib madpack/unit-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #290


commit 5f4395bc8548a5f74ba9a8fcb716f2a39ec51162
Author: Nandish Jayaram 
Date:   2018-07-11T00:32:24Z

madpack: Add madpack option to run unit tests.

JIRA: MADLIB-1252

Unit tests in MADlib are written in python files, that are located in
the .../test/unit_tests/ folders, whose names begin with
the prefix "test_". This commit adds a new madpack option to run unit
tests similar to how we run install and dev checks.

- The new option added is: `unit-test`.
- Sample usage (on a postgres database, with MADlib installed on
  database `madlib`):
  * Run unit tests on all modules that have it defined:
  src/bin/madpack -p postgres -c /madlib unit-test
  * Run unit tests only for the `convex` module:
  src/bin/madpack -p postgres -c /madlib unit-test -t convex
  * Run unit tests only for the `convex` and decision trees module:
  src/bin/madpack -p postgres -c /madlib unit-test -t
  convex,recursive_partitioning/decision_tree
- Add command to run all unit tests in Jenkins build script.




---


[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

2018-07-05 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/288#discussion_r200444598
  
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+
+def validate_cols2vec_args(source_table, output_table,
+   list_of_features, list_of_features_to_exclude, 
cols_to_output, **kwargs):
+"""
+Function to validate input parameters
+"""
+if list_of_features.strip() != '*':
+if not (list_of_features and list_of_features.strip()):
+plpy.error("Features to include is empty")
+
+if list_of_features.strip() != '*':
+if not columns_exist_in_table(
+source_table, 
split_quoted_delimited_str(list_of_features)):
+plpy.error(
+"Invalid columns to list_of_features 
({0})".format(list_of_features))
+
+if cols_to_output and cols_to_output.strip() != '*':
+if not columns_exist_in_table(
+source_table, _string_to_array(cols_to_output)):
+plpy.error("Invalid columns to output list ({0})".
+   format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, 
**kwargs):
+"""
+Args:
+@param schema_madlib:   Name of MADlib schema
+@param model:   Name of table containing the 
tree model
+@param source_table:Name of table containing 
prediction data
+@param output_table:Name of table to output the 
results
+@param list_of_features:Comma-separated string of 
column names or
+expressions to put into 
feature array.
+Can also be a '*' implying all 
columns
+are to be put into feature 
array.
+@param list_of_features_to_exclude: Comma-separated string of 
column names
+to exclude from the feature 
array
+@param cols_to_output:  Comma-separated string of 
column names
+from the source table to keep 
in the output table,
+in addition to the feature 
array.
+
+Returns:
+None
+
+"""
+
+with MinWarning('warning'):
+validate_cols2vec_args(source_table, output_table, 
list_of_features,
+   list_of_features_to_exclude, 
cols_to_output, **kwargs)
+
+all_cols = ''
+feature_cols = ''
+if list_of_features.strip() == '*':
+all_cols = get_cols(source_table, schema_madlib)
+all_col_set = set(list(all_cols))
+exclude_set = set(split_quoted_delimited_str(
+list_of_features_to_exclude))
+feature_set = all_col_set - exclude_set
+feature_cols = py_list_to_sql_string(
+list(feature_set), "text", False)
+filtered_list_of_features = ",".join(
+feat for feat in list(feature_set))
+else:
+feature_list = split_quoted_delimited_str(list_of_features)
+feature_exclude = split_quoted_delimited_str(
+list_of_features_to_exclude)
+return_set = set(feature_list) - set(feature_exclude)
+feature_cols = py_list_to_sql_string(
+list(return_set), "text", False)
+filtered_list_of_features = ",".join(
+feat for feat in feature_list if feat in return_set)
+
+output_cols = ''
+if cols_to_output:
+output_cols_list = [', '.join(get_cols(source_table, 
schema_madlib)) if col == '*' else col
+

[GitHub] madlib issue #284: SVM: Fix flaky dev-check failure

2018-06-27 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/284
  
Thank you for the comment @iyerr3 , will relax the constraint as suggested.


---


[GitHub] madlib pull request #284: SVM: Fix flaky dev-check failure

2018-06-27 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/284

SVM: Fix flaky dev-check failure

JIRA: MADLIB-1232

SVM has a dev-check query that is flaky on a large cluster. This commit
relaxes the assert condition for that query.

Closes #284

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/svm-flaky-dev-check

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/284.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #284


commit 9c556816bcca990e9b168cf556ce0da0cacf935a
Author: Nandish Jayaram 
Date:   2018-06-27T19:40:19Z

SVM: Fix flaky dev-check failure

JIRA: MADLIB-1232

SVM has a dev-check query that is flaky on a large cluster. This commit
relaxes the assert condition for that query.

Closes #284




---


[GitHub] madlib issue #283: Bugfix: Fix failing dev check in CRF

2018-06-27 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/283
  
Thank you for the comments @kaknikhil . I moved out the jenkins build 
script to a different commit.


---


[GitHub] madlib pull request #283: Bugfix: Fix failing dev check in CRF

2018-06-27 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/283

Bugfix: Fix failing dev check in CRF

This commit has the following changes:
- A couple of dev check files in CRF did not have the label table creation
in it. But the label table was consumed by one of the queries that led
to dev-check failure.
- Run dev check on Jenkins build instead of install check.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/crf-dev-check

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/283.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #283


commit deb206b7d1c1e7ce87d6e33c7a1dff91b3adb82b
Author: Nandish Jayaram 
Date:   2018-06-27T18:25:46Z

Bugfix: Fix failing dev check in CRF

This commit has the following changes:
- A couple of dev check files in CRF did not have the label table creation
in it. But the label table was consumed by one of the queries that led
to dev-check failure.
- Run dev check on Jenkins build instead of install check.




---


[GitHub] madlib pull request #277: DT: Add impurity importance metric

2018-06-18 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/277#discussion_r196150565
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, 
cat_features,
  "$3", "$2",
  null_proxy)
 
-# The arguments of the aggregate (in the same order):
-# 1. current tree state, madlib.bytea8
-# 2. categorical features (integer format) in a single array
-# 3. continuous features in a single array
-# 4. weight value
-# 5. categorical sorted levels (integer format) in a combined array
-# 6. continuous splits
-# 7. number of dependent levels
 train_sql = """
 SELECT (result).* from (
 SELECT
-{schema_madlib}._dt_apply($1,
+{schema_madlib}._dt_apply(
+$1,
 {schema_madlib}._compute_leaf_stats(
-$1,
-{cat_features_str},
-{con_features_str},
+$1,  -- current tree state, 
madlib.bytea8
+{cat_features_str},  -- categorical features in an 
array
+{con_features_str},  -- continuous features in an 
array
 {dep_var},
-{weights},
-$2,
-$4,
-{dep_n_levels}::smallint,
-{subsample}::boolean
+{weights},   -- weight value
+$2,  -- categorical sorted levels 
in a combined array
+$4,  -- continuous splits
+{dep_n_levels}::smallint, -- number of dependent 
levels
+{subsample}::boolean  -- should we use a subsample 
of data
--- End diff --

Oh okay, thank you. I think a comment will be useful.


---


[GitHub] madlib issue #276: Feature/dev check

2018-06-15 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/276
  
Thank you for the comments @iyerr3 , will make the changes you have 
requested.

Having one IC file for each module makes sense, but on Greenplum, for some 
modules the IC run time is still quite high. For example, if a module had 5 IC 
files, where each ran for 10 seconds, the user would see IC progressing every 
10 seconds. But if those were combined into one IC file, the progression would 
happen after 50 seconds. It seemed a little odd (longer IC run times for some 
modules in terms of user experience) when we were trying it out, so decided to 
keep multiple IC files for such modules.


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-06-08 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r194151705
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +122,26 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum.
+The velocity is the same size as the coefficient and is accumulated in the 
direction of persistent reduction, which speeds up the optimization. The 
momentum value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum we first correct the velocity, and then update the 
model with that velocity, whereas in Nesterov momentum, we first move the model 
in the direction of momentum*velocity , then correct the velocity and finally 
use the updated model to calculate the gradient. The main difference being that 
in classical momentum, we compute the gradient before updating the model 
whereas in nesterov we first update the model and then compute the gradient 
from the updated position.\\
--- End diff --

`momentum*velocity ,` -> `momentum*velocity,`. The extra space before the 
comma is moving the `,` to the next line in the pdf.


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-06-08 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r194151826
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -196,17 +221,28 @@ \subsubsection{The $\mathit{Gradient}$ Function}
 \end{algorithmic}
 \end{algorithm}
 
-\begin{algorithm}[mlp-train-iteration$(X, Y, \eta)$] 
\label{alg:mlp-train-iteration}
+\begin{algorithm}[mlp-train-iteration$(X, Y, \eta, mu)$] 
\label{alg:mlp-train-iteration}
--- End diff --

`mu` -> `\mu`


---


[GitHub] madlib pull request #275: Madpack: Fix error with dropping user after IC fai...

2018-06-05 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/275

Madpack: Fix error with dropping user after IC failure.

JIRA: MADLIB-1182

Previously, when install check did not fail gracefully, the user created
by madpack hung around and disturbed IC attempts within other databases.
We fixed this by:
1) Renaming the test user using the specific database that the IC test
was run from, making the test user name database-specific.
2) Dropping schema and user in a try-finally block, so that it's executed 
even
on a failed IC run.

Co-authored-by: Arvind Sridhar 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/madpack-ic-user

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #275


commit d817973e78fa419dd7b83b077049c8c6b1911911
Author: Nandish Jayaram 
Date:   2018-06-05T19:08:25Z

Madpack: Fix error with dropping user after IC failure.

JIRA: MADLIB-1182

Previously, when install check did not fail gracefully, the user created
by madpack hung around and disturbed IC attempts within other databases.
We fixed this by:
1) Renaming the test user using the specific database that the IC test
was run from, making the test user name database-specific.
2) Dropping schema and user in a try-finally block, so that it's executed 
even
on a failed IC run.

Co-authored-by: Arvind Sridhar 




---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192246910
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum you first correct the velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity direction 
then make a correction to the velocity vector based on the new location. \\
+Classic momentum update
+\[\begin{aligned}
+\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
+\mathit{u} \set \mathit{u} + \mathit{v} \\
+\end{aligned}\]
+
+Nesterov momentum update
+\[\begin{aligned}
+\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ 
(nesterov's initial coefficient update )} \\
+\mathit{v} \set \mathit{mu} * \mathit{v} -  \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
+\mathit{u} \set \mathit{u} - \eta * \mathit{gradient} \\
--- End diff --

Can we left-align these and other momentum related equations?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192246463
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
--- End diff --

We only describe the range for momentum, but nothing about velocity before 
the next line begins. Can we say a bit about what velocity is before we 
describe how it is used?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192251198
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
--- End diff --

Please add your name to the list of authors in the beginning of the file. 
Maybe add the history for this file as well?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192245605
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum you first correct the velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity direction 
then make a correction to the velocity vector based on the new location. \\
--- End diff --

`step with that velocity` is a little confusing to me. Do we have some 
source where it is defined this way?
If it's any better, can we use the following text to say what the 
difference between momentum and NAG is (source is 
http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf):
```
... the key difference between momentum and Nesterov’s
accelerated gradient is that momentum computes the gradient before applying 
the velocity, while Nesterov’s
accelerated gradient computes the gradient after doing so.
```
If `step with that velocity` is a standard way of defining it, then I am 
okay with it.

This comment applies to user and online docs too.


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192248589
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum you first correct the velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity direction 
then make a correction to the velocity vector based on the new location. \\
+Classic momentum update
+\[\begin{aligned}
+\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
+\mathit{u} \set \mathit{u} + \mathit{v} \\
+\end{aligned}\]
+
+Nesterov momentum update
+\[\begin{aligned}
+\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ 
(nesterov's initial coefficient update )} \\
+\mathit{v} \set \mathit{mu} * \mathit{v} -  \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
--- End diff --

Instead of `gradient`, can we replace it with the actual math notation?
I think the gradient at this line is: `\frac{\partial f}{\partial 
u_{k-1}^{sj}}`
The gradient in the next line should be w.r.t `v` instead of `u`.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192186350
  
--- Diff: src/madpack/madpack.py ---
@@ -131,10 +141,73 @@ def _get_relative_maddir(maddir, port):
 return maddir
 # 
--
 
+def _cleanup_comments_in_sqlfile(output_filename, upgrade):
+"""
+@brief Remove comments in the sql script
+"""
+if not upgrade:
+with open(output_filename, 'r+') as output_filehandle:
+full_sql = output_filehandle.read()
+pattern = 
re.compile(r"""(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))""")
+res = ''
+lines = re.split(r'[\r\n]+', full_sql)
+for line in lines:
+tmp = line
+if not tmp.strip().startswith("E'"):
+line = re.sub(pattern, '', line)
+res += line + '\n'
+full_sql = res.strip()
+full_sql = re.sub(pattern, '', full_sql).strip()
+# Re-write the cleaned-up sql to a new file. Python does not let us
+# erase all the content of a file and rewrite the same file again.
+cleaned_output_filename = output_filename+'.tmp'
+with open(cleaned_output_filename, 'w') as output_filehandle:
+_write_to_file(output_filehandle, full_sql)
+# Move the cleaned output file to the old one.
+os.rename(cleaned_output_filename, output_filename)
+
+def _run_m4_and_append(schema, maddir_mod_py, module, sqlfile,
+   output_filehandle, pre_sql=None):
+"""
+Function to process a sql file with M4.
+"""
+# Check if the SQL file exists
+if not os.path.isfile(sqlfile):
+error_(this, "Missing module SQL file (%s)" % sqlfile, False)
--- End diff --

We removed the error message from `ValueError`, while we still print the 
message from `error_()`.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192174800
  
--- Diff: src/madpack/upgrade_util.py ---
@@ -1299,18 +1303,19 @@ def _clean_function(self):
 pattern = re.compile(r"""CREATE(\s+)FUNCTION""", re.DOTALL | 
re.IGNORECASE)
 self._sql = re.sub(pattern, 'CREATE OR REPLACE FUNCTION', 
self._sql)
 
-def cleanup(self, sql):
+def cleanup(self, sql, algoname):
--- End diff --

Good catch, this is dead code.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192193089
  
--- Diff: src/madpack/madpack.py ---
@@ -559,71 +650,59 @@ def _db_rename_schema(from_schema, to_schema):
 # 
--
 
 
-def _db_create_schema(schema):
+def _db_create_schema(schema, is_schema_in_db, filehandle):
 """
 Create schema
 @param from_schema name of the schema to rename
+@param is_schema_in_db flag to indicate if schema is already 
present
 @param to_schema new name for the schema
 """
 
-info_(this, "> Creating %s schema" % schema.upper(), True)
--- End diff --

This is useful and it was preserved in `main`.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192205268
  
--- Diff: src/madpack/madpack.py ---
@@ -987,275 +1276,42 @@ def main(argv):
 error_(this, "Missing -p/--platform parameter.", True)
 if not con_args:
 error_(this, "Unknown problem with database connection string: 
%s" % con_args, True)
+#  Completed "Get and validate arguments" 
-
 
 # COMMAND: version
 if args.command[0] == 'version':
 _print_revs(rev, dbrev, con_args, schema)
 
-# COMMAND: uninstall/reinstall
-if args.command[0] in ('uninstall', 'reinstall'):
-if get_rev_num(dbrev) == [0]:
-info_(this, "Nothing to uninstall. No version found in schema 
%s." % schema.upper(), True)
-return
-
-# Find any potential data to lose
-affected_objects = _internal_run_query("""
-SELECT
-n1.nspname AS schema,
-relname AS relation,
-attname AS column,
-typname AS type
-FROM
-pg_attribute a,
-pg_class c,
-pg_type t,
-pg_namespace n,
-pg_namespace n1
-WHERE
-n.nspname = '%s'
-AND t.typnamespace = n.oid
-AND a.atttypid = t.oid
-AND c.oid = a.attrelid
-AND c.relnamespace = n1.oid
-AND c.relkind = 'r'
-ORDER BY
-n1.nspname, relname, attname, typname""" % schema.lower(), 
True)
-
-info_(this, "*** Uninstalling MADlib ***", True)
-info_(this, 
"***",
 True)
-info_(this, "* Schema %s and all database objects depending on it 
will be dropped!" % schema.upper(), True)
-if affected_objects:
-info_(this, "* If you continue the following data will be lost 
(schema : table.column : type):", True)
-for ao in affected_objects:
-info_(this, '* - ' + ao['schema'] + ' : ' + ao['relation'] 
+ '.' +
-  ao['column'] + ' : ' + ao['type'], True)
-info_(this, 
"***",
 True)
-info_(this, "Would you like to continue? [Y/N]", True)
-go = raw_input('>>> ').upper()
-while go != 'Y' and go != 'N':
-go = raw_input('Yes or No >>> ').upper()
-
-# 2) Do the uninstall/drop
-if go == 'N':
-info_(this, 'No problem. Nothing dropped.', True)
-return
-
-elif go == 'Y':
-info_(this, "> dropping schema %s" % schema.upper(), verbose)
-try:
-_internal_run_query("DROP SCHEMA %s CASCADE;" % (schema), 
True)
-except:
-error_(this, "Cannot drop schema %s." % schema.upper(), 
True)
-
-info_(this, 'Schema %s (and all dependent objects) has been 
dropped.' % schema.upper(), True)
-info_(this, 'MADlib uninstalled successfully.', True)
-
-else:
-return
-
-# COMMAND: install/reinstall
-if args.command[0] in ('install', 'reinstall'):
-# Refresh MADlib version in DB, None for GP/PG
-if args.command[0] == 'reinstall':
-print "Setting MADlib database version to be None for 
reinstall"
-dbrev = None
-
-info_(this, "*** Installing MADlib ***", True)
-
-# 1) Compare OS and DB versions.
-# noop if OS <= DB.
-_print_revs(rev, dbrev, con_args, schema)
-if is_rev_gte(get_rev_num(dbrev), get_rev_num(rev)):
-info_(this, "Current MADlib version already up to date.", True)
-return
-# proceed to create objects if nothing installed in DB
-elif dbrev is None:
-pass
-# error and refer to upgrade if OS > DB
-else:
-error_(this, """Aborting installation: existing MADlib version 
detected in {0} schema
-To upgrade the {0} schema to MADlib v{1} please run 
the following command:
-madpack upgrade -s {0} -p {2} [-c ...]
-""".forma

[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192175486
  
--- Diff: src/madpack/upgrade_util.py ---
@@ -1299,18 +1303,19 @@ def _clean_function(self):
 pattern = re.compile(r"""CREATE(\s+)FUNCTION""", re.DOTALL | 
re.IGNORECASE)
 self._sql = re.sub(pattern, 'CREATE OR REPLACE FUNCTION', 
self._sql)
 
-def cleanup(self, sql):
+def cleanup(self, sql, algoname):
 """
 @brief Entry function for cleaning the sql script
 """
 self._sql = sql
-self._clean_comment()
-self._clean_type()
-self._clean_cast()
-self._clean_operator()
-self._clean_opclass()
-self._clean_aggregate()
-self._clean_function()
+if algoname not in self.get_change_handler().newmodule:
--- End diff --

This if check was originally done in `madpack.py` before `cleanup()` 
function was called. We moved it to this function instead. Will add the 
comment, but keep the if logic as is, since we will anyway have to return at 
the end.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192204168
  
--- Diff: src/madpack/madpack.py ---
@@ -824,6 +873,246 @@ def parse_arguments():
 # Get the arguments
 return parser.parse_args()
 
+def run_install_check(args, testcase):
+schema = args['schema']
+dbrev = args['dbrev']
+# 1) Compare OS and DB versions. Continue if OS = DB.
+if get_rev_num(dbrev) != get_rev_num(rev):
+_print_revs(rev, dbrev, con_args, schema)
+info_(this, "Versions do not match. Install-check stopped.", True)
+return
+
+# Create install-check user
+test_user = ('madlib_' +
+ rev.replace('.', '').replace('-', '_') +
+ '_installcheck')
+try:
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), False)
+except:
+_internal_run_query("DROP OWNED BY %s CASCADE;" % (test_user), 
True)
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), True)
+_internal_run_query("CREATE USER %s;" % (test_user), True)
+
+_internal_run_query("GRANT USAGE ON SCHEMA %s TO %s;" % (schema, 
test_user), True)
+
+# 2) Run test SQLs
+info_(this, "> Running test scripts for:", verbose)
+
+caseset = (set([test.strip() for test in testcase.split(',')])
+   if testcase != "" else set())
+
+modset = {}
+for case in caseset:
+if case.find('/') > -1:
+[mod, algo] = case.split('/')
+if mod not in modset:
+modset[mod] = []
+if algo not in modset[mod]:
+modset[mod].append(algo)
+else:
+modset[case] = []
+
+# Loop through all modules
+for moduleinfo in portspecs['modules']:
+
+# Get module name
+module = moduleinfo['name']
+
+# Skip if doesn't meet specified modules
+if modset is not None and len(modset) > 0 and module not in modset:
+continue
+# JIRA: MADLIB-1078 fix
+# Skip pmml during install-check (when run without the -t option).
+# We can still run install-check on pmml with '-t' option.
+if not modset and module in ['pmml']:
+continue
+info_(this, "> - %s" % module, verbose)
+
+# Make a temp dir for this module (if doesn't exist)
+cur_tmpdir = tmpdir + '/' + module + '/test'  # tmpdir is a global 
variable
+_make_dir(cur_tmpdir)
+
+# Find the Python module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/" + dbver + 
"/modules/" + module):
+maddir_mod_py = maddir + "/ports/" + portid + "/" + dbver + 
"/modules"
+else:
+maddir_mod_py = maddir + "/modules"
+
+# Find the SQL module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/modules/" + 
module):
+maddir_mod_sql = maddir + "/ports/" + portid + "/modules"
+else:
+maddir_mod_sql = maddir + "/modules"
+
+# Prepare test schema
+test_schema = "madlib_installcheck_%s" % (module)
+_internal_run_query("DROP SCHEMA IF EXISTS %s CASCADE; CREATE 
SCHEMA %s;" %
+(test_schema, test_schema), True)
+_internal_run_query("GRANT ALL ON SCHEMA %s TO %s;" %
+(test_schema, test_user), True)
+
+# Switch to test user and prepare the search_path
+pre_sql = '-- Switch to test user:\n' \
+  'SET ROLE %s;\n' \
+  '-- Set SEARCH_PATH for install-check:\n' \
+  'SET search_path=%s,%s;\n' \
+  % (test_user, test_schema, schema)
+
+# Loop through all test SQL files for this module
+sql_files = maddir_mod_sql + '/' + module + '/test/*.sql_in'
+for sqlfile in sorted(glob.glob(sql_files), reverse=True):
+algoname = os.path.basename(sqlfile).split('.')[0]
+# run only algo specified
+if (module in modset and modset[module] and
+algoname not in modset[module]):
+continue
+
+# Set file names
+tmpfile = cur_tmpdir + '/' + os.path.basename(sqlfile) + '.tmp'
+   

[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192182069
  
--- Diff: src/madpack/madpack.py ---
@@ -95,6 +95,16 @@ def _internal_run_query(sql, show_error):
 return run_query(sql, con_args, show_error)
 # 
--
 
+def _write_to_file(handle, sql, show_error=False):
+handle.write(sql)
+handle.write('\n')
+# 
--
+
+
+def _merge_to_file(handle, other_file, show_error=False):
--- End diff --

It was used very early in the story I guess, have removed it now.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192193755
  
--- Diff: src/madpack/madpack.py ---
@@ -824,6 +873,246 @@ def parse_arguments():
 # Get the arguments
 return parser.parse_args()
 
+def run_install_check(args, testcase):
+schema = args['schema']
--- End diff --

This is install check code which is not refactored in this commit, and 
should be done as part of a different commit. We only moved `install-check` 
specific code from `main` to this function, since `main` was very huge.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192181634
  
--- Diff: src/madpack/madpack.py ---
@@ -95,6 +95,16 @@ def _internal_run_query(sql, show_error):
 return run_query(sql, con_args, show_error)
 # 
--
 
+def _write_to_file(handle, sql, show_error=False):
--- End diff --

Moved this function to `madlib/utilities.py`.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r192178673
  
--- Diff: src/madpack/utilities.py ---
@@ -33,6 +33,23 @@
 this = os.path.basename(sys.argv[0])# name of this script
 
 
+class AtomicFileOpen:
--- End diff --

It's not used, will remove it.


---


[GitHub] madlib issue #271: Madpack: Make install, reinstall and upgrade atomic

2018-05-31 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/271
  
Thank you for the comments @kaknikhil , will address them.


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191575057
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1781,7 +1799,7 @@ class MLPMinibatchPreProcessor:
 summary_table_columns = summary_table_columns[0]
 
 required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
-self.CLASS_VALUES, self.GROUPING_COL)
+self.CLASS_VALUES, self.GROUPING_COL, 
self.DEPENDENT_VARTYPE)
 if set(required_columns) <= set(summary_table_columns):
 self.preprocessed_summary_dict = summary_table_columns
 else:
--- End diff --

Can you please update the optimizer params in online docs?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191533727
  
--- Diff: src/modules/convex/type/model.hpp ---
@@ -126,45 +129,96 @@ struct MLPModel {
 for (k = 0; k < N; k ++) {
 size += static_cast((n[k] + 1) * (n[k+1]));
 }
+//TODO conditionally assign size based on momentum
--- End diff --

Is this TODO still valid?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191574947
  
--- Diff: src/ports/postgres/modules/convex/mlp.sql_in ---
@@ -1474,13 +1480,15 @@ CREATE AGGREGATE MADLIB_SCHEMA.mlp_minibatch_step(
 /* warm_start_coeff */DOUBLE PRECISION[],
 /* lambda */  DOUBLE PRECISION,
 /* batch_size */  INTEGER,
-/* n_epochs */INTEGER
+/* n_epochs */INTEGER,
+/* momentum */DOUBLE PRECISION,
+/* is_nesterov */ BOOLEAN
 )(
 STYPE=DOUBLE PRECISION[],
 SFUNC=MADLIB_SCHEMA.mlp_minibatch_transition,
 m4_ifdef(`__POSTGRESQL__', `', 
`prefunc=MADLIB_SCHEMA.mlp_minibatch_merge,')
 FINALFUNC=MADLIB_SCHEMA.mlp_minibatch_final,
-INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0}'
+INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
 );
 -
 
--- End diff --

Can you please update the user docs with momentum and nesterov related 
optimizer params? Is it also a good time to update the MLP design doc?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191539168
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -197,6 +244,7 @@ MLP::loss(
 const model_type,
 const independent_variables_type,
 const dependent_variable_type   _true) {
+
--- End diff --

The loss computation code in this function is duplicated in `getLoss()` 
function. Can we call that function to remove the code duplication?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191537654
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -126,68 +157,84 @@ MLP::getLossAndUpdateModel(
 const Matrix _true_batch,
 const double ) {
 
-uint16_t num_layers = static_cast(model.u.size()); // 
assuming nu. of layers >= 1
-Index num_rows_in_batch = x_batch.rows();
 double total_loss = 0.;
+// model is updated with the momentum step (i.e. velocity vector)
+// if Nesterov Accelerated Gradient is enabled
+model.nesterovUpdatePosition();
 
-// gradient added over the batch
-std::vector total_gradient_per_layer(num_layers);
-for (Index k=0; k < num_layers; ++k)
+// initialize gradient vector
+std::vector total_gradient_per_layer(model.num_layers);
+for (Index k=0; k < model.num_layers; ++k) {
 total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
model.u[k].cols());
+}
 
+std::vector net, o, delta;
+Index num_rows_in_batch = x_batch.rows();
 for (Index i=0; i < num_rows_in_batch; i++){
+// gradient and loss added over the batch
 ColumnVector x = x_batch.row(i);
 ColumnVector y_true = y_true_batch.row(i);
 
-std::vector net, o, delta;
 feedForward(model, x, net, o);
 backPropogate(y_true, o.back(), net, model, delta);
 
-for (Index k=0; k < num_layers; k++){
+// compute the gradient
+for (Index k=0; k < model.num_layers; k++){
 total_gradient_per_layer[k] += o[k] * delta[k].transpose();
 }
 
-// loss computation
-ColumnVector y_estimated = o.back();
-if(model.is_classification){
-double clip = 1.e-10;
-y_estimated = y_estimated.cwiseMax(clip).cwiseMin(1.-clip);
-total_loss += - (y_true.array()*y_estimated.array().log()
-   + 
(-y_true.array()+1)*(-y_estimated.array()+1).log()).sum();
-}
-else{
-total_loss += 0.5 * (y_estimated - y_true).squaredNorm();
-}
+// compute the loss
+total_loss += getLoss(y_true, o.back(), model.is_classification);
 }
-for (Index k=0; k < num_layers; k++){
+
+// convert gradient to a gradient update vector
+//  1. normalize to per row update
+//  2. discount by stepsize
+//  3. add regularization
+//  4. make negative
+for (Index k=0; k < model.num_layers; k++){
 Matrix regularization = MLP::lambda * model.u[k];
 regularization.row(0).setZero(); // Do not update bias
-model.u[k] -=
-stepsize *
-(total_gradient_per_layer[k] / 
static_cast(num_rows_in_batch) +
- regularization);
+total_gradient_per_layer[k] = -stepsize * 
(total_gradient_per_layer[k] / static_cast(num_rows_in_batch) +
+  regularization);
+model.updateVelocity(total_gradient_per_layer[k], k);
+model.updatePosition(total_gradient_per_layer[k], k);
 }
+
 return total_loss;
+
 }
 
+
 template 
 void
 MLP::gradientInPlace(
 model_type  ,
 const independent_variables_type,
 const dependent_variable_type   _true,
-const double) {
-size_t N = model.u.size(); // assuming nu. of layers >= 1
+const double)
+{
+model.nesterovUpdatePosition();
+
 std::vector net, o, delta;
 
 feedForward(model, x, net, o);
 backPropogate(y_true, o.back(), net, model, delta);
 
-for (size_t k=0; k < N; k++){
+for (Index k=0; k < model.num_layers; k++){
 Matrix regularization = MLP::lambda*model.u[k];
 regularization.row(0).setZero(); // Do not update bias
-model.u[k] -= stepsize * (o[k] * delta[k].transpose() + 
regularization);
+if (model.momentum > 0){
+Matrix gradient = -stepsize * (o[k] * delta[k].transpose() + 
regularization);
+model.updateVelocity(gradient, k);
+model.updatePosition(gradient, k);
+}
+else {
+// Updating model inline instead of using updatePosition since
+// updatePosition creates a copy of the gradient making it 
slower.
--- End diff --

This comment is a little confusing. We do pass along the gradi

[GitHub] madlib pull request #260: minibatch preprocessor improvements

2018-04-10 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/260#discussion_r180603956
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -397,8 +408,9 @@ class MiniBatchStandardizer:
 x_std_dev_str = self.x_std_dev_str)
 return query
 
-def _get_query_for_standardizing_with_grouping(self):
+def _create_table_for_standardizing_with_grouping(self):
--- End diff --

Why was the method name changed? The older name seems to be more apt, since 
this function is still returning the query, and not executing it (the same for 
`_create_table_for_standardizing_without_grouping()` too).


---


[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/259#discussion_r180576675
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in ---
@@ -91,6 +92,22 @@ minibatch_preprocessor(
When this value is NULL, no grouping is used and a single preprocessing 
step
is performed for the whole data set.
   
+
+  one_hot_encode_int_dep_var (optional)
+   BOOLEAN. default: FALSE.
+  A flag to decide whether to one-hot encode dependent variables that are
+scalar integers. This parameter is ignored if the dependent variable is 
not a
+scalar integer.
+
+@note The mini-batch preprocessor automatically encodes
+dependent variables that are boolean and character types such as text, 
char and
+varchar.  However, scalar integers are a special case because they can be 
used
+in both classification and regression problems, so you must tell the 
mini-batch
+preprocessor whether you want to encode them or not. In the case that you 
have
+already encoded the dependent variable yourself,  you can ignore this 
parameter.
+Also, if you want to encode float values for some reason, cast them to text
+first.
--- End diff --

+1 for the explanation.


---


[GitHub] madlib pull request #258: RF: Comment out assert in flaky install check quer...

2018-04-06 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/258

RF: Comment out assert in flaky install check query

JIRA: MADLIB-1225

The variable importance computation involves randomization inherently.
So it is hard to reproduce this error consistently. This commit comments
out the assert for now (the failure rate was around 4.3%, when tested over
600 runs).

Closes #258

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/rf/flaky-install-check

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #258


commit f2059aa35de5ab0827cf1d9fa0e39179a6190b49
Author: Nandish Jayaram <njayaram@...>
Date:   2018-04-07T00:16:17Z

RF: Comment out assert in flaky install check query

JIRA: MADLIB-1225

The variable importance computation involves randomization inherently.
So it is hard to reproduce this error consistently. This commit comments
out the assert for now (the failure rate was around 4.3%, when tested over
600 runs).

Closes #258




---


[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing

2018-04-03 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/254#discussion_r178684532
  
--- Diff: 
src/ports/postgres/modules/utilities/mean_std_dev_calculator.py_in ---
@@ -40,15 +41,27 @@ class MeanStdDevCalculator:
 self.dimension = dimension
 
 def get_mean_and_std_dev_for_ind_var(self):
-set_zero_std_to_one = True
-
 x_scaled_vals = utils_ind_var_scales(self.source_table,
  self.indep_var_array_str,
  self.dimension,
  self.schema_madlib,
- None, # do not dump the 
output to a temp table
- set_zero_std_to_one)
+ x_mean_table = None, # do not 
dump the output to a temp table
+ set_zero_std_to_one=True)
 x_mean_str = _array_to_string(x_scaled_vals["mean"])
 x_std_str = _array_to_string(x_scaled_vals["std"])
 
+if not x_mean_str or not x_std_str:
+plpy.error("mean/stddev for the independent variable"
+   "cannot be null")
+
 return x_mean_str, x_std_str
+
+def create_mean_std_table_for_ind_var_grouping(self, x_mean_table, 
grouping_cols):
+utils_ind_var_scales_grouping(self.source_table,
+ self.indep_var_array_str,
+ self.dimension,
+ self.schema_madlib,
+ grouping_cols,
+ x_mean_table,
+ set_zero_std_to_one = True,
+ create_temp_table = False)
--- End diff --

Could you please correct the indentation here?


---


[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing

2018-04-03 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/254#discussion_r178684358
  
--- Diff: src/ports/postgres/modules/convex/utils_regularization.py_in ---
@@ -85,6 +86,8 @@ def utils_ind_var_scales_grouping(tbl_data, col_ind_var, 
dimension,
 x_mean_table,
 set_zero_std_to_one (optional, default is False. If set to true
  0.0 standard deviation values will be set to 1.0)
+create_temp_table If set to true, create a persistent instead of a 
temp
+  table, else create a temp table for x_mean
--- End diff --

Shouldn't this comment say create temp table when true, and a persistent 
table when set to false?


---


[GitHub] madlib pull request #251: MLP: Simplify initialization of model coefficients

2018-03-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/251#discussion_r178204067
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -98,23 +98,28 @@ mlp_igd_transition::run(AnyType ) {
 
 double is_classification_double = (double) is_classification;
 double activation_double = (double) activation;
-MappedColumnVector coeff = 
args[10].getAs();
-
state.task.model.rebind(_classification_double,_double,
-()[0], numberOfStages,
-[0]);
-
-// state.task.model.is_classification =
-// static_cast(is_classification);
-// state.task.model.activation = 
static_cast(activation);
-// MappedColumnVector initial_coeff = 
args[10].getAs();
-// // copy initial_coeff into the model
-// Index fan_in, fan_out, layer_start = 0;
-// for (size_t k = 0; k < numberOfStages; ++k){
-// fan_in = numbersOfUnits[k];
-// fan_out = numbersOfUnits[k+1];
-// state.task.model.u[k] << 
initial_coeff.segment(layer_start, (fan_in+1)*fan_out);
-// layer_start = (fan_in + 1) * fan_out;
-// }
+if (!args[10].isNull()){
+// initial coefficients are provided
+MappedColumnVector warm_start_coeff = 
args[10].getAs();
+
+// copy warm start into the task model
+// state.reset() ensures algo.incrModel is copied from 
task.model
+Index layer_start = 0;
+for (size_t k = 0; k < numberOfStages; ++k){
+for (size_t j=0; j < state.task.model.u[k].cols(); 
++j){
+for (size_t i=0; i < state.task.model.u[k].rows(); 
++i){
+state.task.model.u[k](i, j) = warm_start_coeff(
+layer_start + j * 
state.task.model.u[k].rows() + i);
+}
+}
+layer_start = state.task.model.u[k].rows() * 
state.task.model.u[k].cols();
+}
+} else {
+// initialize the model with appropriate coefficients
+state.task.model.initialize(
--- End diff --

This is more of a doubt than a comment I guess. Should this be 
`state.algo.incrModel` instead?
The reason I thought so was that `MLPIGDAlgorithm::transition(state, 
tuple);` uses the model in algo instead of the one task.


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/250#discussion_r178167691
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, 
dependent_varname,
 if is_classification:
 # Currently, classification doesn't accept an
--- End diff --

Yes, that's correct.


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/250#discussion_r178168389
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -856,8 +868,16 @@ def mlp_predict(schema_madlib, model_table, 
data_table, id_col_name,
 activation = _get_activation_index(summary['activation'])
 layer_sizes = PY2SQL(
 summary['layer_sizes'], array_type="DOUBLE PRECISION")
-is_classification = int(summary["is_classification"])
 is_response = int(pred_type == 'response')
+is_classification = int(summary["is_classification"])
+classes = summary['classes']
+# Set a flag to indicate that it is a classification model, with an 
array
+# as the dependent var. The only scenario where classification allows 
for
+# an array dep var is when the user has provided a one-hot encoded dep 
var
+# during training, and mlp_classification does not one-hot encode
+# (and hence classes column in model's summary table is NULL).
+array_dep_var_for_classification = int(is_classification and not 
classes)
--- End diff --

Will change it to `is_dep_var_an_array_for_classification`


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/250#discussion_r178168247
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, 
dependent_varname,
 if is_classification:
 # Currently, classification doesn't accept an
 # array for dep type in IGD
-_assert("[]" not in expr_type and expr_type in 
classification_types,
+_assert(("[]" in expr_type and expr_type[:-2] in int_types) \
--- End diff --

- One-hot encoding is normally supposed to be an integer array, hence was 
checking only for int type. But, that assumption might be too strict, so will 
make changes to allow for other numeric types too.
- Good catch. It might be a good idea to disallow anything greater than a 
1-D array. Will make an error check for the same.


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/250#discussion_r178168263
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, 
dependent_varname,
 if is_classification:
 # Currently, classification doesn't accept an
 # array for dep type in IGD
-_assert("[]" not in expr_type and expr_type in 
classification_types,
+_assert(("[]" in expr_type and expr_type[:-2] in int_types) \
+or expr_type in classification_types,
 "Dependent variable column should be of type: "
--- End diff --

Will change it to cover both the cases.


---


[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

2018-03-27 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/248#discussion_r177495864
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
  "one value are dropped from the tree model.")
 cat_features = [feature for feature in cat_features if feature 
in use_cat_features]
 
-grp_to_col_to_row = dict((grp_key, dict(
-(row['colname'], row['levels']) for row in items))
-for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key')))
-
+# grp_col_to_levels is a list of tuples (pairs) with
+#   first value = group value,
+#   second value = a dict mapping a categorical column to its 
levels in data
+#   (these levels are specific to the group and can be 
different
+#   for different groups)
+#   The list of tuples can be converted to a dict, but the ordering
+#   will be lost.
+# eg. grp_col_to_levels =
+#   [
+#   ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+#   ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+#   ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+#   ]
+grp_to_col_to_levels = [
+(grp_key, dict((row['colname'], row['levels']) for row in 
items))
+for grp_key, items in groupby(all_levels, 
key=itemgetter('grp_key'))]
 if cat_features:
-cat_items_list = [rows[col] for col in cat_features
-  for grp_key, rows in grp_to_col_to_row.items() 
if col in rows]
+# Below statements collect the grp_to_col_to_levels into multiple 
variables
+# From above eg.
+#   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+#   cat_n = [2, 3, 2, 2, 2, 3]
+#   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
+#   grp_key_cat = ['3', '4', '5']
--- End diff --

+1 for the examples, very helpful.


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-26 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/250

MLP: Allow one-hot encoded dependent var for classification

JIRA:MADLIB-1222

MLP currently automatically encodes categorical variables for
classification but does not allow already encoded arrays for dependent
variables in mlp_classification. This commit lets users have an already
encoded array for the dependent variable and train a model.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
feature/mlp/support-encoded-dep-var

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #250


commit f5a87dee6bc8f27c1a13a4921ea726b391b1813d
Author: Nandish Jayaram <njayaram@...>
Date:   2018-03-20T22:43:25Z

MLP: Allow one-hot encoded dependent var for classification

JIRA:MADLIB-1222

MLP currently automatically encodes categorical variables for
classification but does not allow already encoded arrays for dependent
variables in mlp_classification. This commit lets users have an already
encoded array for the dependent variable and train a model.




---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-21 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r176218740
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType ) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType ) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState<MutableArrayHandle > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState<ArrayHandle > previousState = 
args[3];
--- End diff --

Tried it, it was cleaner this way.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175949965
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
+
 for _ in range(n_tries):
+prev_state = None
 if not warm_start:
 coeff = []
-for i in range(len(layer_sizes) - 1):
-fan_in = layer_sizes[i]
-fan_out = layer_sizes[i + 1]
+for fan_in, fan_out in zip(layer_sizes, layer_sizes[1:]):
 # Initalize according to Glorot and Bengio (2010)
 # See design doc for more info
 span = math.sqrt(6.0 / (fan_in + fan_out))
-dim = (layer_sizes[i] + 1) * layer_sizes[i + 1]
-rand = plpy.execute("""SELECT 
array_agg({span}*2*(random()-0.5))
-   AS random
-   FROM generate_series(0,{dim})
-""".format(span=span, dim=dim))[0]["random"]
+dim = (fan_in + 1) * fan_out
+rand = [span * (random() - 0.5) for _ in range(dim)]
--- End diff --

Its supposed to be explained in the design doc as per the comment. I think 
these formulae are taken from a research paper.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175950079
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 # used, it will be an empty list if there was not 
grouping.
 groups = [t[col_grp_key] for t in res if 
t[col_grp_key]]
 losses = [t['loss'] for t in res]
-loss = zip(groups, losses) if len(groups)==len(losses) 
\
-   else losses
-plpy.info("Iteration: " + str(it.iteration) + ", Loss: 
<" + \
-  ', '.join([str(l) for l in loss]) + ">")
+loss = zip(groups, losses) if groups else losses
+plpy.info("Iteration: {0}, Loss: <{1}>".
+  format(it.iteration, ', '.join(map(str, 
loss
 it.final()
 _update_temp_model_table(it_args, it.iteration, temp_output_table,
- first_try)
+ is_minibatch_enabled, first_try)
 first_try = False
-layer_sizes_str = py_list_to_sql_string(
-layer_sizes, array_type="integer")
-classes_str = py_list_to_sql_string(
-[strip_end_quotes(cl, "'") for cl in classes],
-array_type=dependent_type)
+layer_sizes_str = py_list_to_sql_string(layer_sizes,
+array_type="integer")
+
 _create_summary_table(locals())
-_create_standardization_table(standardization_table, x_mean_table,
-  warm_start)
+if is_minibatch_enabled:
+# We already have the mean and std in the input standardization 
table
+input_std_table = add_postfix(source_table, '_standardization')
+_create_standardization_table(standardization_table, 
input_std_table,
+  warm_start)
+else:
+_create_standardization_table(standardization_table, x_mean_table,
+  warm_start)
+# The original input table is the tab_data_scaled for mini batch.
+# Do NOT drop this, it will end up dropping the original data 
table.
+plpy.execute("DROP TABLE IF EXISTS {0}".format(tbl_data_scaled))
+plpy.execute("DROP TABLE IF EXISTS {0}".format(x_mean_table))
--- End diff --

Yes.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175947887
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD<State, ConstState, Task>::transition(state_type 
,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i<n_batches; i++) {
+random_curr_batch[i] = i;
+}
+int curr_batch_row_index = 0;
+std::random_shuffle(_curr_batch[0], 
_curr_batch[n_batches]);
+for (int i=0; i < n_batches; i++) {
+int curr_batch = random_curr_batch[i];
+int curr_batch_row_index = curr_batch * batch_size;
+Matrix X_batch;
+Matrix Y_batch;
+if (curr_batch == n_batches-1) {
+   // last batch
+   X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
+   Y_batch = 
tuple.depVar.bottomRows(n_rows-curr_batch_row_index);
+} else {
+X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
+Y_batch = tuple.depVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
--- End diff --

Yes, that seems right.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175947857
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD<State, ConstState, Task>::transition(state_type 
,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i<n_batches; i++) {
+random_curr_batch[i] = i;
+}
+int curr_batch_row_index = 0;
+std::random_shuffle(_curr_batch[0], 
_curr_batch[n_batches]);
--- End diff --

I don't think we should set a seed. Setting a seed will ensure we will 
always get the same set of random numbers.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175950139
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -491,10 +571,28 @@ def _update_temp_model_table(args, iteration, 
temp_output_table, first_try):
 ) rel_state_subq
 {join_clause}
 """.format(insert_or_create_str=insert_or_create_str,
-   iteration=iteration, join_clause=join_clause, **args)
+   iteration=iteration, join_clause=join_clause,
+   internal_result_udf=internal_result_udf, **args)
 plpy.execute(model_table_query)
 
 
+def _get_loss(schema_madlib, state, is_mini_batch):
--- End diff --

Yes, must remove it.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175948252
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType ) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType ) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState<MutableArrayHandle > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState<ArrayHandle > previousState = 
args[3];
+state.allocate(*this, previousState.task.numberOfStages,
+   previousState.task.numbersOfUnits);
+state = previousState;
+} else {
+// configuration parameters
+ArrayHandle numbersOfUnits = 
args[4].getAs<ArrayHandle >();
--- End diff --

We probably could, but there are a couple of extra arguments that only 
minibatch gets, and not IGD (batch_size and n_epochs).


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175949682
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
--- End diff --

Yes, this looks like duplicated code.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175948750
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP<Model, Tuple>::lambda = 0;
 
+template 
+double
+MLP<Model, Tuple>::getLossAndUpdateModel(
+model_type   ,
+const Matrix _batch,
+const Matrix _true_batch,
+const double ) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
+size_t i, k;
+double total_loss = 0.;
+
+// gradient added over the batch
+std::vector total_gradient_per_layer(N);
+for (k=0; k < N; ++k)
+total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
+   model.u[k].cols());
+
+for (i=0; i < n; i++){
+ColumnVector x = x_batch.row(i);
+ColumnVector y_true = y_true_batch.row(i);
+
+std::vector net, o, delta;
+feedForward(model, x, net, o);
--- End diff --

We will have to change the design docs too for that. Apparently, the 
notation used here is supposed to be in sync with the design doc.


---


[GitHub] madlib issue #241: MiniBatch Pre-Processor: Add new module minibatch_preproc...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/241
  
Another issue I found but forgot to mention in the review:
The `__id__` column has double values instead of integers. For instance, I 
found values such as `0.2000` for that column in the output 
table.
This issue also happens only when the module is used without specifying a 
value for the `buffer_size` param.


---


[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175548350
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175588969
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175593796
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175531202
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175522378
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -794,6 +794,41 @@ def collate_plpy_result(plpy_result_rows):
 # 
--
 
 
+def validate_module_input_params(source_table, output_table, 
independent_varname,
+  dependent_varname, module_name, **kwargs):
--- End diff --

How about having an optional param to deal with checking for residual 
output tables (summary and standardization tables). We could take a list of 
suffixes to check for.


---


[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175585050
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-16 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/243

MLP: Add minibatch gradient descent solver

JIRA: MADLIB-1206

This commit adds support for mini-batch based gradient descent for MLP.
If the input table contains a 2D matrix for independent variable,
minibatch is automatically used as the solver. Two minibatch specific
optimizers are also introduced: batch_size and n_epochs.
- batch_size is defaulted to min(200, buffer_size), where buffer_size is
  equal to the number of original input rows packed into a single row in
  the matrix.
- n_epochs is the number of times all the batches in a buffer are
  iterated over (default 1).

Other changes include:
- dependent variable in the minibatch solver is also a matrix now. It
  was initially a vector.
- Randomize the order of processing a batch within an epoch.
- MLP minibatch currently doesn't support weights param, an error is
  thrown now.
- Delete an unused type named mlp_step_result.
- Add unit tests for newly added functions in python file.

Co-authored-by: Rahul Iyer <ri...@apache.org>
Co-authored-by: Nikhil Kak <n...@pivotal.io>

Closes #243

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
mlp-minibatch-with-preprocessed-data-rebased

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #243


commit d9306f7c6a44f64c53df13c34759da55468c4d26
Author: Nandish Jayaram <njayaram@...>
Date:   2018-02-28T00:51:42Z

MLP: Add minibatch gradient descent solver

JIRA: MADLIB-1206

This commit adds support for mini-batch based gradient descent for MLP.
If the input table contains a 2D matrix for independent variable,
minibatch is automatically used as the solver. Two minibatch specific
optimizers are also introduced: batch_size and n_epochs.
- batch_size is defaulted to min(200, buffer_size), where buffer_size is
  equal to the number of original input rows packed into a single row in
  the matrix.
- n_epochs is the number of times all the batches in a buffer are
  iterated over (default 1).

Other changes include:
- dependent variable in the minibatch solver is also a matrix now. It
  was initially a vector.
- Randomize the order of processing a batch within an epoch.
- MLP minibatch currently doesn't support weights param, an error is
  thrown now.
- Delete an unused type named mlp_step_result.
- Add unit tests for newly added functions in python file.

Co-authored-by: Rahul Iyer <ri...@apache.org>
Co-authored-by: Nikhil Kak <n...@pivotal.io>

Closes #243




---


[GitHub] madlib pull request #240: MLP: Fix step size initialization based on learnin...

2018-03-09 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/240

MLP: Fix step size initialization based on learning rate policy

JIRA: MADLIB-1212

The step_size is supposed to be updated based on the learning rate. The
formulae for different policies depend on the current iteration number,
which was not consumed correctly. This commit fixes it by using
it.iteration for the current iteration number in that update.

Co-authored-by: Nikhil Kak <n...@pivotal.io>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/mlp-learning-rate-policy

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #240


commit 72c6c8bdf5bfb50132e0566ca9600f82e99b7700
Author: Nandish Jayaram <njayaram@...>
Date:   2018-03-06T19:10:50Z

MLP: Fix step size initialization based on learning rate policy

JIRA: MADLIB-1212

The step_size is supposed to be updated based on the learning rate. The
formulae for different policies depend on the current iteration number,
which was not consumed correctly. This commit fixes it by using
it.iteration for the current iteration number in that update.

Co-authored-by: Nikhil Kak <n...@pivotal.io>




---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-08 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r173254181
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the
+classes (mainhue), with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- Grouping by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group. Further, `blue=8` in the example 
below will
+be split between the four groups, resulting in two blue rows for each 
group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=8', -- Specified class value size. Rest of the values 
are outputed as is.
+NULL, -- Output table size
+'zone'-- Group by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 2 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 1 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 3 |  3 | Austria |3 |1 |   84 |  8 |  
  4 |   2

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172954235
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, 
output_table, class_col,
 parsed_class_sizes = extract_keyvalue_params(class_sizes,
  
allow_duplicates=False,
  
lower_case_names=False)
+distinct_levels = collate_plpy_result(
+plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ".
+ format(class_col, source_table)))['levels']
 if not parsed_class_sizes:
-sampling_strategy_str = 
_validate_and_get_sampling_strategy(class_sizes,
-output_table_size)
+sampling_strategy_str = _validate_and_get_sampling_strategy(
+class_sizes, output_table_size)
 else:
 sampling_strategy_str = None
 try:
 for each_level, each_class_size in 
parsed_class_sizes.items():
-_assert(each_level in actual_level_counts,
+_assert(each_level in distinct_levels,
 "Sample: Invalid class value specified ({0})".
-   format(each_level))
+format(each_level))
 each_class_size = int(each_class_size)
 _assert(each_class_size >= 1,
 "Sample: Class size has to be greater than 
zero")
 parsed_class_sizes[each_level] = each_class_size
-
-except TypeError:
+except ValueError:
 plpy.error("Sample: Invalid value for class_sizes ({0})".
format(class_sizes))
 
 # Get the number of rows to be sampled for each class level, based 
on
 # the input table, class_sizes, and output_table_size params. This 
also
 # includes info about the resulting sampling strategy, i.e., one of
 # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level.
-target_class_sizes = 
_get_target_level_counts(sampling_strategy_str,
-  parsed_class_sizes,
-  actual_level_counts,
-  output_table_size)
-
-undersample_level_dict, oversample_level_dict, nosample_level_dict 
= \
-_get_sampling_strategy_specific_dict(target_class_sizes)
-
-# Get subqueries for each sampling strategy, so that they can be 
used
-# together in one big query.
-
-# Subquery that will be used to get rows as is for those class 
levels
-# that need no sampling.
-nosample_subquery = _get_nosample_subquery(
-new_source_table, class_col, nosample_level_dict.keys())
-# Subquery that will be used to sample those class levels that
-# have to be oversampled.
-oversample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, oversample_level_dict)
-# Subquery that will be used to sample those class levels that
-# have to be undersampled. Undersampling supports both with and 
without
-# replacement, so fetch the appropriate subquery.
-if with_replacement:
-undersample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-else:
-undersample_subquery = _get_without_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-
-# Merge the three subqueries using a UNION ALL clause.
-union_all_subquery = ' UNION ALL '.join(
-['({0})'.format(subquery)
- for subquery in [undersample_subquery, oversample_subquery, 
nosample_subquery]
- if subquery])
-
-final_query = """
-CREATE TABLE {output_table} AS
-SELECT row_number() OVER() AS {new_col_name}, *
+grp_col_str, grp_cols = get_grouping_col_str(
+schema_madlib, 'Balance sample', [NEW_ID_COLUMN

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172953687
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, 
output_table, class_col,
 parsed_class_sizes = extract_keyvalue_params(class_sizes,
  
allow_duplicates=False,
  
lower_case_names=False)
+distinct_levels = collate_plpy_result(
+plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ".
+ format(class_col, source_table)))['levels']
 if not parsed_class_sizes:
-sampling_strategy_str = 
_validate_and_get_sampling_strategy(class_sizes,
-output_table_size)
+sampling_strategy_str = _validate_and_get_sampling_strategy(
+class_sizes, output_table_size)
 else:
 sampling_strategy_str = None
 try:
 for each_level, each_class_size in 
parsed_class_sizes.items():
-_assert(each_level in actual_level_counts,
+_assert(each_level in distinct_levels,
 "Sample: Invalid class value specified ({0})".
-   format(each_level))
+format(each_level))
 each_class_size = int(each_class_size)
 _assert(each_class_size >= 1,
 "Sample: Class size has to be greater than 
zero")
 parsed_class_sizes[each_level] = each_class_size
-
-except TypeError:
+except ValueError:
 plpy.error("Sample: Invalid value for class_sizes ({0})".
format(class_sizes))
 
 # Get the number of rows to be sampled for each class level, based 
on
 # the input table, class_sizes, and output_table_size params. This 
also
 # includes info about the resulting sampling strategy, i.e., one of
 # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level.
-target_class_sizes = 
_get_target_level_counts(sampling_strategy_str,
-  parsed_class_sizes,
-  actual_level_counts,
-  output_table_size)
-
-undersample_level_dict, oversample_level_dict, nosample_level_dict 
= \
-_get_sampling_strategy_specific_dict(target_class_sizes)
-
-# Get subqueries for each sampling strategy, so that they can be 
used
-# together in one big query.
-
-# Subquery that will be used to get rows as is for those class 
levels
-# that need no sampling.
-nosample_subquery = _get_nosample_subquery(
-new_source_table, class_col, nosample_level_dict.keys())
-# Subquery that will be used to sample those class levels that
-# have to be oversampled.
-oversample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, oversample_level_dict)
-# Subquery that will be used to sample those class levels that
-# have to be undersampled. Undersampling supports both with and 
without
-# replacement, so fetch the appropriate subquery.
-if with_replacement:
-undersample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-else:
-undersample_subquery = _get_without_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-
-# Merge the three subqueries using a UNION ALL clause.
-union_all_subquery = ' UNION ALL '.join(
-['({0})'.format(subquery)
- for subquery in [undersample_subquery, oversample_subquery, 
nosample_subquery]
- if subquery])
-
-final_query = """
-CREATE TABLE {output_table} AS
-SELECT row_number() OVER() AS {new_col_name}, *
+grp_col_str, grp_cols = get_grouping_col_str(
+schema_madlib, 'Balance sample', [NEW_ID_COLUMN

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172958587
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -58,28 +60,64 @@ NOSAMPLE = 'nosample'
 NEW_ID_COLUMN = '__madlib_id__'
 NULL_IDENTIFIER = '__madlib_null_id__'
 
-def _get_level_frequency_distribution(source_table, class_col):
-""" Returns a dict containing the number of rows associated with each 
class
+
+def _get_level_frequency_distribution(source_table, class_col,
+  grp_by_cols=None):
+""" Count the number of rows for each class, partitioned by the 
grp_by_cols
+
+Returns a dict containing the number of rows associated with each 
class
 level. Each class level count is converted to a string using 
::text.
 None is a valid key in this dict, capturing NULL value in the 
database.
 """
+if grp_by_cols and grp_by_cols.lower() != 'null':
+is_grouping = True
+grp_by_cols_comma = grp_by_cols + ', '
+array_grp_by_cols_comma = "array[{0}]".format(grp_by_cols) + " as 
group_values, "
+else:
+is_grouping = False
+grp_by_cols_comma = array_grp_by_cols_comma = ""
+
+# In below query, the inner query groups the data using grp_by_cols + 
classes
+# and obtains the count for each combination. The outer query then 
groups
+# again by the grp_by_cols to collect the classes and counts in an 
array.
 query_result = plpy.execute("""
-SELECT {class_col}::text AS classes,
-   count(*) AS class_count
-FROM {source_table}
-GROUP BY {class_col}
- """.format(**locals()))
+SELECT
+-- For each group get the classes and their rows counts
+{grp_identifier} as group_values,
+array_agg(classes) as classes,
+array_agg(class_count) as class_count
+FROM(
+-- for each group and class combination present in source table
+-- get the count of rows for that combination
+SELECT
+{array_grp_by_cols_comma}
+({class_col})::TEXT AS classes,
+count(*) AS class_count
+FROM {source_table}
+GROUP BY {grp_by_cols_comma} ({class_col})
+) q
+GROUP BY {grp_identifier}
--- End diff --

In Greenplum 5.x, the above query fails on install check, with the 
following error:
```
SELECT balance_sample('"TEST_s"', 'out_sr2', 'gr1', 'undersample ', NULL, 
NULL, TRUE, TRUE);
psql:/tmp/madlib.2N5sjK/sample/test/balance_sample.sql_in.tmp:111: ERROR:  
plpy.SPIError: non-integer constant in GROUP BY
LINE 17: GROUP BY true
  ^
QUERY:
SELECT
-- For each group get the classes and their rows counts
true as group_values,
array_agg(classes) as classes,
array_agg(class_count) as class_count
FROM(
-- for each group and class combination present in source table
-- get the count of rows for that combination
SELECT

(gr1)::TEXT AS classes,
count(*) AS class_count
FROM "TEST_s"
GROUP BY  (gr1)
) q
GROUP BY true

CONTEXT:  Traceback (most recent call last):
  PL/Python function "balance_sample", line 23, in 
return balance_sample.balance_sample(**globals())
  PL/Python function "balance_sample", line 575, in balance_sample
  PL/Python function "balance_sample", line 100, in 
_get_level_frequency_distribution
PL/Python function "balance_sample"
```


---


[GitHub] madlib issue #237: Bugfix: MLP predict using 1.12 model fails on later versi...

2018-02-22 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/237
  
Thank you for the comments @iyerr3 , will push another commit.


---


[GitHub] madlib pull request #237: Bugfix: MLP predict using 1.12 model fails on late...

2018-02-21 Thread njayaram2
GitHub user njayaram2 opened a pull request:

https://github.com/apache/madlib/pull/237

Bugfix: MLP predict using 1.12 model fails on later versions

JIRA: MADLIB-1207

MADlib 1.12 did not support grouping in MLP. The summary table created
used to have the mean and std used for standardizing the independent
variable. From MADlib 1.13 onwards, grouping was supported, and the mean
and std were moved to the standardization table.
This resulted in a failure when MADlib 1.12 MLP models were used to
predict using MADlib 1.13. This commit fixes this issue.

Closes #237

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/madlib bug/mlp-1.12model-grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #237


commit 63daad10bf8c610d60be738f79e9bdc22ddc6483
Author: Nandish Jayaram <njayaram@...>
Date:   2018-02-22T01:18:57Z

Bugfix: MLP predict using 1.12 model fails on later versions

JIRA: MADLIB-1207

MADlib 1.12 did not support grouping in MLP. The summary table created
used to have the mean and std used for standardizing the independent
variable. From MADlib 1.13 onwards, grouping was supported, and the mean
and std were moved to the standardization table.
This resulted in a failure when MADlib 1.12 MLP models were used to
predict using MADlib 1.13. This commit fixes this issue.

Closes #237




---


[GitHub] madlib pull request #230: Balanced sets final

2018-02-05 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r166056096
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
--- End diff --

There is precedence for supporting prefix for parameter values, in modules 
such as SVM.

Yes, the error messages could be similar.


---


  1   2   >