[GitHub] madlib pull request #240: MLP: Fix step size initialization based on learnin...

2018-03-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/240


---


[GitHub] madlib issue #241: MiniBatch Pre-Processor: Add new module minibatch_preproc...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/241
  
Another issue I found but forgot to mention in the review:
The `__id__` column has double values instead of integers. For instance, I 
found values such as `0.2000` for that column in the output 
table.
This issue also happens only when the module is used without specifying a 
value for the `buffer_size` param.


---


[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175548350
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+ calculate_default_buffer_size(
+ self.buffer_size,
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175588969
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+ calculate_default_buffer_size(
+ self.buffer_size,
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175593796
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+ calculate_default_buffer_size(
+ self.buffer_size,
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175531202
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+ calculate_default_buffer_size(
+ self.buffer_size,
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175522378
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -794,6 +794,41 @@ def collate_plpy_result(plpy_result_rows):
 # 
--
 
 
+def validate_module_input_params(source_table, output_table, 
independent_varname,
+  dependent_varname, module_name, **kwargs):
--- End diff --

How about having an optional param to deal with checking for residual 
output tables (summary and standardization tables). We could take a list of 
suffixes to check for.


---


[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-19 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175585050
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+ calculate_default_buffer_size(
+ self.buffer_size,
+

[GitHub] madlib issue #243: MLP: Add minibatch gradient descent solver

2018-03-19 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/243
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/383/



---


[GitHub] madlib issue #245: Reduce Install Check run time

2018-03-19 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/245
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/382/



---


[GitHub] madlib pull request #245: Reduce Install Check run time

2018-03-19 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/245

Reduce Install Check run time

To reduce the total run time of install check, we looked at the top 5 
modules that take longest and modified install check test cases. See each 
commit for details.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib reduce_IC_run_time

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #245


commit bfb3ef87302d80412c72f82194007f088e828974
Author: Jingyi Mei 
Date:   2018-03-16T00:07:44Z

PCA: Refactor IC for pca_project to reduce run time

To reduce the run time for pca_project, we removed calls to
pca_sparse_train and pca_train. Instead, we directly insert data in the
sql file.
Also, we renamed some output table names for simplicity.

Co-authored-by: Nikhil Kak 

commit a744b396cd7d21a43f7ac1bfede963a1bae318d9
Author: Jingyi Mei 
Date:   2018-03-16T00:45:23Z

Decision Tree: Modify IC to reduce run time

1. We use a smaller array dataset as input to one of the test case, which
reduced 2/3 of the decision tree IC run time.
2. Reduce n_folds from 5 to 3 in one test case

Co-authored-by: Nikhil Kak 

commit f61bf0e88a72506e299060f3e20f062d58338ec6
Author: Jingyi Mei 
Date:   2018-03-16T18:41:15Z

Random Forest: Clean up install check

This commit reorders test cases and adds comments. Besides, it removes
unnecessary casting in queries.

commit 74c650ba3ef56206e86f461c00f61cb1c70fb78a
Author: Jingyi Mei 
Date:   2018-03-16T18:44:26Z

Random Foreset: Reduce install check time

This commit changes num of trees from 100 to 10 in two test cases, so
that the total run time of IC will be reduced.

Co-authored-by: Nikhil Kak 

commit c5b88d55a5a9eb25c5dc9ed01f546ab31c4b8f5e
Author: Jngyi Mei and Nikhil Kak 
Date:   2018-03-19T18:31:58Z

Elastic Net: Remove cross validation test

To reduce the Install Check run time for Elastic Net, we removed the
test case with cross validation.

Co-authored-by: Jingyi Mei 
Co-authored-by: Nikhil Kak 




---


[GitHub] madlib issue #240: MLP: Fix step size initialization based on learning rate ...

2018-03-19 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/240
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/381/



---