[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

2019-01-10 Thread reductionista
Github user reductionista commented on the issue:

https://github.com/apache/madlib/pull/342
  
I've added an optional dependent_offset parameter to shift the 
dependent_var values if desired.  

Also noticed a minor documentation issue in .py_in; fixed to match 
documentation in .sql_in (and actual behavior).


---


[GitHub] madlib pull request #342: Minibatch Preprocessor for Deep learning

2018-12-21 Thread reductionista
Github user reductionista commented on a diff in the pull request:

https://github.com/apache/madlib/pull/342#discussion_r243694232
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -51,6 +51,105 @@ m4_changequote(`')
 MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
 MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
 
+class MiniBatchPreProcessorDL:
+def __init__(self, schema_madlib, source_table, output_table,
+ dependent_varname, independent_varname, buffer_size,
+ normalizing_const, **kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+self.normalizing_const = normalizing_const
+self.module_name = "minibatch_preprocessor_DL"
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_args()
+self.num_of_buffers = self._get_num_buffers()
+
+def minibatch_preprocessor_dl(self):
+norm_tbl = unique_string(desp='normalized')
+# Create a temp table that has independent var normalized.
+scalar_mult_sql = """
+CREATE TEMP TABLE {norm_tbl} AS
+SELECT {self.schema_madlib}.array_scalar_mult(
+{self.independent_varname}::REAL[], 
(1/{self.normalizing_const})::REAL) AS x_norm,
+{self.dependent_varname} AS y,
+row_number() over() AS row_id
+FROM {self.source_table}
+""".format(**locals())
+plpy.execute(scalar_mult_sql)
+# Create the mini-batched output table
+if is_platform_pg():
+distributed_by_clause = ''
+else:
+distributed_by_clause= ' DISTRIBUTED BY (buffer_id) '
+sql = """
+CREATE TABLE {self.output_table} AS
+SELECT * FROM
+(
+SELECT {self.schema_madlib}.agg_array_concat(
+ARRAY[{norm_tbl}.x_norm::REAL[]]) AS {x},
+array_agg({norm_tbl}.y) AS {y},
+({norm_tbl}.row_id%{self.num_of_buffers})::smallint AS 
buffer_id
+FROM {norm_tbl}
+GROUP BY buffer_id
+) b
+{distributed_by_clause}
+""".format(x=MINIBATCH_OUTPUT_INDEPENDENT_COLNAME,
+   y=MINIBATCH_OUTPUT_DEPENDENT_COLNAME,
--- End diff --

I don't think we should change the names of these columns while batching 
tables.  IMO, the column names in the output table should remain the same as 
whatever they were in the input table; with the extra column "buffer_id" added. 
 eg, if they were x and y to begin with, they should remain x and y.  If they 
were features and labels, they should remain features and labels.

In other words, instead of the literal strings "independent_varname" and 
"dependent_varname", the names should be self.independent_varname and 
self.dependent_varname, as specified by the user.

If there is some reason why we want to force the user to always use the 
same column names in a batched table, then I'd suggest instead calling them 
either x and y, or independent_var and dependent_var.  Naming them 
"independent_varname" and "dependent_varname" is problematic for at least two 
reasons:
1. These columns contain numeric data, not variable names.  So the column 
heading would not reflect what is in the column.
and 
2. These column names conflict with the column names in the summary table, 
which actually do refer to variable names.  It also conflicts with the names of 
the parameters being passed in by the user.  I think users will be very 
confused if we give two different things exactly the same name in the same 
function.


---


[GitHub] madlib pull request #316: Build: Disable AppendOnly if available

2018-09-13 Thread reductionista
Github user reductionista commented on a diff in the pull request:

https://github.com/apache/madlib/pull/316#discussion_r217537741
  
--- Diff: src/ports/postgres/modules/utilities/control.py_in ---
@@ -158,6 +159,61 @@ class MinWarning(ContextDecorator):
  format(oldMsgLevel=self.oldMsgLevel))
 
 
+class AOControl(ContextDecorator):
+
+"""
+@brief: A wrapper that enables/disables the AO storage option
+"""
+
+def __init__(self, enable=False):
+self.to_enable = enable
+self.was_ao_enabled = False
+self.guc_exists = True
+self.storage_options_dict = dict()
+
+def _parse_gp_default_storage_options(self, 
gp_default_storage_options_str):
+""" Parse comma separated key=value pairs
+
+Example:
+ 
appendonly=false,blocksize=32768,compresstype=none,checksum=true,orientation=row
+"""
+self.storage_options_dict = 
extract_keyvalue_params(gp_default_storage_options_str)
+self.storage_options_dict['appendonly'] = bool(
+strtobool(self.storage_options_dict['appendonly']))
+
+def _join_gp_defaut_storage_options(self):
+return ','.join(['{0}={1}'.format(k, v)
+for k, v in self.storage_options_dict.iteritems()])
+
+def __enter__(self):
+try:
+_storage_options_str = plpy.execute(
+"show 
gp_default_storage_options")[0]["gp_default_storage_options"]
+self._parse_gp_default_storage_options(_storage_options_str)
+
+# Set APPENDONLY=False after backing up existing value
--- End diff --

Rest looks good, I will approve as soon as you update this comment.


---


[GitHub] madlib issue #314: Ubuntu support: Enable creation of gppkg on Ubuntu

2018-08-30 Thread reductionista
Github user reductionista commented on the issue:

https://github.com/apache/madlib/pull/314
  
Also:  a user who downloads the madlib gppkg for Ubuntu must install Alien 
first, before running gppkg -i on it.


---


[GitHub] madlib pull request #314: Ubuntu support: Enable creation of gppkg on Ubuntu

2018-08-27 Thread reductionista
Github user reductionista commented on a diff in the pull request:

https://github.com/apache/madlib/pull/314#discussion_r213069972
  
--- Diff: deploy/CMakeLists.txt ---
@@ -27,9 +26,21 @@ elseif(UNIX)
 )
 elseif(IS_DEBIAN)
 message(STATUS "Detected Debian version ${DEB_VERSION}")
-list(APPEND CPACK_GENERATOR
-DEB
-)
+# -- If cmake flag -DCREATE_RPM_FOR_UBUNTU is set to some
+# value, then we will create an RPM on Ubuntu. If that
+# flag is not set to any value in cmake, then we create
+# only .deb artifact on Ubuntu.
+# Note that package alien must already be installed for
--- End diff --

The comment makes it sound as if setting -DCREATE_RPM_FOR_UBUNTU would 
build both .rpm and .deb while not setting it "only" builds .deb.

But looking at the code itself, only one or the other is built but never 
both. Assuming this is intentional, I would suggest removing "only" from the 
comment for clarity.


---


[GitHub] madlib pull request #314: Ubuntu support: Enable creation of gppkg on Ubuntu

2018-08-27 Thread reductionista
Github user reductionista commented on a diff in the pull request:

https://github.com/apache/madlib/pull/314#discussion_r213063053
  
--- Diff: deploy/CMakeLists.txt ---
@@ -27,9 +26,21 @@ elseif(UNIX)
 )
 elseif(IS_DEBIAN)
 message(STATUS "Detected Debian version ${DEB_VERSION}")
-list(APPEND CPACK_GENERATOR
-DEB
-)
+# -- If cmake flag -DCREATE_RPM_FOR_UBUNTU is set to some
+# value, then we will create an RPM on Ubuntu. If that
+# flag is not set to any value in cmake, then we create
+# only .deb artifact on Ubuntu.
+# Note that package alien must already be installed for
+# building an RPM on Ubuntu.
+if(CREATE_RPM_FOR_UBUNTU)
--- End diff --

The comment makes it sound as if setting -DCREATE_RPM_FOR_UBUNTU would 
build both .rpm and .deb while not setting it "only" builds .deb.

But looking at the code itself, only one or the other is built but never 
both.  Assuming this is intentional, I would suggest removing "only" from the 
comment for clarity.


---


[GitHub] madlib pull request #309: Elastic Net: Fix bug related to grouping by column...

2018-08-13 Thread reductionista
Github user reductionista commented on a diff in the pull request:

https://github.com/apache/madlib/pull/309#discussion_r209807557
  
--- Diff: 
src/ports/postgres/modules/elastic_net/elastic_net_generate_result.py_in ---
@@ -34,26 +36,30 @@ def _elastic_net_generate_result(optimizer, 
iteration_run, **args):
 col_grp_key = args['col_grp_key']
 grouping_str = args['grouping_str']
 cols_types = dict(get_cols_and_types(args["tbl_source"]))
-grouping_str1 = grouping_column + ","
+grouping_cols_list = split_quoted_delimited_str(grouping_column)
+grouping_str1 = ','.join(['{0} AS {1}'.format(g, quote_ident(g))
+ for g in grouping_cols_list])
--- End diff --

Nice!  Would not have thought of this approach.


---