[GitHub] madlib issue #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/259
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/436/



---


[GitHub] madlib issue #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/madlib/pull/259
  
I've (force) pushed after the rebase. This should now reflect the 
`dependent_vartype` change from previous PR. 


---


[GitHub] madlib issue #244: Changes for Personalized Page Rank : Jira:1084

2018-04-10 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/244
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/435/



---


[GitHub] madlib issue #244: Changes for Personalized Page Rank : Jira:1084

2018-04-10 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/244
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/434/



---


[GitHub] madlib pull request #260: minibatch preprocessor improvements

2018-04-10 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/260#discussion_r180603956
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -397,8 +408,9 @@ class MiniBatchStandardizer:
 x_std_dev_str = self.x_std_dev_str)
 return query
 
-def _get_query_for_standardizing_with_grouping(self):
+def _create_table_for_standardizing_with_grouping(self):
--- End diff --

Why was the method name changed? The older name seems to be more apt, since 
this function is still returning the query, and not executing it (the same for 
`_create_table_for_standardizing_without_grouping()` too).


---


[GitHub] madlib pull request #256: Minibatch Preprocessing: change default buffer siz...

2018-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/256


---


[GitHub] madlib pull request #258: RF: Comment out assert in flaky install check quer...

2018-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/258


---


[GitHub] madlib pull request #258: RF: Comment out assert in flaky install check quer...

2018-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/258


---


[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/259#discussion_r180576675
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in ---
@@ -91,6 +92,22 @@ minibatch_preprocessor(
When this value is NULL, no grouping is used and a single preprocessing 
step
is performed for the whole data set.
   
+
+  one_hot_encode_int_dep_var (optional)
+   BOOLEAN. default: FALSE.
+  A flag to decide whether to one-hot encode dependent variables that are
+scalar integers. This parameter is ignored if the dependent variable is 
not a
+scalar integer.
+
+@note The mini-batch preprocessor automatically encodes
+dependent variables that are boolean and character types such as text, 
char and
+varchar.  However, scalar integers are a special case because they can be 
used
+in both classification and regression problems, so you must tell the 
mini-batch
+preprocessor whether you want to encode them or not. In the case that you 
have
+already encoded the dependent variable yourself,  you can ignore this 
parameter.
+Also, if you want to encode float values for some reason, cast them to text
+first.
--- End diff --

+1 for the explanation.


---


[GitHub] madlib issue #258: RF: Comment out assert in flaky install check query

2018-04-10 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/258
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/433/



---


[GitHub] madlib issue #260: minibatch preprocessor improvements

2018-04-10 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/260
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/432/



---


[GitHub] madlib issue #256: Minibatch Preprocessing: change default buffer size formu...

2018-04-10 Thread fmcquillan99
Github user fmcquillan99 commented on the issue:

https://github.com/apache/madlib/pull/256
  
LGTM

Default selection looks reasonable:

(0) data
DROP TABLE IF EXISTS iris_data;
CREATE TABLE iris_data(
id serial,
attributes numeric[],
class_text text,
class integer,
state text
);
INSERT INTO iris_data(id, attributes, class_text, class, state) VALUES
(1,ARRAY[5.0,3.2,1.2,0.2],'Iris_setosa',1,'Alaska'),
(2,ARRAY[5.5,3.5,1.3,0.2],'Iris_setosa',1,'Alaska'),
(3,ARRAY[4.9,3.1,1.5,0.1],'Iris_setosa',1,'Alaska'),
(4,ARRAY[4.4,3.0,1.3,0.2],'Iris_setosa',1,'Alaska'),
(5,ARRAY[5.1,3.4,1.5,0.2],'Iris_setosa',1,'Alaska'),
(6,ARRAY[5.0,3.5,1.3,0.3],'Iris_setosa',1,'Alaska'),
(7,ARRAY[4.5,2.3,1.3,0.3],'Iris_setosa',1,'Alaska'),
(8,ARRAY[4.4,3.2,1.3,0.2],'Iris_setosa',1,'Alaska'),
(9,ARRAY[5.0,3.5,1.6,0.6],'Iris_setosa',1,'Alaska'),
(10,ARRAY[5.1,3.8,1.9,0.4],'Iris_setosa',1,'Alaska'),
(11,ARRAY[4.8,3.0,1.4,0.3],'Iris_setosa',1,'Alaska'),
(12,ARRAY[5.1,3.8,1.6,0.2],'Iris_setosa',1,'Alaska'),
(13,ARRAY[5.7,2.8,4.5,1.3],'Iris_versicolor',2,'Alaska'),
(14,ARRAY[6.3,3.3,4.7,1.6],'Iris_versicolor',2,'Alaska'),
(15,ARRAY[4.9,2.4,3.3,1.0],'Iris_versicolor',2,'Alaska'),
(16,ARRAY[6.6,2.9,4.6,1.3],'Iris_versicolor',2,'Alaska'),
(17,ARRAY[5.2,2.7,3.9,1.4],'Iris_versicolor',2,'Alaska'),
(18,ARRAY[5.0,2.0,3.5,1.0],'Iris_versicolor',2,'Alaska'),
(19,ARRAY[5.9,3.0,4.2,1.5],'Iris_versicolor',2,'Alaska'),
(20,ARRAY[6.0,2.2,4.0,1.0],'Iris_versicolor',2,'Alaska'),
(21,ARRAY[6.1,2.9,4.7,1.4],'Iris_versicolor',2,'Alaska'),
(22,ARRAY[5.6,2.9,3.6,1.3],'Iris_versicolor',2,'Alaska'),
(23,ARRAY[6.7,3.1,4.4,1.4],'Iris_versicolor',2,'Alaska'),
(24,ARRAY[5.6,3.0,4.5,1.5],'Iris_versicolor',2,'Alaska'),
(25,ARRAY[5.8,2.7,4.1,1.0],'Iris_versicolor',2,'Alaska'),
(26,ARRAY[6.2,2.2,4.5,1.5],'Iris_versicolor',2,'Alaska'),
(27,ARRAY[5.6,2.5,3.9,1.1],'Iris_versicolor',2,'Alaska'),
(28,ARRAY[5.0,3.4,1.5,0.2],'Iris_setosa',1,'Tennessee'),
(29,ARRAY[4.4,2.9,1.4,0.2],'Iris_setosa',1,'Tennessee'),
(30,ARRAY[4.9,3.1,1.5,0.1],'Iris_setosa',1,'Tennessee'),
(31,ARRAY[5.4,3.7,1.5,0.2],'Iris_setosa',1,'Tennessee'),
(32,ARRAY[4.8,3.4,1.6,0.2],'Iris_setosa',1,'Tennessee'),
(33,ARRAY[4.8,3.0,1.4,0.1],'Iris_setosa',1,'Tennessee'),
(34,ARRAY[4.3,3.0,1.1,0.1],'Iris_setosa',1,'Tennessee'),
(35,ARRAY[5.8,4.0,1.2,0.2],'Iris_setosa',1,'Tennessee'),
(36,ARRAY[5.7,4.4,1.5,0.4],'Iris_setosa',1,'Tennessee'),
(37,ARRAY[5.4,3.9,1.3,0.4],'Iris_setosa',1,'Tennessee'),
(38,ARRAY[6.0,2.9,4.5,1.5],'Iris_versicolor',2,'Tennessee'),
(39,ARRAY[5.7,2.6,3.5,1.0],'Iris_versicolor',2,'Tennessee'),
(40,ARRAY[5.5,2.4,3.8,1.1],'Iris_versicolor',2,'Tennessee'),
(41,ARRAY[5.5,2.4,3.7,1.0],'Iris_versicolor',2,'Tennessee'),
(42,ARRAY[5.8,2.7,3.9,1.2],'Iris_versicolor',2,'Tennessee'),
(43,ARRAY[6.0,2.7,5.1,1.6],'Iris_versicolor',2,'Tennessee'),
(44,ARRAY[5.4,3.0,4.5,1.5],'Iris_versicolor',2,'Tennessee'),
(45,ARRAY[6.0,3.4,4.5,1.6],'Iris_versicolor',2,'Tennessee'),
(46,ARRAY[6.7,3.1,4.7,1.5],'Iris_versicolor',2,'Tennessee'),
(47,ARRAY[6.3,2.3,4.4,1.3],'Iris_versicolor',2,'Tennessee'),
(48,ARRAY[5.6,3.0,4.1,1.3],'Iris_versicolor',2,'Tennessee'),
(49,ARRAY[5.5,2.5,4.0,1.3],'Iris_versicolor',2,'Tennessee'),
(50,ARRAY[5.5,2.6,4.4,1.2],'Iris_versicolor',2,'Tennessee'),
(51,ARRAY[6.1,3.0,4.6,1.4],'Iris_versicolor',2,'Tennessee'),
(52,ARRAY[5.8,2.6,4.0,1.2],'Iris_versicolor',2,'Tennessee');
```


(1) no groups, 2 segments, default buffer size
```
select * from iris_data_packed_summary;

-[ RECORD 1 ]+--
source_table | iris_data
output_table | iris_data_packed
dependent_varname| class_text
independent_varname  | attributes
buffer_size  | 26
class_values | {Iris_setosa,Iris_versicolor}
num_rows_processed   | 52
num_missing_rows_skipped | 0
grouping_cols| 
```


(2) no groups, 2 segments, buffer size=10
```
madlib=# select * from iris_data_packed_summary;

-[ RECORD 1 ]+--
source_table | iris_data
output_table | iris_data_packed
dependent_varname| class_text
independent_varname  | attributes
buffer_size  | 10
class_values | {Iris_setosa,Iris_versicolor}
num_rows_processed   | 52
num_missing_rows_skipped | 0
grouping_cols| 
```


(3) groups, 2 segments, default buffer size
```
select * from iris_data_packed_summary;

-[ RECORD 1 ]+--
source_table | iris_data
output_table   

[GitHub] madlib pull request #260: minibatch preprocessor improvements

2018-04-10 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/260

minibatch preprocessor improvements

This PR makes two improvements to the preprocessor code

1. Check for all character types for dependent col
2. Create temp table for standardization.

See the commit for more details

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
feature/minibatch-preprocessing-improvements

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #260


commit d5e996a1eb3ea1d28151b48e435f40a3a764aa51
Author: Nikhil Kak 
Date:   2018-04-06T18:35:16Z

Utilities: Add functions for postgres character/boolean type comparison.

This commit adds two functions to check if a given type matches one of the 
predefined postgres character or boolean types.

commit 0f6ca99f4de32f1a235fed612d3b74bf822ef3f9
Author: Nikhil Kak 
Date:   2018-04-06T18:42:41Z

MiniBatch Preprocessor: Check for all character types for dependent col

This commit enables support for dependent column type to be any of the 
postgres character
types instead of just `text`.

commit e3462580b7d43589c8a52244029e056ce182a529
Author: Nikhil Kak 
Date:   2018-04-06T20:55:46Z

Minibatch Preprocessor: Create temp table for standardization.

We did a few experiments and the results proved that creating a temp table 
for standardization is faster than using a subquery.
This commit now creates a temp table for the standardization.
Before this commit, we were calling the `utils_normalize_data` function 
inside the main query but now we create a temp table from the
output of `utils_normalize_data` and use the table in the main query.




---


[GitHub] madlib pull request #255: MLP: Remove source table dependency for predicting...

2018-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/255


---


[GitHub] madlib issue #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/259
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/431/



---


[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/madlib/pull/259

Minibatch: Add one-hot encoding option for int

JIRA: MADLIB-1226

Integer dependent variables can be used either in regression or
classification. To use in classification, they need to be one-hot
encoded. This commit adds an option to allow users to pick if a integer
dependent input needs to one-hot encoded or not. The flag is ignored if
the variable is not of integer type.

Other changes include adding an appropriate test in install-check,
code cleanup and PEP8 conformance.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/minibatch_one_hot_encode

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #259


commit 4729973d4e477cfef42cb21f8b8a3778171a5a3d
Author: Rahul Iyer 
Date:   2018-04-10T19:34:23Z

Minibatch: Add one-hot encoding option for int

JIRA: MADLIB-1226

Integer dependent variables can be used either in regression or
classification. To use in classification, they need to be one-hot
encoded. This commit adds an option to allow users to pick if a integer
dependent input needs to one-hot encoded or not. The flag is ignored if
the variable is not of integer type.

Other changes include adding an appropriate test in install-check,
code cleanup and PEP8 conformance.




---


[GitHub] madlib issue #255: MLP: Remove source table dependency for predicting regres...

2018-04-10 Thread fmcquillan99
Github user fmcquillan99 commented on the issue:

https://github.com/apache/madlib/pull/255
  
LGTM, see https://issues.apache.org/jira/browse/MADLIB-1223 for tests i ran



---


[GitHub] madlib issue #255: MLP: Remove source table dependency for predicting regres...

2018-04-10 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/255
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/430/



---