[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int

2018-04-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/259


---


[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/259#discussion_r180576675
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in ---
@@ -91,6 +92,22 @@ minibatch_preprocessor(
When this value is NULL, no grouping is used and a single preprocessing 
step
is performed for the whole data set.
   
+
+  one_hot_encode_int_dep_var (optional)
+   BOOLEAN. default: FALSE.
+  A flag to decide whether to one-hot encode dependent variables that are
+scalar integers. This parameter is ignored if the dependent variable is 
not a
+scalar integer.
+
+@note The mini-batch preprocessor automatically encodes
+dependent variables that are boolean and character types such as text, 
char and
+varchar.  However, scalar integers are a special case because they can be 
used
+in both classification and regression problems, so you must tell the 
mini-batch
+preprocessor whether you want to encode them or not. In the case that you 
have
+already encoded the dependent variable yourself,  you can ignore this 
parameter.
+Also, if you want to encode float values for some reason, cast them to text
+first.
--- End diff --

+1 for the explanation.


---


[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int

2018-04-10 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/madlib/pull/259

Minibatch: Add one-hot encoding option for int

JIRA: MADLIB-1226

Integer dependent variables can be used either in regression or
classification. To use in classification, they need to be one-hot
encoded. This commit adds an option to allow users to pick if a integer
dependent input needs to one-hot encoded or not. The flag is ignored if
the variable is not of integer type.

Other changes include adding an appropriate test in install-check,
code cleanup and PEP8 conformance.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/minibatch_one_hot_encode

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #259


commit 4729973d4e477cfef42cb21f8b8a3778171a5a3d
Author: Rahul Iyer 
Date:   2018-04-10T19:34:23Z

Minibatch: Add one-hot encoding option for int

JIRA: MADLIB-1226

Integer dependent variables can be used either in regression or
classification. To use in classification, they need to be one-hot
encoded. This commit adds an option to allow users to pick if a integer
dependent input needs to one-hot encoded or not. The flag is ignored if
the variable is not of integer type.

Other changes include adding an appropriate test in install-check,
code cleanup and PEP8 conformance.




---