[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/259 ---
[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/259#discussion_r180576675 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in --- @@ -91,6 +92,22 @@ minibatch_preprocessor( When this value is NULL, no grouping is used and a single preprocessing step is performed for the whole data set. + + one_hot_encode_int_dep_var (optional) + BOOLEAN. default: FALSE. + A flag to decide whether to one-hot encode dependent variables that are +scalar integers. This parameter is ignored if the dependent variable is not a +scalar integer. + +@note The mini-batch preprocessor automatically encodes +dependent variables that are boolean and character types such as text, char and +varchar. However, scalar integers are a special case because they can be used +in both classification and regression problems, so you must tell the mini-batch +preprocessor whether you want to encode them or not. In the case that you have +already encoded the dependent variable yourself, you can ignore this parameter. +Also, if you want to encode float values for some reason, cast them to text +first. --- End diff -- +1 for the explanation. ---
[GitHub] madlib pull request #259: Minibatch: Add one-hot encoding option for int
GitHub user iyerr3 opened a pull request: https://github.com/apache/madlib/pull/259 Minibatch: Add one-hot encoding option for int JIRA: MADLIB-1226 Integer dependent variables can be used either in regression or classification. To use in classification, they need to be one-hot encoded. This commit adds an option to allow users to pick if a integer dependent input needs to one-hot encoded or not. The flag is ignored if the variable is not of integer type. Other changes include adding an appropriate test in install-check, code cleanup and PEP8 conformance. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/minibatch_one_hot_encode Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/259.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #259 commit 4729973d4e477cfef42cb21f8b8a3778171a5a3d Author: Rahul IyerDate: 2018-04-10T19:34:23Z Minibatch: Add one-hot encoding option for int JIRA: MADLIB-1226 Integer dependent variables can be used either in regression or classification. To use in classification, they need to be one-hot encoded. This commit adds an option to allow users to pick if a integer dependent input needs to one-hot encoded or not. The flag is ignored if the variable is not of integer type. Other changes include adding an appropriate test in install-check, code cleanup and PEP8 conformance. ---