[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 Thanks for following up on this, Felix. Still waiting for an agreement on this... Would like to have more direction on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17864 what's next on this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 We can log a warning or issue an error if the input column is int and the imputation is by mean. Would like to know if that's OK with you? @hhbyyh @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/17864 Shall we pay extra attention to the Int case? E.g. input column contains Double.Nan, 1, 2. The current implementation will return surrogate as 1.5. I'm not sure if it's the expectation for some users. It's fine by me but just bring up the issue in case it's missed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 Any committer has a chance to take another look at this PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @MLnick Thanks much for your comments. Yes, I think always returning Double is consistent with Python and R and also other transformers in ML. Plus, as @hhbyyh mentioned, this makes the implementation easier. Would you mind taking a look at the code and let me know if there is any suggestion for improvement? The doc is already updated to make it clear that it always returns Double regardless of the input type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17864 Originally the idea behind only supporting double was as @sethah posted above - there could be some issues with handling of int casting etc. As mentioned originally, we did consider "always cast to double". The only issue with it is the potential for surprising users who may expect the type of the input column to be maintained in the imputation. Having said that I would be broadly ok with just appending a double output column, provided we update the docstrings / guide to make things very clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 Ping folks for comments/review. Many thanks. @viirya @MLnick @jkbradley @hhbyyh @yanboliang @BenFradet --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @hhbyyh @sethah @MLnick Could you take a look at the new commit? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17864 **[Test build #76513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76513/testReport)** for PR 17864 at commit [`86c8a10`](https://github.com/apache/spark/commit/86c8a1061a366f96b3db8acd2a4d1ace3bb81ee3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17864 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17864 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76513/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17864 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17864 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76511/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17864 **[Test build #76511 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76511/testReport)** for PR 17864 at commit [`6479965`](https://github.com/apache/spark/commit/6479965f5d49965dfa59fd65730668f1b8f3ddd5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17864 **[Test build #76513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76513/testReport)** for PR 17864 at commit [`86c8a10`](https://github.com/apache/spark/commit/86c8a1061a366f96b3db8acd2a4d1ace3bb81ee3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17864 **[Test build #76511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76511/testReport)** for PR 17864 at commit [`6479965`](https://github.com/apache/spark/commit/6479965f5d49965dfa59fd65730668f1b8f3ddd5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @hhbyyh Thanks for the suggestion. I have made a new commit that always casts the input to double and outputs the imputed column as double. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/17864 I imagine most Int features will need to be converted to Double for a Vector, thus returns Double regardless the input type makes sense, which also makes the implementation more straight forward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @sethah Thanks for summarizing the previous discussions. What are you suggesting for this PR? I think it makes sense to log a warning when imputing integer types with mean. In addition, perhaps we can set "median" as the default strategy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17864 So the other PR https://github.com/apache/spark/pull/11601 is really long. For reference, I am picking out the relevant discussions to this PR (also someone tell me if there's a better way to link to pr comments :) @MLnick "what do you think about handling different numeric types in input/output columns? If the input is say IntType, then strategy mode andmedian is ok but mean is somewhat problematic - or are we ok with rounding to and Int? The alternative is the Imputer always appends a Double output column. I propose we either (a) do the cast back to input type, but if the user selected "mean" and the input type is not Float or Double, log a warning; or (b) only support Float and Double type for this initial version of the Imputer." @jkbradley "Just catching up now... I like the idea of maintaining the input type. I'm imagining using an Imputer to fill in continuous features with the mean and categoricals with the mode. Later on, we could even check to see if a column is categorical (in the metadata) and throw an exception for mean. I'd prefer your option (b) to be safe." @sethah "For reference, I checked scikit-learn and the Imputer class returns floats regardless of inputs. I also checked R package "mlr" and it appears to do the same. One concern with a.) would be if the true median was something like 5.0, but approxQuantile returned 4.9. Then, we cast back to IntegerType and return 4. I wasn't able to produce this situation when I briefly experimented with it, and also the median is already approximate, so I'm not sure if this is really a problem." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17864 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17864 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76468/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17864 **[Test build #76468 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76468/testReport)** for PR 17864 at commit [`e9ab39c`](https://github.com/apache/spark/commit/e9ab39c2bdca76dae2b5cc40f90e4f5b2f9416c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17864 @yanboliang @srowen @MLnick @jkbradley The example below shows failure of Imputer on integer data. ``` val df = spark.createDataFrame( Seq( (0, 1.0, 1.0, 1.0), (1, 11.0, 11.0, 11.0), (2, 1.5, 1.5, 1.5), (3, Double.NaN, 4.5, 1.5) )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1") val imputer = new Imputer() .setInputCols(Array("value1")) .setOutputCols(Array("out1")) imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType))) java.lang.IllegalArgumentException: requirement failed: Column value1 must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type IntegerType. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17864 **[Test build #76468 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76468/testReport)** for PR 17864 at commit [`e9ab39c`](https://github.com/apache/spark/commit/e9ab39c2bdca76dae2b5cc40f90e4f5b2f9416c8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org