[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-08-08 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
Thanks for following up on this, Felix. 
Still waiting for an agreement on this...
Would like to have more direction on this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-08-08 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17864
  
what's next on this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-06-26 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
We can log a warning or issue an error if the input column is int and the 
imputation is by mean.
Would like to know if that's OK with you? @hhbyyh @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-06-25 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17864
  
Shall we pay extra attention to the Int case? E.g. input column contains 
Double.Nan, 
1, 
2. 

The current implementation will return surrogate as 1.5. I'm not sure if 
it's the expectation for some users. 

It's fine by me but just bring up the issue in case it's missed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-06-23 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
Any committer has a chance to take another look at this PR? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-25 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
@MLnick Thanks much for your comments. Yes, I think always returning Double 
is consistent with Python and R and also other transformers in ML. Plus, as 
@hhbyyh mentioned, this makes the implementation easier. Would you mind taking 
a look at the code and let me know if there is any suggestion for improvement? 
The doc is already updated to make it clear that it always returns Double 
regardless of the input type. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-25 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17864
  
Originally the idea behind only supporting double was as @sethah posted 
above - there could be some issues with handling of int casting etc. As 
mentioned originally, we did consider "always cast to double". The only issue 
with it is the potential for surprising users who may expect the type of the 
input column to be maintained in the imputation.

Having said that I would be broadly ok with just appending a double output 
column, provided we update the docstrings / guide to make things very clear.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-23 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
Ping folks for comments/review. Many thanks. 
@viirya @MLnick @jkbradley @hhbyyh @yanboliang @BenFradet 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-10 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
@hhbyyh @sethah @MLnick 
Could you take a look at the new commit? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17864
  
**[Test build #76513 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76513/testReport)**
 for PR 17864 at commit 
[`86c8a10`](https://github.com/apache/spark/commit/86c8a1061a366f96b3db8acd2a4d1ace3bb81ee3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17864
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17864
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76513/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17864
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17864
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76511/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17864
  
**[Test build #76511 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76511/testReport)**
 for PR 17864 at commit 
[`6479965`](https://github.com/apache/spark/commit/6479965f5d49965dfa59fd65730668f1b8f3ddd5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17864
  
**[Test build #76513 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76513/testReport)**
 for PR 17864 at commit 
[`86c8a10`](https://github.com/apache/spark/commit/86c8a1061a366f96b3db8acd2a4d1ace3bb81ee3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17864
  
**[Test build #76511 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76511/testReport)**
 for PR 17864 at commit 
[`6479965`](https://github.com/apache/spark/commit/6479965f5d49965dfa59fd65730668f1b8f3ddd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-05 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
@hhbyyh Thanks for the suggestion. I have made a new commit that always 
casts the input to double and outputs the imputed column as double. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-05 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17864
  
I imagine most Int features will need to be converted to Double for a 
Vector, thus returns Double regardless the input type makes sense, which also 
makes the implementation more straight forward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
@sethah Thanks for summarizing the previous discussions. 
What are you suggesting for this PR? I think it makes sense to log a 
warning when imputing integer types with mean. In addition, perhaps we can set 
"median" as the default strategy. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/17864
  
So the other PR https://github.com/apache/spark/pull/11601 is really long. 
For reference, I am picking out the relevant discussions to this PR (also 
someone tell me if there's a better way to link to pr comments :)

@MLnick "what do you think about handling different numeric types in 
input/output columns? If the input is say IntType, then strategy mode andmedian 
is ok but mean is somewhat problematic - or are we ok with rounding to and Int? 
The alternative is the Imputer always appends a Double output column.

I propose we either (a) do the cast back to input type, but if the user 
selected "mean" and the input type is not Float or Double, log a warning; or 
(b) only support Float and Double type for this initial version of the Imputer."

@jkbradley "Just catching up now... I like the idea of maintaining the 
input type. I'm imagining using an Imputer to fill in continuous features with 
the mean and categoricals with the mode. Later on, we could even check to see 
if a column is categorical (in the metadata) and throw an exception for mean.

I'd prefer your option (b) to be safe."

@sethah "For reference, I checked scikit-learn and the Imputer class 
returns floats regardless of inputs. I also checked R package "mlr" and it 
appears to do the same. One concern with a.) would be if the true median was 
something like 5.0, but approxQuantile returned 4.9. Then, we cast back 
to IntegerType and return 4. I wasn't able to produce this situation when I 
briefly experimented with it, and also the median is already approximate, so 
I'm not sure if this is really a problem."



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17864
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17864
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76468/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17864
  
**[Test build #76468 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76468/testReport)**
 for PR 17864 at commit 
[`e9ab39c`](https://github.com/apache/spark/commit/e9ab39c2bdca76dae2b5cc40f90e4f5b2f9416c8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17864
  
@yanboliang @srowen @MLnick @jkbradley 

The example below shows failure of Imputer on integer data. 
```
val df = spark.createDataFrame( Seq(
  (0, 1.0, 1.0, 1.0),
  (1, 11.0, 11.0, 11.0),
  (2, 1.5, 1.5, 1.5),
  (3, Double.NaN, 4.5, 1.5)
)).toDF("id", "value1", "expected_mean_value1", 
"expected_median_value1")
val imputer = new Imputer()
  .setInputCols(Array("value1"))
  .setOutputCols(Array("out1"))
imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must 
be of type equal to one of the following types: [DoubleType, FloatType] but was 
actually of type IntegerType.

```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17864
  
**[Test build #76468 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76468/testReport)**
 for PR 17864 at commit 
[`e9ab39c`](https://github.com/apache/spark/commit/e9ab39c2bdca76dae2b5cc40f90e4f5b2f9416c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org