[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14858


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r79282998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -109,7 +114,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): Bucketizer = {
 transformSchema(dataset.schema, logging = true)
-val splits = dataset.stat.approxQuantile($(inputCol),
+val splits = 
dataset.select($(inputCol)).na.drop().stat.approxQuantile($(inputCol),
--- End diff --

@VinceShieh if you're interested in proceeding with the change you 
describe, go ahead. The new behavior should be documented explicitly, because I 
think it's the behavior one would already expect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r78518465
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -109,7 +114,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): Bucketizer = {
 transformSchema(dataset.schema, logging = true)
-val splits = dataset.stat.approxQuantile($(inputCol),
+val splits = 
dataset.select($(inputCol)).na.drop().stat.approxQuantile($(inputCol),
--- End diff --

Yes, I agree that a similar argument applies for approxQuantile methods. I 
think the most reasonable semantics are to ignore NaN as well.

QuantileSummaries should probably reject insertion of NaN too.

I'd support making that change as well here, and expanding the scope 
accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r78518538
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -109,7 +114,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): Bucketizer = {
 transformSchema(dataset.schema, logging = true)
-val splits = dataset.stat.approxQuantile($(inputCol),
+val splits = 
dataset.select($(inputCol)).na.drop().stat.approxQuantile($(inputCol),
--- End diff --

CC @thunterdb for an opinion on that one, as he has touched most of this 
code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-13 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r78513514
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -109,7 +114,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): Bucketizer = {
 transformSchema(dataset.schema, logging = true)
-val splits = dataset.stat.approxQuantile($(inputCol),
+val splits = 
dataset.select($(inputCol)).na.drop().stat.approxQuantile($(inputCol),
--- End diff --

@srowen I have one question here. The reason why we dont move this NaN 
filter into approxiQuantile or multipleApproxQuantiles is that, those apis are 
shared with sparkSQL? Becoz, personally I think it would look better if we put 
this filter inside multipleApproxQuantiles, though it would introduce more 
changes and,  should make sure it doesnt impact other components other than 
mllib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77489786
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

What would you do with NaN in approxQuantile except ignore it? it can't 
affect a quantile. The question of what the bucketizer should do with NaN is 
separate. I'd still favor one behavior there, but keeping that separate, let's 
figure out what approxQuantile should do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-04 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77465278
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

@srowen  then maybe we should, as we discussed earlier on JIRA, align with 
R, by having a NaN checker in approxQuantile, that is, having a NaN filter 
inside of approxQuantile, rather than ahead of calling approxQuantile.

We can also have a same flag for user to choose to either remove NaN values 
or throw an error when there is NaN in data,  although, this API change will 
introduce collateral impact on several existing function calls. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77320714
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

Sure it could be faster, but it's also incorrect. We can't return a wrong 
answer here to save time. Do we disagree on the effect of including NaN when 
computing the quantiles?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77283036
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

Nit. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77156190
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

I would suggest:

The number of bins is set by the `numBuckets` parameter. It's possible that 
the number of buckets used will be less than this value, for example, if there 
are too few distinct values of the input to create enough distinct quantiles. 
Note also that NaN values are handled specially and placed into their own 
bucket. For example, if 4 buckets are used, then non-NaN data will be put into 
buckets 0-3, but NaNs will be counted in a special bucket 4.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77145656
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

for this part of documentation, how about put it:
"The number of bins is set by the `numBuckets` parameter, but the returned 
'actualNumBuckets' might differ from what was request depending on the data 
sample value; while data will go into Buckets[0] to Buckets[actualNumBuckets - 
1] and NaN value, if existed, will go into Buckets[actualNumBuckets]"
sounds good?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77141804
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

It's always possible to have less data than buckets. The problem here is 
that you might have enough non-NaN data, even, to properly determine distinct 
buckets, but fail to do so because of NaNs making some splits NaN. You'd end up 
with fewer splits than intended when you could have created all meaningful 
splits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77138983
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

consider situation, when a high proportion of duplicated data and/or NaN 
exist in a data sample, the exact number of buckets is hard to get, it could be 
less than/equal to/ more than 'numBuckets'. what we can be sure is that, the 
NaN value if existed will be grouped in the last bucket.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77138037
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

hmm, if we filter NaN before the data, with size m, goes to approxQuantile, 
the increased complexity would be o(m), filter NaN in the splits would normally 
be more cheap, with number of splits far less than that of entire dataset. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77135297
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

It's a correctness issue though. You will compute the wrong splits. Why is 
it complex? just filter the input, it seems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77134887
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

I didnt take this approach, becoz it will increase the complexity, 
especially when the dataset is huge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77133636
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

Ah, I think this is a little bit at odds with the intent. We need to filter 
NaN before the data goes to approxQuantile. They should have no input on the 
quantiles. Then there's no need to filter out NaN from splits.

The message below can then remain unchanged from what it was before. This 
message is not related to the behavior of NaN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77132904
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

OK, but this doesn't specify the behavior. It should be explicit that while 
data will go into buckets 0 through numBuckets-1, that NaN values will be 
counted in bucket numBuckets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76953758
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,11 +114,9 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
-if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
-}
+val distinctSplits = splits.filter(!_.isNaN).distinct
+log.info(s"The exact number of returned buckets might differ from what 
was requested" +
--- End diff --

OK, though now this is logged unconditionally. I think we'd only want to 
log (an info-level message) if the number of buckets didn't match the request, 
which is what you had previously?

Separately, I think the behavior of NaN has to be documented somewhere in 
this class too, to make people aware that it's _always_ possible to get an 
extra bucket of data if there are NaNs.

Otherwise looking good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-30 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76773626
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +114,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
 if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
+  log.warn(s"Bucketizing to ${distinctSplits.length - 1} buckets if 
some quantiles" +
+s" were identical; or ${distinctSplits.length} buckets if NaN 
value existed")
--- End diff --

I think we just put it "the returned number of buckets might differ from 
what was requested depending on the data sample values. " , since the result 
number could be less than/equal to the requested number when same quantiles 
were spotted. And if no same quantiles exists in the splits, but dataset has 
NaN value, the actually number of buckets would then be  greater than requested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76771034
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +114,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
 if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
+  log.warn(s"Bucketizing to ${distinctSplits.length - 1} buckets if 
some quantiles" +
+s" were identical; or ${distinctSplits.length} buckets if NaN 
value existed")
--- End diff --

I think this is now a bit confusing, since it's reporting different things 
based on state that isn't logged for the user. If it's hard just say "bucketing 
to fewer buckets" as before at this stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-30 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76738479
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -106,18 +106,19 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 @Since("1.6.0")
 object Bucketizer extends DefaultParamsReadable[Bucketizer] {
 
-  /** We require splits to be of length >= 3 and to be in strictly 
increasing order. */
+  /** We require splits to be of length >= 3 and to be in strictly 
increasing order.
+* No NaN split should be accepted. */
   private[feature] def checkSplits(splits: Array[Double]): Boolean = {
 if (splits.length < 3) {
   false
 } else {
   var i = 0
   val n = splits.length - 1
   while (i < n) {
-if (splits(i) >= splits(i + 1)) return false
+if (splits(i) >= splits(i + 1) || splits(i).isNaN) return false
--- End diff --

add a safe checker for NaN split here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-30 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76738333
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +114,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
 if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
+  log.warn(s"Bucketizing to ${distinctSplits.length - 1} buckets if 
some quantiles" +
+s" were identical; or ${distinctSplits.length} buckets if NaN 
value existed")
--- End diff --

if we are to give user the exact number of returned buckets, we need to go 
through the whole input dataset to check whether NaN value exists, the 
computation incurred just for a log warning msg is too high, so I choose to 
give user such msg instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76572693
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 
 val distinctSplits = splits.distinct
 if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
+  log.warn(s"Some quantiles were identical. Will return fewer buckets 
than requested!")
--- End diff --

Yes, another reason to reject NaN splits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76572410
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 
 val distinctSplits = splits.distinct
 if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
+  log.warn(s"Some quantiles were identical. Will return fewer buckets 
than requested!")
--- End diff --

when the splits has NaN, we have to know the number of non-NaN value in the 
splits to get the exact number of returned buckets, considering splits :{ -Inf, 
-1, 1, Inf, NaN, NaN}, the old msg will actually get us a wrong result


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76572124
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 
   /** @group setParam */
   @Since("1.4.0")
-  def setSplits(value: Array[Double]): this.type = set(splits, value)
+  def setSplits(value: Array[Double]): this.type = set(splits, 
value.filter(!_.isNaN))
--- End diff --

Yeah, that seems like an error. NaN has no ordering so it doesn't make 
sense to consider or return them as part of quantiles. I think these other 
classes have to handle this case too and they can be rejected here. It's not 
something this class can handle for them because it doesn't know what the 
quantiles would have been without NaN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76571886
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -129,17 +129,21 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
 if (feature == splits.last) {
   splits.length - 2
 } else {
-  val idx = ju.Arrays.binarySearch(splits, feature)
-  if (idx >= 0) {
-idx
+  if (feature.isNaN) {
--- End diff --

I think this might just be me, but, it felt like the first two checks were 
at the same level:

```
if (...) {
} else if (...) {
} else {
  ...
}
```

Or alternatively, I could see handling the `feature == splits.last` case in 
the `idx >= 0` case. Really that's the case that handle exact match of a value 
and right now it's split across two clauses. Then it makes sense to start the 
method by handling NaN alone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76571614
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 
 val distinctSplits = splits.distinct
 if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
+  log.warn(s"Some quantiles were identical. Will return fewer buckets 
than requested!")
--- End diff --

I think this message was OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76571166
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 
   /** @group setParam */
   @Since("1.4.0")
-  def setSplits(value: Array[Double]): this.type = set(splits, value)
+  def setSplits(value: Array[Double]): this.type = set(splits, 
value.filter(!_.isNaN))
--- End diff --

rejecting? do you mean we throw an error for splits with NaN ?
User might not specify NaN bounded splits, but it could be set by other 
function calls, such as approxQuantile, from a dataset with quite a number of 
NaN


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76570646
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 
 val distinctSplits = splits.distinct
 if (splits.length != distinctSplits.length) {
-  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
-s" buckets as a result.")
+  log.warn(s"Some quantiles were identical. Will return fewer buckets 
than requested!")
--- End diff --

considering cases when, there are duplicated non-nan values, along with 
multiple NaNs in the splits, to get the exact number of buckets, will require 
extra computation and increase complexity which I dont think it worths for a 
warning msg


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76570285
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 
   /** @group setParam */
   @Since("1.4.0")
-  def setSplits(value: Array[Double]): this.type = set(splits, value)
+  def setSplits(value: Array[Double]): this.type = set(splits, 
value.filter(!_.isNaN))
--- End diff --

How about just rejecting splits with NaN? I don't see a valid reason a 
caller could specify splits bounded by NaN since it has no ordering.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76569942
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
 
   /** @group setParam */
   @Since("1.4.0")
-  def setSplits(value: Array[Double]): this.type = set(splits, value)
+  def setSplits(value: Array[Double]): this.type = set(splits, 
value.filter(!_.isNaN))
--- End diff --

we only accept non-Nan splits


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r76569900
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -129,17 +129,21 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
 if (feature == splits.last) {
   splits.length - 2
 } else {
-  val idx = ju.Arrays.binarySearch(splits, feature)
-  if (idx >= 0) {
-idx
+  if (feature.isNaN) {
+splits.length - 1
--- End diff --

index NaN value as ${splits.length -1}
''' all the other changes within this files, without comments, are just for 
code refactoring 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
GitHub user VinceShieh opened a pull request:

https://github.com/apache/spark/pull/14858

[SPARK-17219][ML] Add NaN value handling in Bucketizer

## What changes were proposed in this pull request?
This PR fixes an issue when a cutpoints vector containing NaN is sent to 
Bucketizer.
Sometimes, null value might also be useful to users, so in these cases, 
Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal 
exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VinceShieh/spark spark-17219

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14858.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14858


commit bfb5b333d0a4e4a9d05a25cc0d47a5cdbd496965
Author: VinceShieh 
Date:   2016-08-26T13:02:31Z

[SPARK-17219][ML] Add NaN splits handling in Bucketizer

This PR fixes an issue when a cutpoints vector containing NaN is sent to 
Bucketizer.
Sometimes, null value might also be useful to users, so in these cases, 
Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal 
exception.

Two unit tests added in BucketizerSuite

Signed-off-by: VinceShieh 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org