[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84731699
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, 
HasInputCol, HasOutputCol, JavaMLReadab
 categorical features. The number of bins can be set using the 
:py:attr:`numBuckets` parameter.
 It is possible that the number of buckets used will be less than this 
value, for example, if
 there are too few distinct values of the input to create enough 
distinct quantiles. Note also
-that NaN values are handled specially and placed into their own 
bucket. For example, if 4
-buckets are used, then non-NaN data will be put into buckets(0-3), but 
NaNs will be counted in
-a special bucket(4).
+that QuantileDiscretizer will raise an error when it finds NaN value 
in the dataset, but user
--- End diff --

This isn't available in Python yet, so can you please revert this change to 
feature.py?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84731675
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -100,6 +102,33 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("1.6.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle invalid entries. Options are skip (which will 
filter out rows with
+   * invalid values), or error (which will throw an error), or keep (which 
will keep the invalid
+   * values in certain way). Default behaviour is to report an error for 
invalid entries.
--- End diff --

I'd just write: ```default: "error"```
Rewording as "report" instead of "throw" could confuse people.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84731665
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +74,52 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle invalid entries. Options are skip (which will 
filter out rows with
+   * invalid values), or error (which will throw an error), or keep (which 
will keep the invalid
+   * values in certain way). Default behaviour is to report an error for 
invalid entries.
+   *
+   * @group param
+   */
+  @Since("2.1.0")
+  val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "how to handle" +
+"invalid entries. Options are skip (which will filter out rows with 
invalid values), or" +
+"error (which will throw an error), or keep (which will keep the 
invalid values" +
+" in certain way). Default behaviour is to report an error for invalid 
entries.",
+ParamValidators.inArray(Array("skip", "error", "keep")))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  def gethandleInvalid: Option[Boolean] = $(handleInvalid) match {
--- End diff --

This should just return ```$(handleInvalid)```, just like any other Param 
getter method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84731655
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +74,52 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle invalid entries. Options are skip (which will 
filter out rows with
+   * invalid values), or error (which will throw an error), or keep (which 
will keep the invalid
+   * values in certain way). Default behaviour is to report an error for 
invalid entries.
--- End diff --

I'd just write: ```default: "error"```
Rewording as "report" instead of "throw" could confuse people.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84563124
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -82,14 +82,23 @@ class QuantileDiscretizerSuite
   .setInputCol("input")
   .setOutputCol("result")
   .setNumBuckets(numBuckets)
-
-// Reserve extra one bucket for NaN
-val expectedNumBuckets = discretizer.fit(df).getSplits.length - 1
-val result = discretizer.fit(df).transform(df)
-val observedNumBuckets = result.select("result").distinct.count
-assert(observedNumBuckets == expectedNumBuckets,
-  s"Observed number of buckets are not correct." +
-s" Expected $expectedNumBuckets but found $observedNumBuckets")
+val splits = Array(Double.NegativeInfinity, 1.0, 
Double.PositiveInfinity)
+
+for (handleNaN <- Array("keep", "skip")) {
+  discretizer.setHandleNaN(handleNaN)
+  val expectedNumBuckets = {
--- End diff --

This is kind of complex logic just to get the expected number of splits.  
I'd recommend just putting the expected bucket *values* in the original 
DataFrame:
```
val df = sc.parallelize(Seq(
  (1.0, /*expected value for option "keep"*/, /*expected value for option 
"skip"*/),
  ...
)).toDF("input", "expectedKeep", "expectedSkip")
```
Then you can compare with the actual values.  That'll be an easier test to 
understand IMO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84563480
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle NaN entries. Options are skip (which will 
filter out rows with
+   * NaN values), or error (which will throw an error), or keep (which 
will make the NaN
+   * values an extra bucket). More options may be added later.
+   *
+   * @group param
+   */
+  @Since("2.1.0")
+  val handleNaN: Param[String] = new Param[String](this, "handleNaNs", 
"how to handle NaN" +
+"entries. Options are skip (which will filter out rows with NaN 
values), or error" +
+"(which will throw an error), or keep (which will make the NaN values 
an extra bucket)." +
+"More options may be added later", 
ParamValidators.inArray(Array("skip", "error", "keep")))
--- End diff --

Remove "More options may be added later."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84563237
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, 
HasInputCol, HasOutputCol, JavaMLReadab
 categorical features. The number of bins can be set using the 
:py:attr:`numBuckets` parameter.
 It is possible that the number of buckets used will be less than this 
value, for example, if
 there are too few distinct values of the input to create enough 
distinct quantiles. Note also
-that NaN values are handled specially and placed into their own 
bucket. For example, if 4
-buckets are used, then non-NaN data will be put into buckets(0-3), but 
NaNs will be counted in
-a special bucket(4).
+that QuantileDiscretizer will raise an error when it finds NaN value 
in the dataset, but user
--- End diff --

Actually no need to update Python API until it is updated to include 
handleNaN


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84563388
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -100,6 +102,24 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("1.6.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle NaN entries. Options are skip (which will 
filter out rows with
+   * NaN values), or error (which will throw an error), or keep (which 
will make the NaN
+   * values an extra bucket). More options may be added later.
+   *
--- End diff --

State default


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84563374
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -100,6 +102,24 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("1.6.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle NaN entries. Options are skip (which will 
filter out rows with
+   * NaN values), or error (which will throw an error), or keep (which 
will make the NaN
+   * values an extra bucket). More options may be added later.
+   *
+   * @group param
+   */
+  @Since("2.1.0")
+  val handleNaN: Param[String] = new Param[String](this, "handleNaNs", 
"how to handle NaN" +
--- End diff --

Add getHandleNaN


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84563479
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle NaN entries. Options are skip (which will 
filter out rows with
+   * NaN values), or error (which will throw an error), or keep (which 
will make the NaN
+   * values an extra bucket). More options may be added later.
--- End diff --

Remove "More options may be added later."

Also state default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84561834
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /**
+   * Param for how to handle NaN entries. Options are skip (which will 
filter out rows with
+   * NaN values), or error (which will throw an error), or keep (which 
will make the NaN
+   * values an extra bucket). More options may be added later.
+   *
+   * @group param
+   */
+  @Since("2.1.0")
+  val handleNaN: Param[String] = new Param[String](this, "handleNaNs", 
"how to handle NaN" +
+"entries. Options are skip (which will filter out rows with NaN 
values), or error" +
+"(which will throw an error), or keep (which will make the NaN values 
an extra bucket)." +
+"More options may be added later", 
ParamValidators.inArray(Array("skip", "error", "keep")))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  def getHandleNaN: Option[Boolean] = $(handleNaN) match {
+case "keep" => Some(true)
+case "skip" => Some(false)
+case _ => None
+  }
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setHandleNaN(value: String): this.type = set(handleNaN, value)
+  setDefault(handleNaN, "error")
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema)
-val bucketizer = udf { feature: Double =>
-  Bucketizer.binarySearchForBuckets($(splits), feature)
+
+val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

Ah, sorry, one more comment.  I'm not quite sure how closure capture 
behaves currently, but it might be good to define local vals for 
```$(splits)``` and ```getHandleNaN.isDefined && getHandleNaN.get```.  Since 
these reference methods in the Bucketizer class, I believe the UDF may capture 
the whole Bucketizer class instead of just those vals.

After you define them in local vals here, you can use those vals in this 
UDF.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84562692
  
--- Diff: docs/ml-features.md ---
@@ -1104,9 +1104,11 @@ for more details on the API.
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
 categorical features. The number of bins is set by the `numBuckets` 
parameter. It is possible
 that the number of buckets used will be less than this value, for example, 
if there are too few
-distinct values of the input to create enough distinct quantiles. Note 
also that NaN values are
-handled specially and placed into their own bucket. For example, if 4 
buckets are used, then
-non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
+distinct values of the input to create enough distinct quantiles. Note 
also that QuantileDiscretizer
--- End diff --

Yep, you're right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-19 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84205467
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, 
HasInputCol, HasOutputCol, JavaMLReadab
 categorical features. The number of bins can be set using the 
:py:attr:`numBuckets` parameter.
 It is possible that the number of buckets used will be less than this 
value, for example, if
--- End diff --

same as the comment above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-19 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84205458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends 
Params
 /**
  * `QuantileDiscretizer` takes a column with continuous features and 
outputs a column with binned
  * categorical features. The number of bins can be set using the 
`numBuckets` parameter. It is
- * possible that the number of buckets used will be less than this value, 
for example, if there
- * are too few distinct values of the input to create enough distinct 
quantiles. Note also that
- * NaN values are handled specially and placed into their own bucket. For 
example, if 4 buckets
- * are used, then non-NaN data will be put into buckets(0-3), but NaNs 
will be counted in a special
- * bucket(4).
+ * possible that the number of buckets used will be less than this value, 
for example, if there are
--- End diff --

same as the comment above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-19 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r84205414
  
--- Diff: docs/ml-features.md ---
@@ -1104,9 +1104,11 @@ for more details on the API.
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
 categorical features. The number of bins is set by the `numBuckets` 
parameter. It is possible
 that the number of buckets used will be less than this value, for example, 
if there are too few
-distinct values of the input to create enough distinct quantiles. Note 
also that NaN values are
-handled specially and placed into their own bucket. For example, if 4 
buckets are used, then
-non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
+distinct values of the input to create enough distinct quantiles. Note 
also that QuantileDiscretizer
--- End diff --

In cases when the number of buckets requested by the user is greater than 
the number of distinct splits generated from Bucketizer, the returned number of 
buckets will be less than requested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911499
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
@@ -114,6 +115,7 @@ class BucketizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   .setInputCol("feature")
--- End diff --

Just set splits here; there's no need to set the other Params for this test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83760190
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+  setDefault(handleInvalid, "error")
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema)
-val bucketizer = udf { feature: Double =>
-  Bucketizer.binarySearchForBuckets($(splits), feature)
+val bucketizer: UserDefinedFunction = udf { (feature: Double, flag: 
String) =>
--- End diff --

Use Boolean flag internally to avoid the string comparison on each call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911464
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends 
Params
 /**
  * `QuantileDiscretizer` takes a column with continuous features and 
outputs a column with binned
  * categorical features. The number of bins can be set using the 
`numBuckets` parameter. It is
- * possible that the number of buckets used will be less than this value, 
for example, if there
- * are too few distinct values of the input to create enough distinct 
quantiles. Note also that
- * NaN values are handled specially and placed into their own bucket. For 
example, if 4 buckets
- * are used, then non-NaN data will be put into buckets(0-3), but NaNs 
will be counted in a special
- * bucket(4).
+ * possible that the number of buckets used will be less than this value, 
for example, if there are
+ * too few distinct values of the input to create enough distinct 
quantiles. Note also that
+ * QuantileDiscretizer will raise an error when it finds NaN value in the 
dataset, but user can
+ * also choose to either keep or remove NaN values within the dataset by 
calling setHandleInvalid.
+ * If user chooses to keep NaN values, they will be handled specially and 
placed into their own
--- End diff --

This documentation should go in the handleInvalid Param doc string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911574
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, 
HasInputCol, HasOutputCol, JavaMLReadab
 categorical features. The number of bins can be set using the 
:py:attr:`numBuckets` parameter.
 It is possible that the number of buckets used will be less than this 
value, for example, if
--- End diff --

Here too: no longer the case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911542
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -83,10 +83,20 @@ class QuantileDiscretizerSuite
   .setOutputCol("result")
   .setNumBuckets(numBuckets)
 
-// Reserve extra one bucket for NaN
-val expectedNumBuckets = discretizer.fit(df).getSplits.length - 1
-val result = discretizer.fit(df).transform(df)
-val observedNumBuckets = result.select("result").distinct.count
+// Reserve extra one bucket for NaN values
+discretizer.setHandleInvalid("keep")
+var expectedNumBuckets = discretizer.fit(df).getSplits.length - 1
+var result = discretizer.fit(df).transform(df)
+var observedNumBuckets = result.select("result").distinct.count
+assert(observedNumBuckets == expectedNumBuckets,
+  s"Observed number of buckets are not correct." +
+s" Expected $expectedNumBuckets but found $observedNumBuckets")
+
+// Remove NaN values
+discretizer.setHandleInvalid("skip")
+expectedNumBuckets = discretizer.fit(df).getSplits.length - 2
--- End diff --

Same here; I'd also recommend directly testing the values of the splits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911537
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -83,10 +83,20 @@ class QuantileDiscretizerSuite
   .setOutputCol("result")
   .setNumBuckets(numBuckets)
 
-// Reserve extra one bucket for NaN
-val expectedNumBuckets = discretizer.fit(df).getSplits.length - 1
-val result = discretizer.fit(df).transform(df)
-val observedNumBuckets = result.select("result").distinct.count
+// Reserve extra one bucket for NaN values
+discretizer.setHandleInvalid("keep")
+var expectedNumBuckets = discretizer.fit(df).getSplits.length - 1
--- End diff --

I'd also recommend directly testing the values of the splits.  The current 
test makes sure that handleInvalid is passed to the bucketizer correctly, which 
is important but separate.

Also, please use vals (not vars) for clarity.  I'd recommend making a 
helper method for lines 87-93, which can be reused for the test of 
handleInvalid = "skip"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911282
  
--- Diff: docs/ml-features.md ---
@@ -1104,9 +1104,11 @@ for more details on the API.
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
 categorical features. The number of bins is set by the `numBuckets` 
parameter. It is possible
 that the number of buckets used will be less than this value, for example, 
if there are too few
-distinct values of the input to create enough distinct quantiles. Note 
also that NaN values are
-handled specially and placed into their own bucket. For example, if 4 
buckets are used, then
-non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
+distinct values of the input to create enough distinct quantiles. Note 
also that QuantileDiscretizer
--- End diff --

(same as below) Is "possible that the number of buckets used will be less 
than this value" true?  It was true before this used Dataset.approxQuantiles, 
but I don't think it is true any longer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911357
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+  setDefault(handleInvalid, "error")
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema)
-val bucketizer = udf { feature: Double =>
-  Bucketizer.binarySearchForBuckets($(splits), feature)
+val bucketizer: UserDefinedFunction = udf { (feature: Double, flag: 
String) =>
+  Bucketizer.binarySearchForBuckets($(splits), feature, flag)
+}
+val filteredDataset = {
+  if ("skip" == getHandleInvalid) {
+dataset.na.drop
+  } else {
+dataset
--- End diff --

I'd put ```dataset.toDF()``` here and not import existentials.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911408
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -128,8 +145,11 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @throws SparkException if a feature is < splits.head or > splits.last
*/
-  private[feature] def binarySearchForBuckets(splits: Array[Double], 
feature: Double): Double = {
-if (feature.isNaN) {
+  private[feature] def binarySearchForBuckets(
+   splits: Array[Double],
+   feature: Double,
+   flag: String): Double = {
--- End diff --

Use a better name than "flag"  How about "keepNaNs"?
Also, please document this argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911300
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+  setDefault(handleInvalid, "error")
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema)
-val bucketizer = udf { feature: Double =>
-  Bucketizer.binarySearchForBuckets($(splits), feature)
+val bucketizer: UserDefinedFunction = udf { (feature: Double, flag: 
String) =>
--- End diff --

Also, at this point, you already know the value for flag, so you can just 
use it here, rather than making the UDF take an extra argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911437
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends 
Params
 /**
  * `QuantileDiscretizer` takes a column with continuous features and 
outputs a column with binned
  * categorical features. The number of bins can be set using the 
`numBuckets` parameter. It is
- * possible that the number of buckets used will be less than this value, 
for example, if there
- * are too few distinct values of the input to create enough distinct 
quantiles. Note also that
- * NaN values are handled specially and placed into their own bucket. For 
example, if 4 buckets
- * are used, then non-NaN data will be put into buckets(0-3), but NaNs 
will be counted in a special
- * bucket(4).
+ * possible that the number of buckets used will be less than this value, 
for example, if there are
+ * too few distinct values of the input to create enough distinct 
quantiles. Note also that
+ * QuantileDiscretizer will raise an error when it finds NaN value in the 
dataset, but user can
+ * also choose to either keep or remove NaN values within the dataset by 
calling setHandleInvalid.
--- End diff --

"calling setHandleInvalid" -> "setting `handleInvalid`"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83760784
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala ---
@@ -270,10 +270,10 @@ private[ml] trait HasFitIntercept extends Params {
 private[ml] trait HasHandleInvalid extends Params {
 
   /**
-   * Param for how to handle invalid entries. Options are skip (which will 
filter out rows with bad values), or error (which will throw an error). More 
options may be added later.
+   * Param for how to handle invalid entries. Options are skip (which will 
filter out rows with bad values), or error (which will throw an error), or keep 
(which will keep the bad values in certain way). More options may be added 
later.
* @group param
*/
-  final val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "how to handle invalid entries. Options are skip (which will 
filter out rows with bad values), or error (which will throw an error). More 
options may be added later", ParamValidators.inArray(Array("skip", "error")))
+  final val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "how to handle invalid entries. Options are skip (which will 
filter out rows with bad values), or error (which will throw an error), or keep 
(which will keep the bad values in certain way). More options may be added 
later", ParamValidators.inArray(Array("skip", "keep", "error")))
--- End diff --

Since this is used by StringIndexer, which will not support "keep," we 
should not modify HasHandleInvalid.  I recommend copying out the shared param 
and just putting it in Bucketizer and QuantileDiscretizer, rather than trying 
to reuse HasHandleInvalid.  That will let you specialize the documentation too 
so that you can be more specific.

HasHandleInvalid may not be a good shared Param yet, but perhaps in the 
future..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83760037
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -128,8 +145,11 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @throws SparkException if a feature is < splits.head or > splits.last
*/
-  private[feature] def binarySearchForBuckets(splits: Array[Double], 
feature: Double): Double = {
-if (feature.isNaN) {
+  private[feature] def binarySearchForBuckets(
+   splits: Array[Double],
--- End diff --

fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r83911419
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends 
Params
 /**
  * `QuantileDiscretizer` takes a column with continuous features and 
outputs a column with binned
  * categorical features. The number of bins can be set using the 
`numBuckets` parameter. It is
- * possible that the number of buckets used will be less than this value, 
for example, if there
- * are too few distinct values of the input to create enough distinct 
quantiles. Note also that
- * NaN values are handled specially and placed into their own bucket. For 
example, if 4 buckets
- * are used, then non-NaN data will be put into buckets(0-3), but NaNs 
will be counted in a special
- * bucket(4).
+ * possible that the number of buckets used will be less than this value, 
for example, if there are
--- End diff --

Is this true?  It was true before this used Dataset.approxQuantiles, but I 
don't think it is true any longer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org