[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15428 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84731699 --- Diff: python/pyspark/ml/feature.py --- @@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Note also -that NaN values are handled specially and placed into their own bucket. For example, if 4 -buckets are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in -a special bucket(4). +that QuantileDiscretizer will raise an error when it finds NaN value in the dataset, but user --- End diff -- This isn't available in Python yet, so can you please revert this change to feature.py? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84731675 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -100,6 +102,33 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui @Since("1.6.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle invalid entries. Options are skip (which will filter out rows with + * invalid values), or error (which will throw an error), or keep (which will keep the invalid + * values in certain way). Default behaviour is to report an error for invalid entries. --- End diff -- I'd just write: ```default: "error"``` Rewording as "report" instead of "throw" could confuse people. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84731665 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,52 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle invalid entries. Options are skip (which will filter out rows with + * invalid values), or error (which will throw an error), or keep (which will keep the invalid + * values in certain way). Default behaviour is to report an error for invalid entries. + * + * @group param + */ + @Since("2.1.0") + val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle" + +"invalid entries. Options are skip (which will filter out rows with invalid values), or" + +"error (which will throw an error), or keep (which will keep the invalid values" + +" in certain way). Default behaviour is to report an error for invalid entries.", +ParamValidators.inArray(Array("skip", "error", "keep"))) + + /** @group getParam */ + @Since("2.1.0") + def gethandleInvalid: Option[Boolean] = $(handleInvalid) match { --- End diff -- This should just return ```$(handleInvalid)```, just like any other Param getter method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84731655 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,52 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle invalid entries. Options are skip (which will filter out rows with + * invalid values), or error (which will throw an error), or keep (which will keep the invalid + * values in certain way). Default behaviour is to report an error for invalid entries. --- End diff -- I'd just write: ```default: "error"``` Rewording as "report" instead of "throw" could confuse people. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563124 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -82,14 +82,23 @@ class QuantileDiscretizerSuite .setInputCol("input") .setOutputCol("result") .setNumBuckets(numBuckets) - -// Reserve extra one bucket for NaN -val expectedNumBuckets = discretizer.fit(df).getSplits.length - 1 -val result = discretizer.fit(df).transform(df) -val observedNumBuckets = result.select("result").distinct.count -assert(observedNumBuckets == expectedNumBuckets, - s"Observed number of buckets are not correct." + -s" Expected $expectedNumBuckets but found $observedNumBuckets") +val splits = Array(Double.NegativeInfinity, 1.0, Double.PositiveInfinity) + +for (handleNaN <- Array("keep", "skip")) { + discretizer.setHandleNaN(handleNaN) + val expectedNumBuckets = { --- End diff -- This is kind of complex logic just to get the expected number of splits. I'd recommend just putting the expected bucket *values* in the original DataFrame: ``` val df = sc.parallelize(Seq( (1.0, /*expected value for option "keep"*/, /*expected value for option "skip"*/), ... )).toDF("input", "expectedKeep", "expectedSkip") ``` Then you can compare with the actual values. That'll be an easier test to understand IMO. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563480 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle NaN entries. Options are skip (which will filter out rows with + * NaN values), or error (which will throw an error), or keep (which will make the NaN + * values an extra bucket). More options may be added later. + * + * @group param + */ + @Since("2.1.0") + val handleNaN: Param[String] = new Param[String](this, "handleNaNs", "how to handle NaN" + +"entries. Options are skip (which will filter out rows with NaN values), or error" + +"(which will throw an error), or keep (which will make the NaN values an extra bucket)." + +"More options may be added later", ParamValidators.inArray(Array("skip", "error", "keep"))) --- End diff -- Remove "More options may be added later." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563237 --- Diff: python/pyspark/ml/feature.py --- @@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Note also -that NaN values are handled specially and placed into their own bucket. For example, if 4 -buckets are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in -a special bucket(4). +that QuantileDiscretizer will raise an error when it finds NaN value in the dataset, but user --- End diff -- Actually no need to update Python API until it is updated to include handleNaN --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563388 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -100,6 +102,24 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui @Since("1.6.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle NaN entries. Options are skip (which will filter out rows with + * NaN values), or error (which will throw an error), or keep (which will make the NaN + * values an extra bucket). More options may be added later. + * --- End diff -- State default --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563374 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -100,6 +102,24 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui @Since("1.6.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle NaN entries. Options are skip (which will filter out rows with + * NaN values), or error (which will throw an error), or keep (which will make the NaN + * values an extra bucket). More options may be added later. + * + * @group param + */ + @Since("2.1.0") + val handleNaN: Param[String] = new Param[String](this, "handleNaNs", "how to handle NaN" + --- End diff -- Add getHandleNaN --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle NaN entries. Options are skip (which will filter out rows with + * NaN values), or error (which will throw an error), or keep (which will make the NaN + * values an extra bucket). More options may be added later. --- End diff -- Remove "More options may be added later." Also state default. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84561834 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** + * Param for how to handle NaN entries. Options are skip (which will filter out rows with + * NaN values), or error (which will throw an error), or keep (which will make the NaN + * values an extra bucket). More options may be added later. + * + * @group param + */ + @Since("2.1.0") + val handleNaN: Param[String] = new Param[String](this, "handleNaNs", "how to handle NaN" + +"entries. Options are skip (which will filter out rows with NaN values), or error" + +"(which will throw an error), or keep (which will make the NaN values an extra bucket)." + +"More options may be added later", ParamValidators.inArray(Array("skip", "error", "keep"))) + + /** @group getParam */ + @Since("2.1.0") + def getHandleNaN: Option[Boolean] = $(handleNaN) match { +case "keep" => Some(true) +case "skip" => Some(false) +case _ => None + } + + /** @group setParam */ + @Since("2.1.0") + def setHandleNaN(value: String): this.type = set(handleNaN, value) + setDefault(handleNaN, "error") + @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema) -val bucketizer = udf { feature: Double => - Bucketizer.binarySearchForBuckets($(splits), feature) + +val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- Ah, sorry, one more comment. I'm not quite sure how closure capture behaves currently, but it might be good to define local vals for ```$(splits)``` and ```getHandleNaN.isDefined && getHandleNaN.get```. Since these reference methods in the Bucketizer class, I believe the UDF may capture the whole Bucketizer class instead of just those vals. After you define them in local vals here, you can use those vals in this UDF. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84562692 --- Diff: docs/ml-features.md --- @@ -1104,9 +1104,11 @@ for more details on the API. `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the `numBuckets` parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few -distinct values of the input to create enough distinct quantiles. Note also that NaN values are -handled specially and placed into their own bucket. For example, if 4 buckets are used, then -non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. +distinct values of the input to create enough distinct quantiles. Note also that QuantileDiscretizer --- End diff -- Yep, you're right --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84205467 --- Diff: python/pyspark/ml/feature.py --- @@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter. It is possible that the number of buckets used will be less than this value, for example, if --- End diff -- same as the comment above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84205458 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends Params /** * `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned * categorical features. The number of bins can be set using the `numBuckets` parameter. It is - * possible that the number of buckets used will be less than this value, for example, if there - * are too few distinct values of the input to create enough distinct quantiles. Note also that - * NaN values are handled specially and placed into their own bucket. For example, if 4 buckets - * are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in a special - * bucket(4). + * possible that the number of buckets used will be less than this value, for example, if there are --- End diff -- same as the comment above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84205414 --- Diff: docs/ml-features.md --- @@ -1104,9 +1104,11 @@ for more details on the API. `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the `numBuckets` parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few -distinct values of the input to create enough distinct quantiles. Note also that NaN values are -handled specially and placed into their own bucket. For example, if 4 buckets are used, then -non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. +distinct values of the input to create enough distinct quantiles. Note also that QuantileDiscretizer --- End diff -- In cases when the number of buckets requested by the user is greater than the number of distinct splits generated from Bucketizer, the returned number of buckets will be less than requested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911499 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala --- @@ -114,6 +115,7 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa .setInputCol("feature") --- End diff -- Just set splits here; there's no need to set the other Params for this test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83760190 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.1.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + setDefault(handleInvalid, "error") + @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema) -val bucketizer = udf { feature: Double => - Bucketizer.binarySearchForBuckets($(splits), feature) +val bucketizer: UserDefinedFunction = udf { (feature: Double, flag: String) => --- End diff -- Use Boolean flag internally to avoid the string comparison on each call. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911464 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends Params /** * `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned * categorical features. The number of bins can be set using the `numBuckets` parameter. It is - * possible that the number of buckets used will be less than this value, for example, if there - * are too few distinct values of the input to create enough distinct quantiles. Note also that - * NaN values are handled specially and placed into their own bucket. For example, if 4 buckets - * are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in a special - * bucket(4). + * possible that the number of buckets used will be less than this value, for example, if there are + * too few distinct values of the input to create enough distinct quantiles. Note also that + * QuantileDiscretizer will raise an error when it finds NaN value in the dataset, but user can + * also choose to either keep or remove NaN values within the dataset by calling setHandleInvalid. + * If user chooses to keep NaN values, they will be handled specially and placed into their own --- End diff -- This documentation should go in the handleInvalid Param doc string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911574 --- Diff: python/pyspark/ml/feature.py --- @@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter. It is possible that the number of buckets used will be less than this value, for example, if --- End diff -- Here too: no longer the case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911542 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -83,10 +83,20 @@ class QuantileDiscretizerSuite .setOutputCol("result") .setNumBuckets(numBuckets) -// Reserve extra one bucket for NaN -val expectedNumBuckets = discretizer.fit(df).getSplits.length - 1 -val result = discretizer.fit(df).transform(df) -val observedNumBuckets = result.select("result").distinct.count +// Reserve extra one bucket for NaN values +discretizer.setHandleInvalid("keep") +var expectedNumBuckets = discretizer.fit(df).getSplits.length - 1 +var result = discretizer.fit(df).transform(df) +var observedNumBuckets = result.select("result").distinct.count +assert(observedNumBuckets == expectedNumBuckets, + s"Observed number of buckets are not correct." + +s" Expected $expectedNumBuckets but found $observedNumBuckets") + +// Remove NaN values +discretizer.setHandleInvalid("skip") +expectedNumBuckets = discretizer.fit(df).getSplits.length - 2 --- End diff -- Same here; I'd also recommend directly testing the values of the splits. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911537 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -83,10 +83,20 @@ class QuantileDiscretizerSuite .setOutputCol("result") .setNumBuckets(numBuckets) -// Reserve extra one bucket for NaN -val expectedNumBuckets = discretizer.fit(df).getSplits.length - 1 -val result = discretizer.fit(df).transform(df) -val observedNumBuckets = result.select("result").distinct.count +// Reserve extra one bucket for NaN values +discretizer.setHandleInvalid("keep") +var expectedNumBuckets = discretizer.fit(df).getSplits.length - 1 --- End diff -- I'd also recommend directly testing the values of the splits. The current test makes sure that handleInvalid is passed to the bucketizer correctly, which is important but separate. Also, please use vals (not vars) for clarity. I'd recommend making a helper method for lines 87-93, which can be reused for the test of handleInvalid = "skip" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911282 --- Diff: docs/ml-features.md --- @@ -1104,9 +1104,11 @@ for more details on the API. `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the `numBuckets` parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few -distinct values of the input to create enough distinct quantiles. Note also that NaN values are -handled specially and placed into their own bucket. For example, if 4 buckets are used, then -non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. +distinct values of the input to create enough distinct quantiles. Note also that QuantileDiscretizer --- End diff -- (same as below) Is "possible that the number of buckets used will be less than this value" true? It was true before this used Dataset.approxQuantiles, but I don't think it is true any longer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911357 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.1.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + setDefault(handleInvalid, "error") + @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema) -val bucketizer = udf { feature: Double => - Bucketizer.binarySearchForBuckets($(splits), feature) +val bucketizer: UserDefinedFunction = udf { (feature: Double, flag: String) => + Bucketizer.binarySearchForBuckets($(splits), feature, flag) +} +val filteredDataset = { + if ("skip" == getHandleInvalid) { +dataset.na.drop + } else { +dataset --- End diff -- I'd put ```dataset.toDF()``` here and not import existentials. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911408 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -128,8 +145,11 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @throws SparkException if a feature is < splits.head or > splits.last */ - private[feature] def binarySearchForBuckets(splits: Array[Double], feature: Double): Double = { -if (feature.isNaN) { + private[feature] def binarySearchForBuckets( + splits: Array[Double], + feature: Double, + flag: String): Double = { --- End diff -- Use a better name than "flag" How about "keepNaNs"? Also, please document this argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911300 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.1.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + setDefault(handleInvalid, "error") + @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema) -val bucketizer = udf { feature: Double => - Bucketizer.binarySearchForBuckets($(splits), feature) +val bucketizer: UserDefinedFunction = udf { (feature: Double, flag: String) => --- End diff -- Also, at this point, you already know the value for flag, so you can just use it here, rather than making the UDF take an extra argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911437 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends Params /** * `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned * categorical features. The number of bins can be set using the `numBuckets` parameter. It is - * possible that the number of buckets used will be less than this value, for example, if there - * are too few distinct values of the input to create enough distinct quantiles. Note also that - * NaN values are handled specially and placed into their own bucket. For example, if 4 buckets - * are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in a special - * bucket(4). + * possible that the number of buckets used will be less than this value, for example, if there are + * too few distinct values of the input to create enough distinct quantiles. Note also that + * QuantileDiscretizer will raise an error when it finds NaN value in the dataset, but user can + * also choose to either keep or remove NaN values within the dataset by calling setHandleInvalid. --- End diff -- "calling setHandleInvalid" -> "setting `handleInvalid`" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83760784 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -270,10 +270,10 @@ private[ml] trait HasFitIntercept extends Params { private[ml] trait HasHandleInvalid extends Params { /** - * Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error). More options may be added later. + * Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error), or keep (which will keep the bad values in certain way). More options may be added later. * @group param */ - final val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error). More options may be added later", ParamValidators.inArray(Array("skip", "error"))) + final val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error), or keep (which will keep the bad values in certain way). More options may be added later", ParamValidators.inArray(Array("skip", "keep", "error"))) --- End diff -- Since this is used by StringIndexer, which will not support "keep," we should not modify HasHandleInvalid. I recommend copying out the shared param and just putting it in Bucketizer and QuantileDiscretizer, rather than trying to reuse HasHandleInvalid. That will let you specialize the documentation too so that you can be more specific. HasHandleInvalid may not be a good shared Param yet, but perhaps in the future.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83760037 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -128,8 +145,11 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @throws SparkException if a feature is < splits.head or > splits.last */ - private[feature] def binarySearchForBuckets(splits: Array[Double], feature: Double): Double = { -if (feature.isNaN) { + private[feature] def binarySearchForBuckets( + splits: Array[Double], --- End diff -- fix indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r83911419 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -66,11 +67,13 @@ private[feature] trait QuantileDiscretizerBase extends Params /** * `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned * categorical features. The number of bins can be set using the `numBuckets` parameter. It is - * possible that the number of buckets used will be less than this value, for example, if there - * are too few distinct values of the input to create enough distinct quantiles. Note also that - * NaN values are handled specially and placed into their own bucket. For example, if 4 buckets - * are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in a special - * bucket(4). + * possible that the number of buckets used will be less than this value, for example, if there are --- End diff -- Is this true? It was true before this used Dataset.approxQuantiles, but I don't think it is true any longer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org