[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16971 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r106569149 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala --- @@ -55,7 +55,7 @@ class QuantileSummariesSuite extends SparkFunSuite { } private def checkQuantile(quant: Double, data: Seq[Double], summary: QuantileSummaries): Unit = { -val approx = summary.query(quant) +val approx = summary.query(quant).get --- End diff -- Ok, I will add a test on empty `data` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r106569064 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- Thanks all. I will add a comment here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r106335040 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r106309333 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- Yes, this is correct. But you should leave a comment, since it is not obvious. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r105588823 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- cc @thunterdb to ensure it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r105588798 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala --- @@ -55,7 +55,7 @@ class QuantileSummariesSuite extends SparkFunSuite { } private def checkQuantile(quant: Double, data: Seq[Double], summary: QuantileSummaries): Unit = { -val approx = summary.query(quant) +val approx = summary.query(quant).get --- End diff -- Add a test case with `summary.count == 0` and improve this helper function to cover it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r105588596 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- Yes. I think it is impossible to hit it. We need some test cases to ensure it. See my next comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r103872828 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- It looks like that it is impossible not return `None` here: Since `summaries.count != 0`, the `summaries.sampled` should not be empty, then `None` will not be returned. @thunterdb Is this correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r103844548 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- Is it possible to return `None`? Then, you will get a strange exception. Could you also add a test case for `getPercentiles` in `ApproximatePercentileQuerySuite`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r103844105 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -91,7 +99,9 @@ object StatFunctions extends Logging { } val summaries = df.select(columns: _*).rdd.aggregate(emptySummaries)(apply, merge) -summaries.map { summary => probabilities.map(summary.query) } +summaries.map { summary => + probabilities.flatMap(summary.query) +} --- End diff -- Nit: ``` summaries.map { summary => probabilities.flatMap(summary.query) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r103844048 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -78,7 +81,12 @@ object StatFunctions extends Logging { def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { var i = 0 while (i < summaries.length) { -summaries(i) = summaries(i).insert(row.getDouble(i)) +if (!row.isNullAt(i)) { + val v = row.getDouble(i) + if (!v.isNaN) { +summaries(i) = summaries(i).insert(v) + } --- End diff -- Nit: ``` if (!v.isNaN) summaries(i) = summaries(i).insert(v) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r103844007 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala --- @@ -176,17 +176,21 @@ class QuantileSummaries( * @param quantile the target quantile * @return */ - def query(quantile: Double): Double = { + def query(quantile: Double): Option[Double] = { require(quantile >= 0 && quantile <= 1.0, "quantile should be in the range [0.0, 1.0]") require(headSampled.isEmpty, "Cannot operate on an uncompressed summary, call compress() first") +if (sampled.isEmpty) { + return None +} --- End diff -- Nit: ``` if (sampled.isEmpty) return None ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r103843868 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,22 +88,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * Note that values greater than 1 are accepted but give the same result as 1. * @return the approximate quantiles at the given probabilities of each column * - * @note Rows containing any null or NaN values will be removed before calculation. If - * the dataframe is empty or all rows contain null or NaN, null is returned. + * @note null and NaN values will be ignored in numerical columns before calculation. For + * columns only containing null or NaN values, an empty array is returned. * * @since 2.2.0 */ def approxQuantile( cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]] = { -// TODO: Update NaN/null handling to keep consistent with the single-column version -try { - StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, -probabilities, relativeError).map(_.toArray).toArray -} catch { - case e: NoSuchElementException => null -} +StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols, + probabilities, relativeError).map(_.toArray).toArray --- End diff -- Nit: style issue ``` StatFunctions.multipleApproxQuantiles( df.select(cols.map(col): _*), cols, probabilities, relativeError).map(_.toArray).toArray ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r103270963 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -91,7 +100,13 @@ object StatFunctions extends Logging { } val summaries = df.select(columns: _*).rdd.aggregate(emptySummaries)(apply, merge) -summaries.map { summary => probabilities.map(summary.query) } +summaries.map { summary => + try { +probabilities.map(summary.query) + } catch { +case e: SparkException => Seq.empty[Double] --- End diff -- Please do not use the Exception handling for this purpose. Instead, you can return None. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102601793 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * Note that values greater than 1 are accepted but give the same result as 1. * @return the approximate quantiles at the given probabilities of each column * - * @note Rows containing any null or NaN values will be removed before calculation. If - * the dataframe is empty or all rows contain null or NaN, null is returned. + * @note null and NaN values will be removed from the numerical column before calculation. If + * the dataframe is empty, or all rows in some column contain null or NaN, null is returned. * * @since 2.2.0 */ def approxQuantile( cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]] = { -// TODO: Update NaN/null handling to keep consistent with the single-column version try { - StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, + StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols, probabilities, relativeError).map(_.toArray).toArray } catch { case e: NoSuchElementException => null --- End diff -- In Spark SQL, all the other built-in functions will not throw an exception if the input data set is empty. An empty inupt data set is pretty normal. Returning either `null` or empty `Array` looks ok to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102592174 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -78,7 +80,12 @@ object StatFunctions extends Logging { def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { var i = 0 while (i < summaries.length) { -summaries(i) = summaries(i).insert(row.getDouble(i)) +if (!row.isNullAt(i)) { --- End diff -- Thank you for fixing this issue, it was an oversight in my original implementation. The current exception being thrown depends on an implementation detail (calling `sampled.head`). Can you modify the function `def query` below to explicitly throw an exception if `sampled` is empty, and document this behavior in that function? This way, we will not forget it if decide to change the semantics of that class. As @MLnick was mentioning above, it would be preferrable to either return an Option, null or NaN eventually, but this can wait for more consensus. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102589719 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * Note that values greater than 1 are accepted but give the same result as 1. * @return the approximate quantiles at the given probabilities of each column * - * @note Rows containing any null or NaN values will be removed before calculation. If - * the dataframe is empty or all rows contain null or NaN, null is returned. + * @note null and NaN values will be removed from the numerical column before calculation. If + * the dataframe is empty, or all rows in some column contain null or NaN, null is returned. * * @since 2.2.0 */ def approxQuantile( cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]] = { -// TODO: Update NaN/null handling to keep consistent with the single-column version try { - StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, + StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols, probabilities, relativeError).map(_.toArray).toArray } catch { case e: NoSuchElementException => null --- End diff -- +1. I tend to think that the result should be `NaN` (following the IEEE convention) or `null` (following scala Option convention). But pending a resolution, I am fine with throwing an exception because it is the most conservative behavior (stopping computations). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102167055 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala --- @@ -214,20 +214,29 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext { val q1 = 0.5 val q2 = 0.8 val epsilon = 0.1 -val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), Row(1.0, 1.0), +val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), Row(1.0, -1.0), Row(-1.0, Double.NaN), Row(Double.NaN, Double.NaN), Row(null, null), Row(null, 1.0), Row(-1.0, null), Row(Double.NaN, null))) val schema = StructType(Seq(StructField("input1", DoubleType, nullable = true), StructField("input2", DoubleType, nullable = true))) val dfNaN = spark.createDataFrame(rows, schema) -val resNaN = dfNaN.stat.approxQuantile("input1", Array(q1, q2), epsilon) -assert(resNaN.count(_.isNaN) === 0) -assert(resNaN.count(_ == null) === 0) +val resNaN1 = dfNaN.stat.approxQuantile("input1", Array(q1, q2), epsilon) +assert(resNaN1.count(_.isNaN) === 0) +assert(resNaN1.count(_ == null) === 0) -val resNaN2 = dfNaN.stat.approxQuantile(Array("input1", "input2"), +val resNaN2 = dfNaN.stat.approxQuantile("input2", Array(q1, q2), epsilon) +assert(resNaN2.count(_.isNaN) === 0) +assert(resNaN2.count(_ == null) === 0) + +val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2"), Array(q1, q2), epsilon) -assert(resNaN2.flatten.count(_.isNaN) === 0) -assert(resNaN2.flatten.count(_ == null) === 0) +assert(resNaNAll.flatten.count(_.isNaN) === 0) +assert(resNaNAll.flatten.count(_ == null) === 0) + +assert(resNaN1(0) === resNaNAll(0)(0)) +assert(resNaN1(1) === resNaNAll(0)(1)) +assert(resNaN2(0) === resNaNAll(1)(0)) +assert(resNaN2(1) === resNaNAll(1)(1)) --- End diff -- Yes, I create a new column containing only NaN/null. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102146260 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -78,7 +80,13 @@ object StatFunctions extends Logging { def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { var i = 0 while (i < summaries.length) { -summaries(i) = summaries(i).insert(row.getDouble(i)) +val item = row(i) --- End diff -- This works, though perhaps we can do: ```scala if (!row.isNullAt(i)) { val v = row.getDouble(i) if (!v.isNaN) { summaries(i) = summaries(i).insert(v) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102145908 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * Note that values greater than 1 are accepted but give the same result as 1. * @return the approximate quantiles at the given probabilities of each column * - * @note Rows containing any null or NaN values will be removed before calculation. If - * the dataframe is empty or all rows contain null or NaN, null is returned. + * @note null and NaN values will be removed from the numerical column before calculation. If + * the dataframe is empty, or all rows in some column contain null or NaN, null is returned. * * @since 2.2.0 */ def approxQuantile( cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]] = { -// TODO: Update NaN/null handling to keep consistent with the single-column version try { - StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, + StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols, probabilities, relativeError).map(_.toArray).toArray } catch { case e: NoSuchElementException => null --- End diff -- This went in for the other PR but I still question whether we should be returning `null` here. Is this standard in SparkSQL? What about returning an empty `Array`? cc @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102145412 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -54,6 +54,8 @@ object StatFunctions extends Logging { * Note that values greater than 1 are accepted but give the same result as 1. * * @return for each column, returns the requested approximations + * + * @note null and NaN values will be removed from the numerical column before calculation. --- End diff -- I think "will be ignored" is more accurate than "will be removed" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102145538 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * Note that values greater than 1 are accepted but give the same result as 1. * @return the approximate quantiles at the given probabilities of each column * - * @note Rows containing any null or NaN values will be removed before calculation. If - * the dataframe is empty or all rows contain null or NaN, null is returned. + * @note null and NaN values will be removed from the numerical column before calculation. If --- End diff -- Again, "ignored" is slightly better than "removed from" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102146144 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala --- @@ -214,20 +214,29 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext { val q1 = 0.5 val q2 = 0.8 val epsilon = 0.1 -val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), Row(1.0, 1.0), +val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), Row(1.0, -1.0), Row(-1.0, Double.NaN), Row(Double.NaN, Double.NaN), Row(null, null), Row(null, 1.0), Row(-1.0, null), Row(Double.NaN, null))) val schema = StructType(Seq(StructField("input1", DoubleType, nullable = true), StructField("input2", DoubleType, nullable = true))) val dfNaN = spark.createDataFrame(rows, schema) -val resNaN = dfNaN.stat.approxQuantile("input1", Array(q1, q2), epsilon) -assert(resNaN.count(_.isNaN) === 0) -assert(resNaN.count(_ == null) === 0) +val resNaN1 = dfNaN.stat.approxQuantile("input1", Array(q1, q2), epsilon) +assert(resNaN1.count(_.isNaN) === 0) +assert(resNaN1.count(_ == null) === 0) -val resNaN2 = dfNaN.stat.approxQuantile(Array("input1", "input2"), +val resNaN2 = dfNaN.stat.approxQuantile("input2", Array(q1, q2), epsilon) +assert(resNaN2.count(_.isNaN) === 0) +assert(resNaN2.count(_ == null) === 0) + +val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2"), Array(q1, q2), epsilon) -assert(resNaN2.flatten.count(_.isNaN) === 0) -assert(resNaN2.flatten.count(_ == null) === 0) +assert(resNaNAll.flatten.count(_.isNaN) === 0) +assert(resNaNAll.flatten.count(_ == null) === 0) + +assert(resNaN1(0) === resNaNAll(0)(0)) +assert(resNaN1(1) === resNaNAll(0)(1)) +assert(resNaN2(0) === resNaNAll(1)(0)) +assert(resNaN2(1) === resNaNAll(1)(1)) --- End diff -- Do we need a test for one column all nulls (that it returns null)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/16971 [SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile ## What changes were proposed in this pull request? update `StatFunctions.multipleApproxQuantiles` to handle NaN/null (Please fill in changes proposed in this fix) ## How was this patch tested? existing tests and added tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark quantiles_nan Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16971.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16971 commit d5e79a809b1edd91a7e0c1d8046bb8bfec2ba4c9 Author: Zheng RuiFengDate: 2017-02-17T03:18:43Z create pr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org