[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19496 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r145182120 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22271: mean overflows and returns null for some decimal variables") { +val d = 0.034567890 +val df = Seq(d, d, d, d, d, d, d, d, d, d).toDF("DecimalCol") +val result = df.select('DecimalCol cast DecimalType(38, 33)) +.select(col("DecimalCol")).describe() +val mean = result.select("DecimalCol").where($"summary" === "mean") +assert(mean.collect.toSet === Set(Row("0.034567890"))) --- End diff -- @gatorsmile Thanks Sean for your review. I will fix the problems. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r145034923 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22271: mean overflows and returns null for some decimal variables") { +val d = 0.034567890 +val df = Seq(d, d, d, d, d, d, d, d, d, d).toDF("DecimalCol") +val result = df.select('DecimalCol cast DecimalType(38, 33)) +.select(col("DecimalCol")).describe() +val mean = result.select("DecimalCol").where($"summary" === "mean") +assert(mean.collect.toSet === Set(Row("0.034567890"))) --- End diff -- Nit: `collect` -> `collect()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r145034645 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22271: mean overflows and returns null for some decimal variables") { +val d = 0.034567890 +val df = Seq(d, d, d, d, d, d, d, d, d, d).toDF("DecimalCol") +val result = df.select('DecimalCol cast DecimalType(38, 33)) +.select(col("DecimalCol")).describe() --- End diff -- Nit: two space indents. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r144732734 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22271: mean overflows and returns null for some decimal variables") { +val d: BigDecimal = BigDecimal(0.034567890) --- End diff -- It's not necessary. I will remove. Thanks for your review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r144723280 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22271: mean overflows and returns null for some decimal variables") { +val d: BigDecimal = BigDecimal(0.034567890) --- End diff -- Do we need `: BigDecimal`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r144689475 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala --- @@ -80,7 +80,8 @@ case class Average(child: Expression) extends DeclarativeAggregate with Implicit case DecimalType.Fixed(p, s) => // increase the precision and scale to prevent precision loss val dt = DecimalType.bounded(p + 14, s + 4) - Cast(Cast(sum, dt) / Cast(count, dt), resultType) + Cast(Cast(sum, dt) / Cast(count, DecimalType.bounded (DecimalType.MAX_PRECISION, 0)), --- End diff -- No need to add space after `bounded`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/19496 [SPARK-22271][SQL]mean overflows and returns null for some decimal variables ## What changes were proposed in this pull request? In Average.scala, it has ``` override lazy val evaluateExpression = child.dataType match { case DecimalType.Fixed(p, s) => // increase the precision and scale to prevent precision loss val dt = DecimalType.bounded(p + 14, s + 4) Cast(Cast(sum, dt) / Cast(count, dt), resultType) case _ => Cast(sum, resultType) / Cast(count, resultType) } def setChild (newchild: Expression) = { child = newchild } ``` It is possible that Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow. Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0) ## How was this patch tested? In DataFrameSuite, I will add a test case. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-22271 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19496 commit a3437ee4a87d1f51b362adeb20d4fcc264085ba7 Author: Huaxin Gao Date: 2017-10-14T04:45:27Z [SPARK-22271][SQL]mean overflows and returns null for some decimal variables --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org