[GitHub] spark issue #23239: [SPARK-26021][SQL][followup] only deal with NaN and -0.0...
Github user adoron commented on the issue: https://github.com/apache/spark/pull/23239 @cloud-fan what about UnsafeRow::setDouble/Float? It doesn't go through the same flow. Is it not used? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...
Github user adoron commented on a diff in the pull request: https://github.com/apache/spark/pull/23141#discussion_r236189376 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. - - In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. + - In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. - The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set. - In Spark version 2.4 and earlier, users can create map values with map type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can still read map values with map type key from data source or Java/Scala collections, though they are not very useful. - + - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`. + - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more. --- End diff -- What version of hive did you test? It was fixed in https://issues.apache.org/jira/browse/HIVE-11174 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user adoron commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r235685712 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -723,4 +723,18 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { "grouping expressions: [current_date(None)], value: [key: int, value: string], " + "type: GroupBy]")) } + + test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when grouping") { +val colName = "i" +val doubles = Seq(0.0d, 0.0d, -0.0d).toDF(colName).groupBy(colName).count().collect() --- End diff -- Example: ``` Seq(0.0d, 0.0d, -0.0d).toDF(colName).groupBy(colName).count().show() +---+-+ | i|count| +---+-+ |0.0|3| +---+-+ Seq(0.0d, -0.0d, 0.0d).toDF(colName).groupBy(colName).count().show() ++-+ | i|count| ++-+ | 0.0|1| |-0.0|2| ++-+ Seq(-0.0d, -0.0d, 0.0d).toDF(colName).groupBy(colName).count().show() ++-+ | i|count| ++-+ |-0.0|3| ++-+ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user adoron commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r235683861 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -723,4 +723,18 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { "grouping expressions: [current_date(None)], value: [key: int, value: string], " + "type: GroupBy]")) } + + test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when grouping") { +val colName = "i" +val doubles = Seq(0.0d, 0.0d, -0.0d).toDF(colName).groupBy(colName).count().collect() --- End diff -- Actually yes, if codegen is enabled a generated FastHashMap is used for the partial grouping before the shuffle. This map doesn't separate 0.0 and -0.0. In addition there are 2 threads for 2 partitions in the unit test. In the doubles Seq order each partition is grouped in 0.0 and so after the shuffle they are being merged. In the floats case the order of the elements in the Seq is different so in the first grouping we get 1 partition on 0.0 and the other on -0.0 and so after the shuffle they are being treated as different groups (before the fix). Now I remember this is the reason I originally disabled codegen in the test. I think I'll just reorder the doubles Seq as well so it manifests the bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user adoron commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r234942649 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -723,4 +723,32 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { "grouping expressions: [current_date(None)], value: [key: int, value: string], " + "type: GroupBy]")) } + + test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when grouping") { +val colName = "i" +def groupByCollect(df: DataFrame): Array[Row] = { + df.groupBy(colName).count().collect() +} +def assertResult[T](result: Array[Row], zero: T)(implicit ordering: Ordering[T]): Unit = { + assert(result.length == 1) + // using compare since 0.0 == -0.0 is true + assert(ordering.compare(result(0).getAs[T](0), zero) == 0) + assert(result(0).getLong(1) == 3) +} + +spark.conf.set("spark.sql.codegen.wholeStage", "false") +val doubles = + groupByCollect(Seq(0.0d, 0.0d, -0.0d).toDF(colName)) +val doublesBoxed = + groupByCollect(Seq(Double.box(0.0d), Double.box(0.0d), Double.box(-0.0d)).toDF(colName)) +val floats = + groupByCollect(Seq(0.0f, -0.0f, 0.0f).toDF(colName)) --- End diff -- looks like leftovers from a different solution. Also there's no need to test the boxed version now that it's not in the codegen. I'll simplify the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Platf...
Github user adoron commented on the issue: https://github.com/apache/spark/pull/23043 @kiszk is there a use case where the preliminary RDD isn't created with UnsafeRows? If not then the data will already be corrected on reading. Anyway, looking at all different implementations of InternalRow.setDouble I found the following places that aren't handled: ``` OnHeapColumnVector.putDouble MutableDouble.update GenericInternalRow.update SpecificInternalRow.setDouble ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user adoron commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r234676540 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -723,4 +723,32 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { "grouping expressions: [current_date(None)], value: [key: int, value: string], " + "type: GroupBy]")) } + + test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when grouping") { +val colName = "i" +def groupByCollect(df: DataFrame): Array[Row] = { + df.groupBy(colName).count().collect() +} +def assertResult[T](result: Array[Row], zero: T)(implicit ordering: Ordering[T]): Unit = { + assert(result.length == 1) + // using compare since 0.0 == -0.0 is true + assert(ordering.compare(result(0).getAs[T](0), zero) == 0) --- End diff -- I'm not sure I follow, below this I'm constructing Seqs with 0 and -0 like in the JIRA and in the assertResult helper I'm checking that there's only 1 line like you said. Do you mean the check that the key is indeed 0.0 and not -0.0 is redundant? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user adoron commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r234674948 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java --- @@ -157,4 +159,15 @@ public void heapMemoryReuse() { Assert.assertEquals(onheap4.size(), 1024 * 1024 + 7); Assert.assertEquals(obj3, onheap4.getBaseObject()); } + + @Test + // SPARK-26021 + public void writeMinusZeroIsReplacedWithZero() { +byte[] doubleBytes = new byte[Double.BYTES]; +byte[] floatBytes = new byte[Float.BYTES]; +Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d); +Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f); +Assert.assertEquals(0, Double.compare(0.0d, ByteBuffer.wrap(doubleBytes).getDouble())); --- End diff -- yeah, it fails. Indeed 0.0 == -0.0 so I'm using Double.compare == 0 to test this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...
Github user adoron commented on the issue: https://github.com/apache/spark/pull/23043 @cloud-fan changing writeDouble/writeFloat in UnsafeWriter indeed do the trick! I'll fix the PR. I was thinking about making the change in `Platform::putDouble` so all accesses get affected, in UnsafeRow and UnsafeWriter as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...
Github user adoron commented on the issue: https://github.com/apache/spark/pull/23043 @cloud-fan that's what I thought as well at first, but the flow doesn't go through that code - running `Seq(0.0d, 0.0d, -0.0d).toDF("i").groupBy("i").count().collect()` and adding a breakpoint. The reason for -0.0 and 0.0 being put in different buckets of "group by" is in UnsafeFixedWidthAggregationMap::getAggregationBufferFromUnsafeRow(): ``` public UnsafeRow getAggregationBufferFromUnsafeRow(UnsafeRow key) { return getAggregationBufferFromUnsafeRow(key, key.hashCode()); } ``` The hashing is done on the UnsafeRow, and by this point the whole row is hashed as a unit and it's hard to find the double columns and their value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...
Github user adoron commented on the issue: https://github.com/apache/spark/pull/23043 @srowen @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
GitHub user adoron opened a pull request: https://github.com/apache/spark/pull/23043 [SPARK-26021][SQL] replace minus zero with zero in UnsafeProjection GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's behavior. In addition current behavior with codegen is unpredictable (see example in JIRA ticket). ## What changes were proposed in this pull request? In BoundReference class, in the generated code, replace -0.0 with 0.0 if the data type is double or float. ## How was this patch tested? Added tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/adoron/spark adoron-spark-26021-replace-minus-zero-with-zero Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23043.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23043 commit ee0ef91a7047d47328efac753e66ec97a91c0e37 Author: Alon Doron Date: 2018-11-14T16:18:30Z replace -0.0 with 0.0 in BoundAttribute added tests commit 63b7f59ad44d0876ea6dde02e4204fc0140d0df6 Author: Alon Doron Date: 2018-11-14T16:27:24Z minor remove var type --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org