[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-12-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23141#discussion_r239333919
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java ---
@@ -165,10 +165,14 @@ public void writeMinusZeroIsReplacedWithZero() {
 byte[] floatBytes = new byte[Float.BYTES];
 Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d);
 Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f);
-double doubleFromPlatform = Platform.getDouble(doubleBytes, 
Platform.BYTE_ARRAY_OFFSET);
-float floatFromPlatform = Platform.getFloat(floatBytes, 
Platform.BYTE_ARRAY_OFFSET);
 
-Assert.assertEquals(Double.doubleToLongBits(0.0d), 
Double.doubleToLongBits(doubleFromPlatform));
-Assert.assertEquals(Float.floatToIntBits(0.0f), 
Float.floatToIntBits(floatFromPlatform));
+byte[] doubleBytes2 = new byte[Double.BYTES];
+byte[] floatBytes2 = new byte[Float.BYTES];
+Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, 0.0d);
+Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, 0.0f);
--- End diff --

ah good catch! I'm surprised this test passed before...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23141#discussion_r239255684
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java ---
@@ -165,10 +165,14 @@ public void writeMinusZeroIsReplacedWithZero() {
 byte[] floatBytes = new byte[Float.BYTES];
 Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d);
 Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f);
-double doubleFromPlatform = Platform.getDouble(doubleBytes, 
Platform.BYTE_ARRAY_OFFSET);
-float floatFromPlatform = Platform.getFloat(floatBytes, 
Platform.BYTE_ARRAY_OFFSET);
 
-Assert.assertEquals(Double.doubleToLongBits(0.0d), 
Double.doubleToLongBits(doubleFromPlatform));
-Assert.assertEquals(Float.floatToIntBits(0.0f), 
Float.floatToIntBits(floatFromPlatform));
+byte[] doubleBytes2 = new byte[Double.BYTES];
+byte[] floatBytes2 = new byte[Float.BYTES];
+Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, 0.0d);
+Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, 0.0f);
--- End diff --

Unfortunately, these should be `doubleByte2` and `floatBytes2`.
I'll comment on the follow-up PR https://github.com/apache/spark/pull/23239 
, too. We can fix there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-11-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23141


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-11-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23141#discussion_r236200834
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
-  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`. 
+  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
   - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can still read map values with map type key from data 
source or Java/Scala collections, though they are not very useful.
-  
+
   - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a 
grouped dataset with key attribute wrongly named as "value", if the key is 
non-struct type, e.g. int, string, array, etc. This is counterintuitive and 
makes the schema of aggregation queries weird. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behaviour is preserved under a newly added 
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a 
default value of `false`.
 
+  - In Spark version 2.4 and earlier, float/double -0.0 is semantically 
equal to 0.0, but users can still distinguish them via `Dataset.show`, 
`Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 
internally, and users can't distinguish them any more.
--- End diff --

I ran few simple queries on Hive 2.1.

Simple comparison seems ok:

```
hive> select 1 where 0.0=-0.0;
OK
1
Time taken: 0.047 seconds, Fetched: 1 row(s)
hive> select 1 where -0.0<0.0;
OK
Time taken: 0.053 seconds
```

But group by behavior seems not correct:
```
hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select a, count(*) from test group by a;
-0.03
Time taken: 1.308 seconds, Fetched: 1 row(s)
```




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-11-26 Thread adoron
Github user adoron commented on a diff in the pull request:

https://github.com/apache/spark/pull/23141#discussion_r236189376
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
-  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`. 
+  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
   - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can still read map values with map type key from data 
source or Java/Scala collections, though they are not very useful.
-  
+
   - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a 
grouped dataset with key attribute wrongly named as "value", if the key is 
non-struct type, e.g. int, string, array, etc. This is counterintuitive and 
makes the schema of aggregation queries weird. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behaviour is preserved under a newly added 
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a 
default value of `false`.
 
+  - In Spark version 2.4 and earlier, float/double -0.0 is semantically 
equal to 0.0, but users can still distinguish them via `Dataset.show`, 
`Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 
internally, and users can't distinguish them any more.
--- End diff --

What version of hive did you test? 
It was fixed in https://issues.apache.org/jira/browse/HIVE-11174


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-11-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23141#discussion_r236143942
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
-  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`. 
+  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
   - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can still read map values with map type key from data 
source or Java/Scala collections, though they are not very useful.
-  
+
   - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a 
grouped dataset with key attribute wrongly named as "value", if the key is 
non-struct type, e.g. int, string, array, etc. This is counterintuitive and 
makes the schema of aggregation queries weird. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behaviour is preserved under a newly added 
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a 
default value of `false`.
 
+  - In Spark version 2.4 and earlier, float/double -0.0 is semantically 
equal to 0.0, but users can still distinguish them via `Dataset.show`, 
`Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 
internally, and users can't distinguish them any more.
--- End diff --

I checked presto and postgres, the behaviors are same. Hive distinguishes 
-0.0 and 0.0, but it has the group by bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-11-25 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/23141

[SPARK-26021][SQL][followup] add test for special floating point values

## What changes were proposed in this pull request?

a followup of https://github.com/apache/spark/pull/23124 . Add a test to 
show the minor behavior change introduced by #23124 , and add migration guide.

## How was this patch tested?

a new test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark follow

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23141


commit 8a9103c47931eb61cb329ece046d5efc50e855c2
Author: Wenchen Fan 
Date:   2018-11-26T06:11:09Z

add test for special floating point values




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org