[GitHub] spark issue #23239: [SPARK-26021][SQL][followup] only deal with NaN and -0.0...

2018-12-05 Thread adoron
Github user adoron commented on the issue:

https://github.com/apache/spark/pull/23239
  
@cloud-fan what about UnsafeRow::setDouble/Float? It doesn't go through the 
same flow. Is it not used?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23141: [SPARK-26021][SQL][followup] add test for special...

2018-11-26 Thread adoron
Github user adoron commented on a diff in the pull request:

https://github.com/apache/spark/pull/23141#discussion_r236189376
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
-  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`. 
+  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
   - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can still read map values with map type key from data 
source or Java/Scala collections, though they are not very useful.
-  
+
   - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a 
grouped dataset with key attribute wrongly named as "value", if the key is 
non-struct type, e.g. int, string, array, etc. This is counterintuitive and 
makes the schema of aggregation queries weird. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behaviour is preserved under a newly added 
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a 
default value of `false`.
 
+  - In Spark version 2.4 and earlier, float/double -0.0 is semantically 
equal to 0.0, but users can still distinguish them via `Dataset.show`, 
`Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 
internally, and users can't distinguish them any more.
--- End diff --

What version of hive did you test? 
It was fixed in https://issues.apache.org/jira/browse/HIVE-11174


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...

2018-11-22 Thread adoron
Github user adoron commented on a diff in the pull request:

https://github.com/apache/spark/pull/23043#discussion_r235685712
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -723,4 +723,18 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   "grouping expressions: [current_date(None)], value: [key: int, 
value: string], " +
 "type: GroupBy]"))
   }
+
+  test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when 
grouping") {
+val colName = "i"
+val doubles = Seq(0.0d, 0.0d, 
-0.0d).toDF(colName).groupBy(colName).count().collect()
--- End diff --

Example:
```
Seq(0.0d, 0.0d, -0.0d).toDF(colName).groupBy(colName).count().show()
+---+-+
|  i|count|
+---+-+
|0.0|3|
+---+-+
Seq(0.0d, -0.0d, 0.0d).toDF(colName).groupBy(colName).count().show()
++-+
|   i|count|
++-+
| 0.0|1|
|-0.0|2|
++-+
Seq(-0.0d, -0.0d, 0.0d).toDF(colName).groupBy(colName).count().show()
++-+
|   i|count|
++-+
|-0.0|3|
++-+
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...

2018-11-22 Thread adoron
Github user adoron commented on a diff in the pull request:

https://github.com/apache/spark/pull/23043#discussion_r235683861
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -723,4 +723,18 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   "grouping expressions: [current_date(None)], value: [key: int, 
value: string], " +
 "type: GroupBy]"))
   }
+
+  test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when 
grouping") {
+val colName = "i"
+val doubles = Seq(0.0d, 0.0d, 
-0.0d).toDF(colName).groupBy(colName).count().collect()
--- End diff --

Actually yes, if codegen is enabled a generated FastHashMap is used for the 
partial grouping before the shuffle. This map doesn't separate 0.0 and -0.0. In 
addition there are 2 threads for 2 partitions in the unit test. In the doubles 
Seq order each partition is grouped in 0.0 and so after the shuffle they are 
being merged.
In the floats case the order of the elements in the Seq is different so in 
the first grouping we get 1 partition on 0.0 and the other on -0.0 and so after 
the shuffle they are being treated as different groups (before the fix).
Now I remember this is the reason I originally disabled codegen in the test.
I think I'll just reorder the doubles Seq as well so it manifests the bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...

2018-11-20 Thread adoron
Github user adoron commented on a diff in the pull request:

https://github.com/apache/spark/pull/23043#discussion_r234942649
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -723,4 +723,32 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   "grouping expressions: [current_date(None)], value: [key: int, 
value: string], " +
 "type: GroupBy]"))
   }
+
+  test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when 
grouping") {
+val colName = "i"
+def groupByCollect(df: DataFrame): Array[Row] = {
+  df.groupBy(colName).count().collect()
+}
+def assertResult[T](result: Array[Row], zero: T)(implicit ordering: 
Ordering[T]): Unit = {
+  assert(result.length == 1)
+  // using compare since 0.0 == -0.0 is true
+  assert(ordering.compare(result(0).getAs[T](0), zero) == 0)
+  assert(result(0).getLong(1) == 3)
+}
+
+spark.conf.set("spark.sql.codegen.wholeStage", "false")
+val doubles =
+  groupByCollect(Seq(0.0d, 0.0d, -0.0d).toDF(colName))
+val doublesBoxed =
+  groupByCollect(Seq(Double.box(0.0d), Double.box(0.0d), 
Double.box(-0.0d)).toDF(colName))
+val floats =
+  groupByCollect(Seq(0.0f, -0.0f, 0.0f).toDF(colName))
--- End diff --

looks like leftovers from a different solution. Also there's no need to 
test the boxed version now that it's not in the codegen. I'll simplify the test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Platf...

2018-11-19 Thread adoron
Github user adoron commented on the issue:

https://github.com/apache/spark/pull/23043
  
@kiszk is there a use case where the preliminary RDD isn't created with 
UnsafeRows? If not then the data will already be corrected on reading.

Anyway, looking at all different implementations of InternalRow.setDouble I 
found the following places that aren't handled:
```
OnHeapColumnVector.putDouble
MutableDouble.update
GenericInternalRow.update
SpecificInternalRow.setDouble
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...

2018-11-19 Thread adoron
Github user adoron commented on a diff in the pull request:

https://github.com/apache/spark/pull/23043#discussion_r234676540
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -723,4 +723,32 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   "grouping expressions: [current_date(None)], value: [key: int, 
value: string], " +
 "type: GroupBy]"))
   }
+
+  test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when 
grouping") {
+val colName = "i"
+def groupByCollect(df: DataFrame): Array[Row] = {
+  df.groupBy(colName).count().collect()
+}
+def assertResult[T](result: Array[Row], zero: T)(implicit ordering: 
Ordering[T]): Unit = {
+  assert(result.length == 1)
+  // using compare since 0.0 == -0.0 is true
+  assert(ordering.compare(result(0).getAs[T](0), zero) == 0)
--- End diff --

I'm not sure I follow, below this I'm constructing Seqs with 0 and -0 like 
in the JIRA and in the assertResult helper I'm checking that there's only 1 
line like you said.
Do you mean the check that the key is indeed 0.0 and not -0.0 is redundant?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...

2018-11-19 Thread adoron
Github user adoron commented on a diff in the pull request:

https://github.com/apache/spark/pull/23043#discussion_r234674948
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java ---
@@ -157,4 +159,15 @@ public void heapMemoryReuse() {
 Assert.assertEquals(onheap4.size(), 1024 * 1024 + 7);
 Assert.assertEquals(obj3, onheap4.getBaseObject());
   }
+
+  @Test
+  // SPARK-26021
+  public void writeMinusZeroIsReplacedWithZero() {
+byte[] doubleBytes = new byte[Double.BYTES];
+byte[] floatBytes = new byte[Float.BYTES];
+Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d);
+Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f);
+Assert.assertEquals(0, Double.compare(0.0d, 
ByteBuffer.wrap(doubleBytes).getDouble()));
--- End diff --

yeah, it fails. Indeed 0.0 == -0.0 so I'm using Double.compare == 0 to test 
this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...

2018-11-17 Thread adoron
Github user adoron commented on the issue:

https://github.com/apache/spark/pull/23043
  
@cloud-fan changing writeDouble/writeFloat in UnsafeWriter indeed do the 
trick!
I'll fix the PR. I was thinking about making the change in 
`Platform::putDouble` so all accesses get affected, in UnsafeRow and 
UnsafeWriter as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...

2018-11-16 Thread adoron
Github user adoron commented on the issue:

https://github.com/apache/spark/pull/23043
  
@cloud-fan that's what I thought as well at first, but the flow doesn't go 
through that code -
running `Seq(0.0d, 0.0d, -0.0d).toDF("i").groupBy("i").count().collect()` 
and adding a breakpoint.

The reason for -0.0 and 0.0 being put in different buckets of "group by" is 
in UnsafeFixedWidthAggregationMap::getAggregationBufferFromUnsafeRow():
```
public UnsafeRow getAggregationBufferFromUnsafeRow(UnsafeRow key) {
return getAggregationBufferFromUnsafeRow(key, key.hashCode());
}
```
The hashing is done on the UnsafeRow, and by this point the whole row is 
hashed as a unit and it's hard to find the double columns and their value.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...

2018-11-15 Thread adoron
Github user adoron commented on the issue:

https://github.com/apache/spark/pull/23043
  
@srowen @gatorsmile  @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...

2018-11-15 Thread adoron
GitHub user adoron opened a pull request:

https://github.com/apache/spark/pull/23043

[SPARK-26021][SQL] replace minus zero with zero in UnsafeProjection

GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's 
behavior.
In addition current behavior with codegen is unpredictable (see example in 
JIRA ticket).

## What changes were proposed in this pull request?

In BoundReference class, in the generated code, replace -0.0 with 0.0 if 
the data type is double or float.

## How was this patch tested?

Added tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adoron/spark 
adoron-spark-26021-replace-minus-zero-with-zero

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23043.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23043


commit ee0ef91a7047d47328efac753e66ec97a91c0e37
Author: Alon Doron 
Date:   2018-11-14T16:18:30Z

replace -0.0 with 0.0 in BoundAttribute
added tests

commit 63b7f59ad44d0876ea6dde02e4204fc0140d0df6
Author: Alon Doron 
Date:   2018-11-14T16:27:24Z

minor remove var type




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org