[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21372


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r190383386
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -169,6 +170,14 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-24322 Fix incorrect workaround for bug in 
java.sql.Timestamp") {
+withTempPath { path =>
+  val ts = Timestamp.valueOf("1900-05-05 12:34:56.000789")
+  Seq(ts).toDF.write.orc(path.getCanonicalPath)
+  checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
+}
+  }
--- End diff --

`OrcSourceSuite` is dedicated for `native` Orc Reader . For `hive` ORC 
reader, `HiveOrcSourceSuite`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r190382953
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -169,6 +170,14 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-24322 Fix incorrect workaround for bug in 
java.sql.Timestamp") {
+withTempPath { path =>
+  val ts = Timestamp.valueOf("1900-05-05 12:34:56.000789")
+  Seq(ts).toDF.write.orc(path.getCanonicalPath)
+  checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
+}
+  }
--- End diff --

Oh, I missed this comments. Hive ORC and ORC MR reader doesn't have this 
bug because it uses `java.sql.Timestamp` class to unserialize it. This happens 
when we directly access the ORC column's sub-vectors, `times` and `nanos`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r190253937
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -169,6 +170,14 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-24322 Fix incorrect workaround for bug in 
java.sql.Timestamp") {
+withTempPath { path =>
+  val ts = Timestamp.valueOf("1900-05-05 12:34:56.000789")
+  Seq(ts).toDF.write.orc(path.getCanonicalPath)
+  checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
+}
+  }
--- End diff --

please explicitly set hive reader to native for this test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-22 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r190123342
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -169,6 +170,14 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-24322 Fix incorrect workaround for bug in 
java.sql.Timestamp") {
+withTempPath { path =>
+  val ts = Timestamp.valueOf("1900-05-05 12:34:56.000789")
+  Seq(ts).toDF.write.orc(path.getCanonicalPath)
+  checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
+}
+  }
--- End diff --

Does that mean the Hive ORC reader works, but the native ORC reader has the 
bug? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-22 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r190073105
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -169,6 +170,14 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-24322 Fix incorrect workaround for bug in 
java.sql.Timestamp") {
+withTempPath { path =>
+  val ts = Timestamp.valueOf("1900-05-05 12:34:56.000789")
+  Seq(ts).toDF.write.orc(path.getCanonicalPath)
+  checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
+}
+  }
--- End diff --

I added the test case for ORC-306 and update the PR title.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-22 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r190042175
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

No, what I mean is, with ORC-306 and this fix, there is no external impact 
outside Spark. More specifically, outside 
`OrcColumnVector`/`OrcColumnarBatchReader`. In other words, ORC 1.4.4 cannot be 
used with Apache Spark without this patch.

Java `Timestamp.getTime` and Timestamp.getNano` has an overlap by 
definition. Previously, ORC didn't stick to the definition.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-22 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r190030340
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

Are you saying no external impact of ORC-306?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-21 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r189720311
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

ORC-306 changes the content of exposed ORC column vectors in reader side. 
The interpretation is Spark's logic as we see in this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r189713487
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

Based on my understanding, ORC-306 changes the query result, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-21 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r189674371
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

`OrcHadoopFsRelationSuite` covers this changes via end-to-end write and 
read test cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-21 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r189673786
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

The change is on 
[TreeReaderFactory.java](https://github.com/apache/orc/commit/6c4865ad9acb75c35d97206f31b4dd9e0a3a7cb4#diff-dcf15a871eb200f0fceaa924e14a01d4R980).
 From Apache ORC project, the prior code is ORC-1 which is importing code from 
Hive two years ago.

Effectively, the writer side is the same. Only, reader side is changed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r189654707
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

Do you know when this issue was introduced in ORC?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r189644114
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

Add a test case? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-20 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21372#discussion_r189472897
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java
 ---
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
   public long getLong(int rowId) {
 int index = getRowIndex(rowId);
 if (isTimestamp) {
-  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000;
+  return timestampData.time[index] * 1000 + timestampData.nanos[index] 
/ 1000 % 1000;
--- End diff --

In Apache ORC 1.4.4, ORC-306 fixes this according to the [original 
definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46).
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-20 Thread dongjoon-hyun
GitHub user dongjoon-hyun reopened a pull request:

https://github.com/apache/spark/pull/21372

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

## What changes were proposed in this pull request?

ORC 1.4.4 includes [nine fixes like 
ORC-301](https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4).
 This issue aims to update it.


## How was this patch tested?

Pass the Jenkins.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK_ORC144

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21372


commit 2d11cdced5a53a7504cb2e52df67ad9256870d90
Author: Dongjoon Hyun 
Date:   2018-05-19T18:20:51Z

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-20 Thread dongjoon-hyun
Github user dongjoon-hyun closed the pull request at:

https://github.com/apache/spark/pull/21372


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-19 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/21372

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

## What changes were proposed in this pull request?

ORC 1.4.4 includes [nine fixes like 
ORC-301](https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4).
 This issue aims to update it.


## How was this patch tested?

Pass the Jenkins.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK_ORC144

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21372


commit 2d11cdced5a53a7504cb2e52df67ad9256870d90
Author: Dongjoon Hyun 
Date:   2018-05-19T18:20:51Z

[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org