[GitHub] spark issue #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and...

2018-02-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20634
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...

2018-02-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20613
  
cc @dongjoon-hyun  Do you want to make a try to help review this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
LGTM 

Thanks! Merged to master. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23390][SQL] Register task completion listeners fi...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20619
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
I agree on what @omalley said. The new reader based on ORC 1.4 is better 
than the old reader. That is why we chose the new reader as the default at the 
beginning. We also saw the performance improvement in the micro-benchmark. 

However, at the RC3 of Spark 2.3, we realize the new reader introduces the 
regression. After RC3, we reverted many PRs that caused the regressions and 
also rejected many bug fixes that could introduce new regressions. We are very 
conservative when merging the bug fixes in this stage. Thus, I also think the 
suggestion from @marmbrus is very reasonable. That is the strategy we are 
following in the previous Spark releases. 

Regarding this specific case, we did not revert the new reader, but just 
changed the default values. Users can try the new reader. We just want to avoid 
breaking the existing workloads when they upgrade to the upcoming Spark 2.3 
release. In the next Spark 2.4 release, I believe we can feel more confident to 
choose the ORC new reader as the default. 

@dongjoon-hyun Could you submit a PR against the master branch to turn on 
them by default? Also add a migration guide. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20568
  
@mrkm4ntr Thank you for your contribution! The PR has been merged using 
your Github account. Could you close this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23390][SQL] Register task completion listeners fi...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20619
  
He is already on vacation. : ) 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20630: [SPARK-23381][CORE] Murmur3 hash generates a different v...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20630
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20630: [SPARK-23381][CORE] Murmur3 hash generates a diff...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20630#discussion_r168904179
  
--- Diff: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_x86_32.java ---
@@ -60,6 +60,8 @@ public static int hashUnsafeWords(Object base, long 
offset, int lengthInBytes, i
   }
 
   public static int hashUnsafeBytes(Object base, long offset, int 
lengthInBytes, int seed) {
+// This is not compatible with original and another implementations.
+// But remain it for backward compatibility for the components 
existing before 2.3.
--- End diff --

Will correct it in the follow-up PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20630: [SPARK-23381][CORE] Murmur3 hash generates a diff...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20630#discussion_r168904164
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/hash/Murmur3_x86_32Suite.java
 ---
@@ -51,6 +53,23 @@ public void testKnownLongInputs() {
 Assert.assertEquals(-2106506049, hasher.hashLong(Long.MAX_VALUE));
   }
 
+  // SPARK-23381 Check whether the hash of the byte array is the same as 
another implementations
--- End diff --

Yeah, we can do it in the follow-up PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20568
  
Submitted the PR https://github.com/apache/spark/pull/20630 to take this 
over. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20630: [SPARK-23381][CORE] Murmur3 hash generates a different v...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20630
  
cc @hvanhovell @sameeragarwal @jkbradley @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20630: [SPARK-23381][CORE] Murmur3 hash generates a diff...

2018-02-16 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20630

[SPARK-23381][CORE] Murmur3 hash generates a different value from other 
implementations

## What changes were proposed in this pull request?
Murmur3 hash generates a different value from the original and other 
implementations (like Scala standard library and Guava or so) when the length 
of a bytes array is not multiple of 4.

## How was this patch tested?
Added a unit test.

**Note:** When we merge this PR, please give all the credits to Shintaro 
Murakami. 

Author: Shintaro Murakami <mrkm4...@gmail.com>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark pr-20568

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20630


commit c20cd97d7ce5690993b4490bb7cca955e7703d90
Author: Shintaro Murakami <mrkm4ntr@...>
Date:   2018-02-10T14:29:48Z

[SPARK-23381][CORE] Fix Murmur3 for byte arrays whose byte length is not a 
multiple of 4

commit 5fcd994554a82d809ac8547378776e2b95c37d93
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-16T21:05:32Z

fix.

commit c406f9846f81ddf8030803cb918bb51506d4ae6b
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-16T21:19:28Z

fix.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20568: [SPARK-23381][CORE] Murmur3 hash generates a diff...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20568#discussion_r168870192
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
@@ -218,4 +221,32 @@ object FeatureHasher extends 
DefaultParamsReadable[FeatureHasher] {
 
   @Since("2.3.0")
   override def load(path: String): FeatureHasher = super.load(path)
+
+  private val seed = OldHashingTF.seed
+
+  /**
+   * Calculate a hash code value for the term object using
+   * Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32).
+   * This is the default hash algorithm used from Spark 2.0 onwards.
+   * Use hashUnsafeBytes2 to match the original algorithm with the value.
+   * See SPARK-23381.
+   */
+  @Since("2.3.0")
+  def murmur3Hash(term: Any): Int = {
--- End diff --

I would also address this comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20568
  
To speedup the work here, I will take this over. All the contributions 
should be given to @mrkm4ntr 

Thanks for your work! @mrkm4ntr 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
@dongjoon-hyun That also sounds good to me. Always welcome to seeing more 
contributions to help improve the test coverage. Doing our best to improve the 
test coverage and code readability. 

Normally, when the reviewers have questions, (sometimes the questions are 
stupid), that might also mean the codes need more comments/refactoring. This 
can help the future code developers and maintainers. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r168836009
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,15 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-23340 Empty float/double array columns raise EOFException") {
+Seq(Seq(Array.empty[Float]).toDF(), 
Seq(Array.empty[Double]).toDF()).foreach { df =>
+  withTempPath { path =>
--- End diff --

You know, we still can support complex type vectorization in the future, 
but we might not re-visist all the tests in the future. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r168833990
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,15 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-23340 Empty float/double array columns raise EOFException") {
+Seq(Seq(Array.empty[Float]).toDF(), 
Seq(Array.empty[Double]).toDF()).foreach { df =>
+  withTempPath { path =>
--- End diff --

When we write these test cases, we should do our best to cover all the 
scenarios even if it is not supported now. This is like black-box testing.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20611
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20625
  
Thanks for the fast fix! We need to merge it to SPARK-2.3.0 before RC4. 
Will merge it now. We can improve the fix later if anybody has better ideas. 

Thanks! Merged to master/2.3

Happy Lunar New Year!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
Thanks! Happy Lunar New Year!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20621
  
It sounds like Spark 2.2 already has this bug. This causes an incorrect 
result.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20621
  
This is a blocker-level regression.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r168817045
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,15 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-23340 Empty float/double array columns raise EOFException") {
+Seq(Seq(Array.empty[Float]).toDF(), 
Seq(Array.empty[Double]).toDF()).foreach { df =>
+  withTempPath { path =>
--- End diff --

We have three ORC readers, right? We need to check all of them, and also 
vectorized reader too, even if they do not support it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20057
  
This test is a flaky test. Your changes did not fail any test case. I will 
review your PR after the 2.3 release. Thanks again!

cc @dongjoon-hyun Do you want to take a look at this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20057
  
Our overwrite semantics is confusing to most. We need to correct it in the 
next release, i.e., Spark 2.4.

Even if we try our best to keep the schema of the original table, the 
actual CREATE TABLE statements still take many vendor-specific info. It is hard 
for us to rebuild all of them. I can understand your use case for truncate. 

I am sorry this will not be part of Spark 2.3 release. We will include it 
in the next release. You can still do the change in your forked Spark. 

Just feel free to let us know if you find anything that we should do in 
Spark SQL JCBC to match the corresponding ones in SQOOP. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
@HyukjinKwon Will you submit a fix for the binary type today? We are very 
close to RC4. This is kind of urgent if we still want to block it in the Spark 
2.3.0 release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r168686683
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,15 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-23340 Empty float/double array columns raise EOFException") {
+Seq(Seq(Array.empty[Float]).toDF(), 
Seq(Array.empty[Double]).toDF()).foreach { df =>
+  withTempPath { path =>
--- End diff --

Please also test both readers as we discussed above. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r168638420
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,15 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  test("SPARK-23340 Empty float/double array columns raise EOFException") {
+Seq(Seq(Array.empty[Float]).toDF(), 
Seq(Array.empty[Double]).toDF()).foreach { df =>
+  withTempPath { path =>
--- End diff --

Are we testing the native readers?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23390][SQL] Register task completion listerners f...

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20619
  
cc @ala @michal-databricks @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20617: [MINOR][SQL] Fix an error message about inserting into b...

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20617
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20057
  
@danielvdende @Fokko We definitely want to help the community replace Sqoop 
by Spark SQL. However, `truncate` is only used when users use 
SaveMode.Overwrite to write the external JDBC tables. In this 
specific scenario, Spark will truncate an existing table instead of dropping 
and recreating it. 

Could you show me the key missing features that are available in Sqoop but 
not in Spark SQL JDBC connectors?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20057
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
@dongjoon-hyun Could you rebase this PR? We want to merge it to master. 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20610: [SPARK-23426][SQL] Use `hive` ORC impl and disable PPD f...

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20610
  
LGTM

Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20614: Revert [SPARK-23094] Fix invalid character handling in J...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20614
  
BTW, the ongoing test will be killed by -9 at midnight 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20614: Revert [SPARK-23094] Fix invalid character handling in J...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20614
  
Since this is just to remove the file and the previous test already passed, 
I would merge it to master/2.3. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20606: [SPARK-23421] [SQL] Document the behavior change in SPAR...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20606
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
What is the root cause? Do we have a trivial fix to resolve/block it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20610: [SPARK-23426][SQL] Use `hive` ORC impl and disabl...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20610#discussion_r168395230
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1004,6 +1004,29 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
 
 
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files.
+To do that, the following configurations are newly added. The vectorized 
reader is used for the
+native ORC tables (e.g., the ones created using the clause `USING ORC`) 
when `spark.sql.orc.impl`
+is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to 
`true`. For the Hive ORC
+serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS 
(fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is 
also set to `true`.
+
+
+  Property 
NameDefaultMeaning
+  
+spark.sql.orc.impl
+hive
+The name of ORC implementation. It can be one of 
native and hive. native means the native 
ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in 
Hive 1.2.1 which is used prior to Spark 2.3.
--- End diff --

Remove ` which is used prior to Spark 2.3`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20614: Revert [SPARK-23094] Fix invalid character handling in J...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20614
  
Saw the history. Our UTF-16 support is pretty weak. Let me revert the test 
case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
The behavior inconsistency between `toPandas` and `createDataFrame` looks 
confusing to end users, I have to admit. 

In the current stage, we are unable to merge the fix for these new features 
to Spark 2.3 branch. Let us wait for the release of Spark 2.3.0  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
Then, let us wait for the release of Spark 2.3.0. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20610: [SPARK-23426][SQL] Use `hive` ORC impl and disabl...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20610#discussion_r168347968
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1004,6 +1004,24 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
 
 
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files. To do that, the following configurations are newly 
added. The vectorized reader is used for the native ORC tables (e.g., the ones 
created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to 
`native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the 
Hive ORC serde table (e.g., the ones created using the clause `USING HIVE 
OPTIONS (fileFormat 'ORC')`), the vectorized reader is used when 
`spark.sql.hive.convertMetastoreOrc` is set to `true`.
+
+
+  Property 
NameDefaultMeaning
+  
+spark.sql.orc.impl
+hive
+The name of ORC implementation. It can be one of 
native and hive. native means the native 
ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in 
Hive 1.2.1 which is used prior to Spark 2.3.
+  
+  
+spark.sql.orc.enableVectorizedReader
+true
+Enables vectorized orc decoding in native 
implementation. If false, a new non-vectorized ORC reader is used 
in native implementation. For hive implementation, 
this is ignored.
+  
+
--- End diff --

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20610: [SPARK-23426][SQL] Use `hive` ORC impl and disabl...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20610#discussion_r168347783
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1004,6 +1004,24 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
 
 
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files. To do that, the following configurations are newly 
added. The vectorized reader is used for the native ORC tables (e.g., the ones 
created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to 
`native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the 
Hive ORC serde table (e.g., the ones created using the clause `USING HIVE 
OPTIONS (fileFormat 'ORC')`), the vectorized reader is used when 
`spark.sql.hive.convertMetastoreOrc` is set to `true`.
--- End diff --

`table ` -> `tables `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20610: [SPARK-23426][SQL] Use `hive` ORC impl and disabl...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20610#discussion_r168347455
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -207,6 +207,16 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
   .collect { case s @ StreamingRelation(dataSource, _, _) => s.schema 
}.head
   }
 
+  override def beforeAll(): Unit = {
+super.beforeAll()
+spark.sessionState.conf.setConf(SQLConf.ORC_IMPLEMENTATION, "native")
+  }
+
+  override def afterAll(): Unit = {
+spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION)
+super.afterAll()
--- End diff --

```
try {
  spark.sessionState.conf.unsetConf(SQLConf.ORC_IMPLEMENTATION)
} finally {
  super.afterAll()
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20406: [SPARK-23230][SQL]When hive.default.fileformat is other ...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20406
  
This is a trivial bug fix. I am fine if anybody wants to revert it from 
Spark 2.3 and merge it later after 2.3.0 release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20614: Revert [SPARK-23094] Fix invalid character handling in J...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20614
  
cc @brkyvz @MaxGekk @dongjoon-hyun @HyukjinKwon @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20614: Revert [SPARK-23094] Fix invalid character handli...

2018-02-14 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20614

Revert [SPARK-23094] Fix invalid character handling in JsonDataSource

## What changes were proposed in this pull request?
This PR is to revert the PR https://github.com/apache/spark/pull/20302, 
because it causes a regression.

## How was this patch tested?
Added a test case. Without the revert, it return an empty result set. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark revertJsonFix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20614.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20614


commit 3cd91b1d07d3cb2451045913d0c1e27226a67816
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-14T23:23:34Z

added test case

commit d4015d0c3c9b5cae0309bd6b9486b4990c7f4479
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-14T23:24:38Z

Revert "[SPARK-23094] Fix invalid character handling in JsonDataSource"

This reverts commit e01919e834d301e13adc8919932796ebae900576.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
uh, I missed the previous comment. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
Based on our standard, we do not change the dependencies in the minor 
releases. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20610: [SPARK-23426][SQL] Use `hive` ORC impl and disable PPD f...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20610
  
I think it makes sense to fix the test cases in the same PR, as long as 
they are not bug fixes. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: [SPARK-23249] [SQL] Improved block merging logic for par...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20372
  
This PR was merged to RC3 of Spark 2.3. For all such fixes, we should not 
merge them to Spark 2.3. The performance regression has been witnessed in RC3, 
compared with RC2. We did not investigate more. For safety, we have to revert 
it from Spark 2.3 and master.

Maybe you can submit a new PR and make it configurable? Does this sound 
good to you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: [SPARK-23249] [SQL] Improved block merging logic for par...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20372
  
Reverted from 2.3 and master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: [SPARK-23249] [SQL] Improved block merging logic for par...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20372
  
We saw a performance regression in SPARK 2.3 about this change. Let me 
revert it now and please resubmit the PR with more reviews. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20610: [SPARK-23426][SQL] Use `hive` ORC implementation ...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20610#discussion_r168267941
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1784,7 +1784,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   Property 
NameDefaultMeaning
   
 spark.sql.orc.impl
-native
+hive
--- End diff --

We do not need this in the migration guide. Please create a new section for 
ORC


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20610: [SPARK-23426][SQL] Use `hive` ORC implementation ...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20610#discussion_r168267722
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -399,11 +399,11 @@ object SQLConf {
 
   val ORC_IMPLEMENTATION = buildConf("spark.sql.orc.impl")
 .doc("When native, use the native version of ORC support instead of 
the ORC library in Hive " +
-  "1.2.1. It is 'hive' by default prior to Spark 2.3.")
+  "1.2.1. It is 'hive' by default.")
 .internal()
 .stringConf
 .checkValues(Set("hive", "native"))
-.createWithDefault("native")
+.createWithDefault("hive")
--- End diff --

We also need to disable the ORC pushdown, because the ORC reader of Hive 
1.2.1 has a few bugs. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20606: [SPARK-23421] [SQL] Document the behavior change ...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20606#discussion_r168265404
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -195,7 +195,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0")
+  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0", "2.2.1")
--- End diff --

The main goal of this PR is not to test it but to document it. We need 
another backport PR to SPARK 2.2 without including the test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20606: [SPARK-23421] [SQL] Document the behavior change ...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20606#discussion_r168265021
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -195,7 +195,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0")
+  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0", "2.2.1")
--- End diff --

Already has it in the PR description. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
Based on what @marmbrus suggested above, given how late it is in the cycle, 
we have to disable the native ORC reader (and also the filter pushdown) in 
2.3.0 to avoid delaying the release, but add to release notes on how to turn 
them on manually. 

In the long term, our strategy is to improve our test coverage in the Spark 
side, instead of relying on the test cases of external data sources. 

cc @dongjoon-hyun @sameeragarwal @cloud-fan @rxin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
We are unable to contain option 3 in Spark 2.3.0. This is too big to merge 
it in the current stage. We still can do it in 2.3.1. 

If needed, I am fine to throw a better error message if the PR size is very 
small; otherwise, keep it unchanged in 2.3.0.

Also cc @liancheng @yhuai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20606: [SPARK-23421] [SQL] Document the behavior change ...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20606#discussion_r168257301
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -195,7 +195,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0")
+  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0", "2.2.1")
--- End diff --

This is to verify what we explained in doc is correct. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20597: [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveExternal...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20597
  
Just submitted a PR https://github.com/apache/spark/pull/20606


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20606: [SPARK-23421] [SQL] Document the behavior change in SPAR...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20606
  
cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20606: [SPARK-23421] [SQL] Document the behavior change ...

2018-02-14 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20606

[SPARK-23421] [SQL] Document the behavior change in SPARK-22356

## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/19579 introduces a behavior change. We 
need to document it in the migration guide.

## How was this patch tested?
Also update the HiveExternalCatalogVersionsSuite to verify it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark addMigrationGuide

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20606.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20606


commit 02275a3d64dfa296a356bfd1aab609c7ba879366
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-14T08:11:49Z

update




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20605: [SPARK-23419][SPARK-23416][SS] data source v2 write path...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20605
  
Also cc @tdas @marmbrus 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20593: [SPARK-23230][SQL][BRANCH-2.2]When hive.default.fileform...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20593
  
Thanks! Merged to 2.2.

Could you close it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20593: [SPARK-23230][SQL][BRANCH-2.2]When hive.default.fileform...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20593
  
Supporting Hadoop 3.0 is being discussed. Now, we do not support it yet. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20477
  
Thanks! The PR has been reverted. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20477
  
As pointed out by @tdas , since this PR impacts the streaming, I am 
reverting this PR from master. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20597: [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveExternal...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20597
  
This is a behavior change introduced in 
https://github.com/apache/spark/pull/19579. We need to add a migration guide 
for this case. Will do it by myself. 

@seancxmao Thanks for your contribution!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20597: [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveExternal...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20597
  
Let me take this over and do more investigation. This looks suspicious. I 
think we are changing the behaviors in 2.3 and 2.2.1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20599: [SPARK-23407][SQL] add a config to try to inline all mut...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20599
  
This PR will be hold until 2.3 is released.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167958614
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,16 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  // This is a test case for ORC-285
+  test("SPARK-23340 Empty float/double array columns raise EOFException") {
+Seq(Seq(Array.empty[Float]).toDF(), 
Seq(Array.empty[Double]).toDF()).foreach { df =>
--- End diff --

uh, I see. I just saw [your reply to 
omalley](https://github.com/apache/spark/pull/20511#discussion_r167952000)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167952859
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala 
---
@@ -208,4 +208,15 @@ class HiveOrcQuerySuite extends OrcQueryTest with 
TestHiveSingleton {
   }
 }
   }
+
+  // This is a test case for ORC-285
--- End diff --

This is not a test case for ORC-285


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167952725
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,16 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  // This is a test case for ORC-285
+  test("SPARK-23340 Empty float/double array columns raise EOFException") {
+Seq(Seq(Array.empty[Float]).toDF(), 
Seq(Array.empty[Double]).toDF()).foreach { df =>
--- End diff --

Since this is the issue when we use vectorized ORC reader, please check 
both vectorized and non-vectorized readers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167952413
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,16 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  // This is a test case for ORC-285
--- End diff --

Thank you for your investigation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20419
  
@kiszk This is unable to merge to 2.3. Could you open a new JIRA for this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId i...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20419#discussion_r167947791
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1226,14 +1226,21 @@ class CodegenContext {
 
   /**
* Register a comment and return the corresponding place holder
+   *
+   * @param placeholderId a string for a place holder
+   * @param force whether to force registering the comments
*/
-  def registerComment(text: => String): String = {
+  def registerComment(
+  text: => String,
+  placeholderId: String = "",
+  force: Boolean = false): String = {
 // By default, disable comments in generated code because computing 
the comments themselves can
 // be extremely expensive in certain cases, such as deeply-nested 
expressions which operate over
 // inputs with wide schemas. For more details on the performance 
issues that motivated this
 // flat, see SPARK-15680.
-if (SparkEnv.get != null && 
SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {
-  val name = freshName("c")
+if (force ||
+  SparkEnv.get != null && 
SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {
--- End diff --

Now, the question is whether we can use `SQLConf.get` in this case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20548: [SPARK-23316][SQL] AnalysisException after max iteration...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20548
  
This is a regression introduced by 
https://github.com/apache/spark/pull/18968. We have to merge to 2.3. I resolved 
my comments when I merge the codes. 

Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167943246
  
--- Diff: pom.xml ---
@@ -1736,10 +1736,6 @@
 org.apache.hive
 hive-storage-api
   
-  
-io.airlift
--- End diff --

Thanks for the confirmation  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167943114
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,16 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  // This is a test case for ORC-285
--- End diff --

yeah, for hive table, we might need a new test case


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][SQL] Empty float/double array columns in O...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
I would still suggest to change the PR title back to `Upgrade ORC to 1.4.3`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Empty float/double array colum...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167940196
  
--- Diff: pom.xml ---
@@ -1736,10 +1736,6 @@
 org.apache.hive
 hive-storage-api
   
-  
-io.airlift
--- End diff --

This is by ORC 1.4.2?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Empty float/double array colum...

2018-02-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167937286
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,16 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  // This is a test case for ORC-285
--- End diff --

This is the only issue for the new ORC reader, right? 

For the other two ORC reading modes, it works well, right? Could you also 
added test cases for the others too?

If this is the case, this should be a regression we have to merge to 2.3 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20595: [SPARK-20090][FOLLOW-UP] Revert the deprecation of `name...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20595
  
cc @rxin @cloud-fan @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20595: [SPARK-20090][FOLLOW-UP] Revert the deprecation o...

2018-02-12 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20595

[SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark 

## What changes were proposed in this pull request?
Deprecating the field `name` in PySpark is not expected. This PR is to 
revert the change.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark removeDeprecate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20595


commit 494dccd00217355f5277a65776a2768e3bab80ec
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-13T05:19:03Z

fix.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fiel...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20545#discussion_r167762799
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -104,6 +104,13 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
   /** Returns all field names in an array. */
   def fieldNames: Array[String] = fields.map(_.name)
 
+  /**
+   * Returns all field names in an array. This is an alias of `fieldNames`.
+   *
+   * @since 2.3.0
--- End diff --

This is too late to be merged to 2.3.0. Please change it to 2.4.0. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20477
  
LGTM Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20593: [SPARK-23230][SQL][BRANCH-2.2]When hive.default.fileform...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20593
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20548: [SPARK-23316][SQL] AnalysisException after max it...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20548#discussion_r167761502
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ---
@@ -298,22 +298,24 @@ object DataType {
* Returns true if the two data types share the same "shape", i.e. the 
types (including
* nullability) are the same, but the field names don't need to be the 
same.
--- End diff --

This comments need an update too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20548: [SPARK-23316][SQL] AnalysisException after max it...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20548#discussion_r167761409
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ---
@@ -298,22 +298,24 @@ object DataType {
* Returns true if the two data types share the same "shape", i.e. the 
types (including
* nullability) are the same, but the field names don't need to be the 
same.
*/
-  def equalsStructurally(from: DataType, to: DataType): Boolean = {
+  def equalsStructurally(from: DataType, to: DataType,
+  ignoreNullability: Boolean = false): Boolean = {
--- End diff --

We can fix it when merging the PR


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20548: [SPARK-23316][SQL] AnalysisException after max it...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20548#discussion_r167761351
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ---
@@ -298,22 +298,24 @@ object DataType {
* Returns true if the two data types share the same "shape", i.e. the 
types (including
* nullability) are the same, but the field names don't need to be the 
same.
*/
-  def equalsStructurally(from: DataType, to: DataType): Boolean = {
+  def equalsStructurally(from: DataType, to: DataType,
+  ignoreNullability: Boolean = false): Boolean = {
--- End diff --

Nit: the indents.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20548: [SPARK-23316][SQL] AnalysisException after max iteration...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20548
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20565: [SPARK-23379][SQL] skip when setting the same current da...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20565
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20548: [SPARK-23316][SQL] AnalysisException after max iteration...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20548
  
This sounds another regression in Spark 2.3. cc @cloud-fan @dilipbiswal 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][BUILD] Update ORC to 1.4.3

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20511
  
@dongjoon-hyun Could you add a test case for `ORC-285` in the Spark side? 
This sounds like a bug fix that impacts Spark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20588: [SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify sup...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20588
  
Could you close it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20588: [SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify sup...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20588
  
This PR contains multiple fixes. This is not good especially for the ones 
targeting to 2.3.0. We should split it to multiple independent PRs if possible. 

cc @ueshin 

Thanks! Merged to 2.3. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
RC3 is out. Just to avoid new regressions that might be introduced in the 
new PR. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >