[jira] [Commented] (SPARK-29768) nondeterministic expression fails column pruning

2019-11-05 Thread yucai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968028#comment-16968028
 ] 

yucai commented on SPARK-29768:
---

[~smilegator] [~wenchen], is it an issue or work as desgin?

> nondeterministic expression fails column pruning
> 
>
> Key: SPARK-29768
> URL: https://issues.apache.org/jira/browse/SPARK-29768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: yucai
>Priority: Major
>
> nondeterministic expression like monotonically_increasing_id fails column 
> pruning
> {code}
> spark.range(10).selectExpr("id as key", "id * 2 as value").
>   write.format("parquet").save("/tmp/source")
> spark.range(10).selectExpr("id as key", "id * 3 as s1", "id * 5 as s2").
>   write.format("parquet").save("/tmp/target")
> val sourceDF = spark.read.parquet("/tmp/source")
> val targetDF = spark.read.parquet("/tmp/target").
>   withColumn("row_id", monotonically_increasing_id())
> sourceDF.join(targetDF, "key").select("key", "row_id").explain()
> {code}
> Spark reads all columns from targetDF, but actually, we only need `key` 
> column.
> {code}
> scala> sourceDF.join(targetDF, "key").select("key", "row_id").explain()
> == Physical Plan ==
> *(2) Project [key#78L, row_id#88L]
> +- *(2) BroadcastHashJoin [key#78L], [key#82L], Inner, BuildLeft
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> true]))
>:  +- *(1) Project [key#78L]
>: +- *(1) Filter isnotnull(key#78L)
>:+- *(1) FileScan parquet [key#78L] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[file:/tmp/source], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: struct
>+- *(2) Filter isnotnull(key#82L)
>   +- *(2) Project [key#82L, monotonically_increasing_id() AS row_id#88L]
>  +- *(2) FileScan parquet [key#82L,s1#83L,s2#84L] Batched: true, 
> Format: Parquet, Location: InMemoryFileIndex[file:/tmp/target], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29768) nondeterministic expression fails column pruning

2019-11-05 Thread yucai (Jira)
yucai created SPARK-29768:
-

 Summary: nondeterministic expression fails column pruning
 Key: SPARK-29768
 URL: https://issues.apache.org/jira/browse/SPARK-29768
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: yucai


nondeterministic expression like monotonically_increasing_id fails column 
pruning

{code}
spark.range(10).selectExpr("id as key", "id * 2 as value").
  write.format("parquet").save("/tmp/source")
spark.range(10).selectExpr("id as key", "id * 3 as s1", "id * 5 as s2").
  write.format("parquet").save("/tmp/target")

val sourceDF = spark.read.parquet("/tmp/source")
val targetDF = spark.read.parquet("/tmp/target").
  withColumn("row_id", monotonically_increasing_id())
sourceDF.join(targetDF, "key").select("key", "row_id").explain()
{code}

Spark reads all columns from targetDF, but actually, we only need `key` column.
{code}
scala> sourceDF.join(targetDF, "key").select("key", "row_id").explain()
== Physical Plan ==
*(2) Project [key#78L, row_id#88L]
+- *(2) BroadcastHashJoin [key#78L], [key#82L], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
true]))
   :  +- *(1) Project [key#78L]
   : +- *(1) Filter isnotnull(key#78L)
   :+- *(1) FileScan parquet [key#78L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/tmp/source], PartitionFilters: [], 
PushedFilters: [IsNotNull(key)], ReadSchema: struct
   +- *(2) Filter isnotnull(key#82L)
  +- *(2) Project [key#82L, monotonically_increasing_id() AS row_id#88L]
 +- *(2) FileScan parquet [key#82L,s1#83L,s2#84L] Batched: true, 
Format: Parquet, Location: InMemoryFileIndex[file:/tmp/target], 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread yucai (JIRA)
yucai created SPARK-26909:
-

 Summary: use unsafeRow.hashCode() as hash value in HashAggregate
 Key: SPARK-26909
 URL: https://issues.apache.org/jira/browse/SPARK-26909
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: yucai


This is a followup PR for #21149.

New way uses unsafeRow.hashCode() as hash value in HashAggregate.
The unsafe row has [null bit set] etc., the result should be different, and we 
don't need weird `48` also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-26909:
--
Description: 
This is a followup PR for #21149.

New way uses unsafeRow.hashCode() as hash value in HashAggregate.
The unsafe row has [null bit set] etc., the result should be different, so we 
don't need weird `48`.

  was:
This is a followup PR for #21149.

New way uses unsafeRow.hashCode() as hash value in HashAggregate.
The unsafe row has [null bit set] etc., the result should be different, and we 
don't need weird `48` also.


> use unsafeRow.hashCode() as hash value in HashAggregate
> ---
>
> Key: SPARK-26909
> URL: https://issues.apache.org/jira/browse/SPARK-26909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>
> This is a followup PR for #21149.
> New way uses unsafeRow.hashCode() as hash value in HashAggregate.
> The unsafe row has [null bit set] etc., the result should be different, so we 
> don't need weird `48`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Set main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:
 * BuiltInDataSourceWriteBenchmark
 * AvroWriteBenchmark 

  was:
Set main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 


> Make main args set correctly in BenchmarkBase
> -
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Set main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
>  * BuiltInDataSourceWriteBenchmark
>  * AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Set main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 

  was:
Save main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 


> Make main args set correctly in BenchmarkBase
> -
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Set main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Summary: Make main args set correctly in BenchmarkBase  (was: Make mainArgs 
correctly set in BenchmarkBase)

> Make main args set correctly in BenchmarkBase
> -
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Save main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Save main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 

  was:
Make mainArgs correctly set in BenchmarkBase, it will benefit:

* BuiltInDataSourceWriteBenchmark

* AvroWriteBenchmark

* Any other case that needs to access main args after inheriting from 
BenchmarkBase class

 


> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Save main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Make mainArgs correctly set in BenchmarkBase, it will benefit:

* BuiltInDataSourceWriteBenchmark

* AvroWriteBenchmark

* Any other case that needs to access main args after inheriting from 
BenchmarkBase class

 

  was:
Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark

- Any other case that needs to access main args after inheriting from 
BenchmarkBase class.

 


> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Make mainArgs correctly set in BenchmarkBase, it will benefit:
> * BuiltInDataSourceWriteBenchmark
> * AvroWriteBenchmark
> * Any other case that needs to access main args after inheriting from 
> BenchmarkBase class
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark

- Any other case that needs to access main args after inheriting from 
BenchmarkBase class.

 

  was:
Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark


> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Make mainArgs correctly set in BenchmarkBase, it will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark
> - Any other case that needs to access main args after inheriting from 
> BenchmarkBase class.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-25475

> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Make mainArgs correctly set in BenchmarkBase, it will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)
yucai created SPARK-25864:
-

 Summary: Make mainArgs correctly set in BenchmarkBase
 Key: SPARK-25864
 URL: https://issues.apache.org/jira/browse/SPARK-25864
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: yucai


Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25663) Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use main method

2018-10-27 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1148#comment-1148
 ] 

yucai commented on SPARK-25663:
---

[~Gengliang.Wang] I make an improvement on this, could you help review?

https://github.com/apache/spark/pull/22861

> Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use 
> main method
> 
>
> Key: SPARK-25663
> URL: https://issues.apache.org/jira/browse/SPARK-25663
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25850) Make the split threshold for the code generated method configurable

2018-10-26 Thread yucai (JIRA)
yucai created SPARK-25850:
-

 Summary: Make the split threshold for the code generated method 
configurable
 Key: SPARK-25850
 URL: https://issues.apache.org/jira/browse/SPARK-25850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: yucai


As per the discussion in 
[https://github.com/apache/spark/pull/22823/files#r228400706,] add a new 
configuration spark.sql.codegen.methodSplitThreshold to make the split 
threshold for the code generated method configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25676) Refactor BenchmarkWideTable to use main method

2018-10-25 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663480#comment-16663480
 ] 

yucai commented on SPARK-25676:
---

I am working on this.

> Refactor BenchmarkWideTable to use main method
> --
>
> Key: SPARK-25676
> URL: https://issues.apache.org/jira/browse/SPARK-25676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25508) Refactor OrcReadBenchmark to use main method

2018-09-21 Thread yucai (JIRA)
yucai created SPARK-25508:
-

 Summary: Refactor OrcReadBenchmark to use main method
 Key: SPARK-25508
 URL: https://issues.apache.org/jira/browse/SPARK-25508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.5.0
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25486) Refactor SortBenchmark to use main method

2018-09-20 Thread yucai (JIRA)
yucai created SPARK-25486:
-

 Summary: Refactor SortBenchmark to use main method
 Key: SPARK-25486
 URL: https://issues.apache.org/jira/browse/SPARK-25486
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.5.0
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25485) Refactor UnsafeProjectionBenchmark to use main method

2018-09-20 Thread yucai (JIRA)
yucai created SPARK-25485:
-

 Summary: Refactor UnsafeProjectionBenchmark to use main method
 Key: SPARK-25485
 URL: https://issues.apache.org/jira/browse/SPARK-25485
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.5.0
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method

2018-09-20 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25481:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-25475

> Refactor ColumnarBatchBenchmark to use main method
> --
>
> Key: SPARK-25481
> URL: https://issues.apache.org/jira/browse/SPARK-25481
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method

2018-09-20 Thread yucai (JIRA)
yucai created SPARK-25481:
-

 Summary: Refactor ColumnarBatchBenchmark to use main method
 Key: SPARK-25481
 URL: https://issues.apache.org/jira/browse/SPARK-25481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.5.0
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-09-11 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-23207:
--
Description: 
Currently shuffle repartition uses RoundRobinPartitioning, the generated result 
is nondeterministic since the sequence of input rows are not determined.

The bug can be triggered when there is a repartition call following a shuffle 
(which would lead to non-deterministic row ordering), as the pattern shows 
below:
upstream stage -> repartition stage -> result stage
(-> indicate a shuffle)
When one of the executors process goes down, some tasks on the repartition 
stage will be retried and generate inconsistent ordering, and some tasks of the 
result stage will be retried generating different data.

The following code returns 931532, instead of 100:
{code:java}
import scala.sys.process._

import org.apache.spark.TaskContext
val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()
{code}

  was:
Currently shuffle repartition uses RoundRobinPartitioning, the generated result 
is nondeterministic since the sequence of input rows are not determined.

The bug can be triggered when there is a repartition call following a shuffle 
(which would lead to non-deterministic row ordering), as the pattern shows 
below:
upstream stage -> repartition stage -> result stage
(-> indicate a shuffle)
When one of the executors process goes down, some tasks on the repartition 
stage will be retried and generate inconsistent ordering, and some tasks of the 
result stage will be retried generating different data.

The following code returns 931532, instead of 100:
{code}
import scala.sys.process._

import org.apache.spark.TaskContext
val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()
{code}


> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.1.4, 2.2.3, 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai resolved SPARK-25206.
---
Resolution: Won't Fix

Not backport to 2.3 as per [~cloud_fan]'s summary, closed.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598113#comment-16598113
 ] 

yucai commented on SPARK-25206:
---

Based on our discussion in 
[https://github.com/apache/spark/pull/22184#issuecomment-416840509],

seems like [~cloud_fan] prefers not backport, need his confirmation.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25281) Add tests to check the behavior when the physical schema and logical schema use difference cases

2018-08-30 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597236#comment-16597236
 ] 

yucai commented on SPARK-25281:
---

cc [~smilegator], [~cloud_fan].

> Add tests to check the behavior when the physical schema and logical schema 
> use difference cases
> 
>
> Key: SPARK-25281
> URL: https://issues.apache.org/jira/browse/SPARK-25281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>
> As per the discussion in SPARK-25206, Spark needs tests to check the behavior 
> when the physical schema and logical schema use difference cases.
> https://issues.apache.org/jira/browse/SPARK-25206?focusedCommentId=16593041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593041



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25281) Add tests to check the behavior when the physical schema and logical schema use difference cases

2018-08-30 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597217#comment-16597217
 ] 

yucai commented on SPARK-25281:
---

[~seancxmao] , since you have done many tests in PR22184 and SPARK-25175, do 
you want to take this task? If not, I can work on this :).

[https://github.com/apache/spark/pull/22184#discussion_r212405373]

https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593185=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593185

 

> Add tests to check the behavior when the physical schema and logical schema 
> use difference cases
> 
>
> Key: SPARK-25281
> URL: https://issues.apache.org/jira/browse/SPARK-25281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>
> As per the discussion in SPARK-25206, Spark needs tests to check the behavior 
> when the physical schema and logical schema use difference cases.
> https://issues.apache.org/jira/browse/SPARK-25206?focusedCommentId=16593041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593041



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25281) Add tests to check the behavior when the physical schema and logical schema use difference cases

2018-08-30 Thread yucai (JIRA)
yucai created SPARK-25281:
-

 Summary: Add tests to check the behavior when the physical schema 
and logical schema use difference cases
 Key: SPARK-25281
 URL: https://issues.apache.org/jira/browse/SPARK-25281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: yucai


As per the discussion in SPARK-25206, Spark needs tests to check the behavior 
when the physical schema and logical schema use difference cases.

https://issues.apache.org/jira/browse/SPARK-25206?focusedCommentId=16593041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593041



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298
 ] 

yucai edited comment on SPARK-25206 at 8/28/18 5:06 PM:


 Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it could be more meaningful.


was (Author: yucai):
 Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298
 ] 

yucai edited comment on SPARK-25206 at 8/28/18 5:05 PM:


 Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.


was (Author: yucai):
 

Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.

 

 

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298
 ] 

yucai commented on SPARK-25206:
---

 

Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.

 

 

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594958#comment-16594958
 ] 

yucai commented on SPARK-25206:
---

[~smilegator] , 2.1's exception is from parquet.
{code:java}
java.lang.IllegalArgumentException: Column [ID] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
{code}
2.1 uses parquet 1.8.1, while 2.3 uses parquet 1.8.3, it is behavior change in 
parquet.

See:

https://issues.apache.org/jira/browse/PARQUET-389

[https://github.com/apache/parquet-mr/commit/2282c22c5b252859b459cc2474350fbaf2a588e9]

 

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter cases 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 addressed this issue already.

 

The biggest difference is, in Spark 2.1, user will get Exception for the same 
query:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So they will know the issue and fix the query.

But in Spark 2.3, user will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

 

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter cases 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 addressed this issue already.

 

The biggest difference is, in Spark 2.1, user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So they will know the issue and fix the query.

But in Spark 2.3, user will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in 

[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593152#comment-16593152
 ] 

yucai commented on SPARK-25175:
---

I pinged [~seancxmao] offline, he will give more details.

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: wrong records are returned when Hive metastore schema and parquet 
schema are in different letter cases  (was: data issue when Hive metastore 
schema and parquet schema are in different letter cases)

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter cases 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 addressed this issue already.

 

The biggest difference is, in Spark 2.1, user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So they will know the issue and fix the query.

But in Spark 2.3, user will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue when Hive metastore schema and parquet schema are in different 
> letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest 

[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue when Hive metastore schema and parquet schema are in different 
> letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column 

[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue when Hive metastore schema and parquet schema are in 
different letter cases  (was: data issue when Hive metastore schema and parquet 
schema have different letter case)

> data issue when Hive metastore schema and parquet schema are in different 
> letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue when Hive metastore schema and parquet schema have 
different letter case  (was: data issue when )

> data issue when Hive metastore schema and parquet schema have different 
> letter case
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> -SPARK-25132-'s backport has been track in its jira.
> Use this Jira to track the backport of SPARK-24716, 
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

 

-SPARK-25132-'s backport has been track in its jira.

Use this Jira to track the backport of SPARK-24716, 

 

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue when Hive metastore schema and parquet schema have different 
> letter case
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. 

[jira] [Updated] (SPARK-25206) data issue when

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue when   (was: data issue because wrong column is 
pushdown for parquet)

> data issue when 
> 
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> -SPARK-25132-'s backport has been track in its jira.
> Use this Jira to track the backport of SPARK-24716, 
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

 

-SPARK-25132-'s backport has been track in its jira.

Use this Jira to track the backport of SPARK-24716, 

 

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. 

 

 

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> -SPARK-25132-'s backport has been track in 

[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. 

 

 

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

It has two issues.

1. Wrong column 

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. 
>  
>  
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

It has two issues.

1. Wrong column 

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> It has two issues.
> 1. Wrong column 
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593126#comment-16593126
 ] 

yucai edited comment on SPARK-25206 at 8/27/18 2:27 AM:


[~dongjoon], because of the below root cause
{quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, 
but ID does not exist in /tmp/data (parquet is case sensitive, it has id 
actually).
{quote}
I changed the title to emphasize wrong column is pushdown: "id" should be 
pushdown instead of "ID".

Feel free to let me know if you have any concern.

This issue exists in 2.3 only, master is different.


was (Author: yucai):
[~dongjoon], because of the below root cause
{quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, 
but ID does not exist in /tmp/data (parquet is case sensitive, it has id 
actually).
{quote}
I changed the title to emphasize wrong column is pushdown: "id" should be 
pushdown instead of "ID".

Feel free to let me know if you have any concern.

> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593126#comment-16593126
 ] 

yucai commented on SPARK-25206:
---

[~dongjoon], because of the below root cause
{quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, 
but ID does not exist in /tmp/data (parquet is case sensitive, it has id 
actually).
{quote}
I changed the title to emphasize wrong column is pushdown: "id" should be 
pushdown instead of "ID".

Feel free to let me know if you have any concern.

> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
scala> sql("select * from t").show
++
|  ID|
++
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
++
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+
scala> sql("set spark.sql.parquet.filterPushdown").show
++-+
| key|value|
++-+
|spark.sql.parquet...| true|
++-+
scala> sql("set spark.sql.parquet.filterPushdown=false").show
++-+
| key|value|
++-+
|spark.sql.parquet...|false|
++-+
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+
{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue because wrong column is pushdown for parquet  (was: 
Wrong data may be returned for Parquet)

> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593110#comment-16593110
 ] 

yucai commented on SPARK-25207:
---

[~dongjoon] , sorry if I am confusing you.

 

This bug is created for master branch, because it has SPARK-25132 and 
-SPARK-24716- already.

So it has no below issue actually.
{code:java}
scala> sql("select * from t").show// Parquet returns NULL for `ID` because 
it has `id`.
++
|  ID|
++
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
++

scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
+---+
| ID|
+---+
+---+
{code}

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>  Labels: Parquet
> Attachments: image.png
>
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct
> scala> sql("select * from t").show// Parquet returns NULL for `ID` 
> because it has `id`.
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593102#comment-16593102
 ] 

yucai commented on SPARK-25206:
---

I am OK with "known correctness bug in 2.3" way, just raise some concern in my 
previous post.

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593100#comment-16593100
 ] 

yucai commented on SPARK-25206:
---

[~smilegator] , sure, I will add tests.

 

If we don't backport SPARK-25132 and SPARK-24716, user will have below issue.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+
{code}
 

The biggest difference is, in Spark 2.1, they will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So will they know the issue and fix the query.

But in Spark 2.3, they will get the wrong results sliently and might be ignored?

 

Could it be risky for the user?

 

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592453#comment-16592453
 ] 

yucai edited comment on SPARK-25206 at 8/25/18 5:01 AM:


{quote} # Vanilla Spark 2.2.0 ~ 2.3.1 always returns NULL for Parquet 
mismatched columns due to case sensitivity. It's not related to filters at 
all.{quote}
Agree, it has nothing to do with filter, actually, the issue exists since 2.0.
{quote}The reason why SPARK-25132 is not complete to your situations is simply 
that it's based on `master` branch. It depends on lots of improvements only in 
`master` branch.
{quote}
This part might need to clarify.

-SPARK-25132- has no dependence, I can backport it alone, but data issue still 
exists if I only backport it.

This time the root cause is filter pushdown.


was (Author: yucai):
{quote} # Vanilla Spark 2.2.0 ~ 2.3.1 always returns NULL for Parquet 
mismatched columns due to case sensitivity. It's not related to filters at 
all.{quote}
Agree, it has nothing to do with filter, actually, the issue exists since 2.0.
{quote}The reason why SPARK-25132 is not complete to your situations is simply 
that it's based on `master` branch. It depends on lots of improvements only in 
`master` branch.
{quote}
This part might need to clarify.

-SPARK-25132- has no dependence, I can backport it alone, but data issue still 
exists if I only backport it.

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592460#comment-16592460
 ] 

yucai commented on SPARK-25206:
---

[~dongjoon] , thanks a lot for so many explanations, if we both agree to 
backport  -SPARK-25132- + -SPARK-24716.-

We can go ahead :).

But master's parquet filter pushdown is still buggy in case-insensitive mode, I 
have summited the PR in SPARK-25207.

Kindly help review.

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592453#comment-16592453
 ] 

yucai commented on SPARK-25206:
---

{quote} # Vanilla Spark 2.2.0 ~ 2.3.1 always returns NULL for Parquet 
mismatched columns due to case sensitivity. It's not related to filters at 
all.{quote}
Agree, it has nothing to do with filter, actually, the issue exists since 2.0.
{quote}The reason why SPARK-25132 is not complete to your situations is simply 
that it's based on `master` branch. It depends on lots of improvements only in 
`master` branch.
{quote}
This part might need to clarify.

-SPARK-25132- has no dependence, I can backport it alone, but data issue still 
exists if I only backport it.

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592425#comment-16592425
 ] 

yucai edited comment on SPARK-25206 at 8/25/18 3:33 AM:


[~dongjoon] , correct me if I am wrong.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").show{code}
Based on 2.3.1,

Backport SPARK-25132, Spark will show nothing if pushdown enabled and show "1 
to 9" if pushdown disabled.

Backport -SPARK-25132- + SPARK-24716, Spark will pushdown nothing and it shows 
"1 to 9".

-SPARK-25132- +  +-SPARK-24716-+  + SPARK-25207, Spark will pushdown "id > 0" 
correctly and shows "1 to 9".


was (Author: yucai):
[~dongjoon] , correct me if I am wrong.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").show{code}
Based on 2.3.1,

Backport SPARK-25132, Spark will show nothing if pushdown enabled and show "1 
to 9" if pushdown disabled.

Backport -SPARK-25132- + SPARK-24716, Spark will pushdown nothing and it shows 
"1 to 9".

-SPARK-25132- + -SPARK-24716- + SPARK-25207, Spark will pushdown "id > 0" 
correctly and shows "1 to 9".

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592425#comment-16592425
 ] 

yucai commented on SPARK-25206:
---

[~dongjoon] , correct me if I am wrong.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").show{code}
Based on 2.3.1,

Backport SPARK-25132, Spark will show nothing if pushdown enabled and show "1 
to 9" if pushdown disabled.

Backport -SPARK-25132- + SPARK-24716, Spark will pushdown nothing and it shows 
"1 to 9".

-SPARK-25132- + -SPARK-24716- + SPARK-25207, Spark will pushdown "id > 0" 
correctly and shows "1 to 9".

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592406#comment-16592406
 ] 

yucai commented on SPARK-25206:
---

Not a simple duplication.

Backport -SPARK-25132-, but without -SPARK-24716-, still buggy.

 

See my test.

*{color:#FF}Attention{color}*: backport SPARK-25132 only.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+


scala> sql("set spark.sql.parquet.filterPushdown").show
++-+
| key|value|
++-+
|spark.sql.parquet...| true|
++-+


scala> sql("set spark.sql.parquet.filterPushdown=false").show
++-+
| key|value|
++-+
|spark.sql.parquet...|false|
++-+


scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
| 7|
| 8|
| 9|
| 2|
| 3|
| 4|
| 5|
| 6|
| 1|
+---+


{code}
 

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592392#comment-16592392
 ] 

yucai commented on SPARK-25206:
---

[~dongjoon] , the reason you see `null` without predicate pushdown, it is 
because of https://issues.apache.org/jira/browse/SPARK-25132. It is one of the 
issues of this bug.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
scala> sql("select * from t").show
++
| ID|
++
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
++{code}

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592390#comment-16592390
 ] 

yucai commented on SPARK-25206:
---

Link to SPARK-25132, this bug needs two PRs backport.

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592384#comment-16592384
 ] 

yucai commented on SPARK-25206:
---

[~dongjoon], I still think this bug is related to pushdown, but unfortunately, 
there are two issues actually, which make it quite confusing. Let me explain:


1. The wrong column name is pushdown into parquet. Spark pushdowns "ID > 0", 
but parquet file has "id" in its schema instead of "ID", so 0 record is 
returned in {color:#FF}*PARQUET SCAN stage.*{color}

{color:#FF}*Attention*{color}: not because of the filter, no record from 
*{color:#FF}parquet scan{color}* in this case.

We can confirm this in Spark's chart, "number of output rows" in Scan is 0.
{code:java}
rm -rf /tmp/data /tmp/data_csv
./bin/spark-shell
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+
{code}
 

!image-2018-08-25-10-04-21-901.png!

That's why we need backport [https://github.com/apache/spark/pull/21696].

With it, Spark will pushdown correct filter "id > 0" into parquet.

2. Unfortunately, with only [https://github.com/apache/spark/pull/21696], it is 
not enough, because of https://issues.apache.org/jira/browse/SPARK-25132, we 
still need backport [https://github.com/apache/spark/pull/22183]. 

 

Does it make sense to you?

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Attachment: image-2018-08-25-10-04-21-901.png

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Attachment: image-2018-08-25-09-54-53-219.png

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591756#comment-16591756
 ] 

yucai commented on SPARK-25206:
---

[~cloud_fan] , we need both [https://github.com/apache/spark/pull/21696] and 
[https://github.com/apache/spark/pull/22183] for this bug.

 

*With only* [https://github.com/apache/spark/pull/21696], no records are 
returned.
{code:java}
rm -rf /tmp/data /tmp/data_csv
./bin/spark-shell
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").write.csv("/tmp/data_csv")

scala> spark.read.csv("/tmp/data_csv")
res4: org.apache.spark.sql.DataFrame = []{code}
*Root Cause*: No filter is pushed, but "ID" is selected from parquet file, 
which has no this field, so 10 null records are returned from parquet scan, and 
then they are filtered by "ID" > 0 in FilterExec, finally, 0 records are 
returned. See:

!image-2018-08-24-22-46-05-346.png!

*With both* [https://github.com/apache/spark/pull/21696] and 
[https://github.com/apache/spark/pull/22183]
{code:java}
rm -rf /tmp/data /tmp/data_csv
./bin/spark-shell
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").write.csv("/tmp/data_csv")

scala> spark.read.csv("/tmp/data_csv").show
+---+
|_c0|
+---+
| 2|
| 3|
| 4|
| 7|
| 8|
| 9|
| 5|
| 6|
| 1|
+---+{code}

> Wrong data may be returned when enable pushdown
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Attachment: image-2018-08-24-22-46-05-346.png

> Wrong data may be returned when enable pushdown
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Attachment: image-2018-08-24-22-34-11-539.png

> Wrong data may be returned when enable pushdown
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Attachment: image-2018-08-24-22-33-03-231.png

> Wrong data may be returned when enable pushdown
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Attachment: pr22183.png

> Wrong data may be returned when enable pushdown
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Attachment: image-2018-08-24-18-05-23-485.png

> Wrong data may be returned when enable pushdown
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-23 Thread yucai (JIRA)
yucai created SPARK-25207:
-

 Summary: Case-insensitve field resolution for filter pushdown when 
reading Parquet
 Key: SPARK-25207
 URL: https://issues.apache.org/jira/browse/SPARK-25207
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: yucai


Currently, filter pushdown will not work if Parquet schema and Hive metastore 
schema are in different letter cases even spark.sql.caseSensitive is false.

Like the below case:
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").show{code}
No filter will be pushed down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-23 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yuming] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> Wrong data may be returned when enable pushdown
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25206) Wrong data may be returned when enable pushdown

2018-08-23 Thread yucai (JIRA)
yucai created SPARK-25206:
-

 Summary: Wrong data may be returned when enable pushdown
 Key: SPARK-25206
 URL: https://issues.apache.org/jira/browse/SPARK-25206
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: yucai


In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yuming] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Spark returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases

2018-08-16 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583145#comment-16583145
 ] 

yucai commented on SPARK-25132:
---

[~cloud_fan] [~smilegator] [~budde] [~ekhliang], do you have any insight?

> Spark returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases
> 
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Spark returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases

2018-08-16 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582667#comment-16582667
 ] 

yucai commented on SPARK-25132:
---

If Spark allows data source case insensitive, query t2 should return number.
If Spark does not allow data source case insensitive, Spark should remind user 
with warning, return NULL may lead to the potential issue that is very 
difficult to find.

> Spark returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases
> 
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-10 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576417#comment-16576417
 ] 

yucai edited comment on SPARK-25084 at 8/10/18 3:17 PM:


[~smilegator], [~jerryshao]
Thanks a lot for marking it blocker.
A lot of eBay's tables use "distribute by" or "cluster by", it is important for 
us to move to Spark 2.3.


was (Author: yucai):
[~smilegator][~jerryshao]
Thanks a lot for marking it blocker.
A lot of eBay's tables use "distribute by" or "cluster by", it is important for 
us to move to Spark 2.3.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code}
> Exception:
> {code:java}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 131, Column 67: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 131, Column 67: One of ', )' expected instead of '['
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code}
> Wrong Codegen:
> {code:java}
> /* 131 */ private int computeHashForStruct_1(InternalRow 
> mutableStateArray[0], int value1) {
> /* 132 */
> /* 133 */
> /* 134 */ if (!mutableStateArray[0].isNullAt(5)) {
> /* 135 */
> /* 136 */ final int element5 = mutableStateArray[0].getInt(5);
> /* 137 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1);
> /* 138 */
> /* 139 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-10 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576417#comment-16576417
 ] 

yucai commented on SPARK-25084:
---

[~smilegator][~jerryshao]
Thanks a lot for marking it blocker.
A lot of eBay's tables use "distribute by" or "cluster by", it is important for 
us to move to Spark 2.3.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code}
> Exception:
> {code:java}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 131, Column 67: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 131, Column 67: One of ', )' expected instead of '['
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code}
> Wrong Codegen:
> {code:java}
> /* 131 */ private int computeHashForStruct_1(InternalRow 
> mutableStateArray[0], int value1) {
> /* 132 */
> /* 133 */
> /* 134 */ if (!mutableStateArray[0].isNullAt(5)) {
> /* 135 */
> /* 136 */ final int element5 = mutableStateArray[0].getInt(5);
> /* 137 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1);
> /* 138 */
> /* 139 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-10 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25084:
--
Description: 
Test Query:
{code:java}
select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code}
Exception:
{code:java}
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 131, Column 67: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
131, Column 67: One of ', )' expected instead of '['
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code}
Wrong Codegen:
{code:java}
/* 131 */ private int computeHashForStruct_1(InternalRow mutableStateArray[0], 
int value1) {
/* 132 */
/* 133 */
/* 134 */ if (!mutableStateArray[0].isNullAt(5)) {
/* 135 */
/* 136 */ final int element5 = mutableStateArray[0].getInt(5);
/* 137 */ value1 = 
org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1);
/* 138 */
/* 139 */ }{code}
 

  was:
Test Query:
{code:java}
select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
ss_net_profit) limit 1000;{code}
Wrong Codegen:
{code:java}
/* 146 */ private int computeHashForStruct_0(InternalRow mutableStateArray[0], 
int value1) {
/* 147 */
/* 148 */
/* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
/* 150 */
/* 151 */ final int element = mutableStateArray[0].getInt(0);
/* 152 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, 
value1);
/* 153 */
/* 154 */ }{code}
 


> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code}
> Exception:
> {code:java}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 131, Column 67: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 131, Column 67: One of ', )' expected instead of '['
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code}
> Wrong Codegen:
> {code:java}
> /* 131 */ private int computeHashForStruct_1(InternalRow 
> mutableStateArray[0], int value1) {
> /* 132 */
> /* 133 */
> /* 134 */ if (!mutableStateArray[0].isNullAt(5)) {
> /* 135 */
> /* 136 */ final int element5 = mutableStateArray[0].getInt(5);
> /* 137 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1);
> /* 138 */
> /* 139 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-10 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575808#comment-16575808
 ] 

yucai commented on SPARK-25084:
---

It is a regression, when the generated codes size is more than 1024, newer 
Spark will split it into many functions, but the function definition is wrong, 
like below:
{code:java}
private int computeHashForStruct_0(InternalRow mutableStateArray[0], int 
value1) {
{code}
 

In the older version, like 2.1.0, it does not split function, so it has no this 
issue.

 

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Blocker
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread yucai (JIRA)
yucai created SPARK-25084:
-

 Summary: "distribute by" on multiple columns may lead to codegen 
issue
 Key: SPARK-25084
 URL: https://issues.apache.org/jira/browse/SPARK-25084
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: yucai


Test Query:
{code:java}
select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
ss_net_profit) limit 1000;{code}
Wrong Codegen:
{code:java}
/* 146 */ private int computeHashForStruct_0(InternalRow mutableStateArray[0], 
int value1) {
/* 147 */
/* 148 */
/* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
/* 150 */
/* 151 */ final int element = mutableStateArray[0].getInt(0);
/* 152 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, 
value1);
/* 153 */
/* 154 */ }{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-25 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556820#comment-16556820
 ] 

yucai commented on SPARK-24925:
---

[~cloud_fan], [~xiaoli] , [~kiszk] , any comments?

> input bytesRead metrics fluctuate from time to time
> ---
>
> Key: SPARK-24925
> URL: https://issues.apache.org/jira/browse/SPARK-24925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
> Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 
> 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-25 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556818#comment-16556818
 ] 

yucai commented on SPARK-24925:
---

I think there could be two issues.

In FileScanRDD
1. ColumnarBatch's bytesRead need to be updated for every 4096 * 1000 rows, 
which makes the metrics out of date.
2. When advancing to the next file, FileScanRDD always adds the whole file 
length into bytesRead, which is inaccurate (pushdown reads much less data).

For problem 1, in https://github.com/apache/spark/pull/21791, I tried to update 
the ColumnarBatch's bytesRead for each batch.
For problem 2, updateBytesReadWithFileSize says, "If we can't get the bytes 
read from the FS stats, fall back to the file size", can we update only when 
this situation happens?

> input bytesRead metrics fluctuate from time to time
> ---
>
> Key: SPARK-24925
> URL: https://issues.apache.org/jira/browse/SPARK-24925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
> Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 
> 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-25 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24925:
--
Attachment: bytesRead.gif

> input bytesRead metrics fluctuate from time to time
> ---
>
> Key: SPARK-24925
> URL: https://issues.apache.org/jira/browse/SPARK-24925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
> Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 53GB 
> ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-25 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24925:
--
Description: 
input bytesRead metrics fluctuate from time to time, it is worse when pushdown 
enabled.

Query
{code:java}
CREATE TABLE dev AS
SELECT
...
FROM lstg_item cold, lstg_item_vrtn v
WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
...
{code}

Issue
See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 
53GB ... 

  was:
input bytesRead metrics fluctuate from time to time, it is worse when pushdown 
enabled.

Query
{code:java}
CREATE TABLE dev AS
SELECT
...
FROM lstg_item cold, lstg_item_vrtn v
WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
...
{code}

Issue
See attached gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 53GB ... 


> input bytesRead metrics fluctuate from time to time
> ---
>
> Key: SPARK-24925
> URL: https://issues.apache.org/jira/browse/SPARK-24925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
> Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 
> 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-25 Thread yucai (JIRA)
yucai created SPARK-24925:
-

 Summary: input bytesRead metrics fluctuate from time to time
 Key: SPARK-24925
 URL: https://issues.apache.org/jira/browse/SPARK-24925
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: yucai


input bytesRead metrics fluctuate from time to time, it is worse when pushdown 
enabled.

Query
{code:java}
CREATE TABLE dev AS
SELECT
...
FROM lstg_item cold, lstg_item_vrtn v
WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
...
{code}

Issue
See attached gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24832) Improve inputMetrics's bytesRead update for ColumnarBatch

2018-07-25 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24832:
--
Summary: Improve inputMetrics's bytesRead update for ColumnarBatch  (was: 
When pushdown enabled, input bytesRead metrics is easy to fluctuate from time 
to time)

> Improve inputMetrics's bytesRead update for ColumnarBatch
> -
>
> Key: SPARK-24832
> URL: https://issues.apache.org/jira/browse/SPARK-24832
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Improve inputMetrics's bytesRead update for ColumnarBatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24832) When pushdown enabled, input bytesRead metrics is easy to fluctuate from time to time

2018-07-25 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24832:
--
Summary: When pushdown enabled, input bytesRead metrics is easy to 
fluctuate from time to time  (was: Improve inputMetrics's bytesRead update for 
ColumnarBatch)

> When pushdown enabled, input bytesRead metrics is easy to fluctuate from time 
> to time
> -
>
> Key: SPARK-24832
> URL: https://issues.apache.org/jira/browse/SPARK-24832
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Improve inputMetrics's bytesRead update for ColumnarBatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24832) Improve inputMetrics's bytesRead update for ColumnarBatch

2018-07-17 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546371#comment-16546371
 ] 

yucai commented on SPARK-24832:
---

Currently, ColumnarBatch's bytesRead need to be updated for every 4096 * 1000 
rows, which makes the metrics out of date.
Can we update it for each batch?

{code:java}
if (nextElement.isInstanceOf[ColumnarBatch]) {
inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows())
} else {
inputMetrics.incRecordsRead(1)
}
if (inputMetrics.recordsRead % 
SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
updateBytesRead()
}
{code}

> Improve inputMetrics's bytesRead update for ColumnarBatch
> -
>
> Key: SPARK-24832
> URL: https://issues.apache.org/jira/browse/SPARK-24832
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Improve inputMetrics's bytesRead update for ColumnarBatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24832) Improve inputMetrics's bytesRead update for ColumnarBatch

2018-07-17 Thread yucai (JIRA)
yucai created SPARK-24832:
-

 Summary: Improve inputMetrics's bytesRead update for ColumnarBatch
 Key: SPARK-24832
 URL: https://issues.apache.org/jira/browse/SPARK-24832
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.3.1
Reporter: yucai


Improve inputMetrics's bytesRead update for ColumnarBatch






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24556:
--
Description: 
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:
{code:java}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}
Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as...
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}
Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

  was:
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:
{code}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}
Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}
Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}


> ReusedExchange should rewrite output partitioning also when child's 
> partitioning is RangePartitioning
> -
>
> Key: SPARK-24556
> URL: https://issues.apache.org/jira/browse/SPARK-24556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: yucai
>Priority: Major
>
> Currently, ReusedExchange would rewrite output partitioning if 

[jira] [Updated] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24556:
--
Description: 
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:
{code}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}
Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}
Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

  was:
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:

{code:scala}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}

Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  

[jira] [Created] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread yucai (JIRA)
yucai created SPARK-24556:
-

 Summary: ReusedExchange should rewrite output partitioning also 
when child's partitioning is RangePartitioning
 Key: SPARK-24556
 URL: https://issues.apache.org/jira/browse/SPARK-24556
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: yucai


Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:

{code:scala}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}

Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24343:
--
Description: 
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
Suppose web_sales_bucketed is bucketed into 4 buckets and set 
shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 

  was:
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 


> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:
> 

[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24343:
--
Description: 
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
Suppose web_sales_bucketed is bucketed into 4 buckets and set 
shuffle.partition=10.

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 

  was:
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
Suppose web_sales_bucketed is bucketed into 4 buckets and set 
shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 


> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:

[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24343:
--
Description: 
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 

  was:
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
 

 

This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 


> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:
> 

[jira] [Created] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)
yucai created SPARK-24343:
-

 Summary: Avoid shuffle for the bucketed table when 
shuffle.partition > bucket number
 Key: SPARK-24343
 URL: https://issues.apache.org/jira/browse/SPARK-24343
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: yucai


When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
 

 

This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys

2018-04-25 Thread yucai (JIRA)
yucai created SPARK-24087:
-

 Summary: Avoid shuffle when join keys are a super-set of bucket 
keys
 Key: SPARK-24087
 URL: https://issues.apache.org/jira/browse/SPARK-24087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-25 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451743#comment-16451743
 ] 

yucai commented on SPARK-24076:
---

1. When shuffle.partition = 8192, tuples in the same partition follows the 
connection like below:

hash(tuple x) = hash(tuple y) + n * 8192

2. In the next HashAggregate stage, tuples from the same partition need put 
into a 16K BytesToBytesMap (unsafeRowAggBuffer).

Here, the HashAggregate uses the same hash algorithm and seed as shuffle, it 
leads to all tuples will be hashed to only 2 different places actually. That's 
why hash conflict happens.

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-25 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451728#comment-16451728
 ] 

yucai commented on SPARK-24076:
---

Root cause: very bad hash conflict in hashaggregate.

!image-2018-04-25-14-29-39-958.png!

 

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-25 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24076:
--
Attachment: image-2018-04-25-14-29-39-958.png

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-25 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451727#comment-16451727
 ] 

yucai commented on SPARK-24076:
---

The query example:

{code:sql}
insert overwrite table target_xxx
SELECT
 item_id,
 auct_end_dt
FROM
 (select cast(item_id as double) as item_id, auct_end_dt from source_xxx
GROUP BY
 item_id,
 auct_end_dt
{code}


> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451540#comment-16451540
 ] 

yucai commented on SPARK-24076:
---

shuffle.partition = 8192

!p1.png!

shuffle.partition = 8000

!p2.png!

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24076:
--
Attachment: p2.png
p1.png

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)
yucai created SPARK-24076:
-

 Summary: very bad performance when shuffle.partition = 8192
 Key: SPARK-24076
 URL: https://issues.apache.org/jira/browse/SPARK-24076
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: yucai


We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >