[jira] [Commented] (SPARK-29768) nondeterministic expression fails column pruning
[ https://issues.apache.org/jira/browse/SPARK-29768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968028#comment-16968028 ] yucai commented on SPARK-29768: --- [~smilegator] [~wenchen], is it an issue or work as desgin? > nondeterministic expression fails column pruning > > > Key: SPARK-29768 > URL: https://issues.apache.org/jira/browse/SPARK-29768 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: yucai >Priority: Major > > nondeterministic expression like monotonically_increasing_id fails column > pruning > {code} > spark.range(10).selectExpr("id as key", "id * 2 as value"). > write.format("parquet").save("/tmp/source") > spark.range(10).selectExpr("id as key", "id * 3 as s1", "id * 5 as s2"). > write.format("parquet").save("/tmp/target") > val sourceDF = spark.read.parquet("/tmp/source") > val targetDF = spark.read.parquet("/tmp/target"). > withColumn("row_id", monotonically_increasing_id()) > sourceDF.join(targetDF, "key").select("key", "row_id").explain() > {code} > Spark reads all columns from targetDF, but actually, we only need `key` > column. > {code} > scala> sourceDF.join(targetDF, "key").select("key", "row_id").explain() > == Physical Plan == > *(2) Project [key#78L, row_id#88L] > +- *(2) BroadcastHashJoin [key#78L], [key#82L], Inner, BuildLeft >:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > true])) >: +- *(1) Project [key#78L] >: +- *(1) Filter isnotnull(key#78L) >:+- *(1) FileScan parquet [key#78L] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[file:/tmp/source], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: struct >+- *(2) Filter isnotnull(key#82L) > +- *(2) Project [key#82L, monotonically_increasing_id() AS row_id#88L] > +- *(2) FileScan parquet [key#82L,s1#83L,s2#84L] Batched: true, > Format: Parquet, Location: InMemoryFileIndex[file:/tmp/target], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29768) nondeterministic expression fails column pruning
yucai created SPARK-29768: - Summary: nondeterministic expression fails column pruning Key: SPARK-29768 URL: https://issues.apache.org/jira/browse/SPARK-29768 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: yucai nondeterministic expression like monotonically_increasing_id fails column pruning {code} spark.range(10).selectExpr("id as key", "id * 2 as value"). write.format("parquet").save("/tmp/source") spark.range(10).selectExpr("id as key", "id * 3 as s1", "id * 5 as s2"). write.format("parquet").save("/tmp/target") val sourceDF = spark.read.parquet("/tmp/source") val targetDF = spark.read.parquet("/tmp/target"). withColumn("row_id", monotonically_increasing_id()) sourceDF.join(targetDF, "key").select("key", "row_id").explain() {code} Spark reads all columns from targetDF, but actually, we only need `key` column. {code} scala> sourceDF.join(targetDF, "key").select("key", "row_id").explain() == Physical Plan == *(2) Project [key#78L, row_id#88L] +- *(2) BroadcastHashJoin [key#78L], [key#82L], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) : +- *(1) Project [key#78L] : +- *(1) Filter isnotnull(key#78L) :+- *(1) FileScan parquet [key#78L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/source], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: struct +- *(2) Filter isnotnull(key#82L) +- *(2) Project [key#82L, monotonically_increasing_id() AS row_id#88L] +- *(2) FileScan parquet [key#82L,s1#83L,s2#84L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/target], PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate
yucai created SPARK-26909: - Summary: use unsafeRow.hashCode() as hash value in HashAggregate Key: SPARK-26909 URL: https://issues.apache.org/jira/browse/SPARK-26909 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: yucai This is a followup PR for #21149. New way uses unsafeRow.hashCode() as hash value in HashAggregate. The unsafe row has [null bit set] etc., the result should be different, and we don't need weird `48` also. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate
[ https://issues.apache.org/jira/browse/SPARK-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-26909: -- Description: This is a followup PR for #21149. New way uses unsafeRow.hashCode() as hash value in HashAggregate. The unsafe row has [null bit set] etc., the result should be different, so we don't need weird `48`. was: This is a followup PR for #21149. New way uses unsafeRow.hashCode() as hash value in HashAggregate. The unsafe row has [null bit set] etc., the result should be different, and we don't need weird `48` also. > use unsafeRow.hashCode() as hash value in HashAggregate > --- > > Key: SPARK-26909 > URL: https://issues.apache.org/jira/browse/SPARK-26909 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > > This is a followup PR for #21149. > New way uses unsafeRow.hashCode() as hash value in HashAggregate. > The unsafe row has [null bit set] etc., the result should be different, so we > don't need weird `48`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase
[ https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25864: -- Description: Set main args correctly in BenchmarkBase, to make it accessible for its subclass. It will benefit: * BuiltInDataSourceWriteBenchmark * AvroWriteBenchmark was: Set main args correctly in BenchmarkBase, to make it accessible for its subclass. It will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark > Make main args set correctly in BenchmarkBase > - > > Key: SPARK-25864 > URL: https://issues.apache.org/jira/browse/SPARK-25864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: yucai >Priority: Major > > Set main args correctly in BenchmarkBase, to make it accessible for its > subclass. > It will benefit: > * BuiltInDataSourceWriteBenchmark > * AvroWriteBenchmark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase
[ https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25864: -- Description: Set main args correctly in BenchmarkBase, to make it accessible for its subclass. It will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark was: Save main args correctly in BenchmarkBase, to make it accessible for its subclass. It will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark > Make main args set correctly in BenchmarkBase > - > > Key: SPARK-25864 > URL: https://issues.apache.org/jira/browse/SPARK-25864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: yucai >Priority: Major > > Set main args correctly in BenchmarkBase, to make it accessible for its > subclass. > It will benefit: > - BuiltInDataSourceWriteBenchmark > - AvroWriteBenchmark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase
[ https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25864: -- Summary: Make main args set correctly in BenchmarkBase (was: Make mainArgs correctly set in BenchmarkBase) > Make main args set correctly in BenchmarkBase > - > > Key: SPARK-25864 > URL: https://issues.apache.org/jira/browse/SPARK-25864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: yucai >Priority: Major > > Save main args correctly in BenchmarkBase, to make it accessible for its > subclass. > It will benefit: > - BuiltInDataSourceWriteBenchmark > - AvroWriteBenchmark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase
[ https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25864: -- Description: Save main args correctly in BenchmarkBase, to make it accessible for its subclass. It will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark was: Make mainArgs correctly set in BenchmarkBase, it will benefit: * BuiltInDataSourceWriteBenchmark * AvroWriteBenchmark * Any other case that needs to access main args after inheriting from BenchmarkBase class > Make mainArgs correctly set in BenchmarkBase > > > Key: SPARK-25864 > URL: https://issues.apache.org/jira/browse/SPARK-25864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: yucai >Priority: Major > > Save main args correctly in BenchmarkBase, to make it accessible for its > subclass. > It will benefit: > - BuiltInDataSourceWriteBenchmark > - AvroWriteBenchmark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase
[ https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25864: -- Description: Make mainArgs correctly set in BenchmarkBase, it will benefit: * BuiltInDataSourceWriteBenchmark * AvroWriteBenchmark * Any other case that needs to access main args after inheriting from BenchmarkBase class was: Make mainArgs correctly set in BenchmarkBase, it will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark - Any other case that needs to access main args after inheriting from BenchmarkBase class. > Make mainArgs correctly set in BenchmarkBase > > > Key: SPARK-25864 > URL: https://issues.apache.org/jira/browse/SPARK-25864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: yucai >Priority: Major > > Make mainArgs correctly set in BenchmarkBase, it will benefit: > * BuiltInDataSourceWriteBenchmark > * AvroWriteBenchmark > * Any other case that needs to access main args after inheriting from > BenchmarkBase class > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase
[ https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25864: -- Description: Make mainArgs correctly set in BenchmarkBase, it will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark - Any other case that needs to access main args after inheriting from BenchmarkBase class. was: Make mainArgs correctly set in BenchmarkBase, it will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark > Make mainArgs correctly set in BenchmarkBase > > > Key: SPARK-25864 > URL: https://issues.apache.org/jira/browse/SPARK-25864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: yucai >Priority: Major > > Make mainArgs correctly set in BenchmarkBase, it will benefit: > - BuiltInDataSourceWriteBenchmark > - AvroWriteBenchmark > - Any other case that needs to access main args after inheriting from > BenchmarkBase class. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase
[ https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25864: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-25475 > Make mainArgs correctly set in BenchmarkBase > > > Key: SPARK-25864 > URL: https://issues.apache.org/jira/browse/SPARK-25864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: yucai >Priority: Major > > Make mainArgs correctly set in BenchmarkBase, it will benefit: > - BuiltInDataSourceWriteBenchmark > - AvroWriteBenchmark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase
yucai created SPARK-25864: - Summary: Make mainArgs correctly set in BenchmarkBase Key: SPARK-25864 URL: https://issues.apache.org/jira/browse/SPARK-25864 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: yucai Make mainArgs correctly set in BenchmarkBase, it will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25663) Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1148#comment-1148 ] yucai commented on SPARK-25663: --- [~Gengliang.Wang] I make an improvement on this, could you help review? https://github.com/apache/spark/pull/22861 > Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use > main method > > > Key: SPARK-25663 > URL: https://issues.apache.org/jira/browse/SPARK-25663 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25850) Make the split threshold for the code generated method configurable
yucai created SPARK-25850: - Summary: Make the split threshold for the code generated method configurable Key: SPARK-25850 URL: https://issues.apache.org/jira/browse/SPARK-25850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: yucai As per the discussion in [https://github.com/apache/spark/pull/22823/files#r228400706,] add a new configuration spark.sql.codegen.methodSplitThreshold to make the split threshold for the code generated method configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25676) Refactor BenchmarkWideTable to use main method
[ https://issues.apache.org/jira/browse/SPARK-25676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663480#comment-16663480 ] yucai commented on SPARK-25676: --- I am working on this. > Refactor BenchmarkWideTable to use main method > -- > > Key: SPARK-25676 > URL: https://issues.apache.org/jira/browse/SPARK-25676 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25508) Refactor OrcReadBenchmark to use main method
yucai created SPARK-25508: - Summary: Refactor OrcReadBenchmark to use main method Key: SPARK-25508 URL: https://issues.apache.org/jira/browse/SPARK-25508 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.5.0 Reporter: yucai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25486) Refactor SortBenchmark to use main method
yucai created SPARK-25486: - Summary: Refactor SortBenchmark to use main method Key: SPARK-25486 URL: https://issues.apache.org/jira/browse/SPARK-25486 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.5.0 Reporter: yucai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25485) Refactor UnsafeProjectionBenchmark to use main method
yucai created SPARK-25485: - Summary: Refactor UnsafeProjectionBenchmark to use main method Key: SPARK-25485 URL: https://issues.apache.org/jira/browse/SPARK-25485 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.5.0 Reporter: yucai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25481: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-25475 > Refactor ColumnarBatchBenchmark to use main method > -- > > Key: SPARK-25481 > URL: https://issues.apache.org/jira/browse/SPARK-25481 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method
yucai created SPARK-25481: - Summary: Refactor ColumnarBatchBenchmark to use main method Key: SPARK-25481 URL: https://issues.apache.org/jira/browse/SPARK-25481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.5.0 Reporter: yucai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-23207: -- Description: Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined. The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below: upstream stage -> repartition stage -> result stage (-> indicate a shuffle) When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data. The following code returns 931532, instead of 100: {code:java} import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() {code} was: Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined. The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below: upstream stage -> repartition stage -> result stage (-> indicate a shuffle) When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data. The following code returns 931532, instead of 100: {code} import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() {code} > Shuffle+Repartition on an DataFrame could lead to incorrect answers > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Blocker > Labels: correctness > Fix For: 2.1.4, 2.2.3, 2.3.0 > > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code:java} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai resolved SPARK-25206. --- Resolution: Won't Fix Not backport to 2.3 as per [~cloud_fan]'s summary, closed. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598113#comment-16598113 ] yucai commented on SPARK-25206: --- Based on our discussion in [https://github.com/apache/spark/pull/22184#issuecomment-416840509], seems like [~cloud_fan] prefers not backport, need his confirmation. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25281) Add tests to check the behavior when the physical schema and logical schema use difference cases
[ https://issues.apache.org/jira/browse/SPARK-25281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597236#comment-16597236 ] yucai commented on SPARK-25281: --- cc [~smilegator], [~cloud_fan]. > Add tests to check the behavior when the physical schema and logical schema > use difference cases > > > Key: SPARK-25281 > URL: https://issues.apache.org/jira/browse/SPARK-25281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > > As per the discussion in SPARK-25206, Spark needs tests to check the behavior > when the physical schema and logical schema use difference cases. > https://issues.apache.org/jira/browse/SPARK-25206?focusedCommentId=16593041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593041 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25281) Add tests to check the behavior when the physical schema and logical schema use difference cases
[ https://issues.apache.org/jira/browse/SPARK-25281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597217#comment-16597217 ] yucai commented on SPARK-25281: --- [~seancxmao] , since you have done many tests in PR22184 and SPARK-25175, do you want to take this task? If not, I can work on this :). [https://github.com/apache/spark/pull/22184#discussion_r212405373] https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593185=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593185 > Add tests to check the behavior when the physical schema and logical schema > use difference cases > > > Key: SPARK-25281 > URL: https://issues.apache.org/jira/browse/SPARK-25281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > > As per the discussion in SPARK-25206, Spark needs tests to check the behavior > when the physical schema and logical schema use difference cases. > https://issues.apache.org/jira/browse/SPARK-25206?focusedCommentId=16593041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593041 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25281) Add tests to check the behavior when the physical schema and logical schema use difference cases
yucai created SPARK-25281: - Summary: Add tests to check the behavior when the physical schema and logical schema use difference cases Key: SPARK-25281 URL: https://issues.apache.org/jira/browse/SPARK-25281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: yucai As per the discussion in SPARK-25206, Spark needs tests to check the behavior when the physical schema and logical schema use difference cases. https://issues.apache.org/jira/browse/SPARK-25206?focusedCommentId=16593041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593041 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298 ] yucai edited comment on SPARK-25206 at 8/28/18 5:06 PM: Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it could be more meaningful. was (Author: yucai): Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298 ] yucai edited comment on SPARK-25206 at 8/28/18 5:05 PM: Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. was (Author: yucai): Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298 ] yucai commented on SPARK-25206: --- Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594958#comment-16594958 ] yucai commented on SPARK-25206: --- [~smilegator] , 2.1's exception is from parquet. {code:java} java.lang.IllegalArgumentException: Column [ID] was not found in schema! at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121) {code} 2.1 uses parquet 1.8.1, while 2.3 uses parquet 1.8.3, it is behavior change in parquet. See: https://issues.apache.org/jira/browse/PARQUET-389 [https://github.com/apache/parquet-mr/commit/2282c22c5b252859b459cc2474350fbaf2a588e9] > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 addressed this issue already. The biggest difference is, in Spark 2.1, user will get Exception for the same query: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So they will know the issue and fix the query. But in Spark 2.3, user will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 addressed this issue already. The biggest difference is, in Spark 2.1, user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So they will know the issue and fix the query. But in Spark 2.3, user will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593152#comment-16593152 ] yucai commented on SPARK-25175: --- I pinged [~seancxmao] offline, he will give more details. > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: wrong records are returned when Hive metastore schema and parquet schema are in different letter cases (was: data issue when Hive metastore schema and parquet schema are in different letter cases) > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 addressed this issue already. The biggest difference is, in Spark 2.1, user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So they will know the issue and fix the query. But in Spark 2.3, user will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue when Hive metastore schema and parquet schema are in different > letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue when Hive metastore schema and parquet schema are in different > letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue when Hive metastore schema and parquet schema are in different letter cases (was: data issue when Hive metastore schema and parquet schema have different letter case) > data issue when Hive metastore schema and parquet schema are in different > letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue when Hive metastore schema and parquet schema have different letter case (was: data issue when ) > data issue when Hive metastore schema and parquet schema have different > letter case > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > -SPARK-25132-'s backport has been track in its jira. > Use this Jira to track the backport of SPARK-24716, > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- -SPARK-25132-'s backport has been track in its jira. Use this Jira to track the backport of SPARK-24716, [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue when Hive metastore schema and parquet schema have different > letter case > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2.
[jira] [Updated] (SPARK-25206) data issue when
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue when (was: data issue because wrong column is pushdown for parquet) > data issue when > > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > -SPARK-25132-'s backport has been track in its jira. > Use this Jira to track the backport of SPARK-24716, > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- -SPARK-25132-'s backport has been track in its jira. Use this Jira to track the backport of SPARK-24716, [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > -SPARK-25132-'s backport has been track in
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* It has two issues. 1. Wrong column Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. > > > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* It has two issues. 1. Wrong column Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > It has two issues. > 1. Wrong column > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593126#comment-16593126 ] yucai edited comment on SPARK-25206 at 8/27/18 2:27 AM: [~dongjoon], because of the below root cause {quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually). {quote} I changed the title to emphasize wrong column is pushdown: "id" should be pushdown instead of "ID". Feel free to let me know if you have any concern. This issue exists in 2.3 only, master is different. was (Author: yucai): [~dongjoon], because of the below root cause {quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually). {quote} I changed the title to emphasize wrong column is pushdown: "id" should be pushdown instead of "ID". Feel free to let me know if you have any concern. > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593126#comment-16593126 ] yucai commented on SPARK-25206: --- [~dongjoon], because of the below root cause {quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually). {quote} I changed the title to emphasize wrong column is pushdown: "id" should be pushdown instead of "ID". Feel free to let me know if you have any concern. > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t").show ++ | ID| ++ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| ++ scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ scala> sql("set spark.sql.parquet.filterPushdown").show ++-+ | key|value| ++-+ |spark.sql.parquet...| true| ++-+ scala> sql("set spark.sql.parquet.filterPushdown=false").show ++-+ | key|value| ++-+ |spark.sql.parquet...|false| ++-+ scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue because wrong column is pushdown for parquet (was: Wrong data may be returned for Parquet) > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593110#comment-16593110 ] yucai commented on SPARK-25207: --- [~dongjoon] , sorry if I am confusing you. This bug is created for master branch, because it has SPARK-25132 and -SPARK-24716- already. So it has no below issue actually. {code:java} scala> sql("select * from t").show// Parquet returns NULL for `ID` because it has `id`. ++ | ID| ++ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| ++ scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. +---+ | ID| +---+ +---+ {code} > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > Labels: Parquet > Attachments: image.png > > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > -No filter will be pushed down.- > {code} > scala> sql("select * from t where id > 0").explain // Filters are pushed > with `ID` > == Physical Plan == > *(1) Project [ID#90L] > +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) >+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], > PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: > struct > scala> sql("select * from t").show// Parquet returns NULL for `ID` > because it has `id`. > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. > +---+ > | ID| > +---+ > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593102#comment-16593102 ] yucai commented on SPARK-25206: --- I am OK with "known correctness bug in 2.3" way, just raise some concern in my previous post. > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593100#comment-16593100 ] yucai commented on SPARK-25206: --- [~smilegator] , sure, I will add tests. If we don't backport SPARK-25132 and SPARK-24716, user will have below issue. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} The biggest difference is, in Spark 2.1, they will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So will they know the issue and fix the query. But in Spark 2.3, they will get the wrong results sliently and might be ignored? Could it be risky for the user? > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592453#comment-16592453 ] yucai edited comment on SPARK-25206 at 8/25/18 5:01 AM: {quote} # Vanilla Spark 2.2.0 ~ 2.3.1 always returns NULL for Parquet mismatched columns due to case sensitivity. It's not related to filters at all.{quote} Agree, it has nothing to do with filter, actually, the issue exists since 2.0. {quote}The reason why SPARK-25132 is not complete to your situations is simply that it's based on `master` branch. It depends on lots of improvements only in `master` branch. {quote} This part might need to clarify. -SPARK-25132- has no dependence, I can backport it alone, but data issue still exists if I only backport it. This time the root cause is filter pushdown. was (Author: yucai): {quote} # Vanilla Spark 2.2.0 ~ 2.3.1 always returns NULL for Parquet mismatched columns due to case sensitivity. It's not related to filters at all.{quote} Agree, it has nothing to do with filter, actually, the issue exists since 2.0. {quote}The reason why SPARK-25132 is not complete to your situations is simply that it's based on `master` branch. It depends on lots of improvements only in `master` branch. {quote} This part might need to clarify. -SPARK-25132- has no dependence, I can backport it alone, but data issue still exists if I only backport it. > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592460#comment-16592460 ] yucai commented on SPARK-25206: --- [~dongjoon] , thanks a lot for so many explanations, if we both agree to backport -SPARK-25132- + -SPARK-24716.- We can go ahead :). But master's parquet filter pushdown is still buggy in case-insensitive mode, I have summited the PR in SPARK-25207. Kindly help review. > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592453#comment-16592453 ] yucai commented on SPARK-25206: --- {quote} # Vanilla Spark 2.2.0 ~ 2.3.1 always returns NULL for Parquet mismatched columns due to case sensitivity. It's not related to filters at all.{quote} Agree, it has nothing to do with filter, actually, the issue exists since 2.0. {quote}The reason why SPARK-25132 is not complete to your situations is simply that it's based on `master` branch. It depends on lots of improvements only in `master` branch. {quote} This part might need to clarify. -SPARK-25132- has no dependence, I can backport it alone, but data issue still exists if I only backport it. > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592425#comment-16592425 ] yucai edited comment on SPARK-25206 at 8/25/18 3:33 AM: [~dongjoon] , correct me if I am wrong. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").show{code} Based on 2.3.1, Backport SPARK-25132, Spark will show nothing if pushdown enabled and show "1 to 9" if pushdown disabled. Backport -SPARK-25132- + SPARK-24716, Spark will pushdown nothing and it shows "1 to 9". -SPARK-25132- + +-SPARK-24716-+ + SPARK-25207, Spark will pushdown "id > 0" correctly and shows "1 to 9". was (Author: yucai): [~dongjoon] , correct me if I am wrong. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").show{code} Based on 2.3.1, Backport SPARK-25132, Spark will show nothing if pushdown enabled and show "1 to 9" if pushdown disabled. Backport -SPARK-25132- + SPARK-24716, Spark will pushdown nothing and it shows "1 to 9". -SPARK-25132- + -SPARK-24716- + SPARK-25207, Spark will pushdown "id > 0" correctly and shows "1 to 9". > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592425#comment-16592425 ] yucai commented on SPARK-25206: --- [~dongjoon] , correct me if I am wrong. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").show{code} Based on 2.3.1, Backport SPARK-25132, Spark will show nothing if pushdown enabled and show "1 to 9" if pushdown disabled. Backport -SPARK-25132- + SPARK-24716, Spark will pushdown nothing and it shows "1 to 9". -SPARK-25132- + -SPARK-24716- + SPARK-25207, Spark will pushdown "id > 0" correctly and shows "1 to 9". > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592406#comment-16592406 ] yucai commented on SPARK-25206: --- Not a simple duplication. Backport -SPARK-25132-, but without -SPARK-24716-, still buggy. See my test. *{color:#FF}Attention{color}*: backport SPARK-25132 only. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ scala> sql("set spark.sql.parquet.filterPushdown").show ++-+ | key|value| ++-+ |spark.sql.parquet...| true| ++-+ scala> sql("set spark.sql.parquet.filterPushdown=false").show ++-+ | key|value| ++-+ |spark.sql.parquet...|false| ++-+ scala> sql("select * from t where id > 0").show +---+ | ID| +---+ | 7| | 8| | 9| | 2| | 3| | 4| | 5| | 6| | 1| +---+ {code} > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592392#comment-16592392 ] yucai commented on SPARK-25206: --- [~dongjoon] , the reason you see `null` without predicate pushdown, it is because of https://issues.apache.org/jira/browse/SPARK-25132. It is one of the issues of this bug. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t").show ++ | ID| ++ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| ++{code} > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592390#comment-16592390 ] yucai commented on SPARK-25206: --- Link to SPARK-25132, this bug needs two PRs backport. > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592384#comment-16592384 ] yucai commented on SPARK-25206: --- [~dongjoon], I still think this bug is related to pushdown, but unfortunately, there are two issues actually, which make it quite confusing. Let me explain: 1. The wrong column name is pushdown into parquet. Spark pushdowns "ID > 0", but parquet file has "id" in its schema instead of "ID", so 0 record is returned in {color:#FF}*PARQUET SCAN stage.*{color} {color:#FF}*Attention*{color}: not because of the filter, no record from *{color:#FF}parquet scan{color}* in this case. We can confirm this in Spark's chart, "number of output rows" in Scan is 0. {code:java} rm -rf /tmp/data /tmp/data_csv ./bin/spark-shell spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} !image-2018-08-25-10-04-21-901.png! That's why we need backport [https://github.com/apache/spark/pull/21696]. With it, Spark will pushdown correct filter "id > 0" into parquet. 2. Unfortunately, with only [https://github.com/apache/spark/pull/21696], it is not enough, because of https://issues.apache.org/jira/browse/SPARK-25132, we still need backport [https://github.com/apache/spark/pull/22183]. Does it make sense to you? > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Attachment: image-2018-08-25-10-04-21-901.png > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Attachment: image-2018-08-25-09-54-53-219.png > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned when enable pushdown
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591756#comment-16591756 ] yucai commented on SPARK-25206: --- [~cloud_fan] , we need both [https://github.com/apache/spark/pull/21696] and [https://github.com/apache/spark/pull/22183] for this bug. *With only* [https://github.com/apache/spark/pull/21696], no records are returned. {code:java} rm -rf /tmp/data /tmp/data_csv ./bin/spark-shell spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").write.csv("/tmp/data_csv") scala> spark.read.csv("/tmp/data_csv") res4: org.apache.spark.sql.DataFrame = []{code} *Root Cause*: No filter is pushed, but "ID" is selected from parquet file, which has no this field, so 10 null records are returned from parquet scan, and then they are filtered by "ID" > 0 in FilterExec, finally, 0 records are returned. See: !image-2018-08-24-22-46-05-346.png! *With both* [https://github.com/apache/spark/pull/21696] and [https://github.com/apache/spark/pull/22183] {code:java} rm -rf /tmp/data /tmp/data_csv ./bin/spark-shell spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").write.csv("/tmp/data_csv") scala> spark.read.csv("/tmp/data_csv").show +---+ |_c0| +---+ | 2| | 3| | 4| | 7| | 8| | 9| | 5| | 6| | 1| +---+{code} > Wrong data may be returned when enable pushdown > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+{code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Attachment: image-2018-08-24-22-46-05-346.png > Wrong data may be returned when enable pushdown > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+{code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Attachment: image-2018-08-24-22-34-11-539.png > Wrong data may be returned when enable pushdown > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+{code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Attachment: image-2018-08-24-22-33-03-231.png > Wrong data may be returned when enable pushdown > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+{code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Attachment: pr22183.png > Wrong data may be returned when enable pushdown > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+{code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Attachment: image-2018-08-24-18-05-23-485.png > Wrong data may be returned when enable pushdown > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+{code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
yucai created SPARK-25207: - Summary: Case-insensitve field resolution for filter pushdown when reading Parquet Key: SPARK-25207 URL: https://issues.apache.org/jira/browse/SPARK-25207 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: yucai Currently, filter pushdown will not work if Parquet schema and Hive metastore schema are in different letter cases even spark.sql.caseSensitive is false. Like the below case: {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").show{code} No filter will be pushed down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned when enable pushdown
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+{code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+{code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yuming] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > Wrong data may be returned when enable pushdown > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+{code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25206) Wrong data may be returned when enable pushdown
yucai created SPARK-25206: - Summary: Wrong data may be returned when enable pushdown Key: SPARK-25206 URL: https://issues.apache.org/jira/browse/SPARK-25206 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: yucai In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+{code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yuming] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25132) Spark returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583145#comment-16583145 ] yucai commented on SPARK-25132: --- [~cloud_fan] [~smilegator] [~budde] [~ekhliang], do you have any insight? > Spark returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases > > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25132) Spark returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582667#comment-16582667 ] yucai commented on SPARK-25132: --- If Spark allows data source case insensitive, query t2 should return number. If Spark does not allow data source case insensitive, Spark should remind user with warning, return NULL may lead to the potential issue that is very difficult to find. > Spark returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases > > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue
[ https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576417#comment-16576417 ] yucai edited comment on SPARK-25084 at 8/10/18 3:17 PM: [~smilegator], [~jerryshao] Thanks a lot for marking it blocker. A lot of eBay's tables use "distribute by" or "cluster by", it is important for us to move to Spark 2.3. was (Author: yucai): [~smilegator][~jerryshao] Thanks a lot for marking it blocker. A lot of eBay's tables use "distribute by" or "cluster by", it is important for us to move to Spark 2.3. > "distribute by" on multiple columns may lead to codegen issue > - > > Key: SPARK-25084 > URL: https://issues.apache.org/jira/browse/SPARK-25084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > > Test Query: > {code:java} > select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, > ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code} > Exception: > {code:java} > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 131, Column 67: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 131, Column 67: One of ', )' expected instead of '[' > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code} > Wrong Codegen: > {code:java} > /* 131 */ private int computeHashForStruct_1(InternalRow > mutableStateArray[0], int value1) { > /* 132 */ > /* 133 */ > /* 134 */ if (!mutableStateArray[0].isNullAt(5)) { > /* 135 */ > /* 136 */ final int element5 = mutableStateArray[0].getInt(5); > /* 137 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1); > /* 138 */ > /* 139 */ }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue
[ https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576417#comment-16576417 ] yucai commented on SPARK-25084: --- [~smilegator][~jerryshao] Thanks a lot for marking it blocker. A lot of eBay's tables use "distribute by" or "cluster by", it is important for us to move to Spark 2.3. > "distribute by" on multiple columns may lead to codegen issue > - > > Key: SPARK-25084 > URL: https://issues.apache.org/jira/browse/SPARK-25084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > > Test Query: > {code:java} > select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, > ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code} > Exception: > {code:java} > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 131, Column 67: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 131, Column 67: One of ', )' expected instead of '[' > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code} > Wrong Codegen: > {code:java} > /* 131 */ private int computeHashForStruct_1(InternalRow > mutableStateArray[0], int value1) { > /* 132 */ > /* 133 */ > /* 134 */ if (!mutableStateArray[0].isNullAt(5)) { > /* 135 */ > /* 136 */ final int element5 = mutableStateArray[0].getInt(5); > /* 137 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1); > /* 138 */ > /* 139 */ }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue
[ https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25084: -- Description: Test Query: {code:java} select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code} Exception: {code:java} Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 131, Column 67: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 131, Column 67: One of ', )' expected instead of '[' at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code} Wrong Codegen: {code:java} /* 131 */ private int computeHashForStruct_1(InternalRow mutableStateArray[0], int value1) { /* 132 */ /* 133 */ /* 134 */ if (!mutableStateArray[0].isNullAt(5)) { /* 135 */ /* 136 */ final int element5 = mutableStateArray[0].getInt(5); /* 137 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1); /* 138 */ /* 139 */ }{code} was: Test Query: {code:java} select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, ss_net_profit) limit 1000;{code} Wrong Codegen: {code:java} /* 146 */ private int computeHashForStruct_0(InternalRow mutableStateArray[0], int value1) { /* 147 */ /* 148 */ /* 149 */ if (!mutableStateArray[0].isNullAt(0)) { /* 150 */ /* 151 */ final int element = mutableStateArray[0].getInt(0); /* 152 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1); /* 153 */ /* 154 */ }{code} > "distribute by" on multiple columns may lead to codegen issue > - > > Key: SPARK-25084 > URL: https://issues.apache.org/jira/browse/SPARK-25084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > > Test Query: > {code:java} > select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, > ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk) limit 1;{code} > Exception: > {code:java} > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 131, Column 67: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 131, Column 67: One of ', )' expected instead of '[' > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1435) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342){code} > Wrong Codegen: > {code:java} > /* 131 */ private int computeHashForStruct_1(InternalRow > mutableStateArray[0], int value1) { > /* 132 */ > /* 133 */ > /* 134 */ if (!mutableStateArray[0].isNullAt(5)) { > /* 135 */ > /* 136 */ final int element5 = mutableStateArray[0].getInt(5); > /* 137 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element5, value1); > /* 138 */ > /* 139 */ }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue
[ https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575808#comment-16575808 ] yucai commented on SPARK-25084: --- It is a regression, when the generated codes size is more than 1024, newer Spark will split it into many functions, but the function definition is wrong, like below: {code:java} private int computeHashForStruct_0(InternalRow mutableStateArray[0], int value1) { {code} In the older version, like 2.1.0, it does not split function, so it has no this issue. > "distribute by" on multiple columns may lead to codegen issue > - > > Key: SPARK-25084 > URL: https://issues.apache.org/jira/browse/SPARK-25084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Blocker > > Test Query: > {code:java} > select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, > ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, > ss_net_profit) limit 1000;{code} > Wrong Codegen: > {code:java} > /* 146 */ private int computeHashForStruct_0(InternalRow > mutableStateArray[0], int value1) { > /* 147 */ > /* 148 */ > /* 149 */ if (!mutableStateArray[0].isNullAt(0)) { > /* 150 */ > /* 151 */ final int element = mutableStateArray[0].getInt(0); > /* 152 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1); > /* 153 */ > /* 154 */ }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue
yucai created SPARK-25084: - Summary: "distribute by" on multiple columns may lead to codegen issue Key: SPARK-25084 URL: https://issues.apache.org/jira/browse/SPARK-25084 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: yucai Test Query: {code:java} select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, ss_net_profit) limit 1000;{code} Wrong Codegen: {code:java} /* 146 */ private int computeHashForStruct_0(InternalRow mutableStateArray[0], int value1) { /* 147 */ /* 148 */ /* 149 */ if (!mutableStateArray[0].isNullAt(0)) { /* 150 */ /* 151 */ final int element = mutableStateArray[0].getInt(0); /* 152 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1); /* 153 */ /* 154 */ }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time
[ https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556820#comment-16556820 ] yucai commented on SPARK-24925: --- [~cloud_fan], [~xiaoli] , [~kiszk] , any comments? > input bytesRead metrics fluctuate from time to time > --- > > Key: SPARK-24925 > URL: https://issues.apache.org/jira/browse/SPARK-24925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > Attachments: bytesRead.gif > > > input bytesRead metrics fluctuate from time to time, it is worse when > pushdown enabled. > Query > {code:java} > CREATE TABLE dev AS > SELECT > ... > FROM lstg_item cold, lstg_item_vrtn v > WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > ... > {code} > Issue > See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, > 54GB, 53GB ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time
[ https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556818#comment-16556818 ] yucai commented on SPARK-24925: --- I think there could be two issues. In FileScanRDD 1. ColumnarBatch's bytesRead need to be updated for every 4096 * 1000 rows, which makes the metrics out of date. 2. When advancing to the next file, FileScanRDD always adds the whole file length into bytesRead, which is inaccurate (pushdown reads much less data). For problem 1, in https://github.com/apache/spark/pull/21791, I tried to update the ColumnarBatch's bytesRead for each batch. For problem 2, updateBytesReadWithFileSize says, "If we can't get the bytes read from the FS stats, fall back to the file size", can we update only when this situation happens? > input bytesRead metrics fluctuate from time to time > --- > > Key: SPARK-24925 > URL: https://issues.apache.org/jira/browse/SPARK-24925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > Attachments: bytesRead.gif > > > input bytesRead metrics fluctuate from time to time, it is worse when > pushdown enabled. > Query > {code:java} > CREATE TABLE dev AS > SELECT > ... > FROM lstg_item cold, lstg_item_vrtn v > WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > ... > {code} > Issue > See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, > 54GB, 53GB ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24925) input bytesRead metrics fluctuate from time to time
[ https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24925: -- Attachment: bytesRead.gif > input bytesRead metrics fluctuate from time to time > --- > > Key: SPARK-24925 > URL: https://issues.apache.org/jira/browse/SPARK-24925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > Attachments: bytesRead.gif > > > input bytesRead metrics fluctuate from time to time, it is worse when > pushdown enabled. > Query > {code:java} > CREATE TABLE dev AS > SELECT > ... > FROM lstg_item cold, lstg_item_vrtn v > WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > ... > {code} > Issue > See attached gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 53GB > ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24925) input bytesRead metrics fluctuate from time to time
[ https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24925: -- Description: input bytesRead metrics fluctuate from time to time, it is worse when pushdown enabled. Query {code:java} CREATE TABLE dev AS SELECT ... FROM lstg_item cold, lstg_item_vrtn v WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) ... {code} Issue See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 53GB ... was: input bytesRead metrics fluctuate from time to time, it is worse when pushdown enabled. Query {code:java} CREATE TABLE dev AS SELECT ... FROM lstg_item cold, lstg_item_vrtn v WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) ... {code} Issue See attached gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 53GB ... > input bytesRead metrics fluctuate from time to time > --- > > Key: SPARK-24925 > URL: https://issues.apache.org/jira/browse/SPARK-24925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > Attachments: bytesRead.gif > > > input bytesRead metrics fluctuate from time to time, it is worse when > pushdown enabled. > Query > {code:java} > CREATE TABLE dev AS > SELECT > ... > FROM lstg_item cold, lstg_item_vrtn v > WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > ... > {code} > Issue > See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, > 54GB, 53GB ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24925) input bytesRead metrics fluctuate from time to time
yucai created SPARK-24925: - Summary: input bytesRead metrics fluctuate from time to time Key: SPARK-24925 URL: https://issues.apache.org/jira/browse/SPARK-24925 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: yucai input bytesRead metrics fluctuate from time to time, it is worse when pushdown enabled. Query {code:java} CREATE TABLE dev AS SELECT ... FROM lstg_item cold, lstg_item_vrtn v WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) ... {code} Issue See attached gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 54GB, 53GB ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24832) Improve inputMetrics's bytesRead update for ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24832: -- Summary: Improve inputMetrics's bytesRead update for ColumnarBatch (was: When pushdown enabled, input bytesRead metrics is easy to fluctuate from time to time) > Improve inputMetrics's bytesRead update for ColumnarBatch > - > > Key: SPARK-24832 > URL: https://issues.apache.org/jira/browse/SPARK-24832 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > > Improve inputMetrics's bytesRead update for ColumnarBatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24832) When pushdown enabled, input bytesRead metrics is easy to fluctuate from time to time
[ https://issues.apache.org/jira/browse/SPARK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24832: -- Summary: When pushdown enabled, input bytesRead metrics is easy to fluctuate from time to time (was: Improve inputMetrics's bytesRead update for ColumnarBatch) > When pushdown enabled, input bytesRead metrics is easy to fluctuate from time > to time > - > > Key: SPARK-24832 > URL: https://issues.apache.org/jira/browse/SPARK-24832 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > > Improve inputMetrics's bytesRead update for ColumnarBatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24832) Improve inputMetrics's bytesRead update for ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546371#comment-16546371 ] yucai commented on SPARK-24832: --- Currently, ColumnarBatch's bytesRead need to be updated for every 4096 * 1000 rows, which makes the metrics out of date. Can we update it for each batch? {code:java} if (nextElement.isInstanceOf[ColumnarBatch]) { inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows()) } else { inputMetrics.incRecordsRead(1) } if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) { updateBytesRead() } {code} > Improve inputMetrics's bytesRead update for ColumnarBatch > - > > Key: SPARK-24832 > URL: https://issues.apache.org/jira/browse/SPARK-24832 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > > Improve inputMetrics's bytesRead update for ColumnarBatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24832) Improve inputMetrics's bytesRead update for ColumnarBatch
yucai created SPARK-24832: - Summary: Improve inputMetrics's bytesRead update for ColumnarBatch Key: SPARK-24832 URL: https://issues.apache.org/jira/browse/SPARK-24832 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.3.1 Reporter: yucai Improve inputMetrics's bytesRead update for ColumnarBatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning
[ https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24556: -- Description: Currently, ReusedExchange would rewrite output partitioning if child's partitioning is HashPartitioning, but it does not do the same when child's partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, see: {code:java} val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j") val df1 = df.as("t1") val df2 = df.as("t2") val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right") t.cache.orderBy($"t2.j").explain {code} Before fix: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200) +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder... +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as... : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)", like: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder... +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} was: Currently, ReusedExchange would rewrite output partitioning if child's partitioning is HashPartitioning, but it does not do the same when child's partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, see: {code} val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j") val df1 = df.as("t1") val df2 = df.as("t2") val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right") t.cache.orderBy($"t2.j").explain {code} Before fix: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200) +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder... +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)", like: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder... +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} > ReusedExchange should rewrite output partitioning also when child's > partitioning is RangePartitioning > - > > Key: SPARK-24556 > URL: https://issues.apache.org/jira/browse/SPARK-24556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: yucai >Priority: Major > > Currently, ReusedExchange would rewrite output partitioning if
[jira] [Updated] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning
[ https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24556: -- Description: Currently, ReusedExchange would rewrite output partitioning if child's partitioning is HashPartitioning, but it does not do the same when child's partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, see: {code} val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j") val df1 = df.as("t1") val df2 = df.as("t2") val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right") t.cache.orderBy($"t2.j").explain {code} Before fix: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200) +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder... +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)", like: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder... +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} was: Currently, ReusedExchange would rewrite output partitioning if child's partitioning is HashPartitioning, but it does not do the same when child's partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, see: {code:scala} val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j") val df1 = df.as("t1") val df2 = df.as("t2") val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right") t.cache.orderBy($"t2.j").explain {code} Before fix: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200) +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) ,None) +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)", like: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) ,None) +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
[jira] [Created] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning
yucai created SPARK-24556: - Summary: ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning Key: SPARK-24556 URL: https://issues.apache.org/jira/browse/SPARK-24556 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Reporter: yucai Currently, ReusedExchange would rewrite output partitioning if child's partitioning is HashPartitioning, but it does not do the same when child's partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, see: {code:scala} val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j") val df1 = df.as("t1") val df2 = df.as("t2") val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right") t.cache.orderBy($"t2.j").explain {code} Before fix: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200) +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) ,None) +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)", like: {code:sql} == Physical Plan == *(1) Sort [j#14 ASC NULLS FIRST], true, 0 +- InMemoryTableScan [i#5, j#6, i#13, j#14] +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) ,None) +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) :+- LocalTableScan [i#5, j#6] +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0 +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number
[ https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24343: -- Description: When shuffle.partition > bucket number, Spark needs to shuffle the bucket table as per the shuffle.partition, can we avoid this? See below example: {code:java} CREATE TABLE dev USING PARQUET AS SELECT ws_item_sk, i_item_sk FROM web_sales_bucketed JOIN item ON ws_item_sk = i_item_sk;{code} Suppose web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10 Currently, both tables are shuffled into 10 partitions. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ws_item_sk#2, 10) : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 10) +- *(3) Project [i_item_sk#6] +- *(3) Filter isnotnull(i_item_sk#6) +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:... {code} A better plan should avoid the shuffle in the bucket table. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 4) +- *(2) Project [i_item_sk#6] +- *(2) Filter isnotnull(i_item_sk#6) +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:...{code} This problem could be worse if we enable the adaptive execution, because it usually prefers a big shuffle.parititon. was: When shuffle.partition > bucket number, Spark needs to shuffle the bucket table as per the shuffle.partition, can we avoid this? See below example: {code:java} CREATE TABLE dev USING PARQUET AS SELECT ws_item_sk, i_item_sk FROM web_sales_bucketed JOIN item ON ws_item_sk = i_item_sk;{code} web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10 Currently, both tables are shuffled into 10 partitions. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ws_item_sk#2, 10) : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 10) +- *(3) Project [i_item_sk#6] +- *(3) Filter isnotnull(i_item_sk#6) +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:... {code} A better plan should avoid the shuffle in the bucket table. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 4) +- *(2) Project [i_item_sk#6] +- *(2) Filter isnotnull(i_item_sk#6) +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:...{code} This problem could be worse if we enable the adaptive execution, because it usually prefers a big shuffle.parititon. > Avoid shuffle for the bucketed table when shuffle.partition > bucket number > --- > > Key: SPARK-24343 > URL: https://issues.apache.org/jira/browse/SPARK-24343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > > When shuffle.partition > bucket number, Spark needs to shuffle the bucket > table as per the shuffle.partition, can we avoid this? > See below example: >
[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number
[ https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24343: -- Description: When shuffle.partition > bucket number, Spark needs to shuffle the bucket table as per the shuffle.partition, can we avoid this? See below example: {code:java} CREATE TABLE dev USING PARQUET AS SELECT ws_item_sk, i_item_sk FROM web_sales_bucketed JOIN item ON ws_item_sk = i_item_sk;{code} Suppose web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10. Currently, both tables are shuffled into 10 partitions. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ws_item_sk#2, 10) : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 10) +- *(3) Project [i_item_sk#6] +- *(3) Filter isnotnull(i_item_sk#6) +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:... {code} A better plan should avoid the shuffle in the bucket table. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 4) +- *(2) Project [i_item_sk#6] +- *(2) Filter isnotnull(i_item_sk#6) +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:...{code} This problem could be worse if we enable the adaptive execution, because it usually prefers a big shuffle.parititon. was: When shuffle.partition > bucket number, Spark needs to shuffle the bucket table as per the shuffle.partition, can we avoid this? See below example: {code:java} CREATE TABLE dev USING PARQUET AS SELECT ws_item_sk, i_item_sk FROM web_sales_bucketed JOIN item ON ws_item_sk = i_item_sk;{code} Suppose web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10 Currently, both tables are shuffled into 10 partitions. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ws_item_sk#2, 10) : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 10) +- *(3) Project [i_item_sk#6] +- *(3) Filter isnotnull(i_item_sk#6) +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:... {code} A better plan should avoid the shuffle in the bucket table. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 4) +- *(2) Project [i_item_sk#6] +- *(2) Filter isnotnull(i_item_sk#6) +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:...{code} This problem could be worse if we enable the adaptive execution, because it usually prefers a big shuffle.parititon. > Avoid shuffle for the bucketed table when shuffle.partition > bucket number > --- > > Key: SPARK-24343 > URL: https://issues.apache.org/jira/browse/SPARK-24343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > > When shuffle.partition > bucket number, Spark needs to shuffle the bucket > table as per the shuffle.partition, can we avoid this? > See below example:
[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number
[ https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24343: -- Description: When shuffle.partition > bucket number, Spark needs to shuffle the bucket table as per the shuffle.partition, can we avoid this? See below example: {code:java} CREATE TABLE dev USING PARQUET AS SELECT ws_item_sk, i_item_sk FROM web_sales_bucketed JOIN item ON ws_item_sk = i_item_sk;{code} web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10 Currently, both tables are shuffled into 10 partitions. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ws_item_sk#2, 10) : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 10) +- *(3) Project [i_item_sk#6] +- *(3) Filter isnotnull(i_item_sk#6) +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:... {code} A better plan should avoid the shuffle in the bucket table. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 4) +- *(2) Project [i_item_sk#6] +- *(2) Filter isnotnull(i_item_sk#6) +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:...{code} This problem could be worse if we enable the adaptive execution, because it usually prefers a big shuffle.parititon. was: When shuffle.partition > bucket number, Spark needs to shuffle the bucket table as per the shuffle.partition, can we avoid this? See below example: {code:java} CREATE TABLE dev USING PARQUET AS SELECT ws_item_sk, i_item_sk FROM web_sales_bucketed JOIN item ON ws_item_sk = i_item_sk;{code} web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10 Currently, both tables are shuffled into 10 partitions. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ws_item_sk#2, 10) : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 10) +- *(3) Project [i_item_sk#6] +- *(3) Filter isnotnull(i_item_sk#6) +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:... {code} A better plan should avoid the shuffle in the bucket table. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 4) +- *(2) Project [i_item_sk#6] +- *(2) Filter isnotnull(i_item_sk#6) +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:...{code} This problem could be worse if we enable the adaptive execution, because it usually prefers a big shuffle.parititon. > Avoid shuffle for the bucketed table when shuffle.partition > bucket number > --- > > Key: SPARK-24343 > URL: https://issues.apache.org/jira/browse/SPARK-24343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > > When shuffle.partition > bucket number, Spark needs to shuffle the bucket > table as per the shuffle.partition, can we avoid this? > See below example: >
[jira] [Created] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number
yucai created SPARK-24343: - Summary: Avoid shuffle for the bucketed table when shuffle.partition > bucket number Key: SPARK-24343 URL: https://issues.apache.org/jira/browse/SPARK-24343 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: yucai When shuffle.partition > bucket number, Spark needs to shuffle the bucket table as per the shuffle.partition, can we avoid this? See below example: {code:java} CREATE TABLE dev USING PARQUET AS SELECT ws_item_sk, i_item_sk FROM web_sales_bucketed JOIN item ON ws_item_sk = i_item_sk;{code} web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10 Currently, both tables are shuffled into 10 partitions. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ws_item_sk#2, 10) : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 10) +- *(3) Project [i_item_sk#6] +- *(3) Filter isnotnull(i_item_sk#6) +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:... {code} A better plan should avoid the shuffle in the bucket table. {code:java} Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, i_item_sk#6] +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0 : +- *(1) Project [ws_item_sk#2] : +- *(1) Filter isnotnull(ws_item_sk#2) : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: true, Format: Parquet, Location:... +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i_item_sk#6, 4) +- *(2) Project [i_item_sk#6] +- *(2) Filter isnotnull(i_item_sk#6) +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: Parquet, Location:...{code} This problem could be worse if we enable the adaptive execution, because it usually prefers a big shuffle.parititon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys
yucai created SPARK-24087: - Summary: Avoid shuffle when join keys are a super-set of bucket keys Key: SPARK-24087 URL: https://issues.apache.org/jira/browse/SPARK-24087 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: yucai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451743#comment-16451743 ] yucai commented on SPARK-24076: --- 1. When shuffle.partition = 8192, tuples in the same partition follows the connection like below: hash(tuple x) = hash(tuple y) + n * 8192 2. In the next HashAggregate stage, tuples from the same partition need put into a 16K BytesToBytesMap (unsafeRowAggBuffer). Here, the HashAggregate uses the same hash algorithm and seed as shuffle, it leads to all tuples will be hashed to only 2 different places actually. That's why hash conflict happens. > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451728#comment-16451728 ] yucai commented on SPARK-24076: --- Root cause: very bad hash conflict in hashaggregate. !image-2018-04-25-14-29-39-958.png! > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24076: -- Attachment: image-2018-04-25-14-29-39-958.png > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451727#comment-16451727 ] yucai commented on SPARK-24076: --- The query example: {code:sql} insert overwrite table target_xxx SELECT item_id, auct_end_dt FROM (select cast(item_id as double) as item_id, auct_end_dt from source_xxx GROUP BY item_id, auct_end_dt {code} > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451540#comment-16451540 ] yucai commented on SPARK-24076: --- shuffle.partition = 8192 !p1.png! shuffle.partition = 8000 !p2.png! > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24076: -- Attachment: p2.png p1.png > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24076) very bad performance when shuffle.partition = 8192
yucai created SPARK-24076: - Summary: very bad performance when shuffle.partition = 8192 Key: SPARK-24076 URL: https://issues.apache.org/jira/browse/SPARK-24076 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: yucai We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org