[jira] [Commented] (SPARK-17601) SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long
[ https://issues.apache.org/jira/browse/SPARK-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514019#comment-15514019 ] Gang Wu commented on SPARK-17601: - [~hyukjin.kwon] Yes I agree. I just created these JIRAs for issues we met in production. I think there definitely can be more issues for ORC, Parquet, etc. Schema evolution is always painful to tackle with. Seems that you are working on this. Do you mind telling a little bit more about what's your plan there? I'd like to know. Thanks! > SparkSQL vectorization cannot handle schema evolution for parquet tables when > parquet files use Int whereas DataFrame uses Long > --- > > Key: SPARK-17601 > URL: https://issues.apache.org/jira/browse/SPARK-17601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > This is a JIRA related to SPARK-17477. > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). If we > use vectorization in SparkSQL then we will get following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) > at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) > at org.apache.spark.sql.Dataset.show(Dataset.scala:526) > at org.apache.spark.sql.Dataset.show(Dataset.scala:486) > at org.apache.spark.sql.Dataset.show(Dataset.scala:495) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPla
[jira] [Commented] (SPARK-17601) SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long
[ https://issues.apache.org/jira/browse/SPARK-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505066#comment-15505066 ] Hyukjin Kwon commented on SPARK-17601: -- We might have to avoid to open multiple related issues. I mean, we wouldn't definitely want JIRAs for every type being handled in ORC, Parquet and Parquet vectorized reader. BYW, you might want to convert this as a sub-task in SPARK-16518 > SparkSQL vectorization cannot handle schema evolution for parquet tables when > parquet files use Int whereas DataFrame uses Long > --- > > Key: SPARK-17601 > URL: https://issues.apache.org/jira/browse/SPARK-17601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > This is a JIRA related to SPARK-17477. > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). If we > use vectorization in SparkSQL then we will get following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) > at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) > at org.apache.spark.sql.Dataset.show(Dataset.scala:526) > at org.apache.spark.sql.Dataset.show(Dataset.scala:486) > at org.apache.spark.sql.Dataset.show(Dataset.scala:495) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPla