Baohe Zhang created SPARK-35010:
-----------------------------------

             Summary: nestedSchemaPruning causes issue when reading hive 
generated Orc files
                 Key: SPARK-35010
                 URL: https://issues.apache.org/jira/browse/SPARK-35010
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.0, 3.0.0
            Reporter: Baohe Zhang


In spark3, we have spark.sql.orc.imple=native and 
spark.sql.optimizer.nestedSchemaPruning.enabled=true as the default settings. 
And these would cause issues when query struct field of hive-generated orc 
files.

For example, we got an error when running this query in spark3
{code:java}
spark.table("testtable").filter(col("utc_date") === 
"20210122").select(col("open_count.d35")).show(false)
{code}
The error is
{code:java}
Caused by: java.lang.AssertionError: assertion failed: The given data schema 
struct<open_count:struct<d35:map<string,double>>> has less fields than the 
actual ORC physical schema, no idea which columns were dropped, fail to read.
  at scala.Predef$.assert(Predef.scala:223)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:153)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
{code}
 

I think the reason is that we apply the nestedSchemaPruning to the dataSchema. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L75]


This nestedSchemaPruning not only prunes the unused fields of the struct, it 
also prunes the unused columns. In my test, the dataSchema originally has 48 
columns, but after nested schema pruning, the dataSchema is pruned to 1 column. 
This pruning result in an assertion error in 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L159]
 because column pruning in hive generated orc files is not supported.

This issue seems also related to the hive version, we use hive 1.2, and it 
doesn't contain field names in the physical schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to