SaurabhChawla100 edited a comment on pull request #29045:
URL: https://github.com/apache/spark/pull/29045#issuecomment-657978027


   > Can you be more specific about the problem? Are you saying that the actual 
file schema doesn't match the table schema specified by the user?
   
   So in case of orc data created by the hive no field names in the physical 
schema. Please find the below code for reference.
   
https://github.com/apache/spark/blob/24be81689cee76e03cd5136dfd089123bbff4595/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L133
   
   So from this code we are sending the index of the col from the dataschema.
   
   But Where as in the below code , we are passing the input result schema and 
that result schema will not have that index number that is passed from 
OrcUtils.scala
   
https://github.com/apache/spark/blob/24be81689cee76e03cd5136dfd089123bbff4595/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L211
   
   For example - 
   
   ```
   val u = """select date_dim. d_year from date_dim limit 5"""
   
   spark.sql(u).collect
   ```
   
   Here the value of index(d_year returned by the OrcUtils.scala#L133 is 6
   
   where the resultSchema passed in OrcFileFormat.scala#L211 is having only one 
 struct<`d_year`:int> 
   
   So now on using the index value 6 in the resultSchema schema which is having 
size 1 is giving the exception
   
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 
2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 6
       at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
       at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to