[GitHub] [hudi] umehrot2 commented on issue #2180: [SUPPORT] Unable to read MERGE ON READ table with Snapshot option using Databricks.

GitBox Thu, 15 Oct 2020 10:36:10 -0700


umehrot2 commented on issue #2180:
URL: https://github.com/apache/hudi/issues/2180#issuecomment-709480582

This is strange. The exception appears to be happening at
https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala#L138
while doing `fileStatuses.toArray`.

Now it is an `java.lang.ArrayStoreException` which indicates that it is
possibly trying to store a wrong type of object
`org.apache.spark.sql.execution.datasources.SerializableFileStatus` in an array
of `FileStatus`. This would mean that Spark itself is returning
`Seq[SerializableFileStatus]` instead of `Seq[FileStatus]` which is not
possible on open source spark.

In open source spark they always convert `SerializableFileStatus` to
`FileStatus` before returning and thats the contract:
https://github.com/apache/spark/blob/v2.4.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L246
. `SerializableFileStatus` is private to that class. So my guess is that
databricks Spark implementation differs here and its possibly returning
`Seq[SerializableFileStatus]` and thats why this is happening. Not sure whether
this is something we want to consider fixing and how. @bvaradar @garyli1019
@vinothchandar

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #2180: [SUPPORT] Unable to read MERGE ON READ table with Snapshot option using Databricks.

Reply via email to