[ 
https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34416.
-----------------------------------
    Fix Version/s: 3.2.0
       Resolution: Fixed

Issue resolved by pull request 31543
[https://github.com/apache/spark/pull/31543]

> Support avroSchemaUrl in addition to avroSchema
> -----------------------------------------------
>
>                 Key: SPARK-34416
>                 URL: https://issues.apache.org/jira/browse/SPARK-34416
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Ohad Raviv
>            Priority: Minor
>             Fix For: 3.2.0
>
>
> We have a use case in which we read a huge table in Avro format. About 30k 
> columns.
> using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
> forever. after 4 hours not even one task has finished.
> We tried instead to use 
> `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
> ```
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
> schema
> ..
> at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
>  at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>  ... 53 elided
> ```
>  
> because files schema contain duplicate column names (when considering 
> case-insensitive).
> So we wanted to provide a user schema with non-duplicated fields, but the 
> schema is huge. a few MBs. it is not practical to provide it in json format.
>  
> So we patched spark-avro to be able to get also `avroSchemaUrl` in addition 
> to `avroSchema` and it worked perfectly.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to