[jira] [Assigned] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema

2021-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34416:
-

Assignee: Ohad Raviv

> Support avroSchemaUrl in addition to avroSchema
> ---
>
> Key: SPARK-34416
> URL: https://issues.apache.org/jira/browse/SPARK-34416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ohad Raviv
>Assignee: Ohad Raviv
>Priority: Minor
> Fix For: 3.2.0
>
>
> We have a use case in which we read a huge table in Avro format. About 30k 
> columns.
> using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
> forever. after 4 hours not even one task has finished.
> We tried instead to use 
> `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
> ```
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
> schema
> ..
> at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
>  at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>  ... 53 elided
> ```
>  
> because files schema contain duplicate column names (when considering 
> case-insensitive).
> So we wanted to provide a user schema with non-duplicated fields, but the 
> schema is huge. a few MBs. it is not practical to provide it in json format.
>  
> So we patched spark-avro to be able to get also `avroSchemaUrl` in addition 
> to `avroSchema` and it worked perfectly.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema

2021-02-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34416:


Assignee: (was: Apache Spark)

> Support avroSchemaUrl in addition to avroSchema
> ---
>
> Key: SPARK-34416
> URL: https://issues.apache.org/jira/browse/SPARK-34416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.2.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> We have a use case in which we read a huge table in Avro format. About 30k 
> columns.
> using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
> forever. after 4 hours not even one task has finished.
> We tried instead to use 
> `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
> ```
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
> schema
> ..
> at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
>  at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>  ... 53 elided
> ```
>  
> because files schema contain duplicate column names (when considering 
> case-insensitive).
> So we wanted to provide a user schema with non-duplicated fields, but the 
> schema is huge. a few MBs. it is not practical to provide it in json format.
>  
> So we patched spark-avro to be able to get also `avroSchemaUrl` in addition 
> to `avroSchema` and it worked perfectly.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema

2021-02-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34416:


Assignee: Apache Spark

> Support avroSchemaUrl in addition to avroSchema
> ---
>
> Key: SPARK-34416
> URL: https://issues.apache.org/jira/browse/SPARK-34416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.2.0
>Reporter: Ohad Raviv
>Assignee: Apache Spark
>Priority: Minor
>
> We have a use case in which we read a huge table in Avro format. About 30k 
> columns.
> using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
> forever. after 4 hours not even one task has finished.
> We tried instead to use 
> `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
> ```
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
> schema
> ..
> at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
>  at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>  ... 53 elided
> ```
>  
> because files schema contain duplicate column names (when considering 
> case-insensitive).
> So we wanted to provide a user schema with non-duplicated fields, but the 
> schema is huge. a few MBs. it is not practical to provide it in json format.
>  
> So we patched spark-avro to be able to get also `avroSchemaUrl` in addition 
> to `avroSchema` and it worked perfectly.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org