[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2021-03-17 Thread Yu Xiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303213#comment-17303213
 ] 

Yu Xiang commented on SPARK-24924:
--

[~tgraves], [~Gengliang.Wang] [~dongjoon], 

Hi, I am struggling with the "Spark Multiple sources found for " issue. Is it a 
bug or is it just some problems with the Spark versions? 

I have a Java program, in which I call the spark textFile function. It works 
well locally when running the Java program from the IDE. However when using 
`spark-submit` with the jar file, there are errors with "Spark Multiple sources 
found for  text". Even I specify the default format 
"org.apache.spark.sql.execution.datasources.text.TextFileFormat", such error 
still exist if I run in "spark-submit" mode. 

 

The detailed description of the problem is here: 
[https://stackoverflow.com/questions/4181/spark-multiple-sources-found-for-text]

 

Could you help have a look? Thank you

 

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583868#comment-16583868
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

Hi, All.
I created SPARK-25143 as a more general and sustaining way for CSV/ORC/AVRO. 
Hopefully, we can remove our internal mappings for `com.databricks.spark.*` 
without any problem in Spark 3. Since SPARK-25143 is a general configuration, 
we can remove those in Spark 2.4, if we want.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-17 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583483#comment-16583483
 ] 

Gengliang Wang commented on SPARK-24924:


[~dongjoon] I see. I am now +1 with adding new configuration.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-16 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582948#comment-16582948
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

[~Gengliang.Wang] . Ur, the latest consensus isn't removing the mapping. With 
configurations, we can maximize the benefit of the users, especially for 
Spark's datasource tables.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-15 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581929#comment-16581929
 ] 

Gengliang Wang commented on SPARK-24924:


As package "org.apache.spark.sql.avro" is external module and not loaded by 
default, we should not prevent users from using "com.databricks.spark.avro".

+1 on removing the mapping. I will create a PR for it.

 

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-15 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581608#comment-16581608
 ] 

Thomas Graves commented on SPARK-24924:
---

I'd be ok with that but CSV has been that way already for a long time already 
so I don't think its required.  I would vote for not doing that, if someone 
wants it do it under separate jira.   I want to see the config for avro go in 
before 2.4 is released for compatibility reasons.  

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-15 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581546#comment-16581546
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

[~tgraves] . In that case, for consistency, we had better add two 
configurations for Avro and CSV. Shall we discuss that in a new minor 
improvement Jira issue?

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-15 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581117#comment-16581117
 ] 

Thomas Graves commented on SPARK-24924:
---

[~cloud_fan] [~hyukjin.kwon] seems no one else has a strong opinion on this.  

Since there is precedence here for the csv stuff, how about we just add a 
config to allow users to turn the mapping off?  That would allow them to easily 
continue to use their own version if they want but if they are using the hive 
tables and want that to work with internal version they can use the config.

Do we have release notes or something documented for compatibility (I didn't 
see anything in sql-programming-guide)?

[~mridulm80] [~irashid] as a couple others that might use avro to see if they 
have input.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-08 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573221#comment-16573221
 ] 

Thomas Graves commented on SPARK-24924:
---

| There was a discussion about why we shouldn't support it: 
[https://github.com/apache/spark/pull/21841]

There is no discussion on that pr?  Assume you are referring to comment that 
points to by? It looks like we aren't supporting because python and R aren't 
going to supported, correct?   That may be a fine thing for us to not support 
it internally, I'm not against that, I'm saying it is not a very good 
compatibility or upgrade story for users who want to switch from databricks 
avro to internal avro.   We are adding this mapping so users can easily upgrade 
and claiming its functionally the same but its not really that easy as they 
potentially have to change their code to not use spark.read/write.avro.  

If we don't support spark.read/write.avro, I know at least for my users I will 
create something so that works for the 2.4 feature release because I view that 
as an api incompatibility and they don't expect that for a feature release.  I 
realize this is a 3rd party library though so we may be able to get away with 
it but that doesn't mean its nice for our users.

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572564#comment-16572564
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

[~cloud_fan], Yea, adding them as implicit sounds not a good idea. But I think 
we can still add {{spark.read.avro}} in {{DataFrameReader}} although it looks a 
bit weird since Avro is external package. 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-07 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572556#comment-16572556
 ] 

Wenchen Fan commented on SPARK-24924:
-

>  I assume we could theoretically also support the spark.read.avro format as 
> well

There was a discussion about why we shouldn't support it: 
https://github.com/apache/spark/pull/21841

Users always need to do some manual work to use `spark.read.avro`, even with 
the databricks avro package. Now users can still define an implicit class to 
support `spark.read.avro` if they want to.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-07 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571908#comment-16571908
 ] 

Thomas Graves commented on SPARK-24924:
---

so originally when I started on this I didn't know about the side affects of 
the hive table here.

So this isn't as straight forward as I originally thought. I still personally 
don't like remapping this because users get something other then what they 
explicitly asked for, but if we want to keep this compatibility we either have 
to do that or actually have a com.databricks.avro class that would just map 
into our internal avro.  That would give the benefit that they could eclipse it 
with their own jar if they wanted to keep using their customer version, I 
assume we could theoretically also support the spark.read.avro format as well.  
Or I guess the third option is to just break compatibility and require the 
users to change the table property, but then they can't read it with older 
versions of spark. 

It also seems bad to me that we aren't supporting spark.read.avro, so its an 
api compatibility issue. We magically help them with compatibility with their 
tables by mapping them but we don't support the old api and they have to update 
your code.  This feels like an inconsistent story to me and not sure how that 
fits with our versioning policy since its a 3rd party thing.

Not sure I like any of these options. Seems like these are the options:

1)I wonder if we actually add the class com.databricks.avro into the spark 
source that does the remap and support spark.read/write.avro for a couple 
releases for compatibility, then remove it and tell people to change the table 
property or provide an api to do that. 

2) make the mapping of com.databricks.avro => internal avro configurable, that 
would allow them to continue use their version of com.databricks.avro until 
they can update api.

3) do nothing, leave this as is with this jira and user has to deal with losing 
spark.read.avro api and possible confusion and breaking if they are using 
modified version of com.databricks.avro 

thoughts from others?

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-07 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571852#comment-16571852
 ] 

Thomas Graves commented on SPARK-24924:
---

thanks, I missed it in the output for spark as I was just looking at table 
properties. 

So what you are saying is that without this change to map databricks avro to 
our internal avro, the only way to update hive tables to use the internal avro 
version is to have them manually set the table properties? 

Do you know off hand if you are able to write to a hive table with datasource 
"com.databricks.spark.avro" using the internal avro version or does it error?  

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570880#comment-16570880
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

Yep. It will work if those 3rd-party packages are rebuilt on Apache Spark 2.4. 
So, it will be the next releases, not the currently existing ones.

Spark hides Spark-generated metadata. You can see them via `hive` CLI like the 
following.

1. Run Apache Hive 1.2.2 CLI and check tables; This initialize metastores, too.
{code}
hive> show tables;
OK
Time taken: 1.163 seconds
{code}

2. Apache Spark 2.3.1 Result
{code}
scala> spark.version
res1: String = 2.3.1
scala> 
spark.range(10).write.format("com.databricks.spark.avro").saveAsTable("t")
scala> sql("desc formatted t").show(false)
++-+---+
|col_name|data_type 
   |comment|
++-+---+
|id  |bigint
   |null   |
||  
   |   |
|# Detailed Table Information|  
   |   |
|Database|default   
   |   |
|Table   |t 
   |   |
|Owner   |dongjoon  
   |   |
|Created Time|Mon Aug 06 15:41:40 PDT 2018  
   |   |
|Last Access |Wed Dec 31 16:00:00 PST 1969  
   |   |
|Created By  |Spark 2.3.1   
   |   |
|Type|MANAGED   
   |   |
|Provider|com.databricks.spark.avro 
   |   |
|Table Properties|[transient_lastDdlTime=1533595300]
   |   |
|Location|file:/user/hive/warehouse/t   
   |   |
|Serde Library   
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   |   |
|InputFormat |org.apache.hadoop.mapred.SequenceFileInputFormat  
   |   |
|OutputFormat
|org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|   |
|Storage Properties  |[serialization.format=1]  
   |   |
++-+---+
{code}

3. Apache Hive 1.2.2 CLI Result
{code}
hive> describe formatted t;
OK
# col_name  data_type   comment

col array   from deserializer

# Detailed Table Information
Database:   default
Owner:  dongjoon
CreateTime: Mon Aug 06 15:41:40 PDT 2018
LastAccessTime: UNKNOWN
Protect Mode:   None
Retention:  0
Location:   
file:/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.create.version2.3.1
spark.sql.sources.provider  com.databricks.spark.avro
spark.sql.sources.schema.numParts   1
spark.sql.sources.schema.part.0 
{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
transient_lastDdlTime   1533595300

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
Storage Desc Params:
pathfile:/user/hive/warehouse/t
serialization.format1
Time taken: 1.373 seconds, Fetched: 31 row(s)
{code}

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error 

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570840#comment-16570840
 ] 

Thomas Graves commented on SPARK-24924:
---

so officially the spark api compatibility is only at the compilation level: 
[http://spark.apache.org/versioning-policy.html] . We try to keep binary 
compatibility but its not guaranteed between releases.   It might be worth 
bringing up though to make sure they thought of that as it should be a 
conscious decision.

I think if you rebuild databricks avro with spark 2.4 it works, right?

I unfortunately don't have a hive setup working with spark 2.4 right now. When 
I wrote a table (saveAsTable) with 2.3 databricks avro I don't see a table 
property spark.sql.sources.provider, what am I missing?

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570756#comment-16570756
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

1. Theoretically, Spark 2.4 should handle both Hive tables simultaneously if 
the jars co-exist.
2. `ALTER TABLE` is technically possible, but it seems not a good way for users 
because `spark.sql.sources.provider` is a Spark-generated metadata.
3. For now, there is another issue with `FileFormat` trait. In Spark 2.4, 
SPARK-24691 adds `FileFormat.supportDataType` and uses it to verify data types. 
Currently, it's a breaking change because the latest 3rd-party file format like 
databricks avro 4.0.0 doesn't have that method. The current Spark 2.4 master 
branch raises `java.lang.AbstractMethodError`. I think we had better fix this 
in Spark-side for compatibility.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570736#comment-16570736
 ] 

Thomas Graves commented on SPARK-24924:
---

so if the user includes the databricks jar and they specify 
"com.databricks.spark.avro" can we support that or is there some conflict that 
won't allow us to have both loaded? 

Can you user simply change the sources.provider to be 'avro' and have it work 
with new internal version?

Sorry trying to make sure I don't miss anything with the compatibility story 
here.  

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570702#comment-16570702
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

For Hive tables, the format name is stored as a table parameter, 
`spark.sql.sources.provider`. For example, 
`spark.sql.sources.provider=com.databricks.spark.avro`. So, without this 
mapping, built-in avro format will not be used for that table. IIUC, one of the 
purposes of the new policy is not to support that.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570638#comment-16570638
 ] 

Thomas Graves commented on SPARK-24924:
---

So something I just thought of that I want to clarify, is this format name 
explicitly stored and used anywhere in say tables created?  For instance lets 
say I'm using the databricks avro format and I create a table with it and save 
it out.  Can I read that table fine with the new built-in avro support without 
this mapping? 

 

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570617#comment-16570617
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

Thank you for confirming and giving the right direction for this, [~tgraves]. 
It must be a consistent and clear policy for Apache Spark. +1 for moving 
forward to that direction by reverting the commits of this JIRA.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570586#comment-16570586
 ] 

Thomas Graves commented on SPARK-24924:
---

For compatibility we can't remove it unless major version, so my vote would be 
to remove it in 3.0.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570562#comment-16570562
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

Sorry for the late responses, [~tgraves] and guys. I was OOO last week. When I 
made this JIRA, I didn't expect a long discussion like this.
Now, it looks like we are setting a new policy. [~tgraves], with a new policy, 
I'm wondering if we are going to remove `com.databricks.spark.csv` mapping in 
Apache Spark 3.0.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570220#comment-16570220
 ] 

Thomas Graves commented on SPARK-24924:
---

{quote}I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.
{quote}
Sorry, I wasn't meaning to claim any bugs were introduced by anyone in merging 
this in.  
{quote}In this case, users should provide their own short name of the package. 
I would say it's discouraged to use the same name with Spark's builtin 
datasources, or other packages name reserved - I wonder if users would actually 
try to have the same name in practice.

 
{quote}
I disagree with this, its already a 3rd party and not call org.apache.spark and 
they are providing their own short name that used to work before this.  Its one 
thing just referencing "avro" but when they put the entire com.databricks. we 
should not be remapping it.
{quote}We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708
{quote}
Yes but like anything else that requires a user to read it.  Many users just 
get new versions deployed on their cluster and if their job continues to run 
they don't notice or pay attentions.
{quote}I also realize these are 3rd party packages but I think we are making 
the assumption here based on this being a databricks package, which in my 
opinion we shouldn't. What if this was companyX package which we didn't know 
about, what would/should be the expected behavior?

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right?
{quote}
Yes personally I don't think we should be remapping any third party libraries 
to apache spark. In my opinion this is even worse since we don't support the 
spark.read.avro but it happens to work if you include the databricks package, 
but it doesn't really call into the databricks code, it calls into the spark 
code. If I remove the databricks jar then spark.read.avro doesn't work. Really 
confusing to users IMHO.
{quote}For clarification, it's not personally related to me in any way at all 
but I thought we better keep it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.
{quote}
I understand and that definitely makes sense, but I don't agree that we should 
have even done it for csv.   Unfortunately I didn't see that go in to disagree. 
  I think we should have made the message more user friendly and told them 
please update to use sparks or rename it.  We can't be responsible to keep 
compatibility with all 3rd party libraries like that.  We can't control what 
names they use. 

I'm fine with is they specify the shortname of just "avro" of having that be 
mapped to our implementation, but if they use the full com.databricks we should 
respect it or throw an error if we can't.  If everyone agrees, I can file a 
separate Jira to revert.

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-04 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569109#comment-16569109
 ] 

Felix Cheung commented on SPARK-24924:
--

 

I tend to agree that we shouldn't "magically" remap different implementations 
or changes behavior across versions, esp. since we have never really tested 
them for compatibility and documented in any way as such.

Do we have agreement on what the behavior should be then? Could someone 
summarize?

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569104#comment-16569104
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

For fully qualifed path, we already could specify like 
{{com.databricks.spark.avro.AvroFormat}} and I guess that will use thrid party 
one if I am not mistaken. 

Probably we should not do this but this is what we do with CSV which kind of 
makes a point as well. Wouldn't we better just follow what we do?

If we should make an error for this case, I guess it should target 3.0.0 for 
CSV and revert this PR.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569085#comment-16569085
 ] 

Wenchen Fan commented on SPARK-24924:
-

when the short name conflicts, I feel it's better to pick the built-in data 
source than failing the job and say it conflicts. When the full class name of 
the data source is specified like com.databricks.spark.avro, we should respect 
it.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569072#comment-16569072
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

Also, for clarification, we already issue warnings:

{code}
17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
{code}

So, I guess it's virtually error vs warning.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569071#comment-16569071
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

So far, I see some issues as below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's related to me in any way but I thought we better keep 
it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.



> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568454#comment-16568454
 ] 

Reynold Xin commented on SPARK-24924:
-

I like the improved error message (I didn't read the earlier comments in this 
thread).

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568393#comment-16568393
 ] 

Thomas Graves commented on SPARK-24924:
---

| It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.  If the user has been using the 
databrick spark-avro version for other releases and it was working fine and now 
we magically map it to a different version and they break, they are going to 
complain and say, I didn't change anything why did this break. 

Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  Just a note 
the fact we use different version isn't my issue, I'm happy to make that work, 
I'm worried about other users who didn't happen to see this jira.   I also 
realize these are 3rd party packages but I think we are making the assumption 
here based on this being a databricks package, which in my opinion we 
shouldn't.   What if this was companyX package which we didn't know about, what 
would/should be the expected behavior? 

How many users complained about the csv thing?  Could we just improve the error 
message to more simply state, "Multiple sources found, perhaps you are 
including an external package that also supports avro. Spark started internally 
supporting as of release X.Y, please remove the external package or rewrite to 
use different function"

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568352#comment-16568352
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

cc [~cloud_fan] since we talked about this for CSV, and [~rxin] who agreed upon 
not adding .avro for now, FYI.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568348#comment-16568348
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

{quote}
but at the same time we aren't adding the spark.read.avro syntax so it break in 
that case or they get a different implementation by default?   
{quote}

If users call this, that's still going to use the builtin implemtnation 
(https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26)
 as it's a short name for {{format("com.databricks.spark.avro")}}.

{quote}
our internal implementation which could very well be different.
{quote}

It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

{quote}
 I would rather just plain error out saying these conflict, either update or 
change your external package to use a different name. 
{quote}

IIRC, in the past, we did for CSV datasource and many users complained about 
this.

{code}
java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
{code}

In practice, I am actually a bit more sure on the current approach since users 
actually complained about his a lot and now I am not seeing (so far) the 
complains about the current approach.

{code}
There is also the case one might be able to argue its breaking api 
compatilibity since .avro option went away, buts it a third party library so 
you can probably get away with that. 
{code}

It's went away so I guess if the jar is provided with implicit import to 
support this, this should work as usual and use the internal implementation in 
theory. If the jar is not given, .avro API is not supported and the internal 
implmentation will be used. 


> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568204#comment-16568204
 ] 

Thomas Graves commented on SPARK-24924:
---

[~felixcheung] did your discussion on the same thing with csv get resolved?  

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568199#comment-16568199
 ] 

Thomas Graves commented on SPARK-24924:
---

Hmm, so we are adding this for ease of upgrading I guess (so user doesn't have 
to change their code), but at the same time we aren't adding the 
spark.read.avro syntax so it break in that case or they get a different 
implementation by default?   

This doesn't make sense to me.  Personally I don't like having some other add 
on package names in our code at all and here we are mapping what the user 
thought they would get to our internal implementation which could very well be 
different.  I would rather just plain error out saying these conflict, either 
update or change your external package to use a different name.  There is also 
the case one might be able to argue its breaking api compatilibity since .avro 
option went away, buts it a third party library so you can probably get away 
with that. 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567708#comment-16567708
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

Similar discussion was made in SPARK-20590 when we port CSV. in my experience, 
users really don't know if {{com.databricks.spark.avro}} or {{avro}} mean 
external Avro jar or internal jar (same thing happened in CSV - 
 I was active in that Spark CSV (databricks) package FWIW).

if users were using the external avro, they will likely meet the error if they 
directly upgrade Spark. Otherwise, users will see the release note that Avro 
package is included in 2.4.0, and they will not provide the external jar.
If users miss the release note, then they will try to explicitly provide the 
thirdparty jar, which will now give the error message like:

{code}
17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
{code}

Encouraging to use builtin's one might better be preferred since the behaviours 
will kept same at its best for now.
Otherwise, If external Avro must be used, I think it can be still used if the 
source is specified by fully qualified path in theory.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-02 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567045#comment-16567045
 ] 

Thomas Graves commented on SPARK-24924:
---

why are we doing this? If a user ships the spark-avro databricks jar and 
references the com.databricks.spark.avro class, why do we want to map that to 
our built in version which might be different?

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-07-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560624#comment-16560624
 ] 

Apache Spark commented on SPARK-24924:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21906

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556427#comment-16556427
 ] 

Apache Spark commented on SPARK-24924:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21878

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org