[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303213#comment-17303213 ] Yu Xiang commented on SPARK-24924: -- [~tgraves], [~Gengliang.Wang] [~dongjoon], Hi, I am struggling with the "Spark Multiple sources found for " issue. Is it a bug or is it just some problems with the Spark versions? I have a Java program, in which I call the spark textFile function. It works well locally when running the Java program from the IDE. However when using `spark-submit` with the jar file, there are errors with "Spark Multiple sources found for text". Even I specify the default format "org.apache.spark.sql.execution.datasources.text.TextFileFormat", such error still exist if I run in "spark-submit" mode. The detailed description of the problem is here: [https://stackoverflow.com/questions/4181/spark-multiple-sources-found-for-text] Could you help have a look? Thank you > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583868#comment-16583868 ] Dongjoon Hyun commented on SPARK-24924: --- Hi, All. I created SPARK-25143 as a more general and sustaining way for CSV/ORC/AVRO. Hopefully, we can remove our internal mappings for `com.databricks.spark.*` without any problem in Spark 3. Since SPARK-25143 is a general configuration, we can remove those in Spark 2.4, if we want. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583483#comment-16583483 ] Gengliang Wang commented on SPARK-24924: [~dongjoon] I see. I am now +1 with adding new configuration. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582948#comment-16582948 ] Dongjoon Hyun commented on SPARK-24924: --- [~Gengliang.Wang] . Ur, the latest consensus isn't removing the mapping. With configurations, we can maximize the benefit of the users, especially for Spark's datasource tables. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581929#comment-16581929 ] Gengliang Wang commented on SPARK-24924: As package "org.apache.spark.sql.avro" is external module and not loaded by default, we should not prevent users from using "com.databricks.spark.avro". +1 on removing the mapping. I will create a PR for it. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581608#comment-16581608 ] Thomas Graves commented on SPARK-24924: --- I'd be ok with that but CSV has been that way already for a long time already so I don't think its required. I would vote for not doing that, if someone wants it do it under separate jira. I want to see the config for avro go in before 2.4 is released for compatibility reasons. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581546#comment-16581546 ] Dongjoon Hyun commented on SPARK-24924: --- [~tgraves] . In that case, for consistency, we had better add two configurations for Avro and CSV. Shall we discuss that in a new minor improvement Jira issue? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581117#comment-16581117 ] Thomas Graves commented on SPARK-24924: --- [~cloud_fan] [~hyukjin.kwon] seems no one else has a strong opinion on this. Since there is precedence here for the csv stuff, how about we just add a config to allow users to turn the mapping off? That would allow them to easily continue to use their own version if they want but if they are using the hive tables and want that to work with internal version they can use the config. Do we have release notes or something documented for compatibility (I didn't see anything in sql-programming-guide)? [~mridulm80] [~irashid] as a couple others that might use avro to see if they have input. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573221#comment-16573221 ] Thomas Graves commented on SPARK-24924: --- | There was a discussion about why we shouldn't support it: [https://github.com/apache/spark/pull/21841] There is no discussion on that pr? Assume you are referring to comment that points to by? It looks like we aren't supporting because python and R aren't going to supported, correct? That may be a fine thing for us to not support it internally, I'm not against that, I'm saying it is not a very good compatibility or upgrade story for users who want to switch from databricks avro to internal avro. We are adding this mapping so users can easily upgrade and claiming its functionally the same but its not really that easy as they potentially have to change their code to not use spark.read/write.avro. If we don't support spark.read/write.avro, I know at least for my users I will create something so that works for the 2.4 feature release because I view that as an api incompatibility and they don't expect that for a feature release. I realize this is a 3rd party library though so we may be able to get away with it but that doesn't mean its nice for our users. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572564#comment-16572564 ] Hyukjin Kwon commented on SPARK-24924: -- [~cloud_fan], Yea, adding them as implicit sounds not a good idea. But I think we can still add {{spark.read.avro}} in {{DataFrameReader}} although it looks a bit weird since Avro is external package. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572556#comment-16572556 ] Wenchen Fan commented on SPARK-24924: - > I assume we could theoretically also support the spark.read.avro format as > well There was a discussion about why we shouldn't support it: https://github.com/apache/spark/pull/21841 Users always need to do some manual work to use `spark.read.avro`, even with the databricks avro package. Now users can still define an implicit class to support `spark.read.avro` if they want to. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571908#comment-16571908 ] Thomas Graves commented on SPARK-24924: --- so originally when I started on this I didn't know about the side affects of the hive table here. So this isn't as straight forward as I originally thought. I still personally don't like remapping this because users get something other then what they explicitly asked for, but if we want to keep this compatibility we either have to do that or actually have a com.databricks.avro class that would just map into our internal avro. That would give the benefit that they could eclipse it with their own jar if they wanted to keep using their customer version, I assume we could theoretically also support the spark.read.avro format as well. Or I guess the third option is to just break compatibility and require the users to change the table property, but then they can't read it with older versions of spark. It also seems bad to me that we aren't supporting spark.read.avro, so its an api compatibility issue. We magically help them with compatibility with their tables by mapping them but we don't support the old api and they have to update your code. This feels like an inconsistent story to me and not sure how that fits with our versioning policy since its a 3rd party thing. Not sure I like any of these options. Seems like these are the options: 1)I wonder if we actually add the class com.databricks.avro into the spark source that does the remap and support spark.read/write.avro for a couple releases for compatibility, then remove it and tell people to change the table property or provide an api to do that. 2) make the mapping of com.databricks.avro => internal avro configurable, that would allow them to continue use their version of com.databricks.avro until they can update api. 3) do nothing, leave this as is with this jira and user has to deal with losing spark.read.avro api and possible confusion and breaking if they are using modified version of com.databricks.avro thoughts from others? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571852#comment-16571852 ] Thomas Graves commented on SPARK-24924: --- thanks, I missed it in the output for spark as I was just looking at table properties. So what you are saying is that without this change to map databricks avro to our internal avro, the only way to update hive tables to use the internal avro version is to have them manually set the table properties? Do you know off hand if you are able to write to a hive table with datasource "com.databricks.spark.avro" using the internal avro version or does it error? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570880#comment-16570880 ] Dongjoon Hyun commented on SPARK-24924: --- Yep. It will work if those 3rd-party packages are rebuilt on Apache Spark 2.4. So, it will be the next releases, not the currently existing ones. Spark hides Spark-generated metadata. You can see them via `hive` CLI like the following. 1. Run Apache Hive 1.2.2 CLI and check tables; This initialize metastores, too. {code} hive> show tables; OK Time taken: 1.163 seconds {code} 2. Apache Spark 2.3.1 Result {code} scala> spark.version res1: String = 2.3.1 scala> spark.range(10).write.format("com.databricks.spark.avro").saveAsTable("t") scala> sql("desc formatted t").show(false) ++-+---+ |col_name|data_type |comment| ++-+---+ |id |bigint |null | || | | |# Detailed Table Information| | | |Database|default | | |Table |t | | |Owner |dongjoon | | |Created Time|Mon Aug 06 15:41:40 PDT 2018 | | |Last Access |Wed Dec 31 16:00:00 PST 1969 | | |Created By |Spark 2.3.1 | | |Type|MANAGED | | |Provider|com.databricks.spark.avro | | |Table Properties|[transient_lastDdlTime=1533595300] | | |Location|file:/user/hive/warehouse/t | | |Serde Library |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | |InputFormat |org.apache.hadoop.mapred.SequenceFileInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat| | |Storage Properties |[serialization.format=1] | | ++-+---+ {code} 3. Apache Hive 1.2.2 CLI Result {code} hive> describe formatted t; OK # col_name data_type comment col array from deserializer # Detailed Table Information Database: default Owner: dongjoon CreateTime: Mon Aug 06 15:41:40 PDT 2018 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: file:/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t Table Type: MANAGED_TABLE Table Parameters: spark.sql.create.version2.3.1 spark.sql.sources.provider com.databricks.spark.avro spark.sql.sources.schema.numParts 1 spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]} transient_lastDdlTime 1533595300 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: pathfile:/user/hive/warehouse/t serialization.format1 Time taken: 1.373 seconds, Fetched: 31 row(s) {code} > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570840#comment-16570840 ] Thomas Graves commented on SPARK-24924: --- so officially the spark api compatibility is only at the compilation level: [http://spark.apache.org/versioning-policy.html] . We try to keep binary compatibility but its not guaranteed between releases. It might be worth bringing up though to make sure they thought of that as it should be a conscious decision. I think if you rebuild databricks avro with spark 2.4 it works, right? I unfortunately don't have a hive setup working with spark 2.4 right now. When I wrote a table (saveAsTable) with 2.3 databricks avro I don't see a table property spark.sql.sources.provider, what am I missing? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570756#comment-16570756 ] Dongjoon Hyun commented on SPARK-24924: --- 1. Theoretically, Spark 2.4 should handle both Hive tables simultaneously if the jars co-exist. 2. `ALTER TABLE` is technically possible, but it seems not a good way for users because `spark.sql.sources.provider` is a Spark-generated metadata. 3. For now, there is another issue with `FileFormat` trait. In Spark 2.4, SPARK-24691 adds `FileFormat.supportDataType` and uses it to verify data types. Currently, it's a breaking change because the latest 3rd-party file format like databricks avro 4.0.0 doesn't have that method. The current Spark 2.4 master branch raises `java.lang.AbstractMethodError`. I think we had better fix this in Spark-side for compatibility. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570736#comment-16570736 ] Thomas Graves commented on SPARK-24924: --- so if the user includes the databricks jar and they specify "com.databricks.spark.avro" can we support that or is there some conflict that won't allow us to have both loaded? Can you user simply change the sources.provider to be 'avro' and have it work with new internal version? Sorry trying to make sure I don't miss anything with the compatibility story here. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570702#comment-16570702 ] Dongjoon Hyun commented on SPARK-24924: --- For Hive tables, the format name is stored as a table parameter, `spark.sql.sources.provider`. For example, `spark.sql.sources.provider=com.databricks.spark.avro`. So, without this mapping, built-in avro format will not be used for that table. IIUC, one of the purposes of the new policy is not to support that. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570638#comment-16570638 ] Thomas Graves commented on SPARK-24924: --- So something I just thought of that I want to clarify, is this format name explicitly stored and used anywhere in say tables created? For instance lets say I'm using the databricks avro format and I create a table with it and save it out. Can I read that table fine with the new built-in avro support without this mapping? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570617#comment-16570617 ] Dongjoon Hyun commented on SPARK-24924: --- Thank you for confirming and giving the right direction for this, [~tgraves]. It must be a consistent and clear policy for Apache Spark. +1 for moving forward to that direction by reverting the commits of this JIRA. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570586#comment-16570586 ] Thomas Graves commented on SPARK-24924: --- For compatibility we can't remove it unless major version, so my vote would be to remove it in 3.0. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570562#comment-16570562 ] Dongjoon Hyun commented on SPARK-24924: --- Sorry for the late responses, [~tgraves] and guys. I was OOO last week. When I made this JIRA, I didn't expect a long discussion like this. Now, it looks like we are setting a new policy. [~tgraves], with a new policy, I'm wondering if we are going to remove `com.databricks.spark.csv` mapping in Apache Spark 3.0. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570220#comment-16570220 ] Thomas Graves commented on SPARK-24924: --- {quote}I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Sorry, I wasn't meaning to claim any bugs were introduced by anyone in merging this in. {quote}In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I disagree with this, its already a 3rd party and not call org.apache.spark and they are providing their own short name that used to work before this. Its one thing just referencing "avro" but when they put the entire com.databricks. we should not be remapping it. {quote}We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} Yes but like anything else that requires a user to read it. Many users just get new versions deployed on their cluster and if their job continues to run they don't notice or pay attentions. {quote}I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? I think the main reason for this is that the code is actually ported from Avro {{com.databricks.*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} Yes personally I don't think we should be remapping any third party libraries to apache spark. In my opinion this is even worse since we don't support the spark.read.avro but it happens to work if you include the databricks package, but it doesn't really call into the databricks code, it calls into the spark code. If I remove the databricks jar then spark.read.avro doesn't work. Really confusing to users IMHO. {quote}For clarification, it's not personally related to me in any way at all but I thought we better keep it consistent with CSV's. To sum up, I get your position but I think the current approach makes a coherent point too. In that case, I think we better follow what we have done with CSV. {quote} I understand and that definitely makes sense, but I don't agree that we should have even done it for csv. Unfortunately I didn't see that go in to disagree. I think we should have made the message more user friendly and told them please update to use sparks or rename it. We can't be responsible to keep compatibility with all 3rd party libraries like that. We can't control what names they use. I'm fine with is they specify the shortname of just "avro" of having that be mapped to our implementation, but if they use the full com.databricks we should respect it or throw an error if we can't. If everyone agrees, I can file a separate Jira to revert. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569109#comment-16569109 ] Felix Cheung commented on SPARK-24924: -- I tend to agree that we shouldn't "magically" remap different implementations or changes behavior across versions, esp. since we have never really tested them for compatibility and documented in any way as such. Do we have agreement on what the behavior should be then? Could someone summarize? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569104#comment-16569104 ] Hyukjin Kwon commented on SPARK-24924: -- For fully qualifed path, we already could specify like {{com.databricks.spark.avro.AvroFormat}} and I guess that will use thrid party one if I am not mistaken. Probably we should not do this but this is what we do with CSV which kind of makes a point as well. Wouldn't we better just follow what we do? If we should make an error for this case, I guess it should target 3.0.0 for CSV and revert this PR. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569085#comment-16569085 ] Wenchen Fan commented on SPARK-24924: - when the short name conflicts, I feel it's better to pick the built-in data source than failing the job and say it conflicts. When the full class name of the data source is specified like com.databricks.spark.avro, we should respect it. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569072#comment-16569072 ] Hyukjin Kwon commented on SPARK-24924: -- Also, for clarification, we already issue warnings: {code} 17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). {code} So, I guess it's virtually error vs warning. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569071#comment-16569071 ] Hyukjin Kwon commented on SPARK-24924: -- If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} So far, I see some issues as below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.com/apache/spark/pull/17916#issuecomment-301898567 For clarification, it's related to me in any way but I thought we better keep it consistent with CSV's. To sum up, I get your position but I think the current approach makes a coherent point too. In that case, I think we better follow what we have done with CSV. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568454#comment-16568454 ] Reynold Xin commented on SPARK-24924: - I like the improved error message (I didn't read the earlier comments in this thread). > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568393#comment-16568393 ] Thomas Graves commented on SPARK-24924: --- | It wouldn't be very different for 2.4.0. It could be different but I guess it should be incremental improvement without behaviour changes. I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. If the user has been using the databrick spark-avro version for other releases and it was working fine and now we magically map it to a different version and they break, they are going to complain and say, I didn't change anything why did this break. Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. Just a note the fact we use different version isn't my issue, I'm happy to make that work, I'm worried about other users who didn't happen to see this jira. I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? How many users complained about the csv thing? Could we just improve the error message to more simply state, "Multiple sources found, perhaps you are including an external package that also supports avro. Spark started internally supporting as of release X.Y, please remove the external package or rewrite to use different function" > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568352#comment-16568352 ] Hyukjin Kwon commented on SPARK-24924: -- cc [~cloud_fan] since we talked about this for CSV, and [~rxin] who agreed upon not adding .avro for now, FYI. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568348#comment-16568348 ] Hyukjin Kwon commented on SPARK-24924: -- {quote} but at the same time we aren't adding the spark.read.avro syntax so it break in that case or they get a different implementation by default? {quote} If users call this, that's still going to use the builtin implemtnation (https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26) as it's a short name for {{format("com.databricks.spark.avro")}}. {quote} our internal implementation which could very well be different. {quote} It wouldn't be very different for 2.4.0. It could be different but I guess it should be incremental improvement without behaviour changes. {quote} I would rather just plain error out saying these conflict, either update or change your external package to use a different name. {quote} IIRC, in the past, we did for CSV datasource and many users complained about this. {code} java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. {code} In practice, I am actually a bit more sure on the current approach since users actually complained about his a lot and now I am not seeing (so far) the complains about the current approach. {code} There is also the case one might be able to argue its breaking api compatilibity since .avro option went away, buts it a third party library so you can probably get away with that. {code} It's went away so I guess if the jar is provided with implicit import to support this, this should work as usual and use the internal implementation in theory. If the jar is not given, .avro API is not supported and the internal implmentation will be used. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568204#comment-16568204 ] Thomas Graves commented on SPARK-24924: --- [~felixcheung] did your discussion on the same thing with csv get resolved? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568199#comment-16568199 ] Thomas Graves commented on SPARK-24924: --- Hmm, so we are adding this for ease of upgrading I guess (so user doesn't have to change their code), but at the same time we aren't adding the spark.read.avro syntax so it break in that case or they get a different implementation by default? This doesn't make sense to me. Personally I don't like having some other add on package names in our code at all and here we are mapping what the user thought they would get to our internal implementation which could very well be different. I would rather just plain error out saying these conflict, either update or change your external package to use a different name. There is also the case one might be able to argue its breaking api compatilibity since .avro option went away, buts it a third party library so you can probably get away with that. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567708#comment-16567708 ] Hyukjin Kwon commented on SPARK-24924: -- Similar discussion was made in SPARK-20590 when we port CSV. in my experience, users really don't know if {{com.databricks.spark.avro}} or {{avro}} mean external Avro jar or internal jar (same thing happened in CSV - I was active in that Spark CSV (databricks) package FWIW). if users were using the external avro, they will likely meet the error if they directly upgrade Spark. Otherwise, users will see the release note that Avro package is included in 2.4.0, and they will not provide the external jar. If users miss the release note, then they will try to explicitly provide the thirdparty jar, which will now give the error message like: {code} 17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). {code} Encouraging to use builtin's one might better be preferred since the behaviours will kept same at its best for now. Otherwise, If external Avro must be used, I think it can be still used if the source is specified by fully qualified path in theory. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567045#comment-16567045 ] Thomas Graves commented on SPARK-24924: --- why are we doing this? If a user ships the spark-avro databricks jar and references the com.databricks.spark.avro class, why do we want to map that to our built in version which might be different? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560624#comment-16560624 ] Apache Spark commented on SPARK-24924: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/21906 > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556427#comment-16556427 ] Apache Spark commented on SPARK-24924: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/21878 > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org