[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605308#comment-16605308 ] Gengliang Wang commented on SPARK-24771: [~vanzin] I am OK with either way. Shading Avro 1.8 in data source only seems reasonable. But I am not confident enough to do the change. Can you open a PR for it? > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604877#comment-16604877 ] Marcelo Vanzin commented on SPARK-24771: I ran a couple of our tests that exercise avro and they worked fine with 2.4. They're not comprehensive, though: - one uses the data source to read / write data, and that shouldn't really be affected by the change - the other uses {{GenericRecord}}, so it doesn't really use generated Avro types. So I don't really have a test that can say for sure what will break when you use generated types, which is the part that is explicitly called as being changed in 1.8. I still think it would be good to try to shade Avro 1.8 in the data source, and not expose it to other parts of Spark, but otherwise a strongly worded release note might be ok, although not optimal. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583077#comment-16583077 ] Steve Loughran commented on SPARK-24771: All the wire stuff (e.g. to HDFS is protobuf). Rummaging around for Avro records, I only see them being used as a persistence format for the event history of the MR client. h2. General Avro API use Assume: not going to break with the version upgrade. h3. @Stringable Both Path and Text import/use org.apache.avro.reflect.Stringable & are tagged as @Stringable; a runtime attr which tells avro that toString() can be used to marshall it. Shoudn't even need avro on the classpath. h3. Package org.apache.hadoop.io.serializer.avro. Lets you ser/desr avro records. Declared as one of the default serializations in {{org.apache.hadoop.io.serializer.SerializationFactory}} if not overridden in {{io.serializations}} conf option. {code} io.serializations org.apache.hadoop.io.serializer.WritableSerialization, org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, org.apache.hadoop.io.serializer.avro.AvroReflectSerialization A list of serialization classes that can be used for obtaining serializers and deserializers. {code} Don't think that'll be brittle to change, it just means that avro gets handled as a wire format. *I have no idea what would break here, or how*. h3. class org.apache.hadoop.fs.AvroFSInput lets avro use FSDataInputStreams as {{org.apache.avro.file.SeekableInput}} sources. h2. Avro schemas and record generation (outside of tests of that serialization) h3. org.apache.hadoop.mapreduce * uses it for history events. * defines records in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/avro/Events.avpr Unless spark is using the hadoop-mapreduce-client code, this isn't going to be directly relevant. if there's something downstream which needs to have spark & mr coexist on the classpath, well, that'll be something for them to address. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583009#comment-16583009 ] Marcelo Vanzin commented on SPARK-24771: Do the Hadoop services use Avro for their protocol? I was under the impression they used protobufs. There are input format APIs for Avro, but maybe those don't get affected by the changes to the compiled classes... In any case, the spark avro integration seems to be in its own separate module, so maybe it would be possible to only upgrade avro to 1.8 in that one module, and shade it? Then the rest of Spark could remain on 1.7. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582911#comment-16582911 ] Steve Loughran commented on SPARK-24771: Linking to the previous PR, as that's got the discussion on compatibility risks. Ignoring stuff related to Hadoop MR/Yarn (which may need some shading, if not already done), this will need to be tagged as "you will need to recompile your avro support" > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580644#comment-16580644 ] Wenchen Fan commented on SPARK-24771: - I've sent the email, let's wait for the feedback. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580062#comment-16580062 ] Marcelo Vanzin commented on SPARK-24771: Asking is a good start. But I have anecdotal evidence that there are quite a few people who use Avro/RDDs... not sure whether they're planning to move to SQL any time soon. In any case, it would be good to know exactly what breaks so that we can have a proper release note, instead of just dumping the problem on the user's lap. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579243#comment-16579243 ] Wenchen Fan commented on SPARK-24771: - It's good to pay more attention to compatibility issues, I've added release_notes label to this ticket and created https://issues.apache.org/jira/browse/SPARK-25110 to track the Flume streaming connector. I'm not sure how many users would use avro with RDD, but I feel it should be rare as there was a spark-avro package available for Spark SQL. Shall we send an email to user/dev list? > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579213#comment-16579213 ] Marcelo Vanzin commented on SPARK-24771: The main problem pointed out in the original attempt is AVRO-1502; it means that people with code generated by Avro 1.7 might run into problems if Spark ships with Avro 1.8; that basically amounts to a binary compatibility issue, which we always try not to break in minor releases. It may be that it only applies in some specific situations and it may be acceptable to release note it. But it would be nice to be sure, since this affects existing users of Avro - including the Flume streaming connector which is still available in Spark 2.4... > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579133#comment-16579133 ] Michael Heuer commented on SPARK-24771: --- I'm looking forward to testing this with [ADAM|https://github.com/bigdatagenomics/adam] and all of our downstream projects as part of the 2.4.0 release candidate process. If it is worth my time doing so before then, please let me know. Parquet + Avro is at the core of what we do, and having the 1.8 vs 1.7 internal conflict present in Spark resolved would be very welcome. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579128#comment-16579128 ] Sean Owen commented on SPARK-24771: --- I confess I just don't know enough to have a strong opinion. A minor version upgrade isn't out of the question for a minor Spark upgrade. You are right that this is considered a non-core integration. It sounds like there are incompatibility issues. However that cuts two ways; some users are facing problems because Spark _isn't_ on 1.8.x. If spark-avro is already on 1.8, I can see the need for Spark to update as well. Not blessing this so much as saying I don't object. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579126#comment-16579126 ] Wenchen Fan commented on SPARK-24771: - cc [~r...@databricks.com] [~srowen] > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579124#comment-16579124 ] Wenchen Fan commented on SPARK-24771: - Sorry I was not aware of https://issues.apache.org/jira/browse/SPARK-16617 . So we proposed to upgrade AVRO 2 years ago, and gave it up because it's not binary compatible and the benefit is not that much. I think things have changed now. This upgrade is super important to the AVRO data source, for date/timestamp/decimal support. Also as people pointed out in https://issues.apache.org/jira/browse/SPARK-16617 , this is an important bug fix to use Parquet and AVRO. BTW I don't think the impact is that large. Spark doesn't have a stable API to plugin AVRO supports, so AVRO users have to do some manual work to migrate to new Spark versions. As an example, I don't think the databricks spark-avro package can work with Spark 2.4 without modification. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579005#comment-16579005 ] Marcelo Vanzin commented on SPARK-24771: Hi guys, why was this accepted? It has been tried in the past and we re-targeted it to 3.0 because Avro 1.8 is not backwards compatible with Avro 1.7: https://issues.apache.org/jira/browse/SPARK-16617 In particular the discussion in the PR is helpful here: https://github.com/apache/spark/pull/17163 > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543319#comment-16543319 ] Apache Spark commented on SPARK-24771: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21761 > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org