[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-09-06 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605308#comment-16605308
 ] 

Gengliang Wang commented on SPARK-24771:


[~vanzin] I am OK with either way. Shading Avro 1.8 in data source only seems 
reasonable.
But I am not confident enough to do the change. Can you open a PR for it?



> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-09-05 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604877#comment-16604877
 ] 

Marcelo Vanzin commented on SPARK-24771:


I ran a couple of our tests that exercise avro and they worked fine with 2.4. 
They're not comprehensive, though:

- one uses the data source to read / write data, and that shouldn't really be 
affected by the change
- the other uses {{GenericRecord}}, so it doesn't really use generated Avro 
types.

So I don't really have a test that can say for sure what will break when you 
use generated types, which is the part that is explicitly called as being 
changed in 1.8. I still think it would be good to try to shade Avro 1.8 in the 
data source, and not expose it to other parts of Spark, but otherwise a 
strongly worded release note might be ok, although not optimal.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-16 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583077#comment-16583077
 ] 

Steve Loughran commented on SPARK-24771:


All the wire stuff (e.g. to HDFS is protobuf). Rummaging around for Avro 
records, I only see them being used as a persistence format for the event 
history of the MR client. 



h2. General Avro API use

Assume: not going to break with the version upgrade.

h3. @Stringable

Both Path and Text import/use org.apache.avro.reflect.Stringable & are tagged 
as @Stringable;
a runtime attr which tells avro that toString() can be used to marshall it.
Shoudn't even need avro on the classpath.

h3. Package org.apache.hadoop.io.serializer.avro. Lets you ser/desr avro 
records.

Declared as one of the default serializations in 
{{org.apache.hadoop.io.serializer.SerializationFactory}}
if not overridden in {{io.serializations}} conf option.

{code}

  io.serializations
  org.apache.hadoop.io.serializer.WritableSerialization, 
org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, 
org.apache.hadoop.io.serializer.avro.AvroReflectSerialization
  A list of serialization classes that can be used for
  obtaining serializers and deserializers.

{code}

Don't think that'll be brittle to change, it just means that avro gets handled 
as a wire format.

*I have no idea what would break here, or how*.

h3. class org.apache.hadoop.fs.AvroFSInput

lets avro use FSDataInputStreams as {{org.apache.avro.file.SeekableInput}} 
sources.

h2. Avro schemas and record generation

(outside of tests of that serialization)

h3. org.apache.hadoop.mapreduce

* uses it for history events. 
* defines records in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/avro/Events.avpr

Unless spark is using the hadoop-mapreduce-client code, this isn't going to be 
directly relevant.

if there's something downstream which needs to have spark & mr coexist on the 
classpath,
well, that'll be something for them to address.


> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583009#comment-16583009
 ] 

Marcelo Vanzin commented on SPARK-24771:


Do the Hadoop services use Avro for their protocol? I was under the impression 
they used protobufs. There are input format APIs for Avro, but maybe those 
don't get affected by the changes to the compiled classes...

In any case, the spark avro integration seems to be in its own separate module, 
so maybe it would be possible to only upgrade avro to 1.8 in that one module, 
and shade it? Then the rest of Spark could remain on 1.7.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-16 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582911#comment-16582911
 ] 

Steve Loughran commented on SPARK-24771:


Linking to the previous PR, as that's got the discussion on compatibility 
risks. 

Ignoring stuff related to Hadoop MR/Yarn (which may need some shading, if not 
already done), this will need to be tagged as "you will need to recompile your 
avro support"

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580644#comment-16580644
 ] 

Wenchen Fan commented on SPARK-24771:
-

I've sent the email, let's wait for the feedback.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580062#comment-16580062
 ] 

Marcelo Vanzin commented on SPARK-24771:


Asking is a good start. But I have anecdotal evidence that there are quite a 
few people who use Avro/RDDs... not sure whether they're planning to move to 
SQL any time soon.

In any case, it would be good to know exactly what breaks so that we can have a 
proper release note, instead of just dumping the problem on the user's lap.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579243#comment-16579243
 ] 

Wenchen Fan commented on SPARK-24771:
-

It's good to pay more attention to compatibility issues, I've added 
release_notes label to this ticket and created 
https://issues.apache.org/jira/browse/SPARK-25110 to track the Flume streaming 
connector.

I'm not sure how many users would use avro with RDD, but I feel it should be 
rare as there was a spark-avro package available for Spark SQL. Shall we send 
an email to user/dev list?

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579213#comment-16579213
 ] 

Marcelo Vanzin commented on SPARK-24771:


The main problem pointed out in the original attempt is AVRO-1502; it means 
that people with code generated by Avro 1.7 might run into problems if Spark 
ships with Avro 1.8; that basically amounts to a binary compatibility issue, 
which we always try not to break in minor releases.

It may be that it only applies in some specific situations and it may be 
acceptable to release note it. But it would be nice to be sure, since this 
affects existing users of Avro - including the Flume streaming connector which 
is still available in Spark 2.4...

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579133#comment-16579133
 ] 

Michael Heuer commented on SPARK-24771:
---

I'm looking forward to testing this with 
[ADAM|https://github.com/bigdatagenomics/adam] and all of our downstream 
projects as part of the 2.4.0 release candidate process.  If it is worth my 
time doing so before then, please let me know.  Parquet + Avro is at the core 
of what we do, and having the 1.8 vs 1.7 internal conflict present in Spark 
resolved would be very welcome.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579128#comment-16579128
 ] 

Sean Owen commented on SPARK-24771:
---

I confess I just don't know enough to have a strong opinion. A minor version 
upgrade isn't out of the question for a minor Spark upgrade. You are right that 
this is considered a non-core integration. It sounds like there are 
incompatibility issues. However that cuts two ways; some users are facing 
problems because Spark _isn't_ on 1.8.x. If spark-avro is already on 1.8, I can 
see the need for Spark to update as well. Not blessing this so much as saying I 
don't object.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579126#comment-16579126
 ] 

Wenchen Fan commented on SPARK-24771:
-

cc [~r...@databricks.com] [~srowen]

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579124#comment-16579124
 ] 

Wenchen Fan commented on SPARK-24771:
-

Sorry I was not aware of https://issues.apache.org/jira/browse/SPARK-16617 .

So we proposed to upgrade AVRO 2 years ago, and gave it up because it's not 
binary compatible and the benefit is not that much.

I think things have changed now. This upgrade is super important to the AVRO 
data source, for date/timestamp/decimal support. Also as people pointed out in 
https://issues.apache.org/jira/browse/SPARK-16617 , this is an important bug 
fix to use Parquet and AVRO.

BTW I don't think the impact is that large. Spark doesn't have a stable API to 
plugin AVRO supports, so AVRO users have to do some manual work to migrate to 
new Spark versions. As an example, I don't think the databricks spark-avro 
package can work with Spark 2.4 without modification.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579005#comment-16579005
 ] 

Marcelo Vanzin commented on SPARK-24771:


Hi guys, why was this accepted? It has been tried in the past and we 
re-targeted it to 3.0 because Avro 1.8 is not backwards compatible with Avro 
1.7:

https://issues.apache.org/jira/browse/SPARK-16617

In particular the discussion in the PR is helpful here:
https://github.com/apache/spark/pull/17163

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-07-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543319#comment-16543319
 ] 

Apache Spark commented on SPARK-24771:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21761

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org