[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2023-01-05 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654965#comment-17654965
 ] 

dzcxzl commented on SPARK-35744:


This problem should be solved by upgrading avro 1.11.0 version 
([AVRO-3186|https://issues.apache.org/jira/browse/AVRO-3186]) through 
[SPARK-37206|https://issues.apache.org/jira/browse/SPARK-37206], we should be 
able to close this ticket.

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2021-06-21 Thread Steven Aerts (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367036#comment-17367036
 ] 

Steven Aerts commented on SPARK-35744:
--

Hi [~xkrogen] ,

we went a different path.  It follows more the flow of 
{{org.apache.spark.sql.catalyst.JavaTypeInference}}.
So you create an {{ExpressionEncoder}} by calling 
{{AvroSpecificRecordEncoder.from(classof[MySpecificRecord])}}. Which will take 
the schema of {{MySpecificRecord}} and based on it generate the 
expressions/code to serialize and deserialize to and from the generated classes.

The resulting {{StructType}} for the class matches the one you expect as it 
internally uses the {{SchemaConverters.toSqlType(schema)}}.  Which means it is 
compatible with all other avro handling withing spark.

The code is rather complete, performant and standalone.  It support (almost) 
all avro constructs.  The test set around it however is more entangled as it 
uses internal classes.

Feel free to contact me https://github.com/steven-aerts.

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2021-06-21 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366892#comment-17366892
 ] 

Erik Krogen commented on SPARK-35744:
-

[~steven.aerts] going a bit off topic from this JIRA, but out of curiosity -- 
is your work based off of SPARK-25789 / [PR 
#22878|https://github.com/apache/spark/pull/22878]? We (LinkedIn) also maintain 
an {{AvroEncoder}} for {{SpecificRecord}} classes which is based off of that 
PR. We've also been planning to make another effort to push this upstream since 
the attempt in #22878 eventually stalled. I'd be interested in learning more 
about your work and potentially collaborating here.

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2021-06-14 Thread Steven Aerts (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363119#comment-17363119
 ] 

Steven Aerts commented on SPARK-35744:
--

[~xkrogen] in the past we use them solely in RDD's as I guess is the most 
common use case.

But we implemented an {{AvroSpecificRecordEncoder}} which is an 
{{ExpressionEncoder}} which allows you to use them in the {{DataSet}} api.

Btw: this is custom code, but if any body is interested, we can push that one 
upstream as it is an encoder for {{SpecifiRecord}}s which is faster than the 
Generic avro encoder in spark as this one is codegened for the specific schema 
you use.

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2021-06-14 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363029#comment-17363029
 ] 

Erik Krogen commented on SPARK-35744:
-

[~steven.aerts] can you elaborate on where you're using {{SpecificRecord}} 
classes? Are you using them via RDD APIs? I ask since as you mentioned, the 
{{spark-avro}} package leverages {{GenericData}} APIs and shouldn't be 
susceptible to this problem.

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2021-06-14 Thread Steven Aerts (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362804#comment-17362804
 ] 

Steven Aerts commented on SPARK-35744:
--

[~Gengliang.Wang] in the avro java/scala world there are two ways of handling 
data.

You can use [GenericData 
|https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/generic/GenericData.html]which
 gives you a generic way to handle any avro data.  This is also what spark-avro 
uses internally.

The other option you have is to use 
[SpecificData|https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/specific/SpecificData.html],
 where you let the[ avro codegen 
generate|https://avro.apache.org/docs/1.10.2/gettingstartedjava.html#Serializing+and+deserializing+with+code+generation]
 specific classes and you can use these classes specifically generated for your 
avro schema.  If you use these classes in spark you will hit the issue 
mentioned.

I am not sure how common this issue is.  And I would totally understand if you 
would close this issue as too exotic.

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2021-06-14 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362780#comment-17362780
 ] 

Gengliang Wang commented on SPARK-35744:


[~steven.aerts] Thanks for reporting. 
Could you tell us more about how you use SpecificRecordBuilderBase? I think we 
can keep this open and upgrade the Avro version after it is fixed in Apache Avro

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders

2021-06-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362693#comment-17362693
 ] 

Hyukjin Kwon commented on SPARK-35744:
--

cc [~Gengliang.Wang] FYI

> Performance degradation in avro SpecificRecordBuilders
> --
>
> Key: SPARK-35744
> URL: https://issues.apache.org/jira/browse/SPARK-35744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Steven Aerts
>Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org