Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-19 Thread Wenchen Fan
+1, thanks for making the release!

On Sat, Feb 17, 2024 at 3:54 AM Sean Owen  wrote:

> Yeah let's get that fix in, but it seems to be a minor test only issue so
> should not block release.
>
> On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
>
>> Very sorry. When I was fixing `SPARK-45242 (
>> https://github.com/apache/spark/pull/43594)`
>> , I noticed that its
>> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
>> didn't realize that it had also been merged into branch-3.5, so I didn't
>> advocate for SPARK-45357 to be backported to branch-3.5.
>>
>>
>>
>> As far as I know, the condition to trigger this test failure is: when
>> using Maven to test the `connect` module, if  `sparkTestRelation` in
>> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
>> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
>> is indeed related to the order in which Maven executes the test cases in
>> the `connect` module.
>>
>>
>>
>> I have submitted a backport PR
>>  to branch-3.5, and if
>> necessary, we can merge it to fix this test issue.
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年2月16日 星期五 22:15
>> *收件人**: *Sean Owen , Rui Wang 
>> *抄送**: *dev 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>>
>>
>>
>> I traced back relevant changes and got a sense of what happened.
>>
>>
>>
>> Yangjie figured out the issue via link
>> .
>> It's a tricky issue according to the comments from Yangjie - the test is
>> dependent on ordering of execution for test suites. He said it does not
>> fail in sbt, hence CI build couldn't catch it.
>>
>> He fixed it via link
>> ,
>> but we missed that the offending commit was also ported back to 3.5 as
>> well, hence the fix wasn't ported back to 3.5.
>>
>>
>>
>> Surprisingly, I can't reproduce locally even with maven. In my attempt to
>> reproduce, SparkConnectProtoSuite was executed at
>> third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
>> and then SparkConnectProtoSuite. Maybe very specific to the environment,
>> not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
>> used build/mvn (Maven 3.8.8).
>>
>>
>>
>> I'm not 100% sure this is something we should fail the release as it's a
>> test only and sounds very environment dependent, but I'll respect your call
>> on vote.
>>
>>
>>
>> Btw, looks like Rui also made a relevant fix via link
>> 
>>  (not
>> to fix the failing test but to fix other issues), but this also wasn't
>> ported back to 3.5. @Rui Wang  Do you think this
>> is a regression issue and warrants a new RC?
>>
>>
>>
>>
>>
>> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
>>
>> Is anyone seeing this Spark Connect test failure? then again, I have some
>> weird issue with this env that always fails 1 or 2 tests that nobody else
>> can replicate.
>>
>>
>>
>> - Test observe *** FAILED ***
>>   == FAIL: Plans do not match ===
>>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
>> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
>> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
>> 44
>>+- LocalRelation , [id#0, name#0]
>>   +- LocalRelation , [id#0, name#0]
>> (PlanTest.scala:179)
>>
>>
>>
>> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
>> out doc generation issue after tagging RC1.
>>
>>
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.5.1.
>>
>> The vote is open until February 18th 9AM (PST) and passes if a majority
>> +1 PMC votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>> 
>>
>> The tag to be voted on is v3.5.1-rc2 (commit
>> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
>> https://github.com/apache/spark/tree/v3.5.1-rc2
>> 
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
>> 

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-19 Thread Jungtaek Lim
Thanks Sean, let's continue the process for this RC.

+1 (non-binding)

- downloaded all files from URL
- checked signature
- extracted all archives
- ran all tests from source files in source archive file, via running "sbt
clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.

Also bump to dev@ to encourage participation - looks like the timing is not
good for US folks but let's see more days.


On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:

> Yeah let's get that fix in, but it seems to be a minor test only issue so
> should not block release.
>
> On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
>
>> Very sorry. When I was fixing `SPARK-45242 (
>> https://github.com/apache/spark/pull/43594)`
>> , I noticed that its
>> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
>> didn't realize that it had also been merged into branch-3.5, so I didn't
>> advocate for SPARK-45357 to be backported to branch-3.5.
>>
>>
>>
>> As far as I know, the condition to trigger this test failure is: when
>> using Maven to test the `connect` module, if  `sparkTestRelation` in
>> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
>> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
>> is indeed related to the order in which Maven executes the test cases in
>> the `connect` module.
>>
>>
>>
>> I have submitted a backport PR
>>  to branch-3.5, and if
>> necessary, we can merge it to fix this test issue.
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年2月16日 星期五 22:15
>> *收件人**: *Sean Owen , Rui Wang 
>> *抄送**: *dev 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>>
>>
>>
>> I traced back relevant changes and got a sense of what happened.
>>
>>
>>
>> Yangjie figured out the issue via link
>> .
>> It's a tricky issue according to the comments from Yangjie - the test is
>> dependent on ordering of execution for test suites. He said it does not
>> fail in sbt, hence CI build couldn't catch it.
>>
>> He fixed it via link
>> ,
>> but we missed that the offending commit was also ported back to 3.5 as
>> well, hence the fix wasn't ported back to 3.5.
>>
>>
>>
>> Surprisingly, I can't reproduce locally even with maven. In my attempt to
>> reproduce, SparkConnectProtoSuite was executed at
>> third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
>> and then SparkConnectProtoSuite. Maybe very specific to the environment,
>> not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
>> used build/mvn (Maven 3.8.8).
>>
>>
>>
>> I'm not 100% sure this is something we should fail the release as it's a
>> test only and sounds very environment dependent, but I'll respect your call
>> on vote.
>>
>>
>>
>> Btw, looks like Rui also made a relevant fix via link
>> 
>>  (not
>> to fix the failing test but to fix other issues), but this also wasn't
>> ported back to 3.5. @Rui Wang  Do you think this
>> is a regression issue and warrants a new RC?
>>
>>
>>
>>
>>
>> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
>>
>> Is anyone seeing this Spark Connect test failure? then again, I have some
>> weird issue with this env that always fails 1 or 2 tests that nobody else
>> can replicate.
>>
>>
>>
>> - Test observe *** FAILED ***
>>   == FAIL: Plans do not match ===
>>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
>> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
>> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
>> 44
>>+- LocalRelation , [id#0, name#0]
>>   +- LocalRelation , [id#0, name#0]
>> (PlanTest.scala:179)
>>
>>
>>
>> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
>> out doc generation issue after tagging RC1.
>>
>>
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.5.1.
>>
>> The vote is open until February 18th 9AM (PST) and passes if a majority
>> +1 PMC votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>> 
>>
>> The tag to be voted on is v3.5.1-rc2 (commit
>> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
>> https://github.com/apache/spark/tree/v3.5.1-rc2

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 19 Feb 2024 at 17:24, Chao Sun  wrote:

> Hi Mich,
>
> > Also have you got some benchmark results from your tests that you can
> possibly share?
>
> We only have some partial benchmark results internally so far. Once
> shuffle and better memory management have been introduced, we plan to
> publish the benchmark results (at least TPC-H) in the repo.
>
> > Compared to standard Spark, what kind of performance gains can be
> expected with Comet?
>
> Currently, users could benefit from Comet in a few areas:
> - Parquet read: a few improvements have been made against reading from S3
> in particular, so users can expect better scan performance in this scenario
> - Hash aggregation
> - Columnar shuffle
> - Decimals (Java's BigDecimal is pretty slow)
>
> > Can one use Comet on k8s in conjunction with something like a Volcano
> addon?
>
> I think so. Comet is mostly orthogonal to the Spark scheduler framework.
>
> Chao
>
>
>
>
>
>
> On Fri, Feb 16, 2024 at 5:39 AM Mich Talebzadeh 
> wrote:
>
>> Hi Chao,
>>
>> As a cool feature
>>
>>
>>- Compared to standard Spark, what kind of performance gains can be
>>expected with Comet?
>>-  Can one use Comet on k8s in conjunction with something like a
>>Volcano addon?
>>
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge, sourced from both personal expertise and other resources but of
>> course cannot be guaranteed . It is essential to note that, as with any
>> advice, one verified and tested result holds more weight than a thousand
>> expert opinions.
>>
>>
>> On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:
>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> has now been open sourced under the Apache Arrow umbrella. Please
>>> check the project repo
>>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> you are interested. We'd love to collaborate with people from the open
>>> source community who share similar goals.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich,

> Also have you got some benchmark results from your tests that you can
possibly share?

We only have some partial benchmark results internally so far. Once shuffle
and better memory management have been introduced, we plan to publish the
benchmark results (at least TPC-H) in the repo.

> Compared to standard Spark, what kind of performance gains can be
expected with Comet?

Currently, users could benefit from Comet in a few areas:
- Parquet read: a few improvements have been made against reading from S3
in particular, so users can expect better scan performance in this scenario
- Hash aggregation
- Columnar shuffle
- Decimals (Java's BigDecimal is pretty slow)

> Can one use Comet on k8s in conjunction with something like a Volcano
addon?

I think so. Comet is mostly orthogonal to the Spark scheduler framework.

Chao






On Fri, Feb 16, 2024 at 5:39 AM Mich Talebzadeh 
wrote:

> Hi Chao,
>
> As a cool feature
>
>
>- Compared to standard Spark, what kind of performance gains can be
>expected with Comet?
>-  Can one use Comet on k8s in conjunction with something like a
>Volcano addon?
>
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge, sourced from both personal expertise and other resources but of
> course cannot be guaranteed . It is essential to note that, as with any
> advice, one verified and tested result holds more weight than a thousand
> expert opinions.
>
>
> On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:
>
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>