Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Sean Owen
Yeah let's get that fix in, but it seems to be a minor test only issue so
should not block release.

On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:

> Very sorry. When I was fixing `SPARK-45242 (
> https://github.com/apache/spark/pull/43594)`
> , I noticed that its
> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
> didn't realize that it had also been merged into branch-3.5, so I didn't
> advocate for SPARK-45357 to be backported to branch-3.5.
>
>
>
> As far as I know, the condition to trigger this test failure is: when
> using Maven to test the `connect` module, if  `sparkTestRelation` in
> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
> is indeed related to the order in which Maven executes the test cases in
> the `connect` module.
>
>
>
> I have submitted a backport PR
>  to branch-3.5, and if
> necessary, we can merge it to fix this test issue.
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年2月16日 星期五 22:15
> *收件人**: *Sean Owen , Rui Wang 
> *抄送**: *dev 
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>
>
>
> I traced back relevant changes and got a sense of what happened.
>
>
>
> Yangjie figured out the issue via link
> .
> It's a tricky issue according to the comments from Yangjie - the test is
> dependent on ordering of execution for test suites. He said it does not
> fail in sbt, hence CI build couldn't catch it.
>
> He fixed it via link
> ,
> but we missed that the offending commit was also ported back to 3.5 as
> well, hence the fix wasn't ported back to 3.5.
>
>
>
> Surprisingly, I can't reproduce locally even with maven. In my attempt to
> reproduce, SparkConnectProtoSuite was executed at
> third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
> and then SparkConnectProtoSuite. Maybe very specific to the environment,
> not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
> used build/mvn (Maven 3.8.8).
>
>
>
> I'm not 100% sure this is something we should fail the release as it's a
> test only and sounds very environment dependent, but I'll respect your call
> on vote.
>
>
>
> Btw, looks like Rui also made a relevant fix via link
> 
>  (not
> to fix the failing test but to fix other issues), but this also wasn't
> ported back to 3.5. @Rui Wang  Do you think this is
> a regression issue and warrants a new RC?
>
>
>
>
>
> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
>
> Is anyone seeing this Spark Connect test failure? then again, I have some
> weird issue with this env that always fails 1 or 2 tests that nobody else
> can replicate.
>
>
>
> - Test observe *** FAILED ***
>   == FAIL: Plans do not match ===
>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
> 44
>+- LocalRelation , [id#0, name#0]
>   +- LocalRelation , [id#0, name#0]
> (PlanTest.scala:179)
>
>
>
> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
> wrote:
>
> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
> out doc generation issue after tagging RC1.
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.1.
>
> The vote is open until February 18th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
>
> The tag to be voted on is v3.5.1-rc2 (commit
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> https://github.com/apache/spark/tree/v3.5.1-rc2
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
> 
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread yangjie01
Very sorry. When I was fixing `SPARK-45242 
(https://github.com/apache/spark/pull/43594)`, I noticed that its `Affects 
Version` and `Fix Version` of SPARK-45242 were both 4.0, and I didn't realize 
that it had also been merged into branch-3.5, so I didn't advocate for 
SPARK-45357 to be backported to branch-3.5.

As far as I know, the condition to trigger this test failure is: when using 
Maven to test the `connect` module, if  `sparkTestRelation` in 
`SparkConnectProtoSuite` is not the first `DataFrame` to be initialized, then 
the `id` of `sparkTestRelation` will no longer be 0. So, I think this is indeed 
related to the order in which Maven executes the test cases in the `connect` 
module.

I have submitted a backport PR to 
branch-3.5, and if necessary, we can merge it to fix this test issue.

Jie Yang

发件人: Jungtaek Lim 
日期: 2024年2月16日 星期五 22:15
收件人: Sean Owen , Rui Wang 
抄送: dev 
主题: Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

I traced back relevant changes and got a sense of what happened.

Yangjie figured out the issue via 
link.
 It's a tricky issue according to the comments from Yangjie - the test is 
dependent on ordering of execution for test suites. He said it does not fail in 
sbt, hence CI build couldn't catch it.
He fixed it via 
link,
 but we missed that the offending commit was also ported back to 3.5 as well, 
hence the fix wasn't ported back to 3.5.

Surprisingly, I can't reproduce locally even with maven. In my attempt to 
reproduce, SparkConnectProtoSuite was executed at third, 
SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite, and then 
SparkConnectProtoSuite. Maybe very specific to the environment, not just maven? 
My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I used build/mvn (Maven 
3.8.8).

I'm not 100% sure this is something we should fail the release as it's a test 
only and sounds very environment dependent, but I'll respect your call on vote.

Btw, looks like Rui also made a relevant fix via 
link
 (not to fix the failing test but to fix other issues), but this also wasn't 
ported back to 3.5. @Rui Wang Do you think this is 
a regression issue and warrants a new RC?


On Fri, Feb 16, 2024 at 11:38 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
Is anyone seeing this Spark Connect test failure? then again, I have some weird 
issue with this env that always fails 1 or 2 tests that nobody else can 
replicate.

- Test observe *** FAILED ***
  == FAIL: Plans do not match ===
  !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, 
sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric, [min(id#0) AS 
min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 44
   +- LocalRelation , [id#0, name#0] 
+- LocalRelation , [id#0, name#0] 
(PlanTest.scala:179)

On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured out 
doc generation issue after tagging RC1.

Please vote on releasing the following candidate as Apache Spark version 3.5.1.

The vote is open until February 18th 9AM (PST) and passes if a majority +1 PMC 
votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see 
https://spark.apache.org/

The tag to be voted on is v3.5.1-rc2 (commit 
fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
https://github.com/apache/spark/tree/v3.5.1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1452/

The 

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Jungtaek Lim
I traced back relevant changes and got a sense of what happened.

Yangjie figured out the issue via link
. It's a
tricky issue according to the comments from Yangjie - the test is dependent
on ordering of execution for test suites. He said it does not fail in sbt,
hence CI build couldn't catch it.
He fixed it via link , but we
missed that the offending commit was also ported back to 3.5 as well, hence
the fix wasn't ported back to 3.5.

Surprisingly, I can't reproduce locally even with maven. In my attempt to
reproduce, SparkConnectProtoSuite was executed at
third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
and then SparkConnectProtoSuite. Maybe very specific to the environment,
not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
used build/mvn (Maven 3.8.8).

I'm not 100% sure this is something we should fail the release as it's a
test only and sounds very environment dependent, but I'll respect your call
on vote.

Btw, looks like Rui also made a relevant fix via link
 (not to fix the failing test
but to fix other issues), but this also wasn't ported back to 3.5. @Rui Wang
 Do you think this is a regression issue and warrants
a new RC?


On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:

> Is anyone seeing this Spark Connect test failure? then again, I have some
> weird issue with this env that always fails 1 or 2 tests that nobody else
> can replicate.
>
> - Test observe *** FAILED ***
>   == FAIL: Plans do not match ===
>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
> 44
>+- LocalRelation , [id#0, name#0]
>   +- LocalRelation , [id#0, name#0]
> (PlanTest.scala:179)
>
> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
> wrote:
>
>> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
>> out doc generation issue after tagging RC1.
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.5.1.
>>
>> The vote is open until February 18th 9AM (PST) and passes if a majority
>> +1 PMC votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.5.1-rc2 (commit
>> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
>> https://github.com/apache/spark/tree/v3.5.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1452/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-docs/
>>
>> The list of bug fixes going into 3.5.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353495
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/pyspark-3.5.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.5.1?
>> ===
>>
>> The current list of open tickets targeted at 3.5.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, 

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao,

As a cool feature


   - Compared to standard Spark, what kind of performance gains can be
   expected with Comet?
   -  Can one use Comet on k8s in conjunction with something like a Volcano
   addon?


HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge, sourced from both personal expertise and other resources but of
course cannot be guaranteed . It is essential to note that, as with any
advice, one verified and tested result holds more weight than a thousand
expert opinions.


On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:

> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveraging DataFusion and Arrow,
> has now been open sourced under the Apache Arrow umbrella. Please
> check the project repo
> https://github.com/apache/arrow-datafusion-comet for more details if
> you are interested. We'd love to collaborate with people from the open
> source community who share similar goals.
>
> Thanks,
> Chao
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>