Re: Apache Spark 3.3.4 EOL Release?

2023-12-11 Thread Jungtaek Lim
Sorry for the late reply, I've been busy these days and haven't had time to
respond.

I didn't realize you were doing release preparation and discussion in
parallel. I totally agree you should go if you take a step already.

Also, thanks for the suggestion! Unfortunately I got to be busy after
volunteering, but I'll figure out I can make it, hopefully before the end
of this year.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, Dec 9, 2023 at 2:22 AM Dongjoon Hyun 
wrote:

> Thank you, Mridul, and Kent, too.
>
> Additionally, thank you for volunteering as a release manager, Jungtaek,
>
> For the 3.3.4 EOL release, I've already been testing and preparing for one
> week since my first email.
>
> So, why don't you proceed with the Apache Spark 3.5.1 release? It has 142
> patches already.
>
> $ git log --oneline v3.5.0..HEAD | wc -l
>  142
>
> I'd like to recommend you to proceed by sending an independent discussion
> email to the dev mailing list.
>
> I love to see Apache Spark 3.5.1 in December. too.
>
> BTW, as you mentioned, there is no strict timeline for 3.5.1, so take your
> time.
>
> Thanks,
> Dongjoon.
>
>
>
> On Fri, Dec 8, 2023 at 2:04 AM Jungtaek Lim 
> wrote:
>
>> +1 to release 3.3.4 and consider 3.3 as EOL.
>>
>> Btw, it'd be probably ideal if we could encourage taking an opportunity
>> of experiencing the release process to people who hadn't had a time to go
>> through (when there are people who are happy to take it). If you don't mind
>> and we are not very strict on the timeline, I'd be happy to volunteer and
>> give it a try.
>>
>> On Tue, Dec 5, 2023 at 12:12 PM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Thank you for driving this EOL release, Dongjoon!
>>>
>>> Kent Yao
>>>
>>> On 2023/12/04 19:40:10 Mridul Muralidharan wrote:
>>> > +1
>>> >
>>> > Regards,
>>> > Mridul
>>> >
>>> > On Mon, Dec 4, 2023 at 11:40 AM L. C. Hsieh  wrote:
>>> >
>>> > > +1
>>> > >
>>> > > Thanks Dongjoon!
>>> > >
>>> > > On Mon, Dec 4, 2023 at 9:26 AM Yang Jie 
>>> wrote:
>>> > > >
>>> > > > +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
>>> > > >
>>> > > > Jie Yang
>>> > > >
>>> > > > On 2023/12/04 15:08:25 Tom Graves wrote:
>>> > > > >  +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
>>> > > > > Tom
>>> > > > > On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon
>>> Hyun <
>>> > > dongjoon.h...@gmail.com> wrote:
>>> > > > >
>>> > > > >  Hi, All.
>>> > > > >
>>> > > > > Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022,
>>> > > branch-3.3 has been maintained and served well until now.
>>> > > > >
>>> > > > > - https://github.com/apache/spark/releases/tag/v3.3.0 (tagged
>>> on Jun
>>> > > 9th, 2022)
>>> > > > > -
>>> https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm
>>> > > (vote result on June 14th, 2022)
>>> > > > >
>>> > > > > As of today, branch-3.3 has 56 additional patches after v3.3.3
>>> (tagged
>>> > > on Aug 3rd about 4 month ago) and reaches the end-of-life this month
>>> > > according to the Apache Spark release cadence,
>>> > > https://spark.apache.org/versioning-policy.html .
>>> > > > >
>>> > > > > $ git log --oneline v3.3.3..HEAD | wc -l
>>> > > > > 56
>>> > > > >
>>> > > > > Along with the recent Apache Spark 3.4.2 release, I hope the
>>> users can
>>> > > get a chance to have these last bits of Apache Spark 3.3.x, and I'd
>>> like to
>>> > > propose to have Apache Spark 3.3.4 EOL Release vote on December 11th
>>> and
>>> > > volunteer as the release manager.
>>> > > > >
>>> > > > > WDTY?
>>> > > > >
>>> > > > > Please let us know if you need more patches on branch-3.3.
>>> > > > >
>>> > > > > Thanks,
>>> > > > > Dongjoon.
>>> > > > >
>>> > > >
>>> > > >
>>> -
>>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > > >
>>> > >
>>> > > -
>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >
>>> > >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-11 Thread Malcolm Decuire
+1

On Mon, Dec 11, 2023 at 6:21 PM Yang Jie  wrote:

> +1
>
> On 2023/12/11 03:03:39 "L. C. Hsieh" wrote:
> > +1
> >
> > On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
> > >
> > > +1(non-binding
> > >
> > > Kent Yao
> > >
> > > Yuming Wang  于2023年12月11日周一 09:33写道:
> > > >
> > > > +1
> > > >
> > > > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun 
> wrote:
> > > >>
> > > >> +1
> > > >>
> > > >> Dongjoon
> > > >>
> > > >> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> > > >> > Please vote on releasing the following candidate as Apache Spark
> version
> > > >> > 3.3.4.
> > > >> >
> > > >> > The vote is open until December 15th 1AM (PST) and passes if a
> majority +1
> > > >> > PMC votes are cast, with a minimum of 3 +1 votes.
> > > >> >
> > > >> > [ ] +1 Release this package as Apache Spark 3.3.4
> > > >> > [ ] -1 Do not release this package because ...
> > > >> >
> > > >> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> > > >> >
> > > >> > The tag to be voted on is v3.3.4-rc1 (commit
> > > >> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
> > > >> > https://github.com/apache/spark/tree/v3.3.4-rc1
> > > >> >
> > > >> > The release files, including signatures, digests, etc. can be
> found at:
> > > >> >
> > > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> > > >> >
> > > >> >
> > > >> > Signatures used for Spark RCs can be found in this file:
> > > >> >
> > > >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > > >> >
> > > >> >
> > > >> > The staging repository for this release can be found at:
> > > >> >
> > > >> >
> https://repository.apache.org/content/repositories/orgapachespark-1451/
> > > >> >
> > > >> >
> > > >> > The documentation corresponding to this release can be found at:
> > > >> >
> > > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> > > >> >
> > > >> >
> > > >> > The list of bug fixes going into 3.3.4 can be found at the
> following URL:
> > > >> >
> > > >> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
> > > >> >
> > > >> >
> > > >> > This release is using the release script of the tag v3.3.4-rc1.
> > > >> >
> > > >> >
> > > >> > FAQ
> > > >> >
> > > >> >
> > > >> > =
> > > >> >
> > > >> > How can I help test this release?
> > > >> >
> > > >> > =
> > > >> >
> > > >> >
> > > >> >
> > > >> > If you are a Spark user, you can help us test this release by
> taking
> > > >> >
> > > >> > an existing Spark workload and running on this release candidate,
> then
> > > >> >
> > > >> > reporting any regressions.
> > > >> >
> > > >> >
> > > >> >
> > > >> > If you're working in PySpark you can set up a virtual env and
> install
> > > >> >
> > > >> > the current RC and see if anything important breaks, in the
> Java/Scala
> > > >> >
> > > >> > you can add the staging repository to your projects resolvers and
> test
> > > >> >
> > > >> > with the RC (make sure to clean up the artifact cache
> before/after so
> > > >> >
> > > >> > you don't end up building with a out of date RC going forward).
> > > >> >
> > > >> >
> > > >> >
> > > >> > ===
> > > >> >
> > > >> > What should happen to JIRA tickets still targeting 3.3.4?
> > > >> >
> > > >> > ===
> > > >> >
> > > >> >
> > > >> >
> > > >> > The current list of open tickets targeted at 3.3.4 can be found
> at:
> > > >> >
> > > >> > https://issues.apache.org/jira/projects/SPARK and search for
> "Target
> > > >> > Version/s" = 3.3.4
> > > >> >
> > > >> >
> > > >> > Committers should look at those and triage. Extremely important
> bug
> > > >> >
> > > >> > fixes, documentation, and API tweaks that impact compatibility
> should
> > > >> >
> > > >> > be worked on immediately. Everything else please retarget to an
> > > >> >
> > > >> > appropriate release.
> > > >> >
> > > >> >
> > > >> >
> > > >> > ==
> > > >> >
> > > >> > But my bug isn't fixed?
> > > >> >
> > > >> > ==
> > > >> >
> > > >> >
> > > >> >
> > > >> > In order to make timely releases, we will typically not hold the
> > > >> >
> > > >> > release unless the bug in question is a regression from the
> previous
> > > >> >
> > > >> > release. That being said, if there is something which is a
> regression
> > > >> >
> > > >> > that has not been correctly targeted please ping me or a
> committer to
> > > >> >
> > > >> > help target the issue.
> > > >> >
> > > >>
> > > >>
> -
> > > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >>
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> 

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
Where exactly are you getting this information from?

As far as I can tell, spark.sql.cbo.enabled has defaulted to false since it was 
introduced 7 years ago 
.
 It has never been enabled by default.

And I cannot see mention of spark.sql.cbo.strategy anywhere at all in the code 
base.

So again, where is this information coming from? Please link directly to your 
source.



> On Dec 11, 2023, at 5:45 PM, Mich Talebzadeh  
> wrote:
> 
> You are right. By default CBO is not enabled. Whilst the CBO was the default 
> optimizer in earlier versions of Spark, it has been replaced by the AQE in 
> recent releases.
> 
> spark.sql.cbo.strategy
> 
> As I understand, The spark.sql.cbo.strategy configuration property specifies 
> the optimizer strategy used by Spark SQL to generate query execution plans. 
> There are two main optimizer strategies available:
> CBO (Cost-Based Optimization): The default optimizer strategy, which analyzes 
> the query plan and estimates the execution costs associated with each 
> operation. It uses statistics to guide its decisions, selecting the plan with 
> the lowest estimated cost.
> 
> CBO-Like (Cost-Based Optimization-Like): A simplified optimizer strategy that 
> mimics some of the CBO's logic, but without the ability to estimate costs. 
> This strategy is faster than CBO for simple queries, but may not produce the 
> most efficient plan for complex queries.
> 
> The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like. The 
> default value is AUTO, which means that Spark will automatically choose the 
> most appropriate strategy based on the complexity of the query and 
> availability of statistic
> 
> 
> 
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas  > wrote:
>> 
>>> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh >> > wrote:
>>> 
>>> By default, the CBO is enabled in Spark.
>> 
>> Note that this is not correct. AQE is enabled 
>> 
>>  by default, but CBO isn’t 
>> .



Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-11 Thread Yang Jie
+1

On 2023/12/11 03:03:39 "L. C. Hsieh" wrote:
> +1
> 
> On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
> >
> > +1(non-binding
> >
> > Kent Yao
> >
> > Yuming Wang  于2023年12月11日周一 09:33写道:
> > >
> > > +1
> > >
> > > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun  wrote:
> > >>
> > >> +1
> > >>
> > >> Dongjoon
> > >>
> > >> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> > >> > Please vote on releasing the following candidate as Apache Spark 
> > >> > version
> > >> > 3.3.4.
> > >> >
> > >> > The vote is open until December 15th 1AM (PST) and passes if a 
> > >> > majority +1
> > >> > PMC votes are cast, with a minimum of 3 +1 votes.
> > >> >
> > >> > [ ] +1 Release this package as Apache Spark 3.3.4
> > >> > [ ] -1 Do not release this package because ...
> > >> >
> > >> > To learn more about Apache Spark, please see https://spark.apache.org/
> > >> >
> > >> > The tag to be voted on is v3.3.4-rc1 (commit
> > >> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
> > >> > https://github.com/apache/spark/tree/v3.3.4-rc1
> > >> >
> > >> > The release files, including signatures, digests, etc. can be found at:
> > >> >
> > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> > >> >
> > >> >
> > >> > Signatures used for Spark RCs can be found in this file:
> > >> >
> > >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >> >
> > >> >
> > >> > The staging repository for this release can be found at:
> > >> >
> > >> > https://repository.apache.org/content/repositories/orgapachespark-1451/
> > >> >
> > >> >
> > >> > The documentation corresponding to this release can be found at:
> > >> >
> > >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> > >> >
> > >> >
> > >> > The list of bug fixes going into 3.3.4 can be found at the following 
> > >> > URL:
> > >> >
> > >> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
> > >> >
> > >> >
> > >> > This release is using the release script of the tag v3.3.4-rc1.
> > >> >
> > >> >
> > >> > FAQ
> > >> >
> > >> >
> > >> > =
> > >> >
> > >> > How can I help test this release?
> > >> >
> > >> > =
> > >> >
> > >> >
> > >> >
> > >> > If you are a Spark user, you can help us test this release by taking
> > >> >
> > >> > an existing Spark workload and running on this release candidate, then
> > >> >
> > >> > reporting any regressions.
> > >> >
> > >> >
> > >> >
> > >> > If you're working in PySpark you can set up a virtual env and install
> > >> >
> > >> > the current RC and see if anything important breaks, in the Java/Scala
> > >> >
> > >> > you can add the staging repository to your projects resolvers and test
> > >> >
> > >> > with the RC (make sure to clean up the artifact cache before/after so
> > >> >
> > >> > you don't end up building with a out of date RC going forward).
> > >> >
> > >> >
> > >> >
> > >> > ===
> > >> >
> > >> > What should happen to JIRA tickets still targeting 3.3.4?
> > >> >
> > >> > ===
> > >> >
> > >> >
> > >> >
> > >> > The current list of open tickets targeted at 3.3.4 can be found at:
> > >> >
> > >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > >> > Version/s" = 3.3.4
> > >> >
> > >> >
> > >> > Committers should look at those and triage. Extremely important bug
> > >> >
> > >> > fixes, documentation, and API tweaks that impact compatibility should
> > >> >
> > >> > be worked on immediately. Everything else please retarget to an
> > >> >
> > >> > appropriate release.
> > >> >
> > >> >
> > >> >
> > >> > ==
> > >> >
> > >> > But my bug isn't fixed?
> > >> >
> > >> > ==
> > >> >
> > >> >
> > >> >
> > >> > In order to make timely releases, we will typically not hold the
> > >> >
> > >> > release unless the bug in question is a regression from the previous
> > >> >
> > >> > release. That being said, if there is something which is a regression
> > >> >
> > >> > that has not been correctly targeted please ping me or a committer to
> > >> >
> > >> > help target the issue.
> > >> >
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-11 Thread Dongjoon Hyun
Hi, Mridul.

> I am currently on Python 3.11.6, java 8.

For the above, I added `Python 3.11 support` at Apache Spark 3.4.0. That's
exactly one of my reasons why I wanted to do the EOL release of Apache
Spark 3.3.4.

https://issues.apache.org/jira/browse/SPARK-41454 (Support Python 3.11)

Thanks,
Dongjoon.




On Mon, Dec 11, 2023 at 12:22 PM Mridul Muralidharan 
wrote:

>
> I am seeing a bunch of python related (43) failures in the sql module (for
> example [1]) ... I am currently on Python 3.11.6, java 8.
> Not sure if ubuntu modified anything from under me, thoughts ?
>
> I am currently testing this against an older branch to make sure it is not
> an issue with my desktop.
>
> Regards,
> Mridul
>
>
> [1]
>
>
> org.apache.spark.sql.IntegratedUDFTestUtils.shouldTestGroupedAggPandasUDFs
> was false (QueryCompilationErrorsSuite.scala:112)
> Traceback (most recent call last):
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py", line
> 458, in dumps
> return cloudpickle.dumps(obj, pickle_protocol)
>^^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 73, in dumps
> cp.dump(obj)
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 602, in dump
> return Pickler.dump(self, obj)
>^^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 692, in reducer_override
> return self._function_reduce(obj)
>^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 565, in _function_reduce
> return self._dynamic_function_reduce(obj)
>^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 546, in _dynamic_function_reduce
> state = _function_getstate(func)
> 
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 157, in _function_getstate
> f_globals_ref = _extract_code_globals(func.__code__)
> 
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
> line 334, in _extract_code_globals
> out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
> ^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
> line 334, in 
> out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
>  ~^^^
> IndexError: tuple index out of range
> Traceback (most recent call last):
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py", line
> 458, in dumps
> return cloudpickle.dumps(obj, pickle_protocol)
>^^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 73, in dumps
> cp.dump(obj)
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 602, in dump
> return Pickler.dump(self, obj)
>^^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 692, in reducer_override
> return self._function_reduce(obj)
>^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 565, in _function_reduce
> return self._dynamic_function_reduce(obj)
>^^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 546, in _dynamic_function_reduce
> state = _function_getstate(func)
> 
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
> line 157, in _function_getstate
> f_globals_ref = _extract_code_globals(func.__code__)
> 
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
> line 334, in _extract_code_globals
> out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
> ^
>   File
> "/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
> line 334, in 
> out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
>  ~^^^
> IndexError: tuple index out of range
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>  

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Mich Talebzadeh
You are right. By default CBO is not enabled. Whilst the CBO was the
default optimizer in earlier versions of Spark, it has been replaced by the
AQE in recent releases.

spark.sql.cbo.strategy

As I understand, The spark.sql.cbo.strategy configuration property
specifies the optimizer strategy used by Spark SQL to generate query
execution plans. There are two main optimizer strategies available:

   -

   CBO (Cost-Based Optimization): The default optimizer strategy, which
   analyzes the query plan and estimates the execution costs associated with
   each operation. It uses statistics to guide its decisions, selecting the
   plan with the lowest estimated cost.
   -

   CBO-Like (Cost-Based Optimization-Like): A simplified optimizer strategy
   that mimics some of the CBO's logic, but without the ability to estimate
   costs. This strategy is faster than CBO for simple queries, but may not
   produce the most efficient plan for complex queries.

The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like.
The default value is AUTO, which means that Spark will automatically choose
the most appropriate strategy based on the complexity of the query and
availability of statistic


Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas 
wrote:

>
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh 
> wrote:
>
> By default, the CBO is enabled in Spark.
>
>
> Note that this is not correct. AQE is enabled
> 
>  by
> default, but CBO isn’t
> 
> .
>


Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-11 Thread Mridul Muralidharan
I am seeing a bunch of python related (43) failures in the sql module (for
example [1]) ... I am currently on Python 3.11.6, java 8.
Not sure if ubuntu modified anything from under me, thoughts ?

I am currently testing this against an older branch to make sure it is not
an issue with my desktop.

Regards,
Mridul


[1]


org.apache.spark.sql.IntegratedUDFTestUtils.shouldTestGroupedAggPandasUDFs
was false (QueryCompilationErrorsSuite.scala:112)
Traceback (most recent call last):
  File "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py",
line 458, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 73, in dumps
cp.dump(obj)
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 602, in dump
return Pickler.dump(self, obj)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 692, in reducer_override
return self._function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 565, in _function_reduce
return self._dynamic_function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 546, in _dynamic_function_reduce
state = _function_getstate(func)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 157, in _function_getstate
f_globals_ref = _extract_code_globals(func.__code__)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in _extract_code_globals
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in 
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
 ~^^^
IndexError: tuple index out of range
Traceback (most recent call last):
  File "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py",
line 458, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 73, in dumps
cp.dump(obj)
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 602, in dump
return Pickler.dump(self, obj)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 692, in reducer_override
return self._function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 565, in _function_reduce
return self._dynamic_function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 546, in _dynamic_function_reduce
state = _function_getstate(func)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 157, in _function_getstate
f_globals_ref = _extract_code_globals(func.__code__)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in _extract_code_globals
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in 
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
 ~^^^
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py",
line 468, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index
out of range
- UNSUPPORTED_FEATURE: Using Python UDF with unsupported join condition ***
FAILED ***



On Sun, Dec 10, 2023 at 9:05 PM L. C. Hsieh  wrote:

> +1
>
> On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
> >
> > +1(non-binding
> >
> > Kent Yao
> >
> > Yuming Wang  于2023年12月11日周一 09:33写道:
> > >
> > > +1
> > >
> > > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun 
> wrote:
> > >>
> 

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas

> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, 
> or NONE to disable it completely.
> 
Hmm, I’ve also never heard of this setting before and can’t seem to find it in 
the Spark docs or source code.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas

> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> 
> By default, the CBO is enabled in Spark.

Note that this is not correct. AQE is enabled 

 by default, but CBO isn’t 
.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Mich Talebzadeh
Some of these have been around outside of spark for years. like CBO and RBO
etc but I concur that they have a place in spark's doc.

Simply put, statistics  provide insights into the characteristics of data,
such as distribution, skewness, and cardinalities, which help the optimizer
make informed decisions about data partitioning, aggregation strategies,
and join order.

Not so differently, Spark utilizes statistics to:

   - Partition Data Effectively: Spark partitions data into smaller chunks
   to distribute and parallelize computations across worker nodes. Accurate
   statistics enable the optimizer to choose the most appropriate partitioning
   strategy for each data set, considering factors like data distribution and
   skewness.
   - Optimize Join Operations: Spark employs statistics to determine the
   most efficient join order, considering the join factors and their
   respective cardinalities. This helps reduce the amount of data shuffled
   during joins, improving performance and minimizing data transfer overhead.
   - Choose Optimal Aggregation Strategies: When performing aggregations,
   Spark uses statistics to determine the most efficient aggregation algorithm
   based on the data distribution and the desired aggregation functions. This
   ensures that aggregations are performed efficiently without compromising
   accuracy.


With regard to type of statistics:


   - Catalog Statistics: These are pre-computed statistics that are stored
   in the Spark SQL catalog and associated with table or dataset metadata.
   They are typically gathered using the ANALYZE TABLE statement or through
   data source-specific mechanisms.
   - Data Source Statistics: These statistics are computed by the data
   source itself, such as Parquet or Hive, and are associated with the
   internal format of the data. Spark can access and utilize these statistics
   when working with external data sources.
   - Runtime Statistics: These are statistics that are dynamically computed
   during query execution. Spark can gather runtime statistics for certain
   operations, such as aggregations or joins, to refine its optimization
   decisions based on the actual data encountered.

It is important to mention Cost-Based Optimization (CBO). CBO in Spark
analyzes the query plan and estimates the execution costs associated with
each operation. It uses statistics to guide its decisions, selecting the
plan with the lowest estimated cost. I do not know any RDBMS that uses rule
based optimizer (RBO) anymore.

By default, the CBO is enabled in Spark. However, you can explicitly enable
or disable it using the following options:

   -

   spark.sql.cbo.enabled: Set to true to enable the CBO, or false to
   disable it.
   -

   spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default
   optimizer, or NONE to disable it completely.

HTH
Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 02:36, Nicholas Chammas 
wrote:

> I’ve done some reading and have a slightly better understanding of
> statistics now.
>
> Every implementation of LeafNode.computeStats
> 
>  offers
> its own way to get statistics:
>
>- LocalRelation
>
> 
>  estimates
>the size of the relation directly from the row count.
>- HiveTableRelation
>
> 
>  pulls
>those statistics from the catalog or metastore.
>- DataSourceV2Relation
>
>