from:"\?\?\?\?\?\?\?\?\?\?"

[VOTE][RESULT] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh

The vote passes with 13+1s (8 binding +1s) and 1+0.

(* = binding)
+1:
Chao Sun (*)
Liang-Chi Hsieh (*)
Huaxin Gao (*)
Bo Yang
Dongjoon Hyun (*)
Kent Yao
Wenchen Fan (*)
Ryan Blue
Anton Okolnychyi
Zhou Jiang
Gengliang Wang (*)
Xiao Li (*)
Hyukjin Kwon (*)

+0: None
Mich Talebzadeh


-1: None

Thanks all.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh

Hi all,

Thanks all for participating and your support! The vote has been passed.
I'll send out the result in a separate thread.

On Wed, May 15, 2024 at 4:44 PM Hyukjin Kwon  wrote:
>
> +1
>
> On Tue, 14 May 2024 at 16:39, Wenchen Fan  wrote:
>>
>> +1
>>
>> On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:

 Hi all,

 I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.

 Please also refer to:

- Discussion thread:
 https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
- SPIP doc: 
 https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/


 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …


 Thank you!

 Liang-Chi Hsieh

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>>
>>>
>>> --
>>> Zhou JIANG
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread Hyukjin Kwon

+1

On Tue, 14 May 2024 at 16:39, Wenchen Fan  wrote:

> +1
>
> On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:
>
>> +1 (non-binding)
>>
>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>>
>>> Thank you!
>>>
>>> Liang-Chi Hsieh
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> *Zhou JIANG*
>>
>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-15 Thread Wenchen Fan

Thanks all for the feedback here! Let me put up a new version, which
clarifies the definition of "users":

Behavior changes mean user-visible functional changes in a new release via
public APIs. The "user" here is not only the user who writes queries and/or
develops Spark plugins, but also the user who deploys and/or manages Spark
clusters. New features, and even bug fixes that eliminate NPE or correct
query results, are behavior changes. Things like performance improvement,
code refactoring, and changes to unreleased APIs/features are not. All
behavior changes should be called out in the PR description. We need to
write an item in the migration guide (and probably legacy config) for those
that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any non-additive change to the public Python/SQL/Scala/Java/R APIs
   (including developer APIs): rename function, remove parameters, add
   parameters, rename parameters, change parameter default values, etc. These
   changes should be avoided in general, or done in a binary-compatible
   way like deprecating and adding a new function instead of renaming.
   - Any non-additive change to the way Spark should be deployed and
   managed.

The list above is not supposed to be comprehensive. Anyone can raise your
concern when reviewing PRs and ask the PR author to add migration guide if
you believe the change is risky and may break users.

On Thu, May 2, 2024 at 10:25 PM Will Raschkowski 
wrote:

> To add some user perspective, I wanted to share our experience from
> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
> Palantir:
>
>
>
> We didn't mind "loud" changes that threw exceptions. We have some infra to
> try run jobs with Spark 3 and fallback to Spark 2 if there's an exception.
> E.g., the datetime parsing and rebasing migration in Spark 3 was great:
> Spark threw a helpful exception but never silently changed results.
> Similarly, for things listed in the migration guide as silent changes
> (e.g., add_months's handling of last-day-of-month), we wrote custom check
> rules to throw unless users acknowledged the change through config.
>
>
>
> Silent changes *not* in the migration guide were really bad for us:
> Trusting the migration guide to be exhaustive, we automatically upgraded
> jobs which then “succeeded” but wrote incorrect results. For example, some
> expression increased timestamp precision in Spark 3; a query implicitly
> relied on the reduced precision, and then produced bad results on upgrade.
> It’s a silly query but a note in the migration guide would have helped.
>
>
>
> To summarize: the migration guide was invaluable, we appreciated every
> entry, and we'd appreciate Wenchen's stricter definition of "behavior
> changes" (especially for silent ones).
>
>
>
> *From: *Nimrod Ofek 
> *Date: *Thursday, 2 May 2024 at 11:57
> *To: *Wenchen Fan 
> *Cc: *Erik Krogen , Spark dev list <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>
> *CAUTION:* This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Message" button built into Outlook.
>
>
>
> Hi Erik and Wenchen,
>
>
>
> I think that usually a good practice with public api and with internal api
> that has big impact and a lot of usage is to ease in changes by providing
> defaults to new parameters that will keep former behaviour in a method with
> the previous signature with deprecation notice, and deleting that
> deprecated function in the next release- so the actual break will be in the
> next release after all libraries had the chance to align with the api and
> upgrades can be done while already using the new version.
>
>
>
> Another thing is that we should probably examine what private apis are
> used externally to provide better experience and provide proper public apis
> to meet those needs (for instance, applicative metrics and some way of
> creating custom behaviour columns).
>
>
>
> Thanks,
>
> Nimrod
>
>
>
> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:
>
> Hi Erik,
>
>
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned in the release notes. Breaking binary compatibility is also a
> "functional change" and should be treated as a behavior change.
>
>
>
> BTW, AFAIK some downstream libraries use private APIs such as Catalyst
> Expression and LogicalPlan. It's too much work to track all the changes to
> private APIs

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-15 Thread Wenchen Fan

RC1 failed because of this issue. I'll cut RC2 after we downgrade Jetty to
9.x.

On Sat, May 11, 2024 at 3:37 PM Cheng Pan  wrote:

> -1 (non-binding)
>
> A small question, the tag is orphan but I suppose it should belong to the
> master branch.
>
> Seems YARN integration is broken due to javax =>  jakarta namespace
> migration, I filled SPARK-48238, and left some comments on
> https://github.com/apache/spark/pull/45154
>
> Caused by: java.lang.IllegalStateException: class
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a
> jakarta.servlet.Filter
> at
> org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
> ~[?:?]
> at
> java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
> ~[?:?]
> at
> java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
> ~[?:?]
> at
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> ... 38 more
>
> Thanks,
> Cheng Pan
>
>
> > On May 11, 2024, at 13:55, Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
> >
> > The vote is open until May 16 PST and passes if a majority +1 PMC votes
> are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v4.0.0-preview1-rc1 (commit
> 7dcf77c739c3854260464d732dbfb9a0f54706e7):
> > https://github.com/apache/spark/tree/v4.0.0-preview1-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1454/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/
> >
> > The list of bug fixes going into 4.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
>
>

Community over Code EU 2024: The countdown has started!

2024-05-14 Thread Ryan Skraba

[Note: You're receiving this email because you are subscribed to one
or more project dev@ mailing lists at the Apache Software Foundation.]

We are very close to Community Over Code EU -- check out the amazing
program and the special discounts that we have for you.

Special discounts

You still have the opportunity to secure your ticket for Community
Over Code EU. Explore the various options available, including the
regular pass, the committer and groups pass, and now introducing the
one-day pass tailored for locals in Bratislava.

We also have a special discount for you to attend both Community Over
Code and Berlin Buzzwords from June 9th to 11th. Visit our website to
find out more about this opportunity and contact te...@sg.com.mx to
get the discount code.

Take advantage of the discounts and register now!
https://eu.communityovercode.org/tickets/

Check out the full program!

This year Community Over Code Europe will bring to you three days of
keynotes and sessions that cover topics of interest for ASF projects
and the greater open source ecosystem including data engineering,
performance engineering, search, Internet of Things (IoT) as well as
sessions with tips and lessons learned on building a healthy open
source community.

Check out the program: https://eu.communityovercode.org/program/

Keynote speaker highlights for Community Over Code Europe include:

* Dirk-Willem Van Gulik, VP of Public Policy at the Apache Software
Foundation, will discuss the Cyber Resiliency Act and its impact on
open source (All your code belongs to Policy Makers, Politicians, and
the Law).

* Dr. Sherae Daniel will share the results of her study on the impact
of self-promotion for open source software developers (To Toot or not
to Toot, that is the question).

* Asim Hussain, Executive Director of the Green Software Foundation
will present a framework they have developed for quantifying the
environmental impact of software (Doing for Sustainability what Open
Source did for Software).

* Ruth Ikegah will  discuss the growth of the open source movement in
Africa (From Local Roots to Global Impact: Building an Inclusive Open
Source Community in Africa)

* A discussion panel on EU policies and regulations affecting
specialists working in Open Source Program Offices

Additional activities

* Poster sessions: We invite you to stop by our poster area and see if
the ideas presented ignite a conversation within your team.

* BOF time: Don't miss the opportunity to discuss in person with your
open source colleagues on your shared interests.

* Participants reception: At the end of the first day, we will have a
reception at the event venue. All participants are welcome to attend!

* Spontaneous talks: There is a dedicated room and social space for
having spontaneous talks and sessions. Get ready to share with your
peers.

* Lighting talks: At the end of the event we will have the awaited
Lighting talks, where every participant is welcome to share and
enlighten us.

Please remember:  If you haven't applied for the visa, we will provide
the necessary letter for the process. In the unfortunate case of a
visa rejection, your ticket will be reimbursed.

See you in Bratislava,

Community Over Code EU Team

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Xiao Li

+1

Gengliang Wang  于2024年5月13日周一 16:24写道：

> +1
>
> On Mon, May 13, 2024 at 12:30 PM Zhou Jiang 
> wrote:
>
>> +1 (non-binding)
>>
>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>>
>>> Thank you!
>>>
>>> Liang-Chi Hsieh
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> *Zhou JIANG*
>>
>>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Wenchen Fan

+1

On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:

> +1 (non-binding)
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> *Zhou JIANG*
>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Gengliang Wang

+1

On Mon, May 13, 2024 at 12:30 PM Zhou Jiang  wrote:

> +1 (non-binding)
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> *Zhou JIANG*
>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Zhou Jiang

+1 (non-binding)

On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:

> Hi all,
>
> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>
> Please also refer to:
>
>- Discussion thread:
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
> Thank you!
>
> Liang-Chi Hsieh
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
*Zhou JIANG*

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Anton Okolnychyi

+1

On 2024/05/13 15:33:33 Ryan Blue wrote:
> +1
> 
> On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh 
> wrote:
> 
> > +0
> >
> > For reasons I outlined in the discussion thread
> >
> > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >
> > Mich Talebzadeh,
> > Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> > London
> > United Kingdom
> >
> >
> >view my Linkedin profile
> > 
> >
> >
> >  https://en.everybodywiki.com/Mich_Talebzadeh
> >
> >
> >
> > *Disclaimer:* The information provided is correct to the best of my
> > knowledge but of course cannot be guaranteed . It is essential to note
> > that, as with any advice, quote "one test result is worth one-thousand
> > expert opinions (Werner  
> > Von
> > Braun )".
> >
> >
> > On Mon, 13 May 2024 at 08:24, Wenchen Fan  wrote:
> >
> >> +1
> >>
> >> On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:
> >>
> >>> +1
> >>>
> >>> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
> >>> >
> >>> > +1
> >>> >
> >>> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
> >>> wrote:
> >>> >>
> >>> >> +1
> >>> >>
> >>> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
> >>> >>>
> >>> >>> +1
> >>> >>>
> >>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >>> >>> >
> >>> >>> > +1
> >>> >>> >
> >>> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
> >>> wrote:
> >>> >>> >>
> >>> >>> >> Hi all,
> >>> >>> >>
> >>> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
> >>> Catalogs.
> >>> >>> >>
> >>> >>> >> Please also refer to:
> >>> >>> >>
> >>> >>> >>- Discussion thread:
> >>> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>> >>> >>- JIRA ticket:
> >>> https://issues.apache.org/jira/browse/SPARK-44167
> >>> >>> >>- SPIP doc:
> >>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Please vote on the SPIP for the next 72 hours:
> >>> >>> >>
> >>> >>> >> [ ] +1: Accept the proposal as an official SPIP
> >>> >>> >> [ ] +0
> >>> >>> >> [ ] -1: I don’t think this is a good idea because …
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Thank you!
> >>> >>> >>
> >>> >>> >> Liang-Chi Hsieh
> >>> >>> >>
> >>> >>> >>
> >>> -
> >>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>> >>
> >>> >>>
> >>> >>> -
> >>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>>
> 
> -- 
> Ryan Blue
> Tabular
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Ryan Blue

+1

On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh 
wrote:

> +0
>
> For reasons I outlined in the discussion thread
>
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 13 May 2024 at 08:24, Wenchen Fan  wrote:
>
>> +1
>>
>> On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
>>> >
>>> > +1
>>> >
>>> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>>> >>> >
>>> >>> > +1
>>> >>> >
>>> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
>>> wrote:
>>> >>> >>
>>> >>> >> Hi all,
>>> >>> >>
>>> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
>>> Catalogs.
>>> >>> >>
>>> >>> >> Please also refer to:
>>> >>> >>
>>> >>> >>- Discussion thread:
>>> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>> >>> >>- JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-44167
>>> >>> >>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> >>> >>
>>> >>> >>
>>> >>> >> Please vote on the SPIP for the next 72 hours:
>>> >>> >>
>>> >>> >> [ ] +1: Accept the proposal as an official SPIP
>>> >>> >> [ ] +0
>>> >>> >> [ ] -1: I don’t think this is a good idea because …
>>> >>> >>
>>> >>> >>
>>> >>> >> Thank you!
>>> >>> >>
>>> >>> >> Liang-Chi Hsieh
>>> >>> >>
>>> >>> >>
>>> -
>>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>> >>
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan

Hi Nicholas,

Thanks for your help! I'm definitely interested in participating in this
unification work. Let me know how I can help.

Wenchen

On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas 
wrote:

> Re: unification
>
> We also have a long-standing problem with how we manage Python
> dependencies, something I’ve tried (unsuccessfully
> ) to fix in the past.
>
> Consider, for example, how many separate places this numpy dependency is
> installed:
>
> 1.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
> 2.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
> 3.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
> 4.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
> 5.
> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
> 6.
> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
> 7.
> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
> 8.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
> 9.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
> 10.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
> 11.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
> 12.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>
> None of those installations reference a unified version requirement, so
> naturally they are inconsistent across all these different lines. Some say
> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In
> several cases there is no version requirement specified at all.
>
> I’m interested in trying again to fix this problem, but it needs to be in
> collaboration with a committer since I cannot fully test the release
> scripts. (This testing gap is what doomed my last attempt at fixing this
> problem.)
>
> Nick
>
>
> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
>
> After finishing the 4.0.0-preview1 RC1, I have more experience with this
> topic now.
>
> In fact, the main job of the release process: building packages and
> documents, is tested in Github Action jobs. However, the way we test them
> is different from what we do in the release scripts.
>
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile:
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile:
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release
> process needs to set up more things so it may not be viable to use a single
> Dockerfile for both.
>
> 2. the execution code is different. Use building documents as an example:
> The release scripts:
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job:
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify
> them.
>
> It's better if we can run the release scripts as Github Action jobs, but I
> think it's more important to do the unification now.
>
> Thanks,
> Wenchen
>
>
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:
>
>> Hello,
>>
>> I can answer some of your common questions with other Apache projects.
>>
>> > Who currently has permissions for Github actions? Is there a specific
>> owner for that today or a different volunteer each time?
>>
>> The Apache organization owns Github Actions, and committers (contributors
>> with write permissions) can retrigger/cancel a Github Actions workflow, but
>> Github Actions runners are managed by the Apache infra team.
>>
>> > What are the current limits of GitHub Actions, who set them - and what
>> is the process to change those (if possible at all, but I presume not all
>> Apache projects have the same limits)?
>>
>> For limits, I don't think there is any significant limit, especially
>> since the Apache organization has 900 donated runners used by its projects,
>> and there is an initiative from the Infra team to add self-hosted runners
>> running on Kubernetes (document
>> 
>> ).
>>
>> > Where should the artifacts be

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Mich Talebzadeh

+0

For reasons I outlined in the discussion thread

https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 13 May 2024 at 08:24, Wenchen Fan  wrote:

> +1
>
> On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:
>
>> +1
>>
>> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
>> >
>> > +1
>> >
>> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >>> >
>> >>> > +1
>> >>> >
>> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
>> wrote:
>> >>> >>
>> >>> >> Hi all,
>> >>> >>
>> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
>> Catalogs.
>> >>> >>
>> >>> >> Please also refer to:
>> >>> >>
>> >>> >>- Discussion thread:
>> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>> >>- JIRA ticket:
>> https://issues.apache.org/jira/browse/SPARK-44167
>> >>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>> >>
>> >>> >>
>> >>> >> Please vote on the SPIP for the next 72 hours:
>> >>> >>
>> >>> >> [ ] +1: Accept the proposal as an official SPIP
>> >>> >> [ ] +0
>> >>> >> [ ] -1: I don’t think this is a good idea because …
>> >>> >>
>> >>> >>
>> >>> >> Thank you!
>> >>> >>
>> >>> >> Liang-Chi Hsieh
>> >>> >>
>> >>> >>
>> -
>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>> >>
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas

Re: unification

We also have a long-standing problem with how we manage Python dependencies, 
something I’ve tried (unsuccessfully 
) to fix in the past.

Consider, for example, how many separate places this numpy dependency is 
installed:

1. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
2. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
3. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
4. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
5. 
https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
6. 
https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
7. 
https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
8. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
9. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
10. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
11. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
12. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92

None of those installations reference a unified version requirement, so 
naturally they are inconsistent across all these different lines. Some say 
`>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In several 
cases there is no version requirement specified at all.

I’m interested in trying again to fix this problem, but it needs to be in 
collaboration with a committer since I cannot fully test the release scripts. 
(This testing gap is what doomed my last attempt at fixing this problem.)

Nick


> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
> 
> After finishing the 4.0.0-preview1 RC1, I have more experience with this 
> topic now.
> 
> In fact, the main job of the release process: building packages and 
> documents, is tested in Github Action jobs. However, the way we test them is 
> different from what we do in the release scripts.
> 
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release 
> process needs to set up more things so it may not be viable to use a single 
> Dockerfile for both.
> 
> 2. the execution code is different. Use building documents as an example:
> The release scripts: 
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job: 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify them.
> 
> It's better if we can run the release scripts as Github Action jobs, but I 
> think it's more important to do the unification now.
> 
> Thanks,
> Wenchen
> 
> 
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  > wrote:
>> Hello,
>> 
>> I can answer some of your common questions with other Apache projects.
>> 
>> > Who currently has permissions for Github actions? Is there a specific 
>> > owner for that today or a different volunteer each time?
>> 
>> The Apache organization owns Github Actions, and committers (contributors 
>> with write permissions) can retrigger/cancel a Github Actions workflow, but 
>> Github Actions runners are managed by the Apache infra team.
>> 
>> > What are the current limits of GitHub Actions, who set them - and what is 
>> > the process to change those (if possible at all, but I presume not all 
>> > Apache projects have the same limits)?
>> 
>> For limits, I don't think there is any significant limit, especially since 
>> the Apache organization has 900 donated runners used by its projects, and 
>> there is an initiative from the Infra team to add self-hosted runners 
>> running on Kubernetes (document 
>> ).
>> 
>> > Where should the artifacts be stored?
>> 
>> Usually, we use Maven for jars, DockerHub for Docker images, and Github 
>> cache for workflow cache. But we can use Github artifacts to store any kind 
>> of package (even Docker images in the

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Wenchen Fan

+1

On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:

> +1
>
> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
> >
> > +1
> >
> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
> wrote:
> >>
> >> +1
> >>
> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
> >>>
> >>> +1
> >>>
> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >>> >
> >>> > +1
> >>> >
> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
> wrote:
> >>> >>
> >>> >> Hi all,
> >>> >>
> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
> Catalogs.
> >>> >>
> >>> >> Please also refer to:
> >>> >>
> >>> >>- Discussion thread:
> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
> >>> >>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>> >>
> >>> >>
> >>> >> Please vote on the SPIP for the next 72 hours:
> >>> >>
> >>> >> [ ] +1: Accept the proposal as an official SPIP
> >>> >> [ ] +0
> >>> >> [ ] -1: I don’t think this is a good idea because …
> >>> >>
> >>> >>
> >>> >> Thank you!
> >>> >>
> >>> >> Liang-Chi Hsieh
> >>> >>
> >>> >>
> -
> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan

After finishing the 4.0.0-preview1 RC1, I have more experience with this
topic now.

In fact, the main job of the release process: building packages and
documents, is tested in Github Action jobs. However, the way we test them
is different from what we do in the release scripts.

1. the execution environment is different:
The release scripts define the execution environment with this Dockerfile:
https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
However, Github Action jobs use a different Dockerfile:
https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
We should figure out a way to unify it. The docker image for the release
process needs to set up more things so it may not be viable to use a single
Dockerfile for both.

2. the execution code is different. Use building documents as an example:
The release scripts:
https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
The Github Action job:
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
I don't know which one is more correct, but we should definitely unify them.

It's better if we can run the release scripts as Github Action jobs, but I
think it's more important to do the unification now.

Thanks,
Wenchen

On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:

> Hello,
>
> I can answer some of your common questions with other Apache projects.
>
> > Who currently has permissions for Github actions? Is there a specific
> owner for that today or a different volunteer each time?
>
> The Apache organization owns Github Actions, and committers (contributors
> with write permissions) can retrigger/cancel a Github Actions workflow, but
> Github Actions runners are managed by the Apache infra team.
>
> > What are the current limits of GitHub Actions, who set them - and what
> is the process to change those (if possible at all, but I presume not all
> Apache projects have the same limits)?
>
> For limits, I don't think there is any significant limit, especially since
> the Apache organization has 900 donated runners used by its projects, and
> there is an initiative from the Infra team to add self-hosted runners
> running on Kubernetes (document
> 
> ).
>
> > Where should the artifacts be stored?
>
> Usually, we use Maven for jars, DockerHub for Docker images, and Github
> cache for workflow cache. But we can use Github artifacts to store any kind
> of package (even Docker images in the ghcr), which is fully accepted by
> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
> ...), a bucket can be used to store some of the packages.
>
>
>  > Who should be permitted to sign a version - and what is the process for
> that?
>
> The Apache documentation is clear about this, by default only PMC members
> can be release managers, but we can contact the infra team to add one of
> the committers as a release manager (document
> ). The
> process of creating a new version is described in this document
> .
>
>
> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:
>
>> Following the conversation started with Spark 4.0.0 release, this is a
>> thread to discuss improvements to our release processes.
>>
>> I'll Start by raising some questions that probably should have answers to
>> start the discussion:
>>
>>
>>1. What is currently running in GitHub Actions?
>>2. Who currently has permissions for Github actions? Is there a
>>specific owner for that today or a different volunteer each time?
>>3. What are the current limits of GitHub Actions, who set them - and
>>what is the process to change those (if possible at all, but I presume not
>>all Apache projects have the same limits)?
>>4. What versions should we support as an output for the build?
>>5. Where should the artifacts be stored?
>>6. What should be the output? only tar or also a docker image
>>published somewhere?
>>7. Do we want to have a release on fixed dates or a manual release
>>upon request?
>>8. Who should be permitted to sign a version - and what is the
>>process for that?
>>
>>
>> Thanks!
>> Nimrod
>>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Kent Yao

+1

Dongjoon Hyun  于2024年5月13日周一 08:39写道：
>
> +1
>
> On Sun, May 12, 2024 at 3:50 PM huaxin gao  wrote:
>>
>> +1
>>
>> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>>>
>>> +1
>>>
>>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>>> >
>>> > +1
>>> >
>>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>> >>
>>> >> Please also refer to:
>>> >>
>>> >>- Discussion thread:
>>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>> >>- SPIP doc: 
>>> >> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> >>
>>> >>
>>> >> Please vote on the SPIP for the next 72 hours:
>>> >>
>>> >> [ ] +1: Accept the proposal as an official SPIP
>>> >> [ ] +0
>>> >> [ ] -1: I don’t think this is a good idea because …
>>> >>
>>> >>
>>> >> Thank you!
>>> >>
>>> >> Liang-Chi Hsieh
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Dongjoon Hyun

+1

On Sun, May 12, 2024 at 3:50 PM huaxin gao  wrote:

> +1
>
> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >
>> > +1
>> >
>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>> >>
>> >> Please also refer to:
>> >>
>> >>- Discussion thread:
>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>
>> >>
>> >> Please vote on the SPIP for the next 72 hours:
>> >>
>> >> [ ] +1: Accept the proposal as an official SPIP
>> >> [ ] +0
>> >> [ ] -1: I don’t think this is a good idea because …
>> >>
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi Hsieh
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread bo yang

+1

On Sat, May 11, 2024 at 4:43 PM huaxin gao  wrote:

> +1
>
> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >
>> > +1
>> >
>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>> >>
>> >> Please also refer to:
>> >>
>> >>- Discussion thread:
>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>
>> >>
>> >> Please vote on the SPIP for the next 72 hours:
>> >>
>> >> [ ] +1: Accept the proposal as an official SPIP
>> >> [ ] +0
>> >> [ ] -1: I don’t think this is a good idea because …
>> >>
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi Hsieh
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread huaxin gao

+1

On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:

> +1
>
> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >
> > +1
> >
> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
> >>
> >> Hi all,
> >>
> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
> >>
> >> Please also refer to:
> >>
> >>- Discussion thread:
> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
> >>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>
> >>
> >> Please vote on the SPIP for the next 72 hours:
> >>
> >> [ ] +1: Accept the proposal as an official SPIP
> >> [ ] +0
> >> [ ] -1: I don’t think this is a good idea because …
> >>
> >>
> >> Thank you!
> >>
> >> Liang-Chi Hsieh
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh

+1

On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>
> +1
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc: 
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Chao Sun

+1

On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:

> Hi all,
>
> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>
> Please also refer to:
>
>- Discussion thread:
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
> Thank you!
>
> Liang-Chi Hsieh
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh

Hi all,

I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.

Please also refer to:

   - Discussion thread:
https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
   - JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
   - SPIP doc: 
https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …


Thank you!

Liang-Chi Hsieh

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Mich Talebzadeh

Thanks

In the context of stored procedures API for Catalogs, this approach
deviates from the traditional definition of stored procedures in RDBMS for
two key reasons:

   - Compilation vs. Interpretation: Traditional stored procedures are
   typically pre-compiled into machine code for faster execution. This
   approach, however, focuses on loading and interpreting the code on demand,
   similar to how scripts are run in some programming languages like Python.
   - Schema Changes and Invalidation: In RDBMS, changes to the underlying
   tables can invalidate compiled procedures as they might reference
   non-existent columns or have incompatible data types. This approach aims to
   avoid invalidation by potentially adapting to minor schema changes.

So, while it leverages the concept of pre-defined procedures stored within
the database and accessible through the Catalog API, it is evident that
this approach functions more like dynamic scripts than traditional compiled
stored procedures.

HTH

Mich Talebzadeh,Technologist | Architect | Data Engineer  | Generative AI |
FinCrime

London
United Kingdom

   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh




Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".


Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 11 May 2024 at 19:25, Anton Okolnychyi 
wrote:

> Mich, I don't think the invalidation will be necessary in our case as
> there is no plan to preprocess or compile the procedures into executable
> objects. They will be loaded and executed on demand via the Catalog API.
>
> пт, 10 трав. 2024 р. о 10:37 Mich Talebzadeh 
> пише:
>
>> Hi,
>>
>> If the underlying table changes (DDL), if I recall from RDBMSs like
>> Oracle, the stored procedure will be invalidated as it is a compiled
>> object. How is this going to be handled? Does it follow the same mechanism?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Sat, 20 Apr 2024 at 02:34, Anton Okolnychyi 
>> wrote:
>>
>>> Hi folks,
>>>
>>> I'd like to start a discussion on SPARK-44167 that aims to enable
>>> catalogs to expose custom routines as stored procedures. I believe this
>>> functionality will enhance Spark’s ability to interact with external
>>> connectors and allow users to perform more operations in plain SQL.
>>>
>>> SPIP [1] contains proposed API changes and parser extensions. Any
>>> feedback is more than welcome!
>>>
>>> Unlike the initial proposal for stored procedures with Python [2], this
>>> one focuses on exposing pre-defined stored procedures via the catalog API.
>>> This approach is inspired by a similar functionality in Trino and avoids
>>> the challenges of supporting user-defined routines discussed earlier [3].
>>>
>>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>>
>>> - Anton
>>>
>>> [1] -
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> [2] -
>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
>>> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>>>
>>>
>>>
>>>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Anton Okolnychyi

Mich, I don't think the invalidation will be necessary in our case as there
is no plan to preprocess or compile the procedures into executable objects.
They will be loaded and executed on demand via the Catalog API.

пт, 10 трав. 2024 р. о 10:37 Mich Talebzadeh 
пише:

> Hi,
>
> If the underlying table changes (DDL), if I recall from RDBMSs like
> Oracle, the stored procedure will be invalidated as it is a compiled
> object. How is this going to be handled? Does it follow the same mechanism?
>
> Thanks
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sat, 20 Apr 2024 at 02:34, Anton Okolnychyi 
> wrote:
>
>> Hi folks,
>>
>> I'd like to start a discussion on SPARK-44167 that aims to enable
>> catalogs to expose custom routines as stored procedures. I believe this
>> functionality will enhance Spark’s ability to interact with external
>> connectors and allow users to perform more operations in plain SQL.
>>
>> SPIP [1] contains proposed API changes and parser extensions. Any
>> feedback is more than welcome!
>>
>> Unlike the initial proposal for stored procedures with Python [2], this
>> one focuses on exposing pre-defined stored procedures via the catalog API.
>> This approach is inspired by a similar functionality in Trino and avoids
>> the challenges of supporting user-defined routines discussed earlier [3].
>>
>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>
>> - Anton
>>
>> [1] -
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> [2] -
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
>> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>>
>>
>>
>>

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-11 Thread Cheng Pan

-1 (non-binding)

A small question, the tag is orphan but I suppose it should belong to the 
master branch.

Seems YARN integration is broken due to javax =>  jakarta namespace migration, 
I filled SPARK-48238, and left some comments on 
https://github.com/apache/spark/pull/45154

Caused by: java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
jakarta.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) 
~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
 ~[?:?]
at 
java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
 ~[?:?]
at 
java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
 ~[?:?]
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
... 38 more

Thanks,
Cheng Pan

> On May 11, 2024, at 13:55, Wenchen Fan  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 4.0.0-preview1.
> 
> The vote is open until May 16 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
> The tag to be voted on is v4.0.0-preview1-rc1 (commit 
> 7dcf77c739c3854260464d732dbfb9a0f54706e7):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1454/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/
> 
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-10 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 16 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc1 (commit
7dcf77c739c3854260464d732dbfb9a0f54706e7):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1454/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-10 Thread Mich Talebzadeh

Hi,

If the underlying table changes (DDL), if I recall from RDBMSs like Oracle,
the stored procedure will be invalidated as it is a compiled object. How is
this going to be handled? Does it follow the same mechanism?

Thanks

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 20 Apr 2024 at 02:34, Anton Okolnychyi 
wrote:

> Hi folks,
>
> I'd like to start a discussion on SPARK-44167 that aims to enable catalogs
> to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
>
> SPIP [1] contains proposed API changes and parser extensions. Any feedback
> is more than welcome!
>
> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
>
> Liang-Chi was kind enough to shepherd this effort. Thanks!
>
> - Anton
>
> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>
>
>
>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread huaxin gao

Thanks Anton for the updated proposal -- it looks great! I appreciate the
hard work put into refining it. I am looking forward to the upcoming vote
and moving forward with this initiative.

Thanks,
Huaxin

On Thu, May 9, 2024 at 7:30 PM L. C. Hsieh  wrote:

> Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
> others if I miss those who are participating in the discussion.
>
> I suppose we have reached a consensus or close to being in the design.
>
> If you have some more comments, please let us know.
>
> If not, I will go to start a vote soon after a few days.
>
> Thank you.
>
> On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi 
> wrote:
> >
> > Thanks to everyone who commented on the design doc. I updated the
> proposal and it is ready for another look. I hope we can converge and move
> forward with this effort!
> >
> > - Anton
> >
> > пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi 
> пише:
> >>
> >> Hi folks,
> >>
> >> I'd like to start a discussion on SPARK-44167 that aims to enable
> catalogs to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
> >>
> >> SPIP [1] contains proposed API changes and parser extensions. Any
> feedback is more than welcome!
> >>
> >> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
> >>
> >> Liang-Chi was kind enough to shepherd this effort. Thanks!
> >>
> >> - Anton
> >>
> >> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> >> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
> >>
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan

Thanks for leading this project! Let's move forward.

On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh  wrote:

> Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
> others if I miss those who are participating in the discussion.
>
> I suppose we have reached a consensus or close to being in the design.
>
> If you have some more comments, please let us know.
>
> If not, I will go to start a vote soon after a few days.
>
> Thank you.
>
> On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi 
> wrote:
> >
> > Thanks to everyone who commented on the design doc. I updated the
> proposal and it is ready for another look. I hope we can converge and move
> forward with this effort!
> >
> > - Anton
> >
> > пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi 
> пише:
> >>
> >> Hi folks,
> >>
> >> I'd like to start a discussion on SPARK-44167 that aims to enable
> catalogs to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
> >>
> >> SPIP [1] contains proposed API changes and parser extensions. Any
> feedback is more than welcome!
> >>
> >> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
> >>
> >> Liang-Chi was kind enough to shepherd this effort. Thanks!
> >>
> >> - Anton
> >>
> >> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> >> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
> >>
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread L. C. Hsieh

Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
others if I miss those who are participating in the discussion.

I suppose we have reached a consensus or close to being in the design.

If you have some more comments, please let us know.

If not, I will go to start a vote soon after a few days.

Thank you.

On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi  wrote:
>
> Thanks to everyone who commented on the design doc. I updated the proposal 
> and it is ready for another look. I hope we can converge and move forward 
> with this effort!
>
> - Anton
>
> пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi  пише:
>>
>> Hi folks,
>>
>> I'd like to start a discussion on SPARK-44167 that aims to enable catalogs 
>> to expose custom routines as stored procedures. I believe this functionality 
>> will enhance Spark’s ability to interact with external connectors and allow 
>> users to perform more operations in plain SQL.
>>
>> SPIP [1] contains proposed API changes and parser extensions. Any feedback 
>> is more than welcome!
>>
>> Unlike the initial proposal for stored procedures with Python [2], this one 
>> focuses on exposing pre-defined stored procedures via the catalog API. This 
>> approach is inspired by a similar functionality in Trino and avoids the 
>> challenges of supporting user-defined routines discussed earlier [3].
>>
>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>
>> - Anton
>>
>> [1] - 
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> [2] - 
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
>> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>>
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Anton Okolnychyi

Thanks to everyone who commented on the design doc. I updated the proposal
and it is ready for another look. I hope we can converge and move forward
with this effort!

- Anton

пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi  пише:

> Hi folks,
>
> I'd like to start a discussion on SPARK-44167 that aims to enable catalogs
> to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
>
> SPIP [1] contains proposed API changes and parser extensions. Any feedback
> is more than welcome!
>
> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
>
> Liang-Chi was kind enough to shepherd this effort. Thanks!
>
> - Anton
>
> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>
>
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

UPDATE:

I've successfully uploaded the release packages:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
(I skipped SparkR as I was not able to fix the errors, I'll get back to it
later)

However, there is a new issue with doc building:
https://github.com/apache/spark/pull/44628#discussion_r1595718574

I'll continue after the issue is fixed.

On Fri, May 10, 2024 at 12:29 AM Dongjoon Hyun 
wrote:

> Please re-try to upload, Wenchen. ASF Infra team bumped up our upload
> limit based on our request.
>
> > Your upload limit has been increased to 650MB
>
> Dongjoon.
>
>
>
> On Thu, May 9, 2024 at 8:12 AM Wenchen Fan  wrote:
>
>> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776
>>
>> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
>> wrote:
>>
>>> In addition, FYI, I was the latest release manager with Apache Spark
>>> 3.4.3 (2024-04-15 Vote)
>>>
>>> According to my work log, I uploaded the following binaries to SVN from
>>> EC2 (us-west-2) without any issues.
>>>
>>> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
>>> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
>>> spark-3.4.3-bin-hadoop3-scala2.13.tgz
>>> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
>>> spark-3.4.3-bin-hadoop3.tgz
>>> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
>>> spark-3.4.3-bin-without-hadoop.tgz
>>> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
>>> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>>>
>>> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination,
>>> the total size should be smaller than 3.4.3 binaires.
>>>
>>> Given that, if there is any INFRA change, that could happen after 4/15.
>>>
>>> Dongjoon.
>>>
>>> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
>>> wrote:
>>>
 Could you file an INFRA JIRA issue with the error message and context
 first, Wenchen?

 As you know, if we see something, we had better file a JIRA issue
 because it could be not only an Apache Spark project issue but also all ASF
 project issues.

 Dongjoon.


 On Thu, May 9, 2024 at 12:28 AM Wenchen Fan 
 wrote:

> UPDATE:
>
> After resolving a few issues in the release scripts, I can finally
> build the release packages. However, I can't upload them to the staging 
> SVN
> repo due to a transmitting error, and it seems like a limitation from the
> server side. I tried it on both my local laptop and remote AWS instance,
> but neither works. These package binaries are like 300-400 MBs, and we 
> just
> did a release last month. Not sure if this is a new limitation due to cost
> saving.
>
> While I'm looking for help to get unblocked, I'm wondering if we can
> upload release packages to a public git repo instead, under the Apache
> account?
>
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun

Please re-try to upload, Wenchen. ASF Infra team bumped up our upload limit
based on our request.

> Your upload limit has been increased to 650MB

Dongjoon.

On Thu, May 9, 2024 at 8:12 AM Wenchen Fan  wrote:

> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776
>
> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
> wrote:
>
>> In addition, FYI, I was the latest release manager with Apache Spark
>> 3.4.3 (2024-04-15 Vote)
>>
>> According to my work log, I uploaded the following binaries to SVN from
>> EC2 (us-west-2) without any issues.
>>
>> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
>> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
>> spark-3.4.3-bin-hadoop3-scala2.13.tgz
>> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
>> spark-3.4.3-bin-hadoop3.tgz
>> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
>> spark-3.4.3-bin-without-hadoop.tgz
>> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
>> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>>
>> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
>> total size should be smaller than 3.4.3 binaires.
>>
>> Given that, if there is any INFRA change, that could happen after 4/15.
>>
>> Dongjoon.
>>
>> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Could you file an INFRA JIRA issue with the error message and context
>>> first, Wenchen?
>>>
>>> As you know, if we see something, we had better file a JIRA issue
>>> because it could be not only an Apache Spark project issue but also all ASF
>>> project issues.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>>>
 UPDATE:

 After resolving a few issues in the release scripts, I can finally
 build the release packages. However, I can't upload them to the staging SVN
 repo due to a transmitting error, and it seems like a limitation from the
 server side. I tried it on both my local laptop and remote AWS instance,
 but neither works. These package binaries are like 300-400 MBs, and we just
 did a release last month. Not sure if this is a new limitation due to cost
 saving.

 While I'm looking for help to get unblocked, I'm wondering if we can
 upload release packages to a public git repo instead, under the Apache
 account?

>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776

On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
wrote:

> In addition, FYI, I was the latest release manager with Apache Spark 3.4.3
> (2024-04-15 Vote)
>
> According to my work log, I uploaded the following binaries to SVN from
> EC2 (us-west-2) without any issues.
>
> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
> spark-3.4.3-bin-hadoop3-scala2.13.tgz
> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
> spark-3.4.3-bin-hadoop3.tgz
> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
> spark-3.4.3-bin-without-hadoop.tgz
> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>
> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
> total size should be smaller than 3.4.3 binaires.
>
> Given that, if there is any INFRA change, that could happen after 4/15.
>
> Dongjoon.
>
> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
> wrote:
>
>> Could you file an INFRA JIRA issue with the error message and context
>> first, Wenchen?
>>
>> As you know, if we see something, we had better file a JIRA issue because
>> it could be not only an Apache Spark project issue but also all ASF project
>> issues.
>>
>> Dongjoon.
>>
>>
>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> After resolving a few issues in the release scripts, I can finally build
>>> the release packages. However, I can't upload them to the staging SVN repo
>>> due to a transmitting error, and it seems like a limitation from the server
>>> side. I tried it on both my local laptop and remote AWS instance, but
>>> neither works. These package binaries are like 300-400 MBs, and we just did
>>> a release last month. Not sure if this is a new limitation due to cost
>>> saving.
>>>
>>> While I'm looking for help to get unblocked, I'm wondering if we can
>>> upload release packages to a public git repo instead, under the Apache
>>> account?
>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun

In addition, FYI, I was the latest release manager with Apache Spark 3.4.3
(2024-04-15 Vote)

According to my work log, I uploaded the following binaries to SVN from EC2
(us-west-2) without any issues.

-rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
-rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
spark-3.4.3-bin-hadoop3-scala2.13.tgz
-rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
spark-3.4.3-bin-hadoop3.tgz
-rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
spark-3.4.3-bin-without-hadoop.tgz
-rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
-rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz

Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
total size should be smaller than 3.4.3 binaires.

Given that, if there is any INFRA change, that could happen after 4/15.

Dongjoon.

On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
wrote:

> Could you file an INFRA JIRA issue with the error message and context
> first, Wenchen?
>
> As you know, if we see something, we had better file a JIRA issue because
> it could be not only an Apache Spark project issue but also all ASF project
> issues.
>
> Dongjoon.
>
>
> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> After resolving a few issues in the release scripts, I can finally build
>> the release packages. However, I can't upload them to the staging SVN repo
>> due to a transmitting error, and it seems like a limitation from the server
>> side. I tried it on both my local laptop and remote AWS instance, but
>> neither works. These package binaries are like 300-400 MBs, and we just did
>> a release last month. Not sure if this is a new limitation due to cost
>> saving.
>>
>> While I'm looking for help to get unblocked, I'm wondering if we can
>> upload release packages to a public git repo instead, under the Apache
>> account?
>>
>>>
>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun

Could you file an INFRA JIRA issue with the error message and context
first, Wenchen?

As you know, if we see something, we had better file a JIRA issue because
it could be not only an Apache Spark project issue but also all ASF project
issues.

Dongjoon.


On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:

> UPDATE:
>
> After resolving a few issues in the release scripts, I can finally build
> the release packages. However, I can't upload them to the staging SVN repo
> due to a transmitting error, and it seems like a limitation from the server
> side. I tried it on both my local laptop and remote AWS instance, but
> neither works. These package binaries are like 300-400 MBs, and we just did
> a release last month. Not sure if this is a new limitation due to cost
> saving.
>
> While I'm looking for help to get unblocked, I'm wondering if we can
> upload release packages to a public git repo instead, under the Apache
> account?
>
> On Thu, May 9, 2024 at 12:39 AM Holden Karau 
> wrote:
>
>> That looks cool, maybe let’s split off a thread on how to improve our
>> release processes?
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:
>>
>>> On that note, GitHub recently released (public preview) a new feature
>>> called Artifact Attestions which may be relevant/useful here: Introducing
>>> Artifact Attestations–now in public beta - The GitHub Blog
>>> 
>>>
>>> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek 
>>> wrote:
>>>
 I have no permissions so I can't do it but I'm happy to help (although
 I am more familiar with Gitlab CICD than Github Actions).
 Is there some point of contact that can provide me needed context and
 permissions?
 I'd also love to see why the costs are high and see how we can reduce
 them...

 Thanks,
 Nimrod

 On Wed, May 8, 2024 at 8:26 AM Holden Karau 
 wrote:

> I think signing the artifacts produced from a secure CI sounds like a
> good idea. I know we’ve been asked to reduce our GitHub action usage but
> perhaps someone interested could volunteer to set that up.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
> wrote:
>
>> Hi,
>> Thanks for the reply.
>>
>> From my experience, a build on a build server would be much more
>> predictable and less error prone than building on some laptop- and of
>> course much faster to have builds, snapshots, release candidates, early
>> previews releases, release candidates or final releases.
>> It will enable us to have a preview version with current changes-
>> snapshot version, either automatically every day or if we need to save
>> costs (although build is really not expensive) - with a click of a 
>> button.
>>
>> Regarding keys for signing. - that's what vaults are for, all across
>> the industry we are using vaults (such as hashicorp vault)- but if the
>> build will be automated and the only thing which will be manual is to 
>> sign
>> the release for security reasons that would be reasonable.
>>
>> Thanks,
>> Nimrod
>>
>>
>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>> holden.ka...@gmail.com>:
>>
>>> Indeed. We could conceivably build the release in CI/CD but the
>>> final verification / signing should be done locally to keep the keys 
>>> safe
>>> (there was some concern from earlier release processes).
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>> wrote:
>>>
 Hi,

 Sorry for the novice question, Wenchen - the release is done
 manually from a laptop? Not using a CI CD process on a build server?

 Thanks,
 Nimrod

 On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
 wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and
> get it ready for the release process (docker desktop doesn't work 
> anymore,
> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
> Thanks
> for your patience!
>
> Wenchen

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan

Thanks for starting the discussion! To add a bit more color, we should at
least add a test job to make sure the release script can produce the
packages correctly. Today it's kind of being manually tested by the
release manager each time, which slows down the release process. It's
better if we can automate it entirely, so that making a release is a simple
click by authorized people.

On Thu, May 9, 2024 at 9:48 PM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Hussein Awala

Hello,

I can answer some of your common questions with other Apache projects.

> Who currently has permissions for Github actions? Is there a specific
owner for that today or a different volunteer each time?

The Apache organization owns Github Actions, and committers (contributors
with write permissions) can retrigger/cancel a Github Actions workflow, but
Github Actions runners are managed by the Apache infra team.

> What are the current limits of GitHub Actions, who set them - and what is
the process to change those (if possible at all, but I presume not all
Apache projects have the same limits)?

For limits, I don't think there is any significant limit, especially since
the Apache organization has 900 donated runners used by its projects, and
there is an initiative from the Infra team to add self-hosted runners
running on Kubernetes (document

).

> Where should the artifacts be stored?

Usually, we use Maven for jars, DockerHub for Docker images, and Github
cache for workflow cache. But we can use Github artifacts to store any kind
of package (even Docker images in the ghcr), which is fully accepted by
Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
...), a bucket can be used to store some of the packages.

 > Who should be permitted to sign a version - and what is the process for
that?

The Apache documentation is clear about this, by default only PMC members
can be release managers, but we can contact the infra team to add one of
the committers as a release manager (document
). The
process of creating a new version is described in this document
.

On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>

[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek

Following the conversation started with Spark 4.0.0 release, this is a
thread to discuss improvements to our release processes.

I'll Start by raising some questions that probably should have answers to
start the discussion:


   1. What is currently running in GitHub Actions?
   2. Who currently has permissions for Github actions? Is there a specific
   owner for that today or a different volunteer each time?
   3. What are the current limits of GitHub Actions, who set them - and
   what is the process to change those (if possible at all, but I presume not
   all Apache projects have the same limits)?
   4. What versions should we support as an output for the build?
   5. Where should the artifacts be stored?
   6. What should be the output? only tar or also a docker image published
   somewhere?
   7. Do we want to have a release on fixed dates or a manual release upon
   request?
   8. Who should be permitted to sign a version - and what is the process
   for that?


Thanks!
Nimrod

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

UPDATE:

After resolving a few issues in the release scripts, I can finally build
the release packages. However, I can't upload them to the staging SVN repo
due to a transmitting error, and it seems like a limitation from the server
side. I tried it on both my local laptop and remote AWS instance, but
neither works. These package binaries are like 300-400 MBs, and we just did
a release last month. Not sure if this is a new limitation due to cost
saving.

While I'm looking for help to get unblocked, I'm wondering if we can upload
release packages to a public git repo instead, under the Apache account?

On Thu, May 9, 2024 at 12:39 AM Holden Karau  wrote:

> That looks cool, maybe let’s split off a thread on how to improve our
> release processes?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:
>
>> On that note, GitHub recently released (public preview) a new feature
>> called Artifact Attestions which may be relevant/useful here: Introducing
>> Artifact Attestations–now in public beta - The GitHub Blog
>> 
>>
>> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>>
>>> I have no permissions so I can't do it but I'm happy to help (although I
>>> am more familiar with Gitlab CICD than Github Actions).
>>> Is there some point of contact that can provide me needed context and
>>> permissions?
>>> I'd also love to see why the costs are high and see how we can reduce
>>> them...
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>>> wrote:
>>>
 I think signing the artifacts produced from a secure CI sounds like a
 good idea. I know we’ve been asked to reduce our GitHub action usage but
 perhaps someone interested could volunteer to set that up.

 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau


 On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
 wrote:

> Hi,
> Thanks for the reply.
>
> From my experience, a build on a build server would be much more
> predictable and less error prone than building on some laptop- and of
> course much faster to have builds, snapshots, release candidates, early
> previews releases, release candidates or final releases.
> It will enable us to have a preview version with current changes-
> snapshot version, either automatically every day or if we need to save
> costs (although build is really not expensive) - with a click of a button.
>
> Regarding keys for signing. - that's what vaults are for, all across
> the industry we are using vaults (such as hashicorp vault)- but if the
> build will be automated and the only thing which will be manual is to sign
> the release for security reasons that would be reasonable.
>
> Thanks,
> Nimrod
>
>
> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
> holden.ka...@gmail.com>:
>
>> Indeed. We could conceivably build the release in CI/CD but the final
>> verification / signing should be done locally to keep the keys safe 
>> (there
>> was some concern from earlier release processes).
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry for the novice question, Wenchen - the release is done
>>> manually from a laptop? Not using a CI CD process on a build server?
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>>> wrote:
>>>
 UPDATE:

 Unfortunately, it took me quite some time to set up my laptop and
 get it ready for the release process (docker desktop doesn't work 
 anymore,
 my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
 Thanks
 for your patience!

 Wenchen

 On Fri, May 3, 2024 at 7:47 AM yangjie01 
 wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li <
> gatorsm...@gmail.com>, Tathagata Das ,
> Wenchen Fan , Cheng Pan ,
> Nicholas Chammas , Dongjoon Hyun <
> dongjoon.h...@gmail.com>, Cheng Pan , Spark
> dev

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo

Very helpful!

On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh 
wrote:

> *Potential reasons*
>
>
>- Data Serialization: Spark needs to serialize the DataFrame into an
>in-memory format suitable for storage. This process can be time-consuming,
>especially for large datasets like 3.2 GB with complex schemas.
>- Shuffle Operations: If your transformations involve shuffle
>operations, Spark might need to shuffle data across the cluster to ensure
>efficient storage. Shuffling can be slow, especially on large datasets or
>limited network bandwidth or nodes..  Check Spark UI staging and executor
>tabs for info on shuffle reads and writes
>- Memory Allocation: Spark allocates memory for the cached DataFrame.
>Depending on the cluster configuration and available memory, this
>allocation can take some time.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:
>
>> Could any one help me here ?
>> Sent from my iPhone
>>
>> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
>> >
>> > 
>> > Hello Folks,
>> > in Spark I have read a file and done some transformation and finally
>> writing to hdfs.
>> >
>> > Now I am interested in writing the same dataframe to MapRFS but for
>> this Spark will execute the full DAG again  (recompute all the previous
>> steps)(all the read + transformations ).
>> >
>> > I don't want this recompute again so I decided to cache() the dataframe
>> so that 2nd/nth write won't recompute all the steps .
>> >
>> > But here is a catch: the cache() takes more time to persist the data in
>> memory.
>> >
>> > I have a question when the dataframe is in memory then just to save it
>> to another space in memory , why it will take more time (3.2 G data 6 mins)
>> >
>> > May I know what operations in cache() are taking such a long time ?
>> >
>> > I would appreciate it if someone would share the information .
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau

That looks cool, maybe let’s split off a thread on how to improve our
release processes?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:

> On that note, GitHub recently released (public preview) a new feature
> called Artifact Attestions which may be relevant/useful here: Introducing
> Artifact Attestations–now in public beta - The GitHub Blog
> 
>
> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>
>> I have no permissions so I can't do it but I'm happy to help (although I
>> am more familiar with Gitlab CICD than Github Actions).
>> Is there some point of contact that can provide me needed context and
>> permissions?
>> I'd also love to see why the costs are high and see how we can reduce
>> them...
>>
>> Thanks,
>> Nimrod
>>
>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>> wrote:
>>
>>> I think signing the artifacts produced from a secure CI sounds like a
>>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>>> perhaps someone interested could volunteer to set that up.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
>>> wrote:
>>>
 Hi,
 Thanks for the reply.

 From my experience, a build on a build server would be much more
 predictable and less error prone than building on some laptop- and of
 course much faster to have builds, snapshots, release candidates, early
 previews releases, release candidates or final releases.
 It will enable us to have a preview version with current changes-
 snapshot version, either automatically every day or if we need to save
 costs (although build is really not expensive) - with a click of a button.

 Regarding keys for signing. - that's what vaults are for, all across
 the industry we are using vaults (such as hashicorp vault)- but if the
 build will be automated and the only thing which will be manual is to sign
 the release for security reasons that would be reasonable.

 Thanks,
 Nimrod


 בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
 holden.ka...@gmail.com>:

> Indeed. We could conceivably build the release in CI/CD but the final
> verification / signing should be done locally to keep the keys safe (there
> was some concern from earlier release processes).
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
> wrote:
>
>> Hi,
>>
>> Sorry for the novice question, Wenchen - the release is done manually
>> from a laptop? Not using a CI CD process on a build server?
>>
>> Thanks,
>> Nimrod
>>
>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>> wrote:
>>
>>> UPDATE:
>>>
>>> Unfortunately, it took me quite some time to set up my laptop and
>>> get it ready for the release process (docker desktop doesn't work 
>>> anymore,
>>> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
>>> Thanks
>>> for your patience!
>>>
>>> Wenchen
>>>
>>> On Fri, May 3, 2024 at 7:47 AM yangjie01 
>>> wrote:
>>>
 +1



 *发件人**: *Jungtaek Lim 
 *日期**: *2024年5月2日 星期四 10:21
 *收件人**: *Holden Karau 
 *抄送**: *Chao Sun , Xiao Li <
 gatorsm...@gmail.com>, Tathagata Das ,
 Wenchen Fan , Cheng Pan ,
 Nicholas Chammas , Dongjoon Hyun <
 dongjoon.h...@gmail.com>, Cheng Pan , Spark
 dev list , Anish Shrigondekar <
 anish.shrigonde...@databricks.com>
 *主题**: *Re: [DISCUSS] Spark 4.0.0 release



 +1 love to see it!



 On Thu, May 2, 2024 at 10:08 AM Holden Karau <
 holden.ka...@gmail.com> wrote:

 +1 :) yay previews



 On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

 +1



 On Wed, May 1, 2024 at 5:23 PM Xiao Li 
 wrote:

 +1 for next Monday.



 We can do more previews when the other features are ready for
 preview.



 Tathagata Das  于2024年5月1日周三 08:46写道：

 Next week sounds

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Erik Krogen

On that note, GitHub recently released (public preview) a new feature
called Artifact Attestions which may be relevant/useful here: Introducing
Artifact Attestations–now in public beta - The GitHub Blog


On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:

> I have no permissions so I can't do it but I'm happy to help (although I
> am more familiar with Gitlab CICD than Github Actions).
> Is there some point of contact that can provide me needed context and
> permissions?
> I'd also love to see why the costs are high and see how we can reduce
> them...
>
> Thanks,
> Nimrod
>
> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
> wrote:
>
>> I think signing the artifacts produced from a secure CI sounds like a
>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>> perhaps someone interested could volunteer to set that up.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:
>>
>>> Hi,
>>> Thanks for the reply.
>>>
>>> From my experience, a build on a build server would be much more
>>> predictable and less error prone than building on some laptop- and of
>>> course much faster to have builds, snapshots, release candidates, early
>>> previews releases, release candidates or final releases.
>>> It will enable us to have a preview version with current changes-
>>> snapshot version, either automatically every day or if we need to save
>>> costs (although build is really not expensive) - with a click of a button.
>>>
>>> Regarding keys for signing. - that's what vaults are for, all across the
>>> industry we are using vaults (such as hashicorp vault)- but if the build
>>> will be automated and the only thing which will be manual is to sign the
>>> release for security reasons that would be reasonable.
>>>
>>> Thanks,
>>> Nimrod
>>>
>>>
>>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>>> holden.ka...@gmail.com>:
>>>
 Indeed. We could conceivably build the release in CI/CD but the final
 verification / signing should be done locally to keep the keys safe (there
 was some concern from earlier release processes).

 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau


 On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
 wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually
> from a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
> wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get
>> it ready for the release process (docker desktop doesn't work anymore, my
>> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
>> for your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas
>>> Chammas , Dongjoon Hyun <
>>> dongjoon.h...@gmail.com>, Cheng Pan , Spark
>>> dev list , Anish Shrigondekar <
>>> anish.shrigonde...@databricks.com>
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for
>>> preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>>> wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about 
>>> we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>>

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh

*Potential reasons*


   - Data Serialization: Spark needs to serialize the DataFrame into an
   in-memory format suitable for storage. This process can be time-consuming,
   especially for large datasets like 3.2 GB with complex schemas.
   - Shuffle Operations: If your transformations involve shuffle
   operations, Spark might need to shuffle data across the cluster to ensure
   efficient storage. Shuffling can be slow, especially on large datasets or
   limited network bandwidth or nodes..  Check Spark UI staging and executor
   tabs for info on shuffle reads and writes
   - Memory Allocation: Spark allocates memory for the cached DataFrame.
   Depending on the cluster configuration and available memory, this
   allocation can take some time.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:

> Could any one help me here ?
> Sent from my iPhone
>
> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> >
> > 
> > Hello Folks,
> > in Spark I have read a file and done some transformation and finally
> writing to hdfs.
> >
> > Now I am interested in writing the same dataframe to MapRFS but for this
> Spark will execute the full DAG again  (recompute all the previous
> steps)(all the read + transformations ).
> >
> > I don't want this recompute again so I decided to cache() the dataframe
> so that 2nd/nth write won't recompute all the steps .
> >
> > But here is a catch: the cache() takes more time to persist the data in
> memory.
> >
> > I have a question when the dataframe is in memory then just to save it
> to another space in memory , why it will take more time (3.2 G data 6 mins)
> >
> > May I know what operations in cache() are taking such a long time ?
> >
> > I would appreciate it if someone would share the information .
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo

Could any one help me here ?
Sent from my iPhone

> On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> 
> 
> Hello Folks,
> in Spark I have read a file and done some transformation and finally writing 
> to hdfs.
> 
> Now I am interested in writing the same dataframe to MapRFS but for this 
> Spark will execute the full DAG again  (recompute all the previous steps)(all 
> the read + transformations ).
> 
> I don't want this recompute again so I decided to cache() the dataframe so 
> that 2nd/nth write won't recompute all the steps .
> 
> But here is a catch: the cache() takes more time to persist the data in 
> memory.
> 
> I have a question when the dataframe is in memory then just to save it to 
> another space in memory , why it will take more time (3.2 G data 6 mins)
> 
> May I know what operations in cache() are taking such a long time ?
> 
> I would appreciate it if someone would share the information .

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek

I have no permissions so I can't do it but I'm happy to help (although I am
more familiar with Gitlab CICD than Github Actions).
Is there some point of contact that can provide me needed context and
permissions?
I'd also love to see why the costs are high and see how we can reduce
them...

Thanks,
Nimrod

On Wed, May 8, 2024 at 8:26 AM Holden Karau  wrote:

> I think signing the artifacts produced from a secure CI sounds like a good
> idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
> someone interested could volunteer to set that up.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:
>
>> Hi,
>> Thanks for the reply.
>>
>> From my experience, a build on a build server would be much more
>> predictable and less error prone than building on some laptop- and of
>> course much faster to have builds, snapshots, release candidates, early
>> previews releases, release candidates or final releases.
>> It will enable us to have a preview version with current changes-
>> snapshot version, either automatically every day or if we need to save
>> costs (although build is really not expensive) - with a click of a button.
>>
>> Regarding keys for signing. - that's what vaults are for, all across the
>> industry we are using vaults (such as hashicorp vault)- but if the build
>> will be automated and the only thing which will be manual is to sign the
>> release for security reasons that would be reasonable.
>>
>> Thanks,
>> Nimrod
>>
>>
>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>> holden.ka...@gmail.com>:
>>
>>> Indeed. We could conceivably build the release in CI/CD but the final
>>> verification / signing should be done locally to keep the keys safe (there
>>> was some concern from earlier release processes).
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>> wrote:
>>>
 Hi,

 Sorry for the novice question, Wenchen - the release is done manually
 from a laptop? Not using a CI CD process on a build server?

 Thanks,
 Nimrod

 On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get
> it ready for the release process (docker desktop doesn't work anymore, my
> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
> for your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas
>> Chammas , Dongjoon Hyun <
>> dongjoon.h...@gmail.com>, Cheng Pan , Spark dev
>> list , Anish Shrigondekar <
>> anish.shrigonde...@databricks.com>
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>> wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We
>> don't need to wait for all the ongoing projects to be ready. How about we
>> do a 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard 
>> to
>> do that without a Preview release. So the sooner we make a Preview 
>> release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau

I think signing the artifacts produced from a secure CI sounds like a good
idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
someone interested could volunteer to set that up.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:

> Hi,
> Thanks for the reply.
>
> From my experience, a build on a build server would be much more
> predictable and less error prone than building on some laptop- and of
> course much faster to have builds, snapshots, release candidates, early
> previews releases, release candidates or final releases.
> It will enable us to have a preview version with current changes- snapshot
> version, either automatically every day or if we need to save costs
> (although build is really not expensive) - with a click of a button.
>
> Regarding keys for signing. - that's what vaults are for, all across the
> industry we are using vaults (such as hashicorp vault)- but if the build
> will be automated and the only thing which will be manual is to sign the
> release for security reasons that would be reasonable.
>
> Thanks,
> Nimrod
>
>
> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
> holden.ka...@gmail.com>:
>
>> Indeed. We could conceivably build the release in CI/CD but the final
>> verification / signing should be done locally to keep the keys safe (there
>> was some concern from earlier release processes).
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry for the novice question, Wenchen - the release is done manually
>>> from a laptop? Not using a CI CD process on a build server?
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>>
 UPDATE:

 Unfortunately, it took me quite some time to set up my laptop and get
 it ready for the release process (docker desktop doesn't work anymore, my
 pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
 for your patience!

 Wenchen

 On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li ,
> Tathagata Das , Wenchen Fan <
> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas
> , Dongjoon Hyun ,
> Cheng Pan , Spark dev list ,
> Anish Shrigondekar 
> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>
>
>
> +1 love to see it!
>
>
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
> +1 :) yay previews
>
>
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
> +1
>
>
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
> +1 for next Monday.
>
>
>
> We can do more previews when the other features are ready for preview.
>
>
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
> Next week sounds great! Thank you Wenchen!
>
>
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
> wrote:
>
> Yea I think a preview release won't hurt (without a branch cut). We
> don't need to wait for all the ongoing projects to be ready. How about we
> do a 4.0 preview release based on the current master branch next Monday?
>
>
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
> Hey all,
>
>
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard 
> to
> do that without a Preview release. So the sooner we make a Preview 
> release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
>
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
> wrote:
>
> Thank you all for the replies!
>
>
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
>
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek

Hi,
Thanks for the reply.

>From my experience, a build on a build server would be much more
predictable and less error prone than building on some laptop- and of
course much faster to have builds, snapshots, release candidates, early
previews releases, release candidates or final releases.
It will enable us to have a preview version with current changes- snapshot
version, either automatically every day or if we need to save costs
(although build is really not expensive) - with a click of a button.

Regarding keys for signing. - that's what vaults are for, all across the
industry we are using vaults (such as hashicorp vault)- but if the build
will be automated and the only thing which will be manual is to sign the
release for security reasons that would be reasonable.

Thanks,
Nimrod

בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏:

> Indeed. We could conceivably build the release in CI/CD but the final
> verification / signing should be done locally to keep the keys safe (there
> was some concern from earlier release processes).
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:
>
>> Hi,
>>
>> Sorry for the novice question, Wenchen - the release is done manually
>> from a laptop? Not using a CI CD process on a build server?
>>
>> Thanks,
>> Nimrod
>>
>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> Unfortunately, it took me quite some time to set up my laptop and get it
>>> ready for the release process (docker desktop doesn't work anymore, my pgp
>>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>>> your patience!
>>>
>>> Wenchen
>>>
>>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>>
 +1

 *发件人**: *Jungtaek Lim 
 *日期**: *2024年5月2日 星期四 10:21
 *收件人**: *Holden Karau 
 *抄送**: *Chao Sun , Xiao Li ,
 Tathagata Das , Wenchen Fan <
 cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
 nicholas.cham...@gmail.com>, Dongjoon Hyun ,
 Cheng Pan , Spark dev list ,
 Anish Shrigondekar 
 *主题**: *Re: [DISCUSS] Spark 4.0.0 release

 +1 love to see it!

 On Thu, May 2, 2024 at 10:08 AM Holden Karau 
 wrote:

 +1 :) yay previews

 On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

 +1

 On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

 +1 for next Monday.

 We can do more previews when the other features are ready for preview.

 Tathagata Das  于2024年5月1日周三 08:46写道：

 Next week sounds great! Thank you Wenchen!

 On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
 wrote:

 Yea I think a preview release won't hurt (without a branch cut). We
 don't need to wait for all the ongoing projects to be ready. How about we
 do a 4.0 preview release based on the current master branch next Monday?

 On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!

 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.

 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.

 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.

 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

 will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

 Thanks,
 Cheng Pan

 > On Apr 15, 2024, at 09:58, Jungtaek Lim 
 wrote:
 >
 >

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau

Indeed. We could conceivably build the release in CI/CD but the final
verification / signing should be done locally to keep the keys safe (there
was some concern from earlier release processes).

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually from
> a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get it
>> ready for the release process (docker desktop doesn't work anymore, my pgp
>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>> your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>>> Cheng Pan , Spark dev list ,
>>> Anish Shrigondekar 
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>>
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
>>> Thank you all for the replies!
>>>
>>>
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>>
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>>
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>>
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>>> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>>> and I think it's time to prepare for it and discuss the ongoing projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Dongjoon Hyun

Thank you so much for the update, Wenchen!

Dongjoon.

On Tue, May 7, 2024 at 10:49 AM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>
>>
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> 
>>
>> YouTube Live Streams:

caching a dataframe in Spark takes lot of time

2024-05-07 Thread Prem Sahoo

Hello Folks,
in Spark I have read a file and done some transformation and finally
writing to hdfs.

Now I am interested in writing the same dataframe to MapRFS but for this
Spark will execute the full DAG again  (recompute all the previous
steps)(all the read + transformations ).

I don't want this recompute again so I decided to cache() the dataframe so
that 2nd/nth write won't recompute all the steps .

But here is a catch: the cache() takes more time to persist the data in
memory.

I have a question when the dataframe is in memory then just to save it to
another space in memory , why it will take more time (3.2 G data 6 mins)

May I know what operations in cache() are taking such a long time ?

I would appreciate it if someone would share the information .

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek

Hi,

Sorry for the novice question, Wenchen - the release is done manually from
a laptop? Not using a CI CD process on a build server?

Thanks,
Nimrod

On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>
>>
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan

UPDATE:

Unfortunately, it took me quite some time to set up my laptop and get it
ready for the release process (docker desktop doesn't work anymore, my pgp
key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
your patience!

Wenchen

On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li ,
> Tathagata Das , Wenchen Fan <
> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
> Cheng Pan , Spark dev list ,
> Anish Shrigondekar 
> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>
>
>
> +1 love to see it!
>
>
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
> +1 :) yay previews
>
>
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
> +1
>
>
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
> +1 for next Monday.
>
>
>
> We can do more previews when the other features are ready for preview.
>
>
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
> Next week sounds great! Thank you Wenchen!
>
>
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
> Yea I think a preview release won't hurt (without a branch cut). We don't
> need to wait for all the ongoing projects to be ready. How about we do a
> 4.0 preview release based on the current master branch next Monday?
>
>
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
> wrote:
>
> Hey all,
>
>
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard to
> do that without a Preview release. So the sooner we make a Preview release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
>
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>
> Thank you all for the replies!
>
>
>
> To @Nicholas Chammas  : Thanks for cleaning
> up the error terminology and documentation! I've merged the first PR and
> let's finish others before the 4.0 release.
>
> To @Dongjoon Hyun  : Thanks for driving the ANSI
> on by default effort! Now the vote has passed, let's flip the config and
> finish the DataFrame error context feature before 4.0.
>
> To @Jungtaek Lim  : Ack. We can treat the
> Streaming state store data source as completed for 4.0 then.
>
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
>
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my understanding
> is that we want to release the feature to 4.0.0, but there are several
> remaining works to be done. While the tentative timeline for releasing is
> June 2024, what would be the tentative timeline for the RC cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June 2024),
> and I think it's time to prepare for it and discuss the ongoing projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I would
> like to volunteer as the release manager for Apache Spark 4.0.0 if there is
> no objection. Thank you all for the great work that fills Spark 4.0!
> >
> > Wenchen Fan
>
>
>
>
> --
>
> Twitter: https://twitter.com/holdenkarau
> 
>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> 
>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 
>
>

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi

Hi Folks,

I wanted to check why spark doesn't create staging dir while doing an
insertInto on partitioned tables. I'm running below example code –
```
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3)))
val df = spark.createDataFrame(rdd)
df.write.insertInto("testing_table") // testing table is partitioned on "_1"
```
In this scenario FileOutputCommitter considers table path as output path
and creates temporary folders like
`/testing_table/_temporary/0` and then moves them to
partition location when job commit happens.

But in-case if multiple parallel apps are inserting into the same
partition, this can cause race condition issues while deleting the
`_temporary` dir. Ideally for each app there should be a unique staging dir
where the job should write its output.

Is there any specific reason for this? or am i missing something here?
Thanks for your time and assistance regarding this!

Kind regards
Sanskar

Re: ASF board report draft for May

2024-05-06 Thread Matei Zaharia

I’ll mention that we’re working toward a preview release, even if the details 
are not finalized by the time we sent the report.

> On May 6, 2024, at 10:52 AM, Holden Karau  wrote:
> 
> I trust Wenchen to manage the preview release effectively but if there are 
> concerns around how to manage a developer preview release lets split that off 
> from the board report discussion.
> 
> On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh  > wrote:
>> I did some historical digging on this.
>> 
>> Whilst both preview release and RCs are pre-release versions, the main 
>> difference lies in their maturity and readiness for production use. Preview 
>> releases are early versions aimed at gathering feedback, while release 
>> candidates (RCs) are nearly finished versions that undergo final testing and 
>> voting before the official release.
>> 
>> So in our case, we have two options:
>> 
>> Skip mentioning of the Preview and focus on "We are intending to gather 
>> feedback on version 4 by releasing an earlier version to the community for 
>> look and feel feedback, especially focused on APIs
>> Mention Preview in the form. "There will be a Preview release with the aim 
>> of gathering feedback from the community focused on APIs"
>> IMO Preview release does not require a formal vote. Preview releases are 
>> often considered experimental or pre-alpha versions and are not expected to 
>> meet the same level of stability and completeness as release candidates or 
>> final releases.
>> 
>> HTH
>> 
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> 
>> London
>> United Kingdom
>> 
>>view my Linkedin profile 
>> 
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>  
>> Disclaimer: The information provided is correct to the best of my knowledge 
>> but of course cannot be guaranteed . It is essential to note that, as with 
>> any advice, quote "one test result is worth one-thousand expert opinions 
>> (Werner  Von Braun 
>> )".
>> 
>> 
>> On Mon, 6 May 2024 at 14:10, Mich Talebzadeh > > wrote:
>>> @Wenchen Fan  
>>> 
>>> Thanks for the update! To clarify, is the vote for approving a specific 
>>> preview build, or is it for moving towards an RC stage? I gather there is a 
>>> distinction between these two?
>>> 
>>> 
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> 
>>> London
>>> United Kingdom
>>> 
>>>view my Linkedin profile 
>>> 
>>> 
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>  
>>> Disclaimer: The information provided is correct to the best of my knowledge 
>>> but of course cannot be guaranteed . It is essential to note that, as with 
>>> any advice, quote "one test result is worth one-thousand expert opinions 
>>> (Werner  Von Braun 
>>> )".
>>> 
>>> 
>>> On Mon, 6 May 2024 at 13:03, Wenchen Fan >> > wrote:
 The preview release also needs a vote. I'll try my best to cut the RC on 
 Monday, but the actual release may take some time. Hopefully, we can get 
 it out this week but if the vote fails, it will take longer as we need 
 more RCs.
 
 On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun >>> > wrote:
> +1 for Holden's comment. Yes, it would be great to mention `it` as 
> "soon". 
> (If Wenchen release it on Monday, we can simply mention the release)
> 
> In addition, Apache Spark PMC received an official notice from ASF Infra 
> team.
> 
> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF 
> > projects
> 
> To track and comply with the new ASF Infra Policy as much as possible, we 
> opened a blocker-level JIRA issue and have been working on it.
> - https://infra.apache.org/github-actions-policy.html
> 
> Please include a sentence that Apache Spark PMC is working on under the 
> following umbrella JIRA issue.
> 
> https://issues.apache.org/jira/browse/SPARK-48094
> > Reduce GitHub Action usage according to ASF project allowance
> 
> Thanks,
> Dongjoon.
> 
> 
> On Sun, May 5, 2024 at 3:45 PM Holden Karau  > wrote:
>> Do we want to include that we’re planning on having a preview release of 
>> Spark 4 so folks can see the APIs “soon”?
>> 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): 
>> https://amzn.to/2MaRAG9

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau

I trust Wenchen to manage the preview release effectively but if there are
concerns around how to manage a developer preview release lets split that
off from the board report discussion.

On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh 
wrote:

> I did some historical digging on this.
>
> Whilst both preview release and RCs are pre-release versions, the main
> difference lies in their maturity and readiness for production use. Preview
> releases are early versions aimed at gathering feedback, while release
> candidates (RCs) are nearly finished versions that undergo final testing
> and voting before the official release.
>
> So in our case, we have two options:
>
>
>1. Skip mentioning of the Preview and focus on "We are intending to
>gather feedback on version 4 by releasing an earlier version to the
>community for look and feel feedback, especially focused on APIs
>2. Mention Preview in the form. "There will be a Preview release with
>the aim of gathering feedback from the community focused on APIs"
>
> IMO Preview release does not require a formal vote. Preview releases are
> often considered experimental or pre-alpha versions and are not expected to
> meet the same level of stability and completeness as release candidates or
> final releases.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 6 May 2024 at 14:10, Mich Talebzadeh 
> wrote:
>
>> @Wenchen Fan 
>>
>> Thanks for the update! To clarify, is the vote for approving a specific
>> preview build, or is it for moving towards an RC stage? I gather there is a
>> distinction between these two?
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 6 May 2024 at 13:03, Wenchen Fan  wrote:
>>
>>> The preview release also needs a vote. I'll try my best to cut the RC on
>>> Monday, but the actual release may take some time. Hopefully, we can get it
>>> out this week but if the vote fails, it will take longer as we need more
>>> RCs.
>>>
>>> On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1 for Holden's comment. Yes, it would be great to mention `it` as
 "soon".
 (If Wenchen release it on Monday, we can simply mention the release)

 In addition, Apache Spark PMC received an official notice from ASF
 Infra team.

 https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
 > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for
 ASF projects

 To track and comply with the new ASF Infra Policy as much as possible,
 we opened a blocker-level JIRA issue and have been working on it.
 - https://infra.apache.org/github-actions-policy.html

 Please include a sentence that Apache Spark PMC is working on under the
 following umbrella JIRA issue.

 https://issues.apache.org/jira/browse/SPARK-48094
 > Reduce GitHub Action usage according to ASF project allowance

 Thanks,
 Dongjoon.


 On Sun, May 5, 2024 at 3:45 PM Holden Karau 
 wrote:

> Do we want to include that we’re planning on having a preview release
> of Spark 4 so folks can see the APIs “soon”?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
> wrote:
>
>> It’s time for our quarterly ASF board report on Apache Spark this
>> Wednesday. Here’s a draft, feel free to suggest changes.
>>
>> 
>>
>> Description:
>>
>> Apache Spark is a fast and general purpose engine for large-scale
>> data processing. It offers high-level APIs in Java,

Re: Why spark-submit works with package not with jar

2024-05-06 Thread Mich Talebzadeh

Thanks David. I wanted to explain the difference between Package and Jar
with comments from the community on previous discussions back a few years
ago.

cheers


Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime


London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 6 May 2024 at 18:32, David Rabinowitz  wrote:

> Hi,
>
> It seems this library is several years old. Have you considered using the
> Google provided connector? You can find it in
> https://github.com/GoogleCloudDataproc/spark-bigquery-connector
>
> Regards,
> David Rabinowitz
>
> On Sun, May 5, 2024 at 6:07 PM Jeff Zhang  wrote:
>
>> Are you sure com.google.api.client.http.HttpRequestInitialize is in
>> the spark-bigquery-latest.jar or it may be in the transitive dependency
>> of spark-bigquery_2.11?
>>
>> On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> -- Forwarded message -
>>> From: Mich Talebzadeh 
>>> Date: Tue, 20 Oct 2020 at 16:50
>>> Subject: Why spark-submit works with package not with jar
>>> To: user @spark 
>>>
>>>
>>> Hi,
>>>
>>> I have a scenario that I use in Spark submit as follows:
>>>
>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>
>>> As you can see the jar files needed are added.
>>>
>>>
>>> This comes back with error message as below
>>>
>>>
>>> Creating model test.weights_MODEL
>>>
>>> java.lang.NoClassDefFoundError:
>>> com/google/api/client/http/HttpRequestInitializer
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>
>>>   ... 76 elided
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.google.api.client.http.HttpRequestInitializer
>>>
>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>>
>>>
>>> So there is an issue with finding the class, although the jar file used
>>>
>>>
>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>
>>> has it.
>>>
>>>
>>> Now if *I remove the above jar file and replace it with the same
>>> version but package* it works!
>>>
>>>
>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>
>>>
>>> I have read the write-ups about packages searching the maven
>>> libraries etc. Not convinced why using the package should make so much
>>> difference between a failure and success. In other words, when to use a
>>> package rather than a jar.
>>>
>>>
>>> Any ideas will be appreciated.
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh

I did some historical digging on this.

Whilst both preview release and RCs are pre-release versions, the main
difference lies in their maturity and readiness for production use. Preview
releases are early versions aimed at gathering feedback, while release
candidates (RCs) are nearly finished versions that undergo final testing
and voting before the official release.

So in our case, we have two options:


   1. Skip mentioning of the Preview and focus on "We are intending to
   gather feedback on version 4 by releasing an earlier version to the
   community for look and feel feedback, especially focused on APIs
   2. Mention Preview in the form. "There will be a Preview release with
   the aim of gathering feedback from the community focused on APIs"

IMO Preview release does not require a formal vote. Preview releases are
often considered experimental or pre-alpha versions and are not expected to
meet the same level of stability and completeness as release candidates or
final releases.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 6 May 2024 at 14:10, Mich Talebzadeh 
wrote:

> @Wenchen Fan 
>
> Thanks for the update! To clarify, is the vote for approving a specific
> preview build, or is it for moving towards an RC stage? I gather there is a
> distinction between these two?
>
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 6 May 2024 at 13:03, Wenchen Fan  wrote:
>
>> The preview release also needs a vote. I'll try my best to cut the RC on
>> Monday, but the actual release may take some time. Hopefully, we can get it
>> out this week but if the vote fails, it will take longer as we need more
>> RCs.
>>
>> On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1 for Holden's comment. Yes, it would be great to mention `it` as
>>> "soon".
>>> (If Wenchen release it on Monday, we can simply mention the release)
>>>
>>> In addition, Apache Spark PMC received an official notice from ASF Infra
>>> team.
>>>
>>> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
>>> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for
>>> ASF projects
>>>
>>> To track and comply with the new ASF Infra Policy as much as possible,
>>> we opened a blocker-level JIRA issue and have been working on it.
>>> - https://infra.apache.org/github-actions-policy.html
>>>
>>> Please include a sentence that Apache Spark PMC is working on under the
>>> following umbrella JIRA issue.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-48094
>>> > Reduce GitHub Action usage according to ASF project allowance
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
>>> wrote:
>>>
 Do we want to include that we’re planning on having a preview release
 of Spark 4 so folks can see the APIs “soon”?

 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau


 On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
 wrote:

> It’s time for our quarterly ASF board report on Apache Spark this
> Wednesday. Here’s a draft, feel free to suggest changes.
>
> 
>
> Description:
>
> Apache Spark is a fast and general purpose engine for large-scale data
> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
> well as a rich set of libraries including stream processing, machine
> learning, and graph analytics.
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
> Spark 3.4.2 on April 18, 2024.
> - The votes on "SPIP: Structured Logging Framework for Apache Spark"

Re: Why spark-submit works with package not with jar

2024-05-06 Thread David Rabinowitz

Hi,

It seems this library is several years old. Have you considered using the
Google provided connector? You can find it in
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

Regards,
David Rabinowitz

On Sun, May 5, 2024 at 6:07 PM Jeff Zhang  wrote:

> Are you sure com.google.api.client.http.HttpRequestInitialize is in
> the spark-bigquery-latest.jar or it may be in the transitive dependency
> of spark-bigquery_2.11?
>
> On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh 
> wrote:
>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> -- Forwarded message -
>> From: Mich Talebzadeh 
>> Date: Tue, 20 Oct 2020 at 16:50
>> Subject: Why spark-submit works with package not with jar
>> To: user @spark 
>>
>>
>> Hi,
>>
>> I have a scenario that I use in Spark submit as follows:
>>
>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>
>> As you can see the jar files needed are added.
>>
>>
>> This comes back with error message as below
>>
>>
>> Creating model test.weights_MODEL
>>
>> java.lang.NoClassDefFoundError:
>> com/google/api/client/http/HttpRequestInitializer
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>
>>   ... 76 elided
>>
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.api.client.http.HttpRequestInitializer
>>
>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>>
>>
>> So there is an issue with finding the class, although the jar file used
>>
>>
>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>
>> has it.
>>
>>
>> Now if *I remove the above jar file and replace it with the same version
>> but package* it works!
>>
>>
>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>
>>
>> I have read the write-ups about packages searching the maven
>> libraries etc. Not convinced why using the package should make so much
>> difference between a failure and success. In other words, when to use a
>> package rather than a jar.
>>
>>
>> Any ideas will be appreciated.
>>
>>
>> Thanks
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh

@Wenchen Fan 

Thanks for the update! To clarify, is the vote for approving a specific
preview build, or is it for moving towards an RC stage? I gather there is a
distinction between these two?


Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 6 May 2024 at 13:03, Wenchen Fan  wrote:

> The preview release also needs a vote. I'll try my best to cut the RC on
> Monday, but the actual release may take some time. Hopefully, we can get it
> out this week but if the vote fails, it will take longer as we need more
> RCs.
>
> On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
> wrote:
>
>> +1 for Holden's comment. Yes, it would be great to mention `it` as
>> "soon".
>> (If Wenchen release it on Monday, we can simply mention the release)
>>
>> In addition, Apache Spark PMC received an official notice from ASF Infra
>> team.
>>
>> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
>> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
>> projects
>>
>> To track and comply with the new ASF Infra Policy as much as possible, we
>> opened a blocker-level JIRA issue and have been working on it.
>> - https://infra.apache.org/github-actions-policy.html
>>
>> Please include a sentence that Apache Spark PMC is working on under the
>> following umbrella JIRA issue.
>>
>> https://issues.apache.org/jira/browse/SPARK-48094
>> > Reduce GitHub Action usage according to ASF project allowance
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
>> wrote:
>>
>>> Do we want to include that we’re planning on having a preview release of
>>> Spark 4 so folks can see the APIs “soon”?
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
>>> wrote:
>>>
 It’s time for our quarterly ASF board report on Apache Spark this
 Wednesday. Here’s a draft, feel free to suggest changes.

 

 Description:

 Apache Spark is a fast and general purpose engine for large-scale data
 processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
 well as a rich set of libraries including stream processing, machine
 learning, and graph analytics.

 Issues for the board:

 - None

 Project status:

 - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
 Spark 3.4.2 on April 18, 2024.
 - The votes on "SPIP: Structured Logging Framework for Apache Spark"
 and "Pure Python Package in PyPI (Spark Connect)" have passed.
 - The votes for two behavior changes have passed: "SPARK-4: Use
 ANSI SQL mode by default" and "SPARK-46122: Set
 spark.sql.legacy.createHiveTableByDefault to false".
 - The community decided that upcoming Spark 4.0 release will drop
 support for Python 3.8.
 - We started a discussion about the definition of behavior changes that
 is critical for version upgrades and user experience.
 - We've opened a dedicated repository for the Spark Kubernetes Operator
 at https://github.com/apache/spark-kubernetes-operator. We added a new
 version in Apache Spark JIRA for versioning of the Spark operator based on
 a vote result.

 Trademarks:

 - No changes since the last report.

 Latest releases:
 - Spark 3.4.3 was released on April 18, 2024
 - Spark 3.5.1 was released on February 28, 2024
 - Spark 3.3.4 was released on December 16, 2023

 Committers and PMC:

 - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
 - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
 Yikun Jiang).

 
 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau

If folks are against the term soon we could say “in-progress”

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, May 6, 2024 at 2:08 AM Mich Talebzadeh 
wrote:

> Hi,
>
> We should reconsider using the term "soon" for ASF board as it is
> subjective with no date (assuming this is an official communication on
> Wednesday). We ought to say
>
>  "Spark 4, the next major release after Spark 3.x, is currently under
> development. We plan to make a preview version available for evaluation as
> soon as it is feasible"
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 6 May 2024 at 05:09, Dongjoon Hyun 
> wrote:
>
>> +1 for Holden's comment. Yes, it would be great to mention `it` as
>> "soon".
>> (If Wenchen release it on Monday, we can simply mention the release)
>>
>> In addition, Apache Spark PMC received an official notice from ASF Infra
>> team.
>>
>> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
>> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
>> projects
>>
>> To track and comply with the new ASF Infra Policy as much as possible, we
>> opened a blocker-level JIRA issue and have been working on it.
>> - https://infra.apache.org/github-actions-policy.html
>>
>> Please include a sentence that Apache Spark PMC is working on under the
>> following umbrella JIRA issue.
>>
>> https://issues.apache.org/jira/browse/SPARK-48094
>> > Reduce GitHub Action usage according to ASF project allowance
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
>> wrote:
>>
>>> Do we want to include that we’re planning on having a preview release of
>>> Spark 4 so folks can see the APIs “soon”?
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
>>> wrote:
>>>
 It’s time for our quarterly ASF board report on Apache Spark this
 Wednesday. Here’s a draft, feel free to suggest changes.

 

 Description:

 Apache Spark is a fast and general purpose engine for large-scale data
 processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
 well as a rich set of libraries including stream processing, machine
 learning, and graph analytics.

 Issues for the board:

 - None

 Project status:

 - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
 Spark 3.4.2 on April 18, 2024.
 - The votes on "SPIP: Structured Logging Framework for Apache Spark"
 and "Pure Python Package in PyPI (Spark Connect)" have passed.
 - The votes for two behavior changes have passed: "SPARK-4: Use
 ANSI SQL mode by default" and "SPARK-46122: Set
 spark.sql.legacy.createHiveTableByDefault to false".
 - The community decided that upcoming Spark 4.0 release will drop
 support for Python 3.8.
 - We started a discussion about the definition of behavior changes that
 is critical for version upgrades and user experience.
 - We've opened a dedicated repository for the Spark Kubernetes Operator
 at https://github.com/apache/spark-kubernetes-operator. We added a new
 version in Apache Spark JIRA for versioning of the Spark operator based on
 a vote result.

 Trademarks:

 - No changes since the last report.

 Latest releases:
 - Spark 3.4.3 was released on April 18, 2024
 - Spark 3.5.1 was released on February 28, 2024
 - Spark 3.3.4 was released on December 16, 2023

 Committers and PMC:

 - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
 - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
 Yikun Jiang).

 
 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh

Hi,

We should reconsider using the term "soon" for ASF board as it is
subjective with no date (assuming this is an official communication on
Wednesday). We ought to say

 "Spark 4, the next major release after Spark 3.x, is currently under
development. We plan to make a preview version available for evaluation as
soon as it is feasible"

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 6 May 2024 at 05:09, Dongjoon Hyun  wrote:

> +1 for Holden's comment. Yes, it would be great to mention `it` as "soon".
> (If Wenchen release it on Monday, we can simply mention the release)
>
> In addition, Apache Spark PMC received an official notice from ASF Infra
> team.
>
> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
> projects
>
> To track and comply with the new ASF Infra Policy as much as possible, we
> opened a blocker-level JIRA issue and have been working on it.
> - https://infra.apache.org/github-actions-policy.html
>
> Please include a sentence that Apache Spark PMC is working on under the
> following umbrella JIRA issue.
>
> https://issues.apache.org/jira/browse/SPARK-48094
> > Reduce GitHub Action usage according to ASF project allowance
>
> Thanks,
> Dongjoon.
>
>
> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
> wrote:
>
>> Do we want to include that we’re planning on having a preview release of
>> Spark 4 so folks can see the APIs “soon”?
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
>> wrote:
>>
>>> It’s time for our quarterly ASF board report on Apache Spark this
>>> Wednesday. Here’s a draft, feel free to suggest changes.
>>>
>>> 
>>>
>>> Description:
>>>
>>> Apache Spark is a fast and general purpose engine for large-scale data
>>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>>> well as a rich set of libraries including stream processing, machine
>>> learning, and graph analytics.
>>>
>>> Issues for the board:
>>>
>>> - None
>>>
>>> Project status:
>>>
>>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
>>> Spark 3.4.2 on April 18, 2024.
>>> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
>>> "Pure Python Package in PyPI (Spark Connect)" have passed.
>>> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
>>> SQL mode by default" and "SPARK-46122: Set
>>> spark.sql.legacy.createHiveTableByDefault to false".
>>> - The community decided that upcoming Spark 4.0 release will drop
>>> support for Python 3.8.
>>> - We started a discussion about the definition of behavior changes that
>>> is critical for version upgrades and user experience.
>>> - We've opened a dedicated repository for the Spark Kubernetes Operator
>>> at https://github.com/apache/spark-kubernetes-operator. We added a new
>>> version in Apache Spark JIRA for versioning of the Spark operator based on
>>> a vote result.
>>>
>>> Trademarks:
>>>
>>> - No changes since the last report.
>>>
>>> Latest releases:
>>> - Spark 3.4.3 was released on April 18, 2024
>>> - Spark 3.5.1 was released on February 28, 2024
>>> - Spark 3.3.4 was released on December 16, 2023
>>>
>>> Committers and PMC:
>>>
>>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>>> Yikun Jiang).
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: ASF board report draft for May

2024-05-06 Thread Wenchen Fan

The preview release also needs a vote. I'll try my best to cut the RC on
Monday, but the actual release may take some time. Hopefully, we can get it
out this week but if the vote fails, it will take longer as we need more
RCs.

On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
wrote:

> +1 for Holden's comment. Yes, it would be great to mention `it` as "soon".
> (If Wenchen release it on Monday, we can simply mention the release)
>
> In addition, Apache Spark PMC received an official notice from ASF Infra
> team.
>
> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
> projects
>
> To track and comply with the new ASF Infra Policy as much as possible, we
> opened a blocker-level JIRA issue and have been working on it.
> - https://infra.apache.org/github-actions-policy.html
>
> Please include a sentence that Apache Spark PMC is working on under the
> following umbrella JIRA issue.
>
> https://issues.apache.org/jira/browse/SPARK-48094
> > Reduce GitHub Action usage according to ASF project allowance
>
> Thanks,
> Dongjoon.
>
>
> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
> wrote:
>
>> Do we want to include that we’re planning on having a preview release of
>> Spark 4 so folks can see the APIs “soon”?
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
>> wrote:
>>
>>> It’s time for our quarterly ASF board report on Apache Spark this
>>> Wednesday. Here’s a draft, feel free to suggest changes.
>>>
>>> 
>>>
>>> Description:
>>>
>>> Apache Spark is a fast and general purpose engine for large-scale data
>>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>>> well as a rich set of libraries including stream processing, machine
>>> learning, and graph analytics.
>>>
>>> Issues for the board:
>>>
>>> - None
>>>
>>> Project status:
>>>
>>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
>>> Spark 3.4.2 on April 18, 2024.
>>> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
>>> "Pure Python Package in PyPI (Spark Connect)" have passed.
>>> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
>>> SQL mode by default" and "SPARK-46122: Set
>>> spark.sql.legacy.createHiveTableByDefault to false".
>>> - The community decided that upcoming Spark 4.0 release will drop
>>> support for Python 3.8.
>>> - We started a discussion about the definition of behavior changes that
>>> is critical for version upgrades and user experience.
>>> - We've opened a dedicated repository for the Spark Kubernetes Operator
>>> at https://github.com/apache/spark-kubernetes-operator. We added a new
>>> version in Apache Spark JIRA for versioning of the Spark operator based on
>>> a vote result.
>>>
>>> Trademarks:
>>>
>>> - No changes since the last report.
>>>
>>> Latest releases:
>>> - Spark 3.4.3 was released on April 18, 2024
>>> - Spark 3.5.1 was released on February 28, 2024
>>> - Spark 3.3.4 was released on December 16, 2023
>>>
>>> Committers and PMC:
>>>
>>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>>> Yikun Jiang).
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: Why spark-submit works with package not with jar

2024-05-05 Thread Jeff Zhang

Are you sure com.google.api.client.http.HttpRequestInitialize is in
the spark-bigquery-latest.jar or it may be in the transitive dependency
of spark-bigquery_2.11?

On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh 
wrote:

>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> -- Forwarded message -
> From: Mich Talebzadeh 
> Date: Tue, 20 Oct 2020 at 16:50
> Subject: Why spark-submit works with package not with jar
> To: user @spark 
>
>
> Hi,
>
> I have a scenario that I use in Spark submit as follows:
>
> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>
> As you can see the jar files needed are added.
>
>
> This comes back with error message as below
>
>
> Creating model test.weights_MODEL
>
> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>
>   ... 76 elided
>
> Caused by: java.lang.ClassNotFoundException:
> com.google.api.client.http.HttpRequestInitializer
>
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
>
> So there is an issue with finding the class, although the jar file used
>
>
> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>
> has it.
>
>
> Now if *I remove the above jar file and replace it with the same version
> but package* it works!
>
>
> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>
>
> I have read the write-ups about packages searching the maven
> libraries etc. Not convinced why using the package should make so much
> difference between a failure and success. In other words, when to use a
> package rather than a jar.
>
>
> Any ideas will be appreciated.
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


-- 
Best Regards

Jeff Zhang

Re: ASF board report draft for May

2024-05-05 Thread Dongjoon Hyun

+1 for Holden's comment. Yes, it would be great to mention `it` as "soon".
(If Wenchen release it on Monday, we can simply mention the release)

In addition, Apache Spark PMC received an official notice from ASF Infra
team.

https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
> [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
projects

To track and comply with the new ASF Infra Policy as much as possible, we
opened a blocker-level JIRA issue and have been working on it.
- https://infra.apache.org/github-actions-policy.html

Please include a sentence that Apache Spark PMC is working on under the
following umbrella JIRA issue.

https://issues.apache.org/jira/browse/SPARK-48094
> Reduce GitHub Action usage according to ASF project allowance

Thanks,
Dongjoon.


On Sun, May 5, 2024 at 3:45 PM Holden Karau  wrote:

> Do we want to include that we’re planning on having a preview release of
> Spark 4 so folks can see the APIs “soon”?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
> wrote:
>
>> It’s time for our quarterly ASF board report on Apache Spark this
>> Wednesday. Here’s a draft, feel free to suggest changes.
>>
>> 
>>
>> Description:
>>
>> Apache Spark is a fast and general purpose engine for large-scale data
>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>> well as a rich set of libraries including stream processing, machine
>> learning, and graph analytics.
>>
>> Issues for the board:
>>
>> - None
>>
>> Project status:
>>
>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and Spark
>> 3.4.2 on April 18, 2024.
>> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
>> "Pure Python Package in PyPI (Spark Connect)" have passed.
>> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
>> SQL mode by default" and "SPARK-46122: Set
>> spark.sql.legacy.createHiveTableByDefault to false".
>> - The community decided that upcoming Spark 4.0 release will drop support
>> for Python 3.8.
>> - We started a discussion about the definition of behavior changes that
>> is critical for version upgrades and user experience.
>> - We've opened a dedicated repository for the Spark Kubernetes Operator
>> at https://github.com/apache/spark-kubernetes-operator. We added a new
>> version in Apache Spark JIRA for versioning of the Spark operator based on
>> a vote result.
>>
>> Trademarks:
>>
>> - No changes since the last report.
>>
>> Latest releases:
>> - Spark 3.4.3 was released on April 18, 2024
>> - Spark 3.5.1 was released on February 28, 2024
>> - Spark 3.3.4 was released on December 16, 2023
>>
>> Committers and PMC:
>>
>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>> Yikun Jiang).
>>
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: ASF board report draft for May

2024-05-05 Thread Holden Karau

Do we want to include that we’re planning on having a preview release of
Spark 4 so folks can see the APIs “soon”?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
wrote:

> It’s time for our quarterly ASF board report on Apache Spark this
> Wednesday. Here’s a draft, feel free to suggest changes.
>
> 
>
> Description:
>
> Apache Spark is a fast and general purpose engine for large-scale data
> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
> well as a rich set of libraries including stream processing, machine
> learning, and graph analytics.
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and Spark
> 3.4.2 on April 18, 2024.
> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
> "Pure Python Package in PyPI (Spark Connect)" have passed.
> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
> SQL mode by default" and "SPARK-46122: Set
> spark.sql.legacy.createHiveTableByDefault to false".
> - The community decided that upcoming Spark 4.0 release will drop support
> for Python 3.8.
> - We started a discussion about the definition of behavior changes that is
> critical for version upgrades and user experience.
> - We've opened a dedicated repository for the Spark Kubernetes Operator at
> https://github.com/apache/spark-kubernetes-operator. We added a new
> version in Apache Spark JIRA for versioning of the Spark operator based on
> a vote result.
>
> Trademarks:
>
> - No changes since the last report.
>
> Latest releases:
> - Spark 3.4.3 was released on April 18, 2024
> - Spark 3.5.1 was released on February 28, 2024
> - Spark 3.3.4 was released on December 16, 2023
>
> Committers and PMC:
>
> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
> Yikun Jiang).
>
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

ASF board report draft for May

2024-05-05 Thread Matei Zaharia

It’s time for our quarterly ASF board report on Apache Spark this Wednesday. 
Here’s a draft, feel free to suggest changes.



Description:

Apache Spark is a fast and general purpose engine for large-scale data 
processing. It offers high-level APIs in Java, Scala, Python, R and SQL as well 
as a rich set of libraries including stream processing, machine learning, and 
graph analytics.

Issues for the board:

- None

Project status:

- We made two patch releases: Spark 3.5.1 on February 28, 2024, and Spark 3.4.2 
on April 18, 2024.
- The votes on "SPIP: Structured Logging Framework for Apache Spark" and "Pure 
Python Package in PyPI (Spark Connect)" have passed.
- The votes for two behavior changes have passed: "SPARK-4: Use ANSI SQL 
mode by default" and "SPARK-46122: Set 
spark.sql.legacy.createHiveTableByDefault to false".
- The community decided that upcoming Spark 4.0 release will drop support for 
Python 3.8.
- We started a discussion about the definition of behavior changes that is 
critical for version upgrades and user experience.
- We've opened a dedicated repository for the Spark Kubernetes Operator at 
https://github.com/apache/spark-kubernetes-operator. We added a new version in 
Apache Spark JIRA for versioning of the Spark operator based on a vote result.

Trademarks:

- No changes since the last report.

Latest releases:
- Spark 3.4.3 was released on April 18, 2024
- Spark 3.5.1 was released on February 28, 2024
- Spark 3.3.4 was released on December 16, 2023

Committers and PMC:

- The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
- The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and Yikun 
Jiang).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Mich Talebzadeh

In answer to this part of your question

"..*Understanding the Issue:* Are there known reasons within Spark that
could explain this difference in behavior when loading dependencies via
`--packages` versus placing JARs directly?
*2. "*

--jar Adds only that jar
--package adds the Jar and a its dependencies listed in maven

*HTH*

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 4 May 2024 at 12:24, Damien Hawes  wrote:

> Hi folks,
>
> I'm contributing to the OpenLineage project, specifically the Apache Spark
> integration. My current focus is on extending the project to support data
> lineage extraction for Spark Streaming, beginning with Apache Kafka sources
> and sinks.
>
> I've encountered an obstacle when attempting to access information
> essential for lineage extraction from Apache Kafka-related classes within
> the OpenLineage Spark code base. Specifically, I need to access details
> like Kafka topic names and bootstrap servers from objects like
> StreamingDataSourceV2Relation.
>
> While I can successfully access these details if the Kafka JARs are placed
> directly in the 'spark/jars' directory, I'm unable to do so when using the
> `--packages` option for dependency management. This creates a significant
> obstacle for users who rely on `--packages` for their Spark applications.
>
> I've taken initial steps to investigate (viewable in this GitHub PR
> , the class in
> question is *StreamingDataSourceV2RelationVisitor*), but I'd greatly
> appreciate any insights or guidance on the following:
>
> *1. Understanding the Issue:* Are there known reasons within Spark that
> could explain this difference in behavior when loading dependencies via
> `--packages` versus placing JARs directly?
> *2. Alternative Approaches:*  Are there recommended techniques or
> patterns to access the necessary Kafka class information within a
> SparkListener extension, especially when dependencies are managed via
> `--packages`?
>
> I'm eager to find a solution that avoids heavy reliance on reflection.
>
> Thank you for your time and assistance!
>
> Kind regards,
> Damien
>
>

Fwd: Why spark-submit works with package not with jar

2024-05-04 Thread Mich Talebzadeh

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


-- Forwarded message -
From: Mich Talebzadeh 
Date: Tue, 20 Oct 2020 at 16:50
Subject: Why spark-submit works with package not with jar
To: user @spark 


Hi,

I have a scenario that I use in Spark submit as follows:

spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
/home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
*/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*

As you can see the jar files needed are added.


This comes back with error message as below


Creating model test.weights_MODEL

java.lang.NoClassDefFoundError:
com/google/api/client/http/HttpRequestInitializer

  at
com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

  at
com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

  at
com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

  ... 76 elided

Caused by: java.lang.ClassNotFoundException:
com.google.api.client.http.HttpRequestInitializer

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)



So there is an issue with finding the class, although the jar file used


/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

has it.


Now if *I remove the above jar file and replace it with the same version
but package* it works!


spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
/home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
*-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*


I have read the write-ups about packages searching the maven libraries etc.
Not convinced why using the package should make so much difference between
a failure and success. In other words, when to use a package rather than a
jar.


Any ideas will be appreciated.


Thanks



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Fwd: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Damien Hawes

Hi folks,

I'm contributing to the OpenLineage project, specifically the Apache Spark
integration. My current focus is on extending the project to support data
lineage extraction for Spark Streaming, beginning with Apache Kafka sources
and sinks.

I've encountered an obstacle when attempting to access information
essential for lineage extraction from Apache Kafka-related classes within
the OpenLineage Spark code base. Specifically, I need to access details
like Kafka topic names and bootstrap servers from objects like
StreamingDataSourceV2Relation.

While I can successfully access these details if the Kafka JARs are placed
directly in the 'spark/jars' directory, I'm unable to do so when using the
`--packages` option for dependency management. This creates a significant
obstacle for users who rely on `--packages` for their Spark applications.

I've taken initial steps to investigate (viewable in this GitHub PR
, the class in
question is *StreamingDataSourceV2RelationVisitor*), but I'd greatly
appreciate any insights or guidance on the following:

*1. Understanding the Issue:* Are there known reasons within Spark that
could explain this difference in behavior when loading dependencies via
`--packages` versus placing JARs directly?
*2. Alternative Approaches:*  Are there recommended techniques or patterns
to access the necessary Kafka class information within a SparkListener
extension, especially when dependencies are managed via `--packages`?

I'm eager to find a solution that avoids heavy reliance on reflection.

Thank you for your time and assistance!

Kind regards,
Damien

Re: Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Jungtaek Lim

(remove user@ as the topic is not aiming to user group)

I would like to make a clarification of SPIP as there have been multiple
times of improper proposals and the ticket also mentions SPIP without
fulfilling effective requirements.
SPIP is only effective when there is a dedicated individual or group to
work on the project, with a concrete plan on design and implementation.
Here the "proposal" does not mean "feature request", but a proposal about
development.

https://spark.apache.org/improvement-proposals.html
I'm quoting a couple of relevant sentences here to explain what is the
requirement for SPIP.

The purpose of an SPIP is to inform and involve the user community in major
> improvements to the Spark codebase *throughout the development process*,
> to increase the likelihood that user needs are met.


SPIP Author is any community member who authors a SPIP and *is committed to
> pushing the change through the entire process*. SPIP authorship can be
> transferred.


SPIP author is the one who would need to lead the effort of "design" and
"code work". The format of SPIP doc isn't strictly requiring the design but
most likely there is at least a high level of design and in many cases
there is a separate doc for detailed design (This is optional but people
tend to provide the doc for the project with non-trivial design).

Hope this clarifies the meaning of SPIP.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, May 4, 2024 at 5:11 AM Mich Talebzadeh 
wrote:

> Hi,
>
> I have raised a ticket SPARK-48117
>  for enhancing Spark
> capabilities with Materialised Views (MV). Currently both Hive and
> Databricks support this. I have added these potential benefits  to the
> ticket
>
> -* Improved Query Performance (especially for Streaming Data):*
> Materialized Views can significantly improve query performance,
> particularly for use cases involving Spark Structured Streaming. When
> dealing with continuous data streams, materialized views can pre-compute
> and store frequently accessed aggregations or transformations. Subsequent
> queries on the materialized view can retrieve the results much faster
> compared to continuously processing the entire streaming data. This is
> crucial for real-time analytics where low latency is essential.
> *Enhancing Data Management:* They offer a way to pre-aggregate or
> transform data, making complex queries more efficient.
> - *Reduced Data Movement*: Materialized Views can be materialized on
> specific clusters or storage locations closer to where the data will be
> consumed. This minimizes data movement across the network, further
> improving query performance and reducing overall processing time.
> - *Simplified Workflows:* Developers and analysts can leverage
> pre-defined Materialized Views that represent specific business logic or
> data subsets. This simplifies data access, reduces development time for
> queries that rely on these views, and fosters code reuse.
>
> Please have a look at the ticket and add your comments.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh

Hi,

I have raised a ticket SPARK-48117
 for enhancing Spark
capabilities with Materialised Views (MV). Currently both Hive and
Databricks support this. I have added these potential benefits  to the
ticket

-* Improved Query Performance (especially for Streaming Data):*
Materialized Views can significantly improve query performance,
particularly for use cases involving Spark Structured Streaming. When
dealing with continuous data streams, materialized views can pre-compute
and store frequently accessed aggregations or transformations. Subsequent
queries on the materialized view can retrieve the results much faster
compared to continuously processing the entire streaming data. This is
crucial for real-time analytics where low latency is essential.
*Enhancing Data Management:* They offer a way to pre-aggregate or transform
data, making complex queries more efficient.
- *Reduced Data Movement*: Materialized Views can be materialized on
specific clusters or storage locations closer to where the data will be
consumed. This minimizes data movement across the network, further
improving query performance and reducing overall processing time.
- *Simplified Workflows:* Developers and analysts can leverage pre-defined
Materialized Views that represent specific business logic or data subsets.
This simplifies data access, reduces development time for queries that rely
on these views, and fosters code reuse.

Please have a look at the ticket and add your comments.

Thanks

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh

Thanks for the comments I received.

So in summary, Apache Spark itself doesn't directly manage materialized
views,(MV)  but it can work with them through integration with the
underlying data storage systems like Hive or through iceberg. I believe
databricks through unity catalog support MVs as well.

Moreover, there is a case for supporting MVs. However, Spark can utilize
materialized views even though it doesn't directly manage them.. This came
about because someone in the Spark user forum enquired about "Spark
streaming issue to Elastic data*". One option I thought of was that uUsing
materialized views with Spark Structured Streaming and Change Data Capture
(CDC) is a potential solution for efficiently streaming view data updates
in this scenario. .

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Fri, 3 May 2024 at 00:54, Mich Talebzadeh 
wrote:

> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior of
> Materialized Views in Spark SQL and Hive.
>
> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
> the keyword MATERIALIZED is not recognized. However, the same statement
> executes successfully in Hive without any errors.
>
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)
>
> == SQL ==
> DROP MATERIALIZED VIEW IF EXISTS test.mv
> -^^^
>
> Here are the versions I am using:
>
>
>
> *Hive: 3.1.1Spark: 3.4*
> my Spark session:
>
> spark = SparkSession.builder \
>   .appName("test") \
>   .enableHiveSupport() \
>   .getOrCreate()
>
> Has anyone seen this behaviour or encountered a similar issue or if there
> are any insights into why this discrepancy exists between Spark SQL and
> Hive.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa

I do not think the issue is with DROP MATERIALIZED VIEW only, but also with
CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess
you must have created the view from Hive and are trying to drop it from
Spark and that is why you are running to the issue with DROP first.

There is some work in the Iceberg community to add the support to Spark
through SQL extensions, and Iceberg support for views and
materialization tables. Some recent discussions can be found here [1] along
with a WIP Iceberg-Spark PR.

[1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc

Thanks,
Walaa.

On Thu, May 2, 2024 at 4:55 PM Mich Talebzadeh 
wrote:

> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior of
> Materialized Views in Spark SQL and Hive.
>
> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
> the keyword MATERIALIZED is not recognized. However, the same statement
> executes successfully in Hive without any errors.
>
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)
>
> == SQL ==
> DROP MATERIALIZED VIEW IF EXISTS test.mv
> -^^^
>
> Here are the versions I am using:
>
>
>
> *Hive: 3.1.1Spark: 3.4*
> my Spark session:
>
> spark = SparkSession.builder \
>   .appName("test") \
>   .enableHiveSupport() \
>   .getOrCreate()
>
> Has anyone seen this behaviour or encountered a similar issue or if there
> are any insights into why this discrepancy exists between Spark SQL and
> Hive.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh

An issue I encountered while working with Materialized Views in Spark SQL.
It appears that there is an inconsistency between the behavior of
Materialized Views in Spark SQL and Hive.

When attempting to execute a statement like DROP MATERIALIZED VIEW IF
EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
the keyword MATERIALIZED is not recognized. However, the same statement
executes successfully in Hive without any errors.

pyspark.errors.exceptions.captured.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)

== SQL ==
DROP MATERIALIZED VIEW IF EXISTS test.mv
-^^^

Here are the versions I am using:



*Hive: 3.1.1Spark: 3.4*
my Spark session:

spark = SparkSession.builder \
  .appName("test") \
  .enableHiveSupport() \
  .getOrCreate()

Has anyone seen this behaviour or encountered a similar issue or if there
are any insights into why this discrepancy exists between Spark SQL and
Hive.

Thanks

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread yangjie01

+1

发件人: Jungtaek Lim 
日期: 2024年5月2日 星期四 10:21
收件人: Holden Karau 
抄送: Chao Sun , Xiao Li , Tathagata 
Das , Wenchen Fan , Cheng Pan 
, Nicholas Chammas , Dongjoon 
Hyun , Cheng Pan , Spark dev list 
, Anish Shrigondekar 
主题: Re: [DISCUSS] Spark 4.0.0 release

+1 love to see it!

On Thu, May 2, 2024 at 10:08 AM Holden Karau 
mailto:holden.ka...@gmail.com>> wrote:
+1 :) yay previews

On Wed, May 1, 2024 at 5:36 PM Chao Sun 
mailto:sunc...@apache.org>> wrote:
+1

On Wed, May 1, 2024 at 5:23 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
+1 for next Monday.

We can do more previews when the other features are ready for preview.

Tathagata Das mailto:tathagata.das1...@gmail.com>> 
于2024年5月1日周三 08:46写道：
Next week sounds great! Thank you Wenchen!

On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Yea I think a preview release won't hurt (without a branch cut). We don't need 
to wait for all the ongoing projects to be ready. How about we do a 4.0 preview 
release based on the current master branch next Monday?

On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
mailto:tathagata.das1...@gmail.com>> wrote:
Hey all,

Reviving this thread, but Spark master has already accumulated a huge amount of 
changes.  As a downstream project maintainer, I want to really start testing 
the new features and other breaking changes, and it's hard to do that without a 
Preview release. So the sooner we make a Preview release, the faster we can 
start getting feedback for fixing things for a great Spark 4.0 final release.

So I urge the community to produce a Spark 4.0 Preview soon even if certain 
features targeting the Delta 4.0 release are still incomplete.

Thanks!

On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Thank you all for the replies!

To @Nicholas Chammas : Thanks for cleaning 
up the error terminology and documentation! I've merged the first PR and let's 
finish others before the 4.0 release.
To @Dongjoon Hyun : Thanks for driving the ANSI 
on by default effort! Now the vote has passed, let's flip the config and finish 
the DataFrame error context feature before 4.0.
To @Jungtaek Lim : Ack. We can treat the 
Streaming state store data source as completed for 4.0 then.
To @Cheng Pan : Yea we definitely should have a 
preview release. Let's collect more feedback on the ongoing projects and then 
we can propose a date for the preview release.

On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan 
mailto:pan3...@gmail.com>> wrote:
will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

Thanks,
Cheng Pan

> On Apr 15, 2024, at 09:58, Jungtaek Lim 
> mailto:kabhwan.opensou...@gmail.com>> wrote:
>
> W.r.t. state data source - reader (SPARK-45511), there are several follow-up 
> tickets, but we don't plan to address them soon. The current implementation 
> is the final shape for Spark 4.0.0, unless there are demands on the follow-up 
> tickets.
>
> We may want to check the plan for transformWithState - my understanding is 
> that we want to release the feature to 4.0.0, but there are several remaining 
> works to be done. While the tentative timeline for releasing is June 2024, 
> what would be the tentative timeline for the RC cut?
> (cc. Anish to add more context on the plan for transformWithState)
>
> On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> mailto:cloud0...@gmail.com>> wrote:
> Hi all,
>
> It's close to the previously proposed 4.0.0 release date (June 2024), and I 
> think it's time to prepare for it and discuss the ongoing projects:
> •
> ANSI by default
> • Spark Connect GA
> • Structured Logging
> • Streaming state store data source
> • new data type VARIANT
> • STRING collation support
> • Spark k8s operator versioning
> Please help to add more items to this list that are missed here. I would like 
> to volunteer as the release manager for Apache Spark 4.0.0 if there is no 
> objection. Thank you all for the great work that fills Spark 4.0!
>
> Wenchen Fan

--
Twitter: 
https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: 
https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Mich Talebzadeh

   - Integration with additional external data sources or systems, say Hive
   - Enhancements to the Spark UI for improved monitoring and debugging
   - Enhancements to machine learning (MLlib) algorithms and capabilities,
   like TensorFlow or PyTorch,( if any in the pipeline)

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 2 May 2024 at 17:02, Steve Loughran 
wrote:

> There's a new parquet RC up this week which would be good to pull in.
>
> On Thu, 2 May 2024 at 03:20, Jungtaek Lim 
> wrote:
>
>> +1 love to see it!
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>>> +1 :) yay previews
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
 +1

 On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

> +1 for next Monday.
>
> We can do more previews when the other features are ready for preview.
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>> wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about 
>>> we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a
 huge amount of changes.  As a downstream project maintainer, I want to
 really start testing the new features and other breaking changes, and 
 it's
 hard to do that without a Preview release. So the sooner we make a 
 Preview
 release, the faster we can start getting feedback for fixing things 
 for a
 great Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the 
> first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving
> the ANSI on by default effort! Now the vote has passed, let's flip the
> config and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can
> treat the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should
> have a preview release. Let's collect more feedback on the ongoing 
> projects
> and then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan 
> wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are
>> several follow-up tickets, but we don't plan to address them soon. 
>> The
>> current implementation is the final shape for Spark 4.0.0, unless 
>> there are
>> demands on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but 
>> there
>> are several remaining works to be done. While the tentative timeline 
>> for
>> releasing is June 2024, what would be the tentative timeline for the 
>> RC cut?
>> > (cc. Anish to add more context on the plan for
>> transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan <
>> cloud0...@gmail.com> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Steve Loughran

There's a new parquet RC up this week which would be good to pull in.

On Thu, 2 May 2024 at 03:20, Jungtaek Lim 
wrote:

> +1 love to see it!
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
>> +1 :) yay previews
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>>> +1
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
 +1 for next Monday.

 We can do more previews when the other features are ready for preview.

 Tathagata Das  于2024年5月1日周三 08:46写道：

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
> wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We
>> don't need to wait for all the ongoing projects to be ready. How about we
>> do a 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a
>>> huge amount of changes.  As a downstream project maintainer, I want to
>>> really start testing the new features and other breaking changes, and 
>>> it's
>>> hard to do that without a Preview release. So the sooner we make a 
>>> Preview
>>> release, the faster we can start getting feedback for fixing things for 
>>> a
>>> great Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
>>> wrote:
>>>
 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the 
 first
 PR and let's finish others before the 4.0 release.
 To @Dongjoon Hyun  : Thanks for driving
 the ANSI on by default effort! Now the vote has passed, let's flip the
 config and finish the DataFrame error context feature before 4.0.
 To @Jungtaek Lim  : Ack. We can
 treat the Streaming state store data source as completed for 4.0 then.
 To @Cheng Pan  : Yea we definitely should
 have a preview release. Let's collect more feedback on the ongoing 
 projects
 and then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan 
 wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and
> 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are
> several follow-up tickets, but we don't plan to address them soon. The
> current implementation is the final shape for Spark 4.0.0, unless 
> there are
> demands on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my
> understanding is that we want to release the feature to 4.0.0, but 
> there
> are several remaining works to be done. While the tentative timeline 
> for
> releasing is June 2024, what would be the tentative timeline for the 
> RC cut?
> > (cc. Anish to add more context on the plan for
> transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June
> 2024), and I think it's time to prepare for it and discuss the ongoing
> projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here.
> I would like to volunteer as the release manager for Apache Spark 
> 4.0.0 if
> there is no objection. Thank you all for the great work that fills 
> Spark
> 4.0!
> >
> > Wenchen Fan
>
>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Will Raschkowski

To add some user perspective, I wanted to share our experience from 
automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir:

We didn't mind "loud" changes that threw exceptions. We have some infra to try 
run jobs with Spark 3 and fallback to Spark 2 if there's an exception. E.g., 
the datetime parsing and rebasing migration in Spark 3 was great: Spark threw a 
helpful exception but never silently changed results. Similarly, for things 
listed in the migration guide as silent changes (e.g., add_months's handling of 
last-day-of-month), we wrote custom check rules to throw unless users 
acknowledged the change through config.

Silent changes not in the migration guide were really bad for us: Trusting the 
migration guide to be exhaustive, we automatically upgraded jobs which then 
“succeeded” but wrote incorrect results. For example, some expression increased 
timestamp precision in Spark 3; a query implicitly relied on the reduced 
precision, and then produced bad results on upgrade. It’s a silly query but a 
note in the migration guide would have helped.

To summarize: the migration guide was invaluable, we appreciated every entry, 
and we'd appreciate Wenchen's stricter definition of "behavior changes" 
(especially for silent ones).

From: Nimrod Ofek 
Date: Thursday, 2 May 2024 at 11:57
To: Wenchen Fan 
Cc: Erik Krogen , Spark dev list 
Subject: Re: [DISCUSS] clarify the definition of behavior changes
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api that 
has big impact and a lot of usage is to ease in changes by providing defaults 
to new parameters that will keep former behaviour in a method with the previous 
signature with deprecation notice, and deleting that deprecated function in the 
next release- so the actual break will be in the next release after all 
libraries had the chance to align with the api and upgrades can be done while 
already using the new version.

Another thing is that we should probably examine what private apis are used 
externally to provide better experience and provide proper public apis to meet 
those needs (for instance, applicative metrics and some way of creating custom 
behaviour columns).

Thanks,
Nimrod

בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan 
‏mailto:cloud0...@gmail.com>>:
Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs 
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes 
should be avoided as much as we can and new APIs should be mentioned in the 
release notes. Breaking binary compatibility is also a "functional change" and 
should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst 
Expression and LogicalPlan. It's too much work to track all the changes to 
private APIs and I think it's the downstream library's responsibility to check 
such changes in new Spark versions, or avoid using private APIs. Exceptions can 
happen if certain private APIs are used too widely and we should avoid breaking 
them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen 
mailto:xkro...@apache.org>> wrote:
Thanks for raising this important discussion Wenchen! Two points I would like 
to raise, though I'm fully supportive of any improvements in this regard, my 
points below notwithstanding -- I am not intending to let perfect be the enemy 
of good here.

On a similar note as Santosh's comment, we should consider how this relates to 
developer APIs. Let's say I am an end user relying on some library like 
frameless 
[github.com],
 which relies on developer APIs in Spark. When we make a change to Spark's 
developer APIs that requires a corresponding change in frameless, I don't 
directly see that change as an end user, but it does impact me, because now I 
have to upgrade to a new version of frameless that supports those new changes. 
This can have ripple effects across the ecosystem. Should we call out such 
changes so that end users understand the potential impact to libraries they use?

Second point, what about binary compatibility? Currently our versioning policy 
says "Link-level compatibility is something we’ll try to guarantee in future 
releases." (FWIW, it has said this since at least 2016 
[web.archive.org]...)
 One step towards this would be to clearly call out any binary-incompatible

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek

Hi Erik and Wenchen,

I think that usually a good practice with public api and with internal api
that has big impact and a lot of usage is to ease in changes by providing
defaults to new parameters that will keep former behaviour in a method with
the previous signature with deprecation notice, and deleting that
deprecated function in the next release- so the actual break will be in the
next release after all libraries had the chance to align with the api and
upgrades can be done while already using the new version.

Another thing is that we should probably examine what private apis are used
externally to provide better experience and provide proper public apis to
meet those needs (for instance, applicative metrics and some way of
creating custom behaviour columns).

Thanks,
Nimrod


בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:

> Hi Erik,
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned in the release notes. Breaking binary compatibility is also a
> "functional change" and should be treated as a behavior change.
>
> BTW, AFAIK some downstream libraries use private APIs such as Catalyst
> Expression and LogicalPlan. It's too much work to track all the changes to
> private APIs and I think it's the downstream library's responsibility to
> check such changes in new Spark versions, or avoid using private APIs.
> Exceptions can happen if certain private APIs are used too widely and we
> should avoid breaking them.
>
> Thanks,
> Wenchen
>
> On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:
>
>> Thanks for raising this important discussion Wenchen! Two points I would
>> like to raise, though I'm fully supportive of any improvements in this
>> regard, my points below notwithstanding -- I am not intending to let
>> perfect be the enemy of good here.
>>
>> On a similar note as Santosh's comment, we should consider how this
>> relates to developer APIs. Let's say I am an end user relying on some
>> library like frameless , which
>> relies on developer APIs in Spark. When we make a change to Spark's
>> developer APIs that requires a corresponding change in frameless, I don't
>> directly see that change as an end user, but it *does* impact me,
>> because now I have to upgrade to a new version of frameless that supports
>> those new changes. This can have ripple effects across the ecosystem.
>> Should we call out such changes so that end users understand the potential
>> impact to libraries they use?
>>
>> Second point, what about binary compatibility? Currently our versioning
>> policy says "Link-level compatibility is something we’ll try to guarantee
>> in future releases." (FWIW, it has said this since at least 2016
>> ...)
>> One step towards this would be to clearly call out any binary-incompatible
>> changes in our release notes, to help users understand if they may be
>> impacted. Similar to my first point, this has ripple effects across the
>> ecosystem -- if I just use Spark itself, recompiling is probably not a big
>> deal, but if I use N libraries that each depend on Spark, then after a
>> binary-incompatible change is made I have to wait for all N libraries to
>> publish new compatible versions before I can upgrade myself, presenting a
>> nontrivial barrier to adoption.
>>
>> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>>  wrote:
>>
>>> Thanks Wenchen for starting this!
>>>
>>> How do we define "the user" for spark?
>>> 1. End users: There are some users that use spark as a service from a
>>> provider
>>> 2. Providers/Operators: There are some users that provide spark as a
>>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>>> like EMR) customers
>>> 3. ?
>>>
>>> Perhaps we need to consider infrastructure behavior changes as well to
>>> accommodate the second group of users.
>>>
>>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>>
>>> Hi all,
>>>
>>> It's exciting to see innovations keep happening in the Spark community
>>> and Spark keeps evolving itself. To make these innovations available to
>>> more users, it's important to help users upgrade to newer Spark versions
>>> easily. We've done a good job on it: the PR template requires the author to
>>> write down user-facing behavior changes, and the migration guide contains
>>> behavior changes that need attention from users. Sometimes behavior changes
>>> come with a legacy config to restore the old behavior. However, we still
>>> lack a clear definition of behavior changes and I propose the following
>>> definition:
>>>
>>> Behavior changes mean user-visible functional changes in a new release
>>> via public APIs. This means new features, and even bug fixes that eliminate
>>> NPE or correct query results,

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Jungtaek Lim

+1 love to see it!

On Thu, May 2, 2024 at 10:08 AM Holden Karau  wrote:

> +1 :) yay previews
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
>> +1
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>>> +1 for next Monday.
>>>
>>> We can do more previews when the other features are ready for preview.
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>>
 Next week sounds great! Thank you Wenchen!

 On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
 wrote:

> Yea I think a preview release won't hurt (without a branch cut). We
> don't need to wait for all the ongoing projects to be ready. How about we
> do a 4.0 preview release based on the current master branch next Monday?
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Hey all,
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard 
>> to
>> do that without a Preview release. So the sooner we make a Preview 
>> release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>> Thanks!
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
>> wrote:
>>
>>> Thank you all for the replies!
>>>
>>> To @Nicholas Chammas  : Thanks for
>>> cleaning up the error terminology and documentation! I've merged the 
>>> first
>>> PR and let's finish others before the 4.0 release.
>>> To @Dongjoon Hyun  : Thanks for driving
>>> the ANSI on by default effort! Now the vote has passed, let's flip the
>>> config and finish the DataFrame error context feature before 4.0.
>>> To @Jungtaek Lim  : Ack. We can treat
>>> the Streaming state store data source as completed for 4.0 then.
>>> To @Cheng Pan  : Yea we definitely should have
>>> a preview release. Let's collect more feedback on the ongoing projects 
>>> and
>>> then we can propose a date for the preview release.
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
 will we have preview release for 4.0.0 like we did for 2.0.0 and
 3.0.0?

 Thanks,
 Cheng Pan


 > On Apr 15, 2024, at 09:58, Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 >
 > W.r.t. state data source - reader (SPARK-45511), there are
 several follow-up tickets, but we don't plan to address them soon. The
 current implementation is the final shape for Spark 4.0.0, unless 
 there are
 demands on the follow-up tickets.
 >
 > We may want to check the plan for transformWithState - my
 understanding is that we want to release the feature to 4.0.0, but 
 there
 are several remaining works to be done. While the tentative timeline 
 for
 releasing is June 2024, what would be the tentative timeline for the 
 RC cut?
 > (cc. Anish to add more context on the plan for transformWithState)
 >
 > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
 wrote:
 > Hi all,
 >
 > It's close to the previously proposed 4.0.0 release date (June
 2024), and I think it's time to prepare for it and discuss the ongoing
 projects:
 > •
 > ANSI by default
 > • Spark Connect GA
 > • Structured Logging
 > • Streaming state store data source
 > • new data type VARIANT
 > • STRING collation support
 > • Spark k8s operator versioning
 > Please help to add more items to this list that are missed here.
 I would like to volunteer as the release manager for Apache Spark 
 4.0.0 if
 there is no objection. Thank you all for the great work that fills 
 Spark
 4.0!
 >
 > Wenchen Fan


>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Holden Karau

+1 :) yay previews

On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

> +1
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
>> +1 for next Monday.
>>
>> We can do more previews when the other features are ready for preview.
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
 Yea I think a preview release won't hurt (without a branch cut). We
 don't need to wait for all the ongoing projects to be ready. How about we
 do a 4.0 preview release based on the current master branch next Monday?

 On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Hey all,
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard 
> to
> do that without a Preview release. So the sooner we make a Preview 
> release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
> Thanks!
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
> wrote:
>
>> Thank you all for the replies!
>>
>> To @Nicholas Chammas  : Thanks for
>> cleaning up the error terminology and documentation! I've merged the 
>> first
>> PR and let's finish others before the 4.0 release.
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>> To @Jungtaek Lim  : Ack. We can treat
>> the Streaming state store data source as completed for 4.0 then.
>> To @Cheng Pan  : Yea we definitely should have
>> a preview release. Let's collect more feedback on the ongoing projects 
>> and
>> then we can propose a date for the preview release.
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>>> 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are 
>>> demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC 
>>> cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June
>>> 2024), and I think it's time to prepare for it and discuss the ongoing
>>> projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new data type VARIANT
>>> > • STRING collation support
>>> > • Spark k8s operator versioning
>>> > Please help to add more items to this list that are missed here. I
>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>> there is no objection. Thank you all for the great work that fills Spark
>>> 4.0!
>>> >
>>> > Wenchen Fan
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan

Hi Erik,

Thanks for sharing your thoughts! Note: developer APIs are also public APIs
(such as Data Source V2 API, Spark Listener API, etc.), so breaking changes
should be avoided as much as we can and new APIs should be mentioned in the
release notes. Breaking binary compatibility is also a "functional change"
and should be treated as a behavior change.

BTW, AFAIK some downstream libraries use private APIs such as Catalyst
Expression and LogicalPlan. It's too much work to track all the changes to
private APIs and I think it's the downstream library's responsibility to
check such changes in new Spark versions, or avoid using private APIs.
Exceptions can happen if certain private APIs are used too widely and we
should avoid breaking them.

Thanks,
Wenchen

On Wed, May 1, 2024 at 11:51 PM Erik Krogen  wrote:

> Thanks for raising this important discussion Wenchen! Two points I would
> like to raise, though I'm fully supportive of any improvements in this
> regard, my points below notwithstanding -- I am not intending to let
> perfect be the enemy of good here.
>
> On a similar note as Santosh's comment, we should consider how this
> relates to developer APIs. Let's say I am an end user relying on some
> library like frameless , which
> relies on developer APIs in Spark. When we make a change to Spark's
> developer APIs that requires a corresponding change in frameless, I don't
> directly see that change as an end user, but it *does* impact me, because
> now I have to upgrade to a new version of frameless that supports those new
> changes. This can have ripple effects across the ecosystem. Should we call
> out such changes so that end users understand the potential impact to
> libraries they use?
>
> Second point, what about binary compatibility? Currently our versioning
> policy says "Link-level compatibility is something we’ll try to guarantee
> in future releases." (FWIW, it has said this since at least 2016
> ...)
> One step towards this would be to clearly call out any binary-incompatible
> changes in our release notes, to help users understand if they may be
> impacted. Similar to my first point, this has ripple effects across the
> ecosystem -- if I just use Spark itself, recompiling is probably not a big
> deal, but if I use N libraries that each depend on Spark, then after a
> binary-incompatible change is made I have to wait for all N libraries to
> publish new compatible versions before I can upgrade myself, presenting a
> nontrivial barrier to adoption.
>
> On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
>  wrote:
>
>> Thanks Wenchen for starting this!
>>
>> How do we define "the user" for spark?
>> 1. End users: There are some users that use spark as a service from a
>> provider
>> 2. Providers/Operators: There are some users that provide spark as a
>> service for their internal(on-prem setup with yarn/k8s)/external(Something
>> like EMR) customers
>> 3. ?
>>
>> Perhaps we need to consider infrastructure behavior changes as well to
>> accommodate the second group of users.
>>
>> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> It's exciting to see innovations keep happening in the Spark community
>> and Spark keeps evolving itself. To make these innovations available to
>> more users, it's important to help users upgrade to newer Spark versions
>> easily. We've done a good job on it: the PR template requires the author to
>> write down user-facing behavior changes, and the migration guide contains
>> behavior changes that need attention from users. Sometimes behavior changes
>> come with a legacy config to restore the old behavior. However, we still
>> lack a clear definition of behavior changes and I propose the following
>> definition:
>>
>> Behavior changes mean user-visible functional changes in a new release
>> via public APIs. This means new features, and even bug fixes that eliminate
>> NPE or correct query results, are behavior changes. Things like performance
>> improvement, code refactoring, and changes to unreleased APIs/features are
>> not. All behavior changes should be called out in the PR description. We
>> need to write an item in the migration guide (and probably legacy config)
>> for those that may break users when upgrading:
>>
>>- Bug fixes that change query results. Users may need to do backfill
>>to correct the existing data and must know about these correctness fixes.
>>- Bug fixes that change query schema. Users may need to update the
>>schema of the tables in their data pipelines and must know about these
>>changes.
>>- Remove configs
>>- Rename error class/condition
>>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>>function, remove parameters, add parameters, rename parameters, change
>>parameter default values, etc. These changes should be avoided in general,
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Chao Sun

+1

On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

> +1 for next Monday.
>
> We can do more previews when the other features are ready for preview.
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config
> and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat
> the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are 
>> demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but there
>> are several remaining works to be done. While the tentative timeline for
>> releasing is June 2024, what would be the tentative timeline for the RC 
>> cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the ongoing
>> projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Hyukjin Kwon

SGTM

On Thu, 2 May 2024 at 02:06, Dongjoon Hyun  wrote:

> +1 for next Monday.
>
> Dongjoon.
>
> On Wed, May 1, 2024 at 8:46 AM Tathagata Das 
> wrote:
>
>> Next week sounds great! Thank you Wenchen!
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!


 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config
> and finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat
> the Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>> 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are 
>> demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my
>> understanding is that we want to release the feature to 4.0.0, but there
>> are several remaining works to be done. While the tentative timeline for
>> releasing is June 2024, what would be the tentative timeline for the RC 
>> cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June
>> 2024), and I think it's time to prepare for it and discuss the ongoing
>> projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Xiao Li

+1 for next Monday.

We can do more previews when the other features are ready for preview.

Tathagata Das  于2024年5月1日周三 08:46写道：

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.
 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.
 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.
 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are 
> demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my
> understanding is that we want to release the feature to 4.0.0, but there
> are several remaining works to be done. While the tentative timeline for
> releasing is June 2024, what would be the tentative timeline for the RC 
> cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June
> 2024), and I think it's time to prepare for it and discuss the ongoing
> projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I
> would like to volunteer as the release manager for Apache Spark 4.0.0 if
> there is no objection. Thank you all for the great work that fills Spark
> 4.0!
> >
> > Wenchen Fan
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Dongjoon Hyun

+1 for next Monday.

Dongjoon.

On Wed, May 1, 2024 at 8:46 AM Tathagata Das 
wrote:

> Next week sounds great! Thank you Wenchen!
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.
 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.
 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.
 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are 
> demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my
> understanding is that we want to release the feature to 4.0.0, but there
> are several remaining works to be done. While the tentative timeline for
> releasing is June 2024, what would be the tentative timeline for the RC 
> cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
> wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June
> 2024), and I think it's time to prepare for it and discuss the ongoing
> projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I
> would like to volunteer as the release manager for Apache Spark 4.0.0 if
> there is no objection. Thank you all for the great work that fills Spark
> 4.0!
> >
> > Wenchen Fan
>
>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Erik Krogen

Thanks for raising this important discussion Wenchen! Two points I would
like to raise, though I'm fully supportive of any improvements in this
regard, my points below notwithstanding -- I am not intending to let
perfect be the enemy of good here.

On a similar note as Santosh's comment, we should consider how this relates
to developer APIs. Let's say I am an end user relying on some library like
frameless , which relies on
developer APIs in Spark. When we make a change to Spark's developer APIs
that requires a corresponding change in frameless, I don't directly see
that change as an end user, but it *does* impact me, because now I have to
upgrade to a new version of frameless that supports those new changes. This
can have ripple effects across the ecosystem. Should we call out such
changes so that end users understand the potential impact to libraries they
use?

Second point, what about binary compatibility? Currently our versioning
policy says "Link-level compatibility is something we’ll try to guarantee
in future releases." (FWIW, it has said this since at least 2016
...)
One step towards this would be to clearly call out any binary-incompatible
changes in our release notes, to help users understand if they may be
impacted. Similar to my first point, this has ripple effects across the
ecosystem -- if I just use Spark itself, recompiling is probably not a big
deal, but if I use N libraries that each depend on Spark, then after a
binary-incompatible change is made I have to wait for all N libraries to
publish new compatible versions before I can upgrade myself, presenting a
nontrivial barrier to adoption.

On Wed, May 1, 2024 at 8:18 AM Santosh Pingale
 wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>function, remove parameters, add parameters, rename parameters, change
>parameter default values, etc. These changes should be avoided in general,
>or do it in a compatible way like deprecating and adding a new function
>instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan

Good point, Santosh!

I was originally targeting end users who write queries with Spark, as this
is probably the largest user base. But we should definitely consider other
users who deploy and manage Spark clusters. Those users are usually more
tolerant of behavior changes and I think it should be sufficient to put
behavior changes in this area in the release notes.

On Wed, May 1, 2024 at 11:18 PM Santosh Pingale 
wrote:

> Thanks Wenchen for starting this!
>
> How do we define "the user" for spark?
> 1. End users: There are some users that use spark as a service from a
> provider
> 2. Providers/Operators: There are some users that provide spark as a
> service for their internal(on-prem setup with yarn/k8s)/external(Something
> like EMR) customers
> 3. ?
>
> Perhaps we need to consider infrastructure behavior changes as well to
> accommodate the second group of users.
>
> On 1 May 2024, at 06:08, Wenchen Fan  wrote:
>
> Hi all,
>
> It's exciting to see innovations keep happening in the Spark community and
> Spark keeps evolving itself. To make these innovations available to more
> users, it's important to help users upgrade to newer Spark versions easily.
> We've done a good job on it: the PR template requires the author to write
> down user-facing behavior changes, and the migration guide contains
> behavior changes that need attention from users. Sometimes behavior changes
> come with a legacy config to restore the old behavior. However, we still
> lack a clear definition of behavior changes and I propose the following
> definition:
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. This means new features, and even bug fixes that eliminate NPE
> or correct query results, are behavior changes. Things like performance
> improvement, code refactoring, and changes to unreleased APIs/features are
> not. All behavior changes should be called out in the PR description. We
> need to write an item in the migration guide (and probably legacy config)
> for those that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any change to the public Python/SQL/Scala/Java/R APIs: rename
>function, remove parameters, add parameters, rename parameters, change
>parameter default values, etc. These changes should be avoided in general,
>or do it in a compatible way like deprecating and adding a new function
>instead of renaming.
>
> Once we reach a conclusion, I'll document it in
> https://spark.apache.org/versioning-policy.html .
>
> Thanks,
> Wenchen
>
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das

Next week sounds great! Thank you Wenchen!

On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:

> Yea I think a preview release won't hurt (without a branch cut). We don't
> need to wait for all the ongoing projects to be ready. How about we do a
> 4.0 preview release based on the current master branch next Monday?
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
> wrote:
>
>> Hey all,
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>> Thanks!
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>>> Thank you all for the replies!
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
 will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

 Thanks,
 Cheng Pan


 > On Apr 15, 2024, at 09:58, Jungtaek Lim 
 wrote:
 >
 > W.r.t. state data source - reader (SPARK-45511), there are several
 follow-up tickets, but we don't plan to address them soon. The current
 implementation is the final shape for Spark 4.0.0, unless there are demands
 on the follow-up tickets.
 >
 > We may want to check the plan for transformWithState - my
 understanding is that we want to release the feature to 4.0.0, but there
 are several remaining works to be done. While the tentative timeline for
 releasing is June 2024, what would be the tentative timeline for the RC 
 cut?
 > (cc. Anish to add more context on the plan for transformWithState)
 >
 > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
 wrote:
 > Hi all,
 >
 > It's close to the previously proposed 4.0.0 release date (June 2024),
 and I think it's time to prepare for it and discuss the ongoing projects:
 > •
 > ANSI by default
 > • Spark Connect GA
 > • Structured Logging
 > • Streaming state store data source
 > • new data type VARIANT
 > • STRING collation support
 > • Spark k8s operator versioning
 > Please help to add more items to this list that are missed here. I
 would like to volunteer as the release manager for Apache Spark 4.0.0 if
 there is no objection. Thank you all for the great work that fills Spark
 4.0!
 >
 > Wenchen Fan

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Santosh Pingale

Thanks Wenchen for starting this!

How do we define "the user" for spark?
1. End users: There are some users that use spark as a service from a
provider
2. Providers/Operators: There are some users that provide spark as a
service for their internal(on-prem setup with yarn/k8s)/external(Something
like EMR) customers
3. ?

Perhaps we need to consider infrastructure behavior changes as well to
accommodate the second group of users.

On 1 May 2024, at 06:08, Wenchen Fan  wrote:

Hi all,

It's exciting to see innovations keep happening in the Spark community and
Spark keeps evolving itself. To make these innovations available to more
users, it's important to help users upgrade to newer Spark versions easily.
We've done a good job on it: the PR template requires the author to write
down user-facing behavior changes, and the migration guide contains
behavior changes that need attention from users. Sometimes behavior changes
come with a legacy config to restore the old behavior. However, we still
lack a clear definition of behavior changes and I propose the following
definition:

Behavior changes mean user-visible functional changes in a new release via
public APIs. This means new features, and even bug fixes that eliminate NPE
or correct query results, are behavior changes. Things like performance
improvement, code refactoring, and changes to unreleased APIs/features are
not. All behavior changes should be called out in the PR description. We
need to write an item in the migration guide (and probably legacy config)
for those that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any change to the public Python/SQL/Scala/Java/R APIs: rename
   function, remove parameters, add parameters, rename parameters, change
   parameter default values, etc. These changes should be avoided in general,
   or do it in a compatible way like deprecating and adding a new function
   instead of renaming.

Once we reach a conclusion, I'll document it in
https://spark.apache.org/versioning-policy.html .

Thanks,
Wenchen

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Wenchen Fan

Yea I think a preview release won't hurt (without a branch cut). We don't
need to wait for all the ongoing projects to be ready. How about we do a
4.0 preview release based on the current master branch next Monday?

On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
wrote:

> Hey all,
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard to
> do that without a Preview release. So the sooner we make a Preview release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
> Thanks!
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>
>> Thank you all for the replies!
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>>> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>>> and I think it's time to prepare for it and discuss the ongoing projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new data type VARIANT
>>> > • STRING collation support
>>> > • Spark k8s operator versioning
>>> > Please help to add more items to this list that are missed here. I
>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>> there is no objection. Thank you all for the great work that fills Spark
>>> 4.0!
>>> >
>>> > Wenchen Fan
>>>
>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das

Hey all,

Reviving this thread, but Spark master has already accumulated a huge
amount of changes.  As a downstream project maintainer, I want to really
start testing the new features and other breaking changes, and it's hard to
do that without a Preview release. So the sooner we make a Preview release,
the faster we can start getting feedback for fixing things for a great
Spark 4.0 final release.

So I urge the community to produce a Spark 4.0 Preview soon even if certain
features targeting the Delta 4.0 release are still incomplete.

Thanks!


On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for cleaning
> up the error terminology and documentation! I've merged the first PR and
> let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the ANSI
> on by default effort! Now the vote has passed, let's flip the config and
> finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat the
> Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-05-01 Thread Mich Talebzadeh

It is important to consider potential impacts on Spark tables stored
in the Hive metastore during an "upgrade". Depending on the upgrade
path, the Hive metastore schema or SerDes behavior might change,
requiring adjustments in the Sparkark code
or configurations. I mentioned the need to test the Spark applications
thoroughly after a Hive upgrade, which will necessitates liaising with
Hive group as your are relying on their metdadata


Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

On Wed, 1 May 2024 at 04:30, Wenchen Fan  wrote:
>
> Yes, Spark has a shim layer to support all Hive versions. It shouldn't be an 
> issue as many users create native Spark data source tables already today, by 
> explicitly putting the `USING` clause in the CREATE TABLE statement.
>
> On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh  
> wrote:
>>
>> @Wenchen Fan Got your explanation, thanks!
>>
>> My understanding is that even if we create Spark tables using Spark's
>> native data sources, by default, the metadata about these tables will
>> be stored in the Hive metastore. As a consequence, a Hive upgrade can
>> potentially affect Spark tables. For example, depending on the
>> severity of the changes, the Hive metastore schema might change, which
>> could require Spark code to be updated to handle these changes in how
>> table metadata is represented. Is this assertion correct?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>>
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[DISCUSS] clarify the definition of behavior changes

2024-04-30 Thread Wenchen Fan

Hi all,

It's exciting to see innovations keep happening in the Spark community and
Spark keeps evolving itself. To make these innovations available to more
users, it's important to help users upgrade to newer Spark versions easily.
We've done a good job on it: the PR template requires the author to write
down user-facing behavior changes, and the migration guide contains
behavior changes that need attention from users. Sometimes behavior changes
come with a legacy config to restore the old behavior. However, we still
lack a clear definition of behavior changes and I propose the following
definition:

Behavior changes mean user-visible functional changes in a new release via
public APIs. This means new features, and even bug fixes that eliminate NPE
or correct query results, are behavior changes. Things like performance
improvement, code refactoring, and changes to unreleased APIs/features are
not. All behavior changes should be called out in the PR description. We
need to write an item in the migration guide (and probably legacy config)
for those that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any change to the public Python/SQL/Scala/Java/R APIs: rename
   function, remove parameters, add parameters, rename parameters, change
   parameter default values, etc. These changes should be avoided in general,
   or do it in a compatible way like deprecating and adding a new function
   instead of renaming.

Once we reach a conclusion, I'll document it in
https://spark.apache.org/versioning-policy.html .

Thanks,
Wenchen

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Wenchen Fan

Yes, Spark has a shim layer to support all Hive versions. It shouldn't be
an issue as many users create native Spark data source tables already
today, by explicitly putting the `USING` clause in the CREATE TABLE
statement.

On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh 
wrote:

> @Wenchen Fan Got your explanation, thanks!
>
> My understanding is that even if we create Spark tables using Spark's
> native data sources, by default, the metadata about these tables will
> be stored in the Hive metastore. As a consequence, a Hive upgrade can
> potentially affect Spark tables. For example, depending on the
> severity of the changes, the Hive metastore schema might change, which
> could require Spark code to be updated to handle these changes in how
> table metadata is represented. Is this assertion correct?
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>

Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Mich Talebzadeh

@Wenchen Fan Got your explanation, thanks!

My understanding is that even if we create Spark tables using Spark's
native data sources, by default, the metadata about these tables will
be stored in the Hive metastore. As a consequence, a Hive upgrade can
potentially affect Spark tables. For example, depending on the
severity of the changes, the Hive metastore schema might change, which
could require Spark code to be updated to handle these changes in how
table metadata is represented. Is this assertion correct?

Thanks

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Kent Yao

+1

Kent Yao

On 2024/04/30 09:07:21 Yuming Wang wrote:
> +1
> 
> On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin  wrote:
> 
> > +1
> > Sent from my iPhone
> >
> > On Apr 30, 2024, at 3:23 PM, DB Tsai  wrote:
> >
> > 
> > +1
> >
> > On Apr 29, 2024, at 8:01 PM, Wenchen Fan  wrote:
> >
> > 
> > To add more color:
> >
> > Spark data source table and Hive Serde table are both stored in the Hive
> > metastore and keep the data files in the table directory. The only
> > difference is they have different "table provider", which means Spark will
> > use different reader/writer. Ideally the Spark native data source
> > reader/writer is faster than the Hive Serde ones.
> >
> > What's more, the default format of Hive Serde is text. I don't think
> > people want to use text format tables in production. Most people will add
> > `STORED AS parquet` or `USING parquet` explicitly. By setting this config
> > to false, we have a more reasonable default behavior: creating Parquet
> > tables (or whatever is specified by `spark.sql.sources.default`).
> >
> > On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:
> >
> >> @Mich Talebzadeh  there seems to be a
> >> misunderstanding here. The Spark native data source table is still stored
> >> in the Hive metastore, it's just that Spark will use a different (and
> >> faster) reader/writer for it. `hive-site.xml` should work as it is today.
> >>
> >> On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon 
> >> wrote:
> >>
> >>> +1
> >>>
> >>> It's a legacy conf that we should eventually remove it away. Spark
> >>> should create Spark table by default, not Hive table.
> >>>
> >>> Mich, for your workload, you can simply switch that conf off if it
> >>> concerns you. We also enabled ANSI as well (that you agreed on). It's a 
> >>> bit
> >>> akwakrd to stop in the middle for this compatibility reason during making
> >>> Spark sound. The compatibility has been tested in production for a long
> >>> time so I don't see any particular issue about the compatibility case you
> >>> mentioned.
> >>>
> >>> On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh <
> >>> mich.talebza...@gmail.com> wrote:
> >>>
> 
>  Hi @Wenchen Fan 
> 
>  Thanks for your response. I believe we have not had enough time to
>  "DISCUSS" this matter.
> 
>  Currently in order to make Spark take advantage of Hive, I create a
>  soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is
>  3.1.1
> 
>   /opt/spark/conf/hive-site.xml ->
>  /data6/hduser/hive-3.1.1/conf/hive-site.xml
> 
>  This works fine for me in my lab. So in the future if we opt to use the
>  setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
>  not be a need for this logical link.?
>  On the face of it, this looks fine but in real life it may require a
>  number of changes to the old scripts. Hence my concern.
>  As a matter of interest has anyone liaised with the Hive team to ensure
>  they have introduced the additional changes you outlined?
> 
>  HTH
> 
>  Mich Talebzadeh,
>  Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>  London
>  United Kingdom
> 
> 
> view my Linkedin profile
>  
> 
> 
>   https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
>  *Disclaimer:* The information provided is correct to the best of my
>  knowledge but of course cannot be guaranteed . It is essential to note
>  that, as with any advice, quote "one test result is worth one-thousand
>  expert opinions (Werner
>  Von Braun
>  )".
> 
> 
>  On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:
> 
> > @Mich Talebzadeh  thanks for sharing your
> > concern!
> >
> > Note: creating Spark native data source tables is usually Hive
> > compatible as well, unless we use features that Hive does not support
> > (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to
> > create Spark native table in this case, instead of creating Hive table 
> > and
> > fail.
> >
> > On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:
> >
> >> +1 (non-binding)
> >>
> >> Thanks,
> >> Cheng Pan
> >>
> >> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau 
> >> wrote:
> >> >
> >> > +1
> >> >
> >> > Twitter: https://twitter.com/holdenkarau
> >> > Books (Learning Spark, High Performance Spark, etc.):
> >> https://amzn.to/2MaRAG9
> >> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >> >
> >> >
> >> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh 
> >> wrote:
> >> >>
> >> >> +1
> >> >>
> >> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun <
> >> dongj...@apache.org>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 28288 matches

Mail list logo